When Passing Tests Lie

48 tests passed. Zero failures. 94% coverage.

I deployed with confidence. First production request: authentication failed. Every request failed. Nothing worked.

The tests lied.

I’ve seen this pattern more times than I can count. A team shows me their test suite with pride — high coverage, fast execution, green across the board. Then I ask one question: when did you last start your server fresh and make a real HTTP request?

Silence.

The problem isn’t that they don’t have tests. It’s that their tests verify an idealized version of their system that exists only in the test suite. Production uses the real HTTP stack, real database connections, real integration points. The tests bypass all of this.

This is worse than having no tests. No tests give no confidence. Passing tests that verify nothing give false confidence.

How tests lie

I was debugging a Web Service last month. 127 unit tests, all passing. The authentication endpoint failed on every real request.

The bug was embarrassing once I found it:

# The test
def test_auth(mock_request):
    mock_request.headers = {"http-authorization": "Bearer xxx"}
    result = authenticate(mock_request)
    assert result.user  # Passes!

# Production expects "Authorization", not "http-authorization"
# The mock used the wrong header name
# The test verified the mock worked, not the code

The test passed because it mocked the HTTP request. The mock was wrong. The test never touched the real HTTP stack, so it never noticed.

Here’s a Common Lisp example that bit me harder:

;; REPL Session 1: Manual setup while developing
(setf *jwt-secret* "my-secret-key")

;; REPL Session 2: Write the actual code
(defvar *jwt-secret* "default")

;; Run tests in the same REPL
(deftest test-jwt
  (let ((token (create-token {:user "alice"})))
    (is (verify-token token))))  ; PASSES

Why does this pass? Because defvar doesn’t reinitialize variables. The test uses “my-secret-key” from Session 1, not “default” from the code.

Production starts fresh. It uses “default”. Every token verification fails.

The test passed in accumulated state. Production starts with zero state.

The three tests that actually matter

First: the smoke test.

Start the actual server. Make a real HTTP request. See if authentication works end-to-end.

def test_server_smoke():
    server = subprocess.Popen(["python", "-m", "myapp"])
    time.sleep(2)

    try:
        # Real HTTP, real headers, real everything
        response = requests.get("http://localhost:8080/health")
        assert response.status_code == 200

        # Auth flow with real HTTP stack
        login = requests.post(
            "http://localhost:8080/login",
            json={"username": "test", "password": "test"}
        )
        token = login.json()["token"]

        protected = requests.get(
            "http://localhost:8080/protected",
            headers={"Authorization": f"Bearer {token}"}
        )
        assert protected.status_code == 200
    finally:
        server.terminate()

This test would have caught my authentication bug. It uses the real HTTP stack with real headers.

Second: the fresh-start test.

Start the server in a completely fresh environment. No accumulated state. No leftover config. Nothing.

def test_fresh_start():
    temp_dir = tempfile.mkdtemp()

    env = {
        "DATABASE_URL": f"sqlite:///{temp_dir}/db.sqlite",
        "JWT_SECRET": "fresh-secret",
        "CONFIG_PATH": f"{temp_dir}/config.yaml"
    }

    process = subprocess.Popen(
        ["python", "-m", "myapp", "--port", "9999"],
        env=env
    )

    try:
        time.sleep(2)
        response = requests.get("http://localhost:9999/health")
        assert response.status_code == 200
    finally:
        process.terminate()
        shutil.rmtree(temp_dir)

This catches the defvar bug. Fresh process, fresh environment, no REPL state to hide behind.

Third: the critical path test.

Pick your most important user journeys. Test them end-to-end through the same code path production uses.

Not user.save() called directly. POST /users through the HTTP handler, middleware, validation, serialization — the whole stack.

The confidence score

Use a simple heuristic when auditing test suites:

Start at 100. Deduct for gaps.

No smoke tests: -30
No fresh-start tests: -20
Mocked integration points: -15
No E2E tests at all: -25
Tests call internal functions instead of API endpoints: -10

That Common Lisp service? 127 tests, 91% coverage, confidence score of 10.

After adding smoke tests, one fresh-start test, and three E2E tests for critical paths: score of 85.

The objections I always hear

“E2E tests are slow.”

One smoke test runs in 5 seconds. It catches bugs that slip through hours of unit testing. Fast tests that verify nothing are useless. Slow tests that catch real bugs are valuable.

“Subprocess tests are flaky.”

Flaky tests mean unstable code. If your server can’t start reliably in tests, it can’t start reliably in production. Fix the stability problem, don’t skip the test.

“We have 100% coverage.”

Coverage measures which lines executed, not whether they worked correctly. This test has 100% coverage:

def test_discount():
    result = calculate_discount(100, "VIP")
    assert result is not None  # Technically passes

It executes the function. It doesn’t verify the result is correct. High coverage with weak assertions creates false confidence.

Start here

This week, add one smoke test. Start your server, make a real HTTP request to your most critical endpoint, verify it works.

Next week, add a fresh-start test. Completely clean environment, no inherited state.

That’s it. Two tests. They’ll catch more bugs than the next hundred unit tests you write.

Your tests pass. But do they verify the code actually works?

The techniques here are part of what I do when auditing test suites for clients. If your tests pass but production keeps breaking, that’s a test strategy problem — and it’s fixable.

quasiLabs Blog

Stories from the code-mines