Your AI Wrote the Code, the Tests, and the Review. Who Caught the Bug?

A pattern showed up in three different code reviews I did last week. Same shape every time. An engineer opens a pull request. The implementation is clean. The test suite is thorough. The PR description reads like a senior engineer wrote it. Every check is green.

None of it was written by a human. The implementation came from Claude Code. The tests came from the same session, generated right after the function compiled. The PR description came from a summarizer agent. The first reviewer was an automated review bot running on the same model family.

The code shipped. It also had a bug that took down a payment flow for forty minutes.

The bug was simple. The test for the broken case had been generated from the implementation, so it asserted the wrong behavior with confidence. The review bot read the code and the test together and concluded both were consistent. They were. They were also wrong in exactly the same way.

This is the problem nobody is talking about loud enough. AI coding tools have started forming a closed loop. The thing that writes the code also writes the things that grade the code. When everything in that loop agrees, you get a pull request that looks like a triumph and ships a regression.

How the loop closed without anyone deciding to close it

A year ago the workflow had natural friction. You wrote code. Someone else wrote tests, or you wrote them yourself before the code was fresh in your head. A different human read the diff. Each step was an independent check on the last one, performed by a different brain with different blind spots.

Each of those steps got automated separately. None of the automations were obviously bad on their own. Generating tests from a function saves hours. Auto-summarizing a PR makes reviews faster. Running an LLM reviewer catches the easy stuff before a human has to look. The pitch for each tool is honest.

What nobody priced in is what happens when you run all of them together. The implementation, the tests, the description, and the review now share a common source of truth: the same model's interpretation of what the code is supposed to do. Every stage in the pipeline derives its judgment from the stage before it. The independence is gone.

You used to have four people looking at a problem from four angles. Now you have one model looking at it four times.

Why this fails differently than human review fails

Human review has its own failure modes. Engineers rubber-stamp diffs from teammates they trust. Tired reviewers miss obvious bugs. The author and the reviewer share a mental model and miss the same edge case. None of this is new.

The closed AI loop fails in a more dangerous way because it fails silently and consistently. When a human reviewer rubber-stamps a PR, you can usually tell from the comments. When a model-generated test confirms a model-generated implementation, the PR looks like the system is working. Green checks. Coverage up. Everyone happy.

Three things make the AI version qualitatively worse.

The errors correlate. A human reviewer brings a different prior than the author. A model running twice in a row brings the same prior twice. If the model misreads the spec, it misreads the spec everywhere in the pipeline. The mistake propagates through every check that's supposed to catch it.

The artifacts look authoritative. A test suite written by a senior engineer and a test suite generated from the implementation it's testing look almost identical on the surface. Both have descriptive names. Both cover edge cases. Both run green. The difference is whether the test encodes the spec or encodes the bug. You can't tell from reading it.

The feedback loop has no external ground truth. Tests written from a spec are anchored to something outside the code. Tests generated from the code are anchored to the code itself. If the code is wrong, the tests confirm it confidently. There's no place in the loop where reality gets a vote.

The places this is already breaking

A few patterns I've seen play out in real codebases.

The off-by-one that became canon. A team used a coding agent to implement a pagination helper. The model generated tests at the same time. One test asserted that page 1 returned items 0 through 9. The actual product spec said page 1 should return items 1 through 10. The test passed, the code shipped, and three downstream services started rendering empty first pages. The test was wrong. The code was wrong. They agreed with each other, so nobody noticed for a week.

The mocked dependency that mocked the bug. An agent was asked to write integration tests for a service that called Stripe. It generated mocks for the Stripe client, including the response shapes. The mocks reflected the agent's guess at the Stripe API, not the actual API. The tests passed. Production calls failed because the real Stripe response had a field the agent had hallucinated.

The review bot that approved its own work. A team ran an LLM reviewer on every PR. The same team also let engineers use Claude Code to write the PRs. The reviewer flagged style issues and tiny refactors. It rarely flagged logic problems, because the logic problems were the ones the original model had also missed. The reviewer was a confidence machine, not a check.

None of these stories require malice or incompetence. They require ordinary engineers using tools the way the tools want to be used.

What independence actually looks like

The fix isn't to stop using AI for any of these jobs. The fix is to make sure the steps in your pipeline don't all derive from the same model run.

A few things that work, that I've watched teams do.

Write tests from the spec, not from the code. If you're generating tests with an AI tool, give it the requirements document or the ticket description as the source. Don't give it the implementation. The whole point of a test is to encode an expectation that exists independently of the code. Generating tests from the code defeats that purpose, no matter how good the model is.

Use a different model for review than for authoring. If your engineers write code with Claude, run your review bot on a different model family. The errors won't correlate as cleanly. You'll catch a class of mistakes that two passes of the same model would have agreed on.

Keep at least one human eye on the contract layer. API shapes, database migrations, public function signatures, anything other systems depend on. These are the places where a hallucinated detail becomes a production incident. Let the AI write the implementation. Let a human sign off on the interface.

Run the tests against something real before merging. Not mocks generated in the same session as the code. A staging environment, a real database, a real downstream service. The friction is the point. It's the place where ground truth gets to talk back.

The honest tradeoff

These fixes cost speed. That's the trade. The closed loop is fast precisely because it removed all the friction. Putting the friction back puts the bugs back outside the loop where you can catch them.

Most teams will not do this until they ship a Mythos-class regression of their own. The economics of "nine green checks and a clean PR description" are too good to give up voluntarily. The teams that do invest in independence early are the ones who'll spend the next year shipping fewer outages than their competitors and won't be able to fully explain why.

The original promise of AI coding tools was that they'd handle the boring parts so engineers could focus on judgment. That promise still holds. The catch is that the boring parts include the parts where independent judgment was hiding. Tests, reviews, and PR descriptions felt like overhead. They were also the places where a second brain looked at the same problem and caught the thing the first brain missed.

You can automate any one of those steps and come out ahead. Automate all of them at once and you've built a machine that grades its own homework. The grades will be excellent. The homework won't be.

Your AI Wrote the Code, the Tests, and the Review. Who Caught the Bug?

How the loop closed without anyone deciding to close it

Why this fails differently than human review fails

The places this is already breaking

What independence actually looks like

The honest tradeoff

You might also like

AI Can Write Your Code Now. The Hard Part Was Never the Code.

88% of AI Agents Never Reach Production. The Model Isn't the Problem.