The Macro: AI Wrote the Bug, AI Has to Find It
Here’s the problem nobody wants to say out loud: AI coding tools created the very problem that AI code review tools now exist to solve. Copilot, Cursor, Claude Code itself, they’ve made individual engineers dramatically more productive. Code output per Anthropic engineer reportedly grew 200% in the last year, according to their own blog. That number sounds like a win until you think about what’s on the other side of it. More code means more PRs. More PRs means review is the bottleneck, and human reviewers can’t skim-read their way through AI-generated output the same way they skimmed through human-written code.
The AI code tools market is projected to hit $37.34 billion by 2032, according to SNS Insider, growing at a clip that makes the underlying problem pretty clear. Everyone is shipping more. Review hasn’t scaled with it.
The competitors in this space are real. Devin Review, from Cognition, launched in January 2026 and is probably the most direct comparison, according to emelia.io. CodeAnt has written up the comparison directly. GitHub’s own Copilot review features exist and are baked into a platform most teams already pay for. The question for any of these tools isn’t whether AI can catch bugs (it can, sometimes) but whether it catches the right bugs without burying you in false positives that train engineers to ignore the alerts.
That false positive problem is actually the hard part. I’ve seen teams deal with alert fatigue in AI tooling before, and it’s a real culture issue, not just a UX one. If your review bot cries wolf on every PR, engineers route around it. That’s worse than no bot.
The Micro: Multi-Agent, Per-PR, and It’ll Cost You
What Claude Code Review actually does is dispatch multiple agents on every pull request in parallel, each one analyzing different aspects of the code, then verifying findings against each other before surfacing anything to the developer. The verification step is the product decision I find most interesting. It’s not just one model pass. It’s agents checking each other’s work specifically to reduce noise before anything hits your review queue.
According to the Anthropic blog, this is modeled on the internal system they run on nearly every PR at Anthropic. That’s either a strong signal or good marketing. Probably some of both. But shipping your internal tooling externally is at least a more honest origin story than most.
It’s in research preview right now, available for Team and Enterprise tiers only. No solo plan, no free tier.
The pricing is where things get interesting and also a little uncomfortable. Claude Code Max costs $200 per month as a subscription. Claude Code Review, according to multiple sources including ZDNet and emelia.io, costs $15 to $25 per review. Per review. If your team is merging ten PRs a day, that math gets unwieldy fast. It’s clearly priced for high-stakes code, not high-volume teams.
It got solid traction on launch day, which tracks for anything Anthropic ships with a clear developer use case.
The focus on AI-generated code specifically is worth paying attention to. This isn’t positioned as a general linter. Claude Code already has known quirks around context and memory, and a review layer that’s aware of how AI code tends to fail (pattern repetition, subtle logic errors, confident-sounding hallucinations) is a different thing than a tool that was built for human-written code and adapted.
Whether that specialization holds up in practice is the open question.
The Verdict
I think this is a real product solving a real problem, and the internal dogfooding story gives it more credibility than a cold launch. But the per-review pricing is a genuine blocker for most mid-size teams, and Anthropic knows that, which is probably why it’s still in research preview.
At 30 days, I’d want to know how the false positive rate compares to GitHub Copilot review in actual production environments, not curated demos. That’s the number that determines whether engineers adopt this or learn to ignore it.
At 60 days, the question is whether Enterprise customers renew the usage or quietly stop running it on every PR because the costs normalized uncomfortably.
The multi-agent verification architecture is the right instinct. The coding AI space is crowded and the teams that win will be the ones that get signal-to-noise right, not just accuracy on benchmarks. Anthropic has the model quality to compete. Whether $25 a review is a price the market accepts is a different bet entirely, and I genuinely don’t know how that one lands.