
84% of Us Use AI Code Review. Only 29% Trust It. I Built One That Had to Earn It.
There's a pair of numbers from this year's developer surveys that should embarrass our entire industry: 84% of developers now use AI coding tools, and only 29% trust the output. Read that again. We have collectively adopted a coworker we don't believe.
Nowhere is that gap more visible than code review. Depending on whose survey you read, somewhere between 40 and 50% of professional developers now have some form of AI reviewing their pull requests — up from maybe 15–20% in 2024 — and GitHub's numbers show well over a million repositories running an AI review integration. "Agentic code review" is officially a trend. The listicles have arrived.
I'm in that statistic twice. I use these tools, and I built one — a GitHub PR-reviewer agent on LangChain and Mastra that has been reviewing real pull requests for months now. So when I say the first version of my reviewer deserved every bit of that 71% distrust, I'm not dunking on the hype from the sidelines. I'm telling you what it cost to fix. This is the honest version.
What I actually built
The shape of the thing is simple — and honestly, the simplicity is part of the trap. A GitHub webhook fires when a PR opens or updates. A small service pulls the diff and the PR metadata. A Mastra agent runs the review loop, with LangChain supplying the tools it can call: fetch full file contents, pull related tests, read the PR description. Findings come back as structured output and get posted as review comments.
const reviewer = new Agent({
name: "pr-reviewer",
instructions: REVIEW_PROMPT,
model: anthropic("claude-sonnet-4-6"),
tools: { getFileContents, getDiffHunks, getRelatedTests, getPrMetadata },
});Why two frameworks? No grand theory. Mastra gave me the agent loop and typed workflow primitives I wanted; LangChain had the mature tool ecosystem I was already using. Could you build this with either one alone? Yes. The combination is an artifact of when I started, not a recommendation.
The afternoon I got it working, it caught a genuine off-by-one in a date-range filter on its very first review. I felt like I'd hired a tireless senior engineer for cents. That feeling lasted about a week. Then the taxes came due — three of them.
Tax #1: It reviewed like it was paid per comment
The first real PR my agent reviewed was about 40 changed lines. It left 23 comments.
Variable naming suggestions. "Consider extracting this into a helper." Three separate notes about things the linter had already flagged. A meditation on whether for...of would be more idiomatic than forEach. Somewhere in there, buried at comment 17, was one genuinely useful observation about a missing error case — and nobody saw it, because by then everyone had stopped reading.
Here's the thing the demos never show: an AI reviewer that comments on everything is worse than no reviewer at all, because it trains your team to ignore the bot. One real finding drowned in nitpicks has the same value as zero findings, except it also costs you goodwill.
The fix was less about prompting and more about plumbing. Every finding now has to come back through a severity schema, and the posting layer is ruthless about what gets through:
const finding = z.object({
severity: z.enum(["blocker", "bug", "improvement", "nitpick"]),
confidence: z.number().min(0).max(1),
file: z.string(),
line: z.number(),
body: z.string(),
});
// Only `bug` and above get posted, ranked by severity.
// Hard cap: five comments per review.
// Nitpicks collapse into one optional summary comment.I also added one line to the system prompt that did more work than any other: "If the linter could have said it, don't." Comment volume dropped by roughly 80%, and — this is the part that matters — people started reading the comments again.
Tax #2: Confidently wrong
Noise erodes trust slowly. Hallucinations torch it instantly.
My favorite specimens from v1: it flagged a "potential race condition" in code that was single-threaded by construction. It told a teammate a null check was missing when the check sat ten lines above the diff hunk, just outside the context it was shown. It once cited a function signature that did not exist anywhere in the repository.
The root cause wasn't the model being stupid. It was me being cheap. I was feeding it the diff and nothing else, and a diff is a terrible unit of comprehension — it answers "what changed" but not "what is true here." Asking for a confident review of a diff in isolation is asking to be lied to.
Two fixes. First, grounding: the agent got tools to pull the full file, its imports, and the PR description before forming an opinion, instead of free-associating over a patch. Second — and this is the one I'd tell you not to skip — a verification pass. Before anything gets posted, a smaller, cheaper model re-reads each finding against the actual file contents and answers one question: is this claim supported by the code shown? In the early weeks, that pass was killing roughly a third of the findings. A third. Every one of those was a comment that would have made a developer trust the bot a little less.
The principle underneath both fixes: precision over recall. A missed bug costs you nothing you didn't already have — humans still review. A wrong comment costs trust you can't easily buy back.
Tax #3: The bill
Nobody puts the unit economics in the launch post, so here are mine. The naive version — full files in context, the big model end to end, no caching — cost around $0.40 per medium-sized PR and took three to four minutes to come back. For one repo and a handful of PRs a day, that's a rounding error. Scale it to a team's full PR volume and it's a real line item, and four minutes of latency means the review often lands after the human one did. Which defeats the point.
Three changes fixed the economics without gutting the quality. Review changed hunks by default and let the agent fetch wider context only when it decides it needs it — that one change cut input tokens by more than half. Tier the models: the big model does the actual review pass, the small one handles triage and the verification step. And cache the stable stuff — the system prompt, the repo conventions document — which provider-side prompt caching makes nearly free on repeat reviews.
Where it landed: about $0.08 per PR on average, with reviews posting in under a minute. Your numbers will vary with PR size and model choice; the ratios are what matter.
What it looks like now
The current version is, above all, quiet. Two or three comments on a typical PR, sometimes zero — and zero is a feature. It's best at exactly the things tired humans miss: a renamed field that didn't get renamed in one last place, an edge case the new branch doesn't handle, a contract quietly drifting between a function and its callers. It does not argue about style. It does not block merges. A human is still the gate, and that's not a temporary concession — it's the design.
And the strange thing is that doing less is what made people trust it more. The 29% trust number from the surveys isn't really about model capability. It's about product behavior. Silence when there's nothing worth saying turns out to be the most trust-building feature an AI reviewer can ship.
Verdict: it works — and the caveats are the product
So: does it work? Yes. Genuinely. It catches real bugs every week that humans would plausibly have missed, and the team treats its comments as signal now. But I want to be precise about what "it" is. The agent loop — webhook, diff, model, comments — was an afternoon. Everything that makes it worth listening to is the caveats: the severity gating, the grounding, the verification pass, the cost engineering. That took months, and it's 90% of the code.
Should you build one? Buy, if what you want is the outcome — the off-the-shelf tools have converged on most of these lessons and they're decent now. Build, if your repos have conventions and private context an off-the-shelf tool can't see, or if you want to actually understand how agentic systems fail. There is no better teacher than your own bot confidently inventing a race condition in single-threaded code.
Either way, hold whatever you adopt to the standard the demos skip: not "can it find bugs" but "is it right enough, quiet enough, and cheap enough that your team still reads it in month three." That gap between 84% adoption and 29% trust isn't a model gap. It's everyone shipping the afternoon version. Mine deserved the distrust too. The difference is what you do in the months after.
More writing