The bottleneck moved to code review

The constraint in shipping software has quietly moved. For most of the last decade, the thing that limited how fast a team shipped was how fast people could write correct code. Agentic coding tools relaxed that limit, the pressure slid one stage downstream, and it now sits on review. The industry’s reflex has been to point the same agents at the review queue—and that reflex optimizes the one thing code review was never really about.

The shift itself is not subtle. In a controlled experiment, developers given an AI pair programmer finished a defined programming task 55.8% faster than a control group.¹ Multiply even a fraction of that across a team and the authoring side of the pipeline accelerates while the review side, still bound to human attention, does not. The engineering-intelligence vendor Faros AI, drawing on two years of workflow telemetry from 22,000 developers across 4,000 teams, reports that as AI adoption climbed, the median time a change spends in review rose roughly fivefold and the average pull request grew about 51% larger.² Same reviewers, more changes, bigger changes, longer queue. The bottleneck did not vanish when we automated writing. It relocated.

The constraint moved, it didn’t disappear

Widen one stage of a pipeline and leave the next one alone, and pressure builds at the stage you didn’t touch. That is the entire mechanism, and it is worth stating plainly because the marketing around it is loud enough to obscure something this simple. The reasoning here is almost embarrassingly mechanical: if you double how fast code is produced and hold review capacity flat, you get a review backlog, not a delivery speedup.

GitHub’s own framing gives the game away. When Copilot code review reached general availability—a feature more than a million developers had used within about a month of its public preview—the pitch was that you can hand the routine pass to an agent that finds bugs and suggests fixes, so you can keep moving “while waiting for a human review.”³ Read that second clause twice. The waiting is for the human. The agent fills the idle time; the human is still the gate. As a description of where the constraint actually sits, it is honest.

So the tempting next move is obvious: make the human gate unnecessary. Stop having the agent merely comment, and let it approve. Clear the queue. If review is what is slow, and an agent can review, then point the agent at review and the problem dissolves. The logic is clean. It also rests on a premise—that the point of review is to process changes quickly—that the evidence does not support.

What review was actually for

The research on modern code review has been remarkably consistent for more than a decade, and it says something the throughput framing leaves out: finding defects is the stated motivation for review, but it is not where most of the value lands.

Bacchelli and Bird studied tool-based review across diverse teams at Microsoft, combining direct observation, interviews, surveys of more than a thousand developers and managers, and a hand-classification of hundreds of real review comments. They found that defect-related comments were a small proportion of the total, that the ones that did appear mostly covered minor low-level issues, and that the larger realized benefits were knowledge transfer, increased team awareness, and better solutions to the problem under discussion.⁴ Their central observation was that understanding the change—its context and intent—is the actual work of reviewing, and that most tools barely support it.

The Google study reached a compatible conclusion from a very different vantage point. Analyzing logs from nine million reviewed changes alongside interviews and a developer survey, the authors describe a process that is deliberately lightweight—usually a single reviewer, small changes, quick turnaround—and valued primarily as an educational and norm-setting practice rather than a defect net.⁵ They situate it among a set of findings that converge across many organizations, including one stated without hedging:

Review has changed from a defect finding activity to a group problem solving activity.⁶

This is where the automation reflex misfires. AI reviewers are genuinely good at the mechanical layer: the style inconsistency, the missing null check, the obvious off-by-one, the test you forgot to add. That layer is real and worth automating. But it was always the smaller, cheaper part of the job—the part a linter and a careful author already caught most of. The expensive part, the part the research keeps pointing at, is the human transfer that happens when a second engineer has to understand your change well enough to have an opinion about it: the shared context, the “why is it done this way,” the junior who learns the system by reading senior changes, the senior who catches a design problem no diff-level tool can see because the problem is not in the diff. An agent can tell you the code is wrong. It cannot make your team know the code.

There is a second-order effect that makes this worse, and it runs through the author. When an engineer writes a change by hand, they arrive at review having already thought it through; the reviewer is checking work the author understands. When an agent writes the change, that is not guaranteed. The author may have read it, or may have skimmed it and trusted the tool—and the larger pull requests that come with agentic development make a thorough self-read less likely, not more.⁷ In that case the human review is no longer a second pair of eyes on a well-understood change. It is sometimes the first time any human engages with the logic at all. My view is that this raises the stakes of the human review rather than lowering them, which is the precise opposite of the assumption baked into “let the agent approve it.”

The rubber stamp

There is a specific failure mode waiting at the end of this, and it is not hypothetical. When an automated check sits in front of a human under time pressure, people defer to it. The human-factors literature has a name for this—automation complacency—and a well-replicated finding: it shows up precisely under high task load, when other work competes for attention; it affects experts as readily as novices; and it cannot be trained or instructed away.⁸ Aviation has studied this for decades, because in safety-critical systems the cost of a human quietly trusting an automated signal that happens to be wrong is measured in wreckage. The pattern does not require anyone to be lazy or incompetent. It requires only that the automation is usually right and the human is busy.

A review queue under agentic-development load is an almost perfect incubator for it. The pull requests are more numerous and larger than they used to be. An agent has already looked at each one and left a tidy set of comments and, increasingly, an approval. The reviewer is behind. The path of least resistance—glance at the green check, skim, approve—is not a character flaw; it is the predictable output of the situation we built.

A wooden rubber stamp resting beside a glowing blue check inside a circle, on a surface of dark printed documents. — The mark is cheap; the reading it stands for is not.

And it is already measurable. The same Faros telemetry that recorded review times climbing also recorded that 31.3% more pull requests are now merging with no review at all.⁹ I want to be careful here, because that is one vendor’s data and it measures correlation, not a controlled cause. But it is consistent with the mechanism, and the mechanism is well understood. The risk is not that AI review is bad. It is that an AI approval comes to mean “the review happened,” and the human layer—the knowledge transfer, the shared ownership, the design judgment that the research says was the actual product of review—silently stops happening, at exactly the moment the volume of code that needs understanding is going up.

The size trend sharpens the point. The same telemetry shows pull requests growing by about half again as large under AI adoption.¹⁰ A bigger change is harder to hold in your head, which means even the reviews that do get a human’s attention are getting a worse one—more surface, less depth, more of the diff skimmed rather than understood. So the two pressures compound: there are more changes than there used to be, each is larger than it used to be, and the easiest way to cope with both is to lean harder on the green check. Nothing in that loop corrects itself. It has to be designed against.

What the standard should protect

So here is my read, marked clearly as opinion. The agents belong in the review loop. Not as the gate, but as the thing that clears the mechanical layer before a human spends a minute on it—and the standard should say so explicitly, in two separable parts.

Let the agent own the mechanical pass, and gate hard on it. Style, obvious defects, missing tests, security lint: if the bot is unhappy, the change does not move, no human time required. That is throughput work, and automating it is pure gain. Then design the human review around the things the human is the only one who can do, and stop measuring that review by how fast the queue drains. Queue depth is a capacity metric; it tells you nothing about whether anyone learned anything. A standard that rewards reviewers for closing changes fast is a standard that will, with perfect efficiency, produce rubber stamps. Keep changes small enough that a human can actually hold them in their head—the research that predates this whole moment already said small changes were the convergent practice, and bigger AI-authored pull requests are pushing the wrong way. And treat an AI approval as input to the review, never as the review.

Two specifics follow from that. First, make the human reviewer accountable for understanding the change, not for clearing it—the question at sign-off is “could I explain why this is built this way,” not “did the bot find anything.” That is the part automation cannot do on your behalf, and it is the part worth a person’s time. Second, watch for the failure where the human review degenerates into reviewing the agent’s review—reading the bot’s comments and adjudicating them rather than reading the code. That feels like diligence and produces almost none of what review is for. The agent’s comments are a starting point for the human’s own reading, not a substitute for it.

None of this is anti-automation. It is the opposite: it is taking automation seriously enough to be specific about which half of the work it is doing. The mistake is not using agents to review code. The mistake is letting “the queue is faster now” stand in for “the review is working,” when the two were never the same thing. A review that a human only rubber-stamps is not a faster review. It is no review, wearing the costume of one.

Notes

Sida Peng, Eirini Kalliamvakou, Peter Cihon, and Mert Demirer, “The Impact of AI on Developer Productivity: Evidence from GitHub Copilot,” arXiv:2302.06590, February 13, 2023. Controlled experiment; the 55.8% figure is for a single defined task (implementing an HTTP server in JavaScript). https://arxiv.org/abs/2302.06590 ↩
Faros AI, “The AI Engineering Report 2026: The Acceleration Whiplash,” 2026. Vendor telemetry across 22,000 developers and 4,000 teams (measured workflow data, not a survey). This reference: a roughly fivefold increase in median review time and a ~51% increase in pull request size. Vendor research, cited as such. https://www.faros.ai/research/ai-acceleration-whiplash ↩
GitHub, “Copilot code review now generally available,” GitHub Changelog, April 4, 2025. Product announcement (vendor); source of the “offload basic reviews” / “while waiting for a human review” framing and the figure of over one million developers within about a month of public preview. https://github.blog/changelog/2025-04-04-copilot-code-review-now-generally-available/ ↩
Alberto Bacchelli and Christian Bird, “Expectations, Outcomes, and Challenges of Modern Code Review,” Proceedings of the 35th International Conference on Software Engineering (ICSE), IEEE, May 2013, pp. 712-721. https://www.microsoft.com/en-us/research/publication/expectations-outcomes-and-challenges-of-modern-code-review/ ↩
Caitlin Sadowski, Emma Söderberg, Luke Church, Michal Sipko, and Alberto Bacchelli, “Modern Code Review: A Case Study at Google,” Proceedings of the 40th International Conference on Software Engineering: Software Engineering in Practice (ICSE-SEIP), May 2018, pp. 181-190. https://doi.org/10.1145/3183519.3183525 ↩
Sadowski et al., “Modern Code Review: A Case Study at Google,” ICSE-SEIP 2018, Table 1 (convergent practice CP5), drawing on Peter C. Rigby and Christian Bird, “Convergent Contemporary Software Peer Review Practices,” Proceedings of the 2013 9th Joint Meeting on Foundations of Software Engineering (ESEC/FSE 2013), pp. 202–212. https://doi.org/10.1145/2491411.2491444 ↩
Faros AI, “The AI Engineering Report 2026: The Acceleration Whiplash,” 2026. This reference: the reported ~51% increase in pull request size under AI adoption. Vendor telemetry, cited as such. https://www.faros.ai/research/ai-acceleration-whiplash ↩
Raja Parasuraman and Dietrich H. Manzey, “Complacency and Bias in Human Use of Automation: An Attentional Integration,” Human Factors, vol. 52, no. 3, June 2010, pp. 381-410. https://doi.org/10.1177/0018720810376055 ↩
Faros AI, “The AI Engineering Report 2026: The Acceleration Whiplash,” 2026. This figure: 31.3% more pull requests merging without review. Vendor telemetry, cited as such. https://www.faros.ai/research/ai-acceleration-whiplash ↩
Faros AI, “The AI Engineering Report 2026: The Acceleration Whiplash,” 2026. This figure: pull request size up roughly 51% under AI adoption. Vendor telemetry, cited as such. https://www.faros.ai/research/ai-acceleration-whiplash ↩

The bottleneck moved to code review. Don't automate the part that mattered.

The constraint moved, it didn’t disappear

What review was actually for

The rubber stamp

What the standard should protect

Notes

Keep reading · Dev tooling

Most agentic AI pilots die in the long tail

AI agents are the new cybersecurity nightmare

The delegation tax: measuring what it costs to let cheap models do the work

The constraint moved, it didn’t disappear

What review was actually for

The rubber stamp

What the standard should protect

Notes

Footnotes

Keep reading · Dev tooling

Most agentic AI pilots die in the long tail

AI agents are the new cybersecurity nightmare

The delegation tax: measuring what it costs to let cheap models do the work