Most agentic AI pilots die in the long tail

In February 2024, Klarna told the market that its OpenAI-powered customer service agent was doing the work of 700 human agents, handling two-thirds of customer chats in its first month and operating in more than 35 languages.¹ The story became the canonical demo of agentic AI replacing labor. Fifteen months later, in a May 2025 Bloomberg interview, CEO Sebastian Siemiatkowski conceded that the AI-only approach had produced “lower quality” output and that Klarna was rehiring humans on flexible contracts to restore the customer experience the chatbot had eroded.² The pilot had been a success on the metric Klarna’s leadership chose to publish. The production system was a failure on a metric the customer noticed first.

Deloitte’s most recent enterprise survey on agentic AI puts numbers around the same dynamic. As of late 2025, 30% of organizations are exploring agents, 38% are piloting them, 14% have something ready to deploy, and only 11% have agents actually running in production.³ Gartner, on a separate dataset, predicts that more than 40% of agentic AI projects will be cancelled by the end of 2027, citing escalating costs, unclear business value, and inadequate risk controls.⁴

The gap between pilot and production is the central operational story of agentic AI right now. The popular diagnoses for why so few projects survive that gap — governance, integration, talent — are not wrong. They are downstream. The mechanism that kills these projects is more specific, and noticing it changes what you do about it.

The popular diagnoses are downstream

If you read the analyst coverage of agentic AI cancellations, the same four causes appear in roughly the same order. Governance is missing. Integration with legacy systems is harder than expected. The ROI case never closes. Talent to operate the system is hard to hire. Each of these is true on its own terms and traceable to specific cited research. Deloitte’s January 2026 follow-up on agent scaling found that only 21% of organizations describe their governance for agentic AI as mature, and that roughly four in five lack the audit trails, decision boundaries, and monitoring that govern a production system the way you would govern any other.⁵

What I think is missing from that list is that these are descriptions of the wreckage, not the cause. An agentic project that has been running in production for eighteen months is going to develop governance, because production exposes the absence of it. An agentic project that lives in a pilot environment is governed by the champion’s email thread and a few rules in a config file, and that is sufficient for as long as the pilot lives. The difference between the pilot and the production version of the same agent is not that one is governed and the other is not. The difference is that the pilot runs against a curated slice of reality and the production system runs against the whole thing.

Gartner’s June 2025 press release on the 40% cancellation prediction includes a useful line from senior director analyst Anushree Verma: “Most agentic AI projects right now are early stage experiments or proof of concepts that are mostly driven by hype and are often misapplied.”⁴ The interesting question is what makes a proof of concept fail in production after appearing to succeed in pilot. The answer is not always hype. Sometimes the demo was real. The demo was just measured on a slice of the world the production system never gets to choose.

Pilots optimize for the median path

A pilot has a champion. The champion picks the use case, the data, the test users, and the success metric. The pilot runs in an environment where the prompts are well-formed because the people writing them know what they are testing. The data is clean because it was selected for the pilot. The users are early-adopter colleagues with a high tolerance for “let me try that again.” The agent is evaluated on the median path it was designed for.

Production has none of those handles. Customers do not write well-formed prompts; they write angry, vague, contradictory, partial prompts at one in the morning. The data is whatever the production pipeline produces, including the rows that are malformed, stale, or missing fields the pilot data did not have. The users are not early adopters; they are the median customer, plus a long tail of customers whose inputs the pilot designers never imagined. The agent is evaluated on every path, weighted by how much the worst path hurts.

The McDonald’s drive-through partnership with IBM, ended in mid-2024 after roughly three years of testing across more than 100 restaurants, is the clearest cited example of this pattern outside customer support.⁶ The pilot proved the technology could take voice orders. Production proved that voice orders include background noise from the next lane over, accents the model was not tuned for, drunk orderers, customers correcting themselves mid-sentence, and the now-famous failure modes — nine sweet teas, bacon on ice cream — that turned into screenshots before turning into a decision to end the test.⁶ The pilot was not lying about what the system could do. It was telling the truth about a narrower slice of what the system would be asked to do.

A developer at a desk in a darkened office, head in hands; on the monitor in front of them, a long project checklist with 'AI Generation' checked at the top and many unchecked items extending down off the screen; on the wall behind, a dartboard with one dart stuck in the outer ring through a small paper labeled 'Production' — The agent shipped one item. The dart found the rest.

The pilot tells the truth about a slice of the world. The production system tells the truth about the rest of it.

This is the asymmetry that kills agentic projects, and it is not specific to consumer-facing surfaces. An internal agent that summarizes engineering tickets works flawlessly on the tickets in the pilot sample, then runs into the ticket from the team that uses a different template, the ticket that was filed as a screenshot, the ticket that contains a customer’s full chat log pasted into the description field. The summary metric on the pilot was 92%. The complaint volume in week three of production is what closes the project.

The Klarna case is the same pattern in customer-support clothing. A pilot can show that an agent handles two-thirds of chats. A production system reveals which third it does not, what those customers were trying to do, and how loudly they say so on social media when the agent fails them on something Klarna’s brand depends on. The cost saving is real. The damage is also real. The accounting catches up.

The cancellation rate is the market correcting

Gartner’s 40% cancellation prediction is usually framed as a warning. I think it is mostly a sign that the market is correcting faster than it did in any of the previous AI cycles. The reasoning here is straightforward. In earlier waves — the chatbot boom of 2016, the enterprise machine-learning push of 2018, the conversational AI investments of 2020 — projects without honest accounting lingered on roadmaps for years, because the inference cost was buried inside a multi-year platform contract and the failures were not visible until the renewal conversation. Agentic AI changes the timing on both ends.

The inference cost shows up on a monthly invoice. The CFO can see, every thirty days, how much each agent is costing and whether the cost per resolved task is going up or down. That is a forcing function the last cycle did not have at the same cadence. Gartner’s underlying January 2025 poll, of 3,412 webinar attendees, found that 19% of organizations had made significant investments in agentic AI, 42% had made conservative investments, 8% had made none, and 31% were on a wait-and-see footing.⁴ The conservative-investment group is the one to watch. They are running cheap-enough pilots that they can kill them when the per-call accounting goes upside down, and they do.

The failures are visible too. An agent that misbehaves does so in a way that produces a screenshot, a complaint, a postmortem, or in the unfortunate cases, a regulatory inquiry. A poorly-targeted machine-learning model in 2018 produced bad recommendations that nobody screenshotted. An agent that rewrites a customer’s contract or sends an email it should not have sent ends up on a vice president’s desk. The failure mode is more legible, which means the decision to kill the project happens faster.

In my view, an agentic AI project that does not survive its first eighteen months in production is, in most cases, a project that should not have survived. The cancellation rate is the market doing what the last cycle did not: pricing failure honestly and on a short cycle. The 60% that survive are the ones who solved the production problem, not the ones who got lucky with a generous CFO.

There is a counter-reading, which I will state because the data is mixed. Some of the 40% will be killed by organizations that lack the patience to debug a real production problem and write the project off as the technology’s fault. Gartner’s September 2025 survey of 360 IT application leaders found that 75% were piloting or deploying AI agents in some form but only 15% were considering or running fully autonomous ones, citing governance, solution maturity, and agent sprawl as the brakes.⁷ Some of that 15% versus 75% gap is the right answer; some of it is reflexive conservatism. The honest version of the prediction is that 40% will be cancelled, of which a significant fraction will have deserved to be, and a smaller fraction will be killed by the wrong people for the wrong reason. Both are healthier than the previous cycle’s pattern of letting bad projects rot.

What survives the gap

The 11% in production are not running magical agents. They are running agents whose scope is small enough that the long tail is bounded. An agent that submits expense reports against a fixed schema, with a human approval gate on anything above a threshold, has a long tail an engineer can enumerate. An agent that “handles operations” has a long tail nobody can enumerate, and that is the project the executive sponsor watches die in month nine.

The shape of agentic systems that survive production has three properties, in my view, that the cancelled ones tend to lack. The first is narrow scope: one task, one data shape, one set of failure modes a human can articulate before the agent is turned on. The second is a human-in-the-loop checkpoint at the points where the long tail does its damage — approvals, irreversible actions, customer-facing communication on anything that touches the brand. The third is an observability surface that distinguishes what the agent did from why, because the audit logs that show API calls are not the same as the audit logs that show intent, and the difference is where most root-cause investigations stall.

Safety-critical software engineering has a phrase for this kind of system: the human-machine system. The aviation industry does not certify pilots and aircraft separately and hope the combination works; it certifies the handoffs and the failure modes in advance.⁸ The agentic systems that survive the next two years will be built on the same premise, not because their authors read the DO-178 family of standards, but because the long tail forces the same conclusion on anyone honest about what production looks like.

The pilot-to-production gap is real and the popular diagnoses are not wrong. They are downstream. The thing that kills the most projects is the asymmetry between what the pilot measured and what production measures, and the gap closes only when the team running the project changes what they measure on the pilot side to match what they will be held to in production. Until then, the 11% who make it across will keep being the ones who picked a problem narrow enough that the long tail did not eat them. The other 89% will keep producing impressive demos and disappointing invoices, on a cycle that, mercifully, is now short enough to learn from.

Notes

Klarna, “Klarna AI assistant handles two-thirds of customer service chats in its first month,” Klarna corporate press release, February 27, 2024. https://www.klarna.com/international/press/klarna-ai-assistant-handles-two-thirds-of-customer-service-chats-in-its-first-month/ ↩
Sherin Shibu, “Klarna Is Hiring Customer Service Agents After AI Couldn’t Cut It on Calls, According to the Company’s CEO,” Entrepreneur, May 9, 2025, covering Sebastian Siemiatkowski’s May 8, 2025 Bloomberg interview. https://www.entrepreneur.com/business-news/klarna-ceo-reverses-course-by-hiring-more-humans-not-ai/491396 ↩
Deloitte Insights, “Agentic AI strategy: The new architecture of an autonomous enterprise,” Tech Trends 2026, published December 10, 2025. Statistics from Deloitte’s 2025 Emerging Technology Trends in the Enterprise Survey and 2025 Tech Value Survey. https://www.deloitte.com/us/en/insights/topics/technology-management/tech-trends/2026/agentic-ai-strategy.html ↩
Gartner press release, “Gartner Predicts Over 40% of Agentic AI Projects Will Be Canceled by End of 2027,” June 25, 2025. Analyst quoted: Anushree Verma, Senior Director Analyst. Underlying data: January 2025 Gartner webinar poll of 3,412 attendees. https://www.gartner.com/en/newsroom/press-releases/2025-06-25-gartner-predicts-over-40-percent-of-agentic-ai-projects-will-be-canceled-by-end-of-2027 ↩ ↩² ↩³
Deloitte Insights, “Business and IT leaders report AI agents are scaling faster than their guardrails,” published January 2026. Survey of 3,235 IT and business leaders across 24 countries. https://www.deloitte.com/us/en/insights/topics/emerging-technologies/ai-agents-scaling-faster.html ↩
Joe Guszkowski, “McDonald’s is ending its drive-thru AI test,” Restaurant Business, June 14, 2024. McDonald’s confirmed that the test, conducted in partnership with IBM across more than 100 restaurants since 2021, would be turned off no later than July 26, 2024. https://www.restaurantbusinessonline.com/technology/mcdonalds-ending-its-drive-thru-ai-test ↩ ↩²
“Survey finds slow adoption of autonomous AI agents in enterprises,” IT Brief, September 30, 2025, reporting on the Gartner press release “Gartner Survey Finds Just 15% of IT Application Leaders Are Considering, Piloting, or Deploying Fully Autonomous AI Agents,” September 30, 2025. Underlying data: Gartner survey of 360 IT application leaders in North America, Europe, and Asia/Pacific. https://itbrief.news/story/survey-finds-slow-adoption-of-autonomous-ai-agents-in-enterprises ↩
RTCA DO-178C, “Software Considerations in Airborne Systems and Equipment Certification,” RTCA, December 2011. See also SAE ARP4754A on the development of civil aircraft and systems. ↩

Most agentic AI pilots die in the long tail

The popular diagnoses are downstream

Pilots optimize for the median path

The cancellation rate is the market correcting

What survives the gap

Notes

Keep reading · The agentic AI pilot-to-production gap

AI agents are the new cybersecurity nightmare

The bottleneck moved to code review. Don't automate the part that mattered.

The delegation tax: measuring what it costs to let cheap models do the work

The popular diagnoses are downstream

Pilots optimize for the median path

The cancellation rate is the market correcting

What survives the gap

Notes

Footnotes

Keep reading · The agentic AI pilot-to-production gap

AI agents are the new cybersecurity nightmare

The bottleneck moved to code review. Don't automate the part that mattered.

The delegation tax: measuring what it costs to let cheap models do the work