The Ferrari Engine in a Fiat 500

You've probably heard about the first METR study from July 2025 - it made the rounds at every conference and every newsletter. 16 experienced open-source developers, a proper randomized controlled trial (not a vendor survey), and the result: 19% slower with AI. Seriously.

That statistic lived a life of its own for half a year - some used it as proof that AI is hype, others criticized the methodology, still others ignored it because it didn't confirm their thesis. The standard cycle.

But have you heard about the second study?

In February 2026, METR published an update - and this is where it gets interesting, because the results don't fit any simple narrative. After a year of agentic AI adoption (Claude Code, Codex), a newer subset of developers shows signs of acceleration in the range of 4–18%. And what's even more interesting - and here's the real punchline - 30–50% of developers in the study refused to submit tasks they didn't want to do without AI. Half the sample self-excludes because people don't want to go back to working without agents. Draw your own conclusions, but that selection bias says more than the study result itself.

At the same time, we have GitHub reporting in Octoverse 2025 that Copilot now generates 46% of active users' code, 90% of the Fortune 100 have purchased licenses, and Java developers are hitting 61% AI share in their code. Microsoft reports a 55% speedup in a controlled experiment with Accenture. Stack Overflow says 84% of developers use AI tools, more than half daily.

So the tools work, people want them, nobody plans to give them up. 90% of Fortune 100 are buying licenses. 84% of developers use them. 46% of the code is AI-generated.

And yet - in a Deloitte survey from fall 2025 (3,200 leaders, 24 countries), only 20% of organizations report that AI actually translates into revenue growth. One fifth. With adoption above 80%. Another 37% admit they use AI "superficially, with no changes to processes."

But before we conclude that AI doesn't work, here's an interesting counterpoint. Laurie Voss (npm co-founder) analyzed data from Crunchbase, Carta, and Revelio Labs and showed that somewhere AI really works - in startups building from scratch. AI-native startups operate with teams 40% smaller than traditional SaaS companies. Revenue per employee is 6x higher - $3.48M vs $580K. The average seed round in 2024 is 3.5 people, down from 6.4 in 2022. Computers are replacing labor at a pace visible to the naked eye.

But those are startups. Greenfield. A blank slate, architecture designed for agents from day zero. And here's the question that should keep anyone managing an organization with twenty years of history up at night: what happens when that speed hits your market?

Because these aren't abstractions anymore. One Cloudflare engineer with an AI agent rewrote 94% of the API surface of Next.js - the world's most popular frontend framework, 194 thousand lines of code built over a decade - in a week. For $1,100 in tokens. IBM stock dropped 7% in a single day after Anthropic published a case study showing that Claude can analyze and rewrite COBOL code - the core of banking infrastructure that IBM had been building a business on for forty years. These aren't science fiction scenarios. These are things that have already happened.

The difference between those getting 6x revenue per employee and those where AI slows down seniors by 21% doesn't lie in the model. It lies in the chassis. CI speed. Context quality. Test determinism. Boring, infrastructural things that are absolutely critical - and that's what the rest of this series will cover.

What Exactly Doesn't Work

If the problem lies in the environment, not the model, then what exactly is it? The whole problem boils down to a simple model you can draw on a window (Fincher's "Social Network" reference here). But the solutions aren't that obvious.

On the input side, the agent has no precise instructions or context. It doesn't know unwritten conventions. It doesn't know that "run X" means "run X after doing Y and Z, which nobody documented." The README lies - or rather, nobody updated it for 18 months because nobody was reading it.

Now someone (or rather something) started reading. And that someone takes it literally. The Jira ticket says "fix login bug" and the agent generates something plausible but fundamentally inconsistent with the architecture - because the architecture exists in the heads of three people, one of whom left last year.

Garbage In, Garbage Out - except now Garbage Out flies at machine speed.

On the output side, the verification infrastructure is too slow.

The agent iterates in a Write → Run → Error → Fix loop - and that loop needs to take seconds, not quarter-hours. If CI takes 15 minutes, then 50 iterations = 12.5 hours. If CI takes 30 seconds, those same 50 iterations = 25 minutes. The difference between "the agent finished the task by lunch" and "the agent didn't finish by end of day" really doesn't lie in the model.

And between input and output - the agent treats CI like an oracle. Literally - it's its only feedback signal from reality. If the oracle lies (a flaky test gives a false positive), the agent enters a loop fixing something that isn't broken. If the oracle goes silent for 15 minutes, the agent loses context. Google reports that ~16% of their tests exhibit flaky behavior. Microsoft identified ~49 thousand flaky tests in their systems. And that's in environments with the best testing infrastructure on the planet.

And each of those iterations costs money - not just in compute, but in organizational attention. Every CI run consumes infrastructure, every LLM call burns tokens, and the meter is running the entire time the agent is stuck in a retry loop.

But there's a subtler cost hiding underneath: traceability. When a human developer writes code, you can ask them why they made a given decision. You can read the PR description, check the commit message, follow the thread in Slack. When an agent makes fifty iterative changes across a retry loop, the decision trail gets murky fast. Which iteration introduced the actual fix? Which ones were dead ends? Why did the agent choose approach A over approach B? Most current agent tooling treats the reasoning trace as disposable - an ephemeral chain-of-thought that evaporates once the final output lands. That's fine for a demo, but it's terrifying to put such created artifact to production

No single tool will solve this - not because the tools are bad, but because each addresses one-third of the problem. Copilot helps write code. Observability helps monitor it. The IDE helps edit it. But nobody is wiring this into a system where an agent can autonomously go from understanding a task to a verified deploy.
Success requires three things simultaneously: context on the input, fast verification on the output, and auto-evaluation in between. Three pillars - Context Fabric, Machine-Speed CI, Auto-Evaluation - that we'll break down to their component parts in this series.

Because the Ferrari engine is already sitting in the garage. Everyone has one, or will have one soon. The question is: what about the rest of the car? Because right now most of us are driving a Fiat 500 with a Ferrari engine under the hood - and I say this as a happy Fiat 500L driver. Italian chassis has its charm - but nobody in their right mind puts a V12 in it, and whoever builds the chassis that can handle that power first, wins.

And you'd better hope that answer comes from you, not from your competitors.

Quick Self-Assessment: 10 Questions

And to close, a checklist.

One point for each honest "yes." Half a point for "partially." Zero for "no."

Is tribal knowledge captured anywhere outside people's heads?
Are architectural rules encoded as lint rules, rather than in a document nobody updates?
Does the CI feedback loop take under 2 minutes?
Do tests pass deterministically - same code, same result, every time?
Is there an audit trail that lets you answer "where did this code come from, which model, who approved it"?
Is the documentation synchronized with the code?
Is there policy-based auto-approval for low-risk changes?
Do you know how much one agent iteration costs - CI plus tokens plus review time?
Is there a dedicated sandbox where the agent can iterate without blocking the team's queue?
Are you measuring anything beyond DORA metrics - because DORA was designed for commits every few hours, not every minute?

And now my (opinionated) take on scoring:

0-3: Agent-Hostile. The environment actively hinders agents. ROI is negative, and rightly so. This is where most companies are - and this is where the conviction that "AI doesn't work" is born.

4-6: Agent-Tolerant. The foundation is there, but bottlenecks eat up the gains. Quick wins exist - CI caching, stabilizing tests, basic docs. Without a plan, this won't change.

7-8: Agent-Ready. Agents are starting to deliver real ROI. Focus: review automation, per-iteration metrics, context delivery.

9-10: Agent-Optimized. Top 5%. The next step is a full autonomy loop - and that's what post #5 will cover.

Ok, we've laid out the problem - time for solutions. In this series, we'll break the whole thing down to its component parts.

Post #2 ("Your README Is a Lie") - the first bottleneck. A framework of 5 context levels and the concept of Autonomous Requirements: analyst-agents that turn vague Jira tickets into deterministic specifications before a developer even gets involved.

Post #3 ("Your CI Is an Oracle That Lies") - the second bottleneck. Feedback loop math, benchmarks from Bazel and remote caching, the concept of "CI as Sandbox," and new metrics for the agent era. Infrastructure, Agentic Swarm.

Post #4 ("Shadow AI Is Already in Your Codebase") - governance. A Green/Yellow/Red system for auto-evaluation, Continuous Modernization, and why the EU AI Act 2026 means that an audit trail is not a nice-to-have.

Post #5 ("The AI-Ready SDLC Maturity Model") - pulling it all together. 5 maturity levels, criteria per level, ROI curve, and practical advice.

Post #6 ("The Opinionated E2E Workflow") - a concrete stack from ticket to production, updated monthly. Because the AI landscape changes every week, and you have a business to run.

We started with context (pun intended). See you in post #2.

The Ferrari Engine in a Fiat 500

What Exactly Doesn't Work

Quick Self-Assessment: 10 Questions

Explore more topics

This Month We AIed #1

Interview with Artur Skowroński, Head of Application Development and the lead of the VISDOM project

GitHub All-Stars #16: Project N.O.M.A.D. - Civilization in a Docker Container