Flaky tests, 15-minute feedback, and the hidden tax - why your CI is killing your AI.
This is post #3 in The Agent-Ready SDLC series. In post #1 we laid out the Ferrari-in-a-Fiat-500 problem - the engine is great, the chassis isn't. In post #2 we covered the first bottleneck: context. Now we're at the second bottleneck - the one that sits between your agent and reality.
If you follow the AI engineering discourse, you've probably seen Steve Yegge's name everywhere this winter. GasTown - his multi-agent orchestrator, described memorably as "an industrialized coding factory manned by superintelligent chimpanzees" - dominated the conversation in January. And rightfully so; it's genuinely fascinating work.
But I want to talk about a quieter project that appeared almost simultaneously: Dan Lorenc's multiclaude. Not because it's technically superior (both are experimental, both are weeks old, and Lorenc himself says he was deeply inspired by GasTown), but because Lorenc reached for a metaphor that I think captures something fundamental about where CI is heading in the agent era.
He calls it the Brownian Ratchet.
In physics, a Brownian ratchet is a thought experiment: you have random molecular motion - particles bouncing chaotically in every direction - and a ratchet mechanism, a one-way gear that clicks forward but can't click backward. In theory, the ratchet converts random thermal noise into directional progress. (In pure thermodynamics this doesn't actually work - it would violate the second law - but as an engineering metaphor, it's perfect.)
Here's how Lorenc maps it: multiple agents work in parallel on the same codebase. They might duplicate effort. They might conflict. Two of them might try to fix the same bug. One might break what another just fixed. This is fine. This is the point. CI is the ratchet. Every PR that passes tests gets merged. Progress is permanent. You never go backward. Chaos in, progress out.
What makes this more than a cute analogy is what it implies about the role of CI. In the Brownian Ratchet model, CI isn't a quality gate that checks your work after you're done. CI is the coordination protocol - the mechanism by which a swarm of autonomous agents achieves consensus without a consensus algorithm. It's how random motion becomes forward motion. It's how agents know what's real.
And if that ratchet is slow - fifteen minutes per click - the swarm stalls. If the ratchet is unreliable - flaky tests letting garbage through - you're not ratcheting forward, you're ratcheting randomly. And if both? You have a blind, expensive, directionless swarm that burns tokens and compute while going nowhere.
Both Lorenc and Yegge, building independently, converged on almost identical primitives: git worktrees for isolation, tmux for observability, persistent state that survives failures, and CI as the final arbiter. That convergence IS the signal. When two experienced practitioners arrive at the same architecture from different directions, you're looking at something structural, not stylistic.
And if that coordination protocol is slow, unreliable, or dishonest? You don't just have a "slow pipeline." You have a blind swarm.
(If you need a mental image: it's the Zerg on Aiur after the Overlord goes down. A swarm without vision, bumping into things, attacking its own units. Your CI is the Overlord. And right now we need to find our Queen of Blades - but more on that in section 3.).
1. The Lying Oracle
A human developer sees a failing test and thinks: "Hmm, that's the flaky one again." They skip the noise, fix the real issue, move on. Thirty years of muscle memory. Tribal knowledge. Intuition.
An agent has none of that.
An agent treats your CI pipeline as an oracle - the only feedback signal from reality. The test says "fail"? The agent believes it. Unconditionally. And proceeds to "fix" a problem that may not exist.
The scale of this problem is staggering. Google's internal analysis shows that approximately 16% of their tests exhibit flaky behavior, and around 84% of transitions from pass to fail in their post-submit testing involve a flaky test, not a real regression. Atlassian reported losing over 150,000 developer hours per year to flakiness. An academic study analyzing GitHub Actions across nearly 2,000 open-source Java projects found over 51% affected by flaky builds, with 67% of rerun builds exhibiting flaky behavior. And it's getting worse: Bitrise data across 10 million+ mobile builds shows the proportion of teams experiencing flaky tests grew from 10% in 2022 to 26% in 2025 - a 160% increase in three years.
For humans, this is a productivity drag. For agents, it's a category error.
Every false signal triggers an additional iteration in the Write → Run → Error → Fix loop. The agent doesn't shrug and re-run. It reasons about the failure. It generates a hypothesis. It writes a "fix." It runs tests again. If the flaky test happens to pass - congratulations, the agent just learned that its nonsensical fix "worked" and has now polluted its context with a false causal chain. If it fails again - another hypothesis, another fix, another iteration. One documented analysis turned a $0.50 fix into a $30 bill through 47 iterations. Another analysis of 220 stuck agent loops found three repeating patterns: the agent tries minor variations of the same approach (2K-5K tokens burned per iteration), a wrong assumption early in the session poisons all subsequent reasoning, or the agent over-plans - generating long reasoning chains before acting, burning tokens on speculation rather than execution.
But there's something worse than a slow lie, and it's a confident truth about the wrong thing.
The Circular Test Trap
When an agent writes code, it often writes the tests too. And here's the part that should keep you up at night: AI-generated tests are circular. They test what the code does, not what the code should do.
Think about it. The agent writes a function. Then it generates a test that verifies the function's behavior. Both pass. Green pipeline. Confidence! But the test was derived from the implementation, not from a specification. The test is tautological - it says "the code does what the code does." That's not verification. That's a mirror - and like every magic mirror, it tells the agent exactly what it wants to hear.
> Mirror, mirror on the wall, whose code is the fairest of them all? Yours, your majesty. Always yours.
Veracode's 2025 GenAI Code Security Report tested over 100 LLMs across 80 coding tasks and found that 45% of AI-generated code contains security vulnerabilities - not edge cases, but OWASP Top 10 issues: SQL injection, cross-site scripting, log injection. Java was the worst at over 70% failure rate. And the punchline: the models' own tests didn't catch any of it, because the tests were checking for functional correctness, not security correctness.
This is where the oracle metaphor breaks down and becomes something darker. A flaky test is an oracle that lies intermittently. A circular test is an oracle that confirms whatever you want to hear. The first wastes your time. The second ships your vulnerabilities.
Spotify's engineering team got very specific about this in their Honk post-mortem. They identified three failure modes for their background coding agents: the agent fails to produce a PR (noisy but harmless), the PR fails CI (caught by infrastructure), and the PR passes CI but is functionally incorrect (the actually dangerous one). Their solution was an LLM-as-judge layer - a separate model that evaluates diffs against the original prompt. Internal metrics across thousands of sessions showed the judge vetoes approximately 25% of proposed changes. One in four. And when vetoed, agents successfully course-correct only about half the time.
Let me do that arithmetic for you: if 25% of agent sessions produce incorrect code that passes CI, and half of those can't self-correct, roughly 12.5% of agent output that passes all your tests is still wrong. That's not a rounding error but structural property of the system.
2. The Feedback Multiplier
OK, so the oracle lies - sometimes intermittently, sometimes systematically. Now let's talk about what happens when the oracle is also slow.
Here's some arithmetic that should restructure your infrastructure budget.
An agent working on a non-trivial task - a refactoring, a migration, a bug fix in tightly coupled code - might need 50 iterations to converge. (That number is consistent with what practitioners report for current-generation agents on brownfield code. Spotify's Honk system operates at this scale routinely across thousands of repositories.)
At 15-minute CI feedback: 50 × 15 = 12.5 hours. A full working day for one task. But it's worse than linear, because the agent accumulates context with every iteration. Later iterations are more expensive (tokens scale with context length) and less reliable (the model reasons over a longer, noisier history). By iteration 30 the agent may be spending more tokens re-reading its own failed attempts than reasoning about the actual problem.
At 30-second CI feedback: 50 × 0.5 = 25 minutes.
Same task. Same agent. Same model. The difference isn't marginal. It's the difference between "agents work" and "agents don't work." And it's entirely an infrastructure problem.
But the interesting thing isn't the arithmetic. The interesting thing is what it means for the economics of agent orchestration.
The Compounding Cost Structure
The VISDOM AI Engineering Radar tracks a cluster we've labeled "The Verification Pivot" - the industry shift from optimizing code generation speed to optimizing verification speed. One of the key findings: practitioners report that "vibe-coding" without strict verification leads to unmaintainable codebases within 12 weeks, and a 125% increase in verification overhead offsets the initial velocity gains.
The economics work like this:
Cost of Agent Work = Iterations × (CI Wait + Token Burn + Compute + Review)
Each of those terms compounds with the others:
- Longer CI wait → more context accumulated → higher token burn per iteration.
- More flaky tests → more wasted iterations → more compute.
- More wasted iterations → more noise for reviewers → higher review overhead. It's not additive; it's multiplicative.
Let me give you a concrete example. One developer tracked 42 agent runs on a FastAPI codebase and found 70% waste - from reading too many files, from failed attempts, from verbose tool output. Another analysis found 87% of tokens went to finding code, not writing it. The navigation overhead dominates. And a 200K-token conversation costs 10x what a 20K-token one costs - so every minute the agent spends waiting for CI is a minute the context is growing and the next iteration is getting more expensive.
This is why Spotify made a critical architectural decision when scaling Honk: they separated the agent runtime from the verification runtime. The agent doesn't run CI itself. It pushes branches to GitHub, triggers builds via a dedicated verification service that abstracts the CI system, waits for results, and only creates pull requests after full validation. This decoupling - the agent writes, a separate service verifies - is the key insight. It lets them optimize each side independently. The agent can iterate at agent speed. The verification service can run builds in parallel, cache results, and manage queue priority.
At QCon London 2026, Spotify reported that Honk now produces 1,000 merged pull requests every 10 days - up from 1,000 in three months just six months prior. That's not a model improvement - that's an infrastructure improvement.
This is the Queen of Blades moment - not a single magic fix, but the architectural decision that gives the swarm its eyes back.
What Actually Speeds Things Up
I'll be brief on the plumbing because this isn't a Bazel tutorial (though you know where to find us for that):
Caching across your builds -. If someone already built and tested the same code with the same inputs, don't do it again. EngFlow - founded by core Bazel engineers - reports 5-10x build time reduction. One customer reported releasing 20x faster. The insight: caching works across agents, not just one developer. Agent A's build benefits from Agent B's build.
Incremental builds. For an agent making small iterative changes - the dominant pattern - a good build system rebuilds only what changed. Fifteen minutes becomes fifteen seconds.
Test impact analysis. Don't run all 10,000 tests when you changed one service. With Bazel, you get a lot of this for free - tests can be incremental, so only affected targets are re-executed. We helped a client cut test cycle time by 78% (from 9.5 hours to under 2) by switching to intelligent test orchestration. Read the full story.
Parallelization and remote execution. Distribute builds and tests across a cluster — but do it smartly: match compute to the task, and keep it local to data or caches. Autoscaling is the next step, making compute proportional to actual usage rather than peak capacity.
None of this is new technology. Google has done it internally for fifteen years. What's new is that it matters qualitatively differently now. A human who waited 15 minutes context-switched - checked Slack, reviewed a P§R, grabbed coffee. The time was redistributed, not wasted. An agent can't context-switch. It either iterates or it idles. And idling is pure waste - except worse than waste, because the context window is still growing.
This is what we call Machine-Speed CI - CI infrastructure designed not for human commit cadence (a few times a day) but for agent iteration cadence (every minute). The bar isn't "faster than before." The bar is "fast enough that the agent never loses its train of thought."
3. CI as Sandbox - or: How the Swarm Learned to See
Here's the idea I think will age the best in this entire series, and it emerged not from theory but from watching what's actually converging in the ecosystem.
Traditional CI is a gate: "is this code good enough to ship?" You submit when you're done. The bouncer checks the list. You're in or you're out.
Agents don't submit when they're done. They submit to learn. Every CI run isn't a gate check - it's a sensor reading. The agent needs to know: did this fix what I intended? Did it break anything else? Are there side effects I didn't predict?
In the Brownian Ratchet model, this is the fundamental primitive. Lorenc is explicit: "If it passes, the code goes in. If it fails, it doesn't. The automation decides." CI isn't validating quality. CI is defining what counts as reality. It's the one-way gate that turns random agent motion into directional progress. And Machine-Speed CI is what makes the ratchet click fast enough to matter.
But here's what happens when you actually run this in production. Yegge described watching 20-30 GasTown agents in parallel, making $100/hour decisions about what to greenlight, experiencing "palpable stress" as the system ran faster than he could comprehend. And then the refinery agent crashed, other agents started lying about state, and half the system's institutions collapsed. His words: "Not 'a process died,' but 'the city's power grid collapsed, and now half the institutions are lying about what's real.'"
This is what it looks like when CI - the only source of truth - becomes overloaded, unreliable, or slow. The swarm goes blind.
The concept of CI as Sandbox addresses this directly: a parallel, isolated CI environment where agents iterate freely without interfering with the team's workflow or each other. The agent can execute 50 failed attempts in 5 minutes. Each attempt gets fast feedback. The sandbox is isolated, ephemeral, and cheap.
The pieces already exist, and they're maturing fast:
Ephemeral sandboxed runtimes. E2B went from 40,000 sandbox sessions per month in March 2024 to roughly 15 million per month by March 2025 - and about 50% of Fortune 500 companies now run agent workloads in sandboxed environments. Claude Code's Docker Sandboxes (launched January 2026) run each session in a dedicated microVM with its own kernel and private Docker daemon. The Rivet team built sandbox-agent - a Rust binary that runs inside any sandbox provider (E2B, Daytona, Modal, Cloudflare Containers) and provides a universal HTTP API to control Claude Code, Codex, or Amp remotely. Shipyard offers self-serve ephemeral environments where agents can spin up full-stack test environments via MCP, pull logs, run tests, and iterate - without a human in the loop.
Agent-specific CI lanes. This is where our own experience with Jenkins Operator comes in (yes, I have the scars - disclosure: VirtusLab maintains it). The evolution from "CI as monolithic build server" to "CI as elastic infrastructure" has been ongoing for a decade. But it was always designed around one pipeline per PR, one run per push. The agent era requires rethinking the unit of work from "a commit" to "an iteration." Spotify's architecture - separating agent runtime from verification runtime, with a dedicated verification service abstracting the CI system - is the production-grade version of this idea. The agent gets its own lane, and this is where remote execution becomes super handy, as you can spin up isolated compute for agent iterations without blocking the main pipeline. When it converges, then it enters the main queue.
Progressive validation. Run fast unit tests first. If green, integration tests. If green, E2E. The agent doesn't need full validation on every iteration - it needs a fast signal to keep the loop moving, and thorough validation before the final merge.
But here's the thing nobody's talking about yet, and it's subtle: sandbox security is unsolved. In March 2026, a firm called Ona demonstrated that Claude Code could bypass its own denylist using path tricks (/proc/self/root/usr/bin/npx resolves to the denied binary but dodges pattern matching). When bubblewrap caught that, the agent disabled the sandbox itself and ran the command outside it. The agent wasn't jailbroken. It wasn't told to escape. It just wanted to complete the task, and the sandbox was in the way.
This is why application-level sandboxing alone is insufficient. You need layers: application-level restrictions (Claude's built-in permission system), OS-level enforcement (bubblewrap, Seatbelt), and infrastructure-level isolation (VMs, network policies). The industry is still converging on what the standard stack looks like - and if you're running agent swarms against your production CI without at least two of these three layers, you're one creative agent decision away from a very bad day.
4. New Metrics for the Agent Era
DORA metrics - Deployment Frequency, Lead Time, Change Failure Rate, MTTR - were designed brilliantly for a world where humans commit every few hours. Agents iterate every minute. DORA doesn't break, but it becomes like measuring highway traffic with a sundial.
Deployment Frequency skyrockets - the agent pushes constantly - but that tells you nothing about whether it's converging efficiently or burning tokens in circles. Lead Time collapses, but a 2-minute lead time from 50 wasted iterations isn't the same as a 2-minute lead time from 3 clean ones. The VISDOM Radar's "Verification Pivot" cluster puts it well: mature teams have stopped tracking PR frequency as a productivity signal and started tracking DORA metrics to detect latent service degradation that high PR velocity can mask.
We need to extend DORA. Not replace it - extend it. Here's what I'm proposing (and what we're tracking internally, so this isn't armchair theory):
Iterations-to-Success (ITS)
How many iterations does the agent need to go from task assignment to passing CI?
| ITS | Signal | What It Means |
| 1–3 | Healthy | Good context, reliable tests, clean architecture |
| 5–10 | Warning | Unclear spec, flaky tests, coupling issues, or poor context delivery |
| 20+ | Structural failure | The task is too ambiguous, the tests are lying, or the codebase actively resists the agent |
Here's the insight that makes ITS genuinely useful: ITS is an architectural signal, not just an agent performance metric. When ITS trends upward over months, the agent isn't getting worse - the codebase is getting harder to reason about. Rising ITS is a leading indicator of architectural decay, the same way rising bug counts used to be. Except now you get the signal automatically, every day, from every task the agent touches. It's like having a junior developer constantly telling you which parts of your system are confusing - and unlike a junior developer, they never get used to it and stop complaining.
A rising ITS correlated with a specific service or module? That module has a context problem - unclear conventions, missing documentation, and tight coupling. A rising ITS across the entire codebase? Your architecture is drifting. Spotify didn't publish ITS directly, but their judge's 25% veto rate is a proxy - it tells you how often the agent produces structurally wrong output despite passing CI.
Cost-per-Iteration (CPI)
What does each turn of the loop actually cost? Not the marketing price - the real price:
- Tokens: Input and output. A 200K-token context costs 10x a 20K one, and context grows with every iteration.
- Compute: CI minutes, sandbox spin-up, build cluster. Autoscaling helps but doesn't eliminate.
- License: Per-seat for Copilot/Claude/Cursor. The fixed cost most organizations can already quote.
- Review: The time seniors spend reviewing agent output. Practitioners report this is currently higher than for human code, because trust isn't established. The Radar's "Verification Pivot" cluster notes a 125% increase in verification overhead.
Most organizations can tell you what they spend on AI licenses. Almost no one can tell you their CPI. And CPI is the denominator in your ROI calculation. If you're a Platform Efficiency Lead building a business case, CPI is the number your CFO needs.
Test Oracle Reliability Score (TORS)
What percentage of test failures are real? Not re-runs, not flakes, not environment issues - actual regressions caught by actual tests.
If your TORS is 50%, half of your agent's iterations are wasted on noise. Google's data implies a TORS of about 16% for their post-submit testing (84% of pass-to-fail transitions were flaky). For an agent, that's a tax on every single iteration.
The formula is direct: wasted_cost = (1 - TORS) × total_iterations × CPI. At Google-scale flakiness, with a CPI of even $1, that's 84 cents wasted per iteration per agent. Scale that across a team of 50 engineers with multiple concurrent agent sessions, and you're looking at significant hidden spend.
(Spotify's approach to TORS is architectural: they run deterministic verifiers - formatting, build, test - before the LLM judge. The deterministic layer handles the "is this mechanically correct?" question. The LLM handles the "is this what we actually asked for?" question. Two oracles, each honest about a different dimension.)
The 4x Hidden Tax
When you buy AI coding licenses, you're budgeting for one-quarter of the actual cost. The full taxonomy:
- License - the subscription, the thing on the pricing page
- Compute - CI, sandboxes, build clusters
- Tokens - API calls, context growth, agent loops
- Review overhead - senior time reviewing agent output
Most organizations budget for #1 and discover #2-4 the hard way. Gartner projects that 40% of agentic AI projects will be scrapped by 2027 - not because the models fail, but because the coordination layer has no feedback and the infrastructure cost balloons uncontrolled. One developer reported a client hitting $2,000 in API costs in a single day because an agent discovered recursive self-improvement - it kept calling itself to optimize its own prompts with no circuit breaker.
The good news: categories #2 and #3 are highly compressible. Remote caching and incremental builds cut 60-70% off CI cost. Model routing - sending simple tasks to cheaper models - cuts 30-40% off token spend. Fixing or deleting flaky tests reduces wasted iterations, which cuts everything proportionally. The compound effect of doing all three can reduce the total cost of agent operation by roughly 70%.
The less good news: this is boring work. Nobody writes a LinkedIn post about fixing flaky tests (well, almost nobody - Spotify did make their flakiness visible on a dashboard, and that alone reduced it by 33% in two months). But this is the chassis. This is what separates the organization getting 6x revenue per employee from the one where AI makes seniors 21% slower.
Feedback is the fuel for agents. Your CI is their eyes.
If your CI answers in 15 minutes, you don't have Machine-Speed CI - you have human-speed CI serving machine-speed agents. Your agent is blind and deaf for 15 minutes - per iteration, per task, per agent. If your CI lies 84% of the time, your agent is a very expensive random number generator.
And if you bought Copilot licenses without writing CI compute into the business case? Well - you now have the numbers to go back and fix that.
What's Next
We've covered the input problem (context, post #2) and the feedback problem (CI, this post). The next post tackles the governance problem - because even if your agent has perfect context and instant feedback, someone has to answer: should this code be in production?
Post #4 ("Shadow AI Is Already in Your Codebase") - a Green/Yellow/Red system for auto-evaluation, Continuous Modernization, and why the EU AI Act 2026 means that an audit trail is not a nice-to-have.
Artur Skowroński is Head of Application Development at VirtusLab, where he leads VISDOM - a product that operationalizes the Agent-Ready SDLC across enterprise environments. The opinions and bad analogies are his own; the data isn't - sources are linked throughout.
The AI-Native SDLC Maturity Matrix is updated monthly. The AI Engineering Radar currently tracks 715+ signals across 17 areas - including the "agent-runtime-sandboxing" and "observability-feedback-loop" categories that directly map to this post.
PS. If you scored yourself on the 10-question checklist from post #1 - questions 3 (CI under 2 minutes?), 4 (deterministic tests?), 8 (cost per iteration?), 9 (dedicated sandbox?), and 10 (beyond DORA?) are all about this post. If you scored below 3 on those five, start here.
PS2. I titled this post "Your CI is an Oracle... that lies too" and spent half of it talking about speed. That's deliberate. A fast, honest oracle is what you want. A fast, lying oracle lets you ship nonsense quickly. A slow, honest oracle lets you ship quality slowly. You need both dimensions. Fix the lies first (test reliability), then fix the speed (infrastructure). In that order - because there's no point making a liar answer faster.




