AI Works Great—But Only at Maturity Level Four

The industry is in the same spot the DevOps community was in 2013. We all know something is happening. None of us can name it properly.

A few weeks ago I was sitting at a table with fifteen senior people from large European companies - banks, retail, infrastructure, that sort of thing. Roundtable, Switzerland, Chatham House Rule, so I won't share details. But one thing stuck in my head so hard that it basically started this post.

In the first hour I asked every participant the same question: where are you with AI? And I got fifteen different answers which - once I started writing them down - turned out to describe maybe four different realities. But every one of those fifteen people was deeply convinced they were somewhere in the middle of what most organizations are doing.

Three conversations stayed with me. The first person (where exactly, I won't say) said: "we have agents in the pipeline." The second: "we're starting to test agentic workflows, but it's still R&D." The third: "we've launched a proof of concept for autonomous coding." Three sentences. Three different companies. And - as became clear after the third coffee and a few follow-up questions - three completely different realities.

The first one had Copilot turned on for maybe thirty percent of the team. The second used Cursor and had a CLAUDE.md file in every repository. The third actually had a fleet of a hundred parallel agents producing a thousand commits a week (that last case is, for the record, still an industry extreme - the rest of the room took it with appropriate disbelief).

All three considered themselves "average plus." Statistically it couldn't be otherwise, so apparently they felt average - especially if you compare yourself to any feed full of Twitter (sorry, X) bros.

And that moment - me sitting there with a notebook, coffee getting cold - was when I realized the problem we don't know how to name isn't a problem with tools. It's a problem with vocabulary.

Remember the first State of DevOps?

I do, because I caught it just after starting my career and I remember the feeling - ah, so that's what it's about. Nicole Forsgren, Jez Humble, Gene Kim, data from thousands of organizations, four concrete metrics: lead time, deployment frequency, MTTR, change failure rate. Five maturity levels. A brutal demonstration that most organizations were "low performers," even though all of them were deeply convinced they were at least decent. Self-reporting didn't match the data, to put it diplomatically.

That was 2014. We had exactly the same problem then that we have now: everyone knew something was happening with software delivery. Continuous integration, microservices, DevOps in the cultural sense, infrastructure as code. Every company had its own language for describing those things. Every company thought it was "kind of halfway there." It was only when concrete metrics and concrete maturity stages appeared that you could have an honest conversation about where you were and where you wanted to be.

That was a revolution most readers of this post probably don't remember as a revolution, because for anyone who entered the industry after 2018, DORA is just background. What we now call obvious was, at the time, naming things.

And now I look at the AI-native development industry in April 2026 and I see exactly the same moment. With one unpleasant difference: we don't have five years for that vocabulary to mature organically. The pace of change is ten times faster than in 2014. We have maybe twelve months before the lack of a common language starts costing the industry not just time, but real money in the form of bad investment decisions. Because, as I showed in the first post of this series, already today in Deloitte's fall 2025 survey only one in five organizations report that AI actually translates into revenue growth - at adoption above eighty percent. That gap is what our industry currently feels but doesn't know how to name.

And I didn't know how either, for a long time.

What we were actually trying to do

The Visdom AI-Native SDLC Maturity Matrix was born of pain. Specifically, the pain of having the third, tenth, and twentieth conversation with a client's C-level where, after an hour of trading generalities, it turned out that we meant something completely different than they did when we said "AI-ready." And both sides walked away deeply convinced we'd understood each other.

The first version of the matrix was a draft, had nine practices, and was really just a checklist for internal workshops. The April 2026 version has sixty practices, four perspectives, and five levels. Not because we planned that number - but because that's what accumulated after running it through dozens of engagements, from a startup with twenty developers to a regulated enterprise with two thousand. Not from theory. From patterns that actually worked, or actually failed, in actually existing companies.

And I'll say something consulting copywriters usually don't write: the first version we made was bad. The second was better, but bad in different places. It was somewhere around the fourth iteration that we started to feel the taxonomy was catching something - that people, after going through the assessment, stopped arguing about definitions and started arguing about priorities. That's the moment you know a tool is starting to work.

Five levels. Brutally.

I won't walk you through all sixty practices - that's what the matrix is for, which we built specifically so you don't have to click through slide decks (and if anyone wants to walk through it with me - be my guest, drop me an email). But I'll give you two sentences on each of the five levels, the way I'd describe them at a whiteboard during a workshop.

L1 is Ad-hoc. Copilot autocomplete, sidebar chat, the agent only sees the open file. Zero instruction files, zero MCP, README last updated eighteen months ago (well, never, but we pretend it was eighteen months). Manual review of one hundred percent of code. Test coverage below forty percent, flaky tests not flagged because "eh, it'll probably pass next time." This is where the vast majority of companies are - and this is exactly the majority that will tell you "AI doesn't work." And they're right. AI really doesn't work at L1, because there's nothing for it to work with.

L2 is Guided. A CLAUDE.md or .cursorrules appears, Cursor 3 or Claude Code in agentic mode is used by half the team. CodeRabbit or Qodo helps with review. The agent generates unit tests, humans write acceptance tests. Flaky tests go into quarantine. This is the state where companies announce in press releases that they're "transforming with AI." And they're right, but only partially - because governance still lives in 2024, and the agent still works blind, because the context is fragmented, out of sync, and contradictory.

L3 is Systematic. This is where things get serious. CLI agents as the primary interface. MCP servers deliver structured context - architecture, ownership, SLAs. Code conventions are written as agent-parseable rules, not as tribal knowledge living in the heads of three seniors (one of whom is on parental leave, another is looking for a new job). Lint-as-architecture: the "bug → codify → lint rule" pipeline works. TORS (Test Oracle Reliability Score, one of the metrics I proposed in the third post) is above ninety percent. The audit trail for AI-generated code exists and is searchable. This is where break-even appears and real ROI begins.

L4 is Optimized. Unattended one-shot agents - the Stripe Minions model, Cursor Automations. Agents invocable from multiple channels: Slack, CLI, web, PagerDuty. Three to five parallel sessions per developer. Green/Yellow/Red auto-evaluation with auto-merge for Green. Continuous Modernization in the background - agents bumping dependencies and runtime versions while the team sleeps. Mutation testing validates that tests actually catch defects rather than just hitting coverage. This is the top five percent. And for most companies, this is the realistic ceiling.

L5 is Autonomous. Multi-agent orchestration, planner-worker hierarchy. A hundred-plus parallel agents, a thousand-plus commits a week with no manual dispatch. Persistent agent identity with memory across sessions. Production telemetry automatically updates agent context. A self-healing test suite. Humans only review Red - meaning architectural decisions, not line changes. Two years ago I would have called this science fiction. Today, a handful of companies do it, mostly in the Bay Area, mostly with budgets that for most readers are otherworldly. Most readers of this post don't need to reach L5. Most readers of this post shouldn't aim for L5. The goal is solid L4. L5 is for those who compete on delivery speed as a primary competitive dimension, and there are fewer such firms than Twitter hype would suggest.

Why four perspectives, not one list

I have to pause here, because this is the place where I most often see companies fool themselves with their own "AI maturity assessments."

The most common mistake I see is reducing the problem to a single dimension - usually tool adoption. Percentage of the team using Copilot. Number of Cursor sessions per week. Coverage by AI-generated code. These are activities, not maturity. You can have a hundred percent of your team on Cursor 3 and still be at L1, if context is fragmented, CI takes eighteen minutes, and governance has no idea what model produced the code that just went to production.

Let me put it differently. Two months ago I ran an assessment with a company that was deeply convinced they were at L3. Tool adoption: a hundred percent. Cursor everywhere, MCP wired in, lint rules existed. But after an hour of conversation it turned out their Organization perspective was at L1. Zero audit trail, zero knowledge of who was generating what, zero compliance gates. Their Infrastructure perspective was at L2 - CI nine minutes, no sandbox for the agent, the agent blocking the team queue every time it iterated. And in Delivery Management they were somewhere between L2 and L3, because some teams had release management for agents and some didn't.

This is typical. Very typical. One metric lies. That's why the matrix describes four perspectives:

Development - how developers work with AI day to day. Coding Agent Usage, Context Engineering, Code Review & Quality, Testing Strategy. This is the perspective most companies start with and the one that's easiest to see.

Delivery Management - how that code actually gets shipped. Branching, PR flow, release management in an era where the agent generates forty commits a day instead of four. This is the perspective companies discover when their 2024 pipeline starts to clog.

Organization - who's accountable. Roles, governance, audit trail, EU AI Act readiness, team structure. Do we have an AI Platform Team, or is everyone on their own? This is the perspective companies discover when an auditor shows up and asks where a specific commit came from.

Infrastructure - the chassis, meaning what I wrote about in posts one and three. CI compute, sandbox isolation, remote caching, hermetic builds, ephemeral environments. This is the perspective companies discover when the GitHub Actions bill quadruples and nobody knows why.

Sixty practices total. Each with concrete criteria and evidence examples - not "we have MCP," but "MCP servers provide structured context, evidence: configuration files listing active context sources, token budget configuration, context coverage audit showing 3+ context levels populated." Because maturity that relies on self-assessment without criteria isn't maturity. It's exactly that median illusion this post started with.

That ROI curve everyone has in their head

There's one moment in every workshop where I see people start to understand why this whole matrix even matters.

I show them a chart. X-axis: maturity level. Y-axis: ROI on AI investment. The curve looks roughly like this: at L1 it's below the line (and deeply below), at L2 it's still below the line (though less than at L1), at L3 it crosses zero, at L4 it climbs fast, at L5 it's asymptotically high but unreachable for most companies.

And then someone (always someone) asks the same question: so if we're at L2, AI is slowing us down?

And I have to say: yes. Statistically, yes. METR demonstrated this in their first study from July 2025. Experienced developers with good tools, in an environment that isn't ready for them, are nineteen percent slower, even though they're convinced they're twenty percent faster. Feeling fast isn't a metric. METR's second study from February 2026 showed that organizations that invested in agentic infrastructure actually accelerated by four to eighteen percent - but the difference between those two groups doesn't lie in the model they use. It lies in the chassis.

And that's the punchline we keep coming back to in this series: most companies are at L1-L2 and think AI doesn't work. Because at L1-L2 it really doesn't. AI works great - from L3 up. But to get there, you have to invest in infrastructure that isn't sold in the license shop.

That, ultimately, is why the matrix exists. So you can have this conversation in finite time, with concrete pictures, with concrete investment decisions.

Now, the uncomfortable part

Here I need to be honest.

The Visdom Maturity Matrix is, today, a research preview. We publish it under our own domain, with an open taxonomy, with a changelog, no paywall, no "enterprise edition" hidden behind a login. But there's another question we ask ourselves on the team every few weeks, and I want to ask it openly here.

Should a framework for assessing the maturity of an entire industry, in the long term, belong to a single consulting firm?

My answer - and to be clear, this is my personal answer, not VirtusLab's official position - is: no.

And I'll say why, because this isn't false modesty.

DORA started as internal research at Google, published by Nicole Forsgren in the State of DevOps Reports. It became a standard only when Google handed it over to the DORA organization, which over time was absorbed as part of Google Cloud, but the metrics stayed open and vendor-neutral. SLSA originated at Google and almost immediately moved to OpenSSF. CMMI started at the Software Engineering Institute at Carnegie Mellon and over time moved to the CMMI Institute, today ISACA. OpenTelemetry - straight to CNCF.

Every one of those frameworks started as the internal artifact of one organization. Every one became a common language only when it stopped belonging to that organization.

Visdom Matrix is too young today to announce that kind of move. There's not enough input from other organizations, not enough mileage in production, not enough weird edge cases ground through it. But that's the direction we want to take this framework - and let me say openly that if it catches on, it shouldn't stay VirtusLab property. It should go where the other frameworks like it have gone: to a foundation where the maintainers come from ten different companies, not one.

Because finally - and this is a business argument, not an altruistic one - being a steward of a standard is far more valuable to VirtusLab than trying to sell a closed framework. A standard builds your brand and the context for conversations with C-level. A closed framework doesn't scale beyond your client list. The fact that someone is using our taxonomy in an internal workshop we weren't even invited to is good news, not bad.

For that to happen, three things have to fall into place. First - the taxonomy has to be good enough that people want to use it. We're working on it. Second - there has to be a contribution ecosystem from people outside VirtusLab. We're working on that more intensely. Third - the right foundation or organization has to appear, one we can hand it over to. Here we're watching what's happening in CNCF, in Linux Foundation AI & Data, sometimes in newly forming organizations. Time will tell.

In the meantime, the matrix is what it is: opinionated but open, unfinished but used.

What you can do this week

Coming back to where this post started - the industry needs a common vocabulary not so we can all agree, but so that when we disagree, we disagree about the same thing.

Three concrete moves to close, in order from least to most engaging:

Open the Maturity Matrix and read through the four perspectives. The act of reading concrete practices per level is often more educational than the scoring itself - because it names things you previously didn't have a word for. Twenty minutes.

Run the Workshop Assessment with your team. Not solo - absolutely cross-functional, ideally tech lead plus senior engineer plus platform lead plus manager. The discussion during scoring is, in ninety percent of cases, more valuable than the final number, because that's where it comes out that one half of the team thinks you're at L3, and the other half knows it's L1 with good marketing.

Want a hand?

If you take a look at the Maturity Matrix and have any doubts about how to use it - drop us a line. I'm personally happy to meet up and walk through it with you, adding commentary on each section 😊

And if you have feedback on the matrix itself - criticism, missing practices, edge cases it doesn't catch - send it. The best improvements between v0.1 and v0.4 came from developers at companies we've never worked with, and it's their submissions that will ultimately decide whether this matrix matures enough to one day deserve to stop belonging to us.

Because as the previous cycle in this industry showed, a vocabulary matures when the people who didn't write it start using it.

Post 5 in the "Visdom: The Agent-Ready SDLC" series. Next, and last - a practical end-to-end workflow from ticket to production, updated monthly. Because the AI landscape changes every week, and you have a business to run.

Visdom: AI Works Great. At Level Four.