What top engineers read about AI in September 2025
Artur Skowroński
Head of Java/Kotlin Space
Published: Oct 13, 2025|21 min read21 minutes read
Welcome to the latest edition. This newsletter is a monthly, noise-free roundup of AI developments that truly matter to engineers and tech leaders — practical, skeptical, and ready for implementation. Instead of chasing every new model or “top 50 tools” list, we focus on what will stand the test of time and genuinely change how we build software.
This month, we’re diving into agents — a concept that’s finally moving beyond experimentation and maturing into a true engineering discipline.
September brought a flurry of launches that could make anyone’s head spin. Two of the leading players, OpenAI and Anthropic, unveiled their visions for the future of agentic engineering by releasing comprehensive toolkits for building AI agents.
OpenAI introduced AgentKit, described as “a complete toolkit for building, deploying, and optimizing agents.” It includes Agent Builder — a visual environment for designing agent flows, ChatKit — a ready-to-embed UI component for integrating agents into applications, and a Connector Registry for centrally managing access to data and tools.
Anthropic responded with the Claude Agent SDK, a suite built on a fundamentally different philosophy: giving agents access to a computer so they can operate much like human developers.
However, focusing on comparing individual features would be a mistake. The real story runs deeper: we’re witnessing the crystallization of two distinct paradigms for building agents. The strategies of both companies reveal fundamentally different philosophies.
OpenAI’s AgentKit, with its visual editor and ready-made UI, appears aimed at a broader audience — including product teams and low-code developers. It’s a platform-centric approach, designed to lower the barrier to entry and accelerate the deployment of agents as turnkey products.
Anthropic’s Claude Agent SDK, on the other hand, takes a more foundational, developer-focused stance. Its guiding principle is to equip agents with the same tools human programmers use — access to the terminal, file system, and the ability to execute scripts. Instead of providing a visual builder, Anthropic gives developers primitives to construct agentic loops built on the cycle: gather context → perform action → verify work → repeat.
We’re observing a classic pattern in the maturation of technology. At first, pioneers build everything from scratch. Then come the platforms, abstracting away repetitive, difficult elements — orchestration, interfaces, connectors — allowing developers to focus on business logic. The simultaneous release of such comprehensive toolkits from the two biggest players signals that, on their end, the underlying technology has become stable enough to be packaged as a product.
The battlefield is shifting from the question, “Can we build an agent?” to “What is the best paradigm for developers to build, deploy, and trust agents on our platform?”
OpenAI is betting that agents will evolve into products in their own right — complete with polished interfaces and visual tools for creation. Anthropic believes that the most powerful agents will emerge when we give them the same raw yet potent capabilities as human experts, integrating them deeply into existing developer workflows. Both visions are likely correct, but they target different market entry points and represent fundamentally different bets on the future of agentic engineering.
Before we go any further, we need to confront a fundamental problem: the term “agent” is used so loosely that it has almost lost its meaning. Depending on the context, it can refer to anything from a chatbot to a fully autonomous “digital employee” meant to replace a human. Such ambiguity is useless from an engineering perspective.
Fortunately, a pragmatic definition is beginning to crystallize within the technical community — one popularized by Simon Willison: “An LLM agent is a model that runs tools in a loop to achieve a goal.” This concise formula, also adopted by Anthropic, is valuable because it breaks the “magic” down into understandable mechanics.
The first part - “tools in a loop” - captures the core mechanism. An LLM doesn’t just generate text; it can invoke external functions (tools): call an API, execute a script, query a database. The tool’s output is then fed back into the model as new context, allowing it to take the next step. This loop continues until the problem is solved. The second part - “to achieve a goal” — is the crucial complement. It clarifies that this isn’t an endless, “autonomous” cycle. The loop is bounded and directed toward a specific, well-defined objective - the stopping condition.
The strength of this definition lies in its freedom from anthropomorphic baggage — the kind that dominates popular discussions about AI “thinking” or “reasoning.” Instead of vague notions of artificial minds, we can talk about concrete engineering problems: what tools the agent has access to, how reliably it executes the loop, and how precisely its goal is defined. Instead of an abstract idea like “let’s build a customer service agent,” a team can ask specific, actionable questions: what tools does our agent need (e.g., fetch_order_history,process_refund), and what goal-oriented loop will it perform to resolve a ticket? This reframes the problem from science fiction to software architecture.
Moreover, this definition also demystifies the concept of “memory” in agents — turning it from a fuzzy metaphor into a design question about state management and context persistence. Short-term memory is simply the history of calls and results within a single loop — everything that fits into the model’s context window. Long-term memory, as Willison suggests, can itself be implemented as just another tool - for example, a pair of functions like save_note(topic, content) and read_notes(topic) that interact with a persistent database.
This means we don’t have to wait for some mythical model with “built-in long-term memory.” We can implement it today, using well-understood engineering patterns.
If there’s a product that perfectly embodies the definition of “tools-in-the-loop,” it’s Claude Code 2.0. Threads on Hacker News are full of examples where developers use it to solve complex, multi-step problems that only recently seemed beyond AI’s reach. But where does its extraordinary effectiveness really come from?
As argued in Alephic’s analysis, the “magic” of Claude Code doesn’t lie in a revolutionary language model, but in a far better environment for reasoning and execution. Its power rests on two pillars. The first is native integration with Unix commands. Giving the model (as a tool) access to the command line was a masterstroke. Unix tools — ls, cat, grep, sed — are perfect companions for LLMs: simple, well-documented, and composable. They embody the “Unix philosophy”: do one thing, do it well, and expect the output of one program to become the input of the next. That’s exactly the same mental pattern a model follows when processing the result of one tool to decide how to use the next.
The second pillar is access to the file system, a feature that changes everything. The file system gives the agent state and memory beyond a single reasoning loop and its limited context window. The agent can write notes, create files, analyze intermediate outputs, and accumulate knowledge throughout the problem-solving process.
Viewed through this lens, Claude Code is an LLM that runs a powerful suite of tools — the entire Unix ecosystem — in the loop (an interactive terminal session) to accomplish a user’s goal (for example, “build a web application from scratch”).
Claude Code’s success proves that the future of AI engineering may depend less on inventing entirely new paradigms and more on intelligently combining LLM reasoning with existing, proven, and powerful ecosystems. Its “magic” is the application of AI within one of the most robust engineering environments ever created — the terminal. Moreover, the interface here is not just a means of communication; it is the agent’s environment. Unlike a web UI, which isolates and constrains the model, the terminal exposes the full power of the system. It’s a delicate balance between safety and capability — one that leads us to the next big question.
To learn more about how Claude Code is built, check out The Pragmatic Engineer’s blog - the full article is paywalled, but the free version is already quite solid.
Alright, since we now know what makes agents like Claude Code so powerful, it’s time to ask how engineers can harness that power efficiently and safely. This is where a new, essential skill comes in—what Simon Willison calls “designing agentic loops.” It’s no longer just about writing prompts, but about the architecture of the entire problem-solving process an agent goes through.
The biggest challenge is risk. Giving an agent access to the system shell creates, as Solomon Hykes put it, “an LLM destroying its own environment in a loop.” At the same time, to get the best results and achieve genuine human-in-the-loop interaction, you often need to run the agent in auto-approval mode—the so-called “YOLO mode” (You Only Live Once)—which is inherently dangerous.
Willison therefore proposes three strategies for managing this risk and enabling safe “YOLO-mode” operation. The first is sandboxing—running the agent in a secure, isolated environment, such as a Docker container with no internet access. The second, Willison’s preferred option, is using someone else’s computer. Running agents in ephemeral cloud environments like GitHub Codespaces limits potential damage to a temporary virtual machine. The third strategy is tightly scoped credentials. If an agent needs API keys, you should generate dedicated keys for test environments with very low limits. Willison’s own example is creating a Fly.io key with a $5 spending cap—so even if the agent goes rogue, it can only spend five dollars at most.
Agentic loop design works best for problems with clear success criteria that require tedious, repetitive trial-and-error work. These include debugging, performance tuning, dependency upgrades, or reducing container image size. The common denominator for all such tasks is the absolute necessity of a solid automated test suite, acting as the agent’s nervous system and providing immediate feedback on its progress.
Willison’s approach shows how to use agents. But how do we build the tools these agents use? Vercel’s blog introduces a key distinction between the first and second waves of AI tooling.
The first wave consisted of simple wrappers around existing APIs designed for humans. The problem is that developers and LLMs use APIs in fundamentally different ways. A developer calls create_project, stores the ID, and then calls deploy(ID)—they manage state and orchestration once. In contrast, in the second wave, we recognize that an LLM is stateless in every new conversation. It must rediscover the entire sequence of steps each time, which is inefficient and error-prone.
The solution of the second wave is intent-based tools. Instead of exposing separate functions like create_project, add_env, and deploy, we build a single atomic tool: deploy_project. This tool internally handles all the multi-step logic deterministically through standard code. The LLM’s role is reduced to invoking one high-level function that matches the user’s intent.
This concept is the missing piece of the architectural puzzle. The agency loops Willison designs in his terminal are the “frontend” of agentic programming. The intent-based tools described by Vercel are the robust and reliable “backend” those loops should communicate with. This division of responsibilities—reasoning on the LLM side, deterministic execution on the code side—dramatically increases system reliability. It also changes how we think about API design. Instead of documenting individual endpoints, we’ll describe user goals and conversational flows, with responses coming not as 200 OK codes, but as natural-language messages like: “The project has been deployed at…”
How do these concepts translate into everyday work? The picture emerging from leading engineers’ experiences is coherent and fascinating.
Simon Willison describes his new workflow as living in parallel coding agent mode. It’s not about multitasking—it’s about delegating distinct categories of tasks that agents perform in the background while he focuses on high-level, strategic work such as code review. Typical tasks delegated to agents include research and proof-of-concept work (checking a new library, understanding code) and small maintenance tasks like fixing linter warnings or updating dependencies, as well as defined implementation work following detailed plans.
This new model of work demands a new ethos. Tim Kellogg, in How I use AI, proposes two principles. The first is Ownership: you are fully responsible for the AI-generated code. Using an agent is like managing a junior developer—you must review their work and take responsibility for it.
The second is Exploit Gradients: the real power lies in identifying tasks where a small investment in prompt preparation yields huge value, such as quickly creating a prototype or an analytics dashboard.
Together, these perspectives paint a consistent picture: the role of a senior engineer is evolving from that of a craftsman-coder to a manager-orchestrator of a small team of AI agents. The most valuable skills are now problem decomposition, writing precise specifications, managing parallel workstreams, and rigorous quality verification. Skills once reserved for tech leads are becoming essential for every effective engineer.
After this dose of powerful capabilities, let’s end with a story that brings us back down to earth - a reminder of the limits of the technology we work with.
As described in Why do LLMs freak out over the seahorse emoji?, it turns out that leading LLMs, including GPT-5 and Claude 4.5, are absolutely convinced that there exists… a seahorse emoji. The problem is, no such emoji has ever been added to the Unicode standard. When you ask the model to produce it, something strange happens. Instead of admitting the error, the model “freaks out” and starts spitting out long, nonsensical chains of other emoji — fish, horses, cows, pigs...
A technical analysis shows that the model internally constructs a correct vector-space representation of the concept “seahorse + emoji.” The problem arises in the final decoding stage, when it tries to translate that abstract concept into a specific token from its available vocabulary. Since no seahorse token exists, the model “jumps” to the closest semantically related tokens — 🐟 and 🐎 — which causes a feedback loop that generates further nonsense.
This anecdote is a perfect miniature of the hallucination problem and a reminder of why the principle of “ownership” matters so much. If an agent can be so unshakably confident about something so trivial, it can be just as confident while making a subtle business-logic error or introducing a critical security flaw. It’s proof that our role as engineers — our judgment, our responsibility, and our capacity for critical verification — remains, and will long remain, the most important part of the entire system.