How to Test and Evaluate Agentic Systems for Reliability
Agentic systems require a new testing paradigm focused on evaluating trajectories, not just outcomes. This post details core test types, metrics, and tools.

Agentic systems require a new testing paradigm focused on evaluating trajectories, not just outcomes. This post details core test types, metrics, and tools.

This guide explains how developers can craft small, precise rules that make AI coding tools more reliable. It shows practical techniques for structuring, organizing, and enforcing rules to achieve consistent, production-grade output.

I've always claimed there's no better way to learn anything than to build something with your own hands... and the second best way is to do a Code Review of someone else's code. Today, we are taking on a project that made waves on Twitter (or X) and GitHub, not so much because of the complexity of the code, but because of the philosophy behind it, and above all, because of its author, Andrej Karpathy. And in this article, we'll discuss his AI consensus mechanism called llm-council.

Every other Wednesday, we’ll pick one trending repository from the previous week and give it some focused attention by preparing a tutorial, article, or code review – learning from its creators in the process. Today, we’re taking a look at a project that tackles one of the most embarrassing yet universal problems in our industry: f/git-rewrite-commits.

Welcome to the fourth article in the This Month We AIed series. In this edition, we will demonstrate how a simple CLAUDE.md file can transform chaos into clarity, reveal how to use ChatGPT to transform a lightning talk into presentation-as-code, and take you on a journey from chaotic vibe coding to disciplined specification-driven development.

If you're a developer exploring AI coding assistants, you might have encountered Claude Code and wondered how it actually works under the hood. What's the relationship between the command-line tool you install and the large language model that powers it? How does the AI decide when to read your files or run commands? And how do those CLAUDE.md instruction files actually get interpreted? This article will walk you through these questions by clarifying a fundamental distinction that often gets overlooked when people first encounter Claude Code.

In our “GitHub All-Stars” series, we take “new or little-known” open-source gems that solve real engineering problems and put them under the microscope. Today, we’re looking at toon —a tool that directly tackles the financial and performance overhead of data serialization in the age of AI.

To see how code assistants work in reality, our engineers ran a focused, half-day AI hackathon inside an active commercial project - a large-scale logistic platform built in Scala and deployed on Kubernetes. Their goal was to see how AI can be used responsibly in a real, mature system and explore its applications and possibilities directly in our project domain, making sure the outcomes were practical, safe, and genuinely valuable. Read on to learn about their results.

This week, we’re diving into a project that tackles one of the most fundamental problems in the world of science and engineering: the frustrating gap between the theory described in a research paper and its practical implementation. Anyone who’s ever tried to reproduce results from a paper knows the pain. The code, if it’s even available, is often a tangled mess of one-off scripts and Jupyter notebooks - making reproducibility, the cornerstone of science, more of an art than a craft.

This newsletter is a monthly, noise-free roundup of AI developments that truly matter to engineers and tech leaders — practical, skeptical, and ready for implementation. Instead of chasing every new model or “top 50 tools” list, we focus on what will stand the test of time and genuinely change how we build software. This month, we’re diving into agents — a concept that’s finally moving beyond experimentation and maturing into a true engineering discipline.

We’ve been putting AI to the test. Not in theory, but in controlled experiments. In this edition, you will learn when to use AI to get things done (not to plan them); how to guide it with the right structure, and where the “AI as a teammate” metaphor breaks down today.

There is a fundamental gap in understanding our own productivity, for those who feel an inner need to close it, creates an opportunity for Dayflow - a project that aims to redefine how we perceive and analyze our screen time - for the good or the bad. Dayflow is not yet another time tracker - it’s an ambitious attempt to build a “semantic timeline,” or, to use the project’s own metaphor, “a git log for your day”.
