ZeroEval Makes AI Agents Fix Themselves, and It Actually Works

The Macro: The AI Agent Reliability Problem Nobody Has Solved

The AI agent gold rush is in full swing. Every company with more than three engineers is building agents that write code, answer customer questions, process documents, or automate workflows. The tools for building these agents are plentiful: LangChain, CrewAI, AutoGen, and dozens of frameworks make it easy to get a demo working in an afternoon.

Getting a demo working is the easy part. Getting an agent to work reliably in production is a completely different problem, and it is the one that is currently killing most AI agent projects.

The core issue is evaluation. When you build traditional software, you can write unit tests. Input A produces output B, every time, or the test fails. When you build an AI agent, the outputs are probabilistic. The same input might produce three slightly different outputs, and deciding whether each one is “good enough” requires judgment that is hard to codify.

Most teams handle this by having humans review agent outputs. A product manager reads through conversation logs, flags bad responses, and tells the engineer to tweak the prompt. The engineer changes a few words, redeploys, and waits to see if things get better. This cycle takes days to weeks per iteration. At scale, it is completely unsustainable.

The alternative is using LLMs to evaluate LLMs, which sounds circular but is actually the most promising approach. The problem is calibration. An LLM judge that has not been trained on your specific quality standards will flag things that do not matter and miss things that do. “Too verbose” means something different for a customer support agent than for a code review agent. Without calibration, LLM judges produce noise, and teams either ignore the evaluations or spend as much time reviewing the evaluations as they would have spent reviewing the original outputs.

Companies like Braintrust, Langfuse, and Arize are building observability and evaluation platforms. Humanloop focuses on prompt management. But the gap between “we can see what our agent is doing” and “our agent automatically gets better” remains wide. Observability tells you there is a problem. It does not fix the problem.

The Micro: Datadog and Atrato Alumni Closing the Loop

ZeroEval was founded by Jonathan Chavez and Sebastian Crossa. Both are repeat founders who previously worked together at Atrato, a YC W21 company. Jonathan spent time at Datadog working on LLM observability, which means he watched the evaluation problem develop from inside one of the companies best positioned to see it. Sebastian was a founding engineer at Micro, an a16z-backed startup, and co-created llm-stats.com, which reached 60,000 monthly active users.

They came through Y Combinator’s Summer 2025 batch and are based in New York. The team is two people.

The product works in four steps. First, you install ZeroEval by integrating it with your existing agent setup. It supports 30+ coding agents including Cursor, Claude Code, and Codex. Second, it evaluates your agent’s outputs using calibrated LLM judges that score on pass/fail or 1-10 rubrics. Third, and this is the key differentiator, you calibrate the judges by giving feedback on their assessments. When ZeroEval flags something as a failure and you disagree, you tell it. The judge learns from your corrections and aligns with your specific quality standards over time. Fourth, the system automatically rewrites prompts based on the failure patterns it identifies.

The optimization results they showcase are specific: improving an agent from 78% to 96% pass rate by extracting patterns from 83 failure signals, categorizing them (43 instances of “too verbose,” 28 of “didn’t address error,” 12 of “too generic”), and rewriting the prompt to address each failure mode.

The product includes a CLI for querying evaluations and triggering optimization, an MCP server integration that lets agents self-diagnose issues, version control with diff views for prompt iterations, and SOC 2 Type II compliance. That last point matters for enterprise sales. Large companies will not send their agent outputs to an evaluation platform that cannot demonstrate security compliance.

They list engineers from Datadog, Hugging Face, and other companies as users, which suggests the product is already finding traction in the developer tooling community that the founders know best.

The Verdict

I think ZeroEval is attacking the right problem at the right layer. Observability platforms tell you your agent is failing. Prompt management tools let you version your prompts. ZeroEval connects these two things and automates the improvement cycle in between. That feedback loop, from “we see the failure” to “the prompt is automatically better,” is the missing piece in most AI agent workflows.

The calibrated judge concept is sound. The biggest complaint about LLM-as-judge approaches is that they do not match human preferences without significant tuning. By building the calibration step directly into the workflow, ZeroEval sidesteps this problem. You are not trusting a generic judge. You are training a judge that learns your standards.

The competitive risk is real. Braintrust is well-funded and expanding from evaluation into optimization. Langfuse has strong open-source adoption. If either company ships a calibrated optimization feature that works well enough, ZeroEval’s differentiation narrows. The advantage ZeroEval has is focus. They are not building a general observability platform with optimization bolted on. The entire product is oriented around the evaluate-calibrate-optimize loop.

Thirty days, I want to see whether teams that adopt ZeroEval are actually shipping prompt improvements faster than before, measured in iterations per week rather than customer testimonials. Sixty days, the question is whether the calibrated judges hold up across different agent architectures and use cases, or whether they need significant recalibration for each new deployment. Ninety days, I want to know the retention curve. Developer tools live and die on whether teams keep using them after the initial setup. If engineers are actively calibrating judges and reviewing optimizations three months in, the product has found real utility. If usage drops off after the first week, the value proposition is weaker than it appears.

ZeroEval Makes AI Agents Fix Themselves, and It Actually Works

The Macro: The AI Agent Reliability Problem Nobody Has Solved

The Micro: Datadog and Atrato Alumni Closing the Loop

The Verdict

More on this