The Macro: AI Agents Have a Stamina Problem
The AI agent hype cycle is in full swing, and most of it deserves the skepticism it’s getting. Every company with an API is calling their product an “agent” now, which has made the term almost meaningless. But underneath the marketing noise, there’s a real technical problem that matters: AI agents are bad at long tasks.
Ask an LLM-based agent to do something that takes three steps and it works fine. Ask it to do something that takes thirty steps and it falls apart. The failure mode isn’t dramatic. The agent doesn’t crash or throw an error. It just starts making slightly wrong decisions around step ten or fifteen, and those small errors compound until the output is useless. Anyone who has tried to use an AI agent for a multi-step research task or a complex workflow has experienced this.
The reason is architectural. Most agents are built on top of language models that process each step somewhat independently. They don’t have a meaningful sense of where they are in a longer process, what they’ve already tried, or how to recover from a wrong turn. It’s like giving someone driving directions one turn at a time without ever showing them the map.
Cognition (Devin), Adept, and Multi have all taken runs at making agents better at complex tasks, mostly through better prompting, planning layers, or tool-use frameworks. Those approaches help, but they’re working around the core problem rather than solving it. The agent isn’t learning from experience. It’s just getting better instructions.
Reinforcement learning is the obvious answer that’s been surprisingly underused in the agent space. RL is how you teach a system to get better at sequential decision-making through trial and feedback. It’s how AlphaGo learned to play Go. It’s how robotics systems learn to walk. Applying it to LLM-based agents is conceptually straightforward and practically very hard, which is probably why so few companies are doing it.
The Micro: Two Founders, One Benchmark, and Garry Tan’s Attention
Aviro is a two-person team out of San Francisco. Aarush Kukreja and Ayush Karupakula are building Cortex, which they describe as an RL-based intelligence layer for enterprise AI agents. The idea is that Cortex sits on top of existing agent infrastructure and teaches agents to learn from experience so they maintain consistency across complex workflows.
The headline number is striking: Aviro’s deep research agent reportedly outperformed OpenAI’s by 70% on enterprise search tasks and reached number one on Microsoft’s Deep Research benchmark. That’s a bold claim to make publicly, and the fact that they’re making it suggests they have the receipts. Benchmark performance doesn’t always translate to real-world utility, but beating OpenAI on their own paradigm by that margin gets your attention.
The product approach is building simulated environments where frontier models can train on long-horizon tool use. Think of it as a gym for AI agents. Instead of deploying an agent into production and hoping it handles a 30-step workflow correctly, you train it in a simulation first, let it fail cheaply, and deploy the version that’s learned from thousands of runs.
They’re part of YC’s Spring 2025 batch, and their primary YC partner is Garry Tan, which is notable because Tan doesn’t typically take companies unless he sees something technically distinctive. The go-to-market is enterprise partnerships. Companies building complex automation systems can integrate Cortex to make their agents more reliable over long task horizons.
The site is clean and functional, with links to benchmarks, research writeups, and a demo booking page through Cal.com. There’s no pricing visible, which is typical for enterprise AI infrastructure. The GitHub presence suggests some open-source component, though the core product appears to be commercial.
What’s interesting about the positioning is the specificity. They’re not saying “we make AI agents better.” They’re saying “we fix the specific failure mode where agents degrade over long task sequences.” That’s a narrow enough claim to be testable and broad enough to matter across many use cases.
The Verdict
I think Aviro is working on one of the genuinely hard problems in AI right now. Long-horizon agent reliability is the bottleneck that’s keeping AI agents in the “demo well, deploy badly” category. If RL-based training environments can meaningfully improve agent consistency over 20+ step workflows, the demand for that is enormous.
The benchmark result is the thing that makes me take this seriously. Beating OpenAI by 70% on enterprise search isn’t something you say casually if you’re trying to build a credible enterprise business. Either it’s true, in which case the technical foundation is genuinely strong, or it’s cherry-picked, in which case the first enterprise customer who runs their own evaluation will find out quickly.
The risk is the gap between “better benchmarks” and “better products.” Enterprise AI buyers don’t purchase benchmark scores. They purchase reliability, integration ease, and support. A two-person team selling RL infrastructure to enterprise AI teams is a hard sales motion, because the buyer needs to trust that the company will exist in twelve months and that the integration won’t become a maintenance burden.
Thirty days, I want to see a customer case study. Not a benchmark. A real company that integrated Cortex and measured the improvement in agent reliability on their specific workflow. Sixty days, I want to see whether the enterprise partnerships are generating revenue or still in pilot mode. Ninety days, the team size question becomes critical. RL infrastructure for enterprise AI is a deep technical product that needs deep technical support. Two people can build it. Two people cannot sell, support, and iterate it simultaneously.