← March 30, 2027 edition

polymath

Applied Intuition for AI agents

Polymath Is Building the Obstacle Courses That Make AI Agents Smarter

AIReinforcement LearningInfrastructureDeveloper Tools

The Macro: AI Agents Need Better Training Grounds

Here is the problem with AI agents right now. They work fine on simple tasks. Ask one to summarize a document or write a function and it performs well. But give it a multi-step task that requires using five different tools over several hours and things fall apart quickly. The agent gets confused about state. It makes errors that compound. It forgets what it was doing three steps ago.

The reason for this is straightforward. Most agents were not trained on long-horizon, multi-tool tasks. The training environments that exist today are too simple. They test whether an agent can use one tool in isolation, not whether it can coordinate across Slack, email, GitHub, Linear, and a spreadsheet simultaneously to complete a realistic piece of work.

Applied Intuition solved a version of this problem for autonomous vehicles. They built simulation environments where self-driving cars could train against realistic road conditions without crashing real vehicles. The AI agent world needs something similar. Training environments that are complex enough, realistic enough, and abundant enough that agents can actually learn to handle the messy reality of knowledge work.

That is what Polymath is building.

The Micro: RL Environments as a Product Category

Polymath is an applied research lab focused on building long-horizon, multi-tool environments for training AI agents. The core thesis is that the key to reliable agent performance is the quality and complexity of the RL environments they train in. Not bigger models. Not more RLHF. Better training grounds.

Their first public output is Horizon-SWE, a benchmark that gives agents a running application with real tools and long-horizon tasks covering the entire software development lifecycle. Leading models currently achieve about 25% on this benchmark. That number tells you a lot. If the best models in the world can only handle a quarter of realistic multi-tool tasks, there is enormous room for improvement.

Dylan Ma and Naren Yenuganti are the cofounders, both UC Berkeley alumni. Dylan previously worked at Hume AI and AWS. Naren came from Plaid and Amazon. They are a two-person team from YC Winter 2026 working with Ankit Gupta. The backgrounds are strong. Both have experience building infrastructure at scale, and both understand the gap between demo-ready AI and production-ready AI.

The environments they build include applications like Slack, email, GitHub, Linear, Excel, and web browsers. Agents get dropped into these environments with realistic data, receive task descriptions, and their performance gets measured against verification rubrics. It is essentially a gym for AI agents, and the workout is designed to expose every weakness.

The Verdict

I think Polymath is building something important. The bottleneck for AI agents is not model capability. It is training data quality. And training data for agents means environments, not text corpora. Someone needs to build the infrastructure that makes RL environments a scalable, repeatable product. Polymath is positioning itself to be that company.

The competitive question is interesting. Scale AI dominates traditional AI training data. Braintrust and Langsmith are building evaluation frameworks. But nobody is really focused on building rich, long-horizon RL environments as a standalone product category. Polymath has a real opportunity to define this space.

The risk is that the big labs build this internally. If Anthropic, or a major research lab decides that training environments are a core capability, they will build their own. Polymath needs to move fast enough and build environments complex enough that it becomes easier for labs to buy than build.

In 30 days, I want to see adoption numbers for Horizon-SWE. Is anyone actually using it to benchmark their agents? In 60 days, the question is whether Polymath has enterprise customers paying for custom environments. In 90 days, I want to know whether agents trained in Polymath environments actually perform better on real tasks than agents trained without them. If the answer is yes, this is foundational infrastructure. If the answer is unclear, it is a research project looking for a business model.