The Macro: AI Agents Have a Cost Problem Nobody’s Solving
The AI agent hype cycle is in full swing and the economics are quietly terrible. Every time an agent performs a multi-step task, it’s burning through tokens at each decision point. Navigate to a page, call a tool, parse the result, decide what to do next, call another tool. Each step requires an LLM inference. Each inference costs money and adds latency. For a simple five-step workflow, you might be looking at 30 seconds and $0.15 in API costs. Run that workflow 1,000 times a day and you’re spending $150 daily on what is essentially the same sequence of actions repeating with minor variations.
This is the dirty secret of the agent economy. Most agent tasks aren’t novel. They’re repetitive. A customer support agent that resolves a password reset follows roughly the same trajectory every time. A data entry agent that fills out a form from a structured input does the same steps in the same order. An automated testing agent that runs a login flow hits the same buttons every run. The LLM is in the hot path making “decisions” that aren’t really decisions at all.
The existing approaches to this problem are either too coarse or too manual. You could write a deterministic script, but then you lose the flexibility that made agents useful in the first place. You could cache LLM responses, but response caching doesn’t capture the full trajectory of tool calls and their dependencies. You could fine-tune a smaller model, but that’s expensive and brittle. Nobody has cleanly solved the problem of “this agent task has been done before, just replay it.”
The Micro: A GPU Infrastructure Guy Who Noticed the Waste
Erik Dunteman is building Butter as a solo founder, and his background explains why this problem caught his attention. He previously founded Banana.dev, a serverless GPU platform that served over 3,000 AI teams. He saw firsthand how much compute was being wasted on repetitive inference. After Banana, he worked as a growth engineer at Modal.com, another infrastructure company focused on running code in the cloud. The pattern he kept seeing: teams deploying AI agents that burned through GPU hours doing the same thing over and over.
Butter sits between your agent and your LLM as a proxy layer. It records the full trajectory of tool calls, inputs, and outputs for each agent task. When the same or sufficiently similar task comes in again, Butter replays the recorded trajectory deterministically. No LLM call. No token cost. No variability. It’s like muscle memory for agents.
The team is two people, San Francisco, YC Winter 2025 batch with Jared Friedman as their partner. They’ve also shipped a related product called Pig, an API for automating Windows desktops using AI, which launched in February 2025. The bVisor sandbox runtime provides a lightweight execution environment for agent actions without requiring a full VM.
What I find compelling about the architecture is that it addresses three problems simultaneously. Speed goes up because replayed trajectories don’t wait for LLM inference. Cost goes down because you’re not paying for tokens on known paths. And reliability improves because deterministic replay eliminates the variance that makes agents unpredictable. An agent that works correctly 95% of the time through an LLM becomes an agent that works correctly 100% of the time through replay for that specific trajectory.
The Verdict
Butter is building infrastructure for a problem that gets worse as the agent market grows. Every new agent framework, every new deployment of autonomous workflows, every new company building on top of LLMs for task execution creates more demand for exactly this kind of optimization layer. The timing is good.
The technical challenge is trajectory matching. How similar does a new task need to be before you can safely replay a cached trajectory? Too strict and you rarely get cache hits. Too loose and you replay incorrect sequences. The matching algorithm is the product, really, and getting it right requires deep understanding of how agent tool calls compose and where variance matters versus where it doesn’t.
There’s a competitive risk from the LLM providers themselves. OpenAI, Anthropic, and others have strong incentives to build caching and optimization into their APIs. Prompt caching already exists. It’s not a huge leap to imagine trajectory caching becoming a platform feature. Butter’s advantage is that it’s model-agnostic and sits at the proxy layer, which means it works regardless of which LLM you’re using. That’s a real moat as long as the market stays multi-model, which it likely will for a while.
In 30 days, I’d want to see cache hit rates from production deployments. What percentage of agent tasks are actually being replayed versus falling through to the LLM? At 60 days, the cost savings data becomes the selling point. If Butter can show customers a 50%+ reduction in API spend with equivalent task completion rates, that sells itself. By 90 days, the question is adoption. How many agent frameworks have integrated Butter as a middleware layer, and is it becoming a default part of the agent deployment stack? Infrastructure plays live or die on developer adoption curves, and the window for establishing a new layer in the stack doesn’t stay open forever.