Halluminate Is Building the Gym Where AI Agents Learn to Do Real Work

The Macro: Computer Use Agents Have a Training Data Problem

The biggest bet in AI right now is computer use. Every major lab is working on agents that can operate a computer the way a human does: clicking buttons, filling forms, navigating between applications, and completing multi-step workflows. Anthropic showed it with Claude. The open-source community is racing to build alternatives. The entire RPA industry is watching nervously.

But there is a fundamental problem that nobody has cleanly solved. How do you train these agents?

You cannot train a computer use agent on a live Salesforce instance. If the agent clicks the wrong button, it modifies real customer data. You cannot train on a live EHR system because the consequences of errors are potentially life-threatening. You cannot train on production banking software for obvious regulatory reasons. The training environments need to look and behave exactly like the real applications, but they cannot be the real applications.

This is not a hypothetical concern. It is the core bottleneck for every team building computer use agents. Reinforcement learning requires thousands or millions of interaction episodes where the agent tries things, fails, learns, and tries again. Each episode needs to start from a clean state. The environment needs to provide reward signals. The state space needs to be rich enough that the agent learns generalizable skills, not just memorized click sequences.

The existing options are bad. Synthetic environments that do not look like real software teach agents skills that do not transfer. Recorded demonstrations give you imitation learning but not reinforcement learning. Browser-based sandboxes work for web apps but miss desktop applications entirely. There is no clean, scalable infrastructure for training computer use agents on realistic enterprise software.

This is the layer Halluminate is building.

The Micro: Two Founders Building the OpenAI Gym for Enterprise Software

Jerry Wu (CEO) and Wyatt Marshall (CTO) founded Halluminate in 2024 and went through Y Combinator Summer 2025. The team is two people in San Francisco. Wu previously worked in roles that gave him direct exposure to the computer use agent space. Marshall’s background is technical and research-oriented.

The product has three components. First, realistic sandbox environments that replicate popular enterprise systems. Think Salesforce, Slack, and other business applications, but running in managed sandboxes that agents can interact with freely without touching real data. These environments are modeled closely enough that skills learned in the sandbox transfer to the real application.

Second, evaluation and benchmarking. Halluminate has built proprietary datasets for measuring how well computer use agents perform on realistic tasks. They have launched two public benchmarks: BrowserBench at browserbench.ai and WebBench at webbench.ai. They also have an open-source project called Westworld on GitHub. The benchmarking angle is smart because it positions Halluminate as the standard for measuring computer use agent quality, which is a powerful place to be in a market that currently has no agreed-upon evaluation methodology.

Third, annotation services. When an agent fails at a task, understanding why it failed requires human analysis of the interaction trace. Halluminate provides annotation services that identify failure modes and generate training data from those failures. This creates a flywheel: agents train in the sandbox, fail in specific ways, human annotators label those failures, and the labels feed back into better training data.

The company is currently focused on financial services, specifically investment banking and private equity workflows. This is a shrewd vertical choice. Financial workflows are complex, high-value, and involve multiple applications. A single analyst might touch Bloomberg, Excel, a CRM, an email client, and a document management system in a single workflow. Automating even parts of that workflow is worth significant money.

The business model appears to be a combination of managed environment access and evaluation services, with pricing likely based on usage volume and the complexity of the environments being replicated. They are clearly in the early revenue stage, with engagement happening through direct sales via Calendly scheduling.

The Verdict

Halluminate is building infrastructure that the entire computer use agent industry needs but that almost nobody is building independently. Every team training computer use agents is currently cobbling together their own sandbox environments, their own evaluation frameworks, and their own failure analysis pipelines. That is redundant work that could be outsourced to a company that does it well.

The analogy to OpenAI Gym is apt. When OpenAI released Gym in 2016, it became the standard environment for training reinforcement learning agents because it was easier to use a shared set of well-built environments than to build your own. Halluminate could play the same role for computer use agents, and the market is orders of magnitude larger because the end applications are enterprise workflows worth billions.

The risk is timing. Computer use agents are still early enough that the number of teams actively training them is relatively small. Anthropic, a few startups like Cyberdesk and others, some research labs. The market for training environments grows in proportion to the number of teams building agents, and that number needs to grow significantly for Halluminate’s business to scale.

I also wonder about the moat. Building realistic sandbox environments requires understanding the applications being replicated at a deep level. But the applications themselves change. Salesforce ships updates constantly. Enterprise software evolves. Keeping sandbox environments current with the real applications is ongoing work, and that maintenance burden could be either a moat (because it is hard for competitors to replicate) or a drain (because it consumes engineering resources that could go elsewhere).

At 30 days, I want to see which AI labs are using Halluminate’s environments for actual training runs. At 60 days, the question is whether agents trained in Halluminate’s sandboxes demonstrably outperform agents trained on other data. At 90 days, I want to know whether the benchmarks (BrowserBench and WebBench) are becoming the standard that the community references when comparing computer use agents.

Two people, a very clear thesis, and a market that will either be enormous or will not exist. I think it will be enormous. The question is whether Halluminate can establish itself as the default training infrastructure before the big labs build their own.

Halluminate Is Building the Gym Where AI Agents Learn to Do Real Work

The Macro: Computer Use Agents Have a Training Data Problem

The Micro: Two Founders Building the OpenAI Gym for Enterprise Software

The Verdict

More on this