← March 10, 2026 edition

janus

Evaluate AI Agents with Simulation Environments

Janus Wants to Break Your AI Agent Before Your Customers Do

The Macro: Nobody Knows If Their AI Agent Actually Works

There is a growing gap between how fast companies are deploying AI agents and how well they understand what those agents are doing. Customer service bots, sales assistants, coding agents, compliance tools. They are shipping into production at a pace that makes traditional QA look quaint. And the testing infrastructure has not kept up.

The problem is not that people are not trying to test. It is that testing AI agents is fundamentally different from testing traditional software. A conventional application has deterministic behavior. You send the same input, you get the same output. You can write unit tests. You can build regression suites. An AI agent, by contrast, can hallucinate, misinterpret context, violate policies, call the wrong tool, or give a plausible but wrong answer. And it might do all of those things differently every time you run the same scenario.

The AI testing and evaluation market is still early, but it is growing fast. Companies like Braintrust, Humanloop, and Arize are building evaluation frameworks, mostly focused on LLM output quality. LangSmith and Langfuse offer tracing and monitoring for LLM applications. But most of these tools are built around evaluating text outputs. They score whether the model’s response was good. That is useful, but it does not tell you whether the agent made the right decisions, used the right tools, and avoided doing something dangerous in a multi-step workflow.

The harder problem is simulation. Not just grading what the agent said, but putting it in a realistic environment and watching what it does. Does it follow the escalation policy? Does it handle edge cases? Does it call the right API with the right parameters? Does it recover when something fails? Those questions require a different kind of testing infrastructure, one that can simulate the world the agent operates in, not just the prompts it receives.

That is the problem Janus is going after.

The Micro: Simulate, Trigger, Trace, Judge

Janus (Y Combinator S25) was co-founded by Shivum Pandove, who studied ML and CS at Carnegie Mellon and previously scaled three startups as a software engineer and product manager. His co-founder Jet rounds out a two-person team based in San Francisco.

The product is structured around a pipeline: generate, trigger, trace, judge, results, improve. You synthetically create tasks for your agent. You execute the agent’s workflows inside a simulation environment. You capture every function call and API interaction. You run those traces through proprietary verification models. You get structured results showing where the agent failed. And then you can annotate those results and re-test immediately.

The detection capabilities cover the failure modes that matter most in production: hallucinations, policy violations, tool-call failures, risky answers, and custom evaluations you define yourself. That last one is important. Every company has its own definition of what “wrong” looks like for their specific agent, and a testing tool that cannot accommodate that is limited.

The product decision I find most interesting is the simulation-first approach. Instead of just monitoring what agents do in production and flagging problems after the fact, Janus wants you to find failures before deployment. That is the difference between a smoke detector and a fire drill. Both have value, but one of them keeps the building from burning down.

The website says Janus is currently available to select enterprises, with a consultation-based onboarding flow. No public pricing. That tracks for an infrastructure product at this stage. Enterprise AI teams evaluating agents at scale are the obvious first buyers, and those deals are not self-serve.

The company has a presence on both X and LinkedIn, and the site includes a Y Combinator backing badge. The positioning is clean: “Before Your AI Goes Live, Make It Earn Your Trust.” That is a good line, and more importantly, it reflects a real concern that I hear from engineering leaders constantly. They know their agents work in demos. They do not know if they work in the wild.

There is a parallel to what companies like Weights and Biases did for ML experiment tracking. Before W&B, people were logging model metrics in spreadsheets and terminal outputs. W&B built the infrastructure layer that made experiment tracking systematic. Janus is trying to do something similar for agent evaluation, turning ad hoc testing into a structured, repeatable process.

The Verdict

I think the market need is genuine and growing. Every company deploying AI agents in production needs better evaluation infrastructure, and most of them are currently doing some combination of manual testing, vibes-based assessment, and hoping for the best. That is not sustainable as these agents take on higher-stakes tasks.

The simulation-first approach is the right architecture. Monitoring production failures is necessary but insufficient. You want to catch the hallucination in staging, not when a customer is on the phone.

At 30 days, I would want to see how many enterprise pilots are running and whether the simulation environments are realistic enough to surface failures that actually happen in production. A simulation that is too simple will miss real problems. One that is too complex will be too slow to iterate on.

At 60 days, the question is whether the resulting datasets are actually feeding post-training loops. That is the flywheel Janus describes on their YC profile, using evaluation data to improve the agents themselves, and it is where the long-term value lives.

At 90 days, I would want to see the competitive picture. The evaluation and testing space is getting crowded, and the companies that win will be the ones with the best simulation fidelity and the tightest integration into existing development workflows. Two people is a small team for that ambition, but the focus is sharp.

If you are shipping AI agents into production and your current testing process involves a Slack thread and the phrase “looks good to me,” Janus is worth a conversation.