← May 5, 2026 edition

mangodesk

RL environments for software engineering

MangoDesk Wants to Grade Your AI's Homework Before You Ship It

The Macro: The Eval Crisis Is Real and Getting Worse

Here is something that does not get talked about enough in the AI hype cycle: the models are getting better, but the methods for testing them are stuck in 2023.

Every team building on top of foundation models faces the same problem. You fine-tune a model or build an agent, and then you need to figure out if it actually works. Not “does it pass a benchmark” but “does it do the right thing in the specific situations our users will encounter.” That requires custom evaluations. Custom evals require custom data. Custom data requires annotation pipelines. And annotation pipelines are, to put it bluntly, a nightmare to build and maintain.

The market for data labeling and annotation is projected to reach $13.7 billion by 2030. Companies like Scale AI, Labelbox, and Snorkel have built significant businesses around this need. But those platforms were designed primarily for the training phase. The post-training evaluation and RLHF data generation workflow is a different animal, and most teams are cobbling it together with spreadsheets, contractor marketplaces, and a lot of manual effort.

The painful irony is that AI teams spend a shocking amount of their time on work that is not building AI. They are building data infrastructure, writing annotation guidelines, managing quality control, debugging labeling inconsistencies. It is the plumbing problem that every infrastructure startup eventually tries to solve: the work behind the work is where all the time goes.

Braintrust, Patronus AI, and Humanloop are all attacking pieces of this problem. Braintrust focuses on eval frameworks. Patronus focuses on testing for hallucinations and compliance. Humanloop focuses on the prompt engineering workflow. But the annotation pipeline itself, the actual generation of bespoke evaluation data, remains largely manual at most companies.

The Micro: A Scale AI Alum and a Seven-Figure Founder, Building Data Plumbing

MangoDesk was founded by Ankith Subramanya and Ananth Subramanya. Ankith was previously a software engineer at Scale AI, which means he saw firsthand how the largest data annotation company in the world operates and where the gaps are. Ananth previously founded Lumio, a software firm that hit seven-figure revenue. They came through Y Combinator’s Summer 2025 batch, are based in San Francisco, and currently have two people on the team.

The pitch is that MangoDesk helps AI teams accelerate the creation of custom evals and post-training data by generating bespoke annotation pipelines in seconds. Not hours. Not days. Seconds. You describe what you need evaluated, and the platform generates the pipeline for you.

I want to be careful about the “seconds” claim because pipeline generation and pipeline quality are different things. I can generate a bad annotation pipeline in seconds too. The value is entirely in whether the generated pipelines produce data that is actually useful for improving model performance. That said, even a 70% automated pipeline that gets refined manually is a massive improvement over building everything from scratch.

The Scale AI background matters here more than it might seem. Scale is the gorilla in the room for data labeling, but their platform is optimized for large-scale training data, not the fast-iteration eval cycles that post-training teams need. If you are fine-tuning a model and need to test 15 different failure modes with bespoke evaluation criteria for each, Scale’s workflow is overkill. You do not need a 500-person annotation workforce. You need a smart pipeline that understands your specific problem and generates the right test data quickly.

MangoDesk is also hiring a founding engineer at $100K to $200K with 0.50% to 1.50% equity, which tells you they are in early-stage building mode. The product exists. The traction story is still forming.

The competitive positioning is interesting. They are not trying to replace Scale AI or Labelbox for training data. They are going after the eval and post-training data workflow specifically, which is a narrower but potentially stickier wedge. Once a team’s eval infrastructure depends on your platform, switching costs are real.

The Verdict

I think MangoDesk is solving a real problem that most people outside of ML engineering do not even know exists. The eval bottleneck is genuine, and the manual nature of current annotation workflows is a drag on every team shipping AI products.

What I would want to see at 30 days: concrete examples of the generated pipelines and how they compare to hand-built alternatives. “Seconds instead of weeks” is a compelling headline, but the actual quality of the generated data is the only metric that matters.

At 60 days: how they handle edge cases. The easy evals are easy for a reason. The value shows up when a team needs to evaluate something subtle, something domain-specific, something that requires understanding context that is hard to specify in a prompt.

At 90 days: retention. Developer tools live and die on whether teams keep using them after the initial setup. If teams build their pipelines with MangoDesk and then never come back, the business model has a ceiling. If they keep iterating through the platform, there is a real SaaS business here.

The Scale AI alumni connection gives them credibility in a space where credibility matters. Whether they can convert that into a product that teams reach for repeatedly is the whole question.