Confluence Labs Hit 97.9% on ARC-AGI-2 by Teaching AI to Write Code That Thinks

The Macro: AI Is Bad at Learning from Small Amounts of Data

Modern AI is remarkable at tasks where you can throw billions of examples at a model. Language, image recognition, code generation. These are all domains with effectively unlimited training data. But there are entire fields where data is scarce, expensive, or impossible to generate at scale.

Drug discovery requires testing molecules in wet labs. New materials need to be synthesized and characterized. Physics simulations take hours per data point. In these domains, you cannot collect a billion examples. You might have a few thousand, or a few hundred, or even fewer. And current AI approaches that need massive datasets simply do not work here.

This is a fundamental limitation. The most important scientific problems, the ones that could cure diseases, create new materials, and advance physics, are exactly the problems where AI training data is hardest to get. Scaling laws work beautifully when data is abundant. They break when data is sparse.

Confluence Labs, backed by Y Combinator and Paul Graham, builds foundation models optimized for learning efficiency. Their goal is AI that can learn from small amounts of data and accelerate research in domains where every data point is expensive.

The Micro: Two Young Founders, One Record-Setting Benchmark Score

Brent Burdick (23, self-taught engineer and researcher) and Niranjan Baskaran (20, math research award winner at ISEF, Dartmouth dropout) founded Confluence Labs with a bold claim: their approach to AI can handle domains where data is sparse or costly.

The proof point is remarkable. Confluence Labs achieved a 97.9% score on the ARC-AGI-2 benchmark at $11.77 per task. ARC-AGI-2 is the gold standard for measuring whether AI systems can genuinely reason and acquire new skills from minimal examples. A 97.9% score is state-of-the-art.

Their approach combines large language models with discrete program synthesis. Instead of trying to learn patterns from data alone, the AI writes code that describes the transformation it sees in a problem. This is a fundamentally different approach from scaling up neural networks with more data and compute. The AI is not pattern-matching. It is hypothesis-generating.

The implications extend far beyond benchmarks. If you can build AI that forms and tests hypotheses efficiently, it becomes useful for drug design, materials science, physics research, and any domain where understanding the underlying rules matters more than memorizing examples.

The company has open-sourced their ARC-AGI-2 solver on GitHub, which builds credibility in the research community and attracts talent.

The Verdict

Confluence Labs is doing fundamental AI research with practical implications. The learning efficiency problem is real and underexplored compared to the scaling approach that dominates the industry.

At 30 days: are research labs in drug discovery or materials science testing Confluence’s models on their domain-specific problems?

At 60 days: does the program synthesis approach generalize to problems beyond visual reasoning tasks?

At 90 days: is Confluence building partnerships with research institutions that have valuable but small datasets?

I think Confluence Labs is working on one of the most important unsolved problems in AI. The industry’s obsession with scale has produced impressive results, but it has also created a blind spot around data-efficient learning. If Confluence can crack this, their models become essential tools for every scientific research lab in the world. The ARC-AGI-2 result suggests they are on the right track.

Confluence Labs Hit 97.9% on ARC-AGI-2 by Teaching AI to Write Code That Thinks

The Macro: AI Is Bad at Learning from Small Amounts of Data

The Micro: Two Young Founders, One Record-Setting Benchmark Score

The Verdict

More on this