The Macro: Benchmarks Are Broken and Everyone Knows It
The AI industry has a measurement problem. The benchmarks everyone uses to evaluate language models, MMLU, HumanEval, GSM8K, and dozens of others, are increasingly meaningless. Models get trained on benchmark data. Scores go up. Actual capability improvements are unclear. When every new model claims state-of-the-art performance on the same tests, and those tests are available in the training data, the benchmarks stop measuring intelligence and start measuring memorization.
This is not a controversial opinion. Researchers at major labs have publicly acknowledged that benchmark contamination is a serious problem. Models score 90%+ on tests that should be difficult, then fail at basic reasoning tasks that a five-year-old could handle. The gap between benchmark scores and real-world capability is growing, not shrinking.
The ARC-AGI benchmark, created by Francois Chollet, approaches the problem differently. Instead of testing knowledge or pattern-matching on familiar tasks, ARC-AGI tests whether a system can efficiently acquire new skills. Each task is a novel visual pattern-completion puzzle that requires reasoning from just a few examples. There is no training data to memorize. There is no shortcut through statistical pattern-matching. You either understand the underlying rule and apply it, or you do not.
ARC Prize Foundation, backed by Y Combinator (W25), is the nonprofit organization that maintains and evolves the ARC-AGI benchmark series. Founded by Mike Knoop and Francois Chollet, now led by Greg Kamradt, the foundation runs global competitions, provides research grants, and builds the infrastructure for measuring genuine AI progress.
The Micro: Competitions That Expose What AI Still Cannot Do
The numbers from the 2025 ARC-AGI competition tell the story. 1,454 teams participated. Over $125,000 in prizes were awarded. And the best-performing systems still could not reliably solve the benchmark’s tasks. Current frontier models, the same ones that ace traditional benchmarks, score nowhere near human performance on ARC-AGI. The gap is not small. It is enormous.
This is exactly the point. ARC-AGI is designed to be resistant to the kind of scaling and memorization that makes other benchmarks easy. You cannot solve it by training a bigger model on more data. You need something qualitatively different: the ability to infer rules from minimal examples and apply them to new situations. This is closer to what cognitive scientists mean by “intelligence” than anything measured by multiple-choice knowledge tests.
The foundation is preparing ARC-AGI-3, launching March 25, 2026, which introduces interactive reasoning benchmarks. This is an evolution from static puzzle-solving to dynamic interaction, where the system can ask for additional information or test hypotheses before committing to an answer. This mirrors how humans actually reason: not in a single shot but through an iterative process of hypothesis formation and testing.
The organizational model is interesting. As a nonprofit, ARC Prize Foundation is not trying to commercialize the benchmark or sell access to it. Their revenue comes from grants and sponsorships, and the benchmark itself is open. This means anyone can test their models against it, and the results are publicly verifiable. There is no incentive for the foundation to make the benchmark easier or to inflate scores.
The fact that every major AI lab has used ARC-AGI to evaluate their systems is significant validation. When the companies building frontier models voluntarily test against a benchmark that makes their models look bad, the benchmark is measuring something real and important.
The community aspects, Discord, YouTube content, newsletters, and research grants, suggest the foundation is building an ecosystem around the benchmark. The competitions create attention. The grants fund research that might lead to breakthroughs. The community provides a place for researchers working on novel approaches to share ideas.
The Verdict
ARC Prize Foundation is not a company in the traditional sense, and evaluating it requires different criteria than a SaaS startup or a hardware company. The question is not whether it will generate revenue. The question is whether it is advancing the field.
At 30 days after ARC-AGI-3 launch: how many teams are participating, and are new approaches emerging? If the interactive format inspires fundamentally different research directions, the benchmark is doing its job.
At 60 days: is any lab publicly sharing progress on the new benchmark? Published results, even negative ones, push the entire field forward. The value of a good benchmark is in the research it inspires.
At 90 days: has any system achieved meaningfully higher scores on ARC-AGI-3 than on ARC-AGI-2? If scores improve, it suggests the field is making genuine progress on reasoning. If they plateau, it confirms that current approaches have fundamental limitations.
I think ARC Prize Foundation is doing some of the most important work in AI right now. While everyone else optimizes for benchmark scores that do not mean anything, ARC is asking the harder question: can these systems actually think? The answer, so far, is no. And knowing that honestly is more valuable than any inflated leaderboard ranking.