Design Arena Lets the Crowd Decide Which AI Design Tool Actually Works

The Macro: We Still Do Not Know Which AI Is Best at Design

There is a measurement problem in AI that nobody talks about enough. We have benchmarks for language. We have benchmarks for code. We have benchmarks for math, reasoning, and factual knowledge. We do not have good benchmarks for design.

This gap is not trivial. Billions of dollars are flowing into AI tools that generate websites, apps, logos, images, and videos. Figma, Canva, Adobe, and dozens of startups are all shipping AI design features. Cursor and Bolt and v0 generate entire frontends from prompts. But if you ask “which model actually produces the best-looking website from a given prompt,” there is no authoritative answer. The evaluation is subjective, fragmented across individual reviews and Twitter threads, and rarely controlled for the same prompt across multiple models.

The traditional benchmarking approach does not work here. You cannot score visual design quality the way you score math problems. There is no ground truth for “does this landing page look good.” MMMU and similar multimodal benchmarks test whether models can understand images, not whether they can create appealing ones. FID and Inception Score measure statistical properties of generated images, not whether a human would actually want to use the output.

What you need is human judgment at scale. Not expert judgment from a panel of ten designers, but broad judgment from thousands of people across different contexts and categories. This is exactly what Chatbot Arena did for language models, turning Elo-style ratings from blind human comparisons into the de facto benchmark for LLM quality. Language models have LMSYS. Design models have had nothing comparable.

Until now.

Design Arena was built by Grace Li and Kamryn Ohly. Both studied computer science at Harvard (Grace doubled in neuroscience, Kamryn in education), and both previously worked at a Cupertino-based consumer electronics company. They are running a three-person team out of San Francisco after going through Y Combinator’s Summer 2025 batch.

The product works like this. You pick a category. Websites, mobile apps, images, videos, game dev, 3D design, data visualization, UI components, logos, SVG, ASCII art, or text-to-speech. You enter a prompt. Four AI models generate responses simultaneously. You do not know which model produced which output. The results appear in head-to-head matchups, and you vote on which one is better.

Your votes feed into a Bradley-Terry statistical model that generates Elo-style rankings. The leaderboards update continuously. New models get flagged until they accumulate at least 50 pairwise evaluations, which prevents a single viral prompt from distorting the ratings. The tournament format generates five pairwise comparisons per session from each four-model bracket, which is efficient data collection from what is essentially a fun interactive experience.

The category breadth is ambitious for a three-person team. Website generation and mobile app design are the obvious high-traffic categories, but including game development, 3D design, and data visualization shows they are thinking about design more broadly than just UI. The text-to-speech category is interesting because it extends the benchmark into a modality that most design benchmarks ignore entirely.

The numbers they are reporting are strong. Active users across 190-plus countries and millions of votes. For a benchmarking platform, community size is the product. A benchmark with ten thousand votes is interesting. A benchmark with millions of votes is authoritative. The gap between those two is enormous and very hard to close once established.

One detail I appreciate: models on the leaderboard include outputs from Claude’s Sonnet, GLM, GPT-5, and others. The benchmark is model-agnostic and does not appear to have commercial partnerships that would bias the rankings. That neutrality is essential for credibility.

The Verdict

I think Design Arena is filling a gap that the AI industry badly needs filled. The absence of a credible design benchmark has let every AI design tool claim superiority without evidence. Marketing copy says “best-in-class UI generation” and there is no independent measurement to confirm or deny it. Design Arena provides that measurement.

The LMSYS comparison is the right one. Chatbot Arena became influential because it was simple, fun, and produced rankings that people actually referenced when choosing models. Design Arena has the same structural advantages: low friction to participate, genuinely engaging as an activity, and useful output in the form of community-validated rankings.

In 30 days I want to see whether the leaderboard rankings start appearing in product marketing. When AI design tools begin citing their Design Arena Elo scores, the benchmark has crossed from interesting project to industry standard.

In 60 days the question is sustainability. Crowdsourced benchmarks need continuous participation to stay relevant. How does Design Arena keep users coming back after the initial novelty? Are there competitive elements, community features, or API access for model developers that create ongoing engagement?

In 90 days I want to understand the business model. Chatbot Arena runs as a research project. Design Arena may need to become a business. Licensing benchmark data to AI companies, offering premium analytics on model performance by category, or providing evaluation-as-a-service for teams building design tools are all plausible paths. The data they are accumulating is genuinely valuable.

This is one of those products where the simplicity of the concept hides the difficulty of execution. Getting millions of design evaluations from people across 190 countries is not something you can replicate by throwing money at it. Design Arena has a head start that compounds with every vote.

Design Arena Lets the Crowd Decide Which AI Design Tool Actually Works

The Macro: We Still Do Not Know Which AI Is Best at Design

The Micro: Blind Voting Across 190 Countries

The Verdict

More on this