The Macro: Voice AI Has a Data Supply Problem Nobody Wants to Admit
I want to talk about a dirty secret in voice AI. Everyone is building speech models. Conversational agents, voice cloning, real-time translation, assistive tech, call center automation. The model architectures are getting impressive. The inference speeds are getting fast. The demos sound almost human.
But where is the training data coming from?
If you dig into this, the answer is uncomfortable. A lot of it comes from scraped podcasts, YouTube videos with dubious licensing, crowdsourced recordings from platforms where quality control means “we played it back and it sounded fine,” and internal datasets that companies built by recording their own employees reading scripts in a conference room. The diversity problem is real too. Most public speech datasets skew heavily toward English, specifically North American English, specifically the kind of English spoken by people who volunteer for data collection tasks on the internet. That is not a representative sample of how humans actually talk.
This matters because voice AI models trained on narrow data produce narrow results. An agent that works great for a Standard American English speaker in a quiet room falls apart when it encounters a Singaporean accent, background noise, emotional speech, or conversational crosstalk. The companies building these models know this. They just have not had great options for fixing it.
The existing data supply chain looks something like this. You can use Common Voice from Mozilla, which is open and crowdsourced but limited in languages and recording quality. You can license datasets from companies like Appen or Lionbridge, which gets expensive fast and still has quality consistency issues. You can build your own pipeline, which is what the well-funded labs do, but it takes months and costs millions. Or you can scrape the internet and hope nobody sues you, which is what more companies do than will admit it publicly.
What is missing is a marketplace layer that connects labs directly to vetted, diverse contributors who are paid to produce high-quality conversational audio on demand. That is what Panels is building.
The Micro: 10,000 Contributors, 20 Languages, and the Boring Infrastructure That Matters
Aaron Wenk and Jason Le came out of Y Combinator’s Summer 2025 batch with a straightforward pitch. They have built a network of over 10,000 vetted voice contributors spread across more than 100 countries, covering 20-plus languages. Labs come to Panels with a specification for the audio data they need. Panels sources it from their contributor network, handles quality control, and delivers structured datasets ready for training.
The word “vetted” is doing heavy lifting here, and I think it should be. The difference between a random person recording themselves on a laptop microphone and a vetted contributor recording in a controlled environment with proper equipment is enormous for training data quality. Background noise, clipping, inconsistent volume levels, room echo, these things compound at scale. A dataset with 10,000 hours of clean, properly segmented conversational audio is worth dramatically more than 100,000 hours of noisy web scrapes.
The contributor network approach also solves the diversity problem structurally rather than as an afterthought. If you need Tagalog conversational speech with emotional variation, you can request it. If you need Hindi code-switching with English, you can request it. This is not a dataset you download and hope it covers your use case. It is a data pipeline you configure.
I find the competitive positioning interesting. Panels is not trying to compete with Common Voice on price or with the big data labeling companies on breadth. They are going after the specific intersection of audio quality, contributor diversity, and conversational naturalness that voice AI labs need most. Scale AI dominates the general data labeling market. Appen and Lionbridge have been doing audio collection for years. But the demand for conversational audio specifically, not read speech, not scripted prompts, but actual natural conversation patterns, has outpaced what those platforms were designed to deliver.
The business model is also worth noting. By building a contributor network rather than scraping or licensing existing audio, Panels owns the supply side. Contributors are compensated directly. The data provenance is clean. In a regulatory environment that is getting increasingly hostile to training on unlicensed content, that matters.
The 100-country coverage number is ambitious for a company this early. I would want to know how deep that coverage actually goes. Having three contributors in Nigeria is not the same as having three hundred. The language count matters less than the hours-per-language and the demographic variation within each language.
The Verdict
Panels is solving a real infrastructure problem at the right time. Voice AI investment is accelerating, model architectures are converging on needing more and better training data, and the regulatory landscape is making scrape-and-pray approaches increasingly risky.
At 30 days, I want to see what the actual delivery pipeline looks like. How fast can a lab go from specification to delivered dataset? The turnaround time is the product as much as the data quality is.
At 60 days, I want to know who the paying customers are. If it is the well-funded voice AI startups, the unit economics should work. If they are trying to sell into enterprise procurement at research labs, the sales cycle will be long and painful.
At 90 days, the question is scale. Can the contributor network grow fast enough to meet demand without compromising quality? The vetting process is the moat, but it is also the bottleneck. Every marketplace has to solve this tension eventually.
The boring truth about AI is that the models are only as good as the data. Panels is betting that the audio data layer needs a dedicated, quality-first platform, and I think that bet is right.