The Macro: The AI Industry Has a Data Quality Problem It Does Not Like Talking About
I have spent the last year watching AI companies sprint toward increasingly realistic voice and video generation. The demos are stunning. The results in production are often less stunning. And when you dig into why, the answer almost always comes back to training data.
The first wave of generative AI was built on text. Text is plentiful. The internet is a giant text corpus, and while there are licensing questions, the raw material was abundant enough to build foundation models without bespoke data collection. Voice and video are fundamentally different. You cannot just scrape podcasts and YouTube clips without running into consent issues, quality inconsistencies, and the kind of legal exposure that makes investors nervous. The models that sound most human were trained on data that was carefully collected, not passively harvested.
This is a supply chain problem. The AI labs building voice models, ElevenLabs, Resemble, PlayHT, Cartesia, and the growing list of startups building expressive speech synthesis, all need diverse, high-quality, rights-cleared audio data. Accents, emotional range, different speaking contexts, conversational versus formal, sales calls versus casual chats. The models are only as expressive as their training sets. And right now, the training sets are bottlenecked.
The same dynamic is playing out in video. Runway, Pika, Kling, and others are building video generation models that need human movement, facial expression, and gesture data. Synthetic data helps, but it has limits. Models trained only on synthetic data tend to produce outputs that look subtly wrong in ways that are hard to debug.
Scale AI built a massive business on the premise that data labeling is the picks-and-shovels play for machine learning. That thesis proved correct. The question now is whether the same thesis applies one layer deeper, to the raw data collection itself, particularly for modalities where quality and consent are non-negotiable.
The Micro: A Caltech Dropout and a Harvard Researcher Walk Into a Recording Studio
Liva AI collects authentic, rights-cleared voice and video datasets for companies developing realistic AI models. They capture diverse accents, emotional ranges, and varied contexts including sales calls, interviews, and casual conversations. The data is collected in-house with high production quality through a combination of crowdsourcing and partnerships. This is not scraping. This is structured data collection with consent built into the process from day one.
Ashley Mo is the co-founder and CEO. She published research on detecting lung disease through cough audio at 17 and led patient audio data collection at MIT at 18. That background in medical audio collection is directly relevant here. Medical data has some of the strictest consent and quality requirements of any domain, and building collection pipelines that satisfy those requirements is a transferable skill. Aoi Otani is the co-founder and CTO, a Harvard Computer Science and Biology graduate with publications in ICML and Nature focused on representation learning and image diffusion models. They are a two-person team based in San Francisco, part of Y Combinator’s Summer 2025 batch.
They are currently delivering a voice dataset to a lab training expressive foundation models. That is deliberately vague, which suggests the client is either a well-known AI lab or a stealth company. Either way, the fact that they have a paying customer at this stage validates that the demand exists.
The competitive landscape is interesting. Scale AI operates at massive scale but is primarily focused on data labeling and annotation, not raw collection. Defined.ai does data collection for NLP and speech, but their focus has historically been broader and less targeted at the cutting edge of generative models. Appen is the legacy player, but they have struggled to stay relevant as the market shifted toward generative AI use cases. The niche Liva occupies, targeted, high-quality, consent-first audio and video data specifically for expressive model training, is surprisingly underserved given how much money is flowing into voice and video AI.
What I keep coming back to is the consent angle. As regulations tighten around AI training data, particularly in the EU and increasingly in the US, companies that can prove their training data was collected with explicit consent will have a real advantage. Models trained on scraped data carry legal risk. Models trained on Liva’s data do not. That distinction will matter more with every new lawsuit filed against an AI company over training data rights.
The Verdict
I think Liva AI is positioned at exactly the right choke point. The voice and video AI market is growing fast, and every company in it needs training data that is diverse, high-quality, and legally clean. The team is small but technically strong, and the medical data collection background gives Ashley a head start on the consent and quality infrastructure that this business requires.
The risk is execution at scale. Collecting audio and video data is labor-intensive and expensive. Crowdsourcing helps, but maintaining consistent quality across hundreds or thousands of contributors is a genuine operational challenge. At 30 days, I would want to know the cost per hour of collected data and how that compares to what labs are currently paying through ad hoc collection. At 60 days, I would want to see whether they can serve multiple clients simultaneously without quality degradation. At 90 days, the question is whether this becomes a recurring revenue business or a project-based services business, because the valuation implications are completely different.
If the AI voice market keeps growing at its current pace, and I see no reason it would slow down, the companies supplying the training data will do well. Liva is early, small, and focused. Those are good qualities for a startup trying to own a supply chain that barely exists yet.