Dollyglot Turns a Photo Into a Talking AI Avatar and It Is Less Creepy Than You Think

The Macro: AI Characters Are Getting Faces

Text-based AI characters have proven there is a massive market for talking to artificial people. Character.AI built one of the fastest-growing consumer apps in history on the premise that people want to chat with fictional and semi-fictional characters. Replika has been doing it since 2017. The demand is clear. But text is a constraint. Humans are wired for faces, eye contact, and vocal tone. A chatbot that can hold a conversation is interesting. A chatbot that can hold a conversation while looking at you and speaking in a natural voice is something different entirely.

The real-time avatar space has been growing from multiple directions. HeyGen and Synthesia built businesses around pre-recorded video avatars for enterprise use cases like training videos and marketing content. D-ID made “talking head” generation accessible through APIs. Tavus focused on personalized video at scale. But most of these are one-directional. You give them a script, they produce a video. Interactive, real-time, conversational avatars are a harder problem because the system has to generate facial animation, lip sync, and voice simultaneously while processing your input and formulating a response. Latency kills the illusion.

The gaming and metaverse world has been working on real-time avatars for years, but those rely on motion capture hardware or stylized 3D models. What is new is the ability to take a single photograph and generate a photorealistic, animated, talking avatar in real time using AI alone. That capability unlocks use cases that motion capture never could, because the barrier to entry drops from “expensive studio setup” to “upload a selfie.”

The competitive field is active. Soul Machines builds enterprise conversational avatars. Inworld AI (recently acquired) focused on game NPCs. Sieve and LiveKit are building infrastructure layers. But the consumer-facing, “make any person into an interactive avatar” space is still early enough that there is room for new entrants with strong technical foundations.

The Micro: Quant Finance Meets Open-Source Speech Models

Thomas Minassian and Paul-Henri Biojout founded Dollyglot in San Francisco. Thomas is a former quant with a background from HEC Paris. Paul-Henri studied at CentraleSupelec, one of France’s top engineering schools. Before Dollyglot, they built Parler-TTS Mini Multilingual, an open-source multilingual text-to-speech model released through Hugging Face. That is not a side project. Building a multilingual TTS model from scratch and open-sourcing it demonstrates the kind of deep technical capability that this product category demands.

The core product takes a single photo and a voice sample and generates a real-time interactive AI avatar. You can have a conversation with it. The system combines real-time video generation with reasoning models so the avatar does not just animate a face. It thinks, responds, and reacts in a way that feels conversational rather than scripted.

The use cases they highlight range from entertainment (talk to Harry Potter about Hogwarts) to education (brainstorm with historical figures) to more practical applications. The company describes itself as building “fully integrated AI companies,” which is a broader framing than just avatars, but the avatar technology is the core capability that everything else builds on.

Coming through YC’s Winter 2025 batch gives them credibility and capital, but the real credibility is the Parler-TTS work. Open-source contributions to the speech and audio AI space are a strong signal that the founders understand the underlying models at a level that goes beyond API integration.

The technical challenge they are solving is genuinely hard. Real-time video generation from a static image requires generating consistent facial movements, accurate lip sync, natural head motion, and appropriate emotional expression, all while maintaining the visual identity of the original photo. Doing this fast enough for a live conversation means the entire pipeline, from speech recognition to reasoning to video generation, has to complete in milliseconds. Most avatar systems cheat by using pre-rendered animations or limiting the range of expressions. Dollyglot appears to be going for full generative video, which is more ambitious and more fragile.

The website is light on product details and pricing, which is typical for a company this early. The technology is the story right now, not the business model.

The Verdict

I think Dollyglot is tackling one of the most technically challenging problems in consumer AI, and the founding team has the credentials to pull it off. The Parler-TTS open-source work is a genuine differentiator because it shows they can build the hard infrastructure, not just wrap someone else’s API.

The risk is that real-time video avatar generation at conversational latency is really, really hard. If the avatar stutters, if the lip sync drifts, if the face enters the uncanny valley, the entire experience falls apart. Users will tolerate imperfect text. They will not tolerate imperfect faces. The bar for “good enough” is much higher for visual AI than for text AI.

In 30 days, I would want to see average conversation length. If people are talking to avatars for more than a few minutes, the tech is working. In 60 days, the question is whether there is a killer use case emerging. Entertainment, education, companionship, and customer service are all plausible, but consumer products usually win on one use case before expanding. In 90 days, I would want to know the infrastructure cost per conversation minute, because real-time video generation is expensive and the unit economics have to work for this to scale. The ambition is huge and the team is capable. Whether the technology is ready to deliver on the promise at consumer scale is the open question.

Dollyglot Turns a Photo Into a Talking AI Avatar and It Is Less Creepy Than You Think

The Macro: AI Characters Are Getting Faces

The Micro: Quant Finance Meets Open-Source Speech Models

The Verdict

More on this