What happens when AI doesn’t just generate content, but embodies it? AI has already mastered the ability to produce realistic photos, videos, and voices, passing the visual and auditory Turing Test. The next big leap is in AI avatars: combining a face with a voice to create a talking character.
Can’t you just generate an image of a face, animate it, and add a voiceover? Not quite. The challenge isn’t just nailing the lip sync — it’s making facial expressions and body language move in tandem. It would be weird if your mouth opened in surprise, but your cheeks and chin didn’t budge! And if a voice sounds excited but the corresponding face doesn’t react, the human-like illusion falls apart.
We’re starting to see real progress here. AI avatars are already being used in content creation, advertising, and corporate communication. Today’s versions are still mostly talking heads — functional, but limited — but we’ve seen some exciting developments in the last few months, and there’s clearly meaningful progress on the horizon.
In this post, we’ll break down what’s working now, what’s next, and the most impressive AI avatar products today, drawn from my hands-on testing of over 20 of them.
AI avatars are a uniquely challenging research problem. To make a talking face, a model needs to learn realistic phoneme-to-viseme mapping: the relationship between speech sounds (phonemes) and their corresponding mouth movements (visemes). If this is “off,” the mouth and voice will look out of sync or even completely disconnected.
To make the issue even more complex, your mouth isn’t the only thing that moves when you talk. The rest of your face moves in conjunction, along with your upper body and sometimes your hands. And everyone has their own distinct style of speaking. Think about how you speak, compared to your favorite celebrity: even if you’re saying the same sentence, your mouths will move differently. If you tried to apply your lip sync to their face, it would look weird.
Over the last few years, this space has evolved significantly from a research perspective. I reviewed over 70 papers on AI talking heads since 2017 and saw a clear progression in model architecture — from CNNs and GANs, to 3D-based approaches like NeRFs and 3D Morphable Models, then to transformers and diffusion models, and most recently, to DiT (diffusion models based on the transformer architecture). The timeline below highlights the most cited papers from each year.
Both the quality of generations and the capabilities of models have improved dramatically. Early approaches were limited. Imagine starting with a single photo of a person, masking the bottom half of their face, and generating new mouth movements based on target facial landmarks from audio input. These models were trained on a limited corpus of quality lip sync data, most of which was closely cropped at the face. More realistic results, like “lip-syncing Obama,” required many hours of video of the target person and were very limited in outputs.
Today’s models are much more flexible and powerful. They can generate half-body or even full body movement, realistic talking faces, and dynamic background motion — all in the same video! These newer models are trained more like traditional text-to-video models on much larger datasets, using a variety of techniques to maintain lip sync accuracy amid all the motion.
The first preview of this came with Bytedance’s OmniHuman-1 model, which was introduced in February (and was recently made available in Dreamina). The space is moving quickly — Hedra released Character-3 in March, which in our head-to-head testing is now best-in-class for most use cases. Hedra also works for non-human characters, like this talking Waymo, and enables users to prompt emotions and movement via text.
New use cases are also emerging around AI animation, spurred by trends like the Studio Ghibli movement. The below video came from a starting image frame and the audio track. Hedra generated the character’s lip sync and face + upper body movement. And check out the moving characters in the background!
Presenting The Office x Studio Ghibli pic.twitter.com/nHYrGc2uDs
— Justine Moore (@venturetwins) March 27, 2025
There are countless use cases for AI avatars — just imagine all the different places where you interact with a character or watch a video where someone is speaking. We’ve already seen usage across consumers, SMBs, and even enterprises.
This is an early market map. The space is evolving quickly, and the product distinctions are relatively rough. Many products theoretically could make avatars for most or all of these use cases, but we’ve found, in practice, that it’s hard to build the workflow and tune the model to excel at everything. Below, we’ve outlined examples for how each segment of the market is leveraging AI avatars.
Anyone can now create animated characters from a single image, which is a massive unlock for creativity. It’s hard to overstate how meaningful this is for everyday people who want to use AI to tell a story. One of the reasons early AI videos were criticized as “slides of images” is there were no talking characters (or speech only came in the form of voiceovers).
When you can make something talk, your content becomes much more interesting. And beyond traditional narrative video, you can create things like AI streamers, podcasters, and music videos. The videos linked here were all made on Hedra, which enables users to create dynamic, speaking characters from a single starting image and either an audio clip or a script.
If you’re starting with a video instead of an image, Sync can apply lip sync to make the character’s face fit your audio. And if you want to use real human performance to drive the movement of your character, tools like Runway Act-One and Viggle make it possible.
One of my favorite creators using AI to animate characters is Neural Viz, whose series, “The Monoverse,” imagines a post-human universe populated by Glurons. It’s only a matter of time before we see an explosion of AI-generated shows — or even just standalone influencers — now that the barrier to entry is so much lower.
As avatars become easier to stream in real-time, we also expect to see consumer-facing companies implement them as a core part of their UI. Imagine learning a language with a live AI “coach” that is not just a disembodied voice, but a full character with a face and personality. Companies like Praktika are already doing this, and it will only get more natural over time.
Ads have become one of the first killer use cases of AI avatars. Instead of hiring actors and a production crew, businesses can now have hyper-realistic AI characters promote their products. Companies like Creatify and Arcads make this seamless — just provide a product link and they generate an ad: writing the script, pulling B-roll and images, and “casting” an AI actor.
This has unlocked advertising for businesses that could never afford traditional ad production. It’s particularly popular among ecommerce companies, games, and consumer apps. Chances are, you’ve already seen AI-generated ads on YouTube or TikTok. Now B2B companies are exploring the tech as well, using AI avatars for content marketing or personalized outreach with tools like Yuzu Labs and Vidyard.
Many of these products combine an AI actor — whether a clone of a real person or a unique character — with other assets like product photos, video clips, and music. Users can control where these assets appear, or put it on “autopilot” and let the product pull together a video for you. You can either write the script yourself or use an AI-generated one.
Beyond marketing, enterprises are finding a range of applications for AI avatars. A few examples:
Learning and development. Most large companies produce training and educational videos for employees, covering everything from onboarding to compliance, product tutorials, and skill development. AI tools like Synthesia can automate this process, making content creation faster and more scalable. Some roles also require ongoing, video-based training — imagine a salesperson practicing their negotiation skills with an AI avatar from a product like Anam.
Localization. If a company has customers or employees in different countries, it may want to localize content into different languages or switch out cultural references. AI actors make it fast and easy to personalize your videos for different geographies. Thanks to AI voice translation from companies like ElevenLabs, businesses can generate the same video in dozens of languages, with natural-sounding voices.
Executive presence. AI avatars let executives scale their presence by cloning their persona to create personalized content for employees or customers. Instead of filming every product announcement or a “thank you” message, companies can generate a realistic AI twin of their CEO or product lead. We’re also seeing companies like Delphi and Cicero make it easy for thought leaders to interact with and answer questions from people they’d never normally be able to meet 1:1.
Creating a believable AI avatar is a challenge, with each element of realism presenting its own technical hurdles. It’s not just about avoiding the uncanny valley, it’s about solving fundamental problems in animation, speech synthesis, and real-time rendering. Here’s a breakdown of what’s required, why it’s so hard to get right, and where we’re seeing progress:
If you want your avatar to engage in real-time conversations — like joining a Zoom meeting — there are a few other things you need to add:
There’s still so much to build and improve in this space. A few areas that are top-of-mind:
Historically, each AI avatar had one, fixed “look.” Their outfit, pose, and environment were static. Some products are starting to offer more options. For example, this character from HeyGen, Raul, has 20 looks! But it would be great to more easily transform a character however you want.
Faces have long been the weak link of AI avatars, often looking robotic. That’s starting to change with products like Captions’ new Mirage, which delivers a more natural look and broader range of expressions. We’d love to see AI avatars that understand the emotional context of a script and react appropriately, like looking scared if the character is fleeing from a monster.
Today, the vast majority of avatars have little movement below the face — even basic things like hand gestures. Gesture control has been fairly programmatic: for example, Argil allows you to select different types of body language for each segment of your video. We’re excited to see more natural, inferred motion in the future.
Right now, AI avatars can’t interact with their surroundings. An attainable near-term goal may be enabling them to hold products in ads. Topview has already made progress (see the below video for their process and outcome), and we’re excited to see what’s to come as models improve.
To name a few potential use cases: doing a video call with an AI doctor, browsing curated products with an AI sales assistant, or FaceTiming with a character from your favorite TV show. The latency and reliability aren’t quite human-level, but they’re getting close. Check out a demo of me chatting with Tavus‘ latest model.
One of our main learnings from investing in both foundation model companies and AI applications over the past few years? It’s nearly impossible to predict with any degree of certainty where a given space is headed. However, it feels safe to say that the application layer is poised for rapid growth now that the underlying model quality finally feels good enough to generate AI talking heads that aren’t painful to watch.
We expect this space will give rise to multiple billion-dollar companies, with products segmented by use case and target customer. For example, an executive looking for an AI clone to film videos for customers will need (and be willing to pay) for a higher level of quality and realism than a fan making a quick clip of their favorite anime character to send to friends.
Workflow is also important. If you’re generating ads with AI influencers, you’ll want to use a platform that can automatically pull in product details, write scripts, add B-roll and product photos, push the videos to your social channels, and measure results. On the other hand, if you’re trying to tell a story using AI characters, you’ll prioritize tools that enable you to save and re-use characters and scenes, and easily splice together different types of clips.