Now is the time to reinvent the phone call. Thanks to gen AI, no human will ever again have to make a call. Humans will spend time on the phone only when a call has value to them.
For businesses, this may mean: (1) time and labor cost savings from human callers; (2) potential to re-allocate resources towards increased revenue generation; and (3) reduced risk with more compliant and consistent customer experiences.
For consumers, voice agents can provide access to human-grade services without the need to pay or “match with” an actual human. Currently, this includes therapists, coaches, and companions — in the future, this is likely to encompass a much broader range of experiences built around voice. Like most other consumer software, the “winners” will be unpredictable!
Phone calls are an API to the world — and AI takes this to the next level.
There is massive opportunity at each layer — infra players, consumer interfaces, and enterprise agents. For B2C and B2B voice agents, we have a few hypotheses around the most exciting emergent products:
If you are building here, reach out to omoore@a16z.com and anish@a16z.com.
New, multi-modal models like GPT-4o may change the structure of the stack by “running” several of these layers concurrently via one model. This may reduce latency and cost, and power more natural conversational interfaces — as many agents haven’t been able to reach true human-like quality with the composed stack below.
To function, voice agents need to ingest human speech (ASR), process this input with an LLM and return an output, and then speak back to the human (TTS).
For some companies/approaches, the LLM or a series of LLMs handles the conversational flow and emotionality. In other cases, there are unique engines to add emotion, manage interruptions, etc. “Full stack” voice providers offer this all in one place.
Consumer (B2C) and enterprise (B2B) apps sit on top of this stack. Even using third party providers, apps (typically) plug in a custom LLM — which often also serves as the conversational engine.
Voice agent founders can choose between spinning up an agent on a full stack platform (ex. Retell, Vapi, Bland) or assembling the stack themselves. In making this decision, there are a few key vectors:
These are some of the leading players across each stack level now. This is not a comprehensive market map, but represents the names most commonly raised by voice agent founders. We expect this stack to change significantly as multi-modal models emerge.
If you’re working on a voice agent infrastructure company, reach out to Jennifer Li (jli@a16z.com) and Yoko Li (yli@a16z.com) on our team.
We are transitioning from 1.0 AI voice (phone tree) → 2.0 wave of AI voice (LLM-based). 2.0 companies have been emerging in the last 6 months or so. 1.0 companies may be more accurate now, but the 2.0 approach should be much more scalable and accurate in the long term.
There is unlikely to be one horizontal model or platform that works across all types of enterprise voice agents. There are a few key differences across verticals: (1) call types, tones, and structures; (2) integrations and processes; and (3) GTM and “killer features.”
This is likely to mean an explosion of vertical agents that are highly opinionated in the UI. This requires founding teams with deep domain expertise or interest. Labor is the #1 cost center for many businesses — TAM is large for companies that “get it right.”
The most near-term opportunities may be in industries that live and die by phone appointments, have significant labor shortages, and have low call complexity. As agents become more sophisticated, they will be able to tackle more complex calls.
We’ve seen three primary waves of tech in the B2B voice agent space:
Many voice agent companies are taking a vertical-specific approach for a specific industry (ex. auto services) or a specific type of task (ex. appointment scheduling). This is for a few reasons:
The “strong form” of an AI voice agent will be an entirely LLM-driven conversation, not an interactive voice response (IVR) or phone tree approach. However, because LLMs are not 100% reliable all the way through, there is likely to be some (temporary) “human in the loop” for more sensitive/larger transactions. This also makes vertical-specific workflows particularly important, as they can maximize the probability of success while minimizing human interference with fewer edge cases.
B2B voice agents will need to navigate specialized (or vertical-specific) conversations for which a general LLM is likely insufficient. Many companies are tuning per customer models (using a few hundred or low thousands of data points) and will likely extrapolate this back to a company-wide base model. The custom tuning may even continue for enterprise clients. Note: Some companies may tune a “general” model (to be used across clients) for their specific use case(s), and then prompt on a per-customer basis.
Given their complexity, some prior AI background will be helpful — if not necessary — for spinning up and scaling high quality B2B voice agents. However, understanding how to package the product and wedge into the vertical is likely to be equally important — requiring either domain expertise or strong interest. You don’t need a PhD in AI to build and launch an enterprise voice agent!
Similar to the above, buyers in each vertical have a few specific features or integrations that they are typically looking to see before they will make a purchase. In fact, this may be the proof point that elevates the product from “useful” to “magic” in their assessment. This is another reason why it makes sense to start fairly verticalized.
For verticals with significant revenue concentration in the top companies/providers, voice agent companies may start with enterprises and eventually “trickle down” to SMBs with a self-serve product. SMB customers are desperate for solutions here and are willing to test a variety of options — but may not provide the scale/quality of data that allows a startup to tune the model to enterprise caliber.
In B2B, voice agents largely replace existing phone calls to complete a specific task. For consumer agents, the user has to choose to continue to engage, which is challenging, as voice is not always convenient to interact with. This means the product bar is “higher.”
The first and most obvious application of consumer voice agents is taking expensive or inaccessible human services, and replacing the supplier with an AI. This includes therapy, coaching, tutoring, and more — anything dialogue-based that can be completed virtually.
However, we believe the true magic in B2C voice agents is likely yet to come! We’re looking for products that use the power of voice to enable new kinds of “conversations” that didn’t exist before. This may reinvent the form factor of existing services, or create entirely new ones.
For products that nail the UX, voice agents provide an opportunity to engage consumers at a level never before seen in software — truly mimicking the human connection. This may manifest in the agent as the product, or voice as a mode of a broader product.
So far, the dominant consumer AI voice agents are from large companies, like ChatGPT Voice and Inflection’s Pi app. There are a few reasons consumer voice has been slower to emerge:
We are excited about products and founders that are opinionated on how voice brings unique value to the product — not just “voice for the sake of voice.” In many cases, a voice interface is actually a net negative versus a text interface, as it’s more inconvenient to consume and extract information from.
While voice is difficult to consume, real time voice is even more difficult (vs. async voice messages). We are excited about founders that have a perspective on why their product needs to be built around live conversations — perhaps it’s for human-like companionship, a practice environment, etc.
We suspect that the strong-form products will not be a direct translation of a previously human to human conversation, where the AI voice agent simply plugs in for the human provider. First, it’s difficult to live up to that standard — but more importantly, there is an opportunity to deliver the same value better (more efficiently, more joyfully) using AI.
The leading general consumer AI products (ChatGPT, Pi, Claude) have high quality voice modes. They can meaningfully engage in many types of conversations and interactions. And, they will likely win on latency and conversational flow in the near term as they host their own models and stack.
We are excited to see startups succeed either by tailoring or tuning for a specific type of conversation, or building a UI that provides more context and value to the voice agent experience — ex. tracking progress over time, or steering the conversation/experience in an opinionated way.
If you’re building an AI voice agent, reach out to omoore@a16z.com and anish@a16z.com — we’d love to hear from you.
Olivia Moore is a partner on the consumer investing team at Andreessen Horowitz, where she focuses on AI.
Anish Acharya Anish Acharya is an entrepreneur and general partner at Andreessen Horowitz. At a16z, he focuses on consumer investing, including AI-native products and companies that will help usher in a new era of abundance.