Voice is one of the most powerful unlocks for AI application companies. It is the most frequent (and most information-dense) form of human communication, made “programmable” for the first time due to AI.
For enterprises, AI directly replaces human labor with technology. It’s cheaper, faster, more reliable — and often outperforms humans. Voice agents also allow businesses to be available to their customers 24/7 to answer questions, schedule appointments, or complete purchases. Customer availability and business availability no longer have to match 1:1 (ever tried to call an East Coast bank after 3 p.m. PT?); with voice agents, every business can always be online.
For consumers, we believe voice will be the first — and perhaps the primary — way people interact with AI. This interaction could take the form of an always-available companion or coach, or by democratizing services, such as language learning, that were previously inaccessible.
We are just now transitioning from the infrastructure to application layer of AI voice. As models improve, voice will become the wedge, not the product. We are excited about startups using a voice wedge to unlock a broader platform.
2024 was a massive year for AI voice. Since we published our last AI voice update…
Advancements in model development have streamlined the infrastructure “stack,” resulting in voice agents with lower latency and improved performance. This improvement has largely materialized in the last six months with new conversational models.
These conversational models are also becoming more affordable over time. In December 2024, OpenAI dropped the price of the GPT-4o realtime API by 60% for input (to $40/1M tokens) and 87.5% for output (to $2.50/1M tokens). GPT-4o mini is also now available via realtime.
The voice agent market exploded in H2 2024. One data point: companies building with voice represented 22% of the most recent YC class, per Cartesia.
Voice agents are also being added as a capability to more horizontal or multi-modal products.
In 2024, we saw companies at several layers of the conversational voice stack attract both funding and traction, including:
The most natural early categories for voice agents typically have high existing call center/BPO spend. If calls are taken by onshore employees as part of their standard jobs: (1) the pain point/revenue is typically not strong enough — unless a significant number of employees solely take/make calls; and (2) it’s difficult to quantify results/savings and “make the case.”
Each of these primary verticals (financial services, B2C, B2B, government, and healthcare) are likely to have their own core providers, similar to how they have their own systems of record.
We expect to see significant founder activity in the following categories (reach out if you’re building here!):
Outside “call center categories,” we have seen willingness to pay for AI voice agents for coaching or training use cases, largely targeted at high salary jobs. In these industries, realistic voice agents can essentially act as “simulators” to significantly improve on-the-job performance. This can replace labor spend (such as sales coaches) or less effective software.
As one indicator of where early stage founders are building, we look at YC companies.
Since 2020, there have been 90 voice agent companies. This is accelerating with each new cohort — 10 of these are in the W25 class, which has yet to be fully announced. In pre-2023 cohorts, voice agents are largely companies that have pivoted into the space in the past year.
YC founders building voice agents are largely concentrated in B2B- (~69%) and healthcare-focused (~18%) use cases, followed by consumer (~13%).
Within B2B, the most common sub-industries are: fintech (16.9%) and ops — largely customer support (12.4%). Within healthcare, voice agents either target front office (patient-facing) or back office (pharmacy, insurance, etc.-facing), focusing on: general human medicine (11.2%), dental (3.4%), veterinary (2.2%), or physical therapy (1.1%).
These are portfolio companies of a16z. A list of investments made by a16z is available here.
Job interviews feel like a non-obvious early use case for voice agents, given the complexity (conducting a full interview with a human) and sensitivity (maintaining a strong candidate experience). However, we’ve seen significant early traction from several startups here — some insights below from customers:
The pain point is especially strong in staffing (43 public co agencies, $650B annual revenue) — higher volume, lower to medium skill roles (likely not a 10x engineer at an early stage startup). AI interviews can easily replace screening calls, or even more of the process. This is because:
"Something like 90% of the candidates we send now make it to first round [with the employer], 75-80% make it to final round. Our numbers were half that before [AI voice interviewing start-up]." —Staffing agency for Fortune 100
Many AI interview products are already performing at or above the level of a human recruiter, for a few reasons:
"The interviewee often starts gaining trust with AI in a way that they might not with the human interviewer. A recruiter may not have the experience to understand what interviewee is saying. AI can read from systems and give responses that are smarter and more engaging." —$200M annual revenue staffing agency
Many companies initially adopted a price-per-minute model, but this approach is increasingly under pressure as model costs decrease — and some customers are becoming aware of these reductions. What will the preferred pricing model look like going forward? It will likely involve a combination of a platform fee and a usage-based component. Where does it make sense to charge for implementation or institute minimum usage requirements?
No business or industry relies entirely on calls — email, web chat, text, etc. are important channels. How quickly should companies expand beyond calls into other modalities? Is it better to capture one workflow, end to end, or all calls first?
Many voice agents pitch the end vision of replacing the xMS (system of record software) in their category. In what categories is this actually possible/likely? Does it matter, if many businesses are already paying more to handle calls than they are for the xMS?
Many of the early voice agents we’ve seen are from highly technical teams who put in the work to learn about a vertical/market after being pulled there. As the technical barriers lower, will it become more of a GTM game, where teams with little technical but more industry expertise are advantaged? How will this look different across verticals?
In some categories, enterprises may want to build an agent themselves using a more horizontal product, versus adopting something built for their specific market or use case. In what industries/sizes of business will this make the most sense? How can vertical products serve enterprises that operate across many verticals (and may see benefits from working with one provider?)
In many cases, AI voice agents can already outperform humans on emotional vectors. They pay better attention, are more empathetic and patient, and have (theoretically) unlimited time to spend. There are categories where this will be particularly valuable, and voice agents can help businesses build deeper relationships with their customers — but this has been relatively untapped so far. We are excited to see how founders build around this theme in the most relevant verticals.
If you're building in voice AI, I'd love to hear from you. Email me at omoore@a16z.com, or reach out on X.