Just over a year ago, we highlighted 16 changes to the way enterprises approached building and buying gen AI. Since then, the landscape has continued to evolve quickly—so we revisited our conversations with over two dozen enterprise buyers and surveyed 100 CIOs across 15 industries to help founders understand how these leaders are using, buying, and budgeting for gen AI in 2025 and beyond.1
Even in a field where the only constant is change, the gen AI market structure has evolved significantly beyond our expectations since we ran our last survey over a year ago.
To give founders a more nuanced look at what’s top of mind for enterprise buyers today, we’ll dig into these shifts in resourcing, models, procurement, and application usage below.
LLM budgets have grown ahead of enterprises’ (already high) expectations from a year ago, and there are no signs of this slowing down. Enterprise leaders expect an average of ~75% growth over the next year. As one CIO noted, “what I spent in 2023 I now spend in a week.”
Spend growth is driven partially by enterprises discovering more relevant internal use cases and increasing employee adoption. On top of this, we’re beginning to see more customer-facing use cases—especially for tech-forward companies—that have the potential to drive exponential spend growth. One large technology company said, “we’ve been mostly focused on internal use cases so far, but this year we’re focused on customer-facing gen AI where spend will be significantly larger.”
Last year, innovation budgets still made up a quarter of LLM spending; this has now dropped to just 7%. Enterprises are increasingly paying for AI models and apps via centralized IT and business unit budgets, reflecting the growing sentiment that gen AI is no longer experimental but essential to business operations. One CTO noted that, “more of our products are adding AI enablement, so our spending growth will rise across all of these products”—suggesting this shift toward core budgets will only accelerate.
With several highly capable LLMs now available, it’s become the norm to have multiple models deployed in production use cases. While one reason for this is certainly to avoid vendor lock-in, model differentiation by use case has become increasingly pronounced and is the main reason enterprises buy models from multiple vendors. In this year’s survey, 37% of respondents are now using 5 or more models as opposed to 29% last year.
While in some cases models appear to have comparable scores on general purpose evaluations, it’s clear that the enterprise model layer has not become commoditized. It’s well known, for instance, that Anthropic’s models excel in coding-related tasks, but there’s more nuance to this claim. Within coding, some users report that Claude performs better for fine-grained code completion, while Gemini is stronger in higher-level system design and architecture. In other domains, such as text-based applications, one customer observed that “Anthropic is a bit better at writing tasks—language fluency, content generation, brainstorming—while OpenAI models are better for more complex question-answering.” These differences have made it best practice to use multiple models, and we expect this strategy will continue as customers build applications for performance and keep an eye towards remaining vendor agnostic.
While enterprises continued to use different models across both experimental and production use cases as explored above, a few players took the lead on overall adoption: OpenAI maintained overall market share leadership, while Google and Anthropic made considerable strides over the last year. Market share differed somewhat by scale of the enterprise, with more open source adoption occurring at the larger end of enterprises where on-prem is still a major consideration.
Double-clicking further into usage:
As we’ve previously discussed, model costs are coming down by an order of magnitude every 12 months. Against this backdrop, we’ve also seen the price-to-performance ratio of closed source become much more compelling for small and medium models, with xAI’s Grok 3 mini and Google’s Gemini 2.5 Flash taking the lead on this count. In some cases, customers more frequently opt for closed source models given this shift, along with other ecosystem benefits. As one customer said, “The pricing has gotten appealing and we’re already embedded with Google: we use everything from G Suite to databases, and their enterprise expertise is attractive.” Or more concisely put by another: “Gemini is cheap.”
Improved model capabilities—chiefly higher intelligence and longer context windows—have made fine-tuning less critical to achieving strong model performance for a specific use case. Instead, companies have found that prompt engineering can drive similar or better results, often at much lower cost. As one enterprise observed, “instead of taking the training data and parameter-efficient fine-tuning, you just dump it into a long context and get almost equivalent results.”
This move away from fine-tuning also helps companies avoid model lock-in, as fine-tuned models require high upfront costs and engineering work while prompts can be more easily ported from one model to another. This is important in a world where models are rapidly improving and companies want the benefits of staying on the leading edge.
That said, companies with hyper-specific use cases are still fine-tuning models. For instance, one streaming service fine-tunes open source models for query augmentation in video search “where you need more domain adaptation.” We might also see a rise in fine-tuning if newer methods, like reinforcement fine tuning, become more widely adopted beyond the labs.
As model capabilities improve, most enterprises aren’t seeing as much ROI on fine-tuning as last year and mainly opt for open source models for highly cost-sensitive use cases.
By allowing LLMs to complete more complex tasks more accurately, reasoning models have expanded the range of use cases that LLMs can tackle. Enterprises are still early in their testing of reasoning models and few have deployed them in production, but companies are very optimistic about their potential. One executive we interviewed captured this well: “[reasoning models] allow us to solve newer, more complex use cases, so I anticipate a big jump in our usage. But we’re still early and testing today.”
Among early adopters, OpenAI’s reasoning models have seen the greatest traction. Despite significant industry buzz around DeepSeek, enterprises are overwhelmingly adopting OpenAI, with 23% of enterprises surveyed already using OpenAI’s o3 model in production compared to just 3% for DeepSeek. DeepSeek’s adoption was higher among startups relative to its low pickup in the enterprise.
Companies now approach model selection with disciplined evaluation frameworks, and factors such as security—which was heavily emphasized in our interviews—and cost have gained ground on overall accuracy and reliability. This shift underscores the increased trust enterprises have in model performance and the confidence that LLMs will be deployed at scale. As one leader succinctly summarized, “for most tasks, all the models perform well enough now—so pricing has become a much more important factor.”
As we mentioned in the “Models” section, enterprises are also becoming more sophisticated in matching specific use cases with the right model. For highly visible or performance-critical applications, companies typically prefer leading-edge models with strong brand recognition. In contrast, for simpler or internal tasks, model choice often comes down purely to cost. See below for how these LLM KPCs (key purchasing criteria) have changed over time.
While there is still some preference for existing cloud relationships (similar to other infra purchases), more enterprises are hosting either directly with model providers or via Databricks, particularly in cases where the model of choice is not hosted by their main cloud provider (e.g., OpenAI for AWS customers). This is typically because leaders “want direct access to the latest model with the best performance as soon as it’s available. Early access previews are important too.” The increased trust in going direct with model providers including OpenAI and Anthropic is a significant shift from what we heard in last year’s interviews with enterprises: many opted to access models via a cloud provider whenever possible, sometimes even if it wasn’t via their primary cloud provider.
Last year, we found that most enterprises were designing their applications to minimize switching costs and make models as interchangeable as possible. As a result, many enterprises treated models as “easy come, easy go.” That might have worked well for simple, one-shot use cases, but the rise of agentic workflows has started making it more difficult to switch between models.
As companies invest the time and resources into building guardrails and prompting for agentic workflows, they’re more hesitant to switch to other models in the event that their results won’t be replicable or that they’ll need to invest significant time into engineering the reliability of a different model. Agentic workflows often require multiple steps to complete a task, so changing one part of a model’s workflow could impact all downstream dependencies. As one leader told us, “all the prompts have been tuned for OpenAI. Each one of them has their own set of instructions and prompts and details. How LLMs get instructions to do agentic processing—it takes lots of pages of instruction. Also, quality assurance of agents is not super easy, so changing models is now a task that can take a lot of engineering time.”
As models proliferate, external evaluations offer a practical, Gartner-like filter that enterprises recognize from their traditional software procurement processes.
While internal benchmarks, golden datasets, and developer feedback are still critical parts of assessing LLM performance more deeply, the maturation of the LLM market has driven companies to increasingly reference external benchmarks like LM Arena. Though these external benchmarks help enterprise buyers sort the market, leaders also noted that these benchmarks are just one factor in a broader evaluation process: “we definitely look at the external benchmarks. But you still need to assess yourself. It’s hard to pick without really trialing things and getting employee feedback.”
Early in the AI product cycle, enterprises largely opted to work directly with AI models and build their own applications. However, we’ve seen a marked shift towards buying third party applications over the last twelve months as the ecosystem of AI apps has started to mature. This is particularly true as the dynamic performance and cost differentiation across models has resulted in incremental ROI gains from constant evaluation and optimization by use case, often best tackled by a dedicated AI application team instead of an internal team.
Moreover: in a space as dynamic as AI, companies are finding that internally developed tools are difficult to maintain and frequently don’t give them a business advantage—which further cements their interest in buying instead of building apps.
As more application categories mature, we’d expect to see this trend swing harder towards third-party applications in the future, as evidenced by the leading indicator of leaders considering apps more heavily when testing new use cases. In the case of customer support, for instance, over 90% of survey respondents noted that they were testing third-party apps. One public fintech noted that while they had started to build customer support internally, a recent review of third-party solutions on the market convinced them to buy instead of continuing their build. The one area where we haven’t seen this trend play out is in regulated or high-risk industries like healthcare, where data privacy and compliance are more top of mind.
While there’s a lot of hype around outcome-based pricing for AI, CIOs are still uncomfortable with how outcome metrics are set, measured, and billed.
Some of the top concerns with outcome-based pricing were lack of clear outcomes that map to business goals, unpredictable costs, and attribution—but there was no consensus on how vendors could mitigate these issues. This isn’t surprising, as AI is a relatively new technology and it’s not yet clear how to implement it so it drives real value for businesses. Buyers don’t know how much they’re going to be charged and don’t want to be left holding the bag. Given this, most CIOs still prefer paying by usage for AI applications.
While we’ve seen progressive adoption of AI use cases across the board—especially internal enterprise search, data analysis, and customer support—software development has seen a step change in adoption, driven by a perfect storm of extremely high-quality off-the-shelf apps, a significant increase in model capabilities, relevance for a broad set of companies and industries, and a no-brainer ROI use case.
One CTO at a high-growth SaaS company reported that nearly 90% of their code is now AI-generated through Cursor and Claude Code, up from 10–15% 12 months ago with GitHub Copilot. This level of adoption still represents the bleeding edge, but is likely a strong leading indicator for the enterprise.
Strong consumer brands are translating into strong enterprise demand.
Like some of the early platform shifts (e.g., the internet), much of the early growth across leading enterprise AI apps has been driven by the prosumer market. This was kicked off by ChatGPT and underscored by coding apps and creator tools like ElevenLabs. Many CIOs noted their decision to purchase enterprise ChatGPT was driven by “employees loving ChatGPT. It’s the brand name they know.” This dual market pull has led to much faster growth in the next generation of AI companies than we’ve seen in the past.
Incumbents have always benefited from established trust and existing distribution, but in the AI era, they’re increasingly outperformed by AI-native competitors from a product quality and velocity perspective.
Unsurprisingly, the primary reason buyers prefer AI-native vendors is their faster innovation rate. The second reason is the recognition that companies built around AI from the ground up deliver fundamentally better products with superior outcomes compared to incumbents retrofitting AI into existing solutions.
This gap is especially clear in software development today, where one public security company CIO highlighted a stark difference in capabilities between first-generation and second-generation AI coding tools as coding becomes more agentic. The shift is also echoed in user satisfaction data: users who have adopted Cursor, a gen AI-native coding solution, show notably lower satisfaction with previous-gen tools like GitHub Copilot, underscoring how quickly innovation fundamentally reshapes the outcomes buyers can and should expect from AI.
The enterprise AI landscape is no longer defined by experimentation: it’s shaped by strategic deployment, budget commitment, and maturing vendor ecosystems. As model choice diversifies, fragmentation by use case is not only expected but embraced, and a few key leaders are emerging. Enterprises are adopting structured procurement processes and increasingly turning to off-the-shelf applications to accelerate adoption. The result is a market that looks more like traditional software—yet moves with the speed and complexity unique to AI.
Sarah Wang is a general partner on the Growth team at Andreessen Horowitz, where she focuses on enterprise technology companies.
Shangda Xu is a partner on the Growth investing team, focused on enterprise technology companies.
Justin Kahl is a partner on the Growth investing team.
Tugce Erten is a partner on the Growth team, focused on pricing and packaging.