In this episode of AI + a16z, LMArena cofounders Anastasios N. Angelopoulos, Wei-Lin Chiang, and Ion Stoica sit down with a16z general partner Anjney Midha to talk about the future of AI evaluation. As benchmarks struggle to keep up with the pace of real-world deployment, LMArena is reframing the problem: what if the best way to test AI models is to put them in front of millions of users and let them vote? The team discusses how Arena evolved from a research side project into a key part of the AI stack, why fresh and subjective data is crucial for reliability, and what it means to build a CI/CD pipeline for large models.
They also explore:
0:04 – LLM evaluation: From consumer chatbots to mission-critical systems
6:04 – Style and substance: Crowdsourcing expertise
18:51 – Building immunity to overfitting and gaming the system
29:49 – The roots of LMArena
41:29 – Proving the value of academic AI research
48:28 – Scaling LMArena and starting a company
59:59 – Benchmarks, evaluations, and the value of ranking LLMs
1:12:13 – The challenges of measuring AI reliability
1:17:57 – Expanding beyond binary rankings as models evolve
1:28:07 – A leaderboard for each prompt
1:31:28 – The LMArena roadmap
1:34:29 – The importance of open source and openness
1:43:10 – Adapting to agents (and other AI evolutions)
Artificial intelligence is changing everything from art to enterprise IT, and a16z is watching all of it with a close eye. This podcast features discussions with leading AI engineers, founders, and experts, as well as our general partners, about where the technology and industry are heading.