The AI industry has an evaluation crisis. Static benchmarks are contaminated the moment they’re published. Models overfit to metrics rather than utility. And no enterprise will bet their business on systems evaluated by their own creators.
So, when frontier labs need to know if their latest model actually works, they release it on LMArena and watch millions of real users vote with their preferences. When OpenAI evaluates chat performance, Google evaluates Gemini, xAI tests Grok, or when teams need to evaluate code generation, we believe LMArena’s growing body of evaluations — like Web Dev Arena — have become the de facto standard.
What started as a Berkeley research project has quickly become essential infrastructure, the continuous integration pipeline for intelligence. This isn’t because of marketing or sales. It’s because the platform solved a problem everyone had but no one addressed.
We believe the companies that make AI boring will create some of the most value. Not boring as in unimpressive, but boring as in reliable, predictable, and trustworthy. LMArena is building the infrastructure to make AI as boring as databases.
That’s why we’re thrilled to be founding investors in LMArena’s seed round alongside UC Investments (University of California) and partners who share the team’s commitment to open science.
What excites me most about LMArena is their north star: solving AI reliability at scale. The platform’s power comes from a simple flywheel: more models attract more users, generating more preferences, which attracts more models. With more than 400 models and millions of monthly users creating novel prompts daily, LMArena has built the largest living dataset of human preferences on AI outputs.
When models become reliable enough for hospitals to trust diagnoses, for courts to trust analysis, or for infrastructure to trust automation, that’s a generational transformation for the economy. Government agencies are already engaging. Regulated industries are piloting private arena deployments. The demand signal is clear: neutral, continuous evaluation isn’t optional for mission-critical AI.
Moving beyond a research project and incorporating as a company allows LMArena to take things even further. Already, it has plans to expand its scope into areas such as:
We envision a world where “Arena-tested” becomes the Good Housekeeping seal for AI, akin to a signal that a system has been validated by millions of real users, not just cherry-picked benchmarks. Where every AI interaction contributes to a shared understanding of what works. Where reliability isn’t promised by vendors, but is proven through transparent, continuous evaluation.
The challenges are substantial: maintaining neutrality under commercial pressure, scaling infrastructure for billions of users, and evolving evaluation methods as AI capabilities expand. But this team has already achieved something remarkable. They’ve made the entire ecosystem collectively invested in human preference at scale. In the race to build more capable AI, LMArena is on a mission to ensure those capabilities actually serve the people who use them. If that’s the future you want to build, they’re hiring.