Eight months after launching its first commercial product, Arena – the crowdsourced leaderboard born as a UC Berkeley research project in 2023 – has reached an annualized revenue of $100 million. A milestone that signals how central the business of language model evaluation has become, but also how much the homemade AI side remains uncovered.

From a Berkeley lab to $100 million in eight months

The Arena project exploded in less than two years. The mechanism is simple but effective: users compare two anonymous responses generated by different LLMs and vote for the better one. The collected data feeds dynamic rankings that capture public preference, turning into an almost mandatory parameter for anyone developing or comparing models. The rapid monetization shows that the demand for benchmarking tools is not shrinking – on the contrary, it is structuring into a multi-hundred-million-dollar market.

Anonymous comparisons and voting: how Arena works

The platform does not reveal the identities of the models during the test. Participants see only two texts and pick the one they find more coherent, informative, or useful. The results feed an Elo ranking, the same system used in chess, translating judgments into an ordered scale. This approach has the merit of bypassing brand-related biases, but shifts attention to often subjective aspects – fluency, style – leaving more engineering-oriented metrics like latency, throughput, or VRAM consumption in the background.

The huge blind spot for enterprises adopting LLMs

For an organization evaluating an on-premise LLM installation – to keep data under its control, comply with GDPR, or optimize total cost of ownership – the Arena ranking offers few actionable insights. Every prompt sent to the platform leaves the corporate perimeter, potentially violating sovereignty policies. Moreover, the models tested on the leaderboard run on cloud infrastructure with generous hardware and often at full precision, whereas an on-premise deployment almost always requires quantization (INT8 or FP8) and must coexist with limited compute resources. Arena says nothing about how a model performs after fine-tuning on proprietary data or in an air-gapped scenario.

From public ranking to custom tests: the parallel path

The most advanced organizations are building in-house evaluation pipelines on their own hardware. Frameworks like lm-evaluation-harness allow replicating standard benchmarks locally, testing inference on specific GPUs and measuring real latency, token-per-second throughput, and energy consumption. For those evaluating on-premise deployments, AI-RADAR provides analytical frameworks in the /llm-onpremise section to weigh trade-offs between models without blindly relying on public rankings. Arena’s commercial success confirms that evaluation has become an indispensable market, but the real leap for enterprise adoption will be bridging the gap between a score obtained on someone else’s server and the performance inside one’s own data center.