Hugging Face has announced the creation of benchmark repositories for large language models (LLMs), with the goal of standardizing and making performance evaluations more transparent.

Collaborative Benchmarks

The initiative, presented by Ben from Hugging Face, aims to address the issue of inconsistencies in benchmark results, often encountered when comparing different models. The new repositories allow the community to directly contribute evaluation results. To add a model to a leaderboard, simply create a pull request (PR) to the model's repository with the results and their sources. This system directly links the model to the leaderboard, without the need to merge the PR.

Transparency and Verification

To ensure verified results, Hugging Face also allows running automated jobs for evaluations. This approach increases the transparency of benchmarks, providing a more solid basis for comparing models. Community feedback is essential to further improve the system.

For those evaluating on-premise deployments, there are trade-offs to consider carefully. AI-RADAR offers analytical frameworks on /llm-onpremise to evaluate these aspects.