LLMs evaluate themselves: part two

A user from the LocalLLaMA community has repeated an experiment already conducted in the past: asking different language models to evaluate the performance of other LLMs. The experiment is based on questions formulated to elicit specific answers, which are then evaluated by other models.

The scores obtained are normalized and made available on Hugging Face. This allows the community to analyze the data and compare the performance of different models in a transparent way.

For those evaluating on-premise deployments, there are trade-offs to consider carefully. AI-RADAR offers analytical frameworks on /llm-onpremise to evaluate these aspects.