Evaluating the performance of large language models (LLMs) in real-world scenarios requires metrics that reflect the end-user experience. A user from the LocalLLaMA community highlighted the limitations of traditional benchmarks based on tokens per second (pp/tg), proposing an alternative approach focused on total wait time.
The problem with traditional benchmarks
Benchmarks that measure tokens per second (or per token) can be misleading, as they do not always translate into a perception of speed by the user. A setup that appears performant on paper may prove slow in everyday use, especially when processing large contexts.
A new approach: total wait time
The user has therefore developed a benchmark that measures the time required to process contexts of varying sizes (from 1,000 to 64,000 tokens) and generate a 500-token response. This approach aims to simulate realistic use scenarios, providing a more accurate estimate of the time the user must wait to get a response from the model.
The results of this benchmark have been published on a website, offering an overview of the performance of different hardware configurations, including a Strix Halo with 128 GB of memory.
For those evaluating on-premise deployments, there are trade-offs to consider. AI-RADAR offers analytical frameworks on /llm-onpremise to evaluate these aspects.
๐ฌ Commenti (0)
๐ Accedi o registrati per commentare gli articoli.
Nessun commento ancora. Sii il primo a commentare!