LLM Benchmarking: Total Wait Time vs. Tokens Per Second

Evaluating the performance of large language models (LLMs) in real-world scenarios requires metrics that reflect the end-user experience. A user from the LocalLLaMA community highlighted the limitations of traditional benchmarks based on tokens per second (pp/tg), proposing an alternative approach focused on total wait time.

The problem with traditional benchmarks

Benchmarks that measure tokens per second (or per token) can be misleading, as they do not always translate into a perception of speed by the user. A setup that appears performant on paper may prove slow in everyday use, especially when processing large contexts.

A new approach: total wait time

The user has therefore developed a benchmark that measures the time required to process contexts of varying sizes (from 1,000 to 64,000 tokens) and generate a 500-token response. This approach aims to simulate realistic use scenarios, providing a more accurate estimate of the time the user must wait to get a response from the model.

The results of this benchmark have been published on a website, offering an overview of the performance of different hardware configurations, including a Strix Halo with 128 GB of memory.

For those evaluating on-premise deployments, there are trade-offs to consider. AI-RADAR offers analytical frameworks on /llm-onpremise to evaluate these aspects.

LLM Benchmarking: Total Wait Time vs. Tokens Per Second

The problem with traditional benchmarks

A new approach: total wait time

💻 Need GPU Cloud Infrastructure?

💬 Commenti (0)

📚 Approfondimenti

Approfondisci su LLM On-Premise

Scoprendere le lacune di competenza nei modelli LLM

Mini-LLM: un modello Llama 3 da 80 milioni di parametri

GLM-4.7-Flash: calo di performance con contesti ampi?