HELM

Benchmark

Holistic Evaluation of Language Models — 42 scenarios × 7 metrics (accuracy, calibration, robustness, fairness, bias, toxicity, efficiency). A comprehensive multi-dimensional evaluation framework.

HELM (Liang et al., Stanford CRFM 2022) is not a single score but a framework for systematic, multi-dimensional model evaluation. Rather than optimising for a single accuracy number, HELM evaluates 42 diverse scenarios across 7 metrics, forcing a holistic view of model capabilities and failure modes.

The 7 Metric Dimensions

Accuracy

Does the model give correct answers? (Task-specific: exact match, F1, ROUGE, etc.)

Calibration

Does the model's confidence match its accuracy? (Expected Calibration Error)

Robustness

Does performance degrade under input perturbations (typos, paraphrases)?

Fairness

Are performance gaps across demographic groups minimal?

Bias

Does the model exhibit stereotypical associations in WinoBias-style probes?

Toxicity

How often does the model generate harmful content under adversarial prompting?

Efficiency

Tokens/second, cost per query, VRAM footprint.

Scenario Categories

HELM's 42 scenarios span question answering (NaturalQA, TriviaQA), reasoning (CommonsenseQA), summarisation (CNN/DailyMail), code (HumanEval), disinformation, toxicity, and more. The breadth is the point — a model cannot look good on HELM by overfitting a single type of evaluation.

HELM Instruct (2024 Update)

HELM Classic focuses on base model evaluation. HELM Instruct (2024) adds instruction-following scenarios and GPT-4 judged open-ended tasks to cover the instruction-tuned model era. Both are maintained at crfm.stanford.edu/helm.

Why HELM Matters for Enterprise Decisions

When choosing an on-premise model, accuracy on MMLU alone is insufficient. HELM's calibration and fairness metrics can reveal when a model is confidently wrong (poorly calibrated) or systematically biased against certain demographic descriptions — both critical for HR, legal, and public-facing deployments.