HELM (Liang et al., Stanford CRFM 2022) is not a single score but a framework for systematic, multi-dimensional model evaluation. Rather than optimising for a single accuracy number, HELM evaluates 42 diverse scenarios across 7 metrics, forcing a holistic view of model capabilities and failure modes.
The 7 Metric Dimensions
Accuracy
Does the model give correct answers? (Task-specific: exact match, F1, ROUGE, etc.)
Calibration
Does the model's confidence match its accuracy? (Expected Calibration Error)
Robustness
Does performance degrade under input perturbations (typos, paraphrases)?
Fairness
Are performance gaps across demographic groups minimal?
Bias
Does the model exhibit stereotypical associations in WinoBias-style probes?
Toxicity
How often does the model generate harmful content under adversarial prompting?
Efficiency
Tokens/second, cost per query, VRAM footprint.
Scenario Categories
HELM's 42 scenarios span question answering (NaturalQA, TriviaQA), reasoning (CommonsenseQA), summarisation (CNN/DailyMail), code (HumanEval), disinformation, toxicity, and more. The breadth is the point — a model cannot look good on HELM by overfitting a single type of evaluation.
HELM Instruct (2024 Update)
HELM Classic focuses on base model evaluation. HELM Instruct (2024) adds instruction-following scenarios and GPT-4 judged open-ended tasks to cover the instruction-tuned model era. Both are maintained at crfm.stanford.edu/helm.
Why HELM Matters for Enterprise Decisions
When choosing an on-premise model, accuracy on MMLU alone is insufficient. HELM's calibration and fairness metrics can reveal when a model is confidently wrong (poorly calibrated) or systematically biased against certain demographic descriptions — both critical for HR, legal, and public-facing deployments.