IFEval (Zhou et al., Google 2023) contains 541 prompts each paired with one or more verifiable formatting instructions: "write at least 500 words", "include the phrase 'AI is a tool'", "respond in JSON", "do not use commas", "answer in only English". Unlike LLM-judged benchmarks, IFEval can be evaluated programmatically — no judge model needed.
Instruction Constraint Types
Format
Response must be in JSON, bullet list, table, code block, or specific markdown structure.
Length
"At least N words", "at most N sentences", "exactly N paragraphs."
Keyword Inclusion
"Include the exact phrase X at least twice."
Keyword Exclusion
"Do not use the word 'however'."
Case / Language
"Respond entirely in lowercase." "Answer only in French."
Section Structure
"Your response must have exactly 3 sections with specific headers."
Metrics
- Prompt accuracy: fraction of prompts where ALL instructions were followed
- Instruction accuracy: fraction of individual instruction constraints satisfied across all prompts
Scores (Instruction Accuracy)
| Model | Inst. Accuracy |
|---|---|
| GPT-4o | 88.6% |
| Claude 3.5 Sonnet | 90.2% |
| Llama 3.1 70B | 87.5% |
| Llama 3.1 8B | 80.4% |
| Mistral 7B v0.3 | 55.1% |
Why IFEval Matters for On-Premise
Instruction following is critical for pipelines that rely on structured output (JSON APIs, report generation, email drafts). A model that ignores format constraints silently breaks downstream parsers and workflows. IFEval provides a fast, objective signal for this capability without requiring a judge model — useful for rapid local evaluation of fine-tuned or quantized variants.