IFEval

Benchmark

Instruction Following Evaluation — 500+ prompts each with verifiable formatting constraints (use N words, include keywords, write in JSON). Objective evaluation of instruction adherence.

IFEval (Zhou et al., Google 2023) contains 541 prompts each paired with one or more verifiable formatting instructions: "write at least 500 words", "include the phrase 'AI is a tool'", "respond in JSON", "do not use commas", "answer in only English". Unlike LLM-judged benchmarks, IFEval can be evaluated programmatically — no judge model needed.

Instruction Constraint Types

Format

Response must be in JSON, bullet list, table, code block, or specific markdown structure.

Length

"At least N words", "at most N sentences", "exactly N paragraphs."

Keyword Inclusion

"Include the exact phrase X at least twice."

Keyword Exclusion

"Do not use the word 'however'."

Case / Language

"Respond entirely in lowercase." "Answer only in French."

Section Structure

"Your response must have exactly 3 sections with specific headers."

Metrics

  • Prompt accuracy: fraction of prompts where ALL instructions were followed
  • Instruction accuracy: fraction of individual instruction constraints satisfied across all prompts

Scores (Instruction Accuracy)

ModelInst. Accuracy
GPT-4o88.6%
Claude 3.5 Sonnet90.2%
Llama 3.1 70B87.5%
Llama 3.1 8B80.4%
Mistral 7B v0.355.1%

Why IFEval Matters for On-Premise

Instruction following is critical for pipelines that rely on structured output (JSON APIs, report generation, email drafts). A model that ignores format constraints silently breaks downstream parsers and workflows. IFEval provides a fast, objective signal for this capability without requiring a judge model — useful for rapid local evaluation of fine-tuned or quantized variants.