VeRA: A New Approach to AI Evaluation
Evaluating artificial intelligence models often relies on static benchmarks, reused over time and subject to memorization and exploitation of format peculiarities. To overcome these limitations, VeRA (Verified Reasoning Data Augmentation) has been proposed, a framework that automatically generates new benchmarks from existing problems.
VeRA transforms benchmark problems into executable specifications, composed of:
- A natural language template with placeholder slots.
- A coherent generator that samples valid configurations.
- A deterministic verifier that validates the parameters and calculates the correct answers.
From a single seed problem, VeRA automatically creates unlimited verified variants, with reliable labels and at near-zero marginal cost, without human involvement.
VeRA Operating Modes
VeRA operates in two complementary modes:
- VeRA-E (equivalent): rewrites problems while keeping the underlying logic intact, useful for detecting memorization versus genuine reasoning.
- VeRA-H (hardened): systematically increases complexity while remaining verifiable, enabling reliable creation and labeling of fresh difficult tasks.
Evaluating 16 frontier models with VeRA highlighted:
- VeRA-E improves evaluation quality and reveals contamination patterns.
- VeRA-H enables human-free generation of hard tasks with reliable labels.
- VeRA establishes verified benchmarks as a general paradigm.
VeRA reconceptualizes benchmarks from static objects used until exhausted, to executable specifications generating fresh, verified instances on demand, enhancing robustness and cost-effectiveness for evaluation.
VeRA has been released open-source to stimulate future research.
๐ฌ Comments (0)
๐ Log in or register to comment on articles.
No comments yet. Be the first to comment!