Enhancing LLMs: The Challenge Between Data and Evaluation

Optimizing the capabilities of Large Language Models (LLMs) represents one of the central challenges during the pre-training phase. Traditionally, this capability is not directly observable: data shapes it prospectively, while evaluation reveals it only retrospectively, compressing samples, prompts, decoding, and scoring rules into a single, often noisy, score. The practical optimization process often runs backward: a failure is observed first, and the engineer must infer the necessary corpus fix. This approach is complicated by the incompatible vocabularies used by the two domains – benchmark names and per-sample correctness on one side, and data sources, domains, and quality labels on the other. Consequently, inferring the cause of a problem is often left to intuition rather than rigorous methodology.

The "Capability Slice": A Precise Diagnostic Lens

To close this gap, the concept of a "capability slice" has been introduced. This refers to a group of evaluation samples sharing background conditions, task type, solving operation, and output constraint. This granularity is precise enough to localize a single model weakness, yet stable enough to survive aggregation, unlike a benchmark name, which is too coarse, or a single sample, which is too noisy. Built around this unit, a complete system includes an evaluation taxonomy, a non-instruction data taxonomy, and mapping rules. The goal is to form a closed loop that transforms a benchmark-level failure into a targeted, testable data intervention.

From Intuition to Data: Concrete Use Cases

The effectiveness of this loop has been tested through two case studies that led to opposite but correct conclusions. In the first scenario, the system ruled out the need for data intervention: continued pre-training had caused a -46.82% drop in the BBH benchmark. The diagnosis, however, traced this problem to a single masked <EOS> (End-Of-Sentence) token loss rather than weakened reasoning capabilities. By correctly restoring the token, the BBH score recovered to 66.44, surpassing the original checkpoint, without any modification to the training data.

In the second case, the loop indicated the need for data intervention. A persistent weakness in mathematical reasoning was decomposed, through the solving operation, into specific failing combinations. A targeted sampling procedure, built from this analysis, allowed for an increase in Pass@128 for AIME2025/AIME2026 from 6.67/0.00 to 26.67 for both. In both examples, the same unmodified loop provided opposite, correct verdicts, demonstrating how evaluation-to-data inference can become a routine, auditable, and experimentally validated process, moving beyond an intuition-based approach.

Control and Auditability: The Value for On-Premise Deployments

For organizations evaluating on-premise LLM deployments, the "capability slice" methodology offers significant advantages. The ability to precisely diagnose the causes of model performance, distinguishing between data issues and internal configurations, is fundamental for optimizing Total Cost of Ownership (TCO). In self-hosted environments, where control over data and infrastructure is a priority, a systematic and auditable approach to model correction reduces debugging time and inefficient use of computational resources, such as expensive GPUs. This not only improves operational efficiency but also strengthens data sovereignty and compliance, allowing companies to maintain rigorous governance over their models and sensitive data, without relying on opaque cloud provider logics. The transparency and reproducibility of results, as demonstrated by the case studies, are key elements for the trust and adoption of LLMs in critical enterprise contexts. For those evaluating on-premise deployments, analytical frameworks are available at /llm-onpremise to help assess these trade-offs.