Targeted Distillation for Language Models

Distillation of large language models (LLMs) is a well-established technique for transferring knowledge from a larger "teacher" model to a smaller, more efficient "student" model. However, traditional methods often waste valuable computational resources by training the student model on problems it has already mastered or on problems that are far beyond its current capabilities.

A new study introduces PACED, a framework that addresses this issue by focusing distillation on the student model's zone of proximal development -- the frontier of its competence. The approach is based on a theoretical analysis demonstrating how the signal-to-noise ratio in distillation gradients drops sharply at the extremes of model performance.

The PACED Framework

PACED uses a weighting function derived from the structure of distillation gradients to give greater importance to problems that are at the edge of the student model's capabilities. Experimental results show that PACED offers significant improvements over traditional distillation methods, both in distillation from a larger teacher model to a smaller student, and in self-distillation. The approach is compatible with different Kullback-Leibler (KL) divergence directions and requires no architectural changes to the model.

Furthermore, combining a first stage of distillation with forward KL divergence followed by a stage with reverse KL divergence appears to produce the best results, suggesting a distillation process that first expands mode coverage and then consolidates the acquired knowledge.

For those evaluating on-premise deployments, there are trade-offs to consider. AI-RADAR offers analytical frameworks on /llm-onpremise to evaluate these aspects.