Task-Specific Knowledge Distillation for LLMs

Knowledge distillation from large language models (LLMs) assumes that the teacher's output distribution is a high-quality training signal. However, on reasoning tasks, this assumption is frequently violated. A model's intermediate representations may encode the correct answer, yet this information is lost or distorted through the vocabulary projection, where prompt formatting and answer-token choices creates brittle, noisy outputs.

This paper introduces a distillation framework that bypasses this bottleneck by training lightweight probes on frozen teacher hidden states. The probe's predictions, rather than output logits, are used as supervision for student training. This approach yields consistent improvements across several reasoning benchmarks, with gains most pronounced under limited data.

Probes trained on intermediate representations provide cleaner labels than the teacher's own outputs, effectively denoising the distillation signal. This method requires no architectural changes to student or teacher, is architecture-agnostic, and adds minimal compute since probe training is cheap and teacher representations can be cached. By exploiting internal representations, practitioners can extract more value from large teacher models without additional training data or architectural complexity.