Distillation Techniques for LLMs

A Reddit thread has raised the question of LLM distillation, asking users which techniques they prefer and which starting models they would use. Distillation is a method of transferring knowledge from a larger model (the "teacher" model) to a smaller one (the "student" model). The goal is to create a more compact and faster model, suitable for scenarios with limited resources or low latency requirements.

For those evaluating on-premise deployments, there are significant trade-offs between model size, hardware requirements, and performance. AI-RADAR offers analytical frameworks on /llm-onpremise to evaluate these trade-offs.

The choice of distillation technique and starting model depends on several factors, including the desired size of the distilled model, the available computational resources, and the type of application for which it is intended.