Pruning and Representations in Language Models

Network pruning is a widely used technique to improve the efficiency of language models by reducing their computational complexity and size. The basic idea is to remove the least important parameters or architectures while maintaining the desired performance. However, the effectiveness of pruning varies significantly depending on the type of task.

Analysis of Representation Hierarchies

A recent study analyzed pruning from a representation hierarchy perspective, decomposing the internal computation of language models into three sequential spaces: embedding (hidden representations), logit (pre-softmax outputs), and probability (post-softmax distributions). The results indicate that the representations in the embedding and logit spaces are generally robust to pruning-induced perturbations.

Impact on Generative and Non-Generative Tasks

The nonlinear transformation from logits to probabilities amplifies the deviations caused by pruning, leading to a significant degradation of performance during generation. In contrast, the stability of the categorical-token probability subspace, together with the robustness of the embedding space, supports the effectiveness of pruning for non-generative tasks such as retrieval and multiple-choice selection.

For those evaluating on-premise deployments, there are trade-offs to consider when implementing pruning techniques. AI-RADAR offers analytical frameworks on /llm-onpremise to evaluate these aspects.