Unlearning in LLMs: A Crucial Challenge

Unlearning in large language models (LLMs) has become essential to address safety, copyright, and privacy concerns. Unlike preference alignment, unlearning offers a more explicit way to remove undesirable knowledge, characterized by specific unlearning datasets.

TRU: A Reasoning-Based Approach

The research highlights how gradient ascent (GA)-based methods have significant limitations, including the degradation of general capabilities, incomplete removal of knowledge, and the generation of incoherent responses. These problems stem from the lack of explicit guidance on what and how models should unlearn.

To fill this gap, a novel reasoning-based unlearning target has been introduced, which satisfies both the specified unlearning scope and the desired post-unlearning response. This led to the development of Targeted Reasoning Unlearning (TRU), which leverages the reasoning-based unlearning target as guidance.

Implementation and Evaluation

TRU uses a cross-entropy supervised loss combined with a GA-based loss, enabling the model to improve reasoning ability for precise knowledge removal while preserving unrelated abilities. Evaluations against strong baselines across multiple benchmarks and LLM architectures demonstrate that TRU achieves more reliable unlearning while preserving general capabilities. Furthermore, TRU exhibits greater robustness in diverse attack scenarios, thanks to the reasoning ability acquired through reasoning-based targets.

For those evaluating on-premise deployments, there are trade-offs that AI-RADAR analyzes in detail at /llm-onpremise.