Large language models (LLMs), trained on vast web datasets, can generate toxic outputs, raising concerns about their safety.

REPO: A New Approach

Research has shown that the modifications made to models to mitigate this problem are often superficial. REPO (Representation Erasure-based Preference Optimization) reformulates detoxification as a token-level preference problem. This innovative method forces the representations of toxic continuations to converge towards their benign counterparts.

Analysis and Results

In-depth analysis reveals that this granular approach induces localized edits to toxicity-encoding neurons while preserving general model utility. Evaluations demonstrate that REPO achieves state-of-the-art robustness, stopping sophisticated threatsโ€”including relearning attacks and enhanced GCG jailbreaksโ€”where existing methods fail.