Large language models (LLMs), trained on vast web datasets, can generate toxic outputs, raising concerns about their safety.
REPO: A New Approach
Research has shown that the modifications made to models to mitigate this problem are often superficial. REPO (Representation Erasure-based Preference Optimization) reformulates detoxification as a token-level preference problem. This innovative method forces the representations of toxic continuations to converge towards their benign counterparts.
Analysis and Results
In-depth analysis reveals that this granular approach induces localized edits to toxicity-encoding neurons while preserving general model utility. Evaluations demonstrate that REPO achieves state-of-the-art robustness, stopping sophisticated threatsโincluding relearning attacks and enhanced GCG jailbreaksโwhere existing methods fail.
๐ฌ Comments (0)
๐ Log in or register to comment on articles.
No comments yet. Be the first to comment!