LLM: Embedding Space Separation for Safety
Large language models (LLMs) exhibit remarkable capabilities, but protecting them from harmful prompts remains a crucial challenge. Recent research has highlighted how the latent representations (embeddings) of harmful and safe queries in LLMs tend to show linear separability. This characteristic has been exploited to construct attacks by perturbing the embeddings of harmful queries towards the safe subspace.
To address this problem, a representation-level fine-tuning approach called Embedding Space Separation (ES2) has been proposed. ES2 aims to improve LLM safety by explicitly increasing the distance between harmful and safe representations in the embedding space. To avoid compromising the model's general capabilities, a Kullback-Leibler (KL) divergence regularization term has been introduced into the loss function. This constrains the logits of the fine-tuned model to align with those of the original base model on harmless inputs.
The methodology was evaluated on several open-source LLMs using standard safety benchmarks. Experimental results indicate that this approach significantly improves model safety while maintaining comparable general capabilities.
๐ฌ Comments (0)
๐ Log in or register to comment on articles.
No comments yet. Be the first to comment!