Knowledge Distillation and Multilingual LLM Safety
The safety of LLMs is an increasingly critical issue, especially in non-English contexts where model alignment is often less refined. A recent study explored the application of knowledge distillation (KD) to prevent multilingual jailbreak attacks.
Researchers used distillation to transfer the refusal behaviors of a proprietary model (OpenAI o1-mini) to three open-source models: Meta-Llama-3-8B-Instruct, Gemma-2-2B-IT, and Qwen3-8B. This process was carried out using Low-Rank Adaptation (LoRA) and approximately 28,000 multilingual jailbreak prompts from XSafety, using a black-box response-based and parameter-efficient fine-tuning (PEFT) approach.
Unexpected Results: Increased Vulnerability
Evaluation using the MultiJail benchmark revealed a counterintuitive behavior: standard fine-tuning on the "safe" refusal data of the teacher model actually increased the Jailbreak Success Rate (JSR) for all student models, by up to 16.6 percentage points. This suggests that generalization to unseen languages during distillation can lead to divergent outcomes, depending on the base model.
Removing a primary source of safety degradation, namely "boundary" refusals, mitigated or even reversed the decline in safety in student models, although reductions in reasoning performance (GSM8K) persisted. The study highlights the challenges and potential of knowledge distillation as a technique for multilingual safety alignment, paving the way for future research in this direction.
For those evaluating on-premise deployments, there are trade-offs to consider carefully. AI-RADAR offers analytical frameworks on /llm-onpremise to evaluate these aspects.
๐ฌ Commenti (0)
๐ Accedi o registrati per commentare gli articoli.
Nessun commento ancora. Sii il primo a commentare!