## Aligning LLM Safety Through Case-Based Reasoning A new study published on arXiv explores methods to improve the safety of Large Language Models (LLMs) without compromising their utility. The research focuses on deliberative alignment, a technique that aims to ensure that LLMs adhere to well-defined safety principles. ## Explicit Safety Codes vs. Practical Examples The paper compares two main approaches: the use of explicit safety codes and the demonstration of safety principles through illustrative use cases. The results indicate that referencing explicit codes can lead to inconsistencies in LLM behavior, sometimes improving harmlessness but often reducing their ability to provide helpful responses. In contrast, training on practical examples proves more effective in promoting robust and generalized safety behaviors. ## CADA: A New Deliberative Alignment Method Based on these observations, the researchers propose CADA, a case-augmented deliberative alignment method. CADA utilizes reinforcement learning on self-generated safety reasoning chains to guide LLMs. This approach proves effective in improving harmlessness, increasing robustness against attacks, and reducing over-refusal, while preserving the utility of the models in different scenarios. CADA offers a practical alternative to rule-only deliberative alignment, allowing for improved safety while maintaining the ability to provide helpful responses.