LLM Vulnerabilities: The Effect of Altered Language

A recent study published on arXiv examined how inducing altered language, similar to that of an intoxicated person, can expose large language models (LLMs) to new vulnerabilities.

The researchers explored three methods for inducing this type of language in LLMs: persona-based prompting, causal fine-tuning, and reinforcement-based post-training. The results obtained on five LLMs showed a greater susceptibility to jailbreaking techniques, measured using the JailbreakBench benchmark, and to privacy leaks, evaluated with ConfAIde. These results were observed even in the presence of defense mechanisms.

The analysis, conducted through manual and automated evaluations, suggests a correlation between human intoxicated behavior and anthropomorphism induced in LLMs through altered language. The simplicity and effectiveness of these altered language induction approaches make them potential tools for testing and improving the safety of LLMs, but at the same time highlight the significant risks to their reliability.