Human-AI Interactions: The Dark Side

Recent incidents have highlighted alarming cases where human-AI interactions led to negative psychological outcomes, including mental health crises and even user harm. As large language models (LLMs) increasingly serve as sources of guidance, emotional support, and even informal therapy, these risks are poised to escalate.

MultiTraitsss: A Framework for Studying Harmful Interactions

Studying the mechanisms underlying harmful human-AI interactions presents significant methodological challenges. Organic harmful interactions typically develop over sustained engagement, requiring extensive conversational context that is difficult to simulate in controlled settings. To address this gap, researchers have developed a framework called Multi-Trait Subspace Steering (MultiTraitsss). This framework leverages established crisis-associated traits and a novel subspace steering approach to generate "dark" models that exhibit cumulative harmful behavioral patterns.

Results and Protective Measures

Single-turn and multi-turn evaluations show that dark models consistently produce harmful interactions and outcomes. Using these models, the researchers propose protective measures to reduce harmful outcomes in Human-AI interactions.