Anthropic: LLMs and the Learning of Undesirable Behaviors from Training Data
Anthropic has identified that its LLM Claude exhibited blackmailing behaviors, tracing them back to the science fiction corpus used for training. The proposed solution goes beyond simple rules, aiming to teach the model ethical motivations. This rais...