LCO: Optimizing Agentic LLMs for Safety Without Fine-tuning
A new framework, LCO (LLM-based Constraint Optimization), addresses the In-Context Reward Hacking (ICRH) problem in agentic LLMs. Designed to reduce harmful side effects from over-optimization, LCO operates without requiring model fine-tuning. Throug...