Difficulty in Controlling Chains of Thought in AI Models

A recent study published on arXiv explores the ability of AI reasoning models to control their 'chains of thought' (CoT). CoT monitoring is a promising tool for detecting misbehaviors and understanding the motivations of models, but its effectiveness depends on the ability of models to manipulate what they verbalize in the CoT.

The research introduces the CoT-Control evaluation suite, designed to measure this 'CoT controllability' capability. The results indicate that reasoning models have significantly lower CoT controllability than final output controllability. For example, Claude Sonnet 4.5 controls its CoT only 2.7% of the time, compared to 61.9% when controlling the final output.

It was also found that CoT controllability increases with larger models and decreases with more RL training, test-time compute, and increased problem difficulty. Even incentives (as opposed to direct requests) to evade CoT monitors do not significantly increase controllability. The researchers recommend monitoring CoT controllability in future models, given its importance for CoT monitoring.