When automatic feedback fails: what truly drives improvement in LLM agents

The dream of autonomous agents that correct themselves by talking to themselves gets a cold shower. A group of researchers put thirteen open-weight models through a rigorous student-teacher protocol and found that self-produced feedback from LLMs adds very little beyond simply repeating attempts. The only significant improvement comes when an external teacher – equipped with privileged task information – provides targeted guidance.

The experiment that separates chance from substance

The team adopted a multi-turn setting where a student agent tackles problems in mathematics (Omni-MATH), competitive programming (Codeforces), linguistic understanding (BBEH Linguini), and visual reasoning (ARC-AGI1). In each configuration, three modes are compared: external feedback, self-generated feedback, and unguided self-refinement. By varying interaction history, task difficulty, and the teacher's access to privileged information, the data show that apparent progress over multiple turns often does not prove actual use of feedback. Improvement can stem from simple resampling, format corrections, or the sheer increase in attempts – all factors that inflate metrics without indicating deeper understanding.

Self-feedback: the illusion of improvement

The simulations debunk the idea that an LLM can be an effective critic of itself. Under self-feedback conditions, gains are statistically indistinguishable from those obtained by repeating the task with no guidance at all. In contrast, the strongest external teachers produce substantial and specific increases, suggesting that useful feedback must offer something beyond a generic “try again.” The dense interaction matrices assembled by the researchers reveal a second bottleneck: what makes the difference is not so much the teacher's identity as the student's ability to absorb and apply suggestions. Naturally, for a fixed student, teacher choice remains crucial, but the real bottleneck shifts to the learner side.

What changes for those bringing agents into the enterprise

For teams evaluating on-premise deployment of LLM agents, these conclusions have immediate practical impact. Those planning local stacks for data sovereignty or TCO reasons must ask: are we measuring our agent's effectiveness against a repeated-attempt baseline, or are we mistaking background noise for progress? Self-correction, if implemented without an authoritative teacher, can become a pure computational cost that does not move the performance needle. In a resource-optimization perspective, it makes more sense to invest in student robustness – through targeted fine-tuning or orchestration architectures – rather than multiplying automatic feedback loops. AI-Radar, among its analytical tools, has already highlighted how the choice of validation metrics is central to TCO calculations: here the study provides another piece of the puzzle, showing that feedback quality is also a variable to measure, not a given.

Beyond a single benchmark

The research fits into a stream that is rethinking the evaluation of interactive agents. The publicly released controlled student-teacher framework allows other teams to replicate the analysis and test their own models against the same baselines. In an ecosystem where vendors emphasize self-improvement as a standard feature, this study serves as a reminder that feedback availability – internal or external – is neither free nor automatically productive. The true frontier is teaching LLMs to be good students, even before they become eager commentators.