LLM Agents with Foresight: A Three-Stage Training Pipeline for Internal World Models

LLM-based agents have demonstrated strong sequential decision-making skills, yet they remain fundamentally reactive in long-horizon tasks. While humans mentally explore hypothetical scenarios before committing to a plan, standard agents lack such an internal world model.

A research team now proposes to internalize this foresight capability by training a single autoregressive model to produce not only actions but also a textual rollout of prospective states and a plan-conditioned success estimate — a linguistic analogue of a Q-value. The goal: to give the agent an inner simulation of what comes next.

The gap between format and substance

The critical hurdle, identified by the authors, is what they call the “format-capability gap.” Simply fine-tuning an agent on look-ahead traces during post-training leads it to mimic the surface structure of foresight without genuine predictive grounding. The model produces text that looks like a simulation, yet lacks causal understanding of future states. In effect, the agent merely acts out foresight without possessing it.

To bridge this gap, the study introduces a three-stage training pipeline. First, World Model Agentic Mid-Training (WM-AMT) injects latent predictive capabilities into the policy during an intermediate training phase. Next, Format-Eliciting Supervised Fine-Tuning (FE-SFT) structures these emerging abilities into the desired format: the agent learns to verbalize state rollouts and success estimates in text. Finally, Foresight-Conditioned Reinforcement Learning (FC-RL) refines the calibration and utility of the generated simulations, ensuring the model actually uses its own predictions to make better decisions.

Performance and context

Evaluated on search tasks and mathematical reasoning, the approach consistently outperformed other training baselines. The results show that effective internal world modeling in LLM agents requires a capability-first pipeline: first build predictive capacity, then shape it, and only then tune it for decision-making. Skipping the first step yields only an illusion of foresight.

Beyond the cloud: implications for on-premise deployment

For teams managing on-premise infrastructure, the work points toward a relevant direction. Agents that genuinely plan by simulating alternative futures could reduce reliance on cloud-based iterations or external services, because deliberation would happen locally. In scenarios where data sovereignty and low latency are critical — smart manufacturing, medical diagnostics, sensitive automation — having agents that evaluate plans without leaving the corporate perimeter is a concrete advantage.

Yet adoption on bare metal raises total cost of ownership (TCO) questions. The three-stage pipeline is computationally demanding during training, and even inference with internally predictive modules could require more VRAM and throughput. Without public estimates on energy costs or hardware requirements — absent from the source — it remains unclear how efficiency might scale on enterprise-class GPUs, possibly aided by quantization techniques that shrink the footprint without sacrificing simulation quality.

Trust is another aspect. The model’s simulations are calibrated to the training domain; if the agent operates in different contexts, their reliability may degrade. System designers for on-premise deployments will need to assess how to anchor these predictions to real-world data and how to integrate validation loops without turning the world model into a self-referential black box.

Looking ahead

The path is clear: turning LLM agents from reactive executors into deliberative planners requires a training hierarchy that puts substance before form. If future research confirms the transferability of these mechanisms to real-world domains and their compatibility with on-premise configurations, we may witness a leap in the automation of complex tasks. For now, the study stands as a solid methodological advance with the potential to influence both academic research and next-generation industrial architectures.