ITPO: Implicit Optimization for Proactive User-LLM Interaction

ITPO: A new approach for collaborative AI interactions

Human-AI collaboration in multi-turn interactions is crucial for interactive services such as adaptive tutoring and professional consultation. Optimizing these interactions via reinforcement learning is complex due to the sparsity of verifiable intermediate rewards and the high stochasticity of user responses.

To address these challenges, Implicit Turn-wise Policy Optimization (ITPO) has been introduced. ITPO leverages an implicit reward model to derive fine-grained, turn-level rewards from sparse outcome signals. Unlike volatile token-level rewards, these turn-level signals exhibit superior robustness and may utilize a normalization mechanism to further enhance training stability.

ITPO was evaluated across three multi-turn collaborative tasks: math tutoring, document writing, and medical recommendation. Empirical results demonstrate that ITPO, when combined with PPO, GRPO, or RLOO, achieves improved convergence compared to existing baselines. Trajectory analysis confirms that ITPO infers turn-level preferences that are semantically aligned with human judgment. The code is publicly available on GitHub.

ITPO: Implicit Optimization for Proactive User-LLM Interaction

ITPO: A new approach for collaborative AI interactions

💻 Need GPU Cloud Infrastructure?

💬 Comments (0)

🔍 Continue Exploring

Explore LLM On-Premise

AOT: Adversarial Reinforcement Learning for Robust MLLMs

LLM Security: Rules succeed at the boundary, fail at the prompt

Anthropic tightens rules on the use of third-party harnesses with Claude

👥 Join 160+ AI explorers