ITPO: A new approach for collaborative AI interactions

Human-AI collaboration in multi-turn interactions is crucial for interactive services such as adaptive tutoring and professional consultation. Optimizing these interactions via reinforcement learning is complex due to the sparsity of verifiable intermediate rewards and the high stochasticity of user responses.

To address these challenges, Implicit Turn-wise Policy Optimization (ITPO) has been introduced. ITPO leverages an implicit reward model to derive fine-grained, turn-level rewards from sparse outcome signals. Unlike volatile token-level rewards, these turn-level signals exhibit superior robustness and may utilize a normalization mechanism to further enhance training stability.

ITPO was evaluated across three multi-turn collaborative tasks: math tutoring, document writing, and medical recommendation. Empirical results demonstrate that ITPO, when combined with PPO, GRPO, or RLOO, achieves improved convergence compared to existing baselines. Trajectory analysis confirms that ITPO infers turn-level preferences that are semantically aligned with human judgment. The code is publicly available on GitHub.