ATPO: A Novel Approach for Medical Dialogues with LLMs

Effective information seeking in multi-turn medical dialogues is critical for accurate diagnosis, especially when dealing with incomplete information. Aligning Large Language Models (LLMs) for these interactive scenarios is challenging due to the uncertainty inherent in user-agent interactions, which is formulated as a Hierarchical Markov Decision Process (H-MDP).

Overcoming the Limitations of Traditional Methods

While conventional Reinforcement Learning (RL) methods struggle with long-horizon credit assignment and unstable value estimation, a novel algorithm is proposed: Adaptive Tree Policy Optimization (ATPO). ATPO adaptively allocates the rollout budget to states with high uncertainty, quantified by a composite metric of Bellman error and action-value variance. This strategy enables more accurate value estimation, while fostering more efficient and diverse exploration.

Optimizations for Computational Efficiency

To mitigate the high computational cost of tree-based RL, two key optimizations have been introduced: an uncertainty-guided pruning mechanism to minimize the number of rollouts, and an asynchronous search architecture that leverages KV cache reuse to maximize inference throughput. Extensive experiments on three public medical dialogue benchmarks demonstrate that the algorithm significantly outperforms several strong baselines, with the Qwen3-8B model surpassing the much larger GPT-4o in accuracy.