## Introduction Large reasoning models (LRMs) have been developed using reinforcement learning with verifiable rewards (RLVR) to enhance their reasoning abilities. This approach has allowed the models of reasoning to reach impressive results in various reasoning tasks. However, the role of sample polarity in the RLVR training process has been little explored. A new research paper has decided to tackle this problem and explore how different sample polarities affect the dynamics and behaviors of the RLVR training. ## Results The results of the work show that positive samples sharpen existing correct reasoning patterns, while negative samples encourage exploration of new reasoning paths. This suggests that the RLVR training method can be optimized to improve the precision of rewards and, consequently, the performance of models of reasoning. ## Proposed Solution To address this problem, researchers have proposed a new token-level Advantage shaping method called A3PO. This method improves the precision of advantage signals to key tokens based on the polarity of samples. ## Experiments The results of the experiment show that the A3PO method can improve the performance of models of reasoning in various reasoning tasks. The results were obtained using five different reasoning benchmarks. ## Conclusion In conclusion, this work shows how sample polarity can affect the dynamics and behaviors of RLVR training. The A3PO proposed method offers a solution to improve the precision of rewards and, consequently, the performance of models of reasoning. ## Implications The implications of this work are important for the field of artificial intelligence. In particular, this study shows how optimizing the RLVR training method can improve the performance of models of reasoning in various reasoning tasks. ## Future Work For future works, it would be useful to explore how the A3PO method can be applied to other RLVR training methods and how it can be improved further.