Constrained Reinforcement Learning: A New Approach

Reinforcement learning (RL) is a powerful technique, but its application in real-world environments, such as robotics and autonomous driving, requires particular attention to safety. Constrained Markov Decision Processes (CMDPs) are a tool for enforcing safety constraints while optimizing performance.

A recent study presents a new algorithm for CMDPs that addresses the limitations of existing methods, often characterized by significant safety violations or high sample complexity. The proposed primal-dual algorithm balances regret and constraint violations, drawing on techniques from online RL and constrained optimization.

Algorithm Details and Results

The algorithm was analyzed in two contexts: relaxed feasibility (where small violations are allowed) and strict feasibility (no violations allowed). The results demonstrate that, in the case of relaxed feasibility, the algorithm returns an ฮต-optimal policy with an ฮต-bounded violation, requiring a number of learning episodes of the order of $\tilde{O}\left(\frac{SAH^3}{\varepsilon^2}\right)$. In the case of strict feasibility, the algorithm guarantees an ฮต-optimal policy with zero violations, with a sample complexity of $\tilde{O}\left(\frac{SAH^5}{\varepsilon^2\zeta^2}\right)$, where ฮถ is a problem-dependent Slater constant.

These results suggest that learning CMDPs in an online setting can be comparable to learning with a generative model and is no more challenging than learning unconstrained MDPs when small violations are allowed.