Journal
KNOWLEDGE-BASED SYSTEMS
Volume 255, Issue -, Pages -Publisher
ELSEVIER
DOI: 10.1016/j.knosys.2022.109687
Keywords
Reinforcement learning; Policy gradient; Stochastic optimization; Non -convex optimization; Second-order stationary point
Categories
Funding
- National Key R&D Program of China [2020YFB1313501]
- Zhejiang Provincial Natural Science Foundation, China [LR19F020005]
- Key R&D program of Zhejiang Province [2021C03003, 2022C01022, 2022C01119]
- Fundamental Research Funds for the Central Universities, China [226-2022-00051]
- National Natural Science Foundation of China [61972347]
Ask authors/readers for more resources
This paper proposes a policy gradient method that converges to second-order stationary point policies for any differentiable policy classes. The method is computationally efficient and uses cubic-regularized subroutines to escape saddle points while minimizing Hessian-based computations. Experimental results demonstrate the effectiveness of the method.
Policy-based reinforcement learning methods have achieved great achievements in real-world decisionmaking problems. However, the theoretical understanding of policy-based methods is still limited. Specifically, existing works mainly focus on first-order stationary point policies (FOSPs); in some very special reinforcement learning settings (e.g., tabular case and function approximation with restricted parametric policy classes) some works consider globally optimal policy. It is well-known that FOSPs could be undesirable local optima or saddle points, and obtaining a global optimum is generally NP-hard. In this paper, we propose a policy gradient method that provably converges to secondorder stationary point policies (SOSPs) for any differentiable policy classes. The proposed method is computationally efficient, and it judiciously uses cubic-regularized subroutines to escape saddle points while at the same time minimizing the Hessian-based computations. We prove that the method enjoys the sample complexity of O tilde (epsilon-3.5), which improves upon the current optimal complexity O tilde (epsilon-4.5). Finally, experimental results are provided to demonstrate the effectiveness of the method.(c) 2022 Elsevier B.V. All rights reserved.
Authors
I am an author on this paper
Click your name to claim this paper and add it to your profile.
Reviews
Recommended
No Data Available