4.7 Article

Stochastic cubic-regularized policy gradient method

Journal

KNOWLEDGE-BASED SYSTEMS
Volume 255, Issue -, Pages -

Publisher

ELSEVIER
DOI: 10.1016/j.knosys.2022.109687

Keywords

Reinforcement learning; Policy gradient; Stochastic optimization; Non -convex optimization; Second-order stationary point

Funding

  1. National Key R&D Program of China [2020YFB1313501]
  2. Zhejiang Provincial Natural Science Foundation, China [LR19F020005]
  3. Key R&D program of Zhejiang Province [2021C03003, 2022C01022, 2022C01119]
  4. Fundamental Research Funds for the Central Universities, China [226-2022-00051]
  5. National Natural Science Foundation of China [61972347]

Ask authors/readers for more resources

This paper proposes a policy gradient method that converges to second-order stationary point policies for any differentiable policy classes. The method is computationally efficient and uses cubic-regularized subroutines to escape saddle points while minimizing Hessian-based computations. Experimental results demonstrate the effectiveness of the method.
Policy-based reinforcement learning methods have achieved great achievements in real-world decisionmaking problems. However, the theoretical understanding of policy-based methods is still limited. Specifically, existing works mainly focus on first-order stationary point policies (FOSPs); in some very special reinforcement learning settings (e.g., tabular case and function approximation with restricted parametric policy classes) some works consider globally optimal policy. It is well-known that FOSPs could be undesirable local optima or saddle points, and obtaining a global optimum is generally NP-hard. In this paper, we propose a policy gradient method that provably converges to secondorder stationary point policies (SOSPs) for any differentiable policy classes. The proposed method is computationally efficient, and it judiciously uses cubic-regularized subroutines to escape saddle points while at the same time minimizing the Hessian-based computations. We prove that the method enjoys the sample complexity of O tilde (epsilon-3.5), which improves upon the current optimal complexity O tilde (epsilon-4.5). Finally, experimental results are provided to demonstrate the effectiveness of the method.(c) 2022 Elsevier B.V. All rights reserved.

Authors

I am an author on this paper
Click your name to claim this paper and add it to your profile.

Reviews

Primary Rating

4.7
Not enough ratings

Secondary Ratings

Novelty
-
Significance
-
Scientific rigor
-
Rate this paper

Recommended

No Data Available
No Data Available