☆ 4.7 Article

Stochastic cubic-regularized policy gradient method

KNOWLEDGE-BASED SYSTEMS (2022)

Journal

KNOWLEDGE-BASED SYSTEMS

Volume 255, Issue -, Pages -

Publisher

ELSEVIER

DOI: 10.1016/j.knosys.2022.109687

Keywords

Reinforcement learning; Policy gradient; Stochastic optimization; Non -convex optimization; Second-order stationary point

Funding

National Key R&D Program of China [2020YFB1313501]
Zhejiang Provincial Natural Science Foundation, China [LR19F020005]
Key R&D program of Zhejiang Province [2021C03003, 2022C01022, 2022C01119]
Fundamental Research Funds for the Central Universities, China [226-2022-00051]
National Natural Science Foundation of China [61972347]

Ask authors/readers for more resources

Protocol

Community support

Reagent

Community support

Automated Summary New
Abstract

This paper proposes a policy gradient method that converges to second-order stationary point policies for any differentiable policy classes. The method is computationally efficient and uses cubic-regularized subroutines to escape saddle points while minimizing Hessian-based computations. Experimental results demonstrate the effectiveness of the method.

Policy-based reinforcement learning methods have achieved great achievements in real-world decisionmaking problems. However, the theoretical understanding of policy-based methods is still limited. Specifically, existing works mainly focus on first-order stationary point policies (FOSPs); in some very special reinforcement learning settings (e.g., tabular case and function approximation with restricted parametric policy classes) some works consider globally optimal policy. It is well-known that FOSPs could be undesirable local optima or saddle points, and obtaining a global optimum is generally NP-hard. In this paper, we propose a policy gradient method that provably converges to secondorder stationary point policies (SOSPs) for any differentiable policy classes. The proposed method is computationally efficient, and it judiciously uses cubic-regularized subroutines to escape saddle points while at the same time minimizing the Hessian-based computations. We prove that the method enjoys the sample complexity of O tilde (epsilon-3.5), which improves upon the current optimal complexity O tilde (epsilon-4.5). Finally, experimental results are provided to demonstrate the effectiveness of the method.(c) 2022 Elsevier B.V. All rights reserved.

Stochastic cubic-regularized policy gradient method

Journal

KNOWLEDGE-BASED SYSTEMS

Publisher

ELSEVIER

Keywords

Categories

Funding

Ask authors/readers for more resources

Protocol

Reagent

Authors

I am an author on this paper

Reviews

Primary Rating

Secondary Ratings

Novelty

Significance

Scientific rigor

Rate this paper

Recommended

Stochastic cubic-regularized policy gradient method

Journal

KNOWLEDGE-BASED SYSTEMS

Publisher

ELSEVIER

Keywords

Categories

Funding

Ask authors/readers for more resources

Protocol

Reagent

Authors

I am an author on this paper

Reviews

Primary Rating

Secondary Ratings

Novelty

Significance

Scientific rigor

Rate this paper

Recommended

Export Citation

Share Paper