☆ 4.7 Article

On the Theory of Policy Gradient Methods: Optimality, Approximation, and Distribution Shift

JOURNAL OF MACHINE LEARNING RESEARCH (2021)

期刊

JOURNAL OF MACHINE LEARNING RESEARCH

卷 22, 期 -, 页码 -

出版社

MICROTOME PUBL

关键词

Policy Gradient; Reinforcement Learning

类别

Automation & Control Systems Computer Science, Artificial Intelligence

资金

Washington Research Foundation for Innovation in Data-intensive Discovery
ONR [N00014-18-1-2247]
DARPA [FA8650-18-27836]
ARO under MURI Award [W911NF-11-1-0303]

向作者/读者索取更多资源

Protocol

社区支持

Reagent

社区支持

智能总结 New
摘要

This work provides provable characterizations of policy gradient methods in the context of discounted Markov Decision Processes, focusing on different policy parameterizations and providing approximation guarantees that avoid explicit worst-case dependencies on the size of state space.

Policy gradient methods are among the most effective methods in challenging reinforcement learning problems with large state and/or action spaces. However, little is known about even their most basic theoretical convergence properties, including: if and how fast they converge to a globally optimal solution or how they cope with approximation error due to using a restricted class of parametric policies. This work provides provable characterizations of the computational, approximation, and sample size properties of policy gradient methods in the context of discounted Markov Decision Processes (MDPs). We focus on both: tabular policy parameterizations, where the optimal policy is contained in the class and where we show global convergence to the optimal policy; and parametric policy classes (considering both log-linear and neural policy classes), which may not contain the optimal policy and where we provide agnostic learning results. One central contribution of this work is in providing approximation guarantees that are average case - which avoid explicit worst-case dependencies on the size of state space - by making a formal connection to supervised learning under distribution shift. This characterization shows an important interplay between estimation error, approximation error, and exploration (as characterized through a precisely defined condition number).

On the Theory of Policy Gradient Methods: Optimality, Approximation, and Distribution Shift

期刊

JOURNAL OF MACHINE LEARNING RESEARCH

出版社

MICROTOME PUBL

关键词

类别

资金

向作者/读者索取更多资源

Protocol

Reagent

作者

我是这篇论文的作者

评论

主要评分

次要评分

新颖性

重要性

科学严谨性

评价这篇论文

推荐

On the Theory of Policy Gradient Methods: Optimality, Approximation, and Distribution Shift

期刊

JOURNAL OF MACHINE LEARNING RESEARCH

出版社

MICROTOME PUBL

关键词

类别

资金

向作者/读者索取更多资源

Protocol

Reagent

作者

我是这篇论文的作者

评论

主要评分

次要评分

新颖性

重要性

科学严谨性

评价这篇论文

推荐

导出引文

分享论文