4.7 Article

Optimistic reinforcement learning by forward Kullback-Leibler divergence optimization

期刊

NEURAL NETWORKS
卷 152, 期 -, 页码 169-180

出版社

PERGAMON-ELSEVIER SCIENCE LTD
DOI: 10.1016/j.neunet.2022.04.021

关键词

Reinforcement learning; Control as probabilistic inference; Kullback-Leibler divergence; Optimistic learning

向作者/读者索取更多资源

This paper presents a new interpretation of the traditional optimization method in reinforcement learning, using KL divergence, and introduces a new optimization method using forward KL divergence. The study demonstrates that moderate optimism can accelerate learning and lead to higher rewards.
This paper addresses a new interpretation of the traditional optimization method in reinforcement learning (RL) as optimization problems using reverse Kullback-Leibler (KL) divergence, and derives a new optimization method using forward KL divergence, instead of reverse KL divergence in the optimization problems. Although RL originally aims to maximize return indirectly through optimization of policy, the recent work by Levine has proposed a different derivation process with explicit consideration of optimality as stochastic variable. This paper follows this concept and formulates the traditional learning laws for both value function and policy as the optimization problems with reverse KL divergence including optimality. Focusing on the asymmetry of KL divergence, the new optimization problems with forward KL divergence are derived. Remarkably, such new optimization problems can be regarded as optimistic RL. That optimism is intuitively specified by a hyperparameter converted from an uncertainty parameter. In addition, it can be enhanced when it is integrated with prioritized experience replay and eligibility traces, both of which accelerate learning. The effects of this expected optimism was investigated through learning tendencies on numerical simulations using Pybullet. As a result, moderate optimism accelerated learning and yielded higher rewards. In a realistic robotic simulation, the proposed method with the moderate optimism outperformed one of the state-of-the-art RL method. (C) 2022 Elsevier Ltd. All rights reserved.

作者

我是这篇论文的作者
点击您的名字以认领此论文并将其添加到您的个人资料中。

评论

主要评分

4.7
评分不足

次要评分

新颖性
-
重要性
-
科学严谨性
-
评价这篇论文

推荐

暂无数据
暂无数据