4.7 Article

Q-learning with heterogeneous update strategy

期刊

INFORMATION SCIENCES
卷 656, 期 -, 页码 -

出版社

ELSEVIER SCIENCE INC
DOI: 10.1016/j.ins.2023.119902

关键词

Q-learning; Homogeneous update; Heterogeneous update; HetUp Q-learning; HetUpSoft Q-learning; HetUpSoft DQN

向作者/读者索取更多资源

This paper proposes a heterogeneous update idea and designs HetUp Q-learning algorithm to enlarge the normalized gap by overestimating the Q-value corresponding to the optimal action and underestimating the Q-value corresponding to the other actions. To address the limitation, a softmax strategy is applied to estimate the optimal action, resulting in HetUpSoft Q-learning and HetUpSoft DQN. Extensive experimental results show significant improvements over SOTA baselines.
A variety of algorithms has been proposed to mitigate the overestimation bias of Q-learning. These algorithms reduce the estimation of maximum Q-value, i.e., homogeneous update. As a result, some of these algorithms such as Double Q-learning suffer from the underestimation bias. Different from these algorithms, this paper proposes a heterogeneous update idea. It aims to enlarge the normalized gap between Q-value corresponding to the optimal action and that corresponding to the other actions. Based on heterogeneous update, we design HetUp Q-learning. More specifically, HetUp Q-learning increases the normalized gap by overestimating Q-value corresponding to the optimal action and underestimating Q-value corresponding to the other actions. However, one limitation is that our HetUp Q-learning takes the optimal action as input to decide whether a state-action pair should be overestimated or underestimated. To address this challenge, we apply a softmax strategy to estimate the optimal action and obtain HetUpSoft Q-learning. We also extend HetUpSoft Q-learning to HetUpSoft DQN for high-dimensional environments. Extensive experiment results show that our proposed methods outperform SOTA baselines drastically in different settings. In particular, HetUpSoft DQN improves the average score per episode over SOTA baselines by at least 55.49% and 32.26% in the Pixelcopter and Breakout environments, respectively.

作者

我是这篇论文的作者
点击您的名字以认领此论文并将其添加到您的个人资料中。

评论

主要评分

4.7
评分不足

次要评分

新颖性
-
重要性
-
科学严谨性
-
评价这篇论文

推荐

暂无数据
暂无数据