☆ 4.7 Article

Q-learning with heterogeneous update strategy

INFORMATION SCIENCES (2024)

Journal

INFORMATION SCIENCES

Volume 656, Issue -, Pages -

Publisher

ELSEVIER SCIENCE INC

DOI: 10.1016/j.ins.2023.119902

Keywords

Q-learning; Homogeneous update; Heterogeneous update; HetUp Q-learning; HetUpSoft Q-learning; HetUpSoft DQN

Ask authors/readers for more resources

Protocol

Community support

Reagent

Community support

Automated Summary New
Abstract

This paper proposes a heterogeneous update idea and designs HetUp Q-learning algorithm to enlarge the normalized gap by overestimating the Q-value corresponding to the optimal action and underestimating the Q-value corresponding to the other actions. To address the limitation, a softmax strategy is applied to estimate the optimal action, resulting in HetUpSoft Q-learning and HetUpSoft DQN. Extensive experimental results show significant improvements over SOTA baselines.

A variety of algorithms has been proposed to mitigate the overestimation bias of Q-learning. These algorithms reduce the estimation of maximum Q-value, i.e., homogeneous update. As a result, some of these algorithms such as Double Q-learning suffer from the underestimation bias. Different from these algorithms, this paper proposes a heterogeneous update idea. It aims to enlarge the normalized gap between Q-value corresponding to the optimal action and that corresponding to the other actions. Based on heterogeneous update, we design HetUp Q-learning. More specifically, HetUp Q-learning increases the normalized gap by overestimating Q-value corresponding to the optimal action and underestimating Q-value corresponding to the other actions. However, one limitation is that our HetUp Q-learning takes the optimal action as input to decide whether a state-action pair should be overestimated or underestimated. To address this challenge, we apply a softmax strategy to estimate the optimal action and obtain HetUpSoft Q-learning. We also extend HetUpSoft Q-learning to HetUpSoft DQN for high-dimensional environments. Extensive experiment results show that our proposed methods outperform SOTA baselines drastically in different settings. In particular, HetUpSoft DQN improves the average score per episode over SOTA baselines by at least 55.49% and 32.26% in the Pixelcopter and Breakout environments, respectively.

Q-learning with heterogeneous update strategy

Journal

INFORMATION SCIENCES

Publisher

ELSEVIER SCIENCE INC

Keywords

Categories

Ask authors/readers for more resources

Protocol

Reagent

Authors

I am an author on this paper

Reviews

Primary Rating

Secondary Ratings

Novelty

Significance

Scientific rigor

Rate this paper

Recommended

Q-learning with heterogeneous update strategy

Journal

INFORMATION SCIENCES

Publisher

ELSEVIER SCIENCE INC

Keywords

Categories

Ask authors/readers for more resources

Protocol

Reagent

Authors

I am an author on this paper

Reviews

Primary Rating

Secondary Ratings

Novelty

Significance

Scientific rigor

Rate this paper

Recommended

Export Citation

Share Paper