4.6 Article

Deterministic policy optimization with clipped value expansion and long-horizon planning

Journal

NEUROCOMPUTING
Volume 483, Issue -, Pages 299-310

Publisher

ELSEVIER
DOI: 10.1016/j.neucom.2022.02.022

Keywords

Model -based reinforcement learning; Policy gradient; Sample efficiency; Planning; Imitation learning

Funding

  1. National Key R&D Program of China [2019YFC1906201]
  2. National Natural Science Foundation of China [91748122]

Ask authors/readers for more resources

This paper presents a model-based deterministic policy gradient (MBDPG) method for efficient utilization of learned dynamics models through multi-step gradient information. It demonstrates higher sampling efficiency and convergence performance compared to the state-of-the-art model-based reinforcement learning methods.
Model-based reinforcement learning (MBRL) approaches have demonstrated great potential in handling complex tasks with high sample efficiency. However, MBRL struggles with asymptotic performance compared to model-free reinforcement learning (MFRL). In this paper, we present a long-horizon policy optimization method, namely model-based deterministic policy gradient (MBDPG), for efficient exploitation of the learned dynamics model through multi-step gradient information. First, we approximate the dynamics of the environment with a parameterized linear combination of an ensemble of Gaussian distributions. Moreover, the dynamics model is equipped with a memory module and trained on a multistep prediction task to reduce cumulative error. Second, successful experience is used to guide the policy at the early stage of training to avoid ineffective exploration. Third, a clipped double value network is expanded in the learned dynamics to reduce overestimation bias. Finally, we present a deterministic policy gradient approach in the model that backpropagates multi-step gradient along the imagined trajectories. Our method shows higher sampling efficiency than the state-of-the-art MFRL methods while maintaining better convergence performance and time efficiency compared to the SOAT MBRL. (c) 2022 Elsevier B.V. All rights reserved.

Authors

I am an author on this paper
Click your name to claim this paper and add it to your profile.

Reviews

Primary Rating

4.6
Not enough ratings

Secondary Ratings

Novelty
-
Significance
-
Scientific rigor
-
Rate this paper

Recommended

No Data Available
No Data Available