4.6 Article

A Prioritized objective actor-critic method for deep reinforcement learning

Journal

NEURAL COMPUTING & APPLICATIONS
Volume 33, Issue 16, Pages 10335-10349

Publisher

SPRINGER LONDON LTD
DOI: 10.1007/s00521-021-05795-0

Keywords

Deep learning; Reinforcement learning; Learning systems; Multi-objective optimization; Actor-critic architecture

Ask authors/readers for more resources

The study introduces two actor-critic deep reinforcement learning methods, Multi-Critic Single Policy (MCSP) and Single Critic Multi-Policy (SCMP), to enhance the performance of agents in solving complex problems. The research adjusts agent behaviors by adopting a weighted-sum scalarization of different objective functions.
An increasing number of complex problems have naturally posed significant challenges in decision-making theory and reinforcement learning practices. These problems often involve multiple conflicting reward signals that inherently cause agents' poor exploration in seeking a specific goal. In extreme cases, the agent gets stuck in a sub-optimal solution and starts behaving harmfully. To overcome such obstacles, we introduce two actor-critic deep reinforcement learning methods, namely Multi-Critic Single Policy (MCSP) and Single Critic Multi-Policy (SCMP), which can adjust agent behaviors to efficiently achieve a designated goal by adopting a weighted-sum scalarization of different objective functions. In particular, MCSP creates a human-centric policy that corresponds to a predefined priority weight of different objectives. Whereas, SCMP is capable of generating a mixed policy based on a set of priority weights, i.e., the generated policy uses the knowledge of different policies (each policy corresponds to a priority weight) to dynamically prioritize objectives in real time. We examine our methods by using the Asynchronous Advantage Actor-Critic (A3C) algorithm to utilize the multithreading mechanism for dynamically balancing training intensity of different policies into a single network. Finally, simulation results show that MCSP and SCMP significantly outperform A3C with respect to the mean of total rewards in two complex problems: Food Collector and Seaquest.

Authors

I am an author on this paper
Click your name to claim this paper and add it to your profile.

Reviews

Primary Rating

4.6
Not enough ratings

Secondary Ratings

Novelty
-
Significance
-
Scientific rigor
-
Rate this paper

Recommended

No Data Available
No Data Available