4.6 Article

A Prioritized objective actor-critic method for deep reinforcement learning

期刊

NEURAL COMPUTING & APPLICATIONS
卷 33, 期 16, 页码 10335-10349

出版社

SPRINGER LONDON LTD
DOI: 10.1007/s00521-021-05795-0

关键词

Deep learning; Reinforcement learning; Learning systems; Multi-objective optimization; Actor-critic architecture

向作者/读者索取更多资源

The study introduces two actor-critic deep reinforcement learning methods, Multi-Critic Single Policy (MCSP) and Single Critic Multi-Policy (SCMP), to enhance the performance of agents in solving complex problems. The research adjusts agent behaviors by adopting a weighted-sum scalarization of different objective functions.
An increasing number of complex problems have naturally posed significant challenges in decision-making theory and reinforcement learning practices. These problems often involve multiple conflicting reward signals that inherently cause agents' poor exploration in seeking a specific goal. In extreme cases, the agent gets stuck in a sub-optimal solution and starts behaving harmfully. To overcome such obstacles, we introduce two actor-critic deep reinforcement learning methods, namely Multi-Critic Single Policy (MCSP) and Single Critic Multi-Policy (SCMP), which can adjust agent behaviors to efficiently achieve a designated goal by adopting a weighted-sum scalarization of different objective functions. In particular, MCSP creates a human-centric policy that corresponds to a predefined priority weight of different objectives. Whereas, SCMP is capable of generating a mixed policy based on a set of priority weights, i.e., the generated policy uses the knowledge of different policies (each policy corresponds to a priority weight) to dynamically prioritize objectives in real time. We examine our methods by using the Asynchronous Advantage Actor-Critic (A3C) algorithm to utilize the multithreading mechanism for dynamically balancing training intensity of different policies into a single network. Finally, simulation results show that MCSP and SCMP significantly outperform A3C with respect to the mean of total rewards in two complex problems: Food Collector and Seaquest.

作者

我是这篇论文的作者
点击您的名字以认领此论文并将其添加到您的个人资料中。

评论

主要评分

4.6
评分不足

次要评分

新颖性
-
重要性
-
科学严谨性
-
评价这篇论文

推荐

暂无数据
暂无数据