4.6 Article

A Collaborative Multiagent Reinforcement Learning Method Based on Policy Gradient Potential

Journal

IEEE TRANSACTIONS ON CYBERNETICS
Volume 51, Issue 2, Pages 1015-1027

Publisher

IEEE-INST ELECTRICAL ELECTRONICS ENGINEERS INC
DOI: 10.1109/TCYB.2019.2932203

Keywords

Gradient method; multiagent reinforcement learning (MARL); multiagent system; reinforcement learning

Funding

  1. Qingdao PostDoctoral Applied Research Project
  2. International Cooperation Training Project for Outstanding Young and Middle-Aged Teachers of Universities in Shandong Province
  3. AGV Road Network Design and Path Planning Method Based on Multi-Agent Reinforcement Learning
  4. Shandong Provincial Natural Science Foundation of China [ZR2017PF005]
  5. National Natural Science Foundation of China [61873138, 61573205, 61603205]

Ask authors/readers for more resources

The study introduces a new MARL algorithm, PGP algorithm, which achieves optimal joint strategy learning in games of identical interest. Theoretical analysis and experimental studies demonstrate that the PGP algorithm outperforms other MARL algorithms in cumulative reward and time steps.
Gradient-based method has been extensively used in today's multiagent reinforcement learning (MARL). In a gradient-based MARL algorithm, each agent updates its parameterized strategy in the direction of the gradient of some performance index. However, studies on the convergence of the existing gradient-based MARL algorithms for identical interest games are quite few. In this article, we propose a policy gradient potential (PGP) algorithm that takes PGP as the source of information for guiding the strategy update, as opposed to the gradient itself, to learn the optimal joint strategy that has a maximal global reward. Since the payoff matrix and the joint strategy are often unavailable to the learning agents in reality, we consider the probability of obtaining the maximal reward as the performance index. Theoretical analysis of the PGP algorithm on the continuous model involving an identical interest repeated game shows that if the component action of every optimal joint action is unique, the critical points corresponding to all optimal joint actions are asymptotically stable. The PGP algorithm is experimentally studied and compared against other MARL algorithms on two commonly used collaborative tasks-the robots leaving a room task and the distributed sensor network task, as well as a real-world minefield navigation problem where only local state and local reward information are available. The results show that the PGP algorithm outperforms the other algorithms in terms of the cumulative reward and the number of time steps used in an episode.

Authors

I am an author on this paper
Click your name to claim this paper and add it to your profile.

Reviews

Primary Rating

4.6
Not enough ratings

Secondary Ratings

Novelty
-
Significance
-
Scientific rigor
-
Rate this paper

Recommended

No Data Available
No Data Available