☆ 4.7 Article

Policy Evaluation and Seeking for Multiagent Reinforcement Learning via Best Response

IEEE TRANSACTIONS ON AUTOMATIC CONTROL (2022)

期刊

IEEE TRANSACTIONS ON AUTOMATIC CONTROL

卷 67, 期 4, 页码 1898-1913

出版社

IEEE-INST ELECTRICAL ELECTRONICS ENGINEERS INC

DOI: 10.1109/TAC.2021.3085171

关键词

Best response; multiagent reinforcement learning; policy evaluation and seeking; sink equilibrium; stochastic stability

类别

Automation & Control Systems Engineering, Electrical & Electronic

资金

National Natural Science Foundation of China [61374034]
China Scholarship Council
U.S. Air Force Office of Scientific Research [FA9550-15-1-0138]

向作者/读者索取更多资源

Protocol

社区支持

Reagent

社区支持

智能总结 New
摘要

This article introduces a metric based on game-theoretic solution concept for the evaluation, ranking, and computation of policies in multiagent learning. The method can handle dynamical behaviors in multiagent reinforcement learning and is also compatible with single-agent reinforcement learning.

Multiagent policy evaluation and seeking are long-standing challenges in developing theories for multiagent reinforcement learning (MARL), due to multidimensional learning goals, nonstationary environment, and scalability issues in the joint policy space. This article introduces two metrics grounded on a game-theoretic solution concept called sink equilibrium, for the evaluation, ranking, and computation of policies in multiagent learning. We adopt strict best response dynamics (SBRDs) to model selfish behaviors at a meta-level for MARL. Our approach can deal with dynamical cyclical behaviors (unlike approaches based on Nash equilibria and Elo ratings), and is more compatible with single-agent reinforcement learning than 0-rank, which relies on weakly better responses. We first consider settings where the difference between the largest and second largest equilibrium metric has a known lower bound. With this knowledge, we propose a class of perturbed SBRD with the following property: only policies with maximum metric are observed with nonzero probability for a broad class of stochastic games with finite memory. We then consider settings where the lower bound for the difference is unknown. For this setting, we propose a class of perturbed SBRD such that the metrics of the policies observed with nonzero probability differ from the optimal by any given tolerance. The proposed perturbed SBRD addresses the scalability issue and opponent-induced nonstationarity by fixing the strategies of others for the learning agent, and uses empirical game-theoretic analysis to estimate payoffs for each strategy profile obtained due to the perturbation.

Policy Evaluation and Seeking for Multiagent Reinforcement Learning via Best Response

期刊

IEEE TRANSACTIONS ON AUTOMATIC CONTROL

出版社

IEEE-INST ELECTRICAL ELECTRONICS ENGINEERS INC

关键词

类别

资金

向作者/读者索取更多资源

Protocol

Reagent

作者

我是这篇论文的作者

评论

主要评分

次要评分

新颖性

重要性

科学严谨性

评价这篇论文

推荐

Policy Evaluation and Seeking for Multiagent Reinforcement Learning via Best Response

期刊

IEEE TRANSACTIONS ON AUTOMATIC CONTROL

出版社

IEEE-INST ELECTRICAL ELECTRONICS ENGINEERS INC

关键词

类别

资金

向作者/读者索取更多资源

Protocol

Reagent

作者

我是这篇论文的作者

评论

主要评分

次要评分

新颖性

重要性

科学严谨性

评价这篇论文

推荐

导出引文

分享论文