☆ 4.6 Article

Dynamic Weights and Prior Reward in Policy Fusion for Compound Agent Learning

ACM TRANSACTIONS ON INTELLIGENT SYSTEMS AND TECHNOLOGY (2023)

Journal

ACM TRANSACTIONS ON INTELLIGENT SYSTEMS AND TECHNOLOGY

Volume 14, Issue 6, Pages -

Publisher

ASSOC COMPUTING MACHINERY

DOI: 10.1145/3623405

Keywords

Compound agent learning; deep reinforcement learning; policy fusion; dynamic weights; prior reward

Ask authors/readers for more resources

Protocol

Community support

Reagent

Community support

Automated Summary New
Abstract

We propose a new method for policy fusion in deep reinforcement learning, which dynamically selects sub-tasks and reduces fusion bias. Experimental results show significant improvements in task duration, episode reward, and score difference.

In Deep Reinforcement Learning (DRL) domain, a compound learning task is often decomposed into several sub-tasks in a divide-and-conquer manner, each trained separately and then fused concurrently to achieve the original task, referred to as policy fusion. However, the state-of-the-art (SOTA) policy fusion methods treat the importance of sub-tasks equally throughout the task process, eliminating the possibility of the agent relying on different sub-tasks at various stages. To address this limitation, we propose a generic policy fusion approach, referred to as Policy Fusion Learning withDynamicWeights and Prior Reward (PFLDWPR), to automate the time-varying selection of sub-tasks. Specifically, PFLDWPR produces a time-varying one-hot vector for sub-tasks to dynamically select a suitable sub-task and mask the rest throughout the entire task process, enabling the fused strategy to optimally guide the agent in executing the compound task. The sub-tasks with the dynamic one-hot vector are then aggregated to obtain the action policy for the original task. Moreover, we collect sub-tasks's rewards at the pre-training stage as a prior reward, which, alongwith the current reward, is used to train the policy fusion network. Thus, this approach reduces fusion bias by leveraging prior experience. Experimental results under three popular learning tasks demonstrate that the proposed method significantly improves three SOTA policy fusion methods in terms of task duration, episode reward, and score difference.

Dynamic Weights and Prior Reward in Policy Fusion for Compound Agent Learning

Journal

ACM TRANSACTIONS ON INTELLIGENT SYSTEMS AND TECHNOLOGY

Publisher

ASSOC COMPUTING MACHINERY

Keywords

Categories

Ask authors/readers for more resources

Protocol

Reagent

Authors

I am an author on this paper

Reviews

Primary Rating

Secondary Ratings

Novelty

Significance

Scientific rigor

Rate this paper

Recommended

Dynamic Weights and Prior Reward in Policy Fusion for Compound Agent Learning

Journal

ACM TRANSACTIONS ON INTELLIGENT SYSTEMS AND TECHNOLOGY

Publisher

ASSOC COMPUTING MACHINERY

Keywords

Categories

Ask authors/readers for more resources

Protocol

Reagent

Authors

I am an author on this paper

Reviews

Primary Rating

Secondary Ratings

Novelty

Significance

Scientific rigor

Rate this paper

Recommended

Export Citation

Share Paper