☆ 4.7 Article

Draw on advantages and avoid disadvantages by making a multi-step prediction

EXPERT SYSTEMS WITH APPLICATIONS (2024)

Journal

EXPERT SYSTEMS WITH APPLICATIONS

Volume 237, Issue -, Pages -

Publisher

PERGAMON-ELSEVIER SCIENCE LTD

DOI: 10.1016/j.eswa.2023.121345

Keywords

Reinforcement learning; Exploration; Intrinsic reward; Multi-step prediction; Policy optimization

Ask authors/readers for more resources

Protocol

Community support

Reagent

Community support

Automated Summary New
Abstract

The article proposes a policy framework called PGMP, which utilizes multi-step prediction to guide exploration in reinforcement learning. The framework includes a curiosity mechanism and a safety bonus model to encourage exploration in safe and task-relevant areas. Additionally, a looking-ahead model is introduced to predict future states, actions, and rewards, allowing the agent to optimize its policy for predicted future states.

Reinforcement learning learns about the environment through the process of exploration, and the sufficient information collected during the interaction helps the agent to predict the future situation. However, uncontrolled exploration may cause the agent to stray into dangerous regions of the environment, leading to bad decisions and impairing the agent's performance. In order to address the issue, a framework, referred to as the policy guided by multi-step prediction (PGMP), is proposed. PGMP utilizes a curiosity mechanism based on multi-step prediction errors to stimulate exploration. To encourage the agent to explore safe or task-relevant areas, a safety bonus model is designed to determine whether the exploration area is safe or not by predicting the possible reward that can be gained. The combination of two intrinsic rewards serves as a curiosity model to give high returns to unknown states and possible safe actions. In addition, to avoid possible dangers in a limited number of future steps during exploration, a looking-ahead model is introduced to predict future multi-step states, actions, and rewards, respectively. Then, future information is combined with the policy network and included in the loss function of the policy update, allowing the agent to optimize its policy for predicted future states. Experiments on several tasks demonstrated that the proposed PGMP framework significantly improves the agent's performance.

Draw on advantages and avoid disadvantages by making a multi-step prediction

Journal

EXPERT SYSTEMS WITH APPLICATIONS

Publisher

PERGAMON-ELSEVIER SCIENCE LTD

Keywords

Categories

Ask authors/readers for more resources

Protocol

Reagent

Authors

I am an author on this paper

Reviews

Primary Rating

Secondary Ratings

Novelty

Significance

Scientific rigor

Rate this paper

Recommended

Draw on advantages and avoid disadvantages by making a multi-step prediction

Journal

EXPERT SYSTEMS WITH APPLICATIONS

Publisher

PERGAMON-ELSEVIER SCIENCE LTD

Keywords

Categories

Ask authors/readers for more resources

Protocol

Reagent

Authors

I am an author on this paper

Reviews

Primary Rating

Secondary Ratings

Novelty

Significance

Scientific rigor

Rate this paper

Recommended

Export Citation

Share Paper