期刊
INTERNATIONAL JOURNAL OF ADAPTIVE CONTROL AND SIGNAL PROCESSING
卷 36, 期 2, 页码 334-353出版社
WILEY
DOI: 10.1002/acs.3282
关键词
eligibility traces; instrumental variable method; least squares; reinforcement learning; temporal difference
资金
- Jiangsu Double Innovation Talents Project for Jiangsu province [4207012004]
- National Natural Science Foundation of China [62073074]
A new reinforcement learning method RLS-TD-f is proposed in this study, using a forgetting factor instead of eligibility traces, and its effectiveness is tested in a Policy Iteration setting.
We propose a new reinforcement learning method in the framework of Recursive Least Squares-Temporal Difference (RLS-TD). Instead of using the standard mechanism of eligibility traces (resulting in RLS-TD(lambda)), we propose to use the forgetting factor commonly used in gradient-based or least-square estimation, and we show that it has a similar role as eligibility traces. An instrumental variable perspective is adopted to formulate the new algorithm, referred to as RLS-TD with forgetting factor (RLS-TD-f). An interesting aspect of the proposed algorithm is that it has an interpretation of a minimizer of an appropriate cost function. We test the effectiveness of the algorithm in a Policy Iteration setting, meaning that we aim to improve the performance of an initially stabilizing control policy (over large portion of the state space). We take a cart-pole benchmark and an adaptive cruise control benchmark as experimental platforms.
作者
我是这篇论文的作者
点击您的名字以认领此论文并将其添加到您的个人资料中。
推荐
暂无数据