☆ 4.7 Article

The Boundedness Conditions for Model-Free HDP(lambda)

IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS (2019)

期刊

IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS

卷 30, 期 7, 页码 1928-1942

出版社

IEEE-INST ELECTRICAL ELECTRONICS ENGINEERS INC

DOI: 10.1109/TNNLS.2018.2875870

关键词

lambda-return; action dependent (AD); approximate dynamic programing (ADP); heuristic dynamic programing (HDP); Lyapunov stability; model free; uniformly ultimately bounded (UUB)

类别

Computer Science, Artificial Intelligence Computer Science, Hardware & Architecture Computer Science, Theory & Methods Engineering, Electrical & Electronic

资金

Missouri University of Science and Technology Mary K. Finley Endowment
Army Research Laboratory [W911NF-18-2-0260]
Higher Committee for Educational Development
Basrah Oil Company in Iraq
Intelligent Systems Center

向作者/读者索取更多资源

Protocol

社区支持

Reagent

社区支持

摘要

This paper provides the stability analysis for a model-free action-dependent heuristic dynamic programing (HDP) approach with an eligibility trace long-term prediction parameter (lambda). HDP(lambda) learns from more than one future reward. Eligibility traces have long been popular in Q-learning. This paper proves and demonstrates that they are worthwhile to use with HDP. In this paper, we prove its uniformly ultimately bounded (UUB) property under certain conditions. Previous works present a UUB proof for traditional HDP [HDP(lambda = 0)], but we extend the proof with the lambda parameter. By using Lyapunov stability, we demonstrate the boundedness of the estimated error for the critic and actor neural networks as well as learning rate parameters. Three case studies demonstrate the effectiveness of HDP(lambda). The trajectories of the internal reinforcement signal nonlinear system are considered as the first case. We compare the results with the performance of HDP and traditional temporal difference [TD(lambda)] with different lambda values. The second case study is a single-link inverted pendulum. We investigate the performance of the inverted pendulum by comparing HDP(lambda) with regular HDP, with different levels of noise. The third case study is a 3-D maze navigation benchmark, which is compared with state action reward state action, Q(lambda), HDP, and HDP(lambda). All these simulation results illustrate that HDP(lambda) has a competitive performance; thus this contribution is not only UUB but also useful in comparison with traditional HDP.

The Boundedness Conditions for Model-Free HDP(lambda)

期刊

IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS

出版社

IEEE-INST ELECTRICAL ELECTRONICS ENGINEERS INC

关键词

类别

资金

向作者/读者索取更多资源

Protocol

Reagent

作者

我是这篇论文的作者

评论

主要评分

次要评分

新颖性

重要性

科学严谨性

评价这篇论文

推荐

The Boundedness Conditions for Model-Free HDP(lambda)

期刊

IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS

出版社

IEEE-INST ELECTRICAL ELECTRONICS ENGINEERS INC

关键词

类别

资金

向作者/读者索取更多资源

Protocol

Reagent

作者

我是这篇论文的作者

评论

主要评分

次要评分

新颖性

重要性

科学严谨性

评价这篇论文

推荐

导出引文

分享论文