4.6 Article

A Method Integrating Q-Learning With Approximate Dynamic Programming for Gantry Work Cell Scheduling

Journal

Publisher

IEEE-INST ELECTRICAL ELECTRONICS ENGINEERS INC
DOI: 10.1109/TASE.2020.2984739

Keywords

Approximate dynamic programming (ADP); gantry scheduling; Markov decision process (MDP); planning and learning; Q-learning

Funding

  1. U.S. National Science Foundation (NSF) [CMMI 1351160, CMMI 1853454]

Ask authors/readers for more resources

This article proposes an innovative method, Q-ADP, that integrates reinforcement learning and approximate dynamic programming for real-time gantry scheduling in a gantry work cell. Numerical studies show that Q-ADP outperforms standard Q-learning and requires less data for convergence. By learning directly from interactions with the environment, the method avoids bias from model designing, making it particularly useful when real data are limited.
This article formulates gantry real-time scheduling in a gantry work cell, where the material transfer is driven by gantries, as a Markov decision process (MDP). Classical learning methods and planning methods for solving the optimization problems in MDP are discussed. An innovative method, called Q-ADP, is proposed to integrate reinforcement learning (RL) with approximate dynamic programming (ADP). Q-ADP uses model-free Q-learning algorithm to learn state values through interactions with the environment, meanwhile, planning steps during the learning process opt for ADP to keep updating state values through several sample paths. A model of one-step transition probabilities is built based on the machines' reliability model, and serves the ADP algorithm. To demonstrate the effectiveness of this method, a numerical study is performed to show the production performance, compared to a standard Q-learning algorithm. The simulation results show that Q-ADP outperforms standard Q-learning under the same length of training process. It is also shown that with the benefit of repeated updating state values through sample paths, Q-ADP requires less data for gantry policy to converge, which makes the method promising when real data are limited. Note to Practitioners-The goal of this work is to find a near optimal gantry assignment policy to realize real-time control of material handling gantry/robot movements in gantry work cells. Properly assigning gantries based on real-time situations of the production system can avoid machines' stoppage due to material shortage, and consequently improve production performance. This gantry scheduling is a sequential decision-making problem and can be presented by Markov Decision Process (MDP). To solve the MDP problem, an algorithm integrating model-free Q-learning and model-based approximate dynamic programming (ADP) is proposed. By learning directly from the interaction with the environment, the method avoids bias problem from any model designing. Meanwhile, a planning process during learning can efficiently speed up the learning for convergence of the policy, and this particularly benefits to the scenario when the real data are insufficient.

Authors

I am an author on this paper
Click your name to claim this paper and add it to your profile.

Reviews

Primary Rating

4.6
Not enough ratings

Secondary Ratings

Novelty
-
Significance
-
Scientific rigor
-
Rate this paper

Recommended

No Data Available
No Data Available