Journal
ANNALS OF STATISTICS
Volume 50, Issue 6, Pages 3364-3387Publisher
INST MATHEMATICAL STATISTICS-IMS
DOI: 10.1214/22-AOS2231
Keywords
Markov decision process; average reward; policy optimization; doubly robust estima-tor
Categories
Funding
- NIH [P50DA039838, R01AA023187, P50DA054039, P41EB028242, U01 CA229437, UG3DE028723, UH3DE028723]
Ask authors/readers for more resources
This study focuses on the batch (off-line) policy learning problem in the infinite horizon Markov decision process and proposes a doubly robust estimator to estimate the average reward. Moreover, an optimization algorithm is developed to compute the optimal policy in a parameterized stochastic policy class.
We consider the batch (off-line) policy learning problem in the infinite horizon Markov decision process. Motivated by mobile health applications, we focus on learning a policy that maximizes the long-term average reward. We propose a doubly robust estimator for the average reward and show that it achieves semiparametric efficiency. Further, we develop an optimization algorithm to compute the optimal policy in a parameterized stochastic policy class. The performance of the estimated policy is measured by the difference between the optimal average reward in the policy class and the average reward of the estimated policy and we establish a finite-sample regret guarantee. The performance of the method is illustrated by simulation studies and an analysis of a mobile health study promoting physical activity.
Authors
I am an author on this paper
Click your name to claim this paper and add it to your profile.
Reviews
Recommended
No Data Available