4.6 Article

BATCH POLICY LEARNING IN AVERAGE REWARD MARKOV DECISION PROCESSES

Journal

ANNALS OF STATISTICS
Volume 50, Issue 6, Pages 3364-3387

Publisher

INST MATHEMATICAL STATISTICS-IMS
DOI: 10.1214/22-AOS2231

Keywords

Markov decision process; average reward; policy optimization; doubly robust estima-tor

Funding

  1. NIH [P50DA039838, R01AA023187, P50DA054039, P41EB028242, U01 CA229437, UG3DE028723, UH3DE028723]

Ask authors/readers for more resources

This study focuses on the batch (off-line) policy learning problem in the infinite horizon Markov decision process and proposes a doubly robust estimator to estimate the average reward. Moreover, an optimization algorithm is developed to compute the optimal policy in a parameterized stochastic policy class.
We consider the batch (off-line) policy learning problem in the infinite horizon Markov decision process. Motivated by mobile health applications, we focus on learning a policy that maximizes the long-term average reward. We propose a doubly robust estimator for the average reward and show that it achieves semiparametric efficiency. Further, we develop an optimization algorithm to compute the optimal policy in a parameterized stochastic policy class. The performance of the estimated policy is measured by the difference between the optimal average reward in the policy class and the average reward of the estimated policy and we establish a finite-sample regret guarantee. The performance of the method is illustrated by simulation studies and an analysis of a mobile health study promoting physical activity.

Authors

I am an author on this paper
Click your name to claim this paper and add it to your profile.

Reviews

Primary Rating

4.6
Not enough ratings

Secondary Ratings

Novelty
-
Significance
-
Scientific rigor
-
Rate this paper

Recommended

No Data Available
No Data Available