☆ 4.6 Article

BATCH POLICY LEARNING IN AVERAGE REWARD MARKOV DECISION PROCESSES

ANNALS OF STATISTICS (2022)

Journal

ANNALS OF STATISTICS

Volume 50, Issue 6, Pages 3364-3387

Publisher

INST MATHEMATICAL STATISTICS-IMS

DOI: 10.1214/22-AOS2231

Keywords

Markov decision process; average reward; policy optimization; doubly robust estima-tor

Funding

NIH [P50DA039838, R01AA023187, P50DA054039, P41EB028242, U01 CA229437, UG3DE028723, UH3DE028723]

Ask authors/readers for more resources

Protocol

Community support

Reagent

Community support

Automated Summary New
Abstract

This study focuses on the batch (off-line) policy learning problem in the infinite horizon Markov decision process and proposes a doubly robust estimator to estimate the average reward. Moreover, an optimization algorithm is developed to compute the optimal policy in a parameterized stochastic policy class.

We consider the batch (off-line) policy learning problem in the infinite horizon Markov decision process. Motivated by mobile health applications, we focus on learning a policy that maximizes the long-term average reward. We propose a doubly robust estimator for the average reward and show that it achieves semiparametric efficiency. Further, we develop an optimization algorithm to compute the optimal policy in a parameterized stochastic policy class. The performance of the estimated policy is measured by the difference between the optimal average reward in the policy class and the average reward of the estimated policy and we establish a finite-sample regret guarantee. The performance of the method is illustrated by simulation studies and an analysis of a mobile health study promoting physical activity.

BATCH POLICY LEARNING IN AVERAGE REWARD MARKOV DECISION PROCESSES

Journal

ANNALS OF STATISTICS

Publisher

INST MATHEMATICAL STATISTICS-IMS

Keywords

Categories

Funding

Ask authors/readers for more resources

Protocol

Reagent

Authors

I am an author on this paper

Reviews

Primary Rating

Secondary Ratings

Novelty

Significance

Scientific rigor

Rate this paper

Recommended

BATCH POLICY LEARNING IN AVERAGE REWARD MARKOV DECISION PROCESSES

Journal

ANNALS OF STATISTICS

Publisher

INST MATHEMATICAL STATISTICS-IMS

Keywords

Categories

Funding

Ask authors/readers for more resources

Protocol

Reagent

Authors

I am an author on this paper

Reviews

Primary Rating

Secondary Ratings

Novelty

Significance

Scientific rigor

Rate this paper

Recommended

Export Citation

Share Paper