| Literature DB >> 33716408 |
Tingting Xu1, Henghui Zhu1, Ioannis Ch Paschalidis2.
Abstract
We consider the problem of estimating the policy and transition probability model of a Markov Decision Process from data (state, action, next state tuples). The transition probability and policy are assumed to be parametric functions of a sparse set of features associated with the tuples. We propose two regularized maximum likelihood estimation algorithms for learning the transition probability model and policy, respectively. An upper bound is established on the regret, which is the difference between the average reward of the estimated policy under the estimated transition probabilities and that of the original unknown policy under the true (unknown) transition probabilities. We provide a sample complexity result showing that we can achieve a low regret with a relatively small amount of training samples. We illustrate the theoretical results with a healthcare example and a robot navigation experiment.Entities:
Keywords: Learning Transition Dynamics; Markov Decision Processes; Maximum likelihood estimation; Policy Learning; Regularization
Year: 2020 PMID: 33716408 PMCID: PMC7944408 DOI: 10.1016/j.ejcon.2020.04.003
Source DB: PubMed Journal: Eur J Control ISSN: 0947-3580 Impact factor: 2.395