Warning: Undefined array key "mm" in /www/wwwroot/www.ai-bt.com/si.php on line 10 Deprecated: trim(): Passing null to parameter #1 ($string) of type string is deprecated in /www/wwwroot/www.ai-bt.com/si.php on line 10 Sample-Efficient Reinforcement Learning for Linearly-Parameterized MDPs with a Generative Model.

Literature DB >> 36168331

Sample-Efficient Reinforcement Learning for Linearly-Parameterized MDPs with a Generative Model.

Bingyan Wang¹, Yuling Yan¹, Jianqing Fan¹.

Abstract

The curse of dimensionality is a widely known issue in reinforcement learning (RL). In the tabular setting where the state space S and the action space A are both finite, to obtain a nearly optimal policy with sampling access to a generative model, the minimax optimal sample complexity scales linearly with | S | × | A | , which can be prohibitively large when S or A is large. This paper considers a Markov decision process (MDP) that admits a set of state-action features, which can linearly express (or approximate) its probability transition kernel. We show that a model-based approach (resp. Q-learning) provably learns an ε-optimal policy (resp. Q-function) with high probability as soon as the sample size exceeds the order of K ( 1 - γ ) 3 ε 2 ( resp . K ( 1 - γ ) 4 ε 2 ) , up to some logarithmic factor. Here K is the feature dimension and γ ∈ (0, 1) is the discount factor of the MDP. Both sample complexity bounds are provably tight, and our result for the model-based approach matches the minimax lower bound. Our results show that for arbitrarily large-scale MDP, both the model-based approach and Q-learning are sample-efficient when K is relatively small, and hence the title of this paper.

Entities: Chemical

Keywords: leave-one-out analysis; linear transition model; model-based reinforcement learning; sample complexity; vanilla Q-learning

Year: 2021 PMID： 36168331 PMCID： PMC9512142

Source DB: PubMed Journal: Adv Neural Inf Process Syst ISSN： 1049-5258

Keyword Cloud
References

6 in total

1. Mastering the game of Go with deep neural networks and tree search.

Authors: David Silver; Aja Huang; Chris J Maddison; Arthur Guez; Laurent Sifre; George van den Driessche; Julian Schrittwieser; Ioannis Antonoglou; Veda Panneershelvam; Marc Lanctot; Sander Dieleman; Dominik Grewe; John Nham; Nal Kalchbrenner; Ilya Sutskever; Timothy Lillicrap; Madeleine Leach; Koray Kavukcuoglu; Thore Graepel; Demis Hassabis
Journal: Nature Date: 2016-01-28 Impact factor: 49.962

2. On the Theory of Dynamic Programming.

Authors: R Bellman
Journal: Proc Natl Acad Sci U S A Date: 1952-08 Impact factor: 11.205

3. NOISY MATRIX COMPLETION: UNDERSTANDING STATISTICAL GUARANTEES FOR CONVEX RELAXATION VIA NONCONVEX OPTIMIZATION.

Authors: Yuxin Chen; Yuejie Chi; Jianqing Fan; Cong Ma; Yuling Yan
Journal: SIAM J Optim Date: 2020-10-28 Impact factor: 2.850

4. Mastering the game of Go without human knowledge.

Authors: David Silver; Julian Schrittwieser; Karen Simonyan; Ioannis Antonoglou; Aja Huang; Arthur Guez; Thomas Hubert; Lucas Baker; Matthew Lai; Adrian Bolton; Yutian Chen; Timothy Lillicrap; Fan Hui; Laurent Sifre; George van den Driessche; Thore Graepel; Demis Hassabis
Journal: Nature Date: 2017-10-18 Impact factor: 49.962

5. SPECTRAL METHOD AND REGULARIZED MLE ARE BOTH OPTIMAL FOR TOP-K RANKING.

Authors: Yuxin Chen; Jianqing Fan; Cong Ma; Kaizheng Wang
Journal: Ann Stat Date: 2019-05-21 Impact factor: 4.028

6. BRIDGING CONVEX AND NONCONVEX OPTIMIZATION IN ROBUST PCA: NOISE, OUTLIERS, AND MISSING DATA.

Authors: Yuxin Chen; Jianqing Fan; Cong Ma; Yuling Yan
Journal: Ann Stat Date: 2021-11-12 Impact factor: 4.904

6 in total