Literature DB >> 36168331

Sample-Efficient Reinforcement Learning for Linearly-Parameterized MDPs with a Generative Model.

Bingyan Wang1, Yuling Yan1, Jianqing Fan1.   

Abstract

The curse of dimensionality is a widely known issue in reinforcement learning (RL). In the tabular setting where the state space S and the action space A are both finite, to obtain a nearly optimal policy with sampling access to a generative model, the minimax optimal sample complexity scales linearly with | S | × | A | , which can be prohibitively large when S or A is large. This paper considers a Markov decision process (MDP) that admits a set of state-action features, which can linearly express (or approximate) its probability transition kernel. We show that a model-based approach (resp. Q-learning) provably learns an ε-optimal policy (resp. Q-function) with high probability as soon as the sample size exceeds the order of K ( 1 - γ ) 3 ε 2 ( resp . K ( 1 - γ ) 4 ε 2 ) , up to some logarithmic factor. Here K is the feature dimension and γ ∈ (0, 1) is the discount factor of the MDP. Both sample complexity bounds are provably tight, and our result for the model-based approach matches the minimax lower bound. Our results show that for arbitrarily large-scale MDP, both the model-based approach and Q-learning are sample-efficient when K is relatively small, and hence the title of this paper.

Entities:  

Keywords:  leave-one-out analysis; linear transition model; model-based reinforcement learning; sample complexity; vanilla Q-learning

Year:  2021        PMID: 36168331      PMCID: PMC9512142     

Source DB:  PubMed          Journal:  Adv Neural Inf Process Syst        ISSN: 1049-5258


  6 in total

1.  Mastering the game of Go with deep neural networks and tree search.

Authors:  David Silver; Aja Huang; Chris J Maddison; Arthur Guez; Laurent Sifre; George van den Driessche; Julian Schrittwieser; Ioannis Antonoglou; Veda Panneershelvam; Marc Lanctot; Sander Dieleman; Dominik Grewe; John Nham; Nal Kalchbrenner; Ilya Sutskever; Timothy Lillicrap; Madeleine Leach; Koray Kavukcuoglu; Thore Graepel; Demis Hassabis
Journal:  Nature       Date:  2016-01-28       Impact factor: 49.962

2.  On the Theory of Dynamic Programming.

Authors:  R Bellman
Journal:  Proc Natl Acad Sci U S A       Date:  1952-08       Impact factor: 11.205

3.  NOISY MATRIX COMPLETION: UNDERSTANDING STATISTICAL GUARANTEES FOR CONVEX RELAXATION VIA NONCONVEX OPTIMIZATION.

Authors:  Yuxin Chen; Yuejie Chi; Jianqing Fan; Cong Ma; Yuling Yan
Journal:  SIAM J Optim       Date:  2020-10-28       Impact factor: 2.850

4.  Mastering the game of Go without human knowledge.

Authors:  David Silver; Julian Schrittwieser; Karen Simonyan; Ioannis Antonoglou; Aja Huang; Arthur Guez; Thomas Hubert; Lucas Baker; Matthew Lai; Adrian Bolton; Yutian Chen; Timothy Lillicrap; Fan Hui; Laurent Sifre; George van den Driessche; Thore Graepel; Demis Hassabis
Journal:  Nature       Date:  2017-10-18       Impact factor: 49.962

5.  SPECTRAL METHOD AND REGULARIZED MLE ARE BOTH OPTIMAL FOR TOP-K RANKING.

Authors:  Yuxin Chen; Jianqing Fan; Cong Ma; Kaizheng Wang
Journal:  Ann Stat       Date:  2019-05-21       Impact factor: 4.028

6.  BRIDGING CONVEX AND NONCONVEX OPTIMIZATION IN ROBUST PCA: NOISE, OUTLIERS, AND MISSING DATA.

Authors:  Yuxin Chen; Jianqing Fan; Cong Ma; Yuling Yan
Journal:  Ann Stat       Date:  2021-11-12       Impact factor: 4.904

  6 in total

北京卡尤迪生物科技股份有限公司 © 2022-2023.