| Literature DB >> 36168331 |
Bingyan Wang1, Yuling Yan1, Jianqing Fan1.
Abstract
The curse of dimensionality is a widely known issue in reinforcement learning (RL). In the tabular setting where the state space S and the action space A are both finite, to obtain a nearly optimal policy with sampling access to a generative model, the minimax optimal sample complexity scales linearly with | S | × | A | , which can be prohibitively large when S or A is large. This paper considers a Markov decision process (MDP) that admits a set of state-action features, which can linearly express (or approximate) its probability transition kernel. We show that a model-based approach (resp. Q-learning) provably learns an ε-optimal policy (resp. Q-function) with high probability as soon as the sample size exceeds the order of K ( 1 - γ ) 3 ε 2 ( resp . K ( 1 - γ ) 4 ε 2 ) , up to some logarithmic factor. Here K is the feature dimension and γ ∈ (0, 1) is the discount factor of the MDP. Both sample complexity bounds are provably tight, and our result for the model-based approach matches the minimax lower bound. Our results show that for arbitrarily large-scale MDP, both the model-based approach and Q-learning are sample-efficient when K is relatively small, and hence the title of this paper.Entities:
Keywords: leave-one-out analysis; linear transition model; model-based reinforcement learning; sample complexity; vanilla Q-learning
Year: 2021 PMID: 36168331 PMCID: PMC9512142
Source DB: PubMed Journal: Adv Neural Inf Process Syst ISSN: 1049-5258