| Literature DB >> 22736969 |
Bo Liu1, Sanfeng Chen, Shuai Li, Yongsheng Liang.
Abstract
In this paper a new framework, called Compressive Kernelized Reinforcement Learning (CKRL), for computing near-optimal policies in sequential decision making with uncertainty is proposed via incorporating the non-adaptive data-independent Random Projections and nonparametric Kernelized Least-squares Policy Iteration (KLSPI). Random Projections are a fast, non-adaptive dimensionality reduction framework in which high-dimensionality data is projected onto a random lower-dimension subspace via spherically random rotation and coordination sampling. KLSPI introduce kernel trick into the LSPI framework for Reinforcement Learning, often achieving faster convergence and providing automatic feature selection via various kernel sparsification approaches. In this approach, policies are computed in a low-dimensional subspace generated by projecting the high-dimensional features onto a set of random basis. We first show how Random Projections constitute an efficient sparsification technique and how our method often converges faster than regular LSPI, while at lower computational costs. Theoretical foundation underlying this approach is a fast approximation of Singular Value Decomposition (SVD). Finally, simulation results are exhibited on benchmark MDP domains, which confirm gains both in computation time and in performance in large feature spaces.Entities:
Keywords: Kernelized Least Square Policy Iteration; Markov Decision Process; random Projections; sensor-actuator systems
Year: 2012 PMID: 22736969 PMCID: PMC3376585 DOI: 10.3390/s120302632
Source DB: PubMed Journal: Sensors (Basel) ISSN: 1424-8220 Impact factor: 3.576
Figure 1.Illustration of Kernelized Bellman Error Decomposition.
General Framework of KLSPI.
| • A sample data set ( |
| • A kernel function |
| • policy |
| |
| Compute |
| Compute |
| Compute solution |
| |
| Compute Policy |
Figure 2.A Successful Run of Pendulum.
Figure 5.The Value Function of Pendulum.
Figure 3.An Illustration of Acrobot.
Figure 4.A Typical Trajectory of a Successful Swing-up of Acrobot.
Acrobot: Comparison of Compression Ratio.
| Compression Ratio | Balancing Steps | With Failure? |
|---|---|---|
| 100 | 412 | Y |
| 150 | 380 | Y |
| 300 | 312 | N |
| 400 | 227 | N |
Comparison of Sparsification Technique: ALD and Random Projections.
| ALD Threshold | Size of ALD Dic | Compression Ratio | Avg Blancing Steps |
|---|---|---|---|
| 0.1 | 184 | * | 178 |
| 0.3 | 117 | * | 270 |
| 0.1 | 117 | 117 | |
| * | * | 117/3, 000 | 403(with failure) |
| * | * | 184/3, 000 | 341(with failure) |
Figure 6.Learning Policy using GGK in Two Rooms Domain.
KLSPI with Random Projections.
| • A sample data set ( |
| • A kernel function |
| • A compression dimension |
| • policy |
| Use ALD to generate compact |
| |
| Compute |
| Construct compression matrix |
| |
| Compute |
| Compute solution |
| |
| Compute Policy |