Literature DB >> 27436996

Kernel Recursive Least-Squares Temporal Difference Algorithms with Sparsification and Regularization.

Chunyuan Zhang¹, Qingxin Zhu², Xinzheng Niu².

Abstract

By combining with sparse kernel methods, least-squares temporal difference (LSTD) algorithms can construct the feature dictionary automatically and obtain a better generalization ability. However, the previous kernel-based LSTD algorithms do not consider regularization and their sparsification processes are batch or offline, which hinder their widespread applications in online learning problems. In this paper, we combine the following five techniques and propose two novel kernel recursive LSTD algorithms: (i) online sparsification, which can cope with unknown state regions and be used for online learning, (ii) L 2 and L 1 regularization, which can avoid overfitting and eliminate the influence of noise, (iii) recursive least squares, which can eliminate matrix-inversion operations and reduce computational complexity, (iv) a sliding-window approach, which can avoid caching all history samples and reduce the computational cost, and (v) the fixed-point subiteration and online pruning, which can make L 1 regularization easy to implement. Finally, simulation results on two 50-state chain problems demonstrate the effectiveness of our algorithms.

Entities: Chemical Disease Gene

Mesh：

Year: 2016 PMID： 27436996 PMCID： PMC4942627 DOI： 10.1155/2016/2305854

Source DB: PubMed Journal: Comput Intell Neurosci

1. Introduction

The least-squares temporal difference (LSTD) learning may be the most popular approach for policy evaluation in reinforcement learning (RL) [1, 2]. Compared with the standard temporal difference (TD) learning, LSTD uses samples more efficiently and eliminates all step-size parameters. However, LSTD also has some drawbacks. First, LSTD requires a matrix-inversion operation at each time step. To reduce computational complexity, Bradtke and Barto proposed a recursive LSTD (RLSTD) algorithm [1], and Xu et al. proposed a RLSTD(λ) algorithm [3]. But these two algorithms still require many features especially for highly nonlinear RL problems, since the RLS approximator assumes a linear model [4]. Second, when the number of features is larger than the number of training samples, LSTD is prone to overfitting. To overcome this problem, Kolter and Ng proposed an L 1-regularized LSTD algorithm called LARS-TD for feature selection [5], but it is only applicable for batch learning and its implementation is complicated. On this basis, Chen et al. proposed an L 2-regularized RLSTD algorithm [6]. In contrast with LARS-TD, it has an analytical solution, but it cannot obtain a sparse solution. Third, LSTD requires users to design the feature vector manually, and poor design choices can result in estimates that diverge from the optimal value function [7]. In the last two decades, kernel methods have been intensively and extensively studied in supervised and unsupervised learning [8]. The basic idea behind kernel methods can be summarized as follows: By using a nonlinear transform, the origin input data can be mapped into a high-dimensional feature space, and an inner product in this space can be interpreted as a Mercer kernel function. Thus, as long as a linear algorithm can be formulated in terms of inner products, there is no need to perform computations in the high-dimensional feature space [9]. Recently, there have also been many research works on kernelizing least-squares algorithms [9-13]. Here, we only review some works related to our proposed algorithms. One typical work is the sparse kernel recursive least-squares (SKRLS) algorithm with the approximate linear dependency (ALD) criterion [11]. Compared with traditional RLS algorithms, it not only has a good nonlinear approximation ability but also can construct the feature dictionary automatically. Similarly, Chen et al. proposed an L 2-regularized SKRLS algorithm with the online vector quantization [12]. Besides having the good properties of SKRLS-ALD, it can avoid overfitting. In addition, Chen et al. proposed an L 1-regularized SKRLS algorithm with the fixed-point subiteration [13], which can yield a much sparser dictionary. Intuitively, we can also bring the benefits of kernel machine learning to LSTD algorithms. In fact, kernel-based RL algorithms have become more and more popular in recent years [14-22], and several works have been done for kernelizing LSTD algorithms. In an earlier paper, Xu proposed a sparse kernel-based LSTD(λ) (SKLSTD(λ)) algorithm with the ALD criterion [19]. Although this algorithm can avoid selecting features manually, it is only applicable for batch learning and its derivation is complicated. After that, Xu et al. proposed an incremental version of the SKLSTD(λ) algorithm for policy iteration [20], but this algorithm still requires a matrix-inversion operation at each time step. Moreover, the feature dictionary is required to be constructed offline, which makes this algorithm only approximate the value function correctly in the area of the state space that is covered by the training samples. Recently, Jakab and Csató proposed a sparse kernel RLSTD (SKRLSTD) algorithm by using a proximity graph sparsification method [21]. Unfortunately, its sparsification process is also offline. In addition, all of these algorithms do not consider regularization, whereas many real problems exhibit noise and the high expressiveness of the kernel matrix can result in overfitting [22]. In this paper, we propose two online SKRLSTD algorithms with L 2 and L 1 regularization, called OSKRLSTD-L 2 and OSKRLSTD-L 1, respectively. Compared with the derivation of SKLSTD(λ), our derivation uses Bellman operator along with projection operator and thus is more simple. To cope with unknown state-space regions and avoid overfitting, our algorithms use online sparsification and regularization techniques. Besides, to reduce computational complexity and avoid caching all history samples, our algorithms also use the recursive least-squares and the sliding-window technique. Moreover, different from LARS-TD, OSKRLSTD-L 1 uses the subiteration and online pruning to find the fixed point. These techniques make our algorithms more suitable for online RL problems with a large or continuous state space. The rest of this paper is organized as follows. In Section 2, we present preliminaries and review the LSTD algorithm. Section 3 contains the main contribution of this paper: we derive OSKRLSTD-L 2 and OSKRLSTD-L 1 algorithms in detail. In Section 4, we demonstrate the effectiveness of our algorithms for two 50-state chain problems. Finally, we conclude the paper in Section 5.

2. Background

In this section, we introduce the basic definitions and notations, which will be used throughout the paper without any further mention. We also review the LSTD algorithm, which is needed to establish our algorithms described in Section 3.

2.1. Preliminaries

In RL and dynamic programming (DP), an underlying sequential decision-making problem is often modeled as a Markov decision process (MDP). An MDP can be defined as a tuple ℳ = 〈𝒮, 𝒜, P, r, γ, d〉 [5], where 𝒮 is a set of states, 𝒜 is a set of actions, P : 𝒮 × 𝒜 × 𝒮 → [0,1] is a state transition probability function where P(s, a, s′) denotes the probability of transitioning to state s′ when taking action a in state s, r ∈ ℝ is a reward function, γ ∈ [0,1] is the discount factor, and d is an initial state distribution. For simplicity of presentation, we assume that 𝒮 and 𝒜 are finite. Given an MDP ℳ and a policy π : 𝒮 → 𝒜, the sequence s 1, r 1, s 2, r 2,… is a Markov reward process ℛ = 〈𝒮, P , R , γ, d〉, where P (s, s′) = ∑ π(a∣s)P(s, a, s′) and R (s) = ∑ π(a∣s)∑ P(s, a, s′)r(s, a, s′). RL and DP often use the state-value function V (s) to evaluate how good the policy π is for the agent to be in state s. For an MDP, V (s) can be defined as V (s) = E [∑ γ r ∣s 0 = s], which must obey the Bellman equation [23],or be expressed in vector form,If P and R are known, V can be solved analytically; that is,where I is the |𝒮 | ×|𝒮| identity matrix. However, different from the case in DP, P and R are unknown in RL. The agent has to estimate V by exploring the environment. Furthermore, many real problems have a large or continuous state space, which makes V (s) hard to be expressed explicitly. To overcome this problem, we often resort to linear function approximation; that is,where w ∈ ℝ is a parameter vector, ϕ(s) ∈ ℝ is the feature vector of state s, and Φ = [ϕ(s 1),…, ϕ(s |)] is a |𝒮 | × m feature matrix. Unfortunately, when approximating V in this manner, there is usually no way to satisfy the Bellman equation exactly, because R + γP Φw may lie outside the span of Φ [5].

2.2. LSTD Algorithm

The LSTD algorithm presents an efficient way to find w such that “approximately” satisfies the Bellman equation [5]. By solving the least-squares problem min‖Φu − (R + γP Φw)‖ 2, we can find a closest approximation Φu in the span of Φ to replace R + γP Φw. Then, from (2) and (4), we can use w = u for approximating V . That means we can find w by solving the fixed-point equation:where D is a nonnegative diagonal matrix indicating a distribution over states. Nevertheless, since P and R are unknown and since Φ is too large to form anyway in a large or continuous state space, we cannot solve (5) exactly. Instead, given a trajectory T = {(s , s ′, r )∣i = 1,…, t} following policy π, LSTD uses , , and to replace Φ, P Φ, and R , respectively. Then, (5) can be approximately rewritten asLet ; we haveThus, the fixed point can be found by

3. Regularized OSKRLSTD Algorithms

To overcome the weaknesses of the previous kernel-based LSTD algorithms, we propose two regularized OSKRLSTD algorithms in this section.

3.1. OSKRLSTD-L 2 Algorithm

Now, we use L 2 regularization and online sparsification to derive the first OSKRLSTD algorithm, which is called OSKRLSTD-L 2. First, we use the kernel trick to kernelize (6). Suppose the feature dictionary 𝒟 = {d ∣d ∈ 𝒮, j = 1,…, n }, and let Φ = [ϕ(d 1),…, ϕ(d )] denote the corresponding feature matrix. By the Representer Theorem [24], w and u can be expressed as follows:where α = [α ,…, α ] and β = [β 1,…, β ] are the coefficient vector of w and u, respectively. Then, from (6), we haveBy the Mercer Theorem [24], the inner product of two feature vectors can be calculated by k(s , s ) = ϕ(s ) ϕ(s ). Thus, we can define K = ΦΦ , , and . On this basis, (10) can be rewritten as Second, we try to derive the L 2-regularized solution of (11). Add an L 2-norm penalty into (11); that is,where η ∈ [0, ∞) is a regularization parameter. Let ; we haveSince w = u , we easily have α = β from (9). Then, the above equation can be rewritten aswhere I is the n × n identity matrix. Thus, α can be analytically solved aswhere A ∈ ℝ and b ∈ ℝ denotewhere k (·) = [k(·, d 1),…, k(·, d )] and Δk (s , s ′) = k (s ) − γ k (s ′). Third, we derive the recursive formulas of A −1 and α . Under online sparsification, there are two cases: (1) 𝒟 = 𝒟 , n = n , k (·) = k (·), Δk (s , s ′) = Δk (s , s ′), and I = I ; (2) 𝒟 = 𝒟 ∪ {s }, n = n + 1, k (·) = [k (·), k(·, s )], Δk (s , s ′) = [Δk (s , s ′), Δk (s , s ′)], where Δk (s , s ′) = k(s , s ) − γk(s ′, s ), and I is expanded aswhere 0 is the n dimensional zero vector. For the first case, (16) can be rewritten as follows: Applying the matrix-inversion lemma [25] for A −1, we getThus, plugging (19) and (20) into (15), we obtain For the second case, (16) can be rewritten as follows:where and are the same as the updated A and b when the feature dictionary keeps unchanged, h = ∑ Δk (s , s ′)k (s ), g = ∑ k(s , s )Δk (s , s ′), p = ∑ k(s , s t)Δk (s , s ′) + η, and q = ∑ k(s , s )r . However, computing h , g , p , and q requires caching all history samples, and the computational cost will become more and more expensive as t increases. Inspired by the work of Van Vaerenbergh et al. [26], we introduce a sliding window ℋ to deal with these problems. Let ℋ = {(s , s ′, r )∣j = max⁡(1, t − M + 1),…, t}, where M is the window size. We only use the samples in ℋ to evaluate h , g , p , and q ; that is,Then, similar to those in the first case, A −1 and α can be derived as follows: where and is the same as the updated α when the dictionary keeps unchanged. Finally, we summarize the whole algorithm in Algorithm 1.

Algorithm 1

OSKRLSTD-L 2.

Remark 1 .

Here, we do not restrict the OSKRLSTD-L 2 algorithm to a specific online sparsification method. That means it can be combined with many popular sparsification methods such as the novelty criterion (NC) [27] and the ALD criterion.

Remark 2 .

Although the OSKRLSTD-L 2 algorithm is designed for infinite horizon tasks, it can be modified for episodic tasks. When s ′ is an absorbing state, it only requires setting γ = 0 temporarily and setting s as the start state of next episode.

Remark 3 .

Our simulation results show that a big sliding window cannot help improve the convergence performance of the OSKRLSTD-L 2 algorithm. Thus, to save memory and reduce the computational cost, M should be set to a small integer.

3.2. OSKRLSTD-L 1 Algorithm

In this subsection, we use L 1 regularization and online sparsification to derive the second OSKRLSTD algorithm, which is called OSKRLSTD-L 1. First, we try to derive the L 1-regularized solution of (11). Add an L 1-norm penalty into (11); that is,where ξ ∈ [0, ∞) is a regularization parameter. However, ‖β‖1 is not differentiable. Similar to Painter-Wakefield and Parr in [28], we resort to the subdifferential of ; that is,where sgn⁡(β) is the set-valued function defined component-wise asLet ∇g(β) = 0, so thatSince w = u , we also have α = β from (9). Then, the above equation can be rewritten aswhere sgn⁡(α ) has the same meaning as sgn⁡(β). To avoid the singularity of and further reduce the complexity of the subsequent derivation, we introduce η α into both sides; that is,where η ∈ [0, ∞) is a regularization parameter. Obviously, the left hand side of (31) is the same as that of (14). Thus, from (16), the above equation can be rewritten asThen, we have the following fixed-point equation:where denotesUnfortunately, here, α cannot be solved analytically. Second, we investigate how to find the fixed point of (33). In L 1-regularized LSTD algorithms [5, 29], researchers often used the LASSO method to tackle this problem. However, the LASSO method is inherently a batch method and is unsuitable for online learning. Instead, we resort to the fixed-point subiteration method introduced in [13]. We first use the sign function sign⁡(α ) to replace sgn⁡(α ) in (33). Then, we can construct the following subiteration:where l ∈ ℕ + denotes the lth subiteration and α 1 is initialized to since the fixed point will be close to if η and ξ are small. If the subiteration number reaches a preset value N ∈ ℕ + or ‖α − α ‖ is less than or equal to a preset threshold ε ∈ ℝ +, the subiteration will stop. From (32) and (28), if |(b + η α − A α ) | < ξ, α should be 0. Obviously, the replacement of sgn⁡(α ) makes α lose the ability to select features. To remedy this situation, after the whole subiteration, we remove the weakly dependent elements from 𝒟 according to the magnitude of α ; that is,where Ψ(·) denotes the operation to remove the elements indexed by the set ℐ , which is determined bywhere v ∈ ℝ + is a preset threshold. Note that we do not remove the last element d of 𝒟 , since |α | is probably very small, especially when d is just added to 𝒟 . Similarly, we perform Ψ(α ) and Ψ( ) to remove the weakly dependent coefficients. From (16), A −1 also requires removing some rows and columns. Unfortunately, we cannot use the method in [30] to do this like Chen et al. in [13], since A −1 is not a symmetry matrix. Considering that b will remove the corresponding elements if 𝒟 is pruned, we directly perform Ψ(A −1) to remove the rows and columns indexed by ℐ . Although this method may bring some bias into A −1, our simulation results show that it is feasible and effective. The whole fixed-point subiteration and online pruning algorithm are summarized in Algorithm 2.

Algorithm 2

Fixed-point subiteration and online pruning.

Remark 4 .

Our simulation results show that Algorithm 2 will converge in few iterations. Thus, Algorithm 2 does not become the computational bottleneck of the OSKRLSTD-L 1 algorithm, and the maximum subiteration number N can be set to a small positive integer. Third, we derive the recursive formulas of A −1 and . Although the dictionary can be pruned by using Algorithm 2, it still has the risk of rapidly growing if new samples are allowed to be added continually. Thus, the conventional sparsification method is also required to be considered here. Similar to Section 3.1, there are two cases under online sparsification. Since A and have the same definitions as A and α in the OSKRLSTD-L 2 algorithm, we can directly use (20) and (24) for updating A −1 and rewrite (21) and (25) for updating . If s dissatisfies the sparsification condition, will be updated byOtherwise, will be updated bywhere , , , and are also calculated by (23). Since 𝒟 , A −1, and will be pruned by Algorithm 2 after the update, it is important to note that and in (39) denote A −1 and updated by 𝒟 but unpruned by Ψ(·). Likewise, when (24) is used here, has the same meaning. Finally, we summarize the whole algorithm in Algorithm 3. For episodic tasks, the modification is the same as Remark 2. In addition, similar to Remark 3, the sliding-window size M should also be set to a small integer.

Algorithm 3

OSKRLSTD-L 1.

Remark 5 .

By pruning the weakly dependent features, the OSKRLSTD-L 1 algorithm can yield a much sparser solution than the OSKRLSTD-L 2 algorithm.

4. Simulations

In this section, we use a nonnoise chain and a noise chain [2, 20, 31] to demonstrate the effectiveness of our proposed algorithms. For comparison purposes, RLSTD [1] and SKRLSTD [21] algorithms are also tested in the simulations. To analyze the effect of regularization and online pruning on the performance of our algorithms, the OSKRLSTD-L 2 algorithm with η = 0 and the OSKRLSTD-L 1 algorithm with v = 0 (called OSKRLSTD-0 and OSKRLSTD-L 1, resp.) are tested here, too. In addition, the effects of the sliding-window size on the performance of our algorithms and OSKRLSTD-L 1 are evaluated as well.

4.1. Simulation Settings

As shown in Figure 1, in both chain problems, each chain consists of 50 states, which are numbered from 1 to 50. For each state, there are two actions available, that is, “left” (L) and “right” (R). Each action succeeds with probability 0.9, changing the state in the intended direction, and fails with probability 0.1, changing the state in the opposite direction. The two boundaries of each chain are dead-ends, and the discount factor γ of each chain is set to 0.9. For the nonnoise chain, the reward is 1 only in states 10 and 41, whereas, for the noise chain, the reward is corrupted by an additive Gaussian noise 0.3𝒩(0,1). Due to the symmetry, the optimal policy for both chains is to go right in states 1–9 and 26–41 and left in states 10–25 and 42–50. Here, we use it as the policy π to be evaluated. Note that the state transition probabilities are available only for solving the true state-value functions V , and they are assumed to be unknown for all algorithms compared here.

Figure 1

The 50-state chain problem.

In the implementations of all tested algorithms for both chain problems, the settings are summarized as follows: (i) For all OSKRLSTD algorithms, the Mercer kernel is defined as k(x, y) = exp⁡(−‖x − y‖2/16), the sparsification condition is defined as min‖s − d ‖ > 2, and the sliding-window size M is set to 5. Besides, for the OSKRLSTD-L 1 algorithm, the regularization parameters η and ξ are set to 0.8 and 0.3, the maximum subiteration number N is set to 10, the precision threshold ε is set to 0.1, and the pruning threshold v is set to 0.4; for the OSKRLSTD-L 1 algorithm, η, ξ, and N are the same as those in the OSKRLSTD-L 1 algorithm; for the OSKRLSTD-L 2 algorithm, ξ is set to 1. (ii) For the SKRLSTD algorithm, the Mercer kernel and the sparsification condition are the same as those in each OSKRLSTD algorithm. (iii) For the RLSTD algorithm, the feature vector ϕ(s) consists of 19 Gaussian radius basis functions (GRBFs) plus a constant term 1, resulting in a total of 20 basis functions. The GRBF has the same definition as the Mercer kernel used in each OSKRLSTD algorithm, and the centers of GRBFs are uniformly distributed over [1,50]. In addition, the variance matrix C 0 of RLSTD is initialized to 0.4I, where I is the 20 × 20 identity matrix. (iv) In the simulations, each algorithm performs 50 runs, each run includes 100 episodes, and each episode is truncated after 100 time steps. In particular, the SKRLSTD algorithm requires an extra run for offline sparsification before each regular run.

4.2. Simulation Results

We first report the comparison results of all tested algorithms with the simulation settings described in Section 4.1. Their learning curves are shown in Figure 2. At each episode, the root mean square error (RMSE) of each algorithm is calculated by , where V (s) is solved by (1) and is the approximate value of the jth run. From Figure 2, we can observe that (i) OSKRLSTD-L 2 and OSKRLSTD-L 1 can obtain the similar performance as RLSTD and converge much faster than SKRLSTD. (ii) Without regularization, the performance of OSKRLSTD-0 becomes very poor, especially in the noise chain. In contrast, OSKRLSTD-L 2 and OSKRLSTD-L 1 still perform well. (iii) The performance of OSKRLSTD-L 1 is only slightly better than that of OSKRLSTD-L 1, which indicates that online pruning has little effect on the performance. Figure 3 illustrates approximated by all tested algorithms at the final episode. Clearly, OSKRLSTD-0 has lost the ability to approximate V (s) of the noise chain. Figure 4 shows the dictionary growth curves of all tested algorithms. Compared with RLSTD and SKRLSTD, all OSKRLSTD algorithms can construct the dictionary automatically, and OSKRLSTD-L 1 yields a much sparser dictionary. Figure 5 shows the average subiterations per time step in OSKRLSTD-L 1 and OSKRLSTD-L 1. As episodes increase, the subiterations decline gradually. In addition, online pruning can reduce the subiterations significantly. Even in the noise chain, the subiterations are small. Finally, the main simulation results of all tested algorithms at the final episode are summarized in Table 1.

Figure 2

Learning curves of all tested algorithms.

Figure 3

approximated by all tested algorithms at the final episode.

Figure 4

Dictionary growth curves of all tested algorithms.

Figure 5

Average subiterations in OSKRLSTD-L 1 and OSKRLSTD-L 1 algorithms.

Table 1

Main simulation results on both chains at the final episode.

Algorithm	Nonnoise chain			Noise chain
Algorithm	RMSE	Dictionary size	Subiterations	RMSE	Dictionary size	Subiterations
RLSTD	0.47 ± 0.03	20	—	0.50 ± 0.04	20	—
SKRLSTD	0.47 ± 0.05	15.36 ± 0.78	—	0.49 ± 0.06	15.32 ± 0.71	—
OKRLSTD-L ₂	0.45 ± 0.05	15.30 ± 0.81	—	0.47 ± 0.04	15.32 ± 0.84	—
OKRLSTD-L ₁	0.49 ± 0.08	11.52 ± 1.16	1.81 ± 1.82	0.53 ± 0.10	12.42 ± 1.13	2.60 ± 2.56
OKRLSTD-0	2.21 ± 0.05	15.25 ± 0.87	—	32.92 ± 68.67	15.24 ± 0.77	—
OKRLSTD-L _1u	0.44 ± 0.05	15.40 ± 0.76	5.08 ± 3.24	0.47 ± 0.05	15.28 ± 0.88	4.90 ± 3.26

Next, we evaluate the effect of the sliding-window size on our proposed algorithms and OSKRLSTD-L 1 with M = 1,5, 10,…, 45,50. The logarithmic RMSEs of each algorithm at the final episode are illustrated in Figure 6. Note that the parameter settings of these algorithms are the same as those described in Section 4.1 except for M. From Figure 6, OSKRLSTD-L 1 and OSKRLSTD-L 1 obviously become worse rather than better as the window size increases, and OSKRLSTD-L 2 has a strong adaptability to different window sizes. The reason for this result is analyzed as follows: From the derivation of our algorithms, the influence of the window size is mainly manifest in A −1. Since here A −1 is calculated by recursive update instead of matrix inversion and samples are used one by one, using too many history samples together may increase the calculation error. In OSKRLSTD-L 2, a moderate regularization parameter η can relieve the influence of this error. In contrast, in OSKRLSTD-L 1 and OSKRLSTD-L 1, the subiteration may expand the influence. Especially for OSKRLSTD-L 1, online pruning can introduce the new error, which further worsens the convergence performance. To verify the above analysis, we reset η = 0.6, ξ = 0.3, and N = 1 for OSKRLSTD-L 1 and OSKRLSTD-L 1 and reevaluate the effect of the window size. The new results are illustrated in Figure 7. As expected, OSKRLSTD-L 1 and OSKRLSTD-L 1 can also adapt to M. Nevertheless, there is still no proof that a big window size can help improve the convergence performance of OSKRLSTD-L 2 and OSKRLSTD-L 1. Thus, as stated in Remark 3, M is suggested to be set to a small integer in practice.

Figure 6

Effect of the sliding-window size M on three OSKRLSTD algorithms.

Figure 7

Effect of the sliding-window size M on three OSKRLSTD algorithms with new parameters.

5. Conclusion

As an important approach for policy evaluation, LSTD algorithms can use samples more efficiently and eliminate all step-size parameters. But they require users to design the feature vector manually and often require many features to approximate state-value functions. Recently, there are some works on these issues by combining with sparse kernel methods. However, these works do not consider regularization and their sparsification processes are batch or offline. In this paper, we propose two online sparse kernel recursive least-squares TD algorithms with L 2 and L 1 regularization, that is, OSKRLSTD-L 2 and OSKRLSTD-L 1. By using Bellman operator along with projection operator, our derivation is more simple. By combining online sparsification, L 2 and L 1 regularization, recursive least squares, a sliding window, and the fixed-point subiteration, our algorithms not only can construct the feature dictionary online but also can avoid overfitting and eliminate the influence of noise. These advantages make them more suitable for online RL problems with a large or continuous state space. In particular, compared with the OSKRLSTD-L 2 algorithm, the OSKRLSTD-L 1 algorithm can yield a much sparser dictionary. Finally, we illustrate the performance of our algorithms and compare them with RLSTD and SKRLSTD algorithms by several simulations. There are also some interesting topics to be studied in future work: (i) How to select proper regularization parameter should be investigated. (ii) A more thorough simulation analysis is needed, including an extension of our algorithms to learning control problems. (iii) Eligibility traces would be combined for further improving the performance of our algorithms. (iv) The convergence and prediction error bounds of our algorithms will be analyzed theoretically.

5 in total