Pau Vilimelis Aceituno1, Gang Yan2, Yang-Yu Liu3. 1. Channing Division of Network Medicine, Brigham and Women's Hospital, Harvard Medical School, Boston, MA 02115, USA; Max Planck Institute for Mathematics in the Sciences, 04103 Leipzig, Germany. 2. School of Physics Science and Engineering, Tongji University, 200092 Shanghai, China. 3. Channing Division of Network Medicine, Brigham and Women's Hospital, Harvard Medical School, Boston, MA 02115, USA. Electronic address: yyl@channing.harvard.edu.
Abstract
As one of the most important paradigms of recurrent neural networks, the echo state network (ESN) has been applied to a wide range of fields, from robotics to medicine, finance, and language processing. A key feature of the ESN paradigm is its reservoir-a directed and weighted network of neurons that projects the input time series into a high-dimensional space where linear regression or classification can be applied. By analyzing the dynamics of the reservoir we show that the ensemble of eigenvalues of the network contributes to the ESN memory capacity. Moreover, we find that adding short loops to the reservoir network can tailor ESN for specific tasks and optimize learning. We validate our findings by applying ESN to forecast both synthetic and real benchmark time series. Our results provide a simple way to design task-specific ESN and offer deep insights for other recurrent neural networks.
As one of the most important paradigms of recurrent neural networks, the echo state network (ESN) has been applied to a wide range of fields, from robotics to medicine, finance, and language processing. A key feature of the ESN paradigm is its reservoir-a directed and weighted network of neurons that projects the input time series into a high-dimensional space where linear regression or classification can be applied. By analyzing the dynamics of the reservoir we show that the ensemble of eigenvalues of the network contributes to the ESN memory capacity. Moreover, we find that adding short loops to the reservoir network can tailor ESN for specific tasks and optimize learning. We validate our findings by applying ESN to forecast both synthetic and real benchmark time series. Our results provide a simple way to design task-specific ESN and offer deep insights for other recurrent neural networks.
As a promising paradigm of recurrent neural networks, echo state network (ESN) has a reservoir of neurons with randomly assigned and fixed synaptic connections (Jaeger 2001a, 2001b, 2002; Jaeger and Haas, 2004). For the ESN to model and predict specific temporal patterns, only the weights of output neurons need to be learned from training data. Owing to its simplicity, ESN and its variants have been applied to many different tasks such as electric load forecasting (Deihimi and Showkati, 2012), robotic control (Plöger et al., 2003), epilepsy forecasting (Buteneers et al., 2008), stock price prediction (Lin et al., 2009), grammar processing (Tong et al., 2007), and many others (Coulibaly, 2010; Newton and Smith, 2012; Pathak et al., 2018; Verplancke et al., 2010).Over the last decade, a plethora of studies have focused on finding good reservoirs. Those studies fall broadly into two categories. The first is systematical parameter search. For specific tasks, this outperforms the classical Monte Carlo reservoir selection (Deng and Zhang, 2006; Ferreira and Ludermir, 2011; Jiang et al., 2008; Liebald, 2004; Rodriguez et al., 2019). Yet, this systematical parameter search is very time consuming and does not offer a significant performance improvement or better mechanistic understanding. The second is particular reservoir characteristics. Many studies have explored reservoirs with some particular characteristics that make them desirable, typically with long memory (Farkaš et al., 2016; Rodan and Tiňo, 2012; Strauss et al., 2012) or “rich” dynamics (Boedecker et al., 2012; Ozturk et al., 2007). However, the desirability of those reservoir characteristics is typically task-specific, rather than applicable to general tasks. Here, we propose a new strategy. We first focus on a general mechanistic understanding of the reservoir dynamics, which then helps us optimize or tailor reservoirs in a task-specific manner. We find that the idea of tailoring reservoirs is applicable to general tasks.Formally, the discrete-time dynamics of the simplest ESN (as shown in Figure 1 and in Supplemental Information Section I) with N neurons, one input, and one output is governed byHere the state vector denotes the state of the N neurons at time t, is the input signal at time t, and is the output at time t. The extended state vector is just the concatenation of and . There are various possibilities for the nonlinear function f, and here we take the hyperbolic tangent, as it is often done in ESN literature (Jaeger, 2002; Jaeger and Haas, 2004). The matrix is the weighted adjacency matrix of the reservoir network describing the fixed wiring diagram of N neurons in the reservoir. There is a rich literature on the conditions that the matrix W must fulfill (Buehner and Young, 2006; Gandhi et al., 2012; Jaeger, 2007; Yildiz et al., 2012). Here we adopt a conservative and simple condition that the reservoir must represent a stable dynamic system. The vector captures the fixed weights of the input connections, which were drawn from a uniform distribution . The vector denotes the fixed weights of the feedback connections from the output to the N neurons, which can induce instabilities if chosen carelessly and may be zero in some tasks (Jaeger, 2002). Finally, the row vector represents the trainable weights of the readout connections from the N neurons and the input to the output.
Figure 1
The Basic Schema of an ESN
The input signal u(t) goes to each neuron in the reservoir with input weights win, the neurons send their states to their neighbors according to the matrix W, and the contribution of each neuron to the output y(t) is collected by wout. The reservoir network may have self-loops, and can have both excitatory (yellow) and inhibitory (gray) synaptic connections.
The Basic Schema of an ESNThe input signal u(t) goes to each neuron in the reservoir with input weights win, the neurons send their states to their neighbors according to the matrix W, and the contribution of each neuron to the output y(t) is collected by wout. The reservoir network may have self-loops, and can have both excitatory (yellow) and inhibitory (gray) synaptic connections.A key feature of ESN is that W, win, and wofbare all fixed, and only woutis trainable: , where t0is the starting time, T is the training interval, and is the target output obtained from the training data (see Supplemental Information Section I for details). In other words, is the linear regression weights approximating the desired output from the extended state vector , which can be easily solved. Hence, captures the underlying mechanism of the dynamic system that produces the training data. Indeed, the right choice of can be used to forecast, reconstruct, or filter nonlinear time series.It is worth noticing that although ESN is easy to train and very flexible, it is often outperformed by more sophisticated methods requiring larger training dataset and longer training time (Hammami and Bedda, 2010), or ad hoc architectures (Wan, 1993). The reason why we focus on ESN in this work is because of its simplicity, which allows us to perform analytical calculations. Moreover, we want to explore how simple ideas from classical signal processing and network science can be applied to dissect ESN—a prototypical paradigm of recurrent neural networks.
Results
ESN Performance Captured by Reservoir Spectrum
The success of ESN in tasks such as forecasting time series comes from the ability of its reservoir to retain memory of previous inputs (Cui et al., 2012). In ESN literature, this is quantified by the memory capacity (Jaeger, 2001a, 2001b):Here r(t)is a random variable drawn from a normal distribution , serving as a random input; “cov” represents the covariance; y(t) is the output as described in Equation 2; and is obtained as a minimizer of the difference between y(t) and r(t−τ).To quantify the relationship between the reservoir dynamics and the memory capacity, we note that the extraction of information from the reservoir is made through a linear combination of the neurons' states. Hence, more linearly independent neurons would offer more variable states, and thus longer memory (Jaeger, 2005; Lukoševičius and Jaeger, 2009). To be more precise, we hypothesize that the memory capacity M strongly depends on the average correlations among neuron states, which can be quantified as follows: with , the Pearson correlation coefficient between the states of neurons i and j, and std(x(t)) the standard deviation of the states of neuron i. Indeed, Figure 2A shows that for various network topologies there is a strong correlation between S and M, which can also be justified analytically (see Supplemental Information Section VI). Thus, hereafter we only need to understand how the reservoir structure affects the neuron correlations.
Figure 2
Relations between Memory Capacity, Neuron Correlations, and Network Spectrum
(A) Memory capacity M versus average correlation of neuron states S.
(B) Average correlation of neuron states S versus average eigenvalue modulus . ESNs were created using reservoirs of N = 400 neurons and sequences of 4,000 random inputs chosen from , and the error bars represent the standard deviation. ER represents reservoirs with structure generated by the classical Erdös-Rény (ER) random graphs, with edge weights drawn from a normal distribution and varying spectral radii. PL represents reservoirs with structures generated from Erdös-Rény random graphs, but the edge weights are drawn from a power-law (PL) distribution with varying exponent β∈[2,5] and then normalized to have a spectral radius α = 1. Lower β renders lower M, higher S, and lower . SF represents reservoirs with scale-free (SF) network structures with degree exponent γ∈[2,6], and the edge weights are drawn from a normal distribution and then normalized to have a spectral radius α = 1. More degree-heterogeneous networks render lower M, higher S, and lower . RR represents reservoirs with structure generated by random regular (RR) graphs, with varying degrees and a spectral radius α = 1. See Supplemental Information Sections III and V for more details. It is worth noticing that although the theoretical upper bound for the memory capacity of a reservoir is M = N (Jaeger, 2001a, 2001b) and small input scalings (Farkaš et al., 2016) do achieve similar values, in our case the input scaling is large and thus the nonlinearity of the reservoir limits M to be lower than 18. A more detailed numerical exploration of the dependency between the various network parameters and M, and between the network parameters and , is presented in the Supplemental Information Section V. The network generation algorithms are presented in Supplemental Information Section III. All networks have a spectral radius , except the ER random graphs where each point corresponds to a spectral radius in the range [0.2,1] to show the impact of spectral radius. It is worth noticing that although the theoretical upper bound for the memory capacity of a reservoir is (Jaeger, 2001a, 2001b) and small input scalings (Farkaš et al., 2016) do achieve similar values, in our case input scaling is large and thus the nonlinearity of the reservoir limits M to be less than 18.
See also Figures S1–S3.
Relations between Memory Capacity, Neuron Correlations, and Network Spectrum(A) Memory capacity M versus average correlation of neuron states S.(B) Average correlation of neuron states S versus average eigenvalue modulus . ESNs were created using reservoirs of N = 400 neurons and sequences of 4,000 random inputs chosen from , and the error bars represent the standard deviation. ER represents reservoirs with structure generated by the classical Erdös-Rény (ER) random graphs, with edge weights drawn from a normal distribution and varying spectral radii. PL represents reservoirs with structures generated from Erdös-Rény random graphs, but the edge weights are drawn from a power-law (PL) distribution with varying exponent β∈[2,5] and then normalized to have a spectral radius α = 1. Lower β renders lower M, higher S, and lower . SF represents reservoirs with scale-free (SF) network structures with degree exponent γ∈[2,6], and the edge weights are drawn from a normal distribution and then normalized to have a spectral radius α = 1. More degree-heterogeneous networks render lower M, higher S, and lower . RR represents reservoirs with structure generated by random regular (RR) graphs, with varying degrees and a spectral radius α = 1. See Supplemental Information Sections III and V for more details. It is worth noticing that although the theoretical upper bound for the memory capacity of a reservoir is M = N (Jaeger, 2001a, 2001b) and small input scalings (Farkaš et al., 2016) do achieve similar values, in our case the input scaling is large and thus the nonlinearity of the reservoir limits M to be lower than 18. A more detailed numerical exploration of the dependency between the various network parameters and M, and between the network parameters and , is presented in the Supplemental Information Section V. The network generation algorithms are presented in Supplemental Information Section III. All networks have a spectral radius , except the ER random graphs where each point corresponds to a spectral radius in the range [0.2,1] to show the impact of spectral radius. It is worth noticing that although the theoretical upper bound for the memory capacity of a reservoir is (Jaeger, 2001a, 2001b) and small input scalings (Farkaš et al., 2016) do achieve similar values, in our case input scaling is large and thus the nonlinearity of the reservoir limits M to be less than 18.See also Figures S1–S3.For linear dynamical systems, the correlations between state variables depend on all the eigenvalues of the adjacency matrix (Boccaletti et al., 2006), with larger mean eigenvalue meaning lower correlations (see Supplemental Information Section VII for details). Our system (Equation 1) is nonlinear, but its first-order approximation is the identity function f(z) = z. Hence we can use the eigenvalues {λ} of matrix W to approximately quantify how fast the input decays in the reservoir, and hence how poorly the ESN remembers. In other words, the eigenvalues of matrix W should be related to the memory capacity of the ESN. Indeed, we find that the average eigenvalue moduli:strongly correlates with S (Figure 2B) and therefore with M as well.Note that, as opposed to M and S, is much easier to compute and is solely determined by the reservoir network. This offers us a simple measure to quantify the ESN memory capacity and hence its performance. For example, this explains two recent studies in which it was found that ring networks and orthogonalized networks have high memory capacities (Farkaš et al., 2016; Rodan and Tiňo, 2012), as both networks have large eigenvalues with respect to their spectral radii. Similarly, the modularity of the network can have an effect on its memory capacity (Rodriguez et al., 2019), which is explained by the effect of modularity in the eigenvalue distribution (Newman and Girvan, 2004). Moreover, is consistent with the effects of scaling the adjacency matrix to tune the spectral radius (Jaeger et al., 2007) and it extends to network topologies with a fixed spectral radius (see Supplemental Information
Figures S1 and S2). Our result is also consistent with studies on Intrinsic Plasticit y, where the memory of the reservoir is increased by growing its entropy (Boedecker et al., 2009; Schrauwen et al., 2008), which is negatively correlated with its correlations. Finally, it suggests that M can be maximized by using networks with large eigenvalues, the simplest one being a circulant network with degree 1 (Aceituno et al., 2019), which achieves M = 20, whereas the others go only up to M = 17 (see the example in Supplemental Information Section VIII).To further demonstrate the validity of as a proxy measure for the ESN performance, we consider the following tasks (Figures 3A–3C): (1) forecasting the chaotic Mackey-Glass time series (Mackey and Glass, 1977), which is a benchmark task to evaluate the performance of ESN (Jaeger and Haas, 2004; Lukoševicius and Jaeger, 2007); (2) forecasting the Laser Intensity time series (Hübner et al., 1989) downloaded from the Santa Fe Institute; and (3) classifying Spoken Arabic Digits (Hammami and Bedda, 2010) downloaded from the UCI Machine Learning Repository (Lichman, 2013). For each task, we consider ESNs with a wide range of reservoir topologies and parameters, and plot the ESN performance (in terms of forecasting error or failure rate) as a function of (Figures 3D–3F). We find that the optimal parameters for all reservoir networks are within a consistent range of (highlighted in pink). In other words, the performance of ESN is indeed captured by , rather than by other reservoir characteristics networks.
Figure 3
Time Series Analyzed in This Work and the ESN Performance Explained by
(A) The classical Mackey-Glass time series (Mackey and Glass, 1977) with 500 data points.
(B) The Laser Intensity time series (Hübner et al., 1989) with 300 data points.
(C) The average value of the first mel-frequency cepstral coefficient (MFCC) Channel of the first Spoken Arabic Digit (Hammami and Bedda, 2010); the error bars represent standard deviations over the training dataset.
(D) The ESN forecasting performance for the Mackey-Glass time series.
(E) The ESN forecasting performance for Laser Intensity time series.
(F) The ESN failure classification rate for Spoken Arabic Digits. For each task, we use scale-free (SF) networks, Erdős-Rényi (ER) random graphs with homogeneous link weights, and ER random graphs with heterogeneous link weights following a power-law distribution (PL) as reservoirs (see Supplemental Information Section III for more details on reservoir generations). The SF and PL reservoirs have various spectral radii α, chosen to be around the optimal value of α for the ER reservoirs. For each parameter set of each network type we created 200 ESN realizations, and then all the points obtained were grouped in 10 bins containing the same number of points. For each bin, we plotted the median against the median performance: log(σ) from Equation S5 for (D) and (E); and the failure rate for (F), with the error bars being the upper and lower quartiles, respectively. See Supplemental Information Section II for an expanded description of the three time series and the performance measurement.
Time Series Analyzed in This Work and the ESN Performance Explained by(A) The classical Mackey-Glass time series (Mackey and Glass, 1977) with 500 data points.(B) The Laser Intensity time series (Hübner et al., 1989) with 300 data points.(C) The average value of the first mel-frequency cepstral coefficient (MFCC) Channel of the first Spoken Arabic Digit (Hammami and Bedda, 2010); the error bars represent standard deviations over the training dataset.(D) The ESN forecasting performance for the Mackey-Glass time series.(E) The ESN forecasting performance for Laser Intensity time series.(F) The ESN failure classification rate for Spoken Arabic Digits. For each task, we use scale-free (SF) networks, Erdős-Rényi (ER) random graphs with homogeneous link weights, and ER random graphs with heterogeneous link weights following a power-law distribution (PL) as reservoirs (see Supplemental Information Section III for more details on reservoir generations). The SF and PL reservoirs have various spectral radii α, chosen to be around the optimal value of α for the ER reservoirs. For each parameter set of each network type we created 200 ESN realizations, and then all the points obtained were grouped in 10 bins containing the same number of points. For each bin, we plotted the median against the median performance: log(σ) from Equation S5 for (D) and (E); and the failure rate for (F), with the error bars being the upper and lower quartiles, respectively. See Supplemental Information Section II for an expanded description of the three time series and the performance measurement.
Adapting ESN in the Frequency Domain
Intuitively, a reservoir can be understood as a set of coupled filters that extract features from the input signal, and the readout simply selects the right combination of those features. Hence, the reservoir should be designed to extract the features that are relevant for the problem at hand, and these features can be expressed in the Fourier domain. This idea can be translated to machine learning terms through a geometric argument (Figure 4), which was made rigorous in Supplemental Information Section IX. In particular, we derived an upper bound for the forecasting errorwhere stands for the Fourier Transform × and ⟨⋅,⋅⟩ are the cross and scalar products. In terms of signal processing, (or ) can be expressed by how much the power spectral densities (PSD) of the neurons differ from (or resemble) y, respectively. This is quite a natural result, as it simply implies that if the time series of the variables x are similar to the target, then the readout will work better.
Figure 4
Sketch of the Frequency Adaptation Argument
(Left) The target output is a time series of length T, represented by a vector in the corresponding space. A reservoir consists of N nonlinear filters of the input time series, represented by N points in that same space. The readout simply selects the point in the subspace spanned by the N neurons that are closest to the target. (Center) Thanks to Parseval's theorem (Parseval, 1806), the distances between vectors do not change after Fourier transformation, hence the picture is still valid in the Fourier domain. (Right) However, in the Fourier domain it will be possible to alter the filters so that the N points approach the target by making the reservoir resonate at appropriate frequencies, effectively reducing the forecasting error.
Sketch of the Frequency Adaptation Argument(Left) The target output is a time series of length T, represented by a vector in the corresponding space. A reservoir consists of N nonlinear filters of the input time series, represented by N points in that same space. The readout simply selects the point in the subspace spanned by the N neurons that are closest to the target. (Center) Thanks to Parseval's theorem (Parseval, 1806), the distances between vectors do not change after Fourier transformation, hence the picture is still valid in the Fourier domain. (Right) However, in the Fourier domain it will be possible to alter the filters so that the N points approach the target by making the reservoir resonate at appropriate frequencies, effectively reducing the forecasting error.Thus, to achieve the optimal performance of ESN for any specific task, it is crucial to alter the PSD of the reservoir, which can be achieved by adding feedback loops with delay L in our neurons, encoded as cycles of length L in the reservoir network. We account for the strength of those cycles by using the following measure:where E is the number of edges in the reservoir network and E (with s∈{+,−}) represents the number of edges embedded in cycles of length L and sign s = + or −, respectively. The value of E, the number of edges, depends on the specific ESN implementation (see Supplemental Information Section II). Note that standard ESNs (Jaeger and Haas, 2004; Pathak et al., 2018; Schrauwen et al., 2007) typically use fully random reservoirs, rendering E−E~0 regardless of the values of E, and hence ρ~0 for all L.As shown in Figure 5, a reservoir with different cycle lengths L and different ρ values as generated by an algorithm presented in Supplemental Information Section IV can enhance different families of frequencies. This holds true even though the dynamics of the neuron are nonlinear (see Supplemental Information Section X for an analytical explanation). For example, the Mackey-Glass time series and Spoken Arabic Digit time series, are dominated by low frequencies. Any reservoir with ρ > 0 will enhance the low frequencies, meaning that such a reservoir would enhance the frequencies relevant to those two time series. Similarly, for Laser Intensity, there are three peaks in the center of its PSD, which are enhanced in the cases of ρ < 0 for L = 2 and L = 3, but not for L = 1. This implies a strategy to tailor reservoir for any specific task.
Figure 5
Frequency Domain Analysis of Target Signals and Reservoir Frequencies
(A–I) Left y axis: the power spectral density (PSD) of three empirical time series (Mackey-Glass in A, D, and G; Laser Intensity in B, E, and H; and Spoken Digits in C, F, and I). Right y axis: the average PSD of the neuron states for reservoirs with various when using a random Gaussian input from . In each panel we plot the average PSD of 500 reservoirs with 400 neurons and connectivity 0.05. The length of cycles added into the reservoir is 1 (A–C), 2 (D–F), and 3 (G–I).
Frequency Domain Analysis of Target Signals and Reservoir Frequencies(A–I) Left y axis: the power spectral density (PSD) of three empirical time series (Mackey-Glass in A, D, and G; Laser Intensity in B, E, and H; and Spoken Digits in C, F, and I). Right y axis: the average PSD of the neuron states for reservoirs with various when using a random Gaussian input from . In each panel we plot the average PSD of 500 reservoirs with 400 neurons and connectivity 0.05. The length of cycles added into the reservoir is 1 (A–C), 2 (D–F), and 3 (G–I).To prove the concept of tailoring reservoir for specific tasks, we consider reservoirs with varying fractions of cycles and plot the ESN performance as a function of for L = 1,2,3. As shown in Figure 6, the ESN performance does change with for each L. To better understand this phenomenon, it is useful to consider the optimal value (highlighted in pink) and compare the corresponding average PSD of the neuron states for this reservoir with the PSD of the empirical time series.
Figure 6
Tailoring ESN through Frequency Adaptation
(A–I) ESN performance as a function of , for Mackey-Glass Forecasting (A, D, and G), Laser Intensity Forecasting (B, E, and H), and Spoken Arabic Digit Recognition (C, F, and I). Every point corresponds to the median performance, measured by forecasting error in (A, B, D, E, G, and H) and failure rate in (C, F, and I), over 200 realizations with error bars representing upper and lower quartiles. The length of cycles added into the reservoir is L = 1 in (A–C), 2 in (D–F), and 3 in (G–I). For each L, the optimal values are highlighted in pink. Dotted lines represent the best ESN performance obtained by combining cycles of different lengths.
Tailoring ESN through Frequency Adaptation(A–I) ESN performance as a function of , for Mackey-Glass Forecasting (A, D, and G), Laser Intensity Forecasting (B, E, and H), and Spoken Arabic Digit Recognition (C, F, and I). Every point corresponds to the median performance, measured by forecasting error in (A, B, D, E, G, and H) and failure rate in (C, F, and I), over 200 realizations with error bars representing upper and lower quartiles. The length of cycles added into the reservoir is L = 1 in (A–C), 2 in (D–F), and 3 in (G–I). For each L, the optimal values are highlighted in pink. Dotted lines represent the best ESN performance obtained by combining cycles of different lengths.For the Mackey-Glass (or the Spoken Arabic Digits) time series, the optimal is positive for L = 1,2,3 (see Figures 6A, 6D, and 6G; or 6C, 6F, and 6I). We also know that for (especially for L = 2,3), the reservoir's average PSD response is enhanced for the frequencies close to 0, which is exactly the regime where the spectrum of the Mackey-Glass (or the Spoken Arabic Digits) time series is concentrated (see Figures 5A, 5D, and 5G; or 5C, 5F, and 5I).As for the Laser Intensity time series, the dominating frequencies of its PSD are around 0.13, 0.27, and 0.38, thus ESN is improved when the response of the reservoir enhances those frequencies. As shown in Figures 5E and 5H, this happens when for L = 2,3. Indeed, as shown in Figures 6E and 6H, negative (for L = 2,3) improves the ESN performance. For L = 1, we observed in Figure 5 that the three peaks cannot be all enhanced simultaneously by setting to be either positive or negative. Instead, setting would yield the optimal performance. This is exactly what we observed in Figure 6B.Results shown in Figures 5 and 6 indicate that the reservoir should be tailored to resonate with the dominating frequencies present in the target signal. To achieve that, we designed a simple heuristic algorithm to find the optimal values of for cycles of different lengths (see Supplemental Information Section XI). This heuristic algorithm does offer much better ESN performance (dotted lines in Figure 6) than simply optimizing cycles of a fixed length L, and is definitely better than standard, state-of-the-art random reservoirs with random weights, where (Jaeger and Haas, 2004; Pathak et al., 2018; Schrauwen et al., 2007).
Discussion
In this work, we start by showing how the correlations between neurons define the memory of ESN and demonstrate that those correlations are determined by the eigenvalues of the reservoir's adjacency matrix. This result allows us to easily assess the memory capacity of a particular reservoir network, unifying previous results (Farkaš et al., 2016; Jaeger, 2001a, 2001b; Rodan and Tiňo, 2012; Strauss et al., 2012). Then we go beyond the current ESN practice and reveal previously unexplored optimization strategies. In particular, we show that adding short loops to the reservoir network can create resonant frequencies and enhance ESN performance by adapting the reservoir to specific tasks. It is important to note that we are not advocating the hand-tuning of reservoir topologies for specific tasks of ESN, but rather raising the point that notions from classical signal processing can help us understand and improve recurrent neural networks, either through selection of appropriate initial topologies in a pre-training stage or by designing learning algorithms that account for the principles outlined here. Given that most current learning strategies such as back-propagation focus on adapting single weights, we are convinced that many new learning algorithms can be created by focusing on network-level features. Moreover, our approach goes beyond improving current techniques. By studying which properties of a recurrent neural network make it well suited for a particular problem, we are also addressing the converse question of how should a neural network be after it has been adapted to a specific task. Thus, we provide valuable insights into the training process of general recurrent neural networks, as our theory highlights structural features that the training process would enhance or inhibit.
Limitations of the Study
Finally, we would like to highlight some potential caveats of the current work. On the application side, although we demonstrate that adding short loops to the reservoir network can improve the ESN performance for the aforementioned tasks, ESN as a whole can be outperformed by other, more task-specialized approaches that often require larger training dataset and longer training time. For instance, with a standard MATLAB package we can obtain performances of log(σ)~−6for Mackey-Glass Time Series Forecasting, although the algorithm takes much longer to train and requires more data. An ad hoc method for the Laser Intensity task with linear Finite Impulse Response Filters and a long memory buffer achieved performance in the interval log(σ) = [−1.7,−3] depending on the time interval (Wan, 1993). The Spoken Arabic Digit Recognition can be solved with a failure rate between 2.5% and 12% by using graphical models (Hammami and Sellam, 2009), although in the graphical models they used all the 13 Mel Frequency Cepstral Coefficient (MFCC) channels as opposed to the single channel that we used in this work. The reason why we focus on ESN in this work is just because of its simplicity, which allows us to perform analytical calculations. On the theoretical side, there are two caveats. First, the heuristic we used to find the optimal values of presented in Figure 6 and Supplemental Information Section XI does not have theoretical guarantees. We could create signals where the dominant frequency does not contain any relevant information, for instance, by adding a strong sinusoid to the time series to be processed. In this sense, we are counting on the domain knowledge of researchers who are interested in our method to filter out large non-informative components before feeding the time series to the ESN. Second, the relationship between eigenvalue moduli and neuron correlations assumes that linearization is a valid approximation. Although it is useful as an upper bound, the bound might become loose when the nonlinearity effects are strong. This could in principle be addressed by studying the Lyapunov spectra of the reservoir, which depends partially on the network structure and partially on the dynamics, accounting thus for the nonlinearities. A rigorous mathematical analysis is likely to be very challenging and is beyond the scope of the current study.
Resource Availability
Lead Contact
Further information and requests should be addressed to Yang-Yu Liu (yyl@channing.harvard.edu).
Materials Availability
This study did not generate any new reagents.
Data and Code Availability
The data and code used in this work are available at: https://github.com/pvili/EchoStateNetworks_NetworkAdaptation/tree/master.
Methods
All methods can be found in the accompanying Transparent Methods supplemental file.
Authors: T Verplancke; S Van Looy; K Steurbaut; D Benoit; F De Turck; G De Moor; J Decruyenaere Journal: BMC Med Inform Decis Mak Date: 2010-01-21 Impact factor: 2.796