| Literature DB >> 32185485 |
Erhard Wieser1, Gordon Cheng2.
Abstract
For spatiotemporal learning with neural networks, hyperparameters are often set manually by a human expert. This is especially the case with multiple timescale networks that require a careful setting of the values of timescales in order to learn spatiotemporal data. However, this implies a cumbersome trial-and-error process until suitable parameters are found and it reduces the long-term autonomy of artificial agents, such as robots that are controlled by multiple timescale networks. To solve the problem, we propose the evolutionary optimized multiple timescale recurrent neural network (EO-MTRNN) that is inspired by the neural plasticity of the human cortex. Our proposed network uses a method of evolutionary optimization to adjust its timescales and to rewire itself in terms of number of neurons and synapses. Moreover, it does not require additional neural networks for pre- and postprocessing input-output data. We validate our EO-MTRNN by applying it to a proposed benchmark training dataset with single and multiple sequence training cases, as well as by applying it to sensory-motor data from a robot. We compare different configuration modes of the network, and we compare the learning performance between a network configuration with manually set hyperparameters and a configuration with automatically estimated hyperparameters. The results show that automatically estimated hyperparameters yield approximately 43% better performance than manually estimated ones, without overfitting the given teaching data. We also validate the generalization ability by successfully learning data that were not included in the hyperparameter estimation process.Entities:
Keywords: Autonomous hyperparameter estimation; EO-MTRNN; Evolutionary optimization; Neural plasticity
Mesh:
Year: 2020 PMID: 32185485 PMCID: PMC7326924 DOI: 10.1007/s00422-020-00828-8
Source DB: PubMed Journal: Biol Cybern ISSN: 0340-1200 Impact factor: 2.086
Fig. 1Traditional a a human determines the hyperparameters. This may imply a series of cumbersome trial-and-error runs, each run conducted with the human in the loop, until the system shows satisfactory learning performance. Proposed b System automatically determines its hyperparameters, i.e. neural dynamics and network structure, through a process resembling biological evolution, in which each generation of the network performs better than the previous one. For each generation, the performance is determined by running the network in closed-loop yielding spatiotemporal patterns , , ..., that are compared with the teaching data. The resulting fitness metric is fed into the optimizer determining the hyperparameters for the next generation. To our best knowledge, evolutionary hyperparameter estimation has not yet been realized for multiple timescale neural networks
Fig. 2Structure of our MTRNN version that works with sigmoid activation for all units. The network consists of units (depicted as squares) connected by a set of mathematical functions and . The left half shows the input–output group IO, and the right half shows the context group C. The thick arrows between the bottom and the middle units indicate the connective weights represented by the weight matrix . The output of the top context units is fed back to the bottom context units. The extension of this network by an evolutionary optimizer yields our proposed EO-MTRNN that is shown in Fig. 3
Fig. 3Proposed EO-MTRNN. The MTRNN (top left) is extended by components for autonomous hyperparameter estimation. For this purpose, the MTRNN is trained and evaluated with benchmark sequences (teaching data). The fitness is computed and fed into the optimizer that estimates a set H of hyperparameters. Here, H consists of the number of context neurons , , and the multiple timescales with IO, FC, SC. The hyperparameters are used to adjust the neural timescales and to restructure the MTRNN (timescale adjustment and restructuring indicated by the orange colour) (colour figure online)
Dimension of optimization problem and the corresponding MTRNN hyperparameters that are optimized through SA-DE
| Problem dimension | Hyperparameter |
|---|---|
| 1 | |
| 2 | |
| 3 | |
| 4 | |
| 5 |
Elementary benchmark training sequences with Gaussian noise with and
| Type | Length | Mathematical expression | |
|---|---|---|---|
| Rising ramp | 50 | ||
| 100 | |||
| 150 | |||
| Falling ramp | 50 | ||
| 100 | |||
| 150 | |||
| Sigmoid-like upw. slope | 50 | ||
| 100 | |||
| 150 | |||
| Sigmoid-like downw. slope | 50 | ||
| 100 | |||
| 150 | |||
| Sine | 50 | ||
| 100 | |||
| 150 | |||
| Irregular (type K) | 50 | ||
| 100 | Same as irregular (type K) with | ||
| 150 | Same as irregular (type K) with | ||
Fig. 4Elementary benchmark training sequences of length . See Table 2 for their mathematical description used to generate them
Elementary benchmark training sequences of irregular type
| Type | Length | Interval | |||
|---|---|---|---|---|---|
| A | 50 | 0.2 | 0 | 0.1 | |
| 100 | 51 | 0.3 | |||
| 150 | 0.3 | 101 | 0.1 | ||
| B | 50 | 0.2 | 0 | 0.3 | |
| 100 | 0.3 | 51 | 0.5 | ||
| 150 | 101 | 0.8 | |||
| C | 50 | 0 | 0.6 | ||
| 100 | 0.2 | 51 | 0.3 | ||
| 150 | 0.2 | 101 | 0.5 | ||
| D | 50 | 0.1 | 0 | 0.1 | |
| 100 | 0.2 | 51 | 0.2 | ||
| 150 | 101 | 0.4 | |||
| E | 50 | 0 | 0.3 | ||
| 100 | 0.3 | 51 | 0.1 | ||
| 150 | 101 | 0.4 | |||
| F | 50 | 0 | 0.6 | ||
| 100 | 0.2 | 51 | 0.4 | ||
| 150 | 101 | 0.6 | |||
| G | 50 | 0 | 0.9 | ||
| 100 | 51 | 0.6 | |||
| 150 | 101 | 0.4 | |||
| H | 50 | 0.3 | 0 | 0.5 | |
| 100 | 51 | 0.8 | |||
| 150 | 0.2 | 101 | 0.6 | ||
| I | 50 | 0.3 | 0 | 0.2 | |
| 100 | 51 | 0.5 | |||
| 150 | 0.3 | 101 | 0.2 | ||
| J | 50 | 0 | 0.5 | ||
| 100 | 0.2 | 51 | 0.4 | ||
| 150 | 101 | 0.6 |
Each sequence type can be described by , with , , and is Gaussian noise with constants and
Fig. 5Elementary benchmark training sequences of irregular type. See Table 3 for their mathematical description used to generate them
Multi-dimensional benchmark training sequences
| Spatial dimensions | Elementary type |
|---|---|
| 1 | D |
| 2 | D, E |
| 4 | D, E, C, B |
| 6 | D, E, C, B, A, F |
| 8 | D, E, C, B, A, F, H, G |
| 10 | D, E, C, B, A, F, H, G, J, I |
See Table 3 for each elementary type. For example, the two-dimensional sequence consists of type D as its first spatial dimension and type E as its second spatial dimension. Note that the order does not matter, and it was composed randomly
Number of neurons and timescales for the learning
| Variable | 20 | 5 | 20 | 25 | 250 |
Learning rates , , and momentum kept fixed for all experiments
| 0.6 | 0.6 | 0.6 | 0.9 | |
| 0.4 | 0.4 | 0.4 | 0.9 |
Learning of one-dimensional benchmark sequences
| Length | Network configuration | Type of training sequence | ||||||
|---|---|---|---|---|---|---|---|---|
| Preprocessing | Early stopping | Ramp | Sigmoid-like | Sine | Irregular (type K) | |||
| Falling | Rising | Downward slope | Upward slope | |||||
| 50 | 0.992 | 0.997 | 0.998 | 0.937 | 0.479 | |||
| 50 | 0.966 | 0.994 | 0.986 | 0.988 | 0.120 | 0.333 | ||
| 50 | 0.998 | 0.998 | 0.995 | 0.996 | 0.945 | 0.318 | ||
| 50 | 0.965 | 0.970 | 0.987 | 0.996 | 0.900 | 0.501 | ||
| 100 | 0.994 | 0.999 | 1.00 | 0.827 | 0.451 | |||
| 100 | 0.953 | 0.982 | 0.943 | 0.944 | 0.285 | 0.0458 | ||
| 100 | 0.996 | 0.999 | 0.998 | 0.999 | 0.644 | 0.410 | ||
| 100 | 0.940 | 0.954 | 0.941 | 0.962 | 0.491 | 0.419 | ||
| 150 | 0.995 | 0.999 | 0.999 | 0.664 | 0.177 | |||
| 150 | 0.930 | 0.950 | 0.901 | 0.905 | 0.106 | |||
| 150 | 0.999 | 0.999 | 0.999 | 0.196 | 0.793 | |||
| 150 | 0.707 | 0.844 | 0.885 | 0.939 | 0.172 | 0.488 | ||
The learning is measured by the R-value. Four different network configuration modes were compared, given the length of a training sequence
Fig. 6Learning of multi-dimensional benchmark training sequences. The learning is measured by the R-value indicated by the colour bar. Four different network configuration modes were compared (colour figure online)
Fig. 7Learning and recall of the noisy six-dimensional benchmark training sequence with length . Network configuration: preprocessing on, early stopping off. Achieved R-value: 0.974. See Table 4 for details on the training data (colour figure online)
Fig. 8Recall with extrapolation from timestep 150 to 450. Dashed lines are the training sequence labelled (T), see also Fig. 7a. Solid lines are the predicted sequence labelled (P). The network tends to extrapolate the sequence based on the latest history of input–output activations, behaving like a type of predictive memory (colour figure online)
Comparison of numerical results using the benchmark function set F from Brest et al. (2006)
| # Gen. | SA-DE (own) | SA-DE (Brest et al. | |
|---|---|---|---|
| 1500 | |||
| 2400 | |||
| 2000 | |||
| 3200 | |||
| 5000 | |||
| 13, 000 | |||
| 5000 | 0 | ||
| 20, 000 | 0 | ||
| 1500 | 0 | 0 | |
| 3000 | |||
| 9000 | |||
| 5000 | 0 | 0 | |
| 1500 | |||
| 2500 | |||
| 2000 | 0 | 0 | |
| 1500 | |||
| 2400 | |||
| 1500 | |||
| 2400 | |||
| 100 | |||
| 100 | 0.397887 | 0.397887 | |
| 100 | 3 | 3 |
The main results are the minima (columns 3 and 4) of particular benchmark functions; the minima are averaged over 50 independent runs. The only purpose of this comparison is to validate a correct implementation of the SA-DE method. Since our results concur with Brest et al. (2006), our implementation of the SA-DE method is correct
Default parameterization for comparison purpose
| variable | 15 | 5 | 10 | 20 | 40 |
Fig. 9Example of hyperparameter estimation: The target sequence (dashed lines) had 4 spatial dimensions and length 50. The configuration was preprocessing on and early stopping on meaning that only 33 of 50 samples were used for training. The recall performance of a weak to reasonable default parameterization is shown in a. Applying our proposed hyperparameter optimization to , , , , increased the performance by 28.7 % without overfitting the data, see b (colour figure online)
Fig. 10Default versus optimized hyperparameterization: Comparison of learning performance when learning a multi-dimensional target sequence with different lengths; network configuration was preprocessing on and early stopping on. From these cases, it follows an average performance gain of approximately 43 % compared to the default parameterization given in Table 9 (colour figure online)
Benchmark training sequences of irregular type; all 7 sequences were trained simultaneously
| Sequence number | Spatial composition |
|---|---|
| 1 | D, E, C, B |
| 2 | A, F, H, G |
| 3 | J, I, D, E |
| 4 | B, C, E, D |
| 5 | E, A, J, B |
| 6 | I, J, H, A |
| 7 | F, G, D, C |
Each spatial dimension contains an elementary sequence of irregular type (see Fig. 5 for the visualization of each spatial dimension). The arrangement of the spatial dimensions does not matter; it was composed randomly
Fig. 11Default versus optimized hyperparameterization for multiple sequences trained simultaneously: Network configuration was preprocessing on and early stopping off. The greater the size of training data, the more the evolutionary optimization is worth it. Cases and have great differences in the R-value, respectively: 0.530 versus 0.811 (53.0 % gain) and 0.171 versus 0.531 (210.5 % gain, albeit poor performance in the default case). As an example, the optimized case with is visualized by Figs. 12 and 13 (colour figure online)
Fig. 12Simultaneous training of 7 benchmark sequences () and their recall: sequences 1–4. See Fig. 13 for the sequences 5 to 7 (colour figure online)
Fig. 13Simultaneous training of 7 benchmark sequences () and their recall: sequences 5–7 (colour figure online)
Fig. 14Initial activation states of the context group in principal component (PC) space after being trained with 7 sequences simultaneously (each sequence with ), using optimized hyperparameters. It can be seen that all 7 sequences are clearly separable in the activation space of each context group. Mapped from each initial activation state of fast and slow context, the corresponding sequence can be recalled, see Figs. 12 and 13, by using the learned weights
Fig. 15Robot task in Yamashita and Tani (2008) to obtain sensory-motor data through kinesthetic teaching. In home position, the robot is facing a box (blue) on a workbench (grey). It reaches and grasps the box. Then, the robot moves the box up and down three times, with its head cameras always focusing on the box by moving the head-neck joint accordingly. Finally, the robot returns back to home position (colour figure online)
Fig. 16Left side: Teaching data of the behaviour sequence consisting of reaching, grasping, and up-down behaviour in Yamashita and Tani (2008). This behaviour sequence is visualized in Fig. 15. Right side: Recall of the sequence (i.e. mental simulation) by our EO-MTRNN, although with default (non-optimized) hyperparameters reflecting a worst-case scenario. The EO-MTRNN is still able to sufficiently learn the sequence despite an adverse choice of hyperparameters. Note that the hyperparameters are not optimized in this case; here, it was of interest whether the proposed network can still preserve the task structure in case of a single learning procedure without going through the evolutionary optimization process (colour figure online)
Fig. 17Left side a, c, e Teaching data of the behaviour sequence used for autonomous hyperparameter estimation (AHE). It encodes reaching and grasping the box, and moving it left and right three times. Right side b, d, f Dashed lines are the teaching data of the task shown to the robot in Fig. 15. These data are different from the data on the left used for AHE. For example, the data on the left side encode moving the box left to right, instead of up and down. The solid lines are the mental simulation of the robot task by the EO-MTRNN. These results show generalization ability: The EO-MTRNN is able to sufficiently learn the task (reach and grasp the box, move it up and down three times), although it estimated its hyperparameters for another task (reach and grasp the box, move it left and right three times). The approximation can be further improved by increasing the number of generations () or by increasing the number epochs per individual of the network population (colour figure online)