| Literature DB >> 34341885 |
Nina M van Mastrigt1, Katinka van der Kooij2, Jeroen B J Smeets2.
Abstract
When learning a movement based on binary success information, one is more variable following failure than following success. Theoretically, the additional variability post-failure might reflect exploration of possibilities to obtain success. When average behavior is changing (as in learning), variability can be estimated from differences between subsequent movements. Can one estimate exploration reliably from such trial-to-trial changes when studying reward-based motor learning? To answer this question, we tried to reconstruct the exploration underlying learning as described by four existing reward-based motor learning models. We simulated learning for various learner and task characteristics. If we simply determined the additional change post-failure, estimates of exploration were sensitive to learner and task characteristics. We identified two pitfalls in quantifying exploration based on trial-to-trial changes. Firstly, performance-dependent feedback can cause correlated samples of motor noise and exploration on successful trials, which biases exploration estimates. Secondly, the trial relative to which trial-to-trial change is calculated may also contain exploration, which causes underestimation. As a solution, we developed the additional trial-to-trial change (ATTC) method. By moving the reference trial one trial back and subtracting trial-to-trial changes following specific sequences of trial outcomes, exploration can be estimated reliably for the three models that explore based on the outcome of only the previous trial. Since ATTC estimates are based on a selection of trial sequences, this method requires many trials. In conclusion, if exploration is a binary function of previous trial outcome, the ATTC method allows for a model-free quantification of exploration.Entities:
Keywords: Exploration; Motor learning; Reinforcement; Reward; Trial-by-trial analysis; Variability
Mesh:
Year: 2021 PMID: 34341885 PMCID: PMC8382626 DOI: 10.1007/s00422-021-00884-8
Source DB: PubMed Journal: Biol Cybern ISSN: 0340-1200 Impact factor: 2.086
Fig. 1Example task and simulation parameters. The example illustrates a class of motor learning tasks involving binary reward feedback based on one dimension of the movement, with no feedback on error size and sign. Left panel: simulation parameters consist of two task parameters and three learner parameters. Participants aim for a target but might end up somewhere else due to motor noise and exploration in their movement. Upon the next attempt, the learner may adjust her aim point with a learning fraction. The gray screen blocks the sight of the hand, so that task feedback is limited to binary reward feedback. Right panel: three types of the reward criterion task parameter. A fixed reward criterion only rewards movements at the target. An adaptive reward criterion additionally rewards movements that are closer to the target than the mean or median of the past five attempts. A random reward criterion randomly rewards 50% of the movements randomly, independent of performance
Fig. 2Learning models translated into terminology similar to Van Beers (van Beers 2009). See Table 1 for terminology. The top part describes the construction of a movement within a trial. The bottom part describes learning from trial to trial. Bold red lines indicate the situation following a non-successful trial; thinner green lines indicate the situation following a successful trial. In all models, an endpoint is constructed by adding exploration and motor noise to the aim point. The actual observable behavior is the endpoint, which is then rewarded or non-rewarded. Learning always involves adjusting the aim point. This might consist of updating the aim point only following success and not following failure (Therrien16, Therrien18, Cashaback19), but it might be that the aim point is also corrected following failure (Dhawale19)
Terminology
| Model | |||
|---|---|---|---|
| Time | |||
| Trial | All | ||
| Motor noise | Inevitable variability that is always present. Also called unregulated variability (Dhawale et al. | Cashaback19 | |
| Dhawale19 | |||
| Therrien16 | |||
| Therrien18 | |||
| Exploration | Variability that can be added and can be learnt from. Also called regulated variability (Dhawale et al. | Cashaback19 | |
| Dhawale19 | |||
| Therrien16 | |||
| Therrien18 | |||
| Input exploration | |||
| Exploration estimated following non-successful trials | |||
| Exploration estimated following successful trials | |||
| (A)TTC estimate of exploration | |||
| Aim point | Mean of the probability density of movement endpoints given a certain ideal motor command (van Beers | All | |
| End point | Observable movement outcome | All | |
| Reward presence or absence | R = 0: no reward | All | |
| R = 1: reward | |||
| Reward prediction error | Difference between actual reward obtained and predicted reward | Dhawale19 | |
| RPE > 0: Reward obtained | |||
| RPE < 0: No reward obtained | |||
| Low-pass filtered reward history | Low-pass filtered reward history of the τ previous trials | Dhawale19 | |
| Reward-based learning parameter | Learning gain, adjustment fraction | Cashaback19 | |
| Dhawale19 | |||
| Therrien16 | |||
| Therrien18 | |||
| Reward rate update fraction | Gain of updating the reward rate estimate ( | Dhawale19 | |
| Number of trials in reward history memory window | Inferred memory window for reinforcement on past trials, or the time-scale of the experimentally observed decay of the effect of single-trial outcomes on variability (Dhawale et al. | Dhawale19 |
Model parameters used for simulating learning. See Table 1 for abbreviations
| Varying values ( | ||||
|---|---|---|---|---|
| Learner parameters | Task parameters | |||
| Target amplitude (units of | Reward criterion ( | |||
| 1 | 1 | 0 | 0 | Random: |
| 50% of trials | ||||
| 4 | 4 | 0.1 | 2 | Adaptive (median): |
| If EP < target: | ||||
| If EP within fixed reward zone: target – 1 ≤ EP ≤ target + 1 | ||||
| If EP > target: target – 1 ≤ EP ≤ | ||||
| If EP < target: | ||||
| If EP within fixed reward zone: target – 1 ≤ EP ≤ target + 1 | ||||
| If EP > target: target – 1 ≤ EP ≤ | ||||
| 16 | 36 | 0.2 | 6 | Fixed: |
| If EP within fixed reward zone: target – 1 ≤ EP ≤ target + 1 | ||||
| 25 | 64 | 1 | 8 | Fixed with lower target fraction (target fraction = 2): |
| If EP within fixed reward zone: target – 1 ≤ EP ≤ target + 1 | ||||
† The input is equal to the exploratory variance used following non-successful trials in the model of Cashaback19 and Therrien16. Using a variability control function (see Eq. 10, Fig. 3), it defines the two exploratory variances in the Therrien18 model, and a whole range of variances in the Dhawale19 model
‡ The values 0.1, 0.15 and 0.2 are not used in the Therrien16 and Therrien18 models, as their learning parameter is fixed at 1 (Eq. 4)
Fig. 3Exploratory variability control functions. The variability control function has been fitted to the variances reported by Dhawale et al. (2019) (c = 19) and scaled to the variances reported by Therrien et al. (2018) (c = 54). This variability control function is used for scaling with the different input values of the exploration parameter (Eq. 10)
Fig. 4Example learning curves. One example simulation for each model: , , α = 0.15 (for the Cashaback19 and Dhawale19 models), reward criterion = adaptive (mean), target amplitude = 4 . The top row shows the movement endpoint behavior that the TTC method uses to estimate exploration. Each movement endpoint is composed of an aim point, motor noise and possibly exploration. The bottom row shows the underlying aim points that are adjusted during learning. Non-successful trials are covered largely by successful trials because following failure, the aim point is not (or only slightly: Dhawale19) updated
Fig. 5Exploration estimated by the TTC method. Means and standard deviations have been calculated per simulation set of 1000 simulations for the four learning models. Colors indicate the learning model, symbols the parameter that is varied: variance of motor noise () and exploration (), learning parameter (α), target amplitude and reward criterion. For the Dhawale19 model, various simulations resulted in values outside the plotted range; error bars have only been plotted when the mean was within the axis range. a TTC estimates of exploration () as a function of input exploration (). The diagonals indicate the unity line, at which estimates are perfect. Note that most simulations have been run with the default exploration . b-f Similarity ratio () as a function of the parameters varied. Horizontal lines at SR = 1 indicate perfect estimation
Fig. 6TTC estimation of exploration is sensitive to the balance between exploration and motor noise. We compare model behavior for an adaptive (a, b) and a random reward criterion (c, d). a, c Similarity ratio as a function of the ratio of exploration and motor noise variance parameters. The horizontal line at SR = 1 indicates perfect estimation of exploration. Symbols and other details as in Fig. 5. b, d Draws of motor noise and exploration on successful and non-successful trials in five simulations of a simulation set of the Therrien16 model, resulting in 2500 samples plotted in total for the rewarded and non-rewarded panel together. Except for the reward criterion, other parameters in the simulations have their default values. The data are split based on whether these trials were rewarded or not. Dotted lines indicate
Fig. 7Reference trial exploration causes TTC method to underestimate exploration. Here, we simplify model behavior of the Therrien16 model by assuming that the learner is not noisy ( = 0) and does not adjust his aim point following success (α = 0). The only contributor to trial-to-trial change (blue) is hence exploration. The hands indicate different trials over time. The reference trial is indicated with bold lines, the target trial with dotted lines. a, b Success or failure determines exploration on the next trial: on trials post-failure (a), exploration is drawn from a distribution with variance (i.e., ). On trials post-success (b), exploration is zero (. a-d Trial-to-trial change is calculated relative to a reference trial, here the successful or non-successful trial. This trial may (c, d) or may not (e, f) have contained exploration that contributes to the trial-to-trial change. Post-double failure (a), trial-to-trial change consists of the difference in exploration between the target and reference trial. Post-single failure or success (b, c), trial-to-trial change consists of the difference between one exploration draw and zero. Post-double success (d), trial-to-trial change is zero
Fig. 8The ATTC method. The ATTC method has been developed based on the Therrien16 model. The ATTC method uses four trial sequences to estimate variability from. The learner only explores post-failure (i.e., ; otherwise ). Post-success, the learner learns by updating the aim point (horizontal lines) with a fraction α of the rewarded exploration. In the figure, motor noise is not displayed ( = 0). In the ATTC method, the reference trial (in bold) has been changed from the (non-)successful trial to the trial preceding it. Exploration is estimated twice, by subtracting variability estimates calculated from trial-to-trial changes of trial sequences post-single failure and post-double success pairwise, depending on reference trial exploration (a, b vs. c, d). A weighted average of the two is used to obtain one exploration estimate. a, c Trial sequences ending with a single failure (t – 1). If trial t-3 was non-successful (a), trial-to-trial change is a difference between two exploration draws: exploration on the target trial and a fraction (1 – α) of the reference trial exploration. If trial t – 3 was successful (c), trial-to-trial change only consists of exploration on the target trial. b, d. Trial sequences ending with double success to calculate post-success variability from. If the trial t – 3 was non-successful (b), trial-to-trial change is a fraction (1 – α) of the reference trial exploration. If the first trial was successful (d), trial-to-trial change is zero
The ATTC method. Each three-trial reward sequence results in a different variance estimate, depending on the contribution of exploration draws () and learning () to the trials. Only sets C (corresponding to panels a, b in Fig. 8) and D (corresponding to panels c, d in Fig. 8) result in an estimate of exploration variance and are used in Eq. 12
| Set | Post-failure sequences | Post-success sequences | Post-failure – post-success | |||||
|---|---|---|---|---|---|---|---|---|
| Sequence | Estimate | Trial-to-trial change ( | Sequence | Estimate | Trial-to-trial change ( | Exploration estimate ( | Weight | |
| A | 0 | |||||||
| B | 0 | |||||||
| C | 3 | |||||||
| D | 0 | 1 | ||||||
Fig. 9Exploration estimated by the ATTC method. Means and standard deviations have been calculated per simulation set of 1000 simulations for the four learning models. Colors indicate the learning model, symbols the parameter that is varied: variance of motor noise () and exploration (), learning parameter (α), target amplitude and reward criterion. For the Dhawale19 model, various simulations resulted in values outside the plotted range; error bars have only been plotted when the mean was within the axis range. a ATTC estimates of exploration () as a function of input exploration (). The diagonals indicate the unity line, at which estimates are perfect. Note that most simulations have been run with the default exploration . b-f Similarity ratio () as a function of the parameters varied. Horizontal lines at SR = 1 indicate perfect estimation