Literature DB >> 36032270

Ten guidelines for designing motor learning studies.

Rajiv Ranganathan¹, Mei-Hua Lee¹, Chandramouli Krishnan².

Abstract

Motor learning is a central focus of several disciplines including kinesiology, neuroscience and rehabilitation. However, given the different traditions of these fields, this interdisciplinarity can be a challenge when trying to interpret evidence and claims from motor learning experiments. To address this issue, we offer a set of ten guidelines for designing motor learning experiments starting from task selection to data analysis, primarily from the viewpoint of running lab-based experiments. The guidelines are not intended to serve as rigid rules, but instead to raise awareness about key issues in motor learning. We believe that addressing these issues can increase the robustness of work in the field and its relevance to the real-world.

Entities: Chemical

Keywords: Analysis; Experiments; Motor learning; Practice; Skill; Task

Year: 2022 PMID： 36032270 PMCID： PMC9406239 DOI： 10.20338/bjmb.v16i2.283

Source DB: PubMed Journal: Braz J Mot Behav ISSN： 1980-5586

INTRODUCTION

Motor learning is a central focus of several disciplines including psychology, kinesiology, neurophysiology, neuroscience, rehabilitation, and engineering. While this diversity of perspectives is a positive feature in terms of the development of new ideas and theories, it also brings associated challenges in terms of interpreting evidence and claims about motor learning. In our experience leading journal clubs, it is not uncommon to discuss a paper with a claim about “motor learning” in the title and end up questioning if the paper was really even about motor learning! A large part of this challenge is due to the fact that theoretical, conceptual, and methodological issues related to the design of motor learning experiments that are ‘common knowledge’ to researchers in one particular discipline may not always be accessible to researchers from other disciplines. To address this issue, we provide a set of ten guidelines to raise awareness about these issues and navigate the design and analysis of motor learning experiments (Table 1). For each of these decision steps, we discuss common pitfalls and suggest recommendations, citing examples from both classic and recent studies of motor learning. Although several of these factors have been emphasized in prior work [1-5], the goal of this article is to synthesize this tacit knowledge to provide a step-by-step guide through the entire process from task selection to data analysis and interpretation. The paper is primarily intended for early-career researchers who are new to the field, but we hope that the issues raised can also serve as a starting point for discussions during interdisciplinary collaborations.

Table 1.

Summary of the ten guidelines for motor learning studies. Each step in the process is listed with associated pitfalls and recommendations.

Steps	Pitfalls	Recommendations
Task Selection	• Lack of generalizability to real-world motor learning	• Explicitly define what the learning in the task entails.• Balance realism and insight
Instructions and Feedback	• Lack of clear instructions and feedback could affect strategy and outcomes	• Provide clear instructions• Provide feedback consistent with the instructions and maintain motivation
Practice Duration	• Practice duration affects the process being studied• Practice duration affects reliability and sensitivity	• Characterize a full learning curve for the task
Groups	• Two-group studies may not provide a complete picture of the learning effects and may result in misleading effect sizes	• Perform extensive initial characterization with multiple groups• Add realistic control groups
Sample Size	• Low sample sizes result in low power and unreliable estimation of effect sizes	• Provide sample size justification
Manipulation Checks	• Interpretations can be ambiguous without manipulation checks	• Provide appropriate manipulation checks
Tests of Learning	• Test conditions do not always align with learning in the task• Learning effects could be masked when viewing short snapshots of retention	• Ensure that test conditions are aligned with the definition of what the learning in the task involves• Increase the number of trials in the retention tests
Dependent Variable	• Choice between multiple dependent variables increases researcher degrees of freedom• Some dependent variables do not have desirable measurement characteristics	• Check if the dependent variable shows conceptual alignment, good measurement properties and mechanistic insight.
Measure of Learning	• Pre-tests can be problematic in some circumstances• Gain scores are not easily interpretable	• Minimize baseline imbalances by appropriate strategies• Exercise caution when baseline differences are inherent• Pre-register measures of learning and check for robustness of results with multiple measures
Processing Data	• Learning effects could be confounded by experimental artifacts (e.g., fatigue), processing artifacts, and statistical artifacts	• Add control experiments to reject experimental artifacts• Provide robustness checks for statistical artifacts• Provide open data and enhance transparency in visualization

STEP 0: DEFINING WHAT MOTOR LEARNING IS

One of the main barriers in motor learning research is that there is no universally accepted definition of motor learning across all disciplines and contexts [6,7]. Even at the behavioral level, definitions of learning have focused on several aspects including improvements in outcome, consistency, stability, persistence, adaptability, and automaticity [8]. Other definitions of motor learning have emphasized adaptation and reorganization of existing skills [9], changes in coordination dynamics [10], speed-accuracy tradeoffs [11], information pick up [12] and even decision making [13]. While there are certainly common characteristics across many definitions that fall under the classic view of learning as a “relatively permanent change in behavior” [4], it is important to note that the specific definition of motor learning adopted can have a major influence on many of the guidelines suggested here. For example, the question of how to measure learning can depend on whether learning is viewed as an improvement in the practiced skill (where learning would be characterized by a retention test with the same task goal and same practice conditions), an improvement in the adaptability or flexibility [14] (where learning would be characterized in terms of achieving the same task goal under different conditions), a general change in the movement capability (where learning would be characterized by a transfer test to examine the generality to other task goals that were not practiced), or a change in the underlying movement repertoire (where learning would be characterized by a ‘scanning’ paradigm examining the stability of different coordination patterns [15]). In this paper, we focus on general guidelines that we believe apply to many motor learning contexts, but these guidelines always have to be considered within the context of how motor learning is defined in that specific context.

STEP 1: TASK SELECTION

The task has a critical role in motor learning and is perhaps the biggest source of interdisciplinary differences. Given the criticism of ‘applied research’ for resulting in “disconnected pockets of data” that are unsuitable for the development of general scientific principles [16], the goal in lab-based settings has been to use somewhat artificial tasks to isolate specific aspects of motor learning (e.g., sequence learning, reducing variability). Therefore, choosing a task in lab settings needs to address two issues: (i) provide the feasibility to do experiments (e.g., tasks that can be learned in relatively short periods of time) and (ii) provide insight that is generalizable to real-world tasks.

Pitfalls

The use of ‘simple’ motor tasks can threaten generalizability to ‘real-world’ motor learning [1,5]. While it is not trivial to define apriori objective criteria on what makes a task simple, they can be roughly characterized as: (i) tasks where learning primarily involves figuring out the task goal and/or using existing movement capabilities to solve the task (rather than requiring a true change in the movement capability), and (ii) tasks with a lack of ‘intrinsic information’, which makes it difficult for a learner to judge their performance on their own, without feedback from the experimenter [17,18]. While it is likely difficult to identify these characteristics directly, they can be examined in the data, in which learning of such simple tasks is likely to be reflected as: (i) a sudden and very rapid improvement in performance over a few trials of practice [5] with very little subsequent improvement (indicating that participants ‘figured out’ what the task goal was with no change required in movement capability), and (ii) a dramatic drop-off in performance when conditions are slightly changed (e.g., when performing the task after a delay or when withholding augmented feedback). For example, early work examining the role of knowledge of results has been criticized for using such tasks (e.g., draw a line of length 50 cm) where the learner does not have sufficient intrinsic sources of information, and therefore, has to rely on augmented feedback to even understand what the task goal is [18].

Recommendations

Explicitly define what the “learning” in the task entails.

In view of improving our understanding of how studies from an experimental paradigm relate to others (including real-world learning), it is critical to answer two questions: (i) what is the problem that the learner has to solve to achieve good performance in the task, (ii) how much of this problem is known to the learner prior to learning (versus being discoverable only through practice)? For example, while most studies provide a ‘score’ that participants have to maximize or minimize, there may be important differences in how much participants ‘know’ about the task. In some cases, participants may know exactly how they need to improve this score (e.g., landing the ball closer to the target will result in higher scores). On the other hand, in typical reinforcement learning paradigms, the learner has no explicit knowledge of what results in higher scores and has to discover this relation through trial-and-error exploration. Making these types of distinctions explicit in the task description may be a critical step in separating out different types of learning studies, which potentially can lead to a better understanding of how findings from particular experimental paradigms generalize to other contexts.

Balance ‘realism’ and ‘insight’.

Given the inherent tradeoff between ecological validity and experimental control, it might be useful to consider tasks that resemble real-world contexts while also providing sufficient richness of measurement beyond just the task outcome [19]. For example, Haar and colleagues [20] devised a real-world pool task that measured the outcome (e.g., directional error), but also allowed for accurate measurement of whole-body kinematics and EEG. This is not to imply that ‘more variables are better’, but that well-justified dependent variables that go beyond the task outcome can provide insight into learning (also see Step 8). In addition, using tasks that allow for ‘multiple pathways’ to learn the task also allows insights into “how” learning occurred. For example, Sternad and colleagues devised a throwing task [21,22] where the main goal for participants is to reduce the variability of the task outcome, but the task was designed in a way that this improvement in variability could be achieved in multiple ways - decreasing noise, moving to more error-tolerant workspaces, or covarying movement parameters. The advantage of such tasks is that one can go beyond typical task outcome measures (where learning is almost certainly bound to have an effect) and look at the effects of learning at other levels of analysis (where the results may be more informative because it may not be easy to predict what the effects of learning may be) and the associated search strategies in the perceptual-motor workspace [23,24].

STEP 2: INSTRUCTIONS AND FEEDBACK

Once the task is defined, the next step is to decide on the instructions to the participant and the associated feedback that is given to the participant. As mentioned in Step 1, one of the important aspects of the instruction is how they define the task goal for the participants. In addition to instructions, participants also typically receive some feedback during the task (usually provided using a score). In our view, while there has been an extensive body of literature on the effects of instructions or feedback – for e.g., manipulating attentional focus [25], given the powerful ways that instructions and feedback can channel learning, they should be considered carefully even in tasks where they are not the primary experimental manipulation. First, in the absence of clear unambiguous instructions, the learning problem may be “ill-defined” as participants may perceive the task goal differently from the way the experimenter intends it. For example, an instruction to “move as quickly as possible” could be interpreted as minimizing reaction time, minimizing movement time or minimizing the total response time. Second, instructions or feedback may qualitatively change the strategy adopted to solve the task, especially when there are multiple, competing demands (e.g., “as fast and accurate as possible”). For example, in a speed-accuracy tradeoff experiment, even when other experimental parameters were closely matched, performance strategies under “time minimization” instructions (i.e., reaching to a target as quickly as possible) were qualitatively different from performance under temporal accuracy constraints (i.e., reaching a target as accurately as possible in a specified time) [26]. Similarly, feedback can also act to serve as a task constraint. For example, allowing for greater ‘tolerance’ in temporal precision changed the nature of the speed-accuracy function [27]. Given the recent emphasis on measuring speed-accuracy tradeoff functions as a measure of skill [11], these examples highlight the critical role of instructions and feedback in motor learning experiments.

Provide clear instructions.

Clear instructions reduce uncertainty about the task goal and reduce between-subject variation in interpreting the tasks (which in turn reduces the need to eliminate ‘outliers’). Adding a ‘familiarization’ block at the start of the experiment can also help provide a quick real-time check as to whether participants indeed understood the instructions.

Provide feedback consistent with the instructions.

The feedback to participants should closely align with the instructions. This not only ensures that participants understand their performance but as seen later in point [8], this also allows for clearly defining dependent variables that can be used to track participants’ compliance to the instructions. For example, in a steering task [28], the task instructions were to move a cursor from the start to finish as fast as possible, while staying within the boundaries of a channel. The feedback, therefore, involved a score that involved two terms - one that penalized the overall time and one that penalized the time outside the channel. By varying the weights of individual terms in an integrated score, it is possible to channel participants toward different strategies [29]. In addition to its informational role, feedback also has a motivational role [4] - so it may also be helpful to think of ways to keep participants engaged in the task over long periods of time- e.g., using visual or sound effects to indicate good and bad performance [4], or providing a summary score at the end of a block of trials.

STEP 3: PRACTICE DURATION

Given the focus on ‘simple’ tasks, a large majority of motor learning studies tend to rely on a single session of practice, with possibly the addition of a 24-h retention test [4]. However, justifications for such practice durations are rarely specified explicitly. It is perhaps not surprising then that in addition to the type of tasks studied, these extremely short practice durations are another key barrier in translating findings from motor learning experiments to real-world learning or rehabilitation. First, from a theoretical standpoint, when the duration of practice is not clearly justified, it becomes unclear what process is being studied. Several theories have proposed that learning occurs in stages, with more ‘cognitive’ processes being engaged early on, followed by more automatic performance later in learning [16,30]. Second, from a measurement standpoint, the duration of practice impacts the reliability and sensitivity of the dependent variable. When practice durations are short, both within- [31] and between-subject [32] variability tends to be high, making measurements less reliable (although these factors depend to a large degree on the task and the stage of learning [33]). When practice durations are long, performance can reach a plateau, making the dependent variable less sensitive to detecting changes between different groups.

Characterize a full learning curve for the task.

Rather than rely on preliminary pilot data, we suggest that initial experiments with a new paradigm involve learning with a relatively extended period of time so that the full learning curve can be characterized. The actual duration of this period will depend on the task, with the goal of trying to estimate when the performance reaches a relative plateau. For example, Reis et al. [11] examined the effect of brain stimulation on a speed accuracy task over 5 days, with retention periods up to 3 months. This type of initial characterization of tasks has three advantages - (i) it allows subsequent studies to potentially run shorter experiments depending on the research question, (ii) it can be helpful in determining if ‘non-significant difference’ between groups are due to performance plateaus, and (iii) it also allows for a more natural interpretation of effect sizes in terms of ‘practice time saved’. For example, Day et al. [34] determined that abbreviated exposures to split-belt walking had the same effect as 4 days of practice at the task. This type of ‘natural effect size’ may be more intuitive in conveying the magnitude of effects for motor learning studies than relying on default standardized effect sizes like Cohen’s d [35,36].

STEP 4: GROUPS

Most motor learning experiments that attempt to examine the effect of training paradigms rely on ‘between-group’ manipulations since it is reasonable to assume that motor learning cannot be ‘washed out’. In this regard, motor learning studies have relied heavily on ‘two group’ comparisons - e.g., studies on variable practice typically have a ‘constant’ group and a ‘variable group’ [37]. However, defining these experimental groups is critical in determining the insights that can be gained. The main pitfall of ‘two-group’ studies is that most manipulations involve a continuous variable that has non-monotonic effects on learning - i.e., there is an optimal parameter value or challenge point [38] at which learning effects are maximal, and the learning effects decrease on either side of this optimal value. Therefore, construction of two groups based on parameter values (e.g., selecting a low variability and high variability group) can yield different results depending on where the parameter values of these groups lie relative to the optimal value [2]. Second, even in cases where the groups are based on a variable that is categorical, two-group studies typically tend to be designed in a way to ‘exaggerate’ differences between groups [39], which can create misleading effect sizes and limit generalizability and real-world relevance.

Characterize ‘dose-response curves’ across multiple parameter values.

Initial studies should characterize the dose-response across a reasonable range of parameter values to detect the presence of any optimal values. For example, an early study on the effect of distributed practice used five groups with inter-trial intervals ranging from 0 to 60 seconds and found no non-monotonic relation on performance or retention [40]. On the other hand, a study on summary feedback used multiple groups and found a non-monotonic change - i.e., both too frequent and too infrequent feedback had similar effects on learning [41]. While the actual research question is always critical in deciding how many groups are required, this type of extensive initial characterization with groups can help justify the choice of the number of groups (and the specific parameter values chosen for these groups) for subsequent studies using the same tasks.

Add realistic control groups.

Even when a variable is clearly categorical, adding of ‘realistic’ control groups that represent reasonable choices for learning the task can provide greater clarity on the effect. A recent review characterizes in detail the different types of control groups that can be used in motor learning research [42]. One particularly important factor that adds a realistic control group in the context of motor learning is ‘practice specificity’ - i.e., when practice conditions match the test conditions [43,44]. For example, when examining the effects of variability on learning and consolidation, Wymbs et al. [45] used a number of control groups (including adding a ‘practice specificity’ group where no variability was added) to fully tease out the effect of adding variability. These types of data are not only useful for future studies, but also change the focus of motor learning studies from “whether” a particular type of practice matters to motor learning (which is almost always a “yes” in our view based on the idea that the ‘nil’ hypothesis of any intervention having zero effect is almost never true [46]) to “how much” it matters (which is much more informative).

STEP 5: SAMPLE SIZE

Sample size is one of the most important factors in experimental design. Most motor learning experiments tend to be small (typical sample sizes of 10–16/group) [2,47] likely indicating that this has become a heuristic determination rather than a justified determination. A difficulty with performing typical justifications with apriori power analysis is that tasks tend to vary between studies [2], making it difficult to extract effect sizes from prior work. The pitfalls of low sample sizes have been extensively covered in other literature [48,49] - so we just highlight two main points: (i) low sample sizes have low power, which can lead to missing true effects, and (ii) a literature with low powered studies means that published effects will tend to have inflated effect sizes [48].

Justify sample size.

Sample size justification statements are critical for judging the informational value of a study. Other than apriori power analysis, these can include other types of justifications such as accuracy or resource constraints [50]. McKay et al. [51] provide an example of sample size justification where the sample size was based on a prior study, with adjustments to the p-value for sequential analysis. In light of steps 3 and 4 that raise the need for initial characterization of long-term studies with multiple groups, it is critical that these studies also have sufficient sample sizes for reliable estimation of effect sizes.

STEP 6: MANIPULATION CHECKS

Manipulation checks refer to tests that examine if the designed intervention had the desired effect on the participant. While certain types of interventions do not require such checks since they are directly under the control of the experimenter (e.g., the inter-trial interval), other variables are designed to elicit specific responses during practice (for e.g., increasing motor variability using force perturbations). Often, with the emphasis that practice during training may not be reflective of ‘true’ learning (i.e., the learning-performance distinction) [52], there is sometimes a tendency in motor learning studies to only focus on the end-result without interpreting if an intervention had the desired effect ‘during’ practice. Without a manipulation check, interpretations of results can be ambiguous. For example, a study on contextual interference (typically manipulated using a blocked vs. random schedule) needs a manipulation check that the random schedule indeed created more interference during practice. This manipulation check is usually done by examining if there were higher errors during practice for the random group. Without this check, it is difficult to disambiguate a true ‘null result’ (i.e., contextual interference did not have a significant influence on learning) from a ‘failed manipulation’ (i.e., the experimenter’s attempt at introducing contextual interference was not successful).

Provide manipulation checks.

Manipulation checks can vary depending on the type of study. In some cases, manipulation checks can be done using variables that are already directly recorded during data collection. For example, Cardis et al. [53] showed perturbations of the null and task space increased variability in these dimensions. In other cases, there may be the need for separate manipulation checks - e.g., Grand et al. [54] used both the intrinsic motivation inventory and electroencephalography to examine if the manipulation resulted in increased motivation. When manipulation checks ‘fail’, it might be critical to consider all aspects of experimental design mentioned in Steps 1–5 (the instructions to the participant, the amount of practice, etc.) as well as the measurement resolution of the variable being used for the manipulation check. However, one caveat when introducing these manipulation checks is to also minimize the possibility that the manipulation checks themselves change the behavior of the participant [55].

STEP 7: TESTS OF LEARNING

Perhaps one of the most important insights in motor learning is the ‘learning-performance’ distinction - i.e., that changes in performance during practice are not always indicative of learning [3,52]. This insight has resulted in the use of separate ‘tests’ of learning-e.g., delayed retention tests or transfer tests as conditions for measuring ‘true’ learning. This raises a critical question as to what test is most applicable for measuring learning. First, there has been an overreliance on the “24h No-KR test” as the ‘only’ acceptable condition for measuring learning [17]. While the 24h No-KR condition can be a good indicator of learning in some cases, both the “24h” and the “No-KR” portions of the test can sometimes be problematic. First, the 24h period is intended for dissipating temporary effects, but whether this makes it a ‘true’ test of learning may depend to a large degree on the task and the timescale of learning involved. For example, rehabilitation paradigms that attempt to change movement capacity over several weeks typically involve follow-up tests in timescales of weeks or months. Second, the No-KR part of the test is intended to provide a ‘stable’ performance by minimizing possibilities for learning “during” the test. However, in many cases, because information about performance is available naturally through sensory information outside of “augmented feedback” provided by the experimenter, trying to prevent learning during the test has led to artificial measures such as ‘blindfolding’ the individual and/or using headphones playing white noise. In such cases, the dramatic change in feedback information may induce the learner to adopt qualitatively different strategies during the test, which are unrelated to the original learning. Second, these tests of learning are typically designed to measure only a ‘snapshot’ of the process (usually as short as 10 trials). This means that the snapshot itself can be unrepresentative of true learning because it can include temporary effects such as warm-up decrement, especially when administered after a time delay, or when there is a sudden change in the context [17].

Justify test conditions.

We recommend that the test conditions are aligned with the definition of what the learning in the task involves (mentioned in Step 1) and what aspects of information are considered ‘part of the task’. For example, Lokesh and Ranganathan [28] used a haptic feedback study where the test involved removal of haptic feedback, but retained concurrent visual information of the cursor because this visual information was considered part of the task. Similarly, the timeline of such a test can be aligned with respect to the total learning time and other task factors such as fatigue. For example, a 24 h retention test may be justifiable for tasks with a one-day practice duration, but longer timescales of learning may potentially require longer time intervals. For example, in a balancing task that was practiced over 6 weeks, the retention test was made 3 months after the last practice block [56].

Make tests more than a single ‘snapshot’.

Including a retention test with a relatively large number of trials can provide the ability to separate out temporary and relatively permanent effects. For example, studies of massed and distributed practice show that while there are effects in the first few trials of the next day, these differences are quickly eliminated within the next few trials of practice [57]. A further alternative is to consider a series of retention periods - for example, Reis et al. [11] measured retention at 5 time points from 3 days until approximately 3 months after the last practice. This type of data could characterize retention over time, which is critical for studies that examine learning in the context of a change in the movement capability. In addition to testing over time, transfer tests [1,58,59], which involve testing in conditions that were not originally practiced, can also provide a richer description of learning. However, a critical piece in transfer tests is determining the appropriate transfer conditions so that it is clear how these transfer conditions relate to the originally practiced task. van Rossum [60] emphasizes the use of task analysis as a means of designing transfer tests so that the results from transfer tests can yield useful information about what was learned. For example, in sequential timing tasks, transfer tests with the same or different relative timing patterns as those during practice were used as a critical test to distinguish whether the benefits of variable practice were due to the formation of motor schemata or contextual interference effects [61,62].

STEP 8: DEPENDENT VARIABLES

Dependent variables in motor learning studies have to accomplish two main goals - (i) the variable sufficiently captures the improvement in task performance, and (ii) it provides ‘insight’ into how improvements are occurring. Part of the challenge in defining these variables is that motor performance is multidimensional and there is usually no unambiguous choice of variables. For example, in a golf putting task, the dependent variables used to measure learning could be (i) the percent of putts made, (ii) the number of points scored in a scheme defined based on the distance from the hole (e.g., 10 points for making the putt), (iii) the absolute error in terms of the distance from the target, or (iv) the variable error in terms of the consistency of the putts. In some cases, the dependent variable may even be derived ‘post hoc’ at the data analysis stage - for e.g., the number of putts that were within 1 inch of the hole, or a score that weights the absolute error in the putting task by the time taken to prepare for the putt. First, the choice between several dependent variables (especially when they are derived post hoc) can inflate researcher degrees of freedom [63]. For example, in the golf putting task described above, the choice between many possible dependent variables can result in a situation where some variables show statistically significant differences but others do not. If undisclosed, such flexibility in the choice of dependent variables can lead to a large increase in the false positive rate [63]. Second, some dependent variables do not have desirable characteristics either from a measurement standpoint or a mechanistic standpoint. For example, from a measurement standpoint, measures such as number of putts made can be insensitive to changes in the magnitude of errors. In terms of mechanistic insight, some dependent variables may not provide sufficient information about the performance. For example, in 2D aiming tasks, using only a scoring system (or radial error) may mask important changes in terms of changes in bias or variability [64,65].

Check if the dependent variable shows conceptual alignment.

From a conceptual standpoint, the dependent variable should closely align with the instructions and feedback to the participant. As Newell [66] states “it seems unreasonable to evaluate the effect of an independent variable primarily through criteria different from that originally stressed to the subject” Moreover, alignment with instructions and feedback reduces researcher degrees of freedom by minimizing the potential for ‘post hoc’ derived dependent variables. For example, in a gait tracking task which involved the participants to match a target template with their foot, Krishnan et al. [67] define the error measure only in the spatial dimension using an area metric (as opposed to say using RMSE, which would need to involve assuming something about the temporal component that were not part of the instructions). In cases where the dependent variable may not be apparent immediately from the instructions and is a combination of multiple variables (say in a task that emphasizes both speed and accuracy), it might be ideal to pre-register this analysis so that there is transparency about flexibility in data analysis [63].

Check if the dependent variable shows good measurement properties.

From a statistical standpoint, dependent variables should be sensitive enough to track changes with learning. Although measures such as error rates or % success may be used in some contexts because they have functional significance (i.e., in a real game, “near-misses” do not count), in general they may be too coarse-grained to detect differences. While it is desirable to have these dependent variables in real-world units (e.g., error measured in cm), this may not always be feasible and may require other approaches. For example, studies on the free-throw use a point-based system (e.g., 5 points for a swish) to improve the sensitivity of the dependent variable [68]. However, it is important to note that any system developed should also be considered on a conceptual basis - i.e., if participants get scored higher for a free throw that goes directly in the hoop compared to one that goes in after bouncing off the rim or the backboard, this information should also be communicated to the participant through instructions.

Check if the dependent variable provides mechanistic insight.

Finally, consider dependent variables that yield ‘insight’ into the underlying process of motor learning. In many cases, these may be considered ‘secondary’ variables that help delineate different processes. For example, the dependent variable in many aiming tasks is an absolute error (AE), but this measure is a combination of constant error (CE) and variable error (VE) [69]. Therefore, while the AE can be the ‘primary’ dependent variable (based on conceptual grounds that it was the instruction to the participant) [66], breaking down the AE into secondary variables CE and VE can provide additional insight into ‘how’ participants got better. Similarly, Lee et al. [70] use movement time as the primary dependent variable in a cursor control task, but also measure the path length to examine if increased movement times are due to changes in movement speed (which would not change path length) or taking more circuitous paths (which would result in increased path length).

STEP 9: MEASURE OF LEARNING

The measure of learning used in the analysis is extremely critical, as they form the basis for inferences drawn from the experiment. Similar to the debate about measuring learning in many other fields, there are two distinct philosophies for measuring motor learning - proficiency (i.e., how good is performance after practice) and growth (i.e., how much has performance changed after practice), and a multitude of measures have been used to quantify learning based on both approaches [71]. Proficiency-based measures typically rely on the absolute performance level at the end of practice (e.g., on a retention test) to measure learning. On the other hand, growth-based measures typically use a pre-post experimental design, where the change between initial and final performance levels are compared across groups. This change can be (i) a gain score (i.e., final performance - initial performance), (ii) a gain score that is ‘normalized’ in some way (e.g., final performance/initial performance *100) or (iii) a post-test score or gain score that uses the pre-test score as a covariate. In addition, the ‘rate’ of learning is often also computed as a way to capture if one group learns faster than the other. As in the case with the choice of dependent variable, the presence of multiple measures to assess learning can create a challenge, as it increases researcher degrees of freedom. First, the use of ‘pre-tests’ in motor learning has been criticized on two grounds [4] - (i) that pre-test scores in motor learning are unreliable because early trials at a task generally are extremely variable and show very poor correlation with final performance, and (ii) extending the trials in the pre-test to get a reliable measure of baseline performance essentially provides practice at the task, thereby minimizing the room for improvement during the actual intervention. Second, the use of gain scores has been criticized as a learning measure [72] especially in cases where the pre-test scores are not the same across groups. For example, since the gain scores are skewed by the initial performance level, individuals with a lower pre-test scores could appear to have greater ‘gains’ even if their final performance was lower or similar to an individual with higher pre-test scores [71]. This will especially be the case in situations where the performance approaches a plateau. While normalizing the gain score (e.g., expressed as a %) can offset some of these issues, this also has to be treated with caution since the normalization procedure makes assumptions about the form of the learning curve. The problem with a gain score is even more obvious when using a ‘relative retention’ measure (i.e., a difference score computed between the last block of practice and the retention test). In this case, the magnitude of relative retention is heavily dependent on performance during practice—a measure that may be affected by factors other than learning [71]. Moreover, if groups underwent different types of intervention during practice, the performance at the end of practice is contaminated by the effects of the intervention itself, making it an unsuitable measure to compare learning in the two groups. Finally, some motor learning studies use curve fitting (e.g., fitting an exponential) as a means to quantify ‘rates’ of learning independent of performance level. While this argument is true ‘in theory’, this relies on two major assumptions - (i) the form of the function actually matches the form of the learning curve (i.e., if an exponential fit is made, that the curve is actually exponential), and (ii) the data have very minimal noise, and the function fits these data extremely well at the individual level. Without full knowledge of these assumptions, it is difficult to interpret results from curve fitting.

Minimize baseline imbalances.

When initial performances at baseline (i.e., the pre-test) are similar between groups, the type of the dependent variable used in the analysis has a minimal effect on the outcome - i.e., the use of a change score or the final score will lead to similar conclusions. Hence, it is critical to ensure that adequate attempts are made to minimize baseline imbalances. This can be done using three strategies - (i) increasing sample sizes: small sample sizes are more likely to create baseline imbalances; hence, we recommend using adequate sample sizes to ensure that the baseline performance reflects the true group means [73,74], (ii) stratifying groups based on initial performance including treatment of outliers [75], and (iii) using an analysis of covariance (ANCOVA) to adjust for any residual differences in baseline for estimating an unbiased intervention effect [76,77]. In cases where ‘pre-test’ measurements are either not feasible or consistent, it might be helpful just to rely on larger sample sizes and use the final test performance as the measure of learning.

Interpret results with considerable caution when baseline performances are different.

When baseline imbalances are inherent to the research question - e.g., when comparing age-related differences [78,79] or the effects of neurological injury [80], the research question can be ‘ill-defined’ to some degree and the results have to be treated with considerable caution since statistical techniques like ANCOVAs that account for baseline differences can yield misleading results [81]. Instead, it might be fruitful to think about potential experiments where ambiguity can be resolved. For example, if a particular study shows that rates of learning in children are higher than adults (but is ambiguous because children still perform lower in an ‘absolute’ sense relative to the adults), then by extending the practice period, one can test in a follow-up experiment whether there is a point where the children ‘outperform’ the adults even in terms of absolute levels of performance. This result would help resolve the ambiguity since both absolute and rate measures would show that the children outperformed the adults.

Pre-register learning measures and robustness checks.

Again, as with the choice of the dependent variable, an important way in improving transparency is to pre-register the learning measures of the study and make data openly available to other researchers. In particular, adding robustness checks (e.g., comparing learning using multiple learning measures) to check if the outcomes change with the type of learning measures will help improve the interpretation of the study results.

STEP 10: PROCESSING DATA

Given the fact that most motor learning data inherently involve change in performance over time (i.e., there is no ‘stable’ performance), with significant within- and between-subject variability, even seemingly standard data analysis procedures can sometimes result in ‘artifacts’ that can be misleading. In addition, the effect of non-learning related factors on performance (such as fatigue) can also pose challenges during data analysis. Therefore, it is important to examine the effects of steps in the data analysis pipeline. We highlight three examples of artifacts arising in motor learning studies at different levels within the data analysis pipeline:

Experimental artifacts.

While the general ‘learning - performance distinction’ is well recognized in the context of needing retention or transfer tests [3], many questions in motor learning require direct analysis of performance curves - e.g. the study of ‘online’ and ‘offline’ learning that examine changes in performance ‘during’ and ‘between’ practice sessions or even at the level of trials [82-85]. One phenomenon observed in these experiments is ‘offline consolidation’ (also referred to as ‘reminiscence’ [86]), where there is a seemingly distinct improvement in performance after a rest break. However, there is a potential for performance artifacts (e.g., effects of inhibition or fatigue) in these types of analyses [57,87] and it is worth noting that these consolidation effects are typically observed in speeded tasks (e.g., produce as many typing sequences in a given time period, or tracking moving targets continuously).

Processing artifacts.

General data processing procedures involve some type of averaging to reduce the ‘noise’ in the data. However, relying on “averaged” data across individuals can substantially alter some types of inferences. From a theoretical point, one of the most well-known artifacts due to averaging is the form of the learning curve which can differ between power-law and exponentials depending on averaging [88-90]. As a result, there is potential for artifacts when estimating individual parameters based on group-averaged data. A second type of averaging that is usually done across trials within a block can mask temporary effects like warm-up decrement [17,89] and create the illusion of a ‘discontinuity’ in the learning curve even when the learning curve is continuous [87].

Statistical artifacts.

In the analysis of individual differences, ‘mathematical coupling’ occurs when the response variable directly or indirectly contains all or a part of the predictor variable [91]. For instance, studies evaluating initial performance to predict ‘gain scores’ are expected to see significant associations simply due to the fact that the gain score depends on the initial performance. Relatedly, methods that involve creating ‘post hoc’ groups based on dichotomizing a continuous response (e.g., analyzing the difference between high and low responders based on a median split) have been criticized from a statistical viewpoint as creating arbitrary or illusory distinctions [92]. It is especially important to note that standard procedures used in other domains (such as repeated exposures to examine if the ‘response’ is reliable within an individual) are not generally applicable in motor learning since it is not possible to wash out prior learning. As a result, results from these types of analyses must be treated with even more caution.

Control experiments for experimental artifacts.

In cases where the analysis is based on performance curves, the use of control experiments that can disambiguate temporary effects of performance from learning are critical. For example, to examine if offline learning is truly distinct from ‘recovery from inhibition/ fatigue’ effects, it might be useful to manipulate the practice/rest interval or perturb neural activity during rest period [93].

Robustness checks for processing and statistical artifacts.

The impact of specific procedures can also be examined by showing how results are impacted by changes in specific parameter choices. For example, when using a parameter based on a group-averaged curve fit, Reis et al. [11] show sensitivity analyses to changes in this parameter. Using simulated data with known properties can also be a critical tool to check the impact of a particular processing procedure. For example, Smeets and Louw [94] showed how the decomposition of variability can be sensitive to the choice of variables used in defining the task.

Providing transparent visualization and open data.

Finally, it may be difficult (if not impossible) for one paper to identify all possible artifacts and/or perform all possible robustness checks. Therefore, providing transparent visualizations that go beyond simple bar graphs (e.g., showing individual performance curves with minimal or no averaging) [95] and using open science practices like publicly sharing analysis and data can be extremely critical in improving the quality of motor learning science [96].

CONCLUSION

Overall, the pitfalls and recommendations highlight two broad themes in motor learning that require attention. The first theme relates to the relevance of motor learning studies to the real world. As highlighted earlier, while we agree that definitions of motor learning will vary depending on context and discipline, it is perhaps also important to take a pragmatic perspective that in some way, the ultimate goal of motor learning experiments is to be able to apply this knowledge to the real world. In this regard, while broad ‘principles’ of motor learning are often mentioned in the context of fields like rehabilitation [6,97], it is difficult to gauge the actual impact of most current motor learning paradigms on these fields. For example, a recent review of stroke rehabilitation literature found that only 8% of studies even mention ‘basic science’ studies (motor learning experiments in animals or humans) in the Introduction [98]. We hope that by raising awareness of several issues that hamper real-world relevance (choice of task, length of practice duration, etc.), the guidelines spur researchers to move outside ‘traditional’ paradigms in their own subfield with the goal of increasing relevance. A second theme that emerges from the guidelines is the need for initiatives that are not just at the level of a single investigator or a lab but at the level of a whole research community. As highlighted in the context of developing ‘model task paradigms [2], many of the proposed recommendations (larger sample sizes, more groups, increasing practice duration, preregistration, sharing open data) require a greater investment of time and effort compared to current publication practices. As a result, we hope that the guidelines spur discussion not only about larger-scale collaborative efforts, but also the need for recognition of such efforts at other levels such as by journal editorial boards, hiring, and promotion and tenure committees.

76 in total