People routinely perform multiple visual judgments in the real world, yet, intermixing tasks or task variants during training can damage or even prevent learning. This paper explores why. We challenged theories of visual perceptual learning focused on plastic retuning of low-level retinotopic cortical representations by placing different task variants in different retinal locations, and tested theories of perceptual learning through reweighting (changes in readout) by varying task similarity. Discriminating different (but equivalent) and similar orientations in separate retinal locations interfered with learning, whereas training either with identical orientations or sufficiently different ones in different locations released rapid learning. This location crosstalk during learning renders it unlikely that the primary substrate of learning is retuning in early retinotopic visual areas; instead, learning likely involves reweighting from location-independent representations to a decision. We developed an Integrated Reweighting Theory (IRT), which has both V1-like location-specific representations and higher level (V4/IT or higher) location-invariant representations, and learns via reweighting the readout to decision, to predict the order of learning rates in different conditions. This model with suitable parameters successfully fit the behavioral data, as well as some microstructure of learning performance in a new trial-by-trial analysis.
People routinely perform multiple visual judgments in the real world, yet, intermixing tasks or task variants during training can damage or even prevent learning. This paper explores why. We challenged theories of visual perceptual learning focused on plastic retuning of low-level retinotopic cortical representations by placing different task variants in different retinal locations, and tested theories of perceptual learning through reweighting (changes in readout) by varying task similarity. Discriminating different (but equivalent) and similar orientations in separate retinal locations interfered with learning, whereas training either with identical orientations or sufficiently different ones in different locations released rapid learning. This location crosstalk during learning renders it unlikely that the primary substrate of learning is retuning in early retinotopic visual areas; instead, learning likely involves reweighting from location-independent representations to a decision. We developed an Integrated Reweighting Theory (IRT), which has both V1-like location-specific representations and higher level (V4/IT or higher) location-invariant representations, and learns via reweighting the readout to decision, to predict the order of learning rates in different conditions. This model with suitable parameters successfully fit the behavioral data, as well as some microstructure of learning performance in a new trial-by-trial analysis.
As humans, our everyday interactions with a complex world often depend on well-practiced visual judgments. Correspondingly, many visual judgments can be substantially improved with training or practice, sometimes from near chance to excellent performance. Examples include judgments of orientation (Crist, Li, & Gilbert, 2001; Dosher & Lu, 1999), motion direction (Ball & Sekuler, 1987), texture pattern (Karni & Sagi, 1991), and many other tasks. Our interactions with the visual world also often require rapid and flexible intermixture (Kuai, Zhang, Klein, Levi, & Yu, 2005) of visual judgments in everyday behavior. Yet, when several visual tasks or task variants have been intermixed (roved) during training, especially those with similar stimuli and judgments, perceptual learning can sometimes be almost completely disrupted (Sagi, Adini, Tsodyks, & Wilkonsky, 2003; Yu, Klein, & Levi, 2004; Kuai et al., 2005; Parkosadze, Otto, Malania, Kezeli, & Herzog, 2008; Zhang et al., 2008).These so-called roving effects may reveal important properties about how tasks are learned. Two broad theories about how visual perceptual learning occurs have been proposed (Seitz & Watanabe, 2005). We have schematically illustrated them in Figure 1a. According to sensory retuning, learning primarily reflects retuning neurons in early retinotopic cortical areas, as early as V1 (Karni & Sagi, 1991). In reweighting, learning optimizes the readout (reweighting) of evidence represented at one or more levels of the visual hierarchy (Dosher & Lu, 1998; Dosher & Lu, 1999; Dosher, Jeter, Liu, & Lu, 2013). One consequence of retuning theory, which has been a dominant proposal in the field, is that learning tasks in separate retinal locations should be largely independent, because they involve separate retinotopic neural populations (Karni & Sagi, 1991). Under reweighting theory, on the other hand, tasks trained in different retinal locations may interact during learning through shared higher-level representations, and (as we will see) this will be especially consequential for tasks in which similar stimuli require different responses. In a neural network model of learning through reweighting, training intermixed tasks can exhibit different levels of interference (Grossberg, 1987; McClosey & Cohen, 1989). Of course, learning might, in some circumstances, occur through both retuning and reweighting, as suggested in several integrated reviews of perceptual learning (Seitz & Watanabe, 2005).
Figure 1.
(a) A schematic illustration shows the retuning and reweighting theories of visual perceptual learning, with associations to possible cortical substrates. Perceptual learning may change the tuning of neurons in early retinotopic areas (e.g. V1) (turquoise inset), or change the weights connecting sensory representations at several levels (e.g. V1, V4, IT, or higher) to decision (red oval). (b). A diagram showing weights fighting following two similar orientations with opposite responses (CW and CCW). Following the left image, a “CW” response is expected and weights move toward positive values (red); following the right image, a “CCW” response is expected and weights move toward negative values (blue). The overlapping weights for these two orientations, hence fail to improve due to conflicting updates (gray dotted lines).
(a) A schematic illustration shows the retuning and reweighting theories of visual perceptual learning, with associations to possible cortical substrates. Perceptual learning may change the tuning of neurons in early retinotopic areas (e.g. V1) (turquoise inset), or change the weights connecting sensory representations at several levels (e.g. V1, V4, IT, or higher) to decision (red oval). (b). A diagram showing weights fighting following two similar orientations with opposite responses (CW and CCW). Following the left image, a “CW” response is expected and weights move toward positive values (red); following the right image, a “CCW” response is expected and weights move toward negative values (blue). The overlapping weights for these two orientations, hence fail to improve due to conflicting updates (gray dotted lines).
Roving effects in perceptual learning
The impact of task intermixture or roving during visual perceptual learning has been documented in a number of visual tasks in experiments that train in a single retinal location. Perhaps the most famous example showed that learning was largely disrupted in two-interval contrast increment detection when base contrast varied from trial to trial, whereas contrast increment detection otherwise showed robust learning with a fixed base contrast (Adini, Sagi, & Tsodyks, 2002; Kuai et al., 2005; Yu et al., 2004). In bisection tasks, learning has been shown to be disrupted, or at least very slow, when the distance between reference lines or dots was intermixed, or with other certain forms of stimulus variation, but not all (Aberg & Herzog, 2009; Parkosadze et al., 2008). In some studies though, contrast increment learning was “re-enabled” when the base contrasts were cycled in a fixed temporal order (Cong & Zhang, 2014; Kuai et al., 2005; Zhang et al., 2008). Similar damaging effects of roving have also been found in the auditory domain. In one example, learning two interval frequency discrimination was robust with a fixed frequency standard, showed very slow or no learning for roved standards within an auditory band, and for wide roving (Amitay, Hawkey, & Moore, 2005); in another, learning was present in an auditory temporal-interval discrimination task with sequential training of two base intervals but not intermixed training (Banai, Ortiz, Oppenheimer, & Wright, 2010).The theoretical explanations for these roving effects have attributed them to task variation and some form of interference. First, roving damages learning when the stimuli and tasks are distinct but similar (Tartaglia, Aberg, & Herzog, 2009a). Such combinations would recruit overlapping neuron populations in which training could have interfering effects. As for the form of interference, some explanations have focused on recurrent processing stages, a potential role of variations in position due to fixation fluctuation and nonlinear processes in registration (Otto, Herzog, Fahle, & Zhaoping, 2006; Zhaoping, Herzog, & Dayan, 2003). Other explanations recruited ideas about failure of developing a stable memory trace (Yu et al., 2004), or memory consolidation (Seitz et al., 2005). The theoretical discussion closest to our current modeling of these roving effects (by Targaglia, Herzog, and colleagues) analyzes stimulus roving effects in the context of network learning models and their selective susceptibility to negative effects, depending on the nature of the learning algorithm and the overlap of representations (Tartaglia, Aberg, & Herzog, 2009b). A computational model can formalize these ideas and be used to test whether they can account for the behavioral learning data.
Modeling approach
Understanding why and under what circumstances learning is disrupted for intermixed tasks has broader implications for the broad theories of brain plasticity (Herzog, Aberg, Frémaux, Gerstner, & Sprekeler, 2012) and may, in turn, be relevant to the design of real-world training protocols (Kuai et al., 2005; Tartaglia et al., 2009b). In this study, we investigated how training tasks in different retinal locations, together with manipulations of stimulus dissimilarity, impact intermixed learning. From a modeling perspective, we see the disruption of learning when stimuli/tasks are intermixed during training as related to the concept of catastrophic interference in neural networks (Grossberg, 1987). Figure 1b schematically illustrates why stimulus similarity might be critical for interference due to roved tasks or stimuli—essentially when very similar stimuli (with close stimulus representations) require distinct responses.In order to investigate this theoretical hypothesis, we developed and tested the predictions of a computational model of visual perceptual learning, the integrated reweighting theory (IRT) (Dosher et al., 2013; Liu, Dosher, & Lu, 2015), which accounts for learning by reweighting—improving the readout—from stable sensory evidence using a hybrid (augmented Hebbian) learning rule. It learns by reweighting the evidence in visual representations at several levels to make perceptual decisions. Stimulus images are first processed through a front-end, which simulates early visual cortical responses in orientation and spatial frequency tuned representations (both location-specific and location-invariant); and then weights connecting these representations to a decision unit are updated (reweighted) using augmented Hebbian learning rules on each trial, simulating the outcomes in the actual experiments (Petrov, Dosher, & Lu, 2005; Petrov, Dosher, & Lu, 2006). (See Methods, Simulation Methods, for details of the computational model.) The schematic principle of interference in decision weights for near stimuli is illustrated in Figure 1b, and this idea was implemented computationally in the IRT.As outlined previously, simple retuning theories of visual perceptual learning, which locate learning primarily in the plasticity of neurons in early retinotopic visual areas, predict little interaction between tasks learned in separate retinal locations—that is, they predict that separating the tasks by location should eliminate effects of roving because the neural populations involved are distinct. In the reweighting model, by contrast, task variants learned in different locations in intermixed training protocols reweights both higher-level location-invariant representations and lower-level location-specific ones. Although roving interference would change the weights on both representation levels if the task variants were trained in the same location, they occur through the weights on location-invariant representations when different task variants are trained in different locations. The interactions between locations, and different amounts of roving interference, were exercised in four groups each of which experienced different task mixtures.
Experimental approach
We compared perceptual learning in four groups of observers who practiced judging the orientations of Gabor patterns (θ ± 12°) in four different peripheral locations, but with different combinations of reference angles (θ) across the locations in each group. On each trial, observers judged whether the stimulus was rotated clockwise or counter clockwise of the reference angle for the location indicated by a pre-cue. External noise was added on half the trials to constrain the estimates of internal noises in the IRT fits (Dosher et al., 2013; Lu & Dosher, 2008). Figure 2 shows the stimulus layout, sample stimuli with and without external noise, illustrations of the four intermixture (roving) conditions, and a typical trial sequence. The four groups were as follows: in the All condition, training intermixed four different reference angles spaced in orientation, one per location. This is a perfect setup for substantial roving interference in learning—if learning in different locations interact—because opposite responses are required for every set of adjacent Gabor stimuli in representations of orientations (see Figure 2b). Two other groups intermixed training of two tasks, each occurring in two locations, either with more similar reference angles (Near) or quite dissimilar reference angles (Far). Finally, a no-roving condition (Single) trained the same reference in all four locations. (See Methods, Simulation Methods, for details.)
Figure 2.
Sample stimuli and responses, and the training task mixtures of orientation judgments in separate locations (“clockwise” (CW), “counterclockwise” (CCW) by ± 12° of a reference angle). (a) Trial sequence of a fixation (500 ms), a precue (100 ms) marking the location to judge, the stimulus display (100 ms), and a response cue. Adaptive methods estimated contrast thresholds at 75% correct. (b) Sample stimuli with and without external noise (“snow”), along with assigned CW or CWW responses. (c) Illustrations of possible task mixtures in the four roving groups: All interleaved different reference angles in each location (−67.5°, −22.5°, 22.5°, or 67.5° from vertical); Near interleaved two similar tasks in two locations each (e.g. 22.5° or 67.5°); Far interleaved two dissimilar tasks at two locations each (e.g. −22.5° or 67.5°); Single trained one reference angle (e.g. −67.5°) in four locations (no roving); showing here only the reference angles, not the actual stimuli. N = 12 observers per group performed 7,680 trials each in 8 sessions.
Sample stimuli and responses, and the training task mixtures of orientation judgments in separate locations (“clockwise” (CW), “counterclockwise” (CCW) by ± 12° of a reference angle). (a) Trial sequence of a fixation (500 ms), a precue (100 ms) marking the location to judge, the stimulus display (100 ms), and a response cue. Adaptive methods estimated contrast thresholds at 75% correct. (b) Sample stimuli with and without external noise (“snow”), along with assigned CW or CWW responses. (c) Illustrations of possible task mixtures in the four roving groups: All interleaved different reference angles in each location (−67.5°, −22.5°, 22.5°, or 67.5° from vertical); Near interleaved two similar tasks in two locations each (e.g. 22.5° or 67.5°); Far interleaved two dissimilar tasks at two locations each (e.g. −22.5° or 67.5°); Single trained one reference angle (e.g. −67.5°) in four locations (no roving); showing here only the reference angles, not the actual stimuli. N = 12 observers per group performed 7,680 trials each in 8 sessions.The IRT model predicted the observed differences in learning rates for the four groups, on the basis of the interaction of learning in the four locations and the similarity or dissimilarity of the set of tasks being learned. As we will see, it provided an excellent account of the data at both the session level and the microstructure level of trial-by-trial learning.
Methods
Behavioral experiment
Observers
Observers, with normal to corrected-to-normal vision, provided written consent under a protocol approved by the Institutional Review Board of the University of California Irvine. Forty-eight observers were randomly assigned, 12 each, to one of four experimental groups. Each observer participated in 7,680 experimental trials over 8 sessions on different days, usually within a two-week period. Other observers were excluded from analysis if their thresholds were not measurable because they were at ceiling (100% contrast) for more than one session, especially but not exclusively in the high external noise condition; this occurred more frequently in more difficult All (n = 4) and Near (n = 4) conditions, compared to the easier Far (n = 1) and Single (n = 1) conditions. (As a consequence, if anything, the results may slightly underestimate the learning rate differences between groups.) Each observer who completed the study performed 7,680 trials over eight training sessions, 960 per session, or 368,640 trials over all observers.
Stimuli and apparatus
A Gabor (windowed sine wave) pattern was presented on each trial at one of the four corners around fixation; its orientation was chosen at random (“clockwise” [CW] or “counter clockwise” [CCW] of the reference angle for each location) and was presented with or without Gaussian noise. The Gabor pattern, defined in a 64 × 64 pixel patch, is described by: , with θ = reference angle ± 12°, spatial frequency f = 1.33 cpd, and SD of the Gaussian envelope σ = 0.5 degrees, maximum contrast c, and l is the mid-gray background luminance. Each external noise image, newly generated for each trial, was composed of 2 × 2 pixel noise elements with contrasts randomly chosen from a Gaussian distribution with mean value 0 and SD 0.25. External noise images and signal Gabor images (NNSNN) were displayed sequentially at the frame rate (see Procedure). The 64 × 64 pixel images subtended 3° × 3° visual angle, located at 5.67° eccentricity, at a viewing distance of 72 cm. Stimuli were generated in MATLAB with PsychToolbox on a Macintosh G4 computer using the internal 10-bit video card, refresh rate of 67 Hz, resolution of 640 × 480 pixels, and displayed on a 19 in. Viewsonic color monitor in pseudo-monochrome. A lookup table, estimated by a visual calibration procedure (Lu & Dosher, 2013) and validated by photometric measurement, linearized the luminance range into 127 levels from 1 cd/m2 to 67 cd/m2; the mid-grey background luminance was 34 cd/m2. The observer's head was stabilized using a chin rest.
Design
Observers discriminated the orientation of a Gabor patch tilted ± 12° (CW or CWW, from a reference angle) in one of the four locations indicated by a pre-cue and a response post-cue (arrows). There were four roving (intermixture) conditions: In the All condition, each of the four locations used a different reference angle (i.e. -67.5°, -22.5°, 22.5°, or 67.5° from vertical for the lower left, upper left, upper right, and lower right positions). In the Near condition, two closer reference angles were used in two diagonal positions (e.g. -22.5° or 22.5°). In the Far condition, two dissimilar reference angles were used in two diagonal positions (e.g. -67.5° or 22.5°). In the Single condition, the same reference angle occurred in all locations. Zero and high external noise test conditions were intermixed. There were 8 sessions of 960 trials per session. Adaptive methods (see below) were used to measure contrast thresholds at 75% correct for each location and external noise condition separately (120 trials each within a session).
Procedure
Observers were instructed in the task, shown printed examples of the stimuli, and then participated in a small number of practice trials prior to collecting the experimental data. Each trial of the experiment started with a central fixation mark and four sets of location markers; 500 ms later, the stimulus sequence (external Gaussian noise or blank frames, signal, external Gaussian noise or blank frames) appeared for 2 refresh counts per frame, with a central pre-cue arrow appearing 100 ms prior to the signal Gabor frame. The contrast of the stimulus in a trial was determined by the adaptive procedure. Observers pressed the “j” key for CW or the “f” key for CCW. A feedback tone followed each correct response. Each session included 960 trials, or 120 trials in each of the four locations for each external noise level.
Adaptive threshold measurement
Thresholds were measured with the accelerated stochastic approximation algorithm (Kesten, 1958). The Gabor (signal) contrast on each trial was selected to track a target performance of 75% correct. In the first two trials, contrasts follow the stochastic approximation procedure (Robbins & Monro, 1951):, where n is the trial number, X is the stimulus contrast in trial n, Zn = 0 or 1 is the response accuracy in trial n, X is the contrast for the next trial, and s is the pre-chosen step size at the beginning of the trial sequence. From the third trial on, the sequence is “accelerated”:, where m is the number of shifts in response category (from correct response to incorrect response and vice versa). In our application, the method was modified such that while m = 0 the increased contrast on an error is capped at 0.125s. See (Treutwein, 1995) for a discussion of this adaptive method, and (Lu & Dosher, 2013) for an analysis of the convergence properties and guidelines for step size and starting values.
Bootstrap methods
Error bars were estimated by bootstrapping. In most cases, this involved (i) generating new pseudo-samples of observers (usually 1,000) by sampling with replacement from the set of actual observers, and then (ii) statistically processing these with the same methods used in the corresponding analysis of human data. For example, session thresholds (symbols in Figure 3) were computed by averaging the contrast thresholds from 12 observers in each condition; and for each observer, the threshold was first averaged from contrasts for the final 30 trials in each testing location, and then averaged these over location. The SD of the mean thresholds of many pseudo-sampled sets of 12 observers per condition was processed to estimate the standard error of these mean values, which gives an estimate of error based on the observed population variability. Similar bootstrapping methods were used as the basis of error estimates for the parameters of power function models fit to the learning curves.
Figure 3.
Contrast thresholds as a function of training session for the four roving groups: All, Near, Far, and Single, displayed separately for low and high noise test trials. (a) Contrast thresholds at 75% correct from adaptive staircases, averaged over observers. Error bars were bootstrapped (n = 1000) from the behavioral data. Smooth curves are the power fits with different learning rates for each group. (b) Post-training contrast thresholds for the four roving groups. Error bars are standard error of the mean.
Contrast thresholds as a function of training session for the four roving groups: All, Near, Far, and Single, displayed separately for low and high noise test trials. (a) Contrast thresholds at 75% correct from adaptive staircases, averaged over observers. Error bars were bootstrapped (n = 1000) from the behavioral data. Smooth curves are the power fits with different learning rates for each group. (b) Post-training contrast thresholds for the four roving groups. Error bars are standard error of the mean.
Fitting power functions
Power functions were fit to the contrast threshold learning data (curves in Figure 3a) to estimate the learning rates. The learning curves were fit by power function improvements (Dosher & Lu, 2007; Heathcote, Brown, & Mewhort, 2000): C(t) = λ(t + 1)−β + α, with initial threshold of λ + α, asymptotic threshold of α, learning rate β, and training block t. The curves for the four roving conditions were tested for significant differences with a lattice of nested F-tests, each of which compares a restricted model to a fuller model of which it is a proper subset. For example, if roving conditions actually differ in learning rate (or any other parameter), constraining the model system to equate that parameter will significantly reduce the quality of fit. The proportion of variance accounted for by a model is r2:The ∑ is over all N observations and is the mean of the observed values. F-tests for nested models compared the fit of the fuller and reduced models: , where df1 = k − k, and df2 = N − k − 1. The k's are the number of model parameters. The F-test computes the ratio of the improvement in error variance for each additional parameter in the fuller model to the (average) error variance per degree of freedom.The variability in the parameter estimated was computed via bootstrapping as described above. A set of 1,000 samples of 12 observers per group, drawn randomly with replacement from the data of actual observers, were created, and then power function models were fit to each of these data sets. This was the basis of the estimated mean and SD of the estimated parameter values. Because the SDs were inflated by correlated structures between power function parameters, we also tabulated the frequency of the ordinal pattern between different rate parameters in the 1,000 fits to bootstrapped data (presented in Appendix A).
Simulation methods
Integrated reweighting theory
The integrated reweighting theory (IRT) (Dosher et al., 2013) was implemented in Matlab (The Mathworks, Inc., Santa Clara, CA). It includes a representation module, a decision module, and a learning module. The simulation model is tested by exactly reprising the experimental protocol (i.e. same number of trials, randomization of conditions, etc.), taking stimulus images as input, producing responses as output on each trial, and using exactly the same data analysis as the behavioural experiment. The descriptions of the modules below are similar to those of previous applications (Dosher et al., 2013; Petrov et al., 2005).The representation module, inspired by units in early visual cortex, computes the activities in location-specific and location-invariant representations from stimulus images. Signal and external noise images are summed in the model to represent temporal integration by the visual system. This implementation used four sets of location-specific representations and one set of location-invariant (location-independent) representations that responds to inputs from all four locations. The spacing, orientation, and spatial frequency bandwidth parameters, and the spatial summation radius, were all set a priori from earlier applications. There were 5 spatial frequency bands (every ½ octave) centered at (0.7, 1, 1.4, 2, and 2.8 cycle/degree), 12 orientation bands (every 15°) centered at (0°, ± 15°, ± 30°, ± 45°, ± 60°, ± 75°, and + 90° [= -90°]), and four spatial phases (0°, 90°, 180°, and 270°). The spatial frequency tuning bandwidth was set at h = 1 octave and the bandwidth of the orientation tuning was set at h = 30° (half-amplitude full-bandwidth), based on estimated cellular tuning bandwidths in primary visual cortex. The location-invariant representations were more broadly tuned, with bandwidths of 1.6 • those of the location-specific units, and also had more internal noise (Dosher et al., 2013). The descriptions of the representation, decision, and learning modules are similar to those in earlier treatments (Liu et al., 2015; Petrov et al., 2005).The representation module computes the activation values A(θ, f) of the orientation- and frequency-selective representation units, whether location-specific or location-invariant, in response to the stimulus image(s). This measures the normalized spectral energy in those channels. Sets of retinotopic phase-sensitive maps S(x, y, θ, f, ) are applied to the input image I(x, y): S(x, y, θ, f, ) = [RFθ,(x, y)⊗I(x, y)], for spatial frequency f, orientation θ, and spatial phase ϕ. The input (stimulus) image I(x, y) is convolved with the filter for each spatial-frequency/orientation unit by fast Fourier transform, followed by half-squaring rectification, to produce phase-sensitive activation maps analogous to “simple cells.” These are pooled over spatial phase: and subjected to inhibitory normalization: (Heeger, 1992) . The noise ε1 is Gaussian-distributed internal noise with mean 0 and standard deviation σ1. The normalization pool N(f) is independent of orientation and only modestly tuned for spatial frequency, as suggested by the physiology. The parameter a is a scaling factor, and k is the saturation constant to prevent division by zero at very low contrasts. For this behavioral task where the observer judges orientation, activations were pooled over spatial phase, and then spatially pooled with a Gaussian kernel of radius W around the target Gabor. Another Gaussian-distributed noise of mean 0 and SD σ2 introduces another source of stochastic variability: . The activations of the representation units are limited within a range by a nonlinear function with gain parameter γ: .The decision module takes the weighted sum of activity in the representation units and a bias unit as inputs and generates a predicted response on each trial. The decision variable is: . The ws are the current weights on representation units, b is a bias term integrated with weight w, and ε (Gaussian, mean 0, SD σ) is decision noise. A sigmoidal function with gain γ transforms it into an “early” post-synaptic decision activation o′: , with negative and positive values mapping to a CCW or CW responses, respectively.The learning module updates the weights from the representation units to the decision unit on every trial using augmented Hebbian learning. Feedback (F = ± 1), when available, moves the “late” post-synaptic activation in the decision unit o toward the correct response: o = G(u + w). If the feedback weight w is high, activation of the decision unit approaches the correct positive or negative maximum (± Amax = ± 1). In the absence of feedback (F = 0), learning operates on the early decision activation (o = o′), which is often intermediate. Learning occurs by changing the connection strengths w from sensory representation units i to the decision unit on each trial. The weight changes depend on the activation at the pre-synaptic connection, A(θ, f), the post-synaptic activation compared to its long-term average, (o − ), the distance of the weight from the minimum or maximum saturation value (w or w) and the (system) learning rate (η). So, the change in weight is: Δw = (w − w)[δ]− + (w − w)[δ]+, where δ = ηA(θ,f)(o − ), and the time-weighted average of post-synaptic activation is (t + 1) = ρ o(t) + (1 − ρ)(t). This Hebbian learning rule is augmented both by feedback (when it occurs in the behavioral experiment) and by information in the bias control unit b that contributes to the early decision activation. The bias is a time-weighted average of responses r(t), weighted exponentially with a time constant ρ = 0.02 (about 50 trials), r(t + 1) = ρ*R(t) + (1 − ρ)*r(t). The bias serves as a counter to deviations from 50% to 50% response histories (assuming symmetric experimental designs). Bias control tracks the observer's responses, while feedback tracks the external teaching signals.
Fitting the simulated model to data
Predictions of the IRT were generally based on 1,000 simulated repetitions of the experiment, yielding estimates of the mean and SDs of simulated thresholds. Parameters of the representation module were fixed a priori from the physiology or from prior implementations of an augmented Hebbian reweighting model (the AHRM) and the IRT (Dosher et al., 2013; Petrov et al., 2005). The parameter for the nonlinear activation of the decision unit, γ, was set to 3.5 based on previous estimates. The parameters varied to achieve the best fit of the IRT to the behavioral data of the current experiment were: internal additive noise (σ1) and internal multiplicative noise (σ2) for location-specific and for location-independent representations, a decision noise (σ), and learning rate (η). A scaling factor (a) matched initial performance of the different randomly assigned observer groups, and could differ slightly between groups if necessary.The best parameter values to fit the behavioral data were found using successive grid search, followed by more detailed searches in identified regions of the parameter space. These are time consuming due to the computational demands of processing many different Gabor plus noise images through the representation module to create a large image activation cache, and many runs of the simulation for each parameter combination. Several key free parameters, which are listed above, were varied and the best least-squares fit of the model to the average data among those we tested was selected. The quality of the fit was measured by the r2 between the mean contrast thresholds from the simulation and the average contrast thresholds in the experiment. The fit was also assessed using the statistic Kendall's τ that measures the consistency in the ordinal predictions between the model and the data. In this application, Kendall's τ was lower than r2 because some conditions led to similar predicted outcomes, and so could easily trade ordinal positions in the data (e.g. the Far and Single groups).
Results
Perceptual learning in separate locations interacts, and depends on the task combination
Perceptual learning occurred at quite different learning rates in the four training groups, as seen in the average contrast threshold learning curves, graphed separately for zero and high external noise tests (Figure 3a). Contrast thresholds were measured per session, as is typical of many studies of perceptual learning. The SDs of the average thresholds (error bars) were estimated by bootstrap methods (see Methods, Behavioral experiment). The ultimate differences in learning between groups after 8 sessions were quite substantial (Figure 3b).The effects of the training (intermixture) group (All, Near, Far, and Single) were tested using analysis of variance on the contrast thresholds, with external noise (zero and high) and training block as within-observer factors, roving group as a between-observer factor, and observers as the random factor (α = 0.05). Higher contrast was, of course, required to accurately judge the stimuli embedded in external noise (F (1, 44) = 418.44; p < 0.0001; = 0.905). Learning reduced contrast thresholds over training sessions (F (7, 308) = 52.85; p < 0.0001; = 0.5457). Of central importance for this study, the four training groups showed different amounts of learning (F (3, 44) = 5.025; p < 0.005; = 0.255), showing large differences in learning after several sessions. There was also an interaction between external noise and training group (F (3, 44) = 2.750; p < 0.05; = 0.158); and among training condition, external noise, and block (F (21, 308) = 1.48; p ≈ 0.08; = 0.092). The methods of Masson (2011) were used to compute the Bayes information criterion probabilities (pBIC(H1|D)) (essentially a transformation of ): pBIC(H1|D) > 0.999 for the effects of training (roving) group, blocks of training, external noise, and the group by external noise interaction; pBIC(H1|D) < 0.001 for the interaction among training condition, external noise, and session.Post hoc tests indicated that the differences between the four intermixture (roving) groups were significant (all p < 0.0001 except All versus Near, p < 0.01), although the Far and Single conditions were statistically equivalent (NS) (Bonferroni correction α = 0.008). See Appendix A, 1 for equivalent results of analyses separately in zero and high external noise conditions. As described later, the IRT reweighting model predicts this same order of learning rates: All < Near < Far ≈ Single, from least to most.The rates of learning were estimated from power function learning curves fitted to the average contrast threshold data (smooth curves in Figure 3) (Dosher & Lu, 2007; Heathcote, Brown, & Mewhort, 2000). These functions are described by: C(t) = λ(t + 1)−β + α) where c(t) is the threshold in session t, λ + α is the initial threshold, α is the asymptotic value late in training, and β is the learning rate. Power functions provide a good description of average contrast threshold learning functions (Dosher & Lu, 2007). In this case, the training sessions started at t = 1 because the thresholds reflect session-end performance, and we tested the equality of pre-training thresholds at t = 0 in additional nested model tests (see Appendix A, 1).The four learning functions differed only in the rate of learning: the (1λ - 4 β - 1 α) model provided an excellent fit with rates of: βs = 1.1478, 1.3763, 1.7446, and 2.3077 (λ = 1.0984, and α= 0.0713) for zero noise (r2= 0.9411) and βs = 0.5538, 0.7936, 1.3242, and 1.2836 (λ = 0.8979, and α= 0.3262) for high external noise (r2 = 0.9554), listed for All, Near, Far, and Single, from slower to faster. A lattice of subcase models and nested significance tests rejected more complicated models (see the discussion in Appendix A, 2, and Tables A.1 and A.2). The SDs of the estimated parameters, computed using bootstrap methods, are listed in Table A.3. The parameter SDs are relatively large (reflecting slight threshold level differences between observer groups and parameter correlations; added variance from parameter correlations was partially discounted in SDs of normalized rates). Despite this, the ordinal consistency of the four rates from the bootstrapped methods, which is perhaps more meaningful, was very high. For example, in zero noise, β was slower than β, slower than β, and slower than β in 998, 949, and 786 fits, respectively, out of 1,000 fits to resampled data sets; and in high noise, β was slower than β in 1,000 fits, slower than β in 1,000 fits, and slower than β in 950 fits out of 1,000 fits to resampled data sets (ordinal statistics are also listed in Table A.3). Consistent with the ANOVA tests, in high noise, β is slower than β only in 469 out of 1,000 fits—they are not significantly different from each other.
Table A.1.
Comparisons of power-function models for the threshold learning curves in high external noise.
Parameter estimates for the (1λ - 4β - 1α) power function models in zero and high noise.
Parameter
Estimate
Mean
St. Dev.
St. Dev.*
No External Noise
λ+α
1.1697
1.2192
0.2511
Ordinal Frequencies
α
0.0713
0.0659
0.0202
βAll <
βNear <
βFar<
βSingle<
βAll
1.1478
1.2098
0.3627
0.2231
—
βNear
1.3763
1.4247
0.2937
0.2454
786
—
βFar
1.7446
1.8394
0.5166
0.2960
949
827
—
βSingle
2.3077
2.4284
0.5923
0.3438
998
994
875
—
High External Noise
λ+α
1.2241
1.2192
0.0796
Ordinal Frequencies
α
0.3262
0.3048
0.0490
βAll<
βNear<
βFar<
βSingle<
βAll
0.5538
0.5380
0.1435
0.1193
—
βNear
0.7936
0.7707
0.1571
0.1329
950
—
βFar
1.3242
1.2885
0.3350
0.2101
1000
979
—
βSingle
1.2836
1.2570
0.2965
0.1918
1000
981
469
—
Note: Parameter estimates for the power function fits of the (1λ - 4 β - 1 α) model to the observed group data (Estimates), and the resampled mean (Mean), standard deviation (St. Dev.), the normalized standard deviation (St. Dev.*), and the frequency of orders from power function fits to n=1000 bootstrapped data sets (12 observers sampled at random with replacement for each condition). The normalized standard deviation partially corrects for rate βs correlations with λ and α parameters (see text). The ordinal frequencies indicate high consistency in the order of learning rates in fits to bootstrapped data sets; e.g., the count out of 1000 fits to bootstrapped in the four training groups with βFar < βSingle in zero noise was 875.
Since observers were assigned to groups randomly, the threshold performance before training was expected to be equivalent in the four groups. Consistent with this, the difference in contrasts among groups was insignificant at the beginning of the first session (p > 0.05), and steadily increased throughout the session ( = 0.015 for trials 40 to 80, pBIC(H1|D) < 0.05, whereas = 0.221 for the last 40 trials, pBIC(H1|D) > 0.999; differences in contrast thresholds between groups emerged as early as 200 to 300 trials of training in the first session p values < 0.05, uncorrected for multiple tests; see Appendix A, 3 for more detailed analysis for each noise level). Additionally, the contrast thresholds of a subset of observers in the All group showed a deterioration in the last few sessions that we believe may have reflected a lack of motivation in this more challenging roving condition.
Differential learning predicted by the IRT
Tasks trained in different retinal locations interacted strongly during learning, which rules out simple forms of the sensory retuning theory, in which learning primarily reflects retuning of separate neural populations in retinotopic representations in early visual cortex. In contrast, the reweighting theory of perceptual learning, implemented computationally in the IRT (Dosher et al., 2013), predicts different empirical learning rates for the four different task intermixtures, with the same order as seen in the data: All < Near < Far ≈ Single, least to most. These different learning rates predicted by the model are induced solely through differential training in the different roving conditions.This computational IRT model processes stimulus images through a visual front end, including normalization and gain control (Heeger, 1992), producing activities in spatial-frequency and orientation-tuned units that approximately mimic early visual cortical responses. It then weights this evidence (activation scores) in location-specific and location-invariant sensory representations to make a decision (i.e. “counter-clockwise” or “clockwise”). Then, augmented Hebbian learning (with feedback and response bias inputs) changes the readout (e.g. weights on sensory evidence) with experience on every trial. It recapitulates the behavioral experiment exactly in the simulation (e.g. stimuli, training sequence, number of training trials, randomization, and adaptive algorithm to adjust contrast). That is, it takes stimulus images as input, produce responses, and learns by adjusting weights on each trial. The resulting responses are analysed as in the experiment. (See Methods, Simulation Methods, for model specification.)The different training experiences in the four groups interact differently in neural network weight space. Weights connecting a location-invariant layer to the decision unit are affected by training in all four locations in addition to the location-specific layer trained by tasks in each location separately. The success of learning at the higher level and the amount of training also influences the simultaneously learned weights connecting each location-specific representation to decision. The four groups experience different levels of interference in the connection weights from the location-invariant representations to decision because of the different task combinations. For an intuition, consider the stimulus-response mappings in the four roving conditions: Any test stimulus in one location of the All condition that maps to a “clockwise” response has two stimuli adjacent in orientation (rotation) space that maps to the competing “counter-clockwise” response, leading to lots of interference; each stimulus in the Near condition has a competitive response mapping with a stimulus on one side; the stimuli in the Far condition are widely separated in orientation space and so have no near stimuli with competing responses; while in the Single condition the same stimuli and responses are trained in all locations. Although the focus here is on the weights from the location-invariant representations to decision because the experiment trains different task variants in different locations, task roving in a single location would lead to interference in the weights from both location-invariant and the location-specific levels of representation. These intuitions were validated by computing (nearly) optimal weights for the four training groups by simulating a very large number of training trials in zero noise. Despite different speeds of weight development during early and middle stages of training, with corresponding predicted differences in contrast thresholds, the four conditions nearly converge after very extensive training in zero external noise (this would require an amount of training far far beyond the thousands of trials for each observer in the behavioral experiment).The IRT provides a theoretical and intuitive understanding of the nature of the interaction between tasks trained in these intermixed training paradigms. It predicts the ordinal properties of the empirical learning rates in the four intermixed training groups, and this is true for many parameter sets—although fitting the data quantitatively requires optimizing parameter values.The best-fitting parameters were estimated through modified grid search (see Methods, Simulation). The average predicted contrast thresholds (line), and ± 1 SD (shaded areas) are shown in Figure 4, along with the behavioral contrast threshold data (symbols). (Error bars for the behavioural data are in Figure 3). The parameters free to vary included: internal multiplicative noise σ1, additive noise σ2, decision noise σ, scaling factor a, the weight on feedback w, and model learning rate η. Spatial frequency and orientation bandwidths of the sensory representations were selected a priori based on the physiology and some nonlinearity parameters were set from prior model applications, with the orientation and spatial frequency bandwidths of the location-invariant representations slightly broader than the location-specific representations. The location-invariant internal noises were set at twice the location-specific internal noises, based on prior applications of the model. With the model constrained from physiology and prior applications, there were 6 parameters (of 20 total parameters) free to vary to optimize the fit to the 64 data points (average contrast thresholds in 8 sessions × 4 groups × 2 external noise levels). Table 1 shows the best-fit parameter values.
Figure 4.
Predictions of the IRT model of perceptual learning are shown for the four training groups, which differ in the intermixture of trained tasks. (a) A best fitting IRT model for the behavioral data; the model simulations take the stimulus images as input, replay the behavioral experiment, and predict contrast thresholds with exact the same parameters for all groups. Symbols are the behavioral data (see Figure 3 for error bars); the lines are the average prediction of the best-fitting IRT, and the bands are the mean ± 1 SD of the individual simulations. (b) Selected average weights for the four training conditions (All, Near, Far, and Single) are shown at the beginning of training (gray) and at the end of training for units tuned to the relevant orientation channels at two different spatial frequencies (1.4 cycles/degrees, which matches the stimulus; and 2.8 cycles/degrees, control) for both location-dependent and location-invariant representations. The increased ranges reflect higher values of positive and negative weights on the decision-relevant evidence.
Table 1.
Parameters of best fit IRT (r2 = 0.919, tau = 0.870).
Predictions of the IRT model of perceptual learning are shown for the four training groups, which differ in the intermixture of trained tasks. (a) A best fitting IRT model for the behavioral data; the model simulations take the stimulus images as input, replay the behavioral experiment, and predict contrast thresholds with exact the same parameters for all groups. Symbols are the behavioral data (see Figure 3 for error bars); the lines are the average prediction of the best-fitting IRT, and the bands are the mean ± 1 SD of the individual simulations. (b) Selected average weights for the four training conditions (All, Near, Far, and Single) are shown at the beginning of training (gray) and at the end of training for units tuned to the relevant orientation channels at two different spatial frequencies (1.4 cycles/degrees, which matches the stimulus; and 2.8 cycles/degrees, control) for both location-dependent and location-invariant representations. The increased ranges reflect higher values of positive and negative weights on the decision-relevant evidence.Parameters of best fit IRT (r2 = 0.919, tau = 0.870).A single learning rate parameter generated different predicted (“empirical”) learning rates in the All, Near, Far, and Single groups. The differences in zero and high external noise thresholds also emerge naturally from the same parameter values. These best-fit model predictions provided a good quantitative fit to the behavioural contrast threshold data (Kendall's τ = 0.870, p < 0.000001; and r2 = 0.919, p < 0.000001). We also looked at a more complicated model, one that allowed small differences in internal noise parameters between groups (i.e. small differences between the groups of observers), which slightly but nonsignificantly improved the fit to the data (Kendall's τ = 0.883; r2 = 0.938) (see Appendix B, 1 and Figure B.1 for the graph of the fit, and Table B.1 for details and significance tests), but even with slight level differences between the randomly assigned groups of observers, it is the intrinsic differences in learning experiences that control the learning rates. In sum, the learning rate differences between the training groups emerge organically from the model based on the intermixture of trial experiences and are qualitatively and quantitatively consistent with the behavioural data in the experiment.
Figure B.1.
A best fitting IRT model with slightly different parameters among groups. (Parameters are listed in Table B.1). With these six additional parameters, Kendall's τ and r2 were slightly better but the improvement was not statistically different. The line and shaded areas from the model predictions are the mean and ± 1 standard deviation from individual simulations (n=1000).
Table B.1.
Parameters of the IRT fits with differences between groups (r2 = 0.938, tau = 0.883).
Learning altered the weights connecting the sensory representations to decision; examining how the weights change in the best-fit model can reveal aspects of the learning process in the model. Initial weights were broadly set for the task, reflecting task instructions and general knowledge of orientation. This accounts for the initial above-chance performance in the behavioral data. The weights then change with training so as to focus on the most useful information in the stimulus. The weights on units tuned to the relevant clockwise or counter-clockwise stimulus orientation and spatial frequency of the Gabor (sf = 1.4 cpd, cycles per degree) increased or decreased during training, as appropriate, resulting in increased range in weights (Figure 4b). The weights on units tuned to relevant orientations in other spatial frequency channels (e.g. sf = 2.8 cpd) were relatively unchanged. In the All condition, changes in weights connecting the location-invariant representations to decision “fight” each other from one trial to the next because adjacent orientation stimuli require opposite correct responses in other locations. This, in turn, forces learning into the location-specific representation weights in this condition. In the Near condition, fewer weights on location-invariant representations conflict for the most similar stimuli, whereas in the Far condition stimuli are sufficiently dissimilar that weights for the tasks can be nearly independent. In the Single condition, the weights on the most relevant location-invariant representation units show the largest increases, due to the consistent training in all locations. The full sets of weight changes in the best-fitting IRT model are shown in Appendix B, 2, and in Figure B.2. In short, training intermixed tasks with similar stimuli that require different responses sets the conditions for catastrophic interference, a common property in neural network learning models. Furthermore, this interference occurs specifically in weights on higher-level location-invariant representations in this experiment.
Figure B.2.
Initial (blue) and final (red) weights from the best-fitting IRT simulation model (n=1000 runs) for location-specific and location-invariant representations in the four training groups (Single, Far, Near and All) as a function of orientation, shown for (a) the most relevant spatial frequency channel (sf = 1.4 cycles/deg) and (b) a less relevant channel (sf = 2.8 cycles/deg). (The IRT representations used units tuned to all combinations centered on 5 spatial frequencies and 12 orientations, for 60 channel weights in each location or 300 overall. For visual clarity, each panel shows the weights at the 12 orientations as lines.) For location-specific units, weights increased for those channels tuned to “clockwise” orientations and decreased for “counterclockwise” orientations used in the different tasks for all four groups. When the tasks in the four locations are compatible (Single), there is substantial learning in location-invariant weights, while if they are more incompatible (All), there is almost no learning in location-invariant weights. In this case, the weights in less relevant channels (b) had little change over the course of learning.
Trial-by-trial learning, behavior, and IRT model
For this experiment, it was also possible to evaluate learning based on trial-by-trial data, and this analysis revealed some additional features of learning in the four groups. Trial-by-trial and other more fine-grained analyses are only beginning to be deployed in the literature in perceptual learning (Zhang, Zhao, Dosher, & Lu, 2019). Figure 5 graphs contrasts and corresponding accuracies from the human data (a) and model-simulated predictions (b). On each trial, the adaptive algorithm—here, the accelerated stochastic approximation staircase (Kesten, 1958)—determined the change in the Gabor stimulus contrast based on the accuracy of the observer's responses in order to track 75% correct. The figure shows the contrast and proportion correct for every trial for the four roving groups, separately for zero and high external noise, averaged over locations and observers (4 locations and 12 observers per group, for 48 trials per point). The vertical lines indicate the session breaks. (See Appendix C, 1, Figure C.1 for graphs with error bars on the all contrast values; the variability in proportion correct data is seen directly in data.) The adaptive algorithm did a good job of keeping the accuracy within ±1.6σ of (binomial) variability (horizontal dashed lines) of its target value of 75% correct, except for first few trials in the session where contrast step-sizes are quite large (step sizes rapidly decrease thereafter). The patterns in the contrast thresholds are discussed next.
Figure 5.
Trial-by-trial contrast and accuracy data averaged over test location and observers (a) and an IRT simulation of these data (b). Trial-by-trial data track the emergence of differential learning between training groups (averaged over four locations and 12 observers per group), and within-session deterioration of performance, especially obvious in later training sessions in which within-session learning is of smaller magnitude. The IRT simulation gives a good account of these trial-by-trial micro-patterns (Kendall's τ = 0.847 and percent variance accounted for r2 = 0.929). Error bars in (a) are from bootstrapping (n = 1,000) data, and in (b) are ± 1 SD of the individual simulations. For figure clarity, only error bars in the middle of a session are shown. More detailed error bars are shown in Figure C.1.
Figure C.1.
Trial-by-trial contrast data (top) and simulation (bottom) with shaded error bars. The error bars for the data are ±1 standard deviation from bootstrapping data (n=1000); the error bars for the simulation are ±1 standard deviation of the individual simulations (n=1000).
Trial-by-trial contrast and accuracy data averaged over test location and observers (a) and an IRT simulation of these data (b). Trial-by-trial data track the emergence of differential learning between training groups (averaged over four locations and 12 observers per group), and within-session deterioration of performance, especially obvious in later training sessions in which within-session learning is of smaller magnitude. The IRT simulation gives a good account of these trial-by-trial micro-patterns (Kendall's τ = 0.847 and percent variance accounted for r2 = 0.929). Error bars in (a) are from bootstrapping (n = 1,000) data, and in (b) are ± 1 SD of the individual simulations. For figure clarity, only error bars in the middle of a session are shown. More detailed error bars are shown in Figure C.1.This trial-by-trial analysis reveals several fine-grained observations about the adaptive method and the learning in different conditions, as well as some within- and between-session micro-patterns that may suggest other processes during learning. To begin with, at a qualitative level, the trial-by-trial data reveals differential rates of learning in the different intermixture (roving) groups. Learning emerges in the first session in the Single and Far conditions, especially in zero noise. The Near condition starts to improve next, somewhere during the second session, and clearly by the third session. The All condition is even more delayed in showing improvements. (See Appendix C, for a detailed description of the analysis and results, and contrast error bands for all trials.)Some within-session micro-patterns are visible in the contrast data. An exploratory analysis of these micro-patterns suggested several influences working together: the adaptive algorithm, overnight consolidation of learning, and within-session deterioration. The accelerated approximation staircase, like essentially all commonly used adaptive staircases, is designed to estimate an unchanging threshold, whereas the thresholds are changing in learning experiments. (In contrast, the newer quick-Change Detection methods build learning curves into the adaptive measurements; Zhang et al., 2019.) An early dip in average contrast followed by a gradual increase is a consequence of this adaptive algorithm, as the step sizes of contrast changes go from very large to small throughout the session (see Figure C.2). Essentially, if the starting threshold in each session is a bit high, which causes the first response to be accurate more often, then the next trial takes a large step down in contrast, and then adjusts back up with smaller and smaller step sizes. This dip is clearly visible at the beginning of each session.
Figure C.2.
Simulations illustrating micro-patterns of performance of the adaptive staircase procedure, with or without learning and/or within-session deterioration. In these illustrations, the ASA (accelerated stochastic approximation) staircase is simulated with three different starting contrasts for a “true” threshold starts at a contrast of 0.5: (a) when the threshold is stable – no within-staircase learning or lapsing; (b) with learning (decrease of contrast threshold) but no lapsing; (c) with an increasing lapse rate but without learning; (d) when there are both decrease of threshold and increase of lapse rate. When compared with the trial-by-trial data, the pattern is consistent with within-block learning, especially in early blocks (b or d), while the within-block deterioration is more prominent in late blocks (c).
The contrast values at the end of the previous session were used as the starting values for the next session; apparently, they were somewhat higher than the true threshold at the beginning of the next session. This could be consistent with the often-claimed consolidation improvements during overnight sleep (Censor, Karni, & Sagi, 2006; Mednick, Drummond, Boynton, Awh, & Serences, 2008; Mednick, Nakayama, & Stickgold, 2003). This situation is likely to lead to a higher than expected number of correct responses on the first trial, and so to the dip in contrast described above. Additionally, there is a trend for deterioration in performance during each session, especially visible in later training sessions (when the amount of within-session learning is very small in the asymptotic phases). This second pattern of within-session deterioration is also consistent with some reports in the literature (Censor et al., 2006; Mednick et al., 2008).The IRT model naturally makes trial-by-trial predictions, which are shown in Figure 5b. Because it exactly reprises the experimental protocols in the simulations, it naturally predicts the within-session microstructure associated with the adaptive method. Despite some small systematic deviations, the IRT model provided a strong account of the data (rank order correlation Kendall's τ = 0.847; p < 0.00001; and proportion variance accounted for r2 = 0.929; p < 0.00001). Note that these model predictions used the parameter values estimated from the fits to the session contrast thresholds (e.g. they involved no additional optimization to fit the trial-by-trial data), except for an added lapse rate (i.e. rate of guessing trials) that increased within each session (0.0 to 0.2) to capture within-session deterioration. The same IRT model without the within-session lapse rate provided a reasonable, but somewhat worse fit to the data (F (1, 7658) = 2372.9; p < 0.0001, with rank order correlation Kendall's τ = 0.835; p < 0.00001; and r2 = 0.907; p < 0.00001). Although these models differed significantly (given the very large number of data points and so degrees of freedom), our choice of the model with an increasing lapse rate was also based on the presence of visible systematic errors in prediction without it, especially at the end of later sessions. A full discussion of trial-by-trial analyses, fits of the model to trial-by-trial data, increasing internal noise as an alternative to the lapse rate, and a discussion of adaptive methods all appear in Appendix C.
Discussion
Summary
This study asked the question: why and under what conditions does training multiple interleaved tasks interfere with perceptual learning? Can we model these patterns of interference? To answer these questions, we manipulated the mixture of tasks trained in four groups of observers. The behaviorally observed learning rates and final performance after thousands of training trials depended on the similarity between intermixed tasks, even when trained in different retinal locations. These interactive effects were substantial, leading to an approximately two-to-one relationship in learning rates (comparing the fastest to the slowest group; e.g. Single to All).Training each task in a different retinal location, although it allowed some learning in even the most challenging condition, did not eliminate the damage to learning of intermixture (roving). These results contradicted the predictions of simple forms of pure retuning theory, which attributes perceptual learning to tuning of neurons in the early retinotopic visual cortex, in which plasticity for each retinal location is separate. In contrast, the IRT (Dosher et al., 2013), using an augmented Hebbian learning rule both qualitatively predicted and quantitatively fitted the substantial differences in learning in the four groups. In the model, tasks trained in different locations interact by shared weights from higher-level location-invariant representations to decision through either destructive interference or cooperative reinforcement. Interference is especially powerful when stimuli that are very similar require opposite responses; or conversely cooperative learning may occur when the weight structures agree. Plasticity must involve both lower-level location-specific and higher-level location-invariant levels in the model to account for the data.The current explanation, based on the IRT and the general framework of reweighting evidence from a hierarchy of sensory representations, bears some similarities to and some differences from earlier related network-style explanations of roving (task intermixture) disruption of perceptual learning. In particular, the research of Tartaglia, Herzog, and colleagues also proposed an important role for overlapping representations in inducing loss of learning in roving paradigms (Tartaglia et al., 2009a). On the other hand, simple neural network learning models were rejected as a class in an earlier paper due to a failure of these models (in some cases) to show disruptions in learning with roving, essentially because the learning mechanisms were too powerful; the conclusion was to discount standard network learning models in favor of reinforcement learning in which roving reflected an inability to separately track a reward expectation for similar stimuli (Tartaglia et al., 2009b).Yet, the IRT network model of perceptual learning—together with the central role of higher-level location-invariant representations in the representational architecture—provided a good account of when and how roving challenges learning. These location-invariant representations were originally proposed as the mechanism of transfer across retinal locations, to account for the patterns of transfer when both stimuli and locations might be altered in transfer tests (Dosher et al., 2013). In more recent work, we have explored other kinds of invariant representations in order to account for other forms of transfer, such as transfer from trained to untrained spatial frequency stimuli in orientation judgments, or transfer from trained to untrained orientations in spatial frequency judgments. All of this suggests the importance of a hierarchy of representations in many visual perceptual learning tasks. In the IRT and its subsequent variants, the location-specific and location-invariant representations have been associated with retinotopic areas of early visual cortex and higher-level areas, respectively (Dosher et al., 2013; Sotiropoulos, Seitz, & Seriès, 2018).A similar argument against early retinotopic retuning as the primary basis for perceptual learning was also proposed by Otto, Ögmen, & Herzog (2010). Their study showed that perceptual learning in an illusory Vernier task with moving flankers partially transferred to different retinal locations but not to different orientations or to a standard Vernier stimulus. Furthermore, a series of studies by Yu and colleagues used “double training” and/or “training plus exposure” paradigms to promote the transfer of ordinarily specific training effects to a different retinal location, to other visual features such as orientation, and even to different types of stimuli (Wang et al., 2016; Xiao, Zhang, Wang, Klein, Levi & Yu, 2008; Zhang & Yang, 2014). These results prompted those researchers to suggest that perceptual learning is “rule-based” and may be mediated through conceptual inference. Another roving study showed that temporal sequencing of the task variants resulted in a release from roving disruptions in learning (Kuai et al., 2005). Together with the present study, this reveals a complicated set of learning and transfer effects, separate from roving.Some of the double training and exposure transfer experiments have been successfully modeled by the IRT by us (Liu, Lu, & Dosher, 2011) or by others using a slightly more flexible IRT variant (Talluri et al., 2015). Whether all double-training phenomena can be handled with some version of a reweighting model is an open question. Examples of independent task co-learning of different tasks (e.g. Vernier and bisection) were modeled using distinct decision units and weights for each task (Huang et al., 2012). Release from roving disruption when tasks were temporally sequenced can be similarly modeled if each task uses distinct decision units and weights—although location separation only partially released roving disruptions in the current study. The IRT is an example of a generative model—one that makes predictions for exact experimental designs on a trial-by-trial basis. Any specific experiment might require an extension either to replace the front-end for the stimulus domain or for the task. Further testing of any specific roving or transfer phenomenon would require its own computational modeling study. It remains an open question whether a successful model would utilize additional kinds or layers of representation or, as suggested by some researchers (e.g. Wang et al., 2016), general conceptual learning is involved.
Relation to physiology
These findings demonstrate that retuning of early retinotopic cortex (hypothesized by many researchers) is almost certainly not the only—or even the primary—form of plasticity in perceptual learning. We suggest that this conclusion is broadly consistent with evidence from single-cell recording studies: although small shifts in V1 (Schoups, Vogels, Qian, & Orban, 2001) and V4 (Yang & Maunsell, 2004) tuning have sometimes been reported, they are generally too small to account for the large behavioral improvements associated with perceptual learning (Dosher & Lu, 2017; Law & Gold, 2008), although neural response changes measured while animals are actively performing the task seem to show larger changes in higher levels of visual cortex (in V4 (Raiguel, Vogels, Mysore, & Orban, 2006) or at LIP in motion tasks (Law & Gold, 2008)) that account for a higher portion of the behavioral responses. This pattern of physiological results is consistent with the idea that plasticity must involve representations higher in the visual cortical hierarchy or even in multi-sensory or motor decision areas (Diaz, Queirazza, & Philiastides, 2017). Overall, improved readout (reweighting) appears to be one dominant mode of perceptual learning in low-level or mid-level visual tasks, even though modest sensory retuning may sometimes occur in certain tasks (Schoups et al., 2001; Seitz & Watanabe, 2005). Indeed, estimates of the influence of changes in V1 or V2 (which have been estimated to account for < 10% of the behavioral improvements) are consistent with the magnitude of learning estimated for retuning within the reweighting models (Petrov et al., 2006). As Figure 1 suggests, even if specific low-level location-specific representations do undergo retuning during learning, the different evidence or activity in these units must still be read out to make a decision and control the motor response (Dosher & Lu, 2017). Although some retuning in early retinotopic areas cannot be ruled out, they are not necessary to account for the behavioral data with the IRT model.
Conclusions
The predictions of the IRT model provided a strong qualitative and quantitative account of the behavioral data in both the session and trial-by-trial measurements. In the model, cross-location interactions reflect learned weight changes for location-specific V1-like representations and for more broadly tuned (and noisier) location-invariant representations, as in higher visual cortex (i.e. IRT or possibly V4-like). Learning is disrupted if the optimized weight structures of the different tasks are in conflict (which occurs when similar sensory stimuli require opposite responses), because updating the weights on one trial may reverse weight changes from other trials—so-called catastrophic interference. The IRT provides a promising framework for predicting the behavioral effects of multiplexed training, but also has been shown to account for many phenomena of transfer (Dosher et al., 2013) and feedback in perceptual learning (Dosher & Lu, 2009; Dosher & Lu, 2017). Visual perceptual plasticity occurs at multiple levels of the visual hierarchy. Further work in physiology or brain imaging may reveal the complex regions underlying this plasticity in particular tasks.