Literature DB >> 34029315

Sequence learning recodes cortical representations instead of strengthening initial ones.

Kristjan Kalm1, Dennis Norris1.   

Abstract

We contrast two computational models of sequence learning. The associative learner posits that learning proceeds by strengthening existing association weights. Alternatively, recoding posits that learning creates new and more efficient representations of the learned sequences. Importantly, both models propose that humans act as optimal learners but capture different statistics of the stimuli in their internal model. Furthermore, these models make dissociable predictions as to how learning changes the neural representation of sequences. We tested these predictions by using fMRI to extract neural activity patterns from the dorsal visual processing stream during a sequence recall task. We observed that only the recoding account can explain the similarity of neural activity patterns, suggesting that participants recode the learned sequences using chunks. We show that associative learning can theoretically store only very limited number of overlapping sequences, such as common in ecological working memory tasks, and hence an efficient learner should recode initial sequence representations.

Entities:  

Mesh:

Year:  2021        PMID: 34029315      PMCID: PMC8177667          DOI: 10.1371/journal.pcbi.1008969

Source DB:  PubMed          Journal:  PLoS Comput Biol        ISSN: 1553-734X            Impact factor:   4.475


Introduction

Here we investigate the neural mechanism involved in learning short visual sequences. The ability to remember or to perform events or actions in the correct order is critical to the performance of almost all cognitive tasks [1]. Understanding human sequence learning mechanism is crucial not only for understanding normal cognition, but also to understand the nature of the impairments and disabilities that follow when sequence learning is disrupted [2-4]. In this study we ask whether the changes in neural activity during sequence learning reflect a particular type of optimal learning strategy. An optimal learner is an agent whose internal model reflects the statistics of the environment [5, 6], and human learning has been shown to follow the optimal model in a wide range of domains such as speech and language [7, 8], visual scenes and objects [9-13], and sensorimotor control [14, 15]. However, statistical regularities across sequences can be represented in multiple ways [1]. First, sequences can be represented as simple associations (Fig 1A and 1B) and their statistics represented by weighting the associations based on their relative frequency (Fig 1A–1C). An optimal learner would update the association weights as new data comes in to reflect the statistics of the environment. Alternatively, learning can proceed by recoding frequently occurring associations using new latent representations. The latter approach has been termed ‘chunking’ in cognitive literature [16, 17] to describe learning where complex objects (words, faces) are constructed from lower-level features (phonemes, syllables, oriented lines, [10]). The crucial difference between these two learning approaches is that the dimensionality of sequence representations changes: for associative learning the sequence representations remain the same, whilst new codes are inferred with recoding (Fig 1D). Therefore, we can dissociate between these two mechanisms by comparing neural representations of novel sequences to learned ones.
Fig 1

Sequence learning.

(A) Four Gabor patches (items used in this study) associated with four sequence positions and a matrix representation of the sequence. (B) Item-position associations in monkey prefrontal cortex as observed by Berdyyeva et al. [38]. Each subplot displays spiking activity for a particular neuron: the first one responds most to items at the beginning of a three-item sequence, the second for the ones in the middle, and the last one for items at the end of the sequence. Numbers on x-axis mark the onset of the stimulus events. (C) Visual representation of three sequences as position-item associations and the resulting frequency of associations. The frequency of associations can be learned as a model of the environment. (D) Dissociating between learning mechanisms in terms of similarity between novel and learned sequences: with associative learning (left) learned sequences share the same item codes with novel ones. Furthermore, learning reduces noise in learned sequence representations. Recoding (right) changes item representations so that novel and learned stimuli do not share representations.

Sequence learning.

(A) Four Gabor patches (items used in this study) associated with four sequence positions and a matrix representation of the sequence. (B) Item-position associations in monkey prefrontal cortex as observed by Berdyyeva et al. [38]. Each subplot displays spiking activity for a particular neuron: the first one responds most to items at the beginning of a three-item sequence, the second for the ones in the middle, and the last one for items at the end of the sequence. Numbers on x-axis mark the onset of the stimulus events. (C) Visual representation of three sequences as position-item associations and the resulting frequency of associations. The frequency of associations can be learned as a model of the environment. (D) Dissociating between learning mechanisms in terms of similarity between novel and learned sequences: with associative learning (left) learned sequences share the same item codes with novel ones. Furthermore, learning reduces noise in learned sequence representations. Recoding (right) changes item representations so that novel and learned stimuli do not share representations. Research on sequence learning has provided evidence for both learning mechanisms. Manual motor skill learning has been shown to decrease noise in learned representations [18-20] whilst not changing the representations of individual items in the sequence [21, 22]. Similarly, in the auditory domain frequently co-occurring sequence items elicit a neural response that indicates an increase in association strength [23]. Contrastingly, chunking has been observed widely in tasks where separate movements are integrated into a unified sequence [24], and in auditory-verbal sequence learning [25-27], where multiple co-occurring sequence items are bound together and recalled in all-or-nothing fashion [28, 29]. Importantly, both learning mechanisms reduce the amount information required to represent stimuli [5, 11, 30] and therefore are hard to dissociate on the basis of simple univariate learning measures. For example, several past fMRI learning studies have observed two broad effects for learned stimuli: reduction of the BOLD signal and increase in pattern separability [18, 19, 31–34]. However, such results do not inform us of the computations underpinning the learning process: any statistical learning mechanism will reduce uncertainty and hence decrease resource requirements for learned stimuli [5, 35]. Therefore broad univariate measures indicating more efficient coding of learned stimuli, such as improvement in behavioural performance, reduction in the average BOLD response, or pattern separability, are expected a priori for any learning mechanism. Contrastingly, in this study we use fMRI to ask what is the computational mechanism underpinning learning in our task, rather than where in the brain can we detect learning effects. We first formally derive the associative and recoding models in the context of Bayesian optimal learner. We show that the two accounts make dissociable predictions as to how sequences are encoded in the brain, and these predictions can be expressed in terms of the similarity of neural pattern activity. We tested these predictions in the dorsal visual processing stream using a sequence recall task together with the representational similarity analysis (RSA, [36, 37]) of fMRI data. We observed that only the recoding account can explain the similarity of neural activity patterns. Specifically, the code for sequences changed from representing novel sequences as individual items to representing them as chunks after they had been presented several times. Finally, we show that associative learning can effectively store only a limited number of similar (overlapping) sequences. Therefore an efficient learner should benefit from recoding initial sequence representations, since ecological learning tasks, such as reading or navigating, often involve a large number of multiple overlapping sequences (e.g. words, directions, recipes). Taken together our findings represent strong theoretical and empirical evidence for a specific learning mechanism: human learning of short visual sequences proceeds by recoding initial sequence representations with new ones within the same brain regions. Such recoding is necessary to enable efficient behaviour in complex tasks.

Models

To contrast the predictions of the two learning models we used a simple sequence recall task: participants were asked to recall a sequence of four Gabor patches in the order they were presented after a brief delay period using a button-box. During the experiment some of the sequences were presented only once (novel sequences) and some were presented multiple times (repeating sequences, Fig 2). Only two individual sequences were repeated and we first presented them 12 times each during a practise session. This ensured that those two individual sequences were learned to criterion before the beginning of the main experiment. The repeating and novel sequences were designed maximally dissimilar to each other so that learning of the repeating sequences would not transfer to the novel ones. We proceeded to present the two familiar repeating sequences interleaved with novel sequences (Fig 2B).
Fig 2

Task.

(A) Single trial: participants had to recall a sequence of four Gabor patches in the order they were presented after a 4.8s delay period using a button-box. The size of the stimuli within the display area is exaggerated for illustrative purposes. (B) Trial types and progression: 2/3 of the trials were repetitions of the same two individual sequences (repeating sequences), while 1/3 of the trials were novel unseen orderings of the items (novel sequences). The identity and order of repeating and novel sequences were pseudo-randomised across participants.

Task.

(A) Single trial: participants had to recall a sequence of four Gabor patches in the order they were presented after a 4.8s delay period using a button-box. The size of the stimuli within the display area is exaggerated for illustrative purposes. (B) Trial types and progression: 2/3 of the trials were repetitions of the same two individual sequences (repeating sequences), while 1/3 of the trials were novel unseen orderings of the items (novel sequences). The identity and order of repeating and novel sequences were pseudo-randomised across participants.

Sequence learning models

Here we define two sequence learning mechanisms—associative and recoding– in terms of two components: a model of representation for novel sequences and another for learned sequences. We assume that the difference between these two representations is attributable to the effect of the learning mechanism. Specifically, associative and recoding mechanisms make different predictions about the similarity between novel and repeating sequences. These predictions are formalised as representational dissimilarity matrices (RDM, Fig 3), which are then compared to the fMRI activity patterns using the representational similarity analysis (RSA, [36], Fig 3).
Fig 3

Testing the predictions of learning models using RSA.

Left: model prediction expressed as a representational dissimilarity matrix (RDM) of pairwise between-stimulus distances (the cells in the matrix display the associative learner predictions in terms of item-position associations as quantified by the Hamming distance). The small matrices on the top refer to the representations of individual sequences in the matrix form (as shown on Fig 1). For example, second cell in the first row is the predicted Hamming distance between sequences presented on trials 1 and 2. Right: RDM of measured voxel activity patterns elicited by the stimuli. The small matrices are illustrative representations of voxel patterns from an arbitrary brain region. The correlation between these two RDMs reflects the evidence for the predictive model. The significance of the correlation can be evaluated via permuting the labels of the matrices and thus deriving the null-distribution. See Representational similarity analysis (RSA) in Methods for details.

Testing the predictions of learning models using RSA.

Left: model prediction expressed as a representational dissimilarity matrix (RDM) of pairwise between-stimulus distances (the cells in the matrix display the associative learner predictions in terms of item-position associations as quantified by the Hamming distance). The small matrices on the top refer to the representations of individual sequences in the matrix form (as shown on Fig 1). For example, second cell in the first row is the predicted Hamming distance between sequences presented on trials 1 and 2. Right: RDM of measured voxel activity patterns elicited by the stimuli. The small matrices are illustrative representations of voxel patterns from an arbitrary brain region. The correlation between these two RDMs reflects the evidence for the predictive model. The significance of the correlation can be evaluated via permuting the labels of the matrices and thus deriving the null-distribution. See Representational similarity analysis (RSA) in Methods for details.

Associative learning model

An ideal learner should infer internal representations that reflect the statistics of the environment. Intuitively, associative learning can be thought of as changing the weights of associations so that they reflect the frequency of past occurrences (Fig 1C). This can be formalised for sequences using the Dirichlet-Multinomial distribution, which encodes how many times particular discrete associations have occurred: the full description of the model and its parameters can be found in Associative learning in Methods. The associative learning model makes two predictions that can expressed in terms of similarity between novel and repeated sequences (Fig 3). First, both sets of sequences should be encoded with the same sequence representation model since associative learning only changes the noise levels in the representations. In other words, if novel sequences are encoded as item-position associations, so should the learned ones. We tested this hypothesis with two classical sequence representation models: item-position associations, where sequences are formed by mapping items to their ordinal positions; and item-item associations, where consecutive items are associated with each other (see Sequence representation models in Methods). Second, the associative learner predicts that the repeating (learned) sequences should be represented with less noise: the repetition of sequences should strengthen the weights of individual associations. Therefore, noise in the neural activity patterns generated by novel sequences should be greater than for repeating sequences and hence the expected similarity between repeating and novel sequences should be greater than within novel sequences (see Associative learning predictions for RSA in Methods).

Recoding model

The recoding model posits that statistical regularities across sequences can be used to infer representations where frequently co-occurring stimuli are recoded using a single code. For example, if two individual items in a sequence occur next to each more frequently than apart, then an optimal learner should infer a model of the environment where those two adjacent items have been generated by a single latent variable. Formally, participants’ internal representations of sequences are therefore recoded as latent variables given the observed sequences: where θ is the internal latent model of a set of sequences S = {y1, …, y}. Here we call this latent representation a chunking model, in line with previous literature [1, 16, 17]. The similarity between individual sequences is here defined by the inferred chunks, and not in terms of ‘initial’ items and their positions. A chunking model θ is defined by two parameters and their probability distributions p(x, z|θ), where x is a set of individual chunks, and z a set of mappings defining how chunks are arranged together to encode observed sequences. For example, regularities within a set of two sequences S = {ABCD, CDAB} can be expressed by two chunks x = {AB, CD} and their mappings to the observed data z = {((A, B), (C, D)), ((C, D), (A, B))}. Here we represent chunks formally as n-grams: for example, a four-item sequence ABCD can be represented by a tri-gram ABC and a uni-gram D; or two uni-grams A and B and a bi-gram CD, etc. Next, we estimated the optimal chunking model for the sequences in our task: given the many possible ways sequences could be chunked, we assumed that the optimal learner would employ a chunking model that finds the most efficient encoding. The full formal description of the chunking models, their parameters, and the process of inferring the optimal model can be found in Chunk Learning in Methods. Importantly, we designed the presentation of repeating and novel sequences so that the optimal model would remain the same for every trial across the experiment: every repeating sequence was encoded with a single four-gram chunk, and every novel sequence with four uni-grams (Fig 4, bottom row). Knowing the optimal chunk representation allowed us to calculate pairwise distances between sequences as defined by their constituent chunks. The resulting RDM of n-gram distances was then compared to the neural activity patterns using the RSA method (Fig 3).
Fig 4

Sequence representation in associative and recoding models.

The figure illustrates difference in sequence representations for four individual sequences: associative representations are displayed at the top and chunk codes at the bottom row. Differently coloured letters and boxes refer to individual item codes. For the chunk recoding model (bottom) item codes reflect the optimal chunking structure for the sequences presented in our experiment. Note that the representation of the novel sequence (Trial 2) contains the same number of item codes at same positions for both models.

Sequence representation in associative and recoding models.

The figure illustrates difference in sequence representations for four individual sequences: associative representations are displayed at the top and chunk codes at the bottom row. Differently coloured letters and boxes refer to individual item codes. For the chunk recoding model (bottom) item codes reflect the optimal chunking structure for the sequences presented in our experiment. Note that the representation of the novel sequence (Trial 2) contains the same number of item codes at same positions for both models. Note that the optimal chunking model predicts the same representation for novel sequences as the associative item-position model. This is because the optimal chunking model encodes novel sequences with four one-item chunks resulting in the same number of item codes associated with the same positions (see Fig 4). In other words, both models’ predictions for novel sequence representation are the same. However, the two models make different predictions about the similarity between the repeating and novel sequences.

Results

Behaviour

We observed that novel and repeating sequences were processed differently by participants. We calculated two behavioural measures of recall for both types of sequences: how many items were recalled in the correct position, and the average time between consecutive key presses. The proportion of correctly recalled items was roughly the same for novel and repeating sequences: 0.96 vs. 0.97, with no significant difference across participants (p = 0.22, df = 21). This was expected since both novel and repeating sequences were only four items long and should therefore fit within participants’ short term memory spans. However, participants were consistently faster in recalling repeating sequences: the average time between consecutive key presses was 0.018 seconds shorter for repeating sequences (0.534 vs. 0.552 sec, t = −3.04, p = 0.007, df = 21). Next, we sought to establish how the neural representation of novel sequences differs from the repeating, learned ones: specifically, whether there is a change in representation that supports either the associative learning or recoding hypotheses.

fMRI evidence for learning models

First, we parcellated the dorsal visual processing stream bilaterally into 74 anatomically distinct regions of interest (ROI). Here we focussed on the dorsal visual pathway, which covered the expected anatomical localisation of our task variables from visual presentation to manual recall (see Discussion for further details). In every ROI we compared the predictions of the sequence learning models to the voxel activation patterns for all three task phases (presentation, delay, response). Only the recoding model predicts the similarity between novel and repeating sequences We found significant evidence for the recoding model in three brain regions: the parietal inferior-supramarginal gyrus, the postcentral sulcus, and the occipital superior transversal sulcus (Fig 5). As predicted by the recoding model, the representation of sequences in all three regions followed a model where novel sequences are encoded with four one-item chunks but repeating sequences with single chunks, indicating a change in the representational code. The evidence for the recoding model was only statistically significant for the presentation phase of the task and not during the delay or the response phases.
Fig 5

Evidence for the recoding model.

The recoding model predicted the distance between pairs of voxel activity patterns corresponding to novel and repeating sequences in three brain regions. Models are shown on the x-axis (’R’: recoding, ‘A-IP’ and ‘A-II’ are associative item-position and item-item models respectively). Y-axis displays the model fit in terms of participants’ average Spearman’s rank-order correlation (r). Dots represent individual participants’ values and error bars the standard error of the mean (SEM). Coloured dashed lines represent the lower and upper bounds of the noise ceiling for the recoding model. In all displayed plots the lower noise ceilings were significantly greater than zero across participants. The anatomical contours of the regions are superimposed on the MNI152 glass-brain template (left).

Evidence for the recoding model.

The recoding model predicted the distance between pairs of voxel activity patterns corresponding to novel and repeating sequences in three brain regions. Models are shown on the x-axis (’R’: recoding, ‘A-IP’ and ‘A-II’ are associative item-position and item-item models respectively). Y-axis displays the model fit in terms of participants’ average Spearman’s rank-order correlation (r). Dots represent individual participants’ values and error bars the standard error of the mean (SEM). Coloured dashed lines represent the lower and upper bounds of the noise ceiling for the recoding model. In all displayed plots the lower noise ceilings were significantly greater than zero across participants. The anatomical contours of the regions are superimposed on the MNI152 glass-brain template (left). No evidence for associative learning in neural representations We found no evidence for the associative learning predictions: novel and repeating sequences were not encoded similarly in any of the brain regions. To further explore this null-result, we looked at the representation of novel and repeating sequences separately. We found significant evidence for novel sequence representation in eleven regions in the dorsal visual processing stream (Table 1; see also S1 Text for plots for individual brain regions). However, in all of the regions where the associative model predicted similarity between novel sequences, it failed to predict the similarity between novel and repeating sequences (df = 21, p > 10−3). This shows that, contrary to the predictions of the associative models, repeating and novel sequences did not share a common representational code in our task.
Table 1

Representation of novel sequences.

Anatomical region suffixes indicate gyrus (G) or sulcus (S). Asterisks (*) represent significant evidence for the item-position sequence representation model reaching the lower bound of the noise ceiling in any of the three task phases: presentation, delay, and response. The lower noise ceilings were significantly greater than zero for all regions displayed in the table (df = 21, p < 10−3); see see Noise ceiling estimation in Methods for details).

NameLobePresentationDelayResponse
Central SFrontal*
Cuneus GOccipital*
Occipital Superior GOccipital*
Occipital Superior Transversal SOccipital*
Paracentral G SFrontal**
Parietal Inferior-Supramarginal GParietal*
Parietal Superior GParietal*
Postcentral GParietal*
Postcentral SParietal**
Precuneus GParietal*
Temporal Superior STemporal*

Representation of novel sequences.

Anatomical region suffixes indicate gyrus (G) or sulcus (S). Asterisks (*) represent significant evidence for the item-position sequence representation model reaching the lower bound of the noise ceiling in any of the three task phases: presentation, delay, and response. The lower noise ceilings were significantly greater than zero for all regions displayed in the table (df = 21, p < 10−3); see see Noise ceiling estimation in Methods for details). The associative learning model also predicts that the noise in the activity patterns generated by novel sequences should be greater than for repeating sequences and hence the expected similarity between repeating and novel sequences should be greater than within novel sequences. In other words, it should be easier to find evidence for associative codes between repeating and novel sequences than for novel sequences alone. Hence the lack of evidence we observe for associative learning cannot be attributed to the lack of fMRI measurement sensitivity. No behavioural evidence for associative learning There was also no behavioural evidence for associative learning: increased probability for associations present in the repeating sequences should affect novel sequences where such associations are also present. For example, repeated exposure to a sequence ABCD should also boost BDCA since C appears at the 3rd position in both. We tested this prediction by comparing response times for individual item-position associations in novel sequences: there was no advantage for those individual associations which were shared with the two repeating sequences (t = 0.28, p = 0.78, df = 21).

Model-free fMRI analyses of learning effects

We carried out two additional model-free fRMI analyses contrasting the representation of novel sequences to repeating ones. This was done to gauge how consistent our results were with previous fMRI studies which have shown two broad fMRI learning effects: reduction of the BOLD signal and increase in fMRI pattern separability for learned stimuli [31-34].

Univariate BOLD difference for learned stimuli

We carried out a whole-brain univariate analysis to test whether the average BOLD response differed between novel and repeating sequences. We found extensive bilateral reduction in the mean BOLD response for repeating sequences (Fig 6). This extended across parietal and pre-motor regions and was mostly absent in the primary visual and motor areas.
Fig 6

Univariate effects of learning.

Statistical map of t-values (magenta-cyan) of the univariate BOLD difference for learned stimuli (repeating/learned < novel sequences) superimposed on the MNI152 glass-brain template. Regions which encode both novel and repeating sequences as predicted by the recoding model are projected on top of the statistical image with solid lines (red: the parietal inferior-supramarginal gyrus; green: the postcentral sulcus; blue: the occipital superior transversal sulcus).

Univariate effects of learning.

Statistical map of t-values (magenta-cyan) of the univariate BOLD difference for learned stimuli (repeating/learned < novel sequences) superimposed on the MNI152 glass-brain template. Regions which encode both novel and repeating sequences as predicted by the recoding model are projected on top of the statistical image with solid lines (red: the parietal inferior-supramarginal gyrus; green: the postcentral sulcus; blue: the occipital superior transversal sulcus). Note that the univariate change for the repeating sequences does not address the main hypothesis of this study, neither does it provide an alternative explanation of the data. Any neural learning mechanism is expected to make representations more efficient and therefore decrease the computational and metabolic cost of inference [5, 35]. Both associative learning and recoding predict more efficient representations: we cannot dissociate between retaining the same codes (associative learner) and recoding by simply measuring behavioural improvement or total change in metabolic cost (univariate BOLD). Changes in voxel pattern noise To gain more insight into learning-induced changes we tested whether the voxel pattern distances within and between novel and repeating sequences change across the experiment. For example, do the neural voxel patterns corresponding to the two repeating sequences become more dissociable over the experiment? Specifically, we tested for significant changes in voxel pattern distance (a) between the repeating sequences, (b) within the individual repeating sequences, (c) between the repeating and novel sequences, (d) within novel sequences. For full details on the distance analyses see Model-free fMRI analyses of learning effects in Methods. We found no brain regions where any of the voxel pattern distance change measures were statistically significant across the participants (df = 21, p > 10−3). Note that these model-free analyses were fundamentally different from the RSA approach employed for the comparison of the learning models above: here we did not measure the change in distances as predicted by a learning model but instead gauged whether the variance of the voxels’ responses changed across the experiment.

Recoding provides more efficient representation of multiple sequences than associative learning

To further explore why recoding might be an advantageous learning mechanism we contrasted the effective learning capacity between the two learning models. Specifically, we estimated how much would multiple to-be-learned sequences interfere with each other. For example, a single sequence can readily be learned by strengthening the item-position associations. However, such a coding scheme would struggle to effectively learn multiple overlapping sequences. For example, two sequences ABCD and BADC can be learned simultaneously by storing position-item associations (Fig 7A), but this would result in eight association weights of equal strength to represent four sequences (ABCD, BADC and ABDC, BACD; Fig 7B). Such a learning mechanism would suffer from catastrophic interference with multiple short sequences of overlapping items. Most naturally occurring sequences (words or sentences in a language, events in tasks like driving a car or preparing a dish) do not consist of items or events which only uniquely occur in that sequence. Hence an efficient sequence learning mechanism has to be able to learn multiple overlapping sequences, which are re-orderings of the same items.
Fig 7

Interference in sequence learning.

(A) Visual representation of two sequences as position-item associations (top) and the resulting frequency of associations (bottom) as defined by the associative sequence learning model. (B) Associative learning of two sequences on panel A would boost the representations of four individual sequences despite the statistical regularities being extracted from only two. See S4 Text for a worked example. (C) Histogram of the expected number of shared codes (item-position associations, x-axis) for a single 4-item sequence with all other possible 4-item sequences (n = 256, allowing repeats), measured as a proportion of sequences sharing the same number of codes (y-axis). (D) Histogram of the shared codes for a two-item (bi-gram) chunk representation. (E) Interference between sequence representations in the item-position model. X-axis displays how many sequences have been learned and lines on the plot display the proportion of other sequences affected by learning as a function of codes shared: the lines correspond to columns in panel C. The red line shows the proportion of sequences which have been unaffected by learning. (F) Interference between sequence representations in the chunk model.

Interference in sequence learning.

(A) Visual representation of two sequences as position-item associations (top) and the resulting frequency of associations (bottom) as defined by the associative sequence learning model. (B) Associative learning of two sequences on panel A would boost the representations of four individual sequences despite the statistical regularities being extracted from only two. See S4 Text for a worked example. (C) Histogram of the expected number of shared codes (item-position associations, x-axis) for a single 4-item sequence with all other possible 4-item sequences (n = 256, allowing repeats), measured as a proportion of sequences sharing the same number of codes (y-axis). (D) Histogram of the shared codes for a two-item (bi-gram) chunk representation. (E) Interference between sequence representations in the item-position model. X-axis displays how many sequences have been learned and lines on the plot display the proportion of other sequences affected by learning as a function of codes shared: the lines correspond to columns in panel C. The red line shows the proportion of sequences which have been unaffected by learning. (F) Interference between sequence representations in the chunk model. The interference resulting from strengthening of individual associations can be quantified for each learning model in terms of shared representations causing such interference. In the associative item-position model any individual sequence will have an expected similarity to all other possible sequences (Fig 7A). For example, ABCD shares two item-position associations with DBCA, and so forth. On the average, any individual 4-item sequence encoded with the item-position model shares two items with 21% of all the other possible sequences, while only 31% share no associations and about 5% share 3 out of 4 associations (Fig 7C). As more sequences are learned the interference between stored representations will inevitably increase since the number of possible associative codes in the item-position model is limited to the number of items × positions. Fig 7E shows that the effective capacity of learning different overlapping sequences with the associative item-position model is approximately five: at that point there are no sequences left which have been unaffected by learned sequences. Contrastingly, a chunk recoding model that only uses two item chunks (bi-grams), has a markedly different expected similarity distribution (Fig 7D), resulting in significantly reduced interference between learned sequences (Fig 7F). Note that the bi-gram chunking model used here for illustrative purposes is the most limited chunking model possible: any flexible chunking model—such as the one estimated for our participants—will perform significantly better. A chunking model that is free to infer any number of chunks of any length can represent any number of multiple overlapping sequences without interference [39]. In sum, associative learning employs fewer codes but therefore necessarily loses in the representational power or ‘coverage’ over multiple overlapping sequences. Contrastingly, chunking allows emergence of more dissimilar codes which can be used to cover the space of all possible sequences with little interference.

Discussion

In the current study we contrasted two classes of sequence learning models. First, we considered associative learning that proceeds by changing the signal-to-noise ratio of existing representations. Alternatively, repeated presentations might lead the initial representations to be recoded into more efficient representations such as chunks. Both mechanisms would result in more efficient codes and improve performance in the task: by either reducing uncertainty in the internal representations (associative learning) or reducing the necessary number of associations (chunk recoding). However, the two accounts make different predictions about changes the similarity between novel and repeating sequences.

Learning induces recoding of sequence representations

We found that novel visual sequences were represented as position-item associations in a number of anatomically distinct regions in the dorsal visual processing stream. This is in line with previous research reporting that initial sequence representations are associative, binding individual events to a temporal order signal which specifies their position in a sequence [40, 41]. However, we found no evidence that repeated sequences were also represented positionally, as would be predicted by the associative learning model. Instead, we observed that learning proceeds by recoding the initial stimuli using a different set of codes. Specifically, the similarity between repeated and novel sequences followed predictions of the optimal chunking model in three cortical regions in the dorsal visual processing stream. Such flexible recoding of stimuli in response to the changing statistics of the environment is a common and often necessary feature of probabilistic learning models (see [5] for a review). However, most neural learning models assume that different populations represent different stages of learning: for example, a traditional hippocampal-cortical learning account assumes that the fast acquisition of initial associations is supported by the dynamics of the hippocampus proper while the cortical areas encode the consolidated representations [42]. Here we show that the same cortical region encodes both initial and learned representations. Our findings are also consistent with behavioural data on memory for sequences where there is evidence for the use of positional coding when recalling novel sequences [43] while learned verbal sequences show little indication of positional coding [25].

Recoding provides more efficient encoding of multiple overlapping sequences

Recoding initial sequence representations is also advantageous from an efficient coding perspective: we show that in our task an associative learner would be only able to effectively learn very few overlapping sequences, as it is limited by the space of possible associations. Contrastingly, recoding by chunking creates higher-dimensional codes, which can effectively store a limitless number of overlapping sequences [5, 10, 39]. Although higher-dimensional codes require more information to store than simpler low-dimensional associations, they are necessary to cover the vast space of possible overlapping sequences present in ecological working memory tasks such as reading, speaking, or navigating. Importantly, sequence learning through recoding is likely only one of the multiple learning processes taking place in parallel. For example, accumulating evidence points to several streams of subcortical learning—mediated by the hippocampal formation and the basal ganglia—operating simultaneously [44-47] and the effects of both types of learning can be delineated for a single task in rodents [48]. Although in our task an efficient learner would benefit from recoding overlapping sequences, recoding might not offer significant advantages over associative learning in tasks with existing high-dimensional representations [49-51].

Multiple and parallel systems for sequence learning

Our experimental task is significantly different from standard motor-sequence learning paradigms where learning proceeds through repetition of movements and consolidation can take several hours or days [24, 52, 53]. Here we used a working memory task with serial recall where individual sequences are typically learned in as few as 2–4 repeated presentations [54, 55] and learning proceeds even when no recall is attempted [56, 57]. Therefore, we think it is likely that in our task the change of representations happens at the level of working memory: the initial associative working memory representations are recoded with chunks. Second, in our task visual presentation was followed by a brief delay and then a manual reproduction of the presented sequence. Therefore, we optimised the anatomical ‘coverage’ of the brain provided by the MRI scanner for the dorsal visuo-motor areas, where the anatomical locus of task representations should be expected: perceptual representations in the primary visual cortex [58], the delay-phase activity in the occipito-parietal and motor-parietal regions [59-62], and the manual response activity in the motor and pre-motor areas [20, 59]). Our MRI coverage did therefore not reach the deep subcortical regions such as the inferior and medio-temporal lobes and the basal ganglia, which have been shown to be crucial in both initial sequence encoding and learning (see S1 Fig). For example, a well-established model of neural learning [42, 47] proposes that the initial hippocampal-dependent associative memories are recoded as they undergo consolidation resulting in more efficient cortical representations [63, 64]. Specifically, while the initial encoding of individual events or items into novel sequences is facilitated by the hippocampal formation [65-67], learning modulates the parallel cortical representations so that they become independent of the hippocampus over time: humans and animals with hippocampal lesions are unable to learn new sequences but are able to recall already learned ones [66, 68]. This suggests a dissociation between the hippocampal and cortical representations: both are created simultaneously but affected differently by learning. Although we observed learning effects in the dorsal visual stream, we think it is very likely that similar effects could be observed in subcortical and frontal regions, given the fundamental role of these brain regions in structure learning. However, further research focussing on the direct comparison between the cortical and subcortical sequence representations is necessary to shed light on this relationship.

Alternative learning models

Associative learning and recoding, as two broad model classes, cover the whole space of statistical learning mechanisms for sequence representations. This is because all learning mechanisms can be labelled according to the simple distinction of whether the initial dimensionality of representations changes or not [69]. However, within each class we could construct different implementations, which would in turn lead to different predictions of similarity between sequences. For example, a chunk recoding model using only 4-item chunks would lead to significantly different predictions in our task compared to the optimal chunking model that we used. However, we designed our study so that a single chunking model would stand out in terms of prior probability since we wanted to maximise the chances of detecting the representational change. This meant that our stimuli were suboptimal for comparing individual chunking models: instead, longer and more individual learned sequences would be required [70-72]. Another common question raised with learning studies is to what extent could the results be explained by attentional differences between novel and learned stimuli: could participants allocate fewer resources to learned sequences once they have been identified as such? In our view such attentional effects are entirely expected in any statistical learning task, given that an optimal learner should allocate resources to stimuli proportional to the amount of information they provide about the environment [35]. Therefore, we would expect to see differential resource allocation in encoding novel and learned sequences, since learned sequences have higher prior probability and thus provide relatively little new information about the environment compared to novel ones. Since the probability of learned sequences is necessarily higher for any possible learning mechanism it follows that any learning model would predict different resource requirements. Univariate effects of resource allocation—decrease in the BOLD signal or reaction times, changes in perceived task difficulty—should be expected for any learning mechanism and do not allow dissociating between different computations underlying learning. Hence here we contrasted models which predict similar differences in of resource allocation but make dissociable predictions in terms of the dimensionality of neural responses. The associative learner would not change the representations (same dimensionality) while recoding would: this distinction is orthogonal to that of resource requirements.

Conclusions

Our results suggest that humans follow an efficient sequence learning strategy and recode initial sequence representations into chunks. We found no evidence for the hypothesis that learning involves strengthening existing associations in our task. Furthermore, we show that associative learning without recoding is not theoretically capable of supporting long-term storage of multiple overlapping items. Although the initial associative representations of novel sequences may be sufficient to support immediate recall, multiple sequences can only be learned by developing more efficient higher order representations such as chunks.

Methods

Ethics statement

The study was approved by the Cambridge Local Research Ethics Committee (CPREC, Cambridge, UK; application PRE.2017.024). Every participant gave a formal written consent to take part of the study.

Participants

In total, 25 right-handed volunteers (19–34 years old, 10 female) gave informed, written consent for participation in the study after its nature had been explained to them. Participants reported no history of psychiatric or neurological disorders and no current use of any psychoactive medications. Three participants were excluded from the study because of excessive inter-scan movements (see fMRI data acquisition and pre-processing).

Task

On each trial, participants saw a sequence of items (oriented Gabor patches) displayed in the centre of the screen (Fig 2A). Each item was displayed on the screen for 2.4s (the whole four-item sequence 9.6s). Presentation of a sequence was followed by a delay of 4.8s during which only a fixation cue ‘+’ was displayed on the screen. After the delay, participants either saw a response cue ‘*’ in the centre of the screen indicating that they should manually recall the sequence exactly as they had just seen it, or a cue ‘–’ indicating not to respond, and to wait for for the next sequence (rest phase; 10–18s). We used a four-button button-box where each button was mapped to a single item (see Stimuli below). The recall cue appeared on 3/4 of the trials and the length of the recall period was limited to 7.2s. We omitted the recall phase for 1/4 of the trials to ensure a sufficient degree of de-correlation between the estimates of the BOLD signal for the delay and recall phases of the task. Each participant was presented with 72 trials (36 trials per scanning run) in addition to an initial practice session outside the scanner. In the practice session participants had to recall two individual sequences 12 times as they learned the mapping of items to button-box buttons. Participants were not informed that there were different types of trials.

Stimuli

All presented sequences were permutations of the same four items (see Sequence generation and similarity). The items were Gabor patches which only differed with respect to the orientation of the patch. Orientations of the patches were equally spaced (0, 45, 90, 135 degrees) to ensure all items were equally similar to each other. The Gabor patches subtended a 6° visual angle around the fixation point in order to elicit an approximately foveal retinotopic representation. Stimuli were back-projected onto a screen in the scanner which participants viewed via a tilted mirror. We used sequences of four items to ensure that the entire sequence would fall within the participants’ short-term memory capacity and could be accurately retained in STM. If we had used longer sequences where participants might make errors (e.g. 8 items) then the representation of any given sequence would necessarily vary from trial to trial, and no consistent pattern of neural activity could be detected. All participants learned which four items corresponded to which buttons during a practice session before scanning. These mappings were shuffled between participants (8 different mappings) and controlled for heuristics (e.g. avoid buttons mapping orientations in a clockwise manner).

Sequence generation and similarity

We permuted the same four items (oriented Gabor patches) into different individual sequences to resemble sequences in the natural world, which are mostly permutations of a small set of items or events based on the desired outcome (e.g. driving the car, parsing a sentence, etc). We chose the 14 individual four-item sequences used in the experiment (2 repeating, 12 unique) to be as dissimilar to each other as possible in order to avoid significant statistical regularities between individual sequences themselves and instead be able to introduce regularities only through repeating the individual sequences (see Chunk learning for details). We constrained the possible set of individual sequences with two criteria: Dissimilarity between all individual sequences: all sequences needed to be at least three edits apart in the Hamming distance space (see Similarity between sequence representations for details on the Hamming distance between two sequences). For example, given a repeating sequence {A, B, C, D} we wanted to avoid a unique sequence {A, B, D, C} as these share the two first items and hence the latter would only be a partially unique sequence. This would allow in chunk learning to encode both sequences with a common multi-item chunk AB. N-gram dissimilarity between two repeating sequences: the two repeating sequences shared no items at common positions and no common n-grams, where n > 1 (see Chunk learning for n-gram definition and details). This ensured that the representations of repeating sequences would not interfere with each other and hence both could be learned to similar level of familiarity. Secondly, this ensured that for the chunking model the repetitions of these two sequences were optimally encoded with two four-grams since they shared no common bi-grams of tri-grams. Given these constraints, we wanted to find a set of sequences which maximised two statistical power measures: Between-sequence similarity score entropy: this was measured as the entropy of the lower triangle of the between-sequence similarity matrix. The pairwise similarity matrix between 14 sequences has 142 = 196 cells, but since it is diagonally identical only 84 cells can be used as a predictor of similarity for experimental data. Note that the maximum entropy set of scores would have an equal number of possible distances but since that is theoretically impossible, we chose the closest distribution given the restrictions above. Between-model dissimilarity: defined as the correlation between pairwise similarity matrices of different sequence representation models (see Similarity between sequence representations). We sought to maximise the dissimilarity between model predictions, that is, decrease the correlation between similarity matrices. The two measures described above, together with the constraints, were used as a cost function for a grid search over a space of all possible subsets of fourteen sequences (k = 14) out of possible total number of four-item sequences (n = 4!). Since the Binomial coefficient of possible choices of sequences is ca 2 × 106 we used a Monte Carlo approach of randomly sampling 104 sets of sequences to get a distribution for cost function parameters. This processes resulted in a set of individual sequences which were used in the study (see S6 Text).

Structure of the trials

To test our hypotheses we split the 14 individual sequences in to two classes: two of these were repeatedly presented throughout the experiment (repeating sequences, 2/3 of the trials) while the remaining 12 were previously unseen and were only presented once (unique sequences, 1/3 of the trials). The two individual repeating sequences were chosen randomly for each participant. The two repeating sequences were also used for training before the scanning experiment (each presented 12 times). This was done to ensure that the two repeating sequences would be 12 times more likely already at the start of the experiment and stay so throughout scanning (see Optimal chunking model for details). To keep the relative probability of repeating and unique sequences fixed throughout the experiment we divided the trials into sub-blocks of three trials so that each contained a single unique sequence and two repeating sequences in random order (Fig 2B). This ensured that after every three trials the participant exposure to repeated and unique sequences was the same (2/3 repeated, 1/3 unique sequences). For MRI scanning we repeated this experimental block twice for every participant so that in a 36-trial scanning session participants recalled each unique sequence once and repeating sequences 12 times each (Fig 2B). Over two scanning sessions this resulted in 48 trials with repeating sequences and 24 trials with unique sequences.

Sequence representation models

Sequences are associative codes: they are formed either by associating successive items to each other (item-item associations) or by associating items to some external signal specifying the temporal context for sequences (item-position associations). In the case of item-position associations sequences are formed by associating items to some external signal specifying the temporal context. This context signal can be a gradually changing temporal signal [73-75], a discrete value specifying the item’s position in a sequence [76], or a combination of multiple context signals [77, 78]. Common to all of these models is the underlying association of item representations to the positional signal, forming item-position associations (Fig 1A). Alternatively, for item-item associations the weights of the associations are usually expressed in terms of transitional probabilities [1] forming a ‘chain’ of associations [79]. Past research has provided evidence for both: sequences are represented as item-position associations in rodent, primate, and humans brains [80-82] and also as item-item associations [83] depending on task type and anatomical area (see [1] for a review). For our sequence processing task (Fig 2) we model the participants’ internal sequence representations μ given the presented sequence y as Bayesian inference (Eq 2), where the posterior distribution p(μ|y) represents a participant’s estimate of the presented stimulus, and their response can be thought of as a sample from the posterior distribution: Associations between discrete variables—such as items, or items and positions—can be formalised as a multinomial joint probability distribution. The multinomial representation can in turn be visualised as a matrix where each cell describes the probability of a particular item at a particular position (Fig 1). Formally, every item x in the sequence z is represented by a multinomial variable which can take K states parametrised by a vector = (μ1, …, μ) which denotes the probability of item x occurring at any of k positions: and the whole N-item sequence z = (x1, …, x) is given by: where the represents the probability of particular item-position associations and hence must satisfy 0 ≤ μ ≤ 1 and ∑ μ = 1. Exactly the same formalism applies to item-item associations: we simply replace the set of K position variables with another identical set of N items.

Similarity between sequence representations

Item-position associations When sequences are represented as item-position associations they can be described in terms of their similarity to each other: how similar one sequence is to another reflects whether common items appear at the same positions. Formally, this is measured by the Hamming distance between two sequences: where and are the i-th items from sequences y and y of equal length k. Consider two sequences ABCD and CBAD: they both share two item-position associations (B at the second and D at the fourth position) hence the Hamming distance between them is 2 (out of possible 4). We use the between-sequence similarity as defined by the Hamming distance as a prediction about the similarity between fMRI activation patterns: if sequences are coded as item-position associations then the similarity of their corresponding neural activity patterns, all else being equal, should follow the Hamming distance. This allows us to test whether a particular set of voxels encodes information about sequences using an item-position model. Representational similarity analysis of fMRI activity patterns below provides the details of the implementation. Item-item associations Here we use n-grams as associations between multiple consecutive items to define sequences as pairwise item-item associations: a four-item sequence ABCD can be unambiguously represented by three bi-grams AB, BC, CD so that every bi-gram represents associations between successive items. The bi-gram representation of item-item associations can be used to derive a hypothesis about the similarity between sequences: the between-sequence similarity is proportional to how many common item pairs they share. For example, the sequences FBI and BIN both could be encoded using a bi-gram where B is followed by I (but share no items at common positions and are hence dissimilar in terms of item-position associations). This allows us to define a pairwise sequence similarity measure which counts how many bi-grams are retained between two sequences: where C and C are the sets of n-grams required to encode sequences S and S so that card(C ∩ C) denotes the cardinality of the union of two sets of n-grams (i.e. the number of elements in the resulting set). All possible constituent n-grams of both sequences can be extracted iteratively starting from the beginning of sequence and choosing n consecutive items as an n-gram. For bi-grams this gives: where C is a set of all possible adjacent n-grams from sequence S of length k so that every bi-gram is a pair (tuple) of consecutive sequence items (x, x). Similarly for a set of n-grams C from any sequence of length k: where n is the length of n-gram. Effectively, the n-gram similarity counts the common members between two n-gram sets. Given sequence length k this similarity can accommodate n-grams of all sizes n (as long as n ≤ k). To make the measure comparable for different values of n we need to make the value proportional to the total number of possible n-grams in the sequence and convert it into a distance measure by subtracting it from 1: where γ is a normalising constant: Effectively, the n-gram distance D counts the common members between two n-gram sets. We then used the bi-gram distance measure to derive sequence representation predictions for item-item association models. The prediction made by the n-gram distance D is fundamentally different from the prediction made by the Hamming distance D (Eq 7): the n-gram distance assumes that sequences are encoded as item-item associations whilst the Hamming distance assumes sequences are encoded as item-position associations. To understand why the item-position and item-item models make inversely correlated predictions, consider again the example given above: two sequences of same items FBI and BIF are similar from a bi-gram perspective since both could be encoded using a bi-gram where B is followed by I (but share no items at common positions and are hence dissimilar in terms of item-position associations). Conversely, two sequences FBI and FIB share no item pairs (bi-grams) and are hence dissimilar form a bi-gram perspective but have both F at the first position and hence somewhat similar in terms of the item-position model (Hamming distance). Item mixture We also defined an additional control model which tested for a null-hypothesis that instead of sequence representations neural activity could be better explained by the superposition of patterns for constituent individual items in the sequence, called the item mixture model (e.g. see [22]). This model is not a sequence representation model but rather an alternative hypothesis of what is being captured by the fMRI data. This model posits that instead of sequence representations fMRI patterns reflect item representations overlaid on top of each other like a palimpsest so that the most recent item is most prominent. For example, a sequence ABCD could be represented as a mixture: 70% the last item (D), 20% the item before (C), and so forth. In this case the mixing coefficient increases along the sequence. Alternatively, the items in the beginning might contribute more and we would like to use a decreasing mixing coefficient. If all items were weighted equally the overall representations would be identical as each sequence is made up of the same four items. Here we only considered linearly changing coefficients: we did not consider non-linear or random weights. Formally, we model an item mixture representation M of a sequence as a weighted sum of the individual item representations: where I is the four-dimensional representation of individual items in the sequence and β is the vector of mixing coefficients so that β is the mixing coefficient of the n-th item in I so that where N is the length of the sequence. The rate of change of β (to give a particular β a value) was calculated as where θ is the rate of change and α normalising constant. In this study we chose the value of θ so that β = {0, 1/6, 1/3, 1/2} represents a recency-weighted slope over individual sequence items. The reason we only tested for the ‘recency mixture’ is that the distances between mixtures only depend on the absolute value of the slope of the mixture coefficients over the items. In other words, an RDM derived with a recency-based item mixture predicts the same similarity between voxel patterns as an RDM derived with a primacy based mixture given the absolute value of the gradient slope remains the same. Here we chose a middle point between two extreme slope values: all the mixtures become the same when the slope is horizontally flat and only a single item contributes when the slope is vertical. See S2 Text for more details and a worked example. Distances between two item mixture representations M and M (Eq 10) of sequences S and S were calculated as correlation distances:

Associative learning

The optimal way of encoding how many times particular discrete associations have occurred is given by the Dirichlet-Multinomial model. In short, past occurrences of items at certain positions are transformed into probabilities, which reflect the frequency of associations. Hence associative learning can be thought of as changing the weights of associations— parameter in the multinomial model above—so that they reflect the statistics of the environment. This is achieved by deriving p() from the Dirichlet distribution: where α = (α1, …, α) denotes the effective number of observations for individual associations. The optimal internal representation of associations for a sequence is therefore given by: We could also use to introduce additional biases into the model (e.g. recency or primacy effects) but since our task has short sequences and clearly distinctive individual items such additional biases are not significant (see Behaviour in Results). In formal terms this means specifying the conjugate prior for the parameter μ of the multinomial prior distribution (Eq 4), which is given by the Dirichlet distribution: where 0 ≤ μ ≤ 1 and ∑μ = 1 and ϕ is the normalisation constant. The parameters α of the prior distribution can be interpreted as an effective number of observations x = 1, or in other words, the counts of the value of the sequence position x previously. Effectively, the conjugate prior tracks the item-position occurrence history. Since this model reflects the expected value of item-position associations it is also an optimal model of sequence representation assuming that associations are independent of each other. Using position-item associations as defined above to encode a set of individual sequences S = (BACD, CABD, ABCD) will result in a following value for reflecting the position-item counts: Here matrix rows and columns reflect the items and position variables. Changing or adding new sequences to the set will only change the probabilities or association weights but not change individual items bound by associations. This matrix is visualised for three item-position associations in Fig 1C.

Chunk learning

A chunking model θ is defined by two parameters and their probability distributions p(x, z|θ), where x is a set of individual chunks and z a set of mappings defining how chunks are arranged together to encode observed sequences. Set of chunks We represent chunks formally as n-grams (used from hereon synonymously with the term ‘chunk’) that can take any length up the maximum possible length of a sequence to be encoded. For illustrative purposes we denote the individual items in our sequences here with letters: a four-item sequence of Gabors can be written as ABCD and in turn be represented by a tri-gram ABC and a uni-gram D. For 4-item sequences the set of all possible n-grams has the number of P members as the sum of partial permutations of n = 4 items taken k = {1, 2, 3, 4} at a time: A set of chunks x comprises J n-grams where each constituent n-gram c appears only once: x = {c1, …, c}, and 1 < J < P; for example x = {AB, BA, A, B, CDA, ACDB}. Each individual n-gram has a probability inversely proportional to its combinatorial complexity. Specifically, the probability of a particular n-gram c is proportional to the reciprocal of the number of partial permutations of n = 4 items taken k at a time: where k is the length of the n-gram and α is the normalising constant. For example, there are 4 possible uni-grams for a 4-item sequence, but 12 bi-grams, 24 tri-grams, etc. Hence the probability of a uni-gram is 3 times greater than a bi-gram and so forth. This also captures the intuition that longer and more complex chunks should be less likely than simple chunks. We also assume that chunks are independent each other and hence the probability of a set of n-grams x defined by the chunking model is the product of its constituent chunk probabilities: Mappings between chunks The second parameter of the chunking model describes how individual n-grams are combined together to encode the observed sequences. For example, given a single sequence ABCD and a set of n-grams x = {AB, BC, CD, A, B, C, D} we can encode the data as (AB, CD) or (A, BC, D) as both mappings are capable of representing the observed data without error. For any 4-item sequence there are eight possible ways n-grams can be linked together to reproduce the observed sequence. These mappings can be described as a set of eight tuples Z = {g1, …, g8}, where each tuple defines F ≤ 4 links g = ((l1), …, (l)) that exhaustively define all possible n-gram parses of a 4-item sequence: For example, given a sequence ABCD, the first tuple in the set g1 = ((1,), (2,3), (4,)) corresponds to linking three individual n-grams as ((A), (B, C), (D)). The mappings g in Z differ in terms of how many links are required to encode the sequence: for example, the first mapping comprises three links, the second two, and the fifth four. The probability of each mapping is a product of it’s individual link probabilities: where F is the number of links in the mapping g and η is a probability of each link which we assume to be constant (the reciprocal of the number of possible links). Such inverse relationship between the probability of a mapping and its length captures the intuition that complex mappings between multiple n-grams should be less likely than simple ones. Note that for longer sequences a different relationship might be defined as the ability to combine chunks is limited by human short term memory capacity which sets an effective limit to the length of sequence that can be encoded [84-86]. For a particular model θ the mapping parameter z defines how n-grams are combined together to generate observed sequences. For example, consider two models and a set of sequences S = {ABCD, ABCD, CDAB}. Both models use the same set of n-grams x = {AB, CD, A, B, C, D}, but encode the observed sequences differently: Although these two models are equally likely in terms of the chunks they employ, their mappings have different probabilities. The probability of mappings defined by a particular model is equal to the product of mappings for individual sequences: where M is the number of mappings (sequences encoded). For any model θ the probability of both parameters—n-grams and mappings—are assumed to be independent of each other and therefore the probability of any particular model is the product their parameter probabilities: where J and M are the number of n-grams and mappings. Therefore every model and its two parameters propose an encoding based on some chunks (e.g. examples above), which can subsequently compared to the observed data.

Optimal chunking model

The optimal model can be estimated by either randomly sampling the parameter distributions or using a systematic approach. Here we found the optimal model by only considering models that result in parsing the set of sequences into chunks and creating a ranking based on model evidence: see S3 Text for details. Furthermore, we designed the experiment so that the only regularities between the sequences that could be encoded with common chunks arose from repeating the same two sequences: this ensured that we effectively knew the optimal model beforehand. To recall, our task included 14 individual sequences made maximally dissimilar to each other with two of them repeated on 2/3 of the trials. We first presented the two repeating sequences 12 times each during the practice session immediately before the experiment. Since those two sequences shared no common bi-grams or tri-grams (see Sequence generation and similarity) the only efficient chunk encoding for the set of repetitions of those two sequences comprised just two four-item chunks (Fig 8A). Using 12 repetitions in the training session was enough to make the four-gram representation the most likely one (Fig 8B). Therefore at the start of the experiment (before the first trial t = 1) the optimal chunking model had a set chunks defined as: and a set of mappings where the set of 24 sequences (made up of just two individual sequences) were encoded with the same one-on-one mappings.
Fig 8

Optimal chunking model.

(A) Evidence for three alternative chunking models and their components at the beginning of the scanning experiment, when participants had seen the two repeating sequences 12 times each during the practice session. The three models use only single type n-grams: the 1-gram model encodes sequences using four single-item n-grams, 2-gram model with two bi-grams, and the 4-gram model with a single four-gram. The left panel shows the probability of the set of n-grams (code) each model specifies in terms of their negative log values. The centre panel shows the probability of their mappings (encoding) and the right panel the combination of the two into model evidence. The blue and red parts of the model evidence bar represent model code (n-grams) and encoding (mappings) probabilities in terms of their negative logs and the total length of the bar displays the model evidence as their sum. This allows intuitive visualisation of the code-encoding trade-off calculated by the Bayesian model comparison. The 4-gram model is the optimal model at the start of the experiment. (B) Model evidence across trials. X-axis shows the trial number and y-axis shows the log of model evidence. The optimal model is inferred at every trial; the 1-gram model encodes sequences only with four uni-grams, and the 4-gram model only uses four-grams. Note that at the beginning of the experiment the 4-gram model is equivalent to the optimal model: however, as new sequences are presented the optimal model encodes new data with shorter chunks (uni-grams) while the 4-gram model encodes new unique sequences with four-grams. Note that as new data is observed the evidence for any particular model decreases as the set of data becomes larger and the space of possible models increases exponentially. Also note that the log scale transforms the change of evidence over trials into linear form.

Optimal chunking model.

(A) Evidence for three alternative chunking models and their components at the beginning of the scanning experiment, when participants had seen the two repeating sequences 12 times each during the practice session. The three models use only single type n-grams: the 1-gram model encodes sequences using four single-item n-grams, 2-gram model with two bi-grams, and the 4-gram model with a single four-gram. The left panel shows the probability of the set of n-grams (code) each model specifies in terms of their negative log values. The centre panel shows the probability of their mappings (encoding) and the right panel the combination of the two into model evidence. The blue and red parts of the model evidence bar represent model code (n-grams) and encoding (mappings) probabilities in terms of their negative logs and the total length of the bar displays the model evidence as their sum. This allows intuitive visualisation of the code-encoding trade-off calculated by the Bayesian model comparison. The 4-gram model is the optimal model at the start of the experiment. (B) Model evidence across trials. X-axis shows the trial number and y-axis shows the log of model evidence. The optimal model is inferred at every trial; the 1-gram model encodes sequences only with four uni-grams, and the 4-gram model only uses four-grams. Note that at the beginning of the experiment the 4-gram model is equivalent to the optimal model: however, as new sequences are presented the optimal model encodes new data with shorter chunks (uni-grams) while the 4-gram model encodes new unique sequences with four-grams. Note that as new data is observed the evidence for any particular model decreases as the set of data becomes larger and the space of possible models increases exponentially. Also note that the log scale transforms the change of evidence over trials into linear form. We proceeded to estimate the optimal chunking model at every trial t as the set of sequences was updated with newly observed stimulus. For this purpose we kept the statistical structure of the sequences fixed across the experiment: otherwise an optimal model on trial one would be different to the one on the last trial. Therefore we organised the order of trials so that on the average there was a single unique sequence and two repeating sequences in three consecutive trials. This ensured that after every three trials the participant exposure to repeated and unique sequences was roughly the same. Since the unique sequences shared no significant statistical regularities with each other or with the repeating sequences they could not have been efficiently encoded with common n-grams (n > 1). Therefore, at trial t = 1 the optimal model to encode the previously seen repeating sequences and the new unique one included the previously inferred two four-grams and additionally four single item uni-grams: Since we kept the statistical structure of the sequences fixed across the experiment this ensured that the optimal model would remain the same for every trial across the experiment: on a trial when a repeating sequence was presented it was encoded with a single four-gram chunk, and when a unique sequence was presented it was encoded with four uni-grams, as shown on Fig 8B.

Representational similarity analysis (RSA)

Representational similarity analysis of fMRI activity patterns

First, we created a representational dissimilarity matrix (RDM) S for the stimuli by calculating the pairwise distances s between sequences {N1, …, N}: where s is the cell in the RDM S in row i and column j, and N and N are individual sequences. D(N, N) is the distance measure corresponding to any of the sequence representation models tested in this study: The item-position model: Hamming distance (Eq 7) The item-item model: bi-gram distance (Eq 9) The item mixture model: the item mixture distance (Eq 11) The optimal chunking model: n-gram distance (Eq 9) between the individual chunks Next, we measured the pairwise distances between the voxel activity patterns: where a is the cell in the RDM A in row i and column j, and P and P are voxel activity patterns corresponding to sequences N and N. As shown by Eq 22, the pairwise voxel pattern dissimilarity is calculated as a correlation distance. We then computed the Spearman’s rank-order correlation between the stimulus and voxel pattern RDMs for every task phase p and ROI r: where ρ is the Pearson correlation coefficient applied to the ranks rS and rA of data S and A. Finally, we tested whether the Spearman correlation coefficients r were significantly positive across participants (see Significance testing below). The steps of the analysis are outlined on Fig 3.

Noise ceiling estimation

Measurement noise in an fMRI experiment includes the physiological and neural noise in voxel activation patterns, fMRI measurement noise, and individual differences between subjects—even a perfect model would not result in a correlation of 1 with the voxel RDMs from each subject. Therefore an estimate of the noise ceiling is necessary to indicate how much variance in brain data—given the noise level—is expected to be explained by an ideal ‘true’ model. We calculated the upper bound of the noise ceiling by finding the average correlation of each individual single-subject voxel RDM (Eqs 21 and 22) with the group mean, where the group mean serves as a surrogate for the perfect model. Because the individual distance structure is also averaged into this group mean, this value slightly overestimates the true ceiling. As a lower bound, each individual RDM was also correlated with the group mean in which this individual was removed. We also tested whether a model’s noise ceiling was significantly greater than zero. We first Fisher transformed individual Spearman’s rank-order correlation values and then performed a one-sided t-test for the mean being greater than zero. The 5% significance threshold for the t-value was corrected for multiple comparisons (number of regions). For more information about the noise ceiling see [36]. In sum, we only considered a model evidence to be significant if it satisfied three criteria: (1) the model correlation with data was significantly greater across participants than the lower bound of the noise ceiling, (2) the lower bound of the noise ceiling was significantly greater than zero across participants, and (3) the average correlation for the item-mixture model (null-hypothesis) did not reach the noise ceiling.

Associative learning predictions for RSA

Associative learning makes two predictions: learning doesn’t change individual item representations and learning reduces noise in sequence representations. These hypotheses can be tested by measuring the similarity between neural activation patterns elicited by novel and repeating sequences. Noise in sequence representations can be estimated by assuming that the voxel pattern similarity A (Eq 21) is a function of the ‘true’ representational similarity between sequences S plus some measurement and/or cognitive noise ν: A = S + ν. Here the noise ν is the difference between predicted and measured similarity. Note that this is only a valid noise estimate when the predicted and measured similarity are significantly positively correlated (i.e. there is ‘signal’ in the channel). If learning reduces noise in sequence representations then the noise in activity patterns generated by novel sequences ν should be greater than for repeating sequences ν. To test this we measured whether the activity patterns of repeating sequences were similar to novel sequences as predicted by the Hamming distance. The analysis followed exactly the same RSA steps as above, except instead of carrying it out within novel sequences we do this between novel and repeating sequences. First, we computed the Hamming distances between individual repeating and novel sequences S, next the corresponding voxel pattern similarities A and finally computed the Spearman’s rank-order correlation between the stimulus and voxel pattern RDMs exactly as above (Eq 23). If this measured correlation is significantly greater than the one within novel sequences (r > r) across participants, it follows that the noise level in repeating representations is lower than in novel representations. This analysis was carried out for all task phases and in all ROIs and the outcome could fall into one of three categories: No significant correlation: the probability of r is less than the significance threshold (p < 10−4; see Significance testing below). This means that repeating sequences are not represented as item-position associations in this ROI and hence the test for noise levels is meaningless. Significant correlation, but consistently smaller across participants than the within-novel sequences measure: r < r. repeating sequence representations are noisier than novel sequence representations. Significant correlation, but consistently greater across participants than the within-novel sequences measure: r > r. repeating sequence representations are less noisy than novel sequence representations. To confirm that our assumptions regarding the effects of noise on sequence representation were correct we estimated the fMRI measurement noise for the participants in our task and tested to what degree the noise should be reduced (or signal-to-noise ratio increased) in the fMRI patterns for the changes to be detectable with the representational similarity analysis. The details of this simulation can be found in S5 Text.

Recoding predictions for RSA

The recoding model predicts that the repetition of individual sequences should recode individual associations of those sequences. We assumed that participants were ideal learners and inferred an optimal chunking model based on the statistics across previously seen sequences (Eq 20). See the Chunk learning and Optimal chunking model sections for estimation details. Importantly, we designed the presentation of repeating and novel sequences so that the optimal model would remain the same for every trial across the experiment: every repeating sequence was encoded with a single four-gram chunk, and every novel sequence with four uni-grams (Fig 4, bottom row). For every participant we then estimated an RDM predicting the distances across novel and repeating sequences using the n-gram distance method described above (Eq 9). First, we computed the n-gram distances between individual repeating and novel sequences S, next the corresponding voxel pattern similarities A and finally computed the Spearman’s rank-order correlation between the stimulus and voxel pattern RDMs exactly as above (Eq 23). All analyses were carried out with pre-processed data as detailed in the Functional data pre-processing section, expect for the univariate analyses, where we used regression coefficient values (beta values) the group level analysis.

Univariate analysis of novel vs. learned sequences

We fitted a GLM where every trial was labelled either novel or learned resulting in two regressors (beta values), plus additional nuisance regressors as described in the Functional data pre-processing section. The average difference between participants’ beta values for every voxel was then tested with 2-sided t-test and controlled for multiple comparisons using the family-wise error rate as implemented in the SPM-12 package.

Changes in pattern distances across the experiment

Between two repeating sequences We computed voxel pattern distance between each of the Nth repetition of the two repeating sequences and estimated a slope across repetitions (least squares linear regression) to see whether there was a significant change in distance across trials. Fig 9 below displays this for a single participant and region: y-axis displays the distance value, while x-axis the trial number. For example, the data point at x = 1 represents the distance between two individual repeated sequences R1 and R2 at their first presentation, and all 12 data points are calculated as: where superscript denotes the repetition number and subscript the identity of the sequences. In all distance analyses we used the cosine distance between two patterns u and v defined as:
Fig 9

Change in pattern distance across trials for a single participant and region.

Y-axis displays the distance value, while x-axis the trial number.

Change in pattern distance across trials for a single participant and region.

Y-axis displays the distance value, while x-axis the trial number. We then tested whether the participants’ slope values were significantly different from zero for all ROIs. The significance threshold was α = 0.05/(Number of ROIs). Within repeating sequences We measured whether voxel pattern distances within the individual two repeating sequences changed significantly across the experiment. This test measured whether there was a change in distances between consecutive presentations of the same individual repeated sequence: As with the previous analysis, the participants’ individual slopes—averaged across the two repeating sequences—were included in the group level t-test for every ROI. Between the two repeating sequences and the unique sequences We tested whether the distance between N-th repetition of the repeating sequence R and the twelve unique sequences U1, …, U12 changed significantly across the experiment. Within unique sequences We tested whether there was a change in pattern distances across successive presentations of individual unique sequences.

Behavioural measures

Significant differences in behavioural measures across participants were evaluated with a t-test for dependent measures. We chose not to inverse-transform reaction time data following recent advice by Schramm and Rouder [87] (see also [88]).

fMRI data acquisition and pre-processing

Acquisition

Participants were scanned at the Medical Research Council Cognition and Brain Sciences Unit (Cambridge, UK) on a 3T Siemens Prisma MRI scanner using a 32-channel head coil and simultaneous multi-slice data acquisition. Functional images were collected using 32 slices covering the whole brain (slice thickness 2 mm, in-plane resolution 2 × 2 mm) with acquisition time of 1.206 seconds, echo time of 30ms, and flip angle of 74 degrees. We used a multi-band factor of 2, partial Fourier 7/8, and applied no parallel imaging techniques (partial Fourier only in the form of in-plane acceleration). In addition, high-resolution MPRAGE structural images were acquired at 1mm isotropic resolution. (See http://imaging.mrc-cbu.cam.ac.uk/imaging/ImagingSequences for detailed information). Each participant performed two scanning runs and 510 scans were acquired per run. The initial ten volumes from the run were discarded to allow for T1 equilibration effects. Stimulus presentation was controlled by PsychToolbox software [89]. The trials were rear projected onto a translucent screen outside the bore of the magnet and viewed via a mirror system attached to the head coil.

Anatomical data pre-processing

All fMRI data were pre-processed using fMRIPprep 1.1.7 [90, 91], which is based on Nipype 1.1.3 [92, 93]. The T1-weighted (T1w) image was corrected for intensity non-uniformity (INU) using N4BiasFieldCorrection ([94], ANTs 2.2.0), and used as T1w-reference throughout the workflow. The T1w-reference was then skull-stripped using antsBrainExtraction.sh (ANTs 2.2.0), using OASIS as target template. Brain surfaces were reconstructed using recon-all ([95], FreeSurfer 6.0.1), and the brain mask estimated previously was refined with a custom variation of the method to reconcile ANTs-derived and FreeSurfer-derived segmentations of the cortical grey-matter of Mindboggle [96]. Spatial normalisation to the ICBM 152 Nonlinear Asymmetrical template version 2009c [97] was performed through nonlinear registration with antsRegistration ([98], ANTs 2.2.0), using brain-extracted versions of both T1w volume and template. Brain tissue segmentation of cerebrospinal fluid (CSF), white-matter (WM) and grey-matter (GM) was performed on the brain-extracted T1w using fast ([99], FSL 5.0.9).

Functional data pre-processing

The BOLD reference volume was co-registered to the T1w reference using bbregister (FreeSurfer) using boundary-based registration [100]. Co-registration was configured with nine degrees of freedom to account for distortions remaining in the BOLD reference. Head-motion parameters with respect to the BOLD reference (transformation matrices and six corresponding rotation and translation parameters) were estimated using mcflirt ([101], FSL 5.0.9). The BOLD time-series were slice-time corrected using 3dTshift from AFNI [102] package and then resampled onto their original, native space by applying a single, composite transform to correct for head motion and susceptibility distortions. Finally, the time-series were resampled to the MNI152 standard space (ICBM 152 Nonlinear Asymmetrical template version 2009c, [97]) with a single interpolation step including head-motion transformation, susceptibility distortion correction, and co-registrations to anatomical and template spaces. Volumetric resampling was performed using antsApplyTransforms (ANTs), configured with Lanczos interpolation to minimise the smoothing effects of other kernels [103]. Surface resamplings were performed using mri_vol2surf (FreeSurfer). Three participants were excluded from the study because more than 10% of the acquired volumes had extreme inter-scan movements (defined as inter-scan movement which exceeded a translation threshold of 0.5mm, rotation threshold of 1.33 degrees and between-images difference threshold of 0.035 calculated by dividing the summed squared difference of consecutive images by the squared global mean).

fMRI event regressors

To study sequence-based pattern similarity across all task phases we modelled the presentation, delay, and response phases of every trial (Fig 2A) as separate event regressors in the general linear model (GLM). We fitted a separate GLM for every event of interest by using an event-specific design matrix to obtain each event’s estimate including a regressor for that event as well as another regressor for all other events (LS-S approach in Mumford et al. [104]). Besides event regressors, we added six head motion movement parameters and additional scan-specific noise regressors to the GLM (see Functional data pre-processing above). The regressors were convolved with the canonical hemodynamic response (as defined by SPM12 analysis package) and passed through a high-pass filter (128 seconds) to remove low-frequency noise. This process generated parameter estimates (beta-values) representing every trial’s task phases for every voxel. The beta values were then transformed to t-values to provide noise-normalised measure of voxel activity for the event. We segmented each participants’ grey matter voxels into anatomically defined regions of interest (ROI, n = 74). These regions were specified by the Destrieux [105] brain atlas and automatically identified and segmented for each participant using mri_annotation2label and mri_label2vol (FreeSurfer).

Significance testing

We carried out the representational similarity analysis for every task phase (encoding, delay, response; n = 3) and ROI (n = 74 for RSA). To test whether the results were significantly different from chance across participants we used bootstrap sampling to create a null-distribution for every result and looked up the percentile for the observed result. We considered a result to be significant if it had a probability of p < α under the null distribution: this threshold α was derived by correcting an initial 5% threshold with the number of ROIs and task phases so that α = 0.05/74/3 ≈ 10−4. We next shuffled the sequence labels randomly to compute 1000 mean RSA correlation coefficients (Eq 23). To this permuted distribution we added the score obtained with the correct labelling. We then obtained the distribution of group-level (across participants) mean scores by randomly sampling mean scores (with replacement) from each participant’s permuted distribution. The number of random samples for the group mean distribution was dependent on the significant probability threshold: we took n = 10/α samples so that the number of samples was always an order of magnitude greater than the required resolution for the group chance distribution. Next, we found the true group-level mean score’s empirical probability based on its place in a rank ordering of this distribution.

Neural representation of novel sequences.

(PDF) Click here for additional data file.

Item mixture model parameters.

(PDF) Click here for additional data file.

Optimal chunking model estimation.

(PDF) Click here for additional data file.

Associative learning of overlapping sequences.

(PDF) Click here for additional data file.

Simulation of expected changes in pattern similarity.

(PDF) Click here for additional data file.

Individual sequences used in the task.

(PDF) Click here for additional data file.

Anatomical coverage of MRI functional scans.

(PDF) Click here for additional data file. 8 Feb 2021 Dear Kalm, Thank you very much for submitting your manuscript "Sequence learning recodes cortical representations instead of strengthening initial ones" for consideration at PLOS Computational Biology. In light of the reviews (below this email), we would like to invite the resubmission of a significantly-revised version that takes into account the reviewers' comments. My apologies for taking such a long time to come back to you. I had difficulties securing a third reviewer (blame the Christmas holidays), to the extent that I was still looking for one when the first two reviewers had submitted their evaluations. Since those reviews turned out to be highly detailed and consistent in their recommendation, I decided to move forward without a third reviewer. As you will see, both reviewers were generally enthusiastic about your manuscript, but they also had a number of serious concerns that should be addressed. I will highlight the main issues here, but please see the full reviews below for more details and a range of smaller concerns that have been raised. The first reviewer – an expert in cognitive modeling – found the manuscript difficult to understand, both in terms of its organization and the presentation and role of models in this study. I have a similar background as this reviewer and experienced the same problem. I believe that the various suggestions made by the reviewer can help to significantly improve the clarity of the modeling part of the paper. (Interestingly, the second reviewer – an expert in neuroimaging – indicated that the manuscript is well written and organized; hence, it seems that the manuscript is currently well adapted to a neuroimaging audience, but less so to a modeling audience.) The main concern of the second reviewer is that inferences about the neural mechanisms are a bit surprising and possibly too strong due to the possibility of alternative explanations. The reviewer makes several suggestions to address this concern. An additional comment that I have is that it could be useful to include a brief discussion about the possible role of long-term memory in these results. One question that I kept asking myself is whether part of the results may be explained as a consequence of a shift from working memory to long-term memory in trials with repeated sequences, while the novel sequences would always have to rely solely on working memory. I don’t know whether long-term memory processes should be seen as an alternative explanation or rather as a possible mechanism behind the recoding hypothesis. Some discussion on this would be helpful. We cannot make any decision about publication until we have seen the revised manuscript and your response to the reviewers' comments. Your revised manuscript is also likely to be sent to reviewers for further evaluation. When you are ready to resubmit, please upload the following: [1] A letter containing a detailed list of your responses to the review comments and a description of the changes you have made in the manuscript. Please note while forming your response, if your article is accepted, you may have the opportunity to make the peer review history publicly available. The record will include editor decision letters (with reviews) and your responses to reviewer comments. If eligible, we will contact you to opt in or out. [2] Two versions of the revised manuscript: one with either highlights or tracked changes denoting where the text has been changed; the other a clean version (uploaded as the manuscript file). Important additional instructions are given below your reviewer comments. Please prepare and submit your revised manuscript within 60 days. If you anticipate any delay, please let us know the expected resubmission date by replying to this email. Please note that revised manuscripts received after the 60-day due date may require evaluation and peer review similar to newly submitted manuscripts. Thank you again for your submission. We hope that our editorial process has been constructive so far, and we welcome your feedback at any time. Please don't hesitate to contact us if you have any questions or comments. Sincerely, Ronald van den Berg Associate Editor PLOS Computational Biology Samuel Gershman Deputy Editor PLOS Computational Biology *********************** Reviewer's Responses to Questions Comments to the Authors: Reviewer #1: # Summary: In this paper the authors compare an associate learning account with a recoding learning account for learning of short sequences (of length 4) using fMRI and representational similarity analysis (RSA). To this end, the predicted similarities across all items of four different representation models are created: an item-position model, an item-item model, an item mixture model, and an optimal chunking model. All of these are in some sense optimal models that produce their predictions without any free parameters (i.e., if a model has parameters, these are fixed to constants). The resulting similarity matrices are compared with the representational similarity matrix of the fMRI. The only prediction matrix that shows a significant relationship with the observed fMRI similarity is the optimal chunking model suggesting that only recoding explains the fMRI data. This results is further supported by a theoretical analysis (which shows that associative learning does not make much sense in the first place) and further fMRI data. # Evaluation: I really like the idea of the paper, the design of the study, the analysis, and the results. It also addresses a current research question in an active research area and does so using state-of-the-art methods, so I would like to see it published in a first class venue such as PLOS Comp Biol. However, I found the paper rather difficult to understand. One reason for this might be my lack of familiarity with the relevant fMRI/RSA literature, but I expect that with a better organisation of the manuscript could be made more accessible for a larger audience. In general I think there are some details missing from introduction and results that should be added to make the exposition clearer. As someone with some experience in cognitive modelling, I had some problem understanding the role the models played here. To me, a model is usually something that has free parameters that are fitted to the data, but this is not the case here. All models make deterministic predictions a priori. I am not sure if I missed this in the manuscript, but it only became clear in the method section and was necessary in understanding the results. Maybe this could be presented clearer. There are also some phrases that seem to be somewhat misleading about the role of the models. For example, line 175 reads: "The resulting RDM of n-gram distances was then fit with neural activity patterns using the RSA method." However, the "fitting" here is the calculation of the distance between two similarity matrices using rank correlations, which serves here as a measure of model adequacy, and not the more common procedure in which free parameters are adjusted to make the prediction of the model be nearer to the observed data. Similar phrases (i.e., estimated or fitted) can be found in lines 101, 161, and 252. Likewise, Figure 4 refers to "Bayesian model comparison". In most contexts this phrase indicates that different models are compared against the data, but here it is only used to determine the structure of the the optimal chunking (i.e., recoding) model before any data is collected. I generally fund the organisation of the results section somewhat unclear as it jumps between different type of results (i.e., the main RSA results and other fMRI results) and is organised across models. Maybe a clearer organisation in which results are presented one by one for each analysis (and not for each model) might be better. For example, the main result seems to be that only the similarity matrix of the optimal chunking/recoding model shows a significant relationship with the observed similarity matrix, but not for the other three accounts. However, the only graphical representation of this result is Figure 5 which only contrasts 2 models and not all 4. I also found the additional prediction and analysis with respect to the noise activity the most difficult part of the analysis. Maybe this could be presented after the description of the main RSA results. On the positive side, the implementation of the four sequence representation models appears to be done very thoroughly and I do not see any large issues. The one thing I think would be interesting to discuss if there is the possibility that additional free parameters (e.g., learning rates or decay) could alter any of the predictions of the models substantially. I do not suggest that such a model should be analysed (it would also be unclear how), but rather whether the deterministic predictions are really the only plausible way associations models could create similarities. I list a few more comments below that I had while reading the manuscript: - The description of the results is difficult to understand without reading the method section. Perhaps a few more information about the task could be given. For example, I had the following two questions: What was the task of the participants? Does the task consist of two parts (initial learning phase without fMRI followed by final part with fMRI) or just one part? - A response time difference of 0.018 seconds (i.e., 18 milliseconds) does not seem to impressive. What is the average RT? - Fig. 1: I do not know the term "multinomial matrix representation". I understand the figure, but the term is at least uncommon to me. - Fig 2B. Either there is an error in the id column (trial 5 seems to have the wrong id) or I am misunderstanding what it means. - Fig. 3 (left): What is the distance measure used to derive the predicted distance between two sequences? - line 156: "recoded inferred": only one of the two words is necessary - Fig. 4: What does "Repeating 2" under "Trial 3" mean? - lines 373 - 377: It is a bit unclear if the randomisation guarantees that a repeating sequence would always occur in any series of three trials - with the consequence that if one hasn't seen a repeating sequence in the last two trials the next trial would be a repeating sequence with certainty - or whether a repeating sequence would only certainly follow after four non-repeating sequences. - Line 365: I think it would be good to alert the reader here that the reason why there are only 14 individual sequences is revealed later on. I spent too long wondering why 4! would result in only 14 sequences. Reviewer #2: The present study investigates the neural basis of sequence learning. In particular, two models of sequence learning, associative and recoding (chunking), are formulated and their fit to fMRI data is compared. Throughout several areas in the dorsal visual stream, the recoding model shows a superior fit to the fMRI data. Additionally, the authors provide model evidence that the associative learning model suffers from significant interference when many possible sequences are learned. Combined, the authors conclude that only the recoding model can account for visual sequence learning. The manuscript is well written and organized. The models appear well specified and the fMRI data analysis is well conducted, using state of the art tools and methods. Overall, the analyses and their description are of high quality. I also appreciate the authors’ commitment to open science by sharing all data and analysis code. The major strength of the present study over most previous ones is the explicit formulation of the learning models, which promises to yield significant additional insight into the neural mechanisms underlying human sequence learning. However, I do have some concerns with the design and interpretation of the results, questioning whether the study truly achieves to provide this additional insight. Major points 1. To my knowledge, the neural consequences of visual sequence learning (or visual statistical learning) are commonly reported in primary and ventral visual areas, as also reflected by several papers cited by the authors (e.g. [31-33] in the manuscript; also see de Lange et al., Trends Cogn Sci 2018 for a review). Another line of evidence suggests the involvement of subcortical structures, such as hippocampus, in statistical learning and positional codes (Hindy et al. 2016; Hsieh et al. 2014). Finally, frontal areas may also encode sequence information in serial order tasks (Berdyyeva and Olson, 2010). Given the prevalence of studies pointing towards sensory, frontal and subcortical areas as key nodes involved in sequence learning, I was surprised by the results in the present manuscript being located exclusively in dorsal visual stream regions (largely parietal cortex; Fig. 5). While the authors state that they want to address the underlying computational mechanism rather than where in the brain effects of learning are found (lines 54-56), I nonetheless believe that the unexpected localization of the results does at least warrants critical evaluation and discussion, also in the context of previous literature. Indeed, the motivation for the present study is framed in the context of previous work on (visual) sequence/statistical learning and aims to advance our understanding by investigating the underlying computation rather than only a univariate difference in response magnitude – a goal I fully support. However, if the observed neural effects drastically differ in localization, is this goal really achieved in the present study? The authors compare their results to several studies showing modulations in early visual and ventral visual stream areas (e.g. line 195), yet report univariate and RSA results exclusively in dorsal visual areas. Without a convincing explanation for this discrepancy, and why it should not be of concern, it would seem to me that the present results do not provide novel insight into the specific mechanism that was cited to motivate their study. 2. Relatedly, the specifics of the experiment design and the cortical areas reported by the authors may in fact suggest an alternative explanation for the results. In the task performed by the participants, the recall of stimulus order, responding to novel compared to learned sequence is more difficult. Accordingly, also behavioral results show slower responses to the more difficult to recall novel sequences. Arguably, trials with novel sequences require more attention, likely over a longer duration, to be devoted to the stimuli. The univariate fMRI results are in line with this interpretation, by showing enhanced neural responses to novel compared to learned sequences; i.e., possibly an enhanced neural responses to novel sequences due to attention. Moreover, there appears to be an appreciable overlap of the results reported in Fig. 5, with posterior parts of the default mode and attention related networks. Combined this evidence suggests an interpretation stressing the role of attention allocation and task difficulty as underlying (at least partially) the observed effects. Therefore, to me it appears unclear what neural mechanism is reflected in the present results. This is a notable limitation, which should be critically evaluated by the authors, or if possible be addressed with additional analyses – although, unfortunately I do currently not see how the existing data could directly answer this concern. 3. The critical question is then whether differences in attention or related effects, brought about by differences in task difficulty and task demands between the conditions of interest, could also account for (apparent) changes in neural representations; the RSA results at the heart of the manuscript. To me this seems possible, albeit less straight forward than for the univariate results. Nonetheless, one can speculate. If I understand correctly, the authors constrained sequence selection (maximized differences between sequences) and pseudo-randomized the order of trials (2 learned, 1 novel sequence for every set of 3 trials). This maximization of differences between sequences and 2nd order regularity across trials probably makes it possible for participants to rapidly and with high probably detect learned sequences after the first stimulus of a sequence is presented. I believe that the third sequence in a set of three trials can even be perfectly predicted before stimulus onset, if 1 learned and 1 novel trial have been seen so far. Even without this 2nd order regularity, learned sequences can be detected rapidly. That is, if the first stimulus belongs to one of the two learned sequence chances are very high that the sequence is the learned sequence; because only 1/3 of sequences are novel sequences and of those only 1/4 start with any one particular stimulus, while on the other hand 2/3 of trials are learned sequences of which each starts with a different stimulus. Thus, for learned sequences participants can rapidly disengage attention after recognizing the specific learned sequence as e.g. “sequence 1”, possibly already upon seeing the first stimulus in a sequence. Then participants can simply recall the required responses before preparing to perform the associated action. Note that no attention to subsequent stimuli is necessary in most cases. On the other hand, novel sequences require sustained attention to the stimulus sequence itself, do not allow observers to preemptively disengage, demand storing the associated stimuli in working memory, prepare a novel motor plan, etc. In other words, many aspects differ in how observers may respond to learned compared to novel sequences. Importantly, many of these differences do not appear to be related to how learning changes the representation of a sequence itself or the representation of stimuli in a sequence, but rather are secondary consequences of having learned sequences in this task, and of making use of what has been learned. It seems possible that one of these differences, or a combination thereof, may result in a better fit of the recoding model compared to the associative learning model. Consider that a learned sequence can be recognized rapidly (e.g. as “sequence 1”) and attention disengaged from the stimuli, only requiring the participant to recall the motor sequence to be performed. The associated representation of this recognition and the secondary processes outlined above would be highly unlikely to resemble the predictions of the associative learning model, but rather resemble a new “recoded” representation. However, this does not seem to me to indicate that the representation of the sequence has necessarily been recoded, but another secondary process may mask as a recoded sequence representation. Even so, it is interesting how these secondary consequences modulate neural responses – i.e., I do not want to argue that the results are uninteresting. Rather, the problem is again that we do not seem to know which mechanism, modulation, strategy, etc. is underlying the present results. In a study that aims to narrow down the precise neural mechanism underlying sequence learning this uncertainty about which process may underpin the observed results is, in my opinion, a crucial limitation worth discussing. Minor points 1. Related to the points above, one may also wonder whether the discrepancy in results compared to previous studies could also be a consequence of differences in stimulus duration. Compare for example the fast-paced statistical learning in [31-33] (stimulus presentation <=500ms), which resulted in reduced sensory responses to learned sequences in early and ventral visual areas, with the slow paced presentation (2.4s) in the present study. Pacing could thus be an important characteristic relevant for determining which coding strategy is employed. If this is the case, it would limit the generalization of the conclusions we can draw from the present study. 2. In lines 118-124, the authors explain the rationale of the test of the associative learning model. I understand the incentive to use RSA here as well. However, it seems to me that a decoding approach (e.g. SVM), comparing decoding accuracy of novel compared to learned sequences, may have provided a more powerful approach, assuming appropriate training data would have been acquired. It might be worth explaining why RSA would be preferable over a decoding analysis (or at least equally suited) to assess predictions of the associative learning model. Otherwise, it may appear to the reader that the study design and analysis procedure is optimized for the recoding model. 3. The authors conclude “humans follow an optimal sequence learning strategy and recode initial sequence representations into more efficient chunks” (line 314-315). What is meant here by “optimal”? I am not sure I quite see how the data support that an optimal strategy is necessarily used, but only that recoding fits the current data better than the associative learning model and the null model. I do understand that practical limitations constrain which models can be implemented, but (at least in my understanding) it has not been tested whether an optimal learning strategy is indeed used. 4. Lines 123-124 first introduce the dorsal visual stream as a ROI. This choice of ROI should be supported or introduced in some fashion. 5. Lines 126-127 state that “We found no evidence for the first associative learning prediction: novel and repeating sequences were not encoded similarly in any of the brain regions”. Would it be possible to clarify which results the authors are referring to here? 6. In addition to the lower and upper noise ceiling for the model results, it might be helpful to mention the variance explained by the models. 7. Important additional MR sequence parameters, such as the multi-band factor, etc., would be a good addition to the Methods section (around line 782). This would better allow the reader to evaluate potential shortcomings of the utilized sequence. 8. Has the MR sequence been evaluated for slice leakage? 9. Was mean scaling performed as part of the univariate fMRI analysis pipeline? 10. Several figures (e.g. Fig. 5, Fig 9) would benefit from labels on the y-axis and the colorbar. 11. The authors mention that in ¼ of trials the recall phase was omitted. Was this data used in any specific manner? What was the intention behind this design choice? The stated rationale, “to ensure a sufficient degree of decorrelation between the estimates of the BOLD signal for the delay and recall phases” would have been possible to achieve by using a variable duration for the delay window. 12. I could not find the fMRI dataset under the link provided in the manuscript. I assume that the authors will publish the dataset upon publication? 13. I can confirm that I could access the code shared on gitlab. The code seems well written and documented. Sharing code in a well readable format is much appreciated. That said, I did not have time to adequately review the code or test it, also due to the problems with data access. ********** Have all data underlying the figures and results presented in the manuscript been provided? Large-scale datasets should be made available via a public repository as described in the PLOS Computational Biology data availability policy, and numerical data that underlies graphs or summary statistics should be provided in spreadsheet form as supporting information. Reviewer #1: Yes Reviewer #2: No: I could not access the MRI dataset using the link in the manuscript, and I did not notice any separate numerical data underlying the figures. However, this could also have been an oversight/problem on my part. ********** PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files. If you choose “no”, your identity will remain anonymous but your review may still be made public. Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy. Reviewer #1: No Reviewer #2: No Figure Files: While revising your submission, please upload your figure files to the Preflight Analysis and Conversion Engine (PACE) digital diagnostic tool, . PACE helps ensure that figures meet PLOS requirements. To use PACE, you must first register as a user. Then, login and navigate to the UPLOAD tab, where you will find detailed instructions on how to use the tool. If you encounter any issues or have any questions when using PACE, please email us at . Data Requirements: Please note that, as a condition of publication, PLOS' data policy requires that you make available all data used to draw the conclusions outlined in your manuscript. Data must be deposited in an appropriate repository, included within the body of the manuscript, or uploaded as supporting information. This includes all numerical values that were used to generate graphs, histograms etc.. For an example in PLOS Biology see here: http://www.plosbiology.org/article/info%3Adoi%2F10.1371%2Fjournal.pbio.1001908#s5. Reproducibility: To enhance the reproducibility of your results, PLOS recommends that you deposit laboratory protocols in protocols.io, where a protocol can be assigned its own identifier (DOI) such that it can be cited independently in the future. For instructions, please see 18 Mar 2021 Submitted filename: responses_to_comments.pdf Click here for additional data file. 16 Apr 2021 Dear Kalm, We are pleased to inform you that your manuscript 'Sequence learning recodes cortical representations instead of strengthening initial ones' has been provisionally accepted for publication in PLOS Computational Biology. Before your manuscript can be formally accepted you will need to complete some formatting changes, which you will receive in a follow up email. A member of our team will be in touch with a set of requests. Please note that your manuscript will not be scheduled for publication until you have made the required changes, so a swift response is appreciated. IMPORTANT: The editorial review process is now complete. PLOS will only permit corrections to spelling, formatting or significant scientific errors from this point onwards. Requests for major changes, or any which affect the scientific understanding of your work, will cause delays to the publication date of your manuscript. Should you, your institution's press office or the journal office choose to press release your paper, you will automatically be opted out of early publication. We ask that you notify us now if you or your institution is planning to press release the article. All press must be co-ordinated with PLOS. Thank you again for supporting Open Access publishing; we are looking forward to publishing your work in PLOS Computational Biology. Best regards, Ronald van den Berg Associate Editor PLOS Computational Biology Samuel Gershman Deputy Editor PLOS Computational Biology *********************************************************** Reviewer's Responses to Questions Comments to the Authors: Reviewer #1: The revision does a good job in making the manuscript clearer and I am happy to see it published. I only have a few comments the authors might want to consider: - lines 102 to 105: I think it would be helpful to clarify here that the prediction is parameter-free or does not depend on participants' responses. This is the important part where a misunderstanding of what is meant with model here can make it difficult to understand the results. - I am not 100% convinced the response to point 1.14 is sufficient. The manuscript now reads (ll. 492): "To keep the relative probability of repeating and unique sequences fixed throughout the experiment we pseudo-randomised the order of trials so that on the average there was a single unique sequence and two repeating sequences in three consecutive trials (Fig 2B)." The phrase "on average" suggests that it does not hold for any set of three trials. Maybe it would be clearer to describe the randomisation procedure. E.g.: The trial list was divided in sub-blocks of length 3 each of which contained exactly one repeated sequence with position randomly determined. - Figure 3 note: Two opening parentheses in new part, but only one closing parenthesis. - line 200 (likewise in line 243): "p > 10^-3" seems a surprisingly low threshold to declare an effect as absent. I understand the logic (ll. 942), but it still feels inappropriate. Reviewer #2: The authors well addressed the concerns and comments raised by the other reviewers and myself. Particularly, the major comments have been discussed thoroughly, which is much appreciated. My questions about the anatomical localization of the MRI results have been satisfactorily addressed. In my opinion the associated adjustments to the manuscript, and the discussion of alternative learning mechanisms, make the manuscript more accessible and better places the results in the context of the wider literature. Explaining the rationale and limitations of the fMRI slice position and coverage further clarifies the localization of the results and absence of results in other areas. In short, I believe that the manuscript is publishable in its current form. That said, I would briefly like to mention a few remaining minor comments for your consideration. Minor comments 1. Line 64: “Specifically, the code for of sequences changed from […]” accidentally says “for of” 2. I appreciate the adjustments to the Discussion. However, the new version of the discussion occasionally seem to lack focus; e.g. the paragraphs of lines 300, 319, 347 all deal with potential hippocampus related learning mechanisms in one way or another, but are separated by other parts of discussion. Although this might not be strictly necessary, it would be helpful if it were possible to slightly streamline the Discussion. 3. In their reply the authors state that: “In our view the identification-and-disengage mechanism is only possible using a latent variable that maps onto a longer sequence. This model can be called pattern completion or chunking — in our manuscript we use the neutral term recoding.” I appreciate that the authors provided an elaborate explanation how they believe this alternative account is covered in their model. I do also agree with the authors that such an “identification-and-disengage” mechanism may indeed require recoding. However, it does still raise the question what the changes in fMRI patterns reflect – is it necessarily the representation of the recoded sequence or a (secondary) consequence thereof; e.g. related to attention. To be clear, the manuscript answers the question whether recoding or associative learning better match the fMRI patterns, thus my comment does not question whether the authors achieved the goal of the study. ********** Have the authors made all data and (if applicable) computational code underlying the findings in their manuscript fully available? The PLOS Data policy requires authors to make all data and code underlying the findings described in their manuscript fully available without restriction, with rare exception (please refer to the Data Availability Statement in the manuscript PDF file). The data and code should be provided as part of the manuscript or its supporting information, or deposited to a public repository. For example, in addition to summary statistics, the data points behind means, medians and variance measures should be available. If there are restrictions on publicly sharing data or code —e.g. participant privacy or use of data from a third party—those must be specified. Reviewer #1: None Reviewer #2: No: However the authors stated that: "[...] once we have received conditional acceptance for our manuscript we can link the data immediately. This can be verified by the editor and the reviewers before making the manuscript public on the PLOS website." I understand the explanation why the dataset is not available during review and I do realize that such SOPs are probably decided on an institute level. However, for review transparency it would be great if datasets could be made available for reviewers (e.g. with a personalized reviewer link) for future manuscripts. ********** PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files. If you choose “no”, your identity will remain anonymous but your review may still be made public. Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy. Reviewer #1: No Reviewer #2: No 19 May 2021 PCOMPBIOL-D-20-02080R1 Sequence learning recodes cortical representations instead of strengthening initial ones Dear Dr Kalm, I am pleased to inform you that your manuscript has been formally accepted for publication in PLOS Computational Biology. Your manuscript is now with our production department and you will be notified of the publication date in due course. The corresponding author will soon be receiving a typeset proof for review, to ensure errors have not been introduced during production. Please review the PDF proof of your manuscript carefully, as this is the last chance to correct any errors. Please note that major changes, or those which affect the scientific understanding of the work, will likely cause delays to the publication date of your manuscript. Soon after your final files are uploaded, unless you have opted out, the early version of your manuscript will be published online. The date of the early version will be your article's publication date. The final article will be published to the same URL, and all versions of the paper will be accessible to readers. Thank you again for supporting PLOS Computational Biology and open-access publishing. We are looking forward to publishing your work! With kind regards, Olena Szabo PLOS Computational Biology | Carlyle House, Carlyle Road, Cambridge CB4 3DN | United Kingdom ploscompbiol@plos.org | Phone +44 (0) 1223-442824 | ploscompbiol.org | @PLOSCompBiol
  88 in total

1.  The temporal context model in spatial navigation and relational learning: toward a common explanation of medial temporal lobe function across domains.

Authors:  Marc W Howard; Mrigankka S Fotedar; Aditya V Datey; Michael E Hasselmo
Journal:  Psychol Rev       Date:  2005-01       Impact factor: 8.934

2.  Considerations arising from a complementary learning systems perspective on hippocampus and neocortex.

Authors:  J L McClelland; N H Goddard
Journal:  Hippocampus       Date:  1996       Impact factor: 3.899

Review 3.  Circuit changes in motor cortex during motor skill learning.

Authors:  Andrew E Papale; Bryan M Hooks
Journal:  Neuroscience       Date:  2017-09-14       Impact factor: 3.590

4.  Automatic parcellation of human cortical gyri and sulci using standard anatomical nomenclature.

Authors:  Christophe Destrieux; Bruce Fischl; Anders Dale; Eric Halgren
Journal:  Neuroimage       Date:  2010-06-12       Impact factor: 6.556

Review 5.  Computational analysis of the role of the hippocampus in memory.

Authors:  A Treves; E T Rolls
Journal:  Hippocampus       Date:  1994-06       Impact factor: 3.899

Review 6.  Visual working memory capacity: from psychophysics and neurobiology to individual differences.

Authors:  Steven J Luck; Edward K Vogel
Journal:  Trends Cogn Sci       Date:  2013-07-11       Impact factor: 20.229

7.  Nipype: a flexible, lightweight and extensible neuroimaging data processing framework in python.

Authors:  Krzysztof Gorgolewski; Christopher D Burns; Cindee Madison; Dav Clark; Yaroslav O Halchenko; Michael L Waskom; Satrajit S Ghosh
Journal:  Front Neuroinform       Date:  2011-08-22       Impact factor: 4.081

8.  Perceptual learning and representational learning in humans and animals.

Authors:  József Fiser
Journal:  Learn Behav       Date:  2009-05       Impact factor: 1.986

9.  Accurate and robust brain image alignment using boundary-based registration.

Authors:  Douglas N Greve; Bruce Fischl
Journal:  Neuroimage       Date:  2009-06-30       Impact factor: 6.556

10.  Effective learning is accompanied by high-dimensional and efficient representations of neural activity.

Authors:  Evelyn Tang; Marcelo G Mattar; Chad Giusti; David M Lydon-Staley; Sharon L Thompson-Schill; Danielle S Bassett
Journal:  Nat Neurosci       Date:  2019-05-20       Impact factor: 28.771

View more

北京卡尤迪生物科技股份有限公司 © 2022-2023.