Literature DB >> 34478442

Learning brain dynamics for decoding and predicting individual differences.

Joyneel Misra¹, Srinivas Govinda Surampudi¹, Manasij Venkatesh¹, Chirag Limbachia², Joseph Jaja¹, Luiz Pessoa^1,2.

Abstract

Insights from functional Magnetic Resonance Imaging (fMRI), as well as recordings of large numbers of neurons, reveal that many cognitive, emotional, and motor functions depend on the multivariate interactions of brain signals. To decode brain dynamics, we propose an architecture based on recurrent neural networks to uncover distributed spatiotemporal signatures. We demonstrate the potential of the approach using human fMRI data during movie-watching data and a continuous experimental paradigm. The model was able to learn spatiotemporal patterns that supported 15-way movie-clip classification (∼90%) at the level of brain regions, and binary classification of experimental conditions (∼60%) at the level of voxels. The model was also able to learn individual differences in measures of fluid intelligence and verbal IQ at levels comparable to that of existing techniques. We propose a dimensionality reduction approach that uncovers low-dimensional trajectories and captures essential informational (i.e., classification related) properties of brain dynamics. Finally, saliency maps and lesion analysis were employed to characterize brain-region/voxel importance, and uncovered how dynamic but consistent changes in fMRI activation influenced decoding performance. When applied at the level of voxels, our framework implements a dynamic version of multivariate pattern analysis. Our approach provides a framework for visualizing, analyzing, and discovering dynamic spatially distributed brain representations during naturalistic conditions.

Entities: Chemical

Mesh：

Year: 2021 PMID： 34478442 PMCID： PMC8445454 DOI： 10.1371/journal.pcbi.1008943

Source DB: PubMed Journal: PLoS Comput Biol ISSN： 1553-734X Impact factor: 4.475

Introduction

As brain data become increasingly spatiotemporal, there is a great need to develop methods that can effectively capture how information across space and time support behavior. In the context of functional Magnetic Resonance Imaging (fMRI), although data are acquired temporally, they are frequently treated in a relatively static manner (e.g., in event-related designs, responses to short trials are estimated with multiple regression assuming canonical hemodynamics). However, a fuller understanding of the mechanisms that support mental functions necessitates the characterization of dynamic properties, particularly during the investigation of more naturalistic paradigms, including movie watching [1], and other continuous paradigms [2, 3]. A central goal of neuroscience is to decode brain activity, namely to infer mental processes and processes based on brain signals. In fMRI, most approaches are spatially constrained and consider voxels in a local neighborhood (“searchlight”) or across a few regions. Such models are useful to decode stimuli or mental state, particularly when the problem is well understood and localized anatomically [4-7]. Additional techniques decode brain activity based on whole-brain functional connectivity matrices [8], as well as machine learning methods [9]. The vast majority of decoding methods are static, that is, the inputs to classification are patterns of activation that are averaged across time (“snapshots”) [10]. Some studies have proposed using temporal information in addition to spatial data [11-16]. In such cases, features used for classification are extended by considering a temporal data segment instead of, for example, the average signal during the acquisition period of interest. Another strategy to make use of temporal information capitalizes on dynamic functional connectivity, and how it is related to different conditions and individual differences [17-20]. Despite some progress in utilizing temporal information in decoding methods, key issues deserve to be further investigated: How can we characterize brain signals generated by dynamic stimuli in terms of generalizable (across participants) spatiotemporal patterns? How are such patterns distributed across both space and time? For example, although a recent study used recurrent neural networks to decode brain states from fMRI data, they investigated different working memory conditions (remembering faces, bodies, tools, or places) which are associated with stable brain states at the temporal scale of technique [21]. Understanding the dimensionality of brain representations has become an important research question in recent years [22-24]. For example, for simple movements, neuronal signals across populations of neurons can be well approximated by a projection onto a two-dimensional representation [25]. This question is now starting to be addressed in the fMRI literature [26, 27]. Here, we tackle the problem of learning low-dimensional spatiotemporal signals from dynamic stimuli, including movie watching and continuous experimental paradigms. If spatiotemporal patterns capture important properties of brain dynamics, do they capture information about individual differences that are predictive of behavioral capabilities? Although multivariate techniques, including neural networks, can be successful at the level of prediction, interpretability is of key importance in the context of neuroscience. Therefore, here we quantified the relative importance of spatiotemporal features, specifically how brain regions contribute to input classification as a function of time. Can spatiotemporal decoding be applied at both the region-of-interest and voxel levels? Extending it to the voxel level allows investigators to probe finer spatial representations underlying task conditions or states of interest, and opens the way for dynamic representational similarity analysis [28]. In the present paper, we sought to address the above challenges by developing a dynamic computational framework built upon recurrent artificial neural networks. At the broadest level, our goal was to develop a multivariate and temporal technique to study fMRI signals that help address the following question: what combination of signals (from multiple regions or voxels), from particular temporal windows, support particular brain states and/or behavior? In essence, how can we characterize multivariate dynamics? A central goal of our study was to investigate how latent representations obtained by learning brain dynamics captures useful information, despite being detached from the input signals. More broadly, our objective was to outline and validate an approach that can be employed when dynamic paradigms are employed. Finally, although we used non-linear neural networks, we investigated ways in which the approach can be used that go beyond using a “black box”, including a combination of “importance” and lesion analysis.

Results

Our goal was to develop a spatiotemporal decoding framework to predict a stimulus, mental state, behavior, and/or personality traits based on brain signals. A key element of our model was a recurrent neural network based on Gated Recurrent Units (GRUs; [34]) sensitive to current and past inputs (Fig 1A). To determine the generalizability of our findings, all results reported below were obtained with data not used in any fashion for training and/or tuning. For a complete description of the architecture, please refer to Materials and methods.

Fig 1

Model architectures based on neural networks with gated recurrent units (GRUs).

(A) Classifier. At each time step, time series data,, provided inputs. The recurrent neural network transformed the inputs into a latent representation, , which then determined the output class scores, . The unit with highest activation determined the model’s prediction of the input stimulus at each time. (B) Dimensionality reduction. The encoder shares the GRU component shown in A. GRU outputs, , were first linearly projected to a lower-dimensional space using a fully-connected layer (DR-FC). Classification was then performed based on the low-dimensional representation, . (C) Input signal reconstruction. A separate GRU was trained independently to reconstruct the original brain signals based on the low-dimensional signals, .

Model architectures based on neural networks with gated recurrent units (GRUs).

Generalizability of spatiotemporal patterns

To test the system’s ability to generalize spatiotemporal brain patterns across participants, we employed the GRU classifier to predict movie clip labels from functional MRI data available from the Human Connectome Project. We performed 15-way classification as a function of time (Fig 2A). Accuracy increased sharply during the first 60 seconds, and stabilized around 90 seconds. Mean classification accuracy after this transient period was 89.46% (Fig 2B). For completeness, a formal evaluation of chance performance based on the null distribution obtained through permutation testing resulted in a mean accuracy of 8.40% (1000 iterations; [54]).

Fig 2

Prediction of movie clip (15-way classification).

(A) Average clip prediction accuracy using the neural network architecture with gated recurrent units (GRUs) as a function of time (dark yellow). Accuracy increased sharply during the first 60 seconds, and stabilized around 90 seconds. Results using multinomial logistic regression (Log-Reg), a feed-forward architecture (FF; 1 layer, 103 units), temporal convolutional networks (TCN; kernel widths of 5 and 40) also also shown (see part B for labels). Error bars show the 95% confidence interval of the mean across test participants. (B) Summary of accuracy results after the 90-second transient period (see dashed line in part A).

Prediction of movie clip (15-way classification).

Is temporal information necessary for clip prediction?

If capturing temporal information and long-term dependencies are important for classification, performance should be affected by temporal order. To evaluate this, we shuffled the temporal order of fMRI time series for test data. To preserve the autocorrelation structure in fMRI data while shuffling, we used a wavestrapping approach [55]. Classification accuracy reduced considerably to 54.14%. We then compared the GRU classifier to alternative schemes. Our objective was not to benchmark our proposal against possible competitors, but to further probe properties captured by our architecture. First, using a simple feed-forward (FF) network with one hidden layer, classification accuracy was 44.86%, which was considerably above the chance accuracy of 8.40% determined via permutation testing (the latter was essentially what we obtained when using multinomial logistic regression; 8.10%). We constrained our choice of units (103) such that the the total number of parameters of the feed-forward classifier (32, 563 parameters) was approximately equal to that of the GRU classifier (32, 559 parameters). As feed-forward classifiers do not capture temporal dependencies, we next employed temporal convolutional networks (TCNs). We used 25 kernels with a width (W) of 5 time steps such that the number of parameters of the classifier (37, 915 parameters) approximated that of the GRU classifier (see Materials and methods for kernel details). Classification accuracy was only 36.22%. Increasing the kernel width to 10 time steps (75, 415 parameters) modestly improved accuracy to 41.56%, and even using a kernel width of size 40 accuracy was only 54.85%, although the number of parameters (300, 415) was close to ten times that of the GRU classifier. Together, the results above are consistent with the notion that temporal information is essential for high-accuracy classification, and that our GRU-based architecture is capable of capturing long-term dependencies that are missed by temporal convolutional networks, for example. Note, however, that temporal shuffling of the time series still allowed the GRU classifier to attain classification at around 54% correct. As the temporal shuffling was unique for each training permutation, order information was broken down, indicating that non-temporal information supported considerably above-chance performance. It is conceivable that spatial information of specific movie scenes provided sufficient information to account for that level of accuracy.

Low-dimensional trajectories as spatiotemporal signatures

We investigated the dimensionality of the inputs required for successful stimulus prediction by using a supervised non-linear dimensionality reduction approach (Fig 1B). The original dimensionality of the input, , was the number of regions (N = 300). After learning, the GRU outputs provide a low-dimensional latent spatiotemporal representation of the input, which is encoded in the vector (N = 32). To further reduce dimensions, GRU signals, , were linearly projected onto a lower-dimensional space, , using a fully connected layer. We refer to this weight matrix as the Dimensionality Reduction Fully-Connected (DRFC) layer, and to this model as the GRU encoder. Since is a low-dimensional representation of the history of , the inputs are not treated independently, effectively leading to non-linear temporal dimensionality reduction. As in the original classifier, a final FC layer was used to predict labels based on . Performance with low-dimensional encoding was surprisingly high; 3 dimensions yielded 71.21% accuracy. In Fig 3A, we show low-dimensional signals for each clip, which we refer to as trajectories: consecutive low-dimensional states, , describe the trajectory in state space. In other words, at each time t, the value of each unit is plotted along the (x, y, z) axes.

Fig 3

Low-dimensional trajectories.

Low-dimensional trajectories.

(A) Trajectories for all clips. Solid line: mean trajectory averaged across participants (line thickness is scaled by variance, which was highest at the end of the clip). All trajectories progressed away from the center (see sample arrows). The inset provides clip abbreviations. (B) Euclidean distance between trajectories. The Euclidean distance between the clip trajectory while watching Home Alone and the mean trajectory across participants for a second clip was computed. The thicker line corresponds to the distance of participants’ Home Alone trajectories to the mean of this clip. The same results are shown for all clips in S1 Fig. (C) Clip prediction accuracy and fraction of variance captured after reconstruction using low-dimensional models. Error bars correspond to the standard error of the mean across participants. To quantify a notion of proximity between trajectories, we computed the Euclidean distance between them as a function of time (Fig 3B). To compute the distance between clips A and a reference clip B, we first computed the mean trajectory of B averaged across participants. Then, for each participant’s clip A trajectory, we computed the Euclidean distance between A and the mean trajectory of B at every time step. Accordingly, the proximity of a clip with itself was not zero (indicated by a thicker line), and reflected the variability of participant trajectories around the clip’s average. As expected, the evolution of low-dimensional trajectories closely reflected the temporal accuracy obtained using the original GRU classifier. Clip trajectories were initially close to one another, but slowly separated during the first 60- 90 seconds of the clip. The results in Fig 3B also help clarify why the trajectories in (Fig 3A) originate from a common part of space. In the first few seconds, discrimination is very low and all trajectories are very close to one another in three dimensions. As time progresses, discrimination increases and the trajectories gradually separate from one another. We further investigated low-dimensional projections with 4, 5, and 10 dimensions, at which point performance (87.84%) was not far (within 2%) from that of full dimensionality (Fig 3C). These results reveal that latent representations with as few as 10 dimensions captured essential discriminative information. The effectiveness of the GRU encoder in capturing low-dimensional information can be appreciated by applying principal components analysis on the input data, , which yielded very low prediction accuracy, although it was capable of recovering 60% of the input signal variance with 10 dimensions (Fig 3C). To investigate the content of the latent space uncovered by the GRU encoder, we performed the following analysis. By construction, the low-dimensional vector captures information to classify the input signals. How much information about the input does this projection preserve? To evaluate this, we considered the low-dimensional signal, , and used it to try to reconstruct the input signal, . Within our framework, it is natural to do so using a GRU decoder (Fig 1C), which reconstructs the input time series (the reconstructed signal is called ). Doing so captured a very modest amount of signal variance (less than 10%).

Spatiotemporal importance maps

The GRU representation that supports classification lies in a latent space that is fairly disconnected from the original brain activation signals, . It is therefore important to relate GRU states to brain activation. We defined a saliency measure to capture the contribution of a brain region to classification as a function of time (see Materials and methods). This analysis attempts to identify sets of brain regions, and particular temporal windows, that are important for the task at hand. Fig 4A shows a saliency map at t = 8 for the Star Wars clip, also shown dynamically for the first 60 seconds (S2 Video). The saliency time series in a few brain regions shows that values were relatively high initially and gradually decreased over the first minute of the clip (Fig 4B). The evolution of saliency for the Home Alone clip was somewhat similar (see also the saliency map video for the Brokovich clip in S1 Video).

Fig 4

Determining brain contributions to classification.

(A) Saliency map for the “Star Wars” clip. For illustration, the top 30 regions are shown. Color scale in arbritary units. See also the video for dynamics. (B) Saliency values for the first 60 seconds of two clips. The gray bands corresponds to the 95-th percentile of null saliency values generated via permutation testing. Scale of the y-axis is arbitrary.

Determining brain contributions to classification.

Lesion analysis

We further investigated region importance by performing a lesion analysis. After training, first we lesioned each of the seven large-scale networks described by Schaefer and colleagues [32] (Fig 5A). Lesioning the visual, somato-motor, and default networks produced the largest drop in test accuracy. Lesion of the other networks produced essentially no deficits. Next, we capitalized on the parcellation available in terms of seventeen networks, allowing us to investigate how sectors of the seven “canonical” networks contributed to performance(Fig 5B). We found that the temporal-parietal was the only one of the default network subgroups that had an appreciable impact on performance. In addition, only the central extra-striate regions in the visual network, and the early auditory regions in the somato-motor network had more appreciable impacts.

Fig 5

Lesion analysis.

Lesion analysis.

(A) The impact of a lesion in each of the seven standard networks is shown. The orange line shows accuracy without any lesion. (B) The impact of lesions to specific subnetworks was also evaluated, revealing greater contributions to clip classification by specific sets of brain regions. (C) Comparison of removal of regions from a given network (visual, somato-motor, or default mode) relative to a random set of regions from that same network. Removing regions in descending order of overall saliency (see text) impacted classification more than selecting regions without considering saliency (without replacement). For illustration, we also display the regions of the most impactful subnetworks observed in part B (red dots; also shown in the insets). The gray region shows the 95% confidence interval when the same number of regions were excluded at random. Both saliency and lesion analyses highlight brain regions that are important for classification. To relate the two, we tested the following conjecture: regions with higher saliency values should have larger effects on classification when lesioned. To test this, we sequentially excluded regions progressing from the highest to the lowest overall saliency values, which was defined as the mean saliency value across time, clips, and participants. We performed tests on the visual, somato-motor, and default networks, which were found to be important for classification. According to the conjecture, accuracy should more sharply decrease after excluding the most salient ROIs. To test for the alternate hypothesis, we lesioned ROIs based on random selection (repeated for 1000 iterations). Thus, we contrasted performance when selecting regions of the, say, visual network based on saliency values relative to selecting regions from the same network randomly. Fig 5C shows that when ROIs were lesioned in sequence of decreasing saliency there was a larger drop in accuracy compared to random selection, therefore supporting our conjecture.

Predicting behavior

Recent studies have employed brain data to predict participants’ behavioral capabilities or personality-based measures [56-60]. We tested the extent to which spatiotemporal information captures individual-based measures based on our recurrent neural network framework. Here, we targeted individual scores for fluid intelligence and verbal IQ. The same approach used for classification was employed, with the model predicting a scalar, namely, a participant’s score. We used all clips for training (rather than training a separate model based on each clip) to promote learning representations that are not idiosyncratic to a particular clip. Fig 6 shows model performance temporally for the Star Wars clip, one of the best predicting clips. At every time point, predicted scores are correlated with actual participants’ scores. For fluid intelligence, the observed correlations were often above chance levels (gray zone indicates 95% confidence level at specific time t), but more modestly for verbal IQ. As our model made predictions as a function of time, we developed a permutation-based test to evaluate the predictions while controlling for multiple comparisons (fluid intelligence, p = 0.0013; verbal IQ: p = 0.0038). The supplementary figures (S2, S3, S4 and S5 Figs) show results for all clips. As an alternative to evaluating prediction at every time point, we developed an “oracle test” that uses training data to learn to guess a time to test prediction once on the separate test set (See Materials and methods). For reference, we compared our approach to connectome-based predictive modeling based on patterns of functional connectivity (CPM; [51]), possibly the state-of-the-art in this regard (see also [56, 58]). We include values predicted with this technique to provide an informal comparison, as this method provides a single prediction per clip, unlike our approach which is time varying.

Fig 6

Prediction of behavior.

(A) Prediction of fluid intelligence scores as a function of time. Prediction (blue) fluctuates considerably but consistently exceeds chance values (indicated by the tic marks at the bottom). Values obtained by connectome-based predictive modeling (CPM) are indicated for comparison (red: CPM applied to Star Wars data; maroon: highest value applying CPM across all clips). The gray region indicates the 95th percentile region based on permutation testing (N.B.: applies to our method only, not CPM). (B) Prediction of verbal IQ as a function of time. Only short periods of time of the Star Wars clip exceeded chance levels. The green bar indicates a segment of the time series that is significant at the 0.05 level corrected for multiple comparisons.

Prediction of behavior.

Dynamic multivariate pattern analysis

Thus far, we applied our model to region-based activation patterns. The approach can also be employed to perform voxel-based (or grayordinate-based) dynamic multivariate pattern analysis (MVPA). Here, we applied our model to the prediction of experimental conditions from a dataset collected in our laboratory [3]. Briefly, in the “moving circles” paradigm, over periods of three-minute blocks, two circles moved on the screen, at times approaching and at times retreating from each other. If they touched, participants received a non-painful but highly unpleasant electrical shock. The movement of the circles was smooth but not predictable. In particular, the circle motions were set up to include multiple instances of “near misses”: periods of approach followed by retreat; in such instances, the circles came close to each other and retreated just before colliding. Based on near-misses, we defined approach and retreat states: each lasted seven time steps (total of 11.25 seconds) during which the circles approached or retreated from each other. We performed dynamic MVPA by using voxels from the anterior insula, a region strongly engaged by the experimental paradigm [2, 3]. Classification accuracy ranged from 59- 63% over the approach and retreat segments (the upper bound of the 95% confidence interval determined with permutation testing was 50.37%). We also computed saliency in a voxelwise manner; values were substantially higher at the outset of a approach/retreat segment and relatively lower for the subsequent time points. Fig 7 shows a snapshot at t = 6 seconds, illustrating the spatial organization of the map.

Fig 7

Voxelwise prediction of experimental conditions: Threat vs. safe.

Only voxels from the anterior insula were employed. Saliency values at t = 6 are shown; for illustration no thresholding was applied.

Voxelwise prediction of experimental conditions: Threat vs. safe.

Only voxels from the anterior insula were employed. Saliency values at t = 6 are shown; for illustration no thresholding was applied.

Discussion

We sought to characterize distributed spatiotemporal patterns of functional MRI data during movie watching and continuous task conditions. To do so, we employed recurrent neural networks sensitive to the long-term history of the input signals, and applied them to the problem of input classification. The model was tested on brain data both at the level of region of interest and voxels. All neural network training and model tuning were performed using a training set independent of the test set to establish the model’s ability to generalize learned representations to unseen participants. The framework we developed is general and its subcomponents can be easily exchanged to utilize other algorithms (e.g., long short-term memory networks, or novel developments).

Spatiotemporal representations

Spatiotemporal information learned in a given set of participants generalized very well to new participants, where challenging 15-way classification was ∼90% correct for movie clip data. These results are noteworthy because they suggest that activation patterns distributed across space and time are shared across participants when viewing naturalistic stimuli. To understand the type of information captured by our model, we compared it to a few simple schemes, including a simple feedforward network, and temporal convolutional networks. The low performance with these models (∼40%) suggests that long-term dependencies are essential for input classification. Indeed, temporal shuffling of the input time series decreased performance to ∼50%. As temporal shuffling preserves the input signals but scrambles their sequence, these results reveal that the precise temporal order is key for the ability to generalize to unseen data.

Latent dimensionality

The question of the “inherent” dimensionality of brain signals has attracted considerable attention in recent years [22, 23, 61–64]. Here, we investigated this issue for the movie clip data, where inputs were 300-dimensional based on the ROI time series. As the number of units in the hidden layer was 32, a considerable amount of dimensionality reduction was accomplished at the outset. Further investigation revealed that this signal could be further reduced to ∼10 dimensions without compromising prediction accuracy appreciably. Strikingly, even with three dimensions, accuracy was relatively high (77.30%). Two issues about dimensionality deserve attention here. First, as the transformation of the input by the model is non-linear, when stating that one has, say, a 10-dimensional space, the non-linearity of the space must be kept in mind, as contrasted to a space obtained by linear techniques, such as those based on singular value decomposition. Second, the system processes temporal data. Thus, in a sense the dimensionality of the system is considerably higher (for example, “10d+time”). Whereas the mapping onto a lower dimensionality space preserved classification information about the input category, it was very poor at reconstructing the input time series itself. Thus, the content of the latent low-dimensional representation was substantially distinct from the fMRI signal itself. This observation raises the concern that the non-linear projection of the original input signal is too far removed from the brain time series. In other words, that the system works in a strongly “black box” fashion. In the present case, we believe this concern is considerably assuaged by the results of the saliency and lesion analysis (see below). Another way to probe this issue would involve applying the present architecture to datasets with a more clear input or behavioral “space”. For example, using the model to learn about the spatiotemporal content of classes of movies, much like research with static stimuli organized along certain natural axes [28], allowing the representational content of the low-dimensional signals to be evaluated. Finally, note that related issues apply to linear dimensionality reduction, too. In many cases, the first principle component may carry high signal variance, but other components associated with low, and even very low, input signal variance explain most of the behavioral variance [65-68].

Saliency and lesion analysis

Although prediction is valuable in some settings (e.g., clinical diagnosis), our general goal was to develop a framework to characterize distributed spatiotemporal brain signals. Thus, one of our objectives was to identify sets of brain regions, and particular temporal windows, that are important for the classification task at hand, having in mind future applications of our model to characterizing spatiotemporal patterns linked to behavioral conditions or states. Together, the saliency and lesion analyses uncovered several notable findings. Regions of the of the visual and auditory cortex were important for classification. Whereas this was to be expected, it serves as an important confirmation that the modeling approach captured biologically relevant information. Corroborating this statement, the model also identified as important regions of the inferior temporal cortex (such as the fusiform gyrus and the parahippocampal cortex), which are involved in high-level object recognition (not simply “low-level” visual processing). Our findings also illustrate how the method can be used to formulate hypotheses that can be subsequently studied in more targeted studies. We found, for example, that the inferior frontal cortex and the orbitofrontal cortex supported clip classification, too. The lesion analysis also identified specific parts of the default network that are particularly important. Thus, the model makes novel predictions that can be further studied, for instance, by developing clips to address the types of spatiotemporal information being captured by these regions (social? emotional?). Taken together, our results support the idea that the model captured biologically relevant spatiotemporal information that can inform specific research questions.

Individual differences

Recent work has sought to predict individual differences based on functional MRI data [51, 56–60, 69, 70]. Our model was capable of predicting fluid intelligence and verbal IQ from unseen participants at levels comparable to, and possibly better than, established methods such as connectome-based predictive modeling, which is based on information from functional connectivity matrices [51]. Some movies (e.g., Star Wars, Social Net, Oceans) performed much better at prediction than others, and at particular segments of the movie [71]. Why does movie watching correlate with individual differences, including fluid intelligence? We note that the main reason for our analysis was to demonstrate the feasibility of our approach, not a priori hypotheses. The approach adopted here holds more promise when the content of the naturalistic task is chosen or designed to investigate the behavioral dimension of interest. For example, Finn and colleagues found that trait paranoia is linked to brain signals during ambiguous social narratives ([72]; for further discussion, see also [71]). We plan to apply the method described here to investigate individual differences in reward sensitivity and anxiety-related measures in work employing dynamic paradigms [3]. Should prediction of behavioral differences be performed in a time-varying manner? Such strategy could be potentially beneficial if particular segments were better “tuned” to particular individual differences; for example, very suspenseful parts might be particularly associated with anxiety-related traits. This possibility remains speculative and needs to be investigated in future studies. Nevertheless, we showed that it is possible to perform above-chance prediction by making a single prediction of behavior. In this approach, the best-predictive segment is selected based on training data, and this specific temporal window is used with unseen, test data. We applied the model developed here with inputs at the level of voxels, so as to implement dynamic MVPA. We tested the approach in a dataset involving continuous changes in threat level based on the proximity of two moving circles. Voxels from the anterior insula were used to predict stimulus condition (threat vs. safe). Although accuracy was relatively modest (∼60%), the results demonstrate the feasibility of the approach. Saliency maps also uncovered subsectors of the anterior insula with increased importance for prediction, illustrating how it is possible to identify voxels that contribute to decoding the experimental conditions, opening an avenue for the application of representational similarity analysis [28] in a dynamic setting.

Other approaches

Decoding approaches using some temporal information have been used in the past [11-16]. Recurrent neural networks have also been employed for brain decoding more recently, albeit in some cases on relatively static task paradigms, such as working memory which generates stable brain states at the temporal scale of fMRI [21, 73]; working memory conditions can be predicted even when temporal information is eliminated [68]. Because most fMRI datasets are typically small, they are often evaluated using cross-validation without assessing their generalization on held-out data (see Discussion by [74]). The size of the datasets studied allowed us to test the model in a separate set of participants thereby evaluating the model’s ability to generalize beyond trained data (see also [73]). Finally, we stress that the goal of the present study was not to describe a model that outcompetes others but to describe a general, modular modeling approach to investigating spatiotemporal dynamics of brain data, here applied to fMRI.

Materials and methods

Data

Movie data

We employed Human Connectome Project (HCP; [29]) data of participants scanned while watching movie excerpts (Hollywood and independent films) and other short videos, which we call “clips”. The legend of Fig 3 provides further information. All 15 clips contained some degree of social content. Participants viewed each clip once, except for the test-retest clip that was viewed 4 times. Clip lengths varied from 65 to 260 seconds (average: 198 seconds). We employed all available movie-watching data, except those of 8 participants with runs missing; thus we used N = 176 participants. Data were sampled every 1 second. The preprocessed HCP data included FIX-denoising, motion correction, and surface registration [30, 31]. We analyzed data at the region of interest (ROI) level, with one time series per ROI (average time series across locations). We employed a standard 300-ROI cortical parcellation [32]. Thus, the input time series consisted of a vector of N = 300 dimensions at each time, t. Individual ROI time courses were normalized by z-scoring (i.e. centered to zero mean and unit standard deviation) to help remove potential differences across runs/participants.

Moving-circles paradigm

We employed data from a separate dataset collected in our laboratory [3]. Briefly, the experimental paradigm was as follows. Over a period of 3 minutes, two circles moved on the screen, at times approaching and at times retreating from each other. If they touched, participants received a non-painful, but highly unpleasant electrical shock. The movement of the circles was smooth but not predictable. In particular, the circle motions were set up with multiple instances of what we called “near misses”: periods of approach followed by retreat; in such instances, the circles came close to each other and retreated just before colliding. Based on near-misses, we defined approach and retreat states: each lasted 6–9 seconds during which the circles approached or retreated from each other. Data from N = 122 participants were employed (sampled every 1.25 seconds). Preprocessing steps included motion correction (ICA-AROMA via FSL). We analyzed data at the voxel level from the right anterior insula, which corresponds to region numbers 230 and 231 of the Schaefer-300 17-network parcellation [32]. The input time series consisted of a vector of N = 745 dimensions at each time, t.

Performance evaluation

Accuracy

At each point in time, t, we computed accuracy as the average true positive rate (TPR) across inputs of a given class (movie clip or experimental condition). Specifically, for each class label c: where N is the total number of data samples belonging to class c, predicted_label is the label predicted by the model for the nth data sample, and I is an indicator function which equates to 1 if predicted_label = c and 0 otherwise. We computed accuracy at each time step and visualized it as a function of time (see Fig 2).

Training/testing split and cross-validation

Movie data from 100 participants were used for training, and the remaining 76 participants were used as a test set (thus completely invisible to model tuning/learning). There were more participants in the train set to have a little more data for training. To determine optimal hyperparameters for clip prediction, we employed a 5-fold cross-validation approach on the training set. In each fold, participants in the training set were not included in the validation set. After determining optimal hyperparameters, we retrained the model on the entire training data. Final results were generated exclusively based on test data. A similar approach was used for the the moving-circles paradigm, where data from 62 and 60 participants were used as a training set and test set, respectively. For cross validation, the folds had size {13, 13, 12, 12, 12}.

Basic recurrent architecture for classification

Recurrent neural networks allow signals at the current time step to be influenced by long-term past information. Although it was originally difficult to develop effective learning procedures for these algorithms (e.g., vanishing/exploding gradients prevented them from learning relationships beyond 5–10 time steps; [33]), recent developments have largely overcome these challenges. Here, we employed a recurrent neural network architecture based on Gated Recurrent Units (GRUs; [34]). As our goal was to use an architecture that would learn the long-term history of the inputs, we could have used different architectures, such as based on Long Short Term Memory models [35]. The choice of a GRU-based architecture was based on the fact that they perform well across many applications [36], as well as computational availability and expediency. Fig 1A shows the model. Briefly, inputs were provided to a GRU network that produced a “hidden” representation, , of the incoming signal based on current and past signals. For classification purposes, the hidden-layer signals were transformed to create a vector of class scores, the maximum of which determined the model’s prediction of the input. Next, we describe the model formally. Time series data from units of interest (ROIs or voxels) comprised the input vector at each time, . The GRU network (with one or more layers) transformed its input, , non-linearly onto its output, , the hidden layer of the classifier system (see S1 Appendix for details). Because our goal was input classification, GRU outputs at each time, , passed through an additionally fully-connected (FC) layer of units generating a vector of class scores: where Softmax is the generalized logistic activation function. As the outputs are in the range [0, 1] and and add up to 1, they can be treated as probabilities. The output of the model was the unit label, i, such that it selected the class with largest probability: . As all operations were performed at every time, t, the final output was a time series of label predictions. For movie clips, we employed data from 300 ROIs, so the dimensionality of the input was N = 300. The GRU network employed a single layer with 32 units (N = 32). The dimensionality of was N = 15 to implement 15-way classification (based on the number of clips). For the moving-circles data, we employed data from 745 voxels (N = 745). The GRU network contained three layers of 32 units each; the third layer corresponded to the vector (N = 32). The dimensionality of was N = 2 to implement threat vs. safe classification. All weight matrices, W, required for GRU tuning (see S1 Appendix) and the bias parameters, b, underwent supervised training. Training sought to minimize the cross-entropy (CE) loss between predicted labels, , and true labels, (as provided by the supervisor), and was evaluated at at each time step: where the superscript i indexes the dimension of the predicted and teaching signals. The set of trainable parameters was denoted as Θ. The cross-entropy attains its minimum value when the two distributions, and , are identical. We defined the total loss function, J(Θ), as the average cross-entropy loss across time points (i.e., the average value across time of Eq 3), thus encouraging predicted labels to be as close to the true labels as possible across time. We minimized J(Θ) using the backpropagation through time algorithm [37]; gradient descent was optimised with the Adam optimizer [38]. The model was implemented using TensorFlow [39]. As our goal was not necessarily maximizing model performance, we explored a small set of potential architectures by varying the number of hidden layers and number of units per layer (Fig 8). For the region-based analysis, we explored the {1, 2, 3} × {16, 32, 64} space (layers by units). As accuracy was relatively high overall (>70%) and improvements were relatively modest as a function of these parameters, we chose the simplest case (one layer) but with 32 units which improved performance over 16 units (for evaluation performance, we considered all time points, even those during the first 60–90 seconds when accuracy increased sharply; see Fig 2). For the voxel-based analysis, we explored the {1, 2, 3} × {16, 32, 64, 128, 256} space. Classification of approach vs. retreat was a challenging problem, so we utilized 3 layers; we employed 32 units per layer for consistency with the clip classification architecture.

Fig 8

Selection of model architectures via cross validation on data from 100 participants.

Val: validation portion of training data. Error bars show 95% confidence intervals across folds.

Selection of model architectures via cross validation on data from 100 participants.

Val: validation portion of training data. Error bars show 95% confidence intervals across folds.

Dimensionality reduction

The basic architecture (Fig 1A) automatically implemented considerable dimensionality reduction. For example, for movie data, the GRU layer contained 32 units, whereas inputs from 300 regions were employed. This initial step implemented a version of non-linear dimensionality reduction. However, we included a separate stage of linear dimensionality reduction that was obtained by adding an extra fully connected layer, as described next. This two-step procedures was adopted to create a modular architecture that could be probed as desired: an initial step used to learn a temporal prediction problem, and an additional dimensionality reduction stage. Our approach can thus be contrasted to employing, for instance, auto-encoder architectures with a single non-linear dimensionality reduction step associated with hidden layers [40-42]. Several dimensionality reduction techniques exist that could be used to probe lower-dimensional representations in our system. To capture the temporal relationships in the time series data in the dimensionality reduction process, we combined the GRU network with a simple additional fully connected layer for dimensionality reduction (DR-FC). We refer to this model as the GRU encoder (Fig 1B). Formally, GRU outputs, , were linearly projected onto a lower-dimensional space containing units with linear activation: where the weight matrix (WDRFC had dimensions ). We refer to the layer as the Dimensionality Reduction Fully Connected (DR-FC) layer. The output of this layer, , is sensitive to past inputs, , thus effectively capturing temporal dependencies. An additional fully connected layer was used to predict labels, as in the previous subsection. For dimensionality reduction, only the DRFC and final FC layers were trained (via standard backprogapation). In other words, the training weights of the GRU network for classification were frozen in place after that initial learning phase. For visualization, when the dimensionality of was reduced to three, we plotted “trajectories” in a state-space representation (Fig 3A).

GRU decoder

Can low-dimensional representations obtained for classification be used to reconstruct the original data? Another GRU-based module was used to reconstruct brain activation from the learned low-dimensional representations, , which we refer to as the GRU decoder (Fig 1C). The GRU decoder was trained independently (that is, separately from the GRU encoder), by minimizing the mean squared error between the GRU decoder output, , and the original time series input, , at each time step.

Saliency maps: Explaining model predictions via sensitivity analysis

Given an input activation vector, , the GRU classifier generates a prediction output vector, , encoding class-scores. As the process is supervised, the true class label, c ∈ {1, ⋯, C}, is known. Thus, the cth element of the output vector, , is the class score for the input’s true class label. To help evaluate the contribution of individual input features (ROIs or voxels) to classification, we employed a saliency measure [43] that was used in the context of sensitivity analysis, a widely used approach in machine learning applications [44-47]. Sensitivity analysis helps in the interpretation of complex nonlinear models by representing variation of class-scores in terms of variation of individual input features. Input features are more salient if their class-scores are more sensitive to changes in their values. Mathematically, the class-score, , is a multivariate function of input features, : The total differential of class-score, , can be expressed as a weighted sum of the differentials of individual input features, , weighted by their corresponding partial derivatives, : where the operator ⊤ is the transpose. Since the gradient vector, , captures the sensitivity of class-score w.r.t changes in input features, we define the saliency vector, , as the gradient vector of the class-score evaluated at the current input, ,: The value of the ith element, , is the saliency value for the ith input feature (ROI or voxel) at time t. The saliency indicates ROIs or voxels which, when their values change, have the largest impact on the true-class score, and consequently a larger effect on predicting the output class. The saliency vector is straightforward to implement for our model, as the gradient can be computed using the backpropagation through time algorithm [37]. To consider saliency values across participants, we z-scored the gradient values, , across inputs and time steps for each participant. Because positive or negative contributions were potentially equally important (i.e., increases or decreases in brain activation altered the class prediction), we considered the absolute values of the z-scores above. Finally, we averaged these values across participants to obtain group-level saliency estimates. To evaluate the robustness of the saliency results, we compared them to those obtained from a null model. To generate a null distribution, using test data, class labels were randomly shuffled, and saliency values computed as defined above. The procedure was repeated 1000 times, yielding a distribution of null-saliency values at each time step (see Fig 4). To further characterize region importance, we performed a lesion analysis. After the learning phase with the full input, particular groups of ROIs were “lesioned” (masked) by setting their input time series to zero in the test phase. Initially, we masked each network from the cortical Schaefer-300 seven-network parcellation [32]. We then grouped ROIs within the seven “standard” networks into smaller groups using the seventeen-network Schaefer-300 parcellation.

Baseline models

We compared the GRU classifier to two baseline models. Training was based on computing cross-entropy loss between predicted and true labels at each time step (similar to Eq 3). The standard backpropagation algorithm was employed (Adam optimizer), as implemented in TensorFlow.

Feed-forward network

The feed-forward (FF classifier) network consisted of three fully-connected (FC) layers. Brain inputs at time, , were fed into a fully-connected hidden layer, and then transformed into a vector of class scores, as with the GRU classifier. Formally, the hidden layer, , and output layer, , activations were obtained as follows: where the ReLU and Softmax operations are standard rectified linear and generalized logistic functions, respectively. The vectors and had dimensions and (for 15-way classification) respectively. The number of hidden units was set to 103 in order to maintain the total number of parameters of the classifier (32563) comparable to that of the GRU classifier (32559).

Temporal convolutional network

We investigated whether static fixed-sized windows would be sufficient for modeling dynamics. A temporal convolutional network (TCN classifier) uses fixed-sized convolutional filters, where the output at a time depends upon a fixed temporal window, which is determined by the filter length l [48, 49]. Activations after temporal convolution were then transformed to obtain class scores. Formally, brain inputs at each time, , were convolved temporally with k filters, each with length of l time steps. A single filter of size N × l convolved along the time dimension (input time series) and generated a scalar time series. Therefore, k such filters produced a k-dimensional convolutional layer activation at each time, , which was used to generate class scores: The vectors and were of dimensions k = 25 and , respectively. The choice of k = 25 filters, each of width of l = 5 time steps, was chosen to keep the the total number of parameters (37915) similar to that of the GRU classifier (32559). To predict participants’ scores, we employed a GRU-based model, essentially a GRU regressor. Separate models were trained to predict fluid intelligence and verbal IQ [50]. In each case, all movie clips were used for training (rather than training a separate model for each clip) to promote learning representations that were not idiosyncratic to a particular clip, and thus generalizable across clips. We applied the GRU classifier approach (Fig 1). The GRU outputs, (dimensionality: N = 32), were input to a fully-connected layer with a single unit having a linear activation function. The models learned to predict a single score, , at every time step t: Thus, we were able to generate predicted scores as a function of time, which could be compared with the true score, allowing evaluation of how model predictions evolved. We computed the mean squared error (MSE) between true, y, and predicted, , scores at each time point: where Θ was the set of learnable parameters. The total loss was defined as the average MSE loss across all time points. The models were trained using backprogagation, using the Adam optmizer for gradient descent. We compared our approach to connectome-based predictive modeling (CPM; [51]), possibly the state-of-the-art in this regard. In this approach, a functional connectivity matrix is initially formed for each clip based on the Pearson correlation between every pair of brain region time series. Subsequently, the entries in the matrix (or edges) that correlate with the individual differences measure beyond a set threshold (here, 0.2) are retained and a linear model is fit to predict scores based on the functional connectivity values. Given that functional connectivity is employed, CPM predictions rely on activations during the entire clip. We tested our model statistically when making predictions at every time point. At each time t, once the model was trained (training set only), we evaluated chance performance based upon a null distribution obtained through permutation testing. For each iteration, we randomly shuffled the true scores of test data across participants and then calculated Spearman’s rank correlation between predicted and shuffled scores (1000 iterations). The 95% confidence interval is shown by the gray zone in Fig 6. Because this test evaluates differences at every time t, it suffers from the problem of multiple comparisons. To address multiple comparisons, we performed a single test (hence eschewing multiple tests) that evaluated if the maximum average correlation over a temporal window of length l could be observed by chance. The null scenario was generated by shuffling participants (i.e., the fMRI and fluid intelligent were randomly associated, not based on participant ID). The permutation-based test draws from ideas used in the MEG/EEG field to evaluate if two time series differ in time [52]. Specifically, our method was as follows. Consider a sliding window of length l over the entire clip time series, and compute the value of the average correlation over the sliding window. This allows us to determine the window that has the highest average correlation. For example, in Fig 6, the window of length l = 10 extended from t = 139, …, 148, for fluid intelligence. To determine the probability that such value would be observed by chance, we computed the same highest average correlation based on values for shuffled participants (1000 iterations), providing a null distribution (Fig 6, S3 and S5 Figs). The above approach handles multiple comparisons by performing a single statistical evaluation based on the test dataset (after training). However, it considers all temporal windows to be able to select the one that has the highest average score. Another possible approach is to select the best temporal window based on training data, and apply that temporal interval to the test data (as a test of generalization). We call this test the oracle test because it uses test data to guess what will work (generalize) in the test data. No correction for multiple comparisons is needed because a single test is applied to the test dataset. For Fluid Intelligence prediction, the best windows were 20–29 seconds for Socialnet (p = 0.0014) and 73–84 seconds for Starwars (p = 0.0007). For Verbal IQ, the best window was 128–137 seconds for Oceans (p = 0.0034). These tests pass or just miss a Bonferroni correction with 15 movies (0.05/15 = 0.0033).

Summary of models used

Table 1 summarizes all model architectures and documents all hyperparameters used for training.

Table 1

Summary of models implemented in the paper.

Hyperparameters and other parameters: LR: controlled the rate of update for gradient descent; Dr: dropout for the GRU layer [53]; L2: L2 regularization coefficient for the GRU layer; BS: Batch Size, number of training samples before updating model parameters; Ep: Epochs, number of passes through the training dataset.

Model (Dataset)	Architecture Layer (unit(s))	# layers	LR	Dr	L2	BS	Ep
GRU classifier (movie-watching)	1 × GRU (32 units)	2	6 × 10⁻³	10⁻⁶	10⁻³	32	45
GRU classifier (movie-watching)	1 × FC (15 units)	2	6 × 10⁻³	10⁻⁶	10⁻³	32	45
FF classifier (movie-watching)	1 × FC (103 units)	2	10⁻³	NA	NA	32	45
FF classifier (movie-watching)	1 × FC (15 units)	2	10⁻³	NA	NA	32	45
TCN classifier (movie-watching)	1 × TCN (25 filters)	2	10⁻³	NA	NA	32	50
TCN classifier (movie-watching)	1 × FC (15 units)	2	10⁻³	NA	NA	32	50
GRU encoder (movie-watching)	1 × GRU (32 units)	3	6 × 10⁻³	10⁻⁶	10⁻³	32	45
	1 × FC (Nh^ units)
	1 × FC (15 units)
GRU regressor (movie-watching)	1 × GRU (1 unit)	2	10⁻³	NA	NA	30	20
GRU regressor (movie-watching)	1 × FC (1 unit)	2	10⁻³	NA	NA	30	20
GRU classifier (moving-circles)	3 × GRU (32 units)	4	6 × 10⁻³	10⁻⁶	10⁻³	32	28
GRU classifier (moving-circles)	1 × FC (1 unit)	4	6 × 10⁻³	10⁻⁶	10⁻³	32	28

Summary of models implemented in the paper.

Conclusion

In the present paper, we developed a computational approach to study and characterize spatiotemporal brain signals in the context of fMRI. The framework can be applied to other types of brain data for discovering and interpreting brain dynamics.

Euclidean distances between trajectories for all movie clips.

Euclidean distances between trajectories. In the inset, the duration of every clip is indicated in parenthesis. (PDF) Click here for additional data file.

Fluid intelligence predictions for all movie clips.

Fluid Intelligence predictions for all movie clips. Conventions as in Fig 6 in the main text. (PDF) Click here for additional data file.

Fluid intelligence predictions: Null distributions.

Null distributions of fluid intelligence predictions. Vertical blue lines indicate the prediction based on actual data. (PDF) Click here for additional data file.

Verbal IQ predictions for all movie clips.

Verbal IQ predictions for all movie clips. Conventions as in Fig 6 in the main text. (PDF) Click here for additional data file.

Verbal IQ predictions: Null distributions.

Verbal IQ predictions: null distributions. Vertical blue lines indicate the prediction based on actual data. (PDF) Click here for additional data file.

Saliency movie for Brokovich clip.

(MP4) Click here for additional data file.

Saliency movie for Star Wars clip.

(MP4) Click here for additional data file.

Gated recurrent units.

(PDF) Click here for additional data file. 26 May 2021 Dear Dr. Pessoa, Thank you very much for submitting your manuscript "Learning brain dynamics for decoding and predicting individual differences" for consideration at PLOS Computational Biology. As with all papers reviewed by the journal, your manuscript was reviewed by members of the editorial board and by several independent reviewers. Your paper was in general well received, but several issues, some of them serious, were raised. In light of the reviews (below this email), we would like to invite the resubmission of a significantly-revised version that takes into account the reviewers' comments. We cannot make any decision about publication until we have seen the revised manuscript and your response to the reviewers' comments. Your revised manuscript is also likely to be sent to reviewers for further evaluation. When you are ready to resubmit, please upload the following: [1] A letter containing a detailed list of your responses to the review comments and a description of the changes you have made in the manuscript. Please note while forming your response, if your article is accepted, you may have the opportunity to make the peer review history publicly available. The record will include editor decision letters (with reviews) and your responses to reviewer comments. If eligible, we will contact you to opt in or out. [2] Two versions of the revised manuscript: one with either highlights or tracked changes denoting where the text has been changed; the other a clean version (uploaded as the manuscript file). Important additional instructions are given below your reviewer comments. Please prepare and submit your revised manuscript within 60 days. If you anticipate any delay, please let us know the expected resubmission date by replying to this email. Please note that revised manuscripts received after the 60-day due date may require evaluation and peer review similar to newly submitted manuscripts. Thank you again for your submission. We hope that our editorial process has been constructive so far, and we welcome your feedback at any time. Please don't hesitate to contact us if you have any questions or comments. Sincerely, Daniele Marinazzo Deputy Editor PLOS Computational Biology Daniele Marinazzo Deputy Editor PLOS Computational Biology *********************** Reviewer's Responses to Questions Comments to the Authors: Please note here if the review is uploaded as an attachment. Reviewer #1: Thank you for inviting me to review this manuscript by Pessoa and colleagues, in which the authors train a GRU with functional MRI data in order to classify different movie clips and task states, while also retaining the ability to track low-dimensional trajectories over time. Overall, I found the study to be of interest, however there were a number of sections that I believe would benefit from more explicit justification and expansion. • The authors may wish to provide some insights into why a GRU architecture was chosen over other potential options (e.g., vanilla feed-forward networks; convNets; LSTMs; etc). I appreciate that they used two of these as benchmark networks, however it would benefit the reader if the reasons for including them (or not) were made more explicit. • I apologise if I missed something, but how does training a GRU with an extra fully-connected hidden layer lead to low-dimensional representations of the data? Is it simply due to the fact that the added hidden layer has a smaller number of nodes? And if so, how did the authors determine precisely how low-dimensional to make this hidden layer? • Fig 3. The results of Figure 3 are mildly surprising. Were the authors not surprised that their different test benchmarks performed so poorly, particularly when the only major differences between GRUs and ANNs/ConvNets is related to local nodal recurrence. It was also slightly strange that each of the null networks was strongly above chance (i.e., greater than 8%) but also hovered around the same value (~40%). What is the significance of this value? • Fig 4. Can the authors provide any intuition as to why the trajectories of the different movie clips arise from the origin of the 3D plot? The authors interest in fluid intelligence and fMRI may be augmented by reference to both https://doi.org/10.1038/nn.4135 and https://doi.org/10.1038/s41593-018-0312-0 Reviewer #2: In this work, Pessoa et al. test a neural network architecture to decode fMRI-based brain dynamics and classify brain patterns elicited by movie-watching and a continuous experimental paradigm. They explore the associations between the employed GRU network and dimensionality reduction, behavior and brain activation through salience maps. The manuscript is generally clearly written and logically implemented. However, some of the characteristics of this methodology should be more clearly explained and motivated, before I can recommend this work for publication. Please find my comments below. Major comments: - One of the goals of using this approach was to exploit the temporal richness of the fMRI brain signals. However, as the authors acknowledge, "temporal shuffling of the time series still allowed the GRU classifier to attain classification at around 50% correct". In other words, breaking temporal information still gives ~50% accuracy after the GRU modeling procedure. Isn't that worrisome? As a "sanity" check, could the authors include a resting-state condition in the data pool, to see how it will affect the classification results? - If I understood Fig. 4 correctly (I'd recoomend to improve the caption of 4C), the trajectories in 4A) are based on the first 3 PCA components of the GRU outputs h_t, which already allow for a high classification accuracy (~60%). However, these very same 3 components have very little to do with the original brain signal ( they only explain ~2 % of the variance of the original data!). This is somewhat in line with the common viewpoint of considering neural networks architectures as "black box operators", that output complicated non-linear projections of original signals, which are then harder to trace back. Therefore, I don't understand how informative are the trajectories in 4A), since they have very little to do with the original brain signal, and are only related to the specific GRU architecture. This point should be better elaborated in the manuscript, to warn the reader on potential misleading interpretations of these NN classifiers. - On a related note, this also makes me wonder: how accurate is the approximation in equation 5), on which the brain salience maps were based? - Similarly, what is the advantage of this complicated GRU-based model for the voxel prediction maps depicted in Fig. 7 with respect to, for instance, a simple GLM-based (SPM style) analysis? Wouldn't the GLM approach give you the same brain maps shown in Fig. 7? - I know that in our community there is the increasing need of connecting new computational models with behavior and/or clinical scores. However, I really have trouble understanding why the way a person watches Star Wars or Home Alone should be predictive of that person's "fluid intelligence". The current manuscript lacks a proper discussion of these results. I would appreciate a more in-depth reasoning on these "neuroscientific" findings, rooted in the state-of the art of naturalistic task analysis. In other words, it is clear that the GRU black box is very effective at classifying different naturalistic tasks, but what do we learn about the brain that we did not know before? Minor comments: - Methods- Was band pass filtering applied to the fMRI data? What about Global signal regression? Would those impact the classification results? - Methods. Moving-circles data: how were the voxels selected? - Behavior. "Because of multiple comparisons across time, we performed an additional test considering all time points simultaneously." It is not clear how the authors correct for multiple comparisons. Please clarify. Reviewer #3: The paper presents a spatiotemporal analysis framework for dynamic classification ,regression and visualization of fMRI data. The authors develop a deep learning architecture based on GRUs (a recurrent neural net architecture) using three related components: a classifier to predict the class of stimuli, an encoder for dimensionality reduction to perform visualization and a decoder. They also present saliency maps which enable an interpretability of the model and provide information on the importance of different brain regions / voxels to the classification. This is an advantage of the approach compared to other nonlinear techniques which cannot map back to the original input space. I enjoyed reading the paper and believe it is an important contribution of interest to the readers of PLOS Computational Biology. Major comments 1. The authors should comment in the paper why they choose to train three separate networks instead of training an end-to-end neural network for classification, encoding and decoding, i.e. an alternative architecture could have been a jointly trained autoencoder and classifier but perhaps this would be less stable to train or result in poorer performance. In addition regarding the encoder, since it shares the GRU with the classifier (these aren’t actually fully separate), this should be made more explicit in the figure. 2. Related work: the introduction should also briefly discuss how this paper relates to recent work on dynamical functional connectivity. 3. The authors make the point that decoding is typically static, based on for example reducing the activity to a functional connectivity matrix or averaged activity. A static method should be added as a benchmark to the classification task, similar to including CPM as a static baseline for predicting fluid intelligence and verbal IQ. 4. For the TCN classifier, increasing the length of the kernel from 5 to 10 time steps improved performance. Does a further increase to larger kernel widths (20?40?) further improve classification accuracy? This also relates to the author’s discussion on spatiotemporal representations that long-term dependencies are essential. 5. Regarding the low-dimensional trajectories as well as trajectory distances, what can be learned from this visualization, i.e. what is the aim of visualization? Is the organization of video clips meaningful in any way?? The result that classifying on the low dimensional space (3-10 dimensions) yields classification accuracy close to the full dimension is interesting as it indicates the intrinsic dimensionality of the data. Minor comments: • Can the authors comment on whether saliency maps can be generated for behavior prediction (regression task), as done for the classification task. • Line 12 “to infer mental processes and processes from brain signals.” Missing word? • GRU – main text should mention detailed explanation is in supplementary • Line 131 N_y should be N_h • Line 148 – should be layers by units • Line 256 – how was the threshold of 0.2 chosen? • Table 1 – add number of layers • Line 284 - “were obtained with not used” – missing word? • Red line missing in Fig S4 for flower clip Reviewer #4: The authors propose using modern recurrent neural networks (in this case, a network with gated recurrent units; GRUs) to analyze spatial-temporal brain data in rich, continuous experimental settings like movie watching. Considering most fMRI work is based on static images, this seems like a potentially interesting approach to reveal what previous work could not. Overall, the authors did a lot of work, but did not sufficiently motivate and explain the methods in a way that convinced me that we learnt much new - or what the new part was. The authors need to explain and motivate their approach, what their findings are, and why this is the way to go (at least for this kind of data). Below, I note major and minor points that I hope will be helpful to improve the manuscript. Major concerns 1. General point on motivation and introducing the method. It was not clear to me what we benefit from this analysis from a broad perspective, or from the specific analysis steps. There is a brief general note on what they will do and why (e.g. decode, reduce dimensionality), but not clear what the conceptual significance is (test what kind of information there is some brain area, explain what temporal information is capture and why it is interesting). 2. Interpretation problem from nonlinear decoders. What can we say from a non-linear decoder? Standard MVPA is linear - and the point is if you can read out information from a certain brain area from a linear decoder, the content is there in an explicit way. For example, you can decode anything from the retina if you use a non-linear decoder, but that doesn't mean everything is represented there. For example, see Tong and Pratte 2012, p487, DiCarlo et al., 2012 on tangled representations, Kriegeskorte & Kreiman 2011. 3. What does the decoder tell us? What did we learn? The classifying decode which movie clip it is across participants. This tells us the brain holds some info about movie clip. Is this the conclusion? What have we learnt? Is it based on visual, auditory information, or semantic content? More generally: what would decoding of a movie clip mean - we probably don't care that area X represents encodes movies, or distinguishes different movies… What is the important point? Although the authors argue that the method give spatial and temporal insights, there doesn't seem to be any spatial or temporal interpretation since it uses all the data to construct the classifier. Training on temporal data then testing on specific time points - does this give any info on that specific time point, is there a problem with the classifier having inputs/info from other timepoint? The authors use saliency maps for finding interesting spatial or temporal points, but it's unclear if this is meaningful (see point below on saliency maps). 4. Why GRUs, why is it good for the current setting? Lack of a motivation/explanation Typically, GRUs are used to predict the next time step. Here it seems like it tries to do that too, but then prediction is then on a 15-way classifier to classifier movie clips across participants. Not clear how the temporal prediction, applied to across subjects analysis and not predicting time, helps. Authors need to clarify this - and maybe it is an important method point that I missed. 5. Saliency maps - the paper authors cite was for images in convnets, a visualizing technique based on images which usually finds where the 'main object' is. Is this transferrable to brain data? What is a 'salient' activation/timepoint in the brain in this case? Input feature is time and/or space - what would a salient timepoint mean? Strong activation, or strong decoder? Is it just a (statistical) way to extract the strong areas/timepoints or something more? Is it related to saliency in the clips (probably not?)? Not much details in the results and discussion. Would be good to have this from the methods and to fully motivate the meaning and purpose of this measure. 6. Dimensionality reduction - this is non-linear, so again, this seems like there might be a problem for interpretation, especially when it is used to test if you can decode from the low-dimensional space - i.e. same problem as a non-linear decoder above. Furthermore, this is now a non-linear space - which could be high dimensional in linear space? Separately - dimensions increased variance explained - is it still a low-d manifold? Looks like it keeps increasing. That the first few dimensions take much of the variance is common (e.g. in PCA), and here it's non-linear which might mean it's more. This is not so much an issue of method but interpretation. Maybe more important: it is 10d with time - is this not much higher dimension than 10? 7. Predicting 'behavior' (e.g. IQ scores) - why is this done over time? Is it actually predicting a single score over time? Why would you predict something multiple times when there is only 1 score (stats point) - it seems more valid (and powerful) to predict one score per participants with all the data. More importantly, why would you theoretically predict that it varies, and correlates with movie watching? Minor 8. Discussion is very short and basic, reiterating the results without much interpretation of what they mean. If the contribution is more about a method, more discussion should go into the methods and comparison with previous methods. If it's about prediction (in the ML sense), then they can also be more clear about that, without interpretation. 9. The GRU model does better than other models that don't take into account time. These are some good comparisons. However, it was unclear to me if the GRU was therefore more complex (it is in the sense that it takes into account temporal info, so maybe that's unavoidable) so did better, and if there are more parameters (which would be unfair). 10. "As our goal was not necessarily maximizing model performance, we explored a small set of potential architectures by varying the number of hidden layers and number of units per layer " If it's not, what's the purpose of testing multiple models? Having multiple models show similar effects could be informative to show that it's reliable across different model settings, but it seems like the authors did choose the best model. Why in the HCP 1 layer enough and in the moving circles task 3 layers needed? Classification task was harder, probably less information (e.g. visual). But this is now more difficult to interpret. What does this tell us - does it tell us anything apart from decoding is harder because there less information? Then making the model more complex you can decode more - this is obvious, but not clear how it's interpretable neurally. 11. Train on subset of participants, test on others (100 train, 76 test). This is fine but why these numbers? Worth justifying in the text (e.g. more training data can be more desirable than split half). 12. Threat vs safe - it seems like the brain region would have come up in a more traditional univariate and/or multivariate (e.g. linear SVM) analysis as well. What is added here? ********** Have the authors made all data and (if applicable) computational code underlying the findings in their manuscript fully available? The PLOS Data policy requires authors to make all data and code underlying the findings described in their manuscript fully available without restriction, with rare exception (please refer to the Data Availability Statement in the manuscript PDF file). The data and code should be provided as part of the manuscript or its supporting information, or deposited to a public repository. For example, in addition to summary statistics, the data points behind means, medians and variance measures should be available. If there are restrictions on publicly sharing data or code —e.g. participant privacy or use of data from a third party—those must be specified. Reviewer #1: None Reviewer #2: Yes Reviewer #3: Yes Reviewer #4: No: "All data and code will be made available upon publication." I can see the GitHub repo but not the processed data. Not sure if they will have a repo for that or will just point to the HCP? ********** PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files. If you choose “no”, your identity will remain anonymous but your review may still be made public. Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy. Reviewer #1: No Reviewer #2: No Reviewer #3: No Reviewer #4: No Figure Files: While revising your submission, please upload your figure files to the Preflight Analysis and Conversion Engine (PACE) digital diagnostic tool, . PACE helps ensure that figures meet PLOS requirements. To use PACE, you must first register as a user. Then, login and navigate to the UPLOAD tab, where you will find detailed instructions on how to use the tool. If you encounter any issues or have any questions when using PACE, please email us at . Data Requirements: Please note that, as a condition of publication, PLOS' data policy requires that you make available all data used to draw the conclusions outlined in your manuscript. Data must be deposited in an appropriate repository, included within the body of the manuscript, or uploaded as supporting information. This includes all numerical values that were used to generate graphs, histograms etc.. For an example in PLOS Biology see here: http://www.plosbiology.org/article/info%3Adoi%2F10.1371%2Fjournal.pbio.1001908#s5. Reproducibility: To enhance the reproducibility of your results, we recommend that you deposit your laboratory protocols in protocols.io, where a protocol can be assigned its own identifier (DOI) such that it can be cited independently in the future. Additionally, PLOS ONE offers an option to publish peer-reviewed clinical study protocols. Read more information on sharing protocols at https://plos.org/protocols?utm_medium=editorial-email&utm_source=authorletters&utm_campaign=protocols 26 Jul 2021 Submitted filename: GRU_plos_responses_revision1.pdf Click here for additional data file. 19 Aug 2021 Dear Mr Misra, We are pleased to inform you that your manuscript 'Learning brain dynamics for decoding and predicting individual differences' has been provisionally accepted for publication in PLOS Computational Biology. Before your manuscript can be formally accepted you will need to complete some formatting changes, which you will receive in a follow up email. A member of our team will be in touch with a set of requests. Please note that your manuscript will not be scheduled for publication until you have made the required changes, so a swift response is appreciated. IMPORTANT: The editorial review process is now complete. PLOS will only permit corrections to spelling, formatting or significant scientific errors from this point onwards. Requests for major changes, or any which affect the scientific understanding of your work, will cause delays to the publication date of your manuscript. Should you, your institution's press office or the journal office choose to press release your paper, you will automatically be opted out of early publication. We ask that you notify us now if you or your institution is planning to press release the article. All press must be co-ordinated with PLOS. Thank you again for supporting Open Access publishing; we are looking forward to publishing your work in PLOS Computational Biology. Best regards, Daniele Marinazzo Deputy Editor PLOS Computational Biology Daniele Marinazzo Deputy Editor PLOS Computational Biology *********************************************************** Reviewer's Responses to Questions Comments to the Authors: Please note here if the review is uploaded as an attachment. Reviewer #1: The authors have adequately addressed my concerns. Reviewer #2: The authors have addressed all my concerns. Reviewer #4: The authors have made many clarifications and rephrasings that better expresses the purposes of the paper and its reach. The lesion results were very nice. One idea for an additional analysis is if you lesion brain regions across 2 (or more) salient networks – do they cause an additive amount of decrease in accuracy (because they are different, independent contributions, e.g. 10% drop + 10% drop in accuracy = 20%) or not (because they are not independent – e.g. visual-motor systems might be representing visual actions, and so you get still a 10% drop). However, it might not be necessary here. It might be worth discussing / doing in the future. Individual differences – OK, makes sense with anxiety and suspense. Right - with IQ, I guess one reason could be to test which timepoints might relate to IQ (e.g. complex storyline / whether participants understood something). With the results here I think it’s probably less interesting (or difficult to find out) but proposing it as a method makes sense. General comment: A suggestion to the authors for replying to reviews (a suggestion to me from my advisors) – include the points / changes to the manuscript in the reply verbatim (e.g. in quotation marks). It was inconvenient to go to the referenced lines (which is better than no reference), but difficult to figure out how the points were addressed when there were no references. ********** Have the authors made all data and (if applicable) computational code underlying the findings in their manuscript fully available? The PLOS Data policy requires authors to make all data and code underlying the findings described in their manuscript fully available without restriction, with rare exception (please refer to the Data Availability Statement in the manuscript PDF file). The data and code should be provided as part of the manuscript or its supporting information, or deposited to a public repository. For example, in addition to summary statistics, the data points behind means, medians and variance measures should be available. If there are restrictions on publicly sharing data or code —e.g. participant privacy or use of data from a third party—those must be specified. Reviewer #1: None Reviewer #2: Yes Reviewer #4: Yes ********** PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files. If you choose “no”, your identity will remain anonymous but your review may still be made public. Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy. Reviewer #1: No Reviewer #2: No Reviewer #4: No 30 Aug 2021 PCOMPBIOL-D-21-00636R1 Learning brain dynamics for decoding and predicting individual differences Dear Dr Misra, I am pleased to inform you that your manuscript has been formally accepted for publication in PLOS Computational Biology. Your manuscript is now with our production department and you will be notified of the publication date in due course. The corresponding author will soon be receiving a typeset proof for review, to ensure errors have not been introduced during production. Please review the PDF proof of your manuscript carefully, as this is the last chance to correct any errors. Please note that major changes, or those which affect the scientific understanding of the work, will likely cause delays to the publication date of your manuscript. Soon after your final files are uploaded, unless you have opted out, the early version of your manuscript will be published online. The date of the early version will be your article's publication date. The final article will be published to the same URL, and all versions of the paper will be accessible to readers. Thank you again for supporting PLOS Computational Biology and open-access publishing. We are looking forward to publishing your work! With kind regards, Andrea Szabo PLOS Computational Biology | Carlyle House, Carlyle Road, Cambridge CB4 3DN | United Kingdom ploscompbiol@plos.org | Phone +44 (0) 1223-442824 | ploscompbiol.org | @PLOSCompBiol

54 in total

Introduction

Results

Model architectures based on neural networks with gated recurrent units (GRUs).

Generalizability of spatiotemporal patterns

Prediction of movie clip (15-way classification).

Is temporal information necessary for clip prediction?

Low-dimensional trajectories as spatiotemporal signatures

Low-dimensional trajectories.

Spatiotemporal importance maps

Determining brain contributions to classification.

Lesion analysis

Lesion analysis.

Predicting behavior

Prediction of behavior.

Dynamic multivariate pattern analysis

Voxelwise prediction of experimental conditions: Threat vs. safe.

Discussion

Spatiotemporal representations

Latent dimensionality

Saliency and lesion analysis

Individual differences

Other approaches

Materials and methods

Data

Movie data

Moving-circles paradigm

Performance evaluation

Accuracy

Training/testing split and cross-validation

Basic recurrent architecture for classification

Selection of model architectures via cross validation on data from 100 participants.

Dimensionality reduction

GRU decoder

Saliency maps: Explaining model predictions via sensitivity analysis

Baseline models

Feed-forward network

Temporal convolutional network

Summary of models used

Summary of models implemented in the paper.

Conclusion

Euclidean distances between trajectories for all movie clips.

Fluid intelligence predictions for all movie clips.

Fluid intelligence predictions: Null distributions.

Verbal IQ predictions for all movie clips.

Verbal IQ predictions: Null distributions.

Saliency movie for Brokovich clip.

Saliency movie for Star Wars clip.

Gated recurrent units.

Review 2. Machine learning in resting-state fMRI analysis.

Review 3. Assessing and tuning brain decoders: Cross-validation, caveats, and guidelines.