Literature DB >> 36081469

Real-time neurofeedback to alter interpretations of a naturalistic narrative.

Anne C Mennen¹, Samuel A Nastase¹, Yaara Yeshurun², Uri Hasson^1,3, Kenneth A Norman^1,3.

Abstract

We explored the potential of using real-time fMRI (rt-fMRI) neurofeedback training to bias interpretations of naturalistic narrative stimuli. Participants were randomly assigned to one of two possible conditions, each corresponding to a different interpretation of an ambiguous spoken story. While participants listened to the story in the scanner, neurofeedback was used to reward neural activity corresponding to the assigned interpretation. After scanning, final interpretations were assessed. While neurofeedback did not change story interpretations on average, participants with higher levels of decoding accuracy during the neurofeedback procedure were more likely to adopt the assigned interpretation; additional control conditions are needed to establish the role of individualized feedback in driving this result. While naturalistic stimuli introduce a unique set of challenges in providing effective and individualized neurofeedback, we believe that this technique holds promise for individualized cognitive therapy.

Entities: Chemical

Keywords: Naturalistic stimuli; Neurofeedback; Real-time fMRI; Shared response model

Year: 2022 PMID： 36081469 PMCID： PMC9451129 DOI： 10.1016/j.ynirp.2022.100111

Source DB: PubMed Journal: Neuroimage Rep ISSN： 2666-9560

Introduction

Compared to healthy participants, depressed participants are more likely to negatively interpret ambiguous scenarios, especially those that contain self-referential prompts (Everaert et al., 2017). Past research has associated increased depression severity with the persistence of these negative interpretations even after learning that the scenarios ended positively (Everaert et al., 2018). To reduce this negative bias and encourage positive interpretations, researchers have used Cognitive Bias Modification for Interpretation (CBM-I) training (Mathews and Mackintosh, 2000). During positive CBM-I training, depressed participants are shown scenarios that differ in interpretation depending on the final word, e.g., The new people you meet will find you (boring/friendly). During training, the final word is revealed to disambiguate the meaning (e.g., friendly) so that participants become more likely to expect the positive outcome (Joormann et al., 2015). A meta-analysis collapsing across anxious and depressed individuals found that CBM-I training more effectively reduced negative biases than Attention Bias Modification (ABM) training, which involves training participants to direct their attention away from negatively valenced pictures (Hallion and Ruscio, 2011). One explanation for the increased efficacy is that CBM-I training uses self-relevant stimuli that are inherently more realistic and relatable than the negative faces or isolated words used in ABM training. As with other training paradigms, however, results are mixed in regards to whether CBM-I training actually improves depressive symptoms (Jones and Sharpe, 2017). One explanation for the mixed clinical results is that the training uses a one-size-fits-all approach. In other words, all participants are trained in the same way, regardless of variability in symptoms, momentary lapses in attention, effort, or belief in one of the interpretations. This raises the prospect that better results could be obtained using individualized training methods with more realistic stimuli. Specifically, given that different interpretations of complex narrative and social scenarios yield different neural responses (Yeshurun et al., 2017), it might be possible to use real-time neuroimaging to track how well participants are adopting the desired interpretation, and to provide feedback to bias them toward the desired interpretation (for recent reviews of real-time fMRI neurofeedback, see Stoeckel et al., 2014; Sitaram et al., 2017; Thibault et al., 2018; Watanabe et al., 2017; Hampson, 2021; Taschereau-Dumouchel et al., 2022). Before trying this in a clinical setting, we need to demonstrate that it is possible to decode interpretations in real time, and that feedback based on this decoded interpretation is effective in shaping participants’ interpretations. The present study takes some initial steps toward this goal, by assessing whether it is possible to nudge participants’ interpretation of an ambiguous narrative using real-time fMRI (rt-fMRI) neurofeedback. Here, we build on prior work demonstrating that high-level cortical areas differentiate interpretations of an ambiguous social narrative across individuals (Yeshurun et al., 2017; Finn et al., 2018; Nguyen et al., 2019). Specifically, we based our experiment on a prior study showing that neural responses to an ambiguous 12-min spoken narrative vary depending on how the story is interpreted (Yeshurun et al., 2017). In this study, Yeshurun et al. (2017) explicitly instructed participants to adopt one of two different interpretations before listening to the story in the fMRI scanner – in one interpretation, the main character’s wife is cheating on him, and in the other interpretation, the main character is just being paranoid (see Stimulus section below). This manipulation ensured that the two groups of participants would interpret the story in different ways, allowing the authors to measure neural signatures of the interpretations shared within each group. The authors successfully identified neural regions that accurately predicted the assigned interpretation in held-out participants. Within these regions, the parts of the story that varied most in meaning based on the two interpretations were more likely to yield higher neural classification accuracy. In the current study, we pre-trained a classifier using the data from Yeshurun et al. (2017) to decode which interpretation participants were adopting; the data from Yeshurun et al. (2017) are openly available as part of the Narratives dataset released by Nastase et al. (2021). Crucially, instead of explicitly instructing participants to adopt a particular interpretation, we randomly assigned participants to an interpretation condition (without telling them which condition they were assigned to) and attempted to nudge them towards the assigned interpretation using neurofeedback. Specifically, while real-time participants listened to the same story in the fMRI scanner, we used the pre-trained classifier to decode which interpretation the participant was most likely thinking about in a given moment, and then provided intermittent neurofeedback to push the participants toward their assigned interpretations – participants were rewarded when the classifier’s estimate of their interpretation of the story matched the participant’s randomly assigned interpretation condition. After listening to the story, participants answered questions to assess if neurofeedback successfully biased story interpretations. Thus, the critical comparison was between each participant’s actual interpretation at the end of neurofeedback and the target interpretation determined by random group assignment. If neurofeedback was successful, participants in each of the assigned groups would be more likely to adopt the target interpretation of that group, causing interpretations between assigned groups to differ. Note that we chose this design to obtain initial proof-of-concept evidence that we would be able to “nudge” participants’ interpretations, accepting that additional controls would be needed to establish the role of individualized neurofeedback in driving these results (see Discussion). Overall, we did not reliably bias participants toward the target interpretations. However, taking into account the decoding accuracy of the classifier for each participant, we found that participants with the highest decoding accuracy were more likely to choose the target interpretation.

Methods

Participants

Twenty-two participants from Princeton University and the surrounding local community consented to participate in this study, but 2 participants did not return for their second visit. Thus, 20 participants were included in the analysis (12 female, 2 left-handed, mean age = 20.5 years). Participants received monetary compensation for their participation, including an additional bonus based on their neurofeedback performance ($20 maximum). The study was approved by Princeton University’s Institutional Review Board.

Stimuli

All participants listened to a 12-min adapted version of “Pretty Mouth and Green My Eyes” by J. D. Salinger, read by a professional actor. The same recording was used in Yeshurun et al. (2017). The audio stimulus began with 18 s of music and ended in silence; see Yeshurun et al. (2017) for further details about the stimulus. The audio stimulus is openly available as part of the Narratives dataset (Nastase et al., 2021). The story begins when the character Arthur calls his friend, Lee, after returning home late at night without his wife, Joanie. Arthur is concerned about Joanie’s whereabouts. Lee is at home in bed with a woman, although the narrator purposefully leaves her identity ambiguous. Yeshurun et al. (2017) created two interpretation groups by briefing participants on this woman’s identity before they heard the story. The two interpretation groups were: Cheating: Joanie is the woman in Lee’s bed. In this case, Joanie is cheating on her husband, Arthur, with Lee. Paranoid: Lee’s girlfriend, Rose, is the woman in Lee’s bed. In this case, Arthur is paranoid that Joanie is cheating on him and is bothering Lee late at night.

Procedure

On the first visit (Visit 1), participants consented to the experiment and underwent initial structural and functional scans for image registration. Because our pre-trained classifier was trained in standard MNI space, registering each participant’s brain to MNI space before Visit 2 allowed us to apply the classifier in real-time. For registration details, see Methods, Sections 2.4.1 and 2.4.2. On the second visit (at least one day later, mean delay = 4.6 days), participants returned to complete 4 runs of neurofeedback training (one participant only completed 2 runs due to technical problems). Before training began, participants were randomly assigned in a double-blind fashion into either the cheating or paranoid interpretation group. Group assignment was performed via a Python script that saved the assignment as a text file to be used later during neurofeedback. This way, both the experimenter and participant were blind to group assignment. Of the 20 participants, 10 were randomly assigned to the cheating group and 10 were randomly assigned to the paranoid group. Importantly, the assigned interpretation group for a particular participant was the same across all 4 runs of neurofeedback training (i.e., if a participant was assigned to the cheating group for the first run, they were also in the cheating group for runs 2, 3, and 4). As noted above, we did not tell participants which interpretation to take at the onset of the experiment. Instead, we informed them of the two possible interpretations and that both were equally likely (meaning that there was no ground truth or correct interpretation). Thus, they would have to use neurofeedback to determine which was the correct interpretation for them. While participants listened to the story, we used a visual stimulus to present neurofeedback (Fig. 1). For most of the story, participants viewed a gray rectangle, indicating that they should simply continue listening. Four seconds prior to a specific period in the story when we would analyze brain activity (which we refer to as a station), participants indicated which interpretive “lens” (cheating or paranoid) they were going to adopt for the upcoming station. Participants were free to choose either interpretation at any time, as long as they were trying to maximize their scores. Participants pressed their index or middle fingers to indicate their choice of either the cheating or paranoid interpretation (left/right position of choices was counterbalanced across participants).

Fig. 1.

Experimental design. (a) Before scanning, we pre-trained a classification model on data from Yeshurun et al. (2017) to create template brain responses at each station. (b) On Visit 1, a high-resolution anatomical scan and a short functional scan were acquired for registration to MNI space. At least one day later, participants returned for 4 runs of story listening with neurofeedback training (Visit 2). Each run was about 12 min long, and featured the same audio stimulus used by Yeshurun et al. (2017). Each run included 7 stations where brain activity was analyzed. Stations varied in length from 3 to 16.5 s. (c) After each station, the neurofeedback score was displayed and participants received a reward based on the model prediction for that station, normalized by pilot experimental data. (d) Afterward, participants answered questions outside of the scanner to assess their final interpretations of the story.

Two seconds after the probe, the station began. During these stations, the rectangle turned red to signal that brain activity was being recorded. We did this so that participants were aware of when we were analyzing their brain activity. We explicitly marked stations to help participants determine which thoughts led to higher rewards, in hopes of decreasing the credit-assignment problem inherent in the delayed rt-fMRI signal (Oblak et al., 2017); see Appendix A.3 for details on how we chose the stations. To account for hemodynamic lag, the BOLD data that we analyzed for each station were shifted by 3 TRs (4.5 s) from the actual TRs when the “station recording signal” was on screen. After each station, the data were preprocessed and the pre-trained model was used to decode the participant’s interpretation at that station. The rectangle then displayed the neurofeedback score; a horizontal line through the center indicated the threshold to earn any extra money and the filled area corresponded to the score. If participants scored above threshold, the score was shown in green with the monetary reward underneath. Otherwise, the score was shown in gray with $0.00 below the rectangle. Prior work has argued that intermittent feedback of this kind is more effective than continuous feedback (Johnson et al., 2012; Emmert et al., 2017; Hellrung et al., 2018). Because of the aforementioned 3-TR hemodynamic shift, plus one TR of time for data analysis, participants typically had to wait 4 TRs (6 s) for the feedback display to appear after the end of the station (see Appendix B for more details regarding feedback timing). To assess the final interpretations after all runs, participants answered the same questions from Yeshurun et al. (2017). The 39 questions contained 27 comprehension questions (e.g., What was the girl doing when the phone rang?) and 12 questions that directly probed interpretation (e.g., Did you think Joanie was cheating on Arthur?). Additionally, participants provided numerical ratings (spanning 1–5) indicating their opinions on various topics, such as how much they empathized with the characters, enjoyed the story, thought neurofeedback helped, etc. Lastly, participants completed a survey to assess perception and strategies. Note: We emphasized the following in our instructions before scanning began, to make sure that participants were attending to neurofeedback throughout the entire experiment: The story was written to be purposefully ambiguous, instead of there being one true interpretation. You should try to interpret the story based on your neurofeedback, not what you think was intended by the author. Even if you think that the correct interpretation was revealed in the story, characters might be lying. Keep using neurofeedback to guide your interpretations. The neurofeedback scores reflect how clearly and correctly your brain is interpreting the story. Low scores can indicate either (1) noisy signals (e.g., poor focus) and/or (2) incorrect interpretations. The neurofeedback score only represents the neural information recorded during the station. We keep track of your responses to the probes for analysis purposes, but the neurofeedback score you see is only based on your brain activity. Although strategic instructions of this sort are not necessary for neurofeedback-induced learning (e.g. Shibata et al., 2011; Linden et al., 2012), we hoped that providing participants with explicit guidance would improve their performance (Scharnowski et al., 2015; Pamplona et al., 2020). The full set of instructions used in the experiment is included in Appendix D, Fig. 16.

Fig. 16.

Instructions that span pages 1–4 shown in a, b, c, d, respectively.

After scanning, participants completed a questionnaire and survey. Fig. 1 illustrates the full experimental design and the neurofeedback stimuli.

Data acquisition

All scanning data were acquired with a 3T MRI scanner (Siemens Skyra) and a 64-channel head coil. Sequences were matched to Yeshurun et al. (2017) as closely as possible. Both scanning sessions began with a Siemens scout scan for automated slice alignment to the ACPC axis. On Visit 1 only, we collected a high-resolution, T1-weighted magnetization-prepared rapid acquisition gradient-echo (MPRAGE) anatomical scan to facilitate normalizing each participant’s functional data to standard space: repetition time (TR) = 2300 ms, TE = 3.08 ms, flip angle = 9°, resolution = 0.86 × 0.86 × 0.9 mm3, FOV = 220 mm2. We ran functional scans on Visit 1 for registration purposes only (i.e., without presenting stimuli); and we ran functional scans for the experiment during Visit 2. All functional scans used a T2*-weighted echo-planar imaging sequence: 1.5 s TR, 28 ms echo time, flip angle = 64°, 3 × 3 × 4 mm3 voxel size, 64 × 64 matrix, 192 × 192 mm2 field of view, 27 slices, no gap between slices, interleaved slice acquisition. No fieldmap scans were collected, as we were trying to best match the real-time data to the data used to estimate the pre-trained model. As the data from Yeshurun et al. (2017) did not include fieldmap scans, we omitted real-time susceptibility distortion correction in this experiment.

Offline image registration

The following sections describe how we registered each participant to MNI space after the Visit 1 scans, in preparation for Visit 2. We used fMRIPprep 1.2.3 (Esteban et al., 2019; Esteban et al., 2018; RRID: SCR_016216), which is based on Nipype 1.1.6-dev (Gorgolewski et al., 2011; Gorgolewski et al., 2018; RRID:SCR_002502). The following description of preprocessing was generated with fMRIPrep. The T1-weighted (T1w) image was corrected for intensity non-uniformity (INU) using N4BiasFieldCorrection (ANTs 2.2.0; Tustison et al., 2010), and used as T1w-reference throughout the workflow. The T1w-reference was then skull-stripped using antsBrainExtraction.sh (ANTs 2.2.0), using OASIS as target template. Brain surfaces were reconstructed using recon-all (FreeSurfer 6.0.1, RRID:SCR_001847, Dale et al., 1999), and the brain mask estimated previously was refined with a custom variation of the method to reconcile ANTs-derived and FreeSurfer-derived segmentations of the cortical gray-matter of Mind-boggle (RRID:SCR_002438, Klein et al., 2017). Spatial normalization to the ICBM 152 Nonlinear Asymmetrical template version 2009c (Fonov et al., 2009, RRID:SCR_008796) was performed through nonlinear registration with antsRegistration (ANTs 2.2.0, RRID:SCR_004757, Avants et al., 2008), using brain-extracted versions of both T1w volume and template. Brain tissue segmentation of CSF, WM and GM was performed on the brain-extracted T1w using fast (FSL 5.0.9, RRID: SCR_002823, Zhang et al., 2001). For the BOLD run collected during Visit 1, the following registration processing was performed. First, a reference volume and its skull-stripped version were generated using a custom methodology of fMRIPrep. The BOLD reference was then co-registered to the T1w reference using bbregister (FreeSurfer) which implements boundary-based registration (Greve and Fischl, 2009). Co-registration was configured with nine degrees of freedom to account for distortions remaining in the BOLD reference. Head-motion parameters with respect to the BOLD reference (transformation matrices, and six corresponding rotation and translation parameters) were estimated before any spatiotemporal filtering using mcflirt (FSL 5.0.9, Jenkinson et al., 2002). The BOLD time series was resampled onto the original, native space by applying a single, composite transform to correct for head-motion. The BOLD time series was then resampled to MNI152NLin2009cAsym standard space. First, a reference volume and its skull-stripped version were generated using a custom methodology of fMRIPrep. All resamplings were performed with a single interpolation step by composing all the pertinent transformations (i.e. head-motion transform matrices and co-registrations to anatomical and template spaces). Gridded (volumetric) resamplings were performed using antsApplyTransforms (ANTs), configured with Lanczos interpolation to minimize the smoothing effects of other kernels (Lanczos, 1964). Many internal operations of fMRIPrep use Nilearn 0.4.2 (Abraham et al., 2014, RRID:SCR_001362), mostly within the functional processing workflow. For more details of the pipeline, see the section corresponding to workflows in fMRIPrep’s documentation.

Real-time image registration and processing

As BOLD data arrived from the scanner, the data were transferred as bytes to a secure cloud server, where all subsequent processing steps were performed. The data were returned to the local Linux machine as a .txt file containing the final neurofeedback score to be displayed. Real-time processing was handled using the RT-Cloud software package (Kumar et al., 2021; Wallace et al., 2022); see Appendix B for full details on the cloud setup. BOLD data were acquired in the participant’s native space, but the pre-trained model was in standard MNI space. Therefore, we had to transform each incoming BOLD volume to MNI space in real-time. To do this, we combined Visit 1’s previously calculated registration steps from fMRIPrep with real-time registration of each new BOLD volume. In real-time, we used mcflirt (FSL 5.0.9, Jenkinson et al., 2002) to register each incoming BOLD volume with the example functional image acquired on Visit 1. We combined this transformation matrix with the two transformation matrices calculated previously with fMRIPrep: (1) the transformation from functional → T1w space, and (2) the transformation from T1w space → MNI space. All transformations were concatenated and performed in a single step using antsApplyTransforms (Avants et al., 2008). Details of this pipeline performed on the cloud server are shown in Table 2 in Appendix B.

Table 2

List of preprocessing steps completed on the cloud server for real-time generation of neurofeedback scores.

Step	Description	Software used

1	Convert from DICOM bytes to save as a NIfTI image	nilearn (Abraham et al., 2014)
2	Calculate transformation matrix between current TR and reference image	mcflirt FSL 5.0.9 (Jenkinson et al., 2002)
3	Convert transformation matrix.mat →.txt	c3d_affine_tool Convert3D (Yushkevich et al., 2006)
4	Transform current TR to MNI space by concatenating all transformations	antsApplyTransforms ANTs 2.1.0 (Avants et al., 2008)
5	Read and mask NIfTI data in MNI space, store in memory	nibabel (Brett et al., 2019)
If current TR is the last TR in a station:
6	Z-score data within voxels from the beginning of the story to that TR (and remove voxels that were outside of the brain by setting activity to 0 if value was < 100 or std < 0.001)	custom Python code
7	Subtract the average signal from the Yeshurun et al. (2017) data set	custom Python code
8	Classify station and generate NF score	Scikit-learn (Pedregosa et al., 2011)
9	Save score as text file	RT-Cloud software (Wallace et al., 2022) Kumar et al. (2021)
10	Send text file back to local Linux to update display	RT-Cloud software (Wallace et al., 2022) Kumar et al. (2021)

Classification

We used 7 pre-trained logistic regression classifiers (scikit-learn solver = ‘lbfgs’, C = 1), corresponding to the 7 stations. Classifiers targeted regions of interest comprising the theory of mind (ToM) network and default mode network (DMN) based on interpretation-specific neural responses observed in these regions by Yeshurun et al. (2017), and prior work suggesting that these regions are effective targets for neurofeedback (Zhang et al., 2012; Harmelech et al., 2015; Skouras and Scharnowski, 2019; Pamplona et al., 2020). As noted above, the classifiers were trained on previously-collected data from Yeshurun et al. (2017); see Appendix A for details on classifier construction. Each classifier estimated p(c), the probability that the participant was interpreting the story in line with the cheating interpretation (the probability of the paranoid interpretation was 1 − p(c)). We generated this probability estimate using scikit-learn’s predict_proba function (Pedregosa et al., 2011). For more detailed information on how we optimized the classifier design, see Appendix A. To convert this prediction to the neurofeedback score delivered to the participant, we normalized p(c) based on pilot data. We learned from the pilot experiment (Appendix C) that – in our paradigm, where participants are allowed to form their own interpretations, as opposed to being explicitly told which interpretation to use as in Yeshurun et al. (2017) — the mean p(c) values varied considerably across stations, such that many stations were strongly biased toward particular interpretations (regardless of group assignment). To control for these biases, we decided to provide neurofeedback based on participants’ deviation from the “average neural interpretation trajectory” (where this “average neural interpretation trajectory” was computed by collapsing results across the two conditions in the pilot study), rather than providing neurofeedback based on whether participants’ decoded interpretation matched the assigned interpretation at a particular moment. That is, at a moment when the narrative leans toward the paranoid interpretation on average, we rewarded participants in the cheating group if their interpretation was closer to cheating than the average, even if (in absolute terms) their interpretation was closer to paranoid than to cheating. This kind of neurofeedback can be viewed as “nudging” individual trajectories off the default stimulus-driven path. Specifically, for each participant, at a given station st, we applied the following formula to transform p(c) to a neurofeedback score: where μ and σ were the mean and standard deviation, respectively, of p(c) for that station, from all participants and runs in the pilot experiment. We included a scaling factor of 3 so that the z-scored differences would range roughly between [−0.5,0.5]. Thus, by adding 0.5 to this ratio, we could set the range to be [0,1]. In case scores were much larger or much lower than station means, we added the thresholds: Next, we generated a score based on group assignment with: In this framework, participants received a neurofeedback score of .5 if their p(c) was equal to the mean across participants in the pilot experiment. To earn a reward, participants had to score above the station’s mean in the assigned direction. To enforce this constraint, neurofeedback scores ≤ .5 were converted to 0 values: In the results that follow, when reporting neurofeedback scores, we report this final value where scores ≤ 0.5 were converted to 0, since this corresponds to the actual rewards received by participants.

Story comprehension and interpretation scores

Participants answered 39 story questions: 27 general comprehension-based questions and 12 interpretation-specific questions (Yeshurun et al., 2017). We did not analyze one of the interpretation-specific questions that asked for Lee’s girlfriend’s name, since we informed participants that Lee had a girlfriend named Rose prior to the experiment. All participants answered this question correctly, regardless of their interpretation. To score the interpretation-specific questions, we assigned a +1 score to each answer corresponding to the cheating interpretation, and a −1 score to each answer corresponding to the paranoid interpretation. To calculate a final score, we totaled all interpretation-specific scores and divided by the total number of questions. Thus, a +1 indicates answering all questions consistent with a cheating interpretation, and a −1 indicates answering all questions consistent with a paranoid interpretation. To collapse across assigned groups and compare “correctness” of final interpretations (i.e., consistency with group assignment), we transformed the interpretation-specific scores by multiplying the scores for participants assigned to the paranoid group by −1. After this transformation, a +1 score indicates answering all questions consistently with one’s group assignment (correct), and a −1 score indicates answering all questions inconsistently with one’s group assignment (incorrect). When we use these transformed scores in the analyses below, we refer to them as “correct interpretation” scores.

Empathy ratings

After answering the story questions, participants rated how much they empathized with Arthur, Lee, Joanie, and “the girl” (the mysterious woman in Lee’s bed). We expected those who favored the cheating interpretation to empathize with Arthur, and not with Lee and Joanie. Likewise, we expected those who believed the paranoid interpretation to feel more empathy for Lee and Joanie, instead of Arthur. To compute one score that captured the empathy bias, we calculated the difference between each participant’s empathy ratings for Arthur and Lee. Analogously to how we computed “correct interpretation” scores, we also computed “correct empathy” scores by multiplying the empathy difference for all participants assigned to the paranoid group by −1. Thus, positive “correct empathy” scores indicate that the participant empathized with Arthur and Lee in a manner that was consistent with the assigned interpretation group.

Results

Comprehension scores indicated that all participants understood the story (Fig. 2a); there was no significant difference between assigned groups (t(18) = 1.41, p = 0.17). Our neurofeedback manipulation did not push participants to their assigned interpretation, as indicated by the interpretation scores. Fig. 2b shows the interpretation scores by assigned group. Neither group showed significant bias toward their assigned interpretation (all p > 0.20). Further, the groups did not differ in interpretation scores (one-tailed t(18) = −0.062, p = 0.48).

Fig. 2.

Average scores for (a) comprehension and (b) interpretation questions. (a): All participants understood the story. (b): Interpretations were not modified by neurofeedback – neither group was significantly pushed toward their assigned interpretation. Error bars = ±1 s.e.m.

The empathy ratings for some of the characters differed by group in the “correct” direction of the assigned interpretations. While participants in the cheating group had numerically more empathy for Arthur than participants in the paranoid group, this difference was not significant (one-tailed t(18) = 1.38, p = 0.093). Additionally, participants in the paranoid group had more empathy for Lee than the participants in the cheating group (one-tailed t(18) = −1.76, p = 0.048). The difference in empathy for Arthur and Lee was significantly different between the assigned groups, with those assigned to the cheating condition having more empathy for Arthur than Lee (one-tailed t(18) = 1.84, p = 0.041). Fig. 3 plots the empathy ratings by assigned group. Additionally, there was a significant positive correlation between the difference in empathy for Arthur and Lee and interpretation scores (Pearson r = 0.47, p = 0.037), implying that the more participants empathized with Arthur over Lee, the more likely they were to endorse the cheating interpretation in response to the interpretation questions.

Fig. 3.

Empathy ratings for (a) all characters and (b) the difference for Arthur and Lee, separated by assigned group. (a): Empathy ratings differed significantly in the predicted direction for Lee based on group assignment. (b): The difference in empathy for Arthur and Lee also differed significantly by assigned group. Error bars = ±1 s.e.m. * = p < 0.05.

Neurofeedback scores

If neurofeedback successfully modified neural responses to the story, we would expect participants to differ in the neurally-decoded cheating probability based on random group assignment. Fig. 4a plots cheating probability p(c) across all stations and runs. Against our expectations, participants in the cheating group did not have significantly larger p(c) during any individual runs – in fact, p(c) was numerically smaller for participants in the cheating group (compared to the paranoid group) in the fourth run. Similarly, participants did not show a significant improvement in neurofeedback scores from run 1 to run 4 (combining results from both groups, neurofeedback scores numerically decreased on average; neurofeedback scores numerically increased for the paranoid group and numerically decreased for the cheating group; Fig. 4d). Thus, we were unable to reliably push neurofeedback scores, both in terms of p(c) and normalized neurofeedback rewards, in the assigned group direction.

Fig. 4.

Neurofeedback results, in terms of p(c) (a, c) and normalized neurofeedback score (b, d). (a) Average cheating probability (not normalized), separated by assigned group. The dashed line represents mean p(c) from the pilot experiment. (b) Neurofeedback scores that were delivered to participants during real-time neurofeedback, divided by assigned group. Figures (c–d) show the run-wise averages for the same values shown in (a–b), respectively. Error bars = ±1 s.e.m.

Probe responses

Prior to each station, we probed participants as to which interpretation they were adopting for this station. Fig. 5 shows the behavioral probe responses that were chosen, divided by assigned interpretation groups. While the average choices numerically diverged between groups in the assigned or “correct” direction for each of the runs, none of these run-wise differences were significant after correction for multiple comparisons.

Fig. 5.

(a) Average probe responses during neurofeedback for individual stations, separated by assigned group. (b) Average probe responses during neurofeedback, collapsed across stations within each run and separated by assigned group. None of the run-wise differences were significant after correcting for multiple comparisons. Error bars = ±1 s.e.m.

Decoding accuracy

By collecting behavioral probe responses at each station, we were able to measure how accurately the classifier was decoding story interpretations in each participant. If 1) a participant was truly adopting distinct interpretations when they gave a probe response of “cheating” vs. “paranoid”, and 2) the classifier was sensitive to these interpretive differences, then we would expect to see higher values of p(c) (i.e., the classifier’s estimate of the strength of the cheating interpretation) for stations where participants (behaviorally) reported thinking of the cheating interpretation compared to the paranoid interpretation. To assess this, for each participant, we separated all classification probabilities by probe response. We averaged p(c) over all of the times the participant gave a “cheating” probe response and (separately) over all of the times the participant gave a “paranoid” probe response. We computed decoding accuracy for each participant by taking the difference of the average p(c) value for “cheating” probe responses and the average p(c) value for “paranoid” probe responses. Note that low decoding accuracy can have multiple causes – the participant might not be adopting distinct interpretations, or the classifier might be insensitive (see Discussion) – but high decoding accuracy indicates that, to some degree, the participant is adopting distinct interpretations and the classifier is detecting them. Note also that the classifier was trained on data from Yeshurun et al. (2017), where participants merely listened to the stories and did not give behavioral responses during the scan. As such, there is no plausible route for the classifier to be picking up on purely motoric features of the behavioral response; rather, the classifier must be picking up on more central (non-motoric) differences relating to participants’ interpretations. On average, decoding accuracy was numerically above zero (i.e., p(c) values were numerically larger for stations where participants endorsed the cheating interpretation compared to the paranoid interpretation), but this effect was not significant (one-tailed t(19) = 1.40, p = 0.088). Results are shown in Fig. 6a. The figure also shows that there was considerable variation across participants in the level of decoding accuracy.

Fig. 6.

Decoding accuracy results overall (a) and related to correct interpretation scores (b). (a) Cheating probability was numerically higher when participants endorsed the cheating interpretation compared to the paranoid interpretation, but this difference did not reach statistical significance. Note: the x-axis represents the probe responses at each station, not the assigned interpretation group. Lines connect each participant’s average probability for each response. (b) Correct interpretation plotted as a function of decoding accuracy. Decoding accuracy and correct interpretation scores were positively correlated, suggesting that participants with accurate neurofeedback were able to learn to adopt the assigned interpretation. The black line indicates the line of best fit. Error bars = ±1 s.e.m.

We reasoned that participants with higher decoding accuracy would be receiving higher-fidelity neurofeedback, and consequently they should show a larger effect of neurofeedback on their final interpretation of the story. Consistent with this prediction, there was a significant positive relationship between decoding accuracy and correct interpretation scores (Pearson r = 0.56, p = 0.010; Fig. 6b).

Results divided by decoding accuracy

Given that decoding accuracy varied considerably across participants, we assessed if the outcomes described above were different in participants with high vs. low decoding accuracy. For these analyses, we performed a median split for each assigned interpretation group based on decoding accuracy. Thus, 5 participants from the cheating group and 5 participants from the paranoid group were in the “best decoding” group and 5 from each group were in the “worst decoding” group. Next, we recalculated our results within each of these new groups. To combine participants who were assigned different interpretations, we again adjusted scores so that positive values relate to the assigned interpretation and negative values relate to the opposite interpretation.

Story comprehension and interpretation scores

Comprehension scores did not vary significantly between decoding accuracy groups (t(18) = −1.41, p = 0.18), shown in Fig. 7a. Interpretation scores, however, were significantly higher for the best compared to worst decoding accuracy group (one-tailed t(18) = 2.44, p = 0.013), which was expected based on the significant positive correlation between decoding accuracy and correct interpretation (Fig. 7b). For the 10 participants in the best decoding accuracy group, the average score was numerically positive (i.e., correct), but this score was not significantly greater than zero (one-tailed t(9) = 1.541, p = 0.079).

Fig. 7.

Average scores for (a) comprehension and (b) interpretation questions, divided by decoding accuracy. (a) Comprehension scores did not vary significantly between groups. (b) Correct interpretation was significantly larger for the participants with more accurate decoding. Key: B = best; W = worst. Error bars = ±1 s.e. m. * = p < 0.05.

Empathy ratings

The difference in empathy for Arthur and Lee did not vary significantly depending on decoding accuracy (one-tailed t(18) = 0.77, p = 0.23; Fig. 8). Still, participants with the best decoding accuracy had a numerically higher mean (indicating empathy scores that fall more in line with the “correct” interpretation) than those with the worst decoding accuracy.

Fig. 8.

The difference in empathy for Arthur and Lee, divided by decoding accuracy. Here, the empathy difference is adjusted for assigned group such that positive empathy differences indicate the correct direction. There was no significant difference between the best and worst classifier groups in terms of modifying empathy.

Neurofeedback scores

We plotted neurofeedback scores, both in terms of prediction probabilities and normalized neurofeedback scores (Fig. 9). Looking at individual stations, we only obtained one significant result after Bonferroni correction for multiple comparisons across stations: For the final station in the first run, neurofeedback scores were higher for participants that had the best decoding accuracy (one-tailed t(18) = 3.40, p = 0.0016 before correction and p < 0.05 after correction). When averaging over all stations within each run, the neurofeedback scores in the fourth run were numerically higher on average for participants that had the best decoding accuracy, but this was not significant after Bonferroni correction for four tests (one-tailed t(17) = 2.36, p = 0.015 before correction).

Fig. 9.

Neurofeedback results, in terms of classification output (a, c) and normalized neurofeedback score (b, d). (a) Average classifier probability for the assigned interpretation, for the participants with the best and worst decoding accuracy. (b) Neurofeedback reward, divided by the participants with the best and worst decoding accuracy. Figures (c–d) show the run-wise averages for the same values shown in (a–b), respectively. Error bars = ±1 s.e.m. Significance was corrected for multiple comparisons using Bonferroni correction. * = p < 0.05.

Probe responses

As shown in Fig. 10, choosing the correct interpretation during probes was closely related to decoding accuracy. For the second-to-last station, participants with the best decoding accuracy were significantly more likely than participants with the worst decoding accuracy to choose the correct assigned interpretation in both run 2 and run 4 (for run 2, one-tailed t(18) = 4.20, p < 0.01 after correction for multiple comparisons; for run 4, one-tailed t(15) = 4.97, p < 0.01 after correction for multiple comparisons). The participants with the best decoding accuracy were numerically more likely on average to choose the correct assigned interpretation for the same station in run 1, and for the preceding two stations in run 4, but these differences were not significant. Note that, across participants, the tendency to choose the assigned interpretation during these in-scan probes (averaged across all stations and runs) was positively correlated with participants’ tendency to choose the assigned interpretation on the post-scan questionnaire (Pearson r = 0.48, p = 0.033).

Fig. 10.

Probe choices split by decoding accuracy. Participants with the most accurate decoding were more likely to choose probes consistent with their assigned interpretation by the end of training. Error bars = ±1 s.e.m. Significance was corrected for multiple comparisons using Bonferroni correction. ** = p < 0.01.

Discussion

In summary, we examined the effect of neurofeedback during a naturalistic spoken story on several behavioral and neural measures of narrative interpretation. Behaviorally, we tracked participants’ overall story interpretation at the end of neurofeedback, their empathy for the different characters at the end of neurofeedback, and also their responses to probes (during neurofeedback) about what interpretation they planned to adopt at each station. Neurally, we used a classifier (trained on previously-collected data from Yeshurun et al., 2017) to track what interpretation participants adopted at each station; we also tracked how much reward participants received, which was tied to how strongly their brain activity matched the assigned interpretation. When considering the entire participant group, the effects of neurofeedback were fairly weak: Only one of the aforementioned measures (empathy for the characters) showed a statistically reliable effect of neurofeedback. The effects of neurofeedback were somewhat stronger when we grouped participants based on the station-to-station correspondence between their behavioral probe responses and classifier evidence (i.e., was the classifier more likely to favor the “cheating” interpretation at a station when participants behaviorally responded that they planned to adopt the cheating interpretation at that station). This decoding accuracy measure shows how well the classifier performs at decoding the interpretation that participants say they are adopting, regardless of whether that interpretation is the “correct” (i.e., assigned) interpretation. We found that participants who ranked high on this decoding accuracy measure were more likely to show the predicted effect of neurofeedback on behavioral interpretation scores. When we did a median split based on this measure (dividing participants into “best-decoding” and “worst-decoding” groups), the “best-decoding” participants showed a stronger effect of neurofeedback than the “worst-decoding” participants on behavioral probe responses for some stations in runs 2 and 4 (i.e., they were more likely to indicate that they were adopting the assigned interpretation for those stations). Also, for the final station in the first run, neurofeedback scores were higher for participants that had the best decoding accuracy. There are two (non-mutually-exclusive) explanations for why neurofeedback effects were generally larger in the “best-decoding” participants. One possibility is that our classification pipeline did a better job of decoding story interpretations in some participants than others. These participants may have received more accurate (and thus more useful) feedback, leading to larger neurofeedback effects. If some participants failed to show neurofeedback effects because of a poorly functioning classifier, then the most effective way to boost neurofeedback effects in future studies would be to improve our classification pipeline to make it work more consistently across participants (e.g., by trying to improve registration, ROI selection, classifier parameter choices, and the quality of the template data). Another possibility is that the classifier itself was working well in the “worst decoding” participants, but these participants had trouble shifting their interpretations (e.g., a participant might behaviorally signal that they intend to adopt a “cheating” interpretive lens for the upcoming station, but then fail to actually do this). That is, the problem may reflect a cognitive difference across participants (some participants can cleanly shift their interpretations, some cannot) rather than a problem with neural decoding. If this is the case, perhaps screening participants beforehand with various behavioral tasks (e.g., measuring cognitive flexibility) could help exclude participants who will ultimately not benefit from neurofeedback. Based on our current set of results, we can not adjudicate between these two interpretations of why the “worst-decoding” participants did not benefit from neurofeedback. For the neurofeedback effects that we did observe, we must critically assess the role of individualized neurofeedback in driving these effects. The premise of our approach is that measuring a particular individual’s interpretation neurally and using this signature as the basis for feedback is useful for changing their interpretation. Importantly, the presence of a difference between the two neurofeedback groups in our study does not necessarily mean that individualized neurofeedback was responsible for this difference. For example, say that everyone strongly adopts the cheating interpretation at a particular point in the story and the classifier registers this. In this scenario, participants in the cheating group will be given “correct” feedback, reinforcing the interpretation, and participants in the paranoid group will be given “incorrect” feedback, leading them to adopt the paranoid interpretation. The key point here is that we could achieve this same effect outside of the scanner (by using behavioral data to pick a point in time when the interpretation is unambiguous, and giving the two groups different feedback at that point). For our study, this scenario is unlikely, because we normalized by the mean interpretation (across all pilot-study participants) when giving feedback – as such, participants adopting the mean interpretation at a particular time point will receive the same (neutral) feedback, regardless of group assignment. Having said this, more work is needed to establish that individualized feedback is still important. The gold standard for establishing a role for individualized neurofeedback is to include a yoked control condition where participants receive feedback based on brain data from another participant in the same condition (deBettencourt et al., 2015; see Sorger et al., 2019, for an overview of different control types for neurofeedback studies). If providing neurofeedback based on another participant “breaks” the neurofeedback effect, this is strong evidence that individualized neurofeedback is important. In our study, this kind of yoked control could be accomplished by replacing the neurofeedback at each station with the neurofeedback that would have been given to another participant from the same overall condition (cheating or paranoid) who had also behaviorally chosen the same interpretation (cheating or paranoid) at that station. Only upon running this control could we then conclude whether or not this rt-fMRI paradigm is able to alter thoughts through individualized neurofeedback. Returning to the clinical applications discussed at the beginning of the paper: Our interest in this paradigm was driven by the idea that, eventually, we could use this approach to train depressed individuals to interpret realistic scenarios less negatively. What have we learned about the feasibility of this program of research from this study? Overall, we think there are reasons both for optimism and concern. On the optimistic side, we were able to see effects of neurofeedback on both behavioral and neural measures of story interpretation. On the pessimistic side, the effects were weak: The only dependent measure that showed a neurofeedback effect when looking at the entire participant pool was the “empathy” behavioral measure; all of the other effects only showed up when we filtered the participants based on how well the classifier output matched their behavioral probe responses. With the benefit of hindsight, we now think that the “mystery interpretation” approach may not have been the best way to proceed. In the present study, we told participants to explore interpretations and not to lock in on one. We thought that this would help participants pay attention to neurofeedback, but instead we may have encouraged participants to focus too much on hypothesis-testing strategies (e.g., adopting an interpretation contrary to the preferred one, to see if this results in a reduction in reward) – this approach may reduce engagement with the story overall, thereby weakening our effects. Going forward, we think that it might be more effective to simply reveal the assigned interpretation outright, and then use neurofeedback to maximize the degree to which participants adopt and sustain that interpretation over time. This approach would fit better with our long-term clinical goal of using this method to treat depression – there, you would want to inform participants that things will turn out well in the story and then help the depressed participants adopt and sustain that interpretation. Note that, if you reveal the desired interpretation, this makes it less informative to use behavioral interpretation questions as a dependent measure (since participants will know the “right answer”) but neural measures (e.g., the output of a classifier tracking the presence of the correct interpretation) could be used instead. We could also look at transfer to other measures of depression. This work constitutes the first attempt to use rt-fMRI to alter ongoing thoughts related to naturalistic narrative stimuli. In addition to having greater ecological validity than traditional experimental stimuli, naturalistic stimuli have the advantage of producing robust neural responses (Sonkusare et al., 2019; Nastase et al., 2020); they have been shown to be useful for studying individual differences (Vanderwal et al., 2017; Finn et al., 2020; Feilong et al., 2018) and for exposing neural correlates of clinical variables (Rikandi et al., 2017; Finn et al., 2018; Eickhoff et al., 2020; Salmi et al., 2020). Our use of naturalistic stimuli led us to use a design that differed in several ways from other fMRI neurofeedback studies (including studies using the popular Decoded Neurofeedback method; Watanabe et al., 2017; Taschereau-Dumouchel et al., 2022) – in particular, our study did not include a separate “induction” period where participants were trained to upregulate or downregulate a particular cognitive state. This is a consequence of the fact that naturalistic stimuli give rise to trajectories of meaning states (Baldassano et al., 2017) – as such, there was no guarantee that there would be a single neural state that we could induce or train to foster one interpretation of the narrative over the other. Our practice of providing feedback during the story (as opposed to doing this in a separate phase) and using station-specific classifiers (vs. a single classifier for all stations) was meant to accommodate the dynamically-shifting nature of how narratives are processed in the brain – instead of fostering a particular pattern, our procedure sought to encourage participants to adopt the “right pattern at the right time”. Overall, our study highlights both the challenges and future promise of using neurofeedback in a naturalistic context to reshape how we interpret ambiguous situations.

Table 1

All ROIs considered in our bootstrap analysis, with a description of how we created each mask. The index on the left corresponds to the ROI key shown in Fig. 12.

Index	Name	Method	N voxels

0	Default Mode Network (DMN)	Yeo et al. (2011):- resampled to functional space	3757 voxels
1	Theory of Mind (TOM)	Neurosynth (Yarkoni et al., 2011):- used the association test only- thresholded at FDR-corrected p ≤ 0.01 significance- resampled to functional space- masked by average brain mask across all participants	2414 voxels
2	TOM clusters	Yeshurun et al. (2017), Supplementary section (Table 1):- extracted significant clusters: left TPJ, precuneus, posterior cingulate, vmPFC, dmPFC, rIFG, lIFG, lSTS, and left temporal pole- converted to MNI space- created 10-mm cube around each cluster- resampled to functional space- masked by average brain mask across all participants	240 voxels

53 in total

1. Discovering Event Structure in Continuous Narrative Perception and Memory.

Authors: Christopher Baldassano; Janice Chen; Asieh Zadbood; Jonathan W Pillow; Uri Hasson; Kenneth A Norman
Journal: Neuron Date: 2017-08-02 Impact factor: 17.173

Review 2. Naturalistic Stimuli in Neuroscience: Critically Acclaimed.

Authors: Saurabh Sonkusare; Michael Breakspear; Christine Guo
Journal: Trends Cogn Sci Date: 2019-06-27 Impact factor: 20.229

3. Individual differences in functional connectivity during naturalistic viewing conditions.

Authors: Tamara Vanderwal; Jeffrey Eilbott; Emily S Finn; R Cameron Craddock; Adam Turnbull; F Xavier Castellanos
Journal: Neuroimage Date: 2017-06-16 Impact factor: 6.556

Review 4. Closed-loop brain training: the science of neurofeedback.

Authors: Ranganatha Sitaram; Tomas Ros; Luke Stoeckel; Sven Haller; Frank Scharnowski; Jarrod Lewis-Peacock; Nikolaus Weiskopf; Maria Laura Blefari; Mohit Rana; Ethan Oblak; Niels Birbaumer; James Sulzer
Journal: Nat Rev Neurosci Date: 2016-12-22 Impact factor: 34.870

Real-time neurofeedback to alter interpretations of a naturalistic narrative.

Introduction

Methods

Participants

Stimuli

Procedure

Data acquisition

Offline image registration

Real-time image registration and processing

Classification

Story comprehension and interpretation scores

Empathy ratings

Results

Neurofeedback scores

Probe responses

Decoding accuracy

Results divided by decoding accuracy

Story comprehension and interpretation scores

Empathy ratings

Neurofeedback scores

Probe responses

Discussion

1. Discovering Event Structure in Continuous Narrative Perception and Memory.

Review 2. Naturalistic Stimuli in Neuroscience: Critically Acclaimed.

3. Individual differences in functional connectivity during naturalistic viewing conditions.

Review 4. Closed-loop brain training: the science of neurofeedback.

Review 5. Towards clinical applications of movie fMRI.

6. Perceptual learning incepted by decoded fMRI neurofeedback without stimulus presentation.

7. ADHD desynchronizes brain activity during watching a distracted multi-talker conversation.

8. Accurate and robust brain image alignment using boundary-based registration.

9. Closed-loop training of attention with real-time brain imaging.

10. Machine learning for neuroimaging with scikit-learn.

1. RT-Cloud: A cloud-based software framework to simplify and standardize real-time fMRI.