Literature DB >> 33017389

Mental operations in rhythm: Motor-to-sensory transformation mediates imagined singing.

Yanzhu Li^1,2, Huan Luo³, Xing Tian^1,2.

Abstract

What enables the mental activities of thinking verbally or humming in our mind? We hypothesized that the interaction between motor and sensory systems induces speech and melodic mental representations, and this motor-to-sensory transformation forms the neural basis that enables our verbal thinking and covert singing. Analogous with the neural entrainment to auditory stimuli, participants imagined singing lyrics of well-known songs rhythmically while their neural electromagnetic signals were recorded using magnetoencephalography (MEG). We found that when participants imagined singing the same song in similar durations across trials, the delta frequency band (1-3 Hz, similar to the rhythm of the songs) showed more consistent phase coherence across trials. This neural phase tracking of imagined singing was observed in a frontal-parietal-temporal network: the proposed motor-to-sensory transformation pathway, including the inferior frontal gyrus (IFG), insula (INS), premotor area, intra-parietal sulcus (IPS), temporal-parietal junction (TPJ), primary auditory cortex (Heschl's gyrus [HG]), and superior temporal gyrus (STG) and sulcus (STS). These results suggest that neural responses can entrain the rhythm of mental activity. Moreover, the theta-band (4-8 Hz) phase coherence was localized in the auditory cortices. The mu (9-12 Hz) and beta (17-20 Hz) bands were observed in the right-lateralized sensorimotor systems that were consistent with the singing context. The gamma band was broadly manifested in the observed network. The coherent and frequency-specific activations in the motor-to-sensory transformation network mediate the internal construction of perceptual representations and form the foundation of neural computations for mental operations.

Entities: Chemical Disease Gene Species

Mesh：

Year: 2020 PMID： 33017389 PMCID： PMC7561264 DOI： 10.1371/journal.pbio.3000504

Source DB: PubMed Journal: PLoS Biol ISSN： 1544-9173 Impact factor: 8.029

Introduction

“What is this paper about?” You are probably asking this question in your mind. We think in a verbal form all the time in our everyday lives. Verbal thinking is one of the common mental operations that manifest as inner speech—a type of mental imagery induced by covert speaking [1-3]. Another related common mental phenomenon is the “earworm”—a piece of music that repeats in someone’s mind or the involuntary action of humming a melody. What enables these mental operations that take the form of speech and singing? Neural evidence suggests that modality-specific cortical processes mediate covert operations of mental functions. For example, previous studies have demonstrated that mental imagery is mediated by neural activity in modality-specific cortices, such as the motor system for motor imagery [4, 5] and sensory systems for visual imagery [6, 7] and auditory imagery [8, 9]. Recently, the internal forward model has been proposed to link the motor and sensory systems [10]. The presupposition of the model is a motor-to-sensory transformation—a copy of motor command, termed “efference copy,” is internally sent to sensory regions to estimate the perceptual consequence of actions [11-13]. This motor-to-sensory transformation has been demonstrated in speech production, learning, and control [14-18] and has been extended to speech imagery [3, 19–26]. This motor-to-sensory transformation has been suggested to be in a frontal-parietal-temporal network. Specifically, it has been assumed that the motor system in the frontal lobe simulates the motor action, while the sensory systems in the parietal and temporal lobes estimate the possible perceptual changes caused by the action [3, 22, 25]. Would the continuous simulation and estimation in the motor-to-sensory transformation network mediate the mental operations of covert singing and inner speech during verbal thinking [27]? Thinking verbally and singing covertly are similar to speech because they all unfold over time. The analysis of time-series information in speech perception has been investigated using a neural entrainment paradigm. It has been demonstrated that neural responses can be temporally aligned to the frequency of acoustic features, such as speech envelopes [28, 29]. The neural responses can also entrain perceptual and cognitive constructs, such as syllabic information [30], music beats [31, 32], syntactic structures [33], and language formality structures [34]. That is, the frequency of neural responses can mirror the rate of internal representation derived from external stimulations. Would neural responses track the representations that are constructed without external stimulation, such as covert singing and inner speech during verbal thinking? This study aims to use a rhythmic entrainment paradigm to investigate the neural mechanisms that mediate mental operations such as inner speech and covert singing. This “entrainment” is in a broader sense, as suggested by Obleser and Kayser [35]. We implemented a natural and rhythmic setting in which participants imagined singing lyrics of well-known songs by imitating the same songs heard before the imagery tasks (Fig 1A). A color change of the fixation cued the onset of imagery. The offset of imagery was indicated by a button press by participants. Unlike the frequency tracking of passive listening to an external stimulus that has a consistent rate across trials, the production rate in the active imagery task inevitably has temporal variance across trials. We used 2 approaches to deal with the temporal variation. First, the purpose of a musical context was to reduce the large temporal variability during imagery—participants would imagine singing at a more consistent rate compared to saying the same lyrics. Second, we took advantage of the remaining temporal variation among trials in imagery (Fig 1B, 1C and 1D). The variation in the duration of performing imagery tasks correlated with the temporal consistency of neural responses across trials. If the neural responses tracked the rate of mental operations, the phase coherence of neural responses would be different between 2 groups of imagery trials that have different durations (Fig 1E). According to our hypothesis that the motor-to-sensory transformation neural network mediates inner speech and covert singing, we predicted that the different degrees of neural tracking to the rate of mental operations would be observed in specific areas in the frontal, parietal, and temporal regions (Fig 1F), where the core computations for motor simulations and perceptual estimations in motor-to-sensory transformation have been indicated [3, 22, 25].

Fig 1

Rhythmic entrainment of imagined singing and hypothesis of neural phase tracking.

(A) Experimental paradigm. According to the color of visual fixation, participants listened to the first sentence of 3 well-known songs, followed by imagined singing of the song they just heard (the Alphabet Song is used for illustration). Participants pressed a button to indicate the end of their imagery. (B) RT of imagined singing for the 3 songs. The red dashed lines indicate the duration of the 3 songs. The duration of imagined singing was longer than the preceding auditory stimuli. Each blue dot indicates an individual RT. (C) Distribution of imagined singing RT. The z-scores of RT followed a normal distribution, with about half of the trials within 2 standard deviations. (D) Grouping of imagined singing trials. Twenty-four trials of each song were sorted in ascending order based on their z-scores and separated into 2 groups. The 12 trials that were close to the mean RT were selected for the center group, whereas the other 12 trials that were further away from the mean RT were included in the dispersed group. (E) Hypothesis about neural phase coherence across trials of imagined singing. Schematic display of 2 trials in each group. The short bars indicate the beginning and end of a trial. The wave lines represent neural oscillations. The trials in the dispersed group have different durations, thus the temporal variance was large. The phase of neural oscillation that corresponded to the construction of syllabic representation during imagined singing did not align across trials. On the other hand, the temporal variance was small across trials in the center group, hence the phase of neural oscillation is more coherent across trials. (F) Hypothesis about phase coherence in the motor-to-sensory transformation network during imagined singing. The motor-to-sensory transformation was assumed to manifest in a frontal-parietal-temporal network, including the IFG, INS, premotor area, and SMA in the frontal lobe for simulating articulation; somatosensory areas (SI, SII), SMG, and its adjacent PO, AG, and TPJ in the parietal lobe for estimating somatosensory consequence; as well as the STG and STS with a possibility of extension to the HG in the temporal lobe for estimating auditory consequence. The more consistent phase coherence at the delta band (1–3 Hz)—the rate of imagined singing—was predicted to be observed in the motor-to-sensory transformation network. AG, angular gyrus; HG, Heschl’s gyrus; IFG, inferior frontal gyrus; INS, insula; PO, parietal operculum; RT, reaction time; SI, primary somatosensory area; SII, secondary somatosensory area; SMA, supplementary motor area; SMG, supramarginal gyrus; STG, superior temporal gyrus; STS, superior temporal sulcus; TPJ, temporal-parietal junction.

Rhythmic entrainment of imagined singing and hypothesis of neural phase tracking.

Results

The reaction time (RT) in the imagery condition suggested that the duration of imagined singing was longer than the duration of the auditory stimuli (Fig 1B) (one-sample t test; for song 1, t(15) = 6.04, p < 0.001; for song 2, t(15) = 2.64, p = 0.02; for song 3, t(15) = 2.82, p = 0.01). A repeat measures one-way ANOVA did not reveal differences, in the increase of duration, among imagined singing of the 3 songs (F(2) = 2.17, p = 0.126). This suggested that the reduced speed in the imagery condition, presumably caused by the imagery task and motor responses of button pressing, was consistent during imagery of all songs. The distribution of RT (Fig 1C) followed a normal distribution (chi-squared goodness-of-fit test, χ(4) = 2.30, p = 0.68) and revealed that about half of the trials fall within ±1 standard deviations. Two groups of trials, based on the variation from the mean of RT, were formed for further analysis of the imagined singing magnetoencephalography (MEG) responses. For imagery of each song, 12 trials that were close to the mean RT were included in the center group, whereas the 12 trials that were further away from the mean RT were included in the dispersed group (Fig 1D). We first examined the MEG responses in the temporal domain. The event-related responses that were time-locked to the onset of auditory stimuli revealed a clear peak and topography around 100 ms after the stimulus onset—the typical M100 auditory response (Fig 2A). Moreover, in the imagery condition, a topographic pattern that was similar to the M100 auditory response was observed around a similar latency (Fig 1B), even though no external auditory stimulus was presented in the imagery condition. Both M100 responses demonstrated a canonical auditory response topography in which a dipole pattern over the temporal lobe region in each hemisphere was observed [e.g., 36]. This response topography contrasted with earlier visual responses in MEG that showed more posterior dipole patterns [e.g., 37, 38]. Therefore, the M100 responses in both the listening and imagery conditions were less likely to be induced by the fixation color changes. This similar event-related auditory response in the listening and imagery conditions was consistent with our previous findings [22] and suggested that auditory cortices were activated during the imagery condition.

Fig 2

MEG results of temporal responses and neural tracking in the delta band (1–3 Hz).

MEG results of temporal responses and neural tracking in the delta band (1–3 Hz).

(A) Waveforms and topographies in the listening condition (the Alphabet Song). The vertical, dotted line at time 0 indicates the onset of the auditory stimuli. Each black line represents the waveform responses from a sensor. The red bold line represents the RMS waveform across all sensors. Topographies are plotted every 333 ms from −1,000 ms to 4,000 ms. A clear auditory onset event-related response (M100, the single topography in the upper row) was observed. (B) Waveforms and topographies in the imagery condition. Similar depicting form as in A. The vertical dotted line at time 0 indicates the onset of imagined singing. No repetitive patterns in topographies across the period. A similar event-related response in the range of M100 latency, as in the listening condition, was observed (the single topography in the upper row). (C) Phase-coherence results in the listening condition. Neural entrainment in the delta band was observed in the HG and its adjacent aSTG, aSTS, and pINS. (D) Phase-coherence results in the imagery condition. More consistent neural entrainment in the delta band was observed in the proposed motor-to-sensory network, including frontal areas (IFG, aINS, and premotor area), parietal areas (IPS and TPJ), and temporal areas (HG, aSTG, and m&pSTS). These observed cortical areas were consistent with the predicted frontal-parietal-temporal regions in Fig 1F. (E) Results of direct comparison between the listening and imagery conditions (Listening > Imagery). Greater phase coherence in listening than imagery was observed in bilateral HG and adjunct pINS. (F) Results of direct comparison between the listening and imagery conditions (Imagery > Listening). Greater phase coherence in listening than imagery was observed in premotor areas and IPS. (G) Results of conjunction analysis between the listening and imagery conditions. Overlapped significant phase coherence was observed in the bilateral primary auditory cortex, left pINS, and aSTG. The underlying data for this figure can be found at https://osf.io/mc8wd/. aINS, anterior insula; aSTG, anterior superior temporal gyrus; aSTS, anterior superior temporal sulcus; HG, Heschl’s gyrus; IFG, inferior frontal gyrus; IPS, intra-parietal sulcus; MEG, magnetoencephalography; m&pSTS, middle and posterior superior temporal sulcus; pINS, posterior insula; RMS, root-mean-square; TPJ, temporal-parietal junction. No repetitive topographic patterns were observed in the time course of listening or imagery (Fig 1A and 1B), suggesting that tracking of the acoustic stream or the rate of imagery was not by the response magnitude. These results are consistent with previous studies that showed the absence of power coherence in speech tracking [29]. The lack of effects in response magnitude was probably due to repetition suppression [39]. Repetitions of similar processes decreased the response magnitude. Moreover, the temporal variance in imagery could further reduce the magnitude of evoked responses. Therefore, we further investigated the timing information and neural tracking of the induced responses in the spectral domain. The MEG responses in the listening condition showed the neural tracking of rhythm in the songs. The phase-coherence analysis revealed that the significant differences (p < 0.05) between the inter-trial phase coherence (ITC) of the within-group and between-group were localized mostly in the primary auditory cortex (Heschl’s gyrus [HG]) and its adjacent areas, including the left anterior superior temporal gyrus (aSTG) and sulcus (aSTS), and bilateral posterior insula (pINS) (Fig 2C). (See S1 Fig and S1 Table for precise anatomical locations.) These results suggest that auditory systems can reliably follow the rhythm of acoustic signals and demonstrate the validity and accuracy of source localization based on phase coherence. For the imagery condition, the comparison between the ITC in the center group and dispersed group revealed 3 significant clusters (p < 0.05) in the frontal, parietal, and temporal regions (Fig 2D). Specifically, in the frontal region, more consistent phase coherence was observed in the inferior frontal gyrus (IFG), insular cortex (INS), and premotor cortex (PreM). In the parietal region, the differences were in the bilateral intra-parietal sulcus (IPS) and temporal-parietal junction (TPJ). In the temporal region, more consistent phase coherence was localized in the bilateral HG, left aSTG, and right middle and posterior superior temporal sulcus (m&pSTS). These phase-coherence results during imagined singing were observed in the proposed core computational regions of motor-to-sensory transformation (Fig 1F). These results suggest that neural responses track the dynamics of mental operations in the motor-based prediction pathway [3, 22, 23, 25]. The reliability of the neural entrainment in listening and imagery was further tested using an extended period of 6-s data epochs (S2 Fig). The results were consistent with those obtained using 4-s epochs (Fig 2C). The entrainment was localized at the bilateral primary auditory cortex, aSTG, and INS in the listening condition, whereas localization of phase-coherence differences in the imagery condition was observed in the proposed frontal-parietal-temporal network, similar to the results in Fig 2D. Moreover, the analysis of four 3-s consecutive time bins revealed that the entrainment to imagery required time to be established, as the phase-coherence differences became more stable in the later time bins (S3 Fig). The direct comparison between listening and imagery revealed stronger phase coherence in the HG and pINS in the listening condition compared with the imagery condition (Fig 2E) but greater phase coherence in the premotor areas and IPS in the imagery condition compared with the listening condition (Fig 2F). These results were consistent with common observations that perception has a more robust neural activation than imagery and additional frontal and parietal areas are engaged in speech imagery [e.g., 22]. More importantly, in the conjunction analysis, the significant phase coherence in listening and imagery conditions overlapped over the primary auditory cortex and extended to the posterior part of the INS and in the anterior part of the STG (Fig 2G). These overlaps suggest that the motor-to-sensory transformation during imagery can induce similar auditory representation as in perception [21]. To further explore the functional specificity of dynamic processing in the motor-to-sensory pathway, we performed the phase-coherence analysis in a broad range of frequency bands (Fig 3). We found that in the 4–8 Hz (theta) band, only the primary auditory cortex and anterior part of the STG and STS in the left hemisphere showed significant phase coherence. This focused theta-band activation in the auditory cortices—contrasted with the collaborative frontal-parietal-temporal network activation of the entire motor-to-sensory pathway in the delta band (Fig 1)—was consistent with the specific role of the theta band in auditory and speech processing [e.g., 40, 41].

Fig 3

Localization results of phase coherence in the imagery condition in the theta (4–8 Hz), mu (9–12 Hz), low-beta (13–16 Hz), mid-beta (17–20 Hz), high-beta (21–28 Hz), and low-gamma (30–48 Hz) bands.

Localization results of phase coherence in the imagery condition in the theta (4–8 Hz), mu (9–12 Hz), low-beta (13–16 Hz), mid-beta (17–20 Hz), high-beta (21–28 Hz), and low-gamma (30–48 Hz) bands.

(A) The phase coherence in the theta band was observed in the left primary auditory cortex, STG, and STS. (B) The phase coherence in the mu band was observed over the sensorimotor cortices in the right hemisphere, as well as in the inferior parts of the subcentral gyrus, and the opercular part of the IFG in the left hemisphere. (C) Only a small patch of subcentral gyrus, IFG, and mSTG was observed in the low-beta band. (D) The phase coherence in the mid-beta band was observed in similar and broader right sensorimotor cortices as in the mu band in B. In the left hemisphere, the PreM was observed in addition to the observed areas in the mu band. (E) In the high-beta band, the phase coherence was observed in pSTG and pSTS in the right hemisphere and a small patch of aSTG in the left hemisphere. (F) Similar and broader networks, as observed in the neural tracking of the delta band in Fig 2D, were observed in the low-gamma band. The underlying data for this figure can be found at https://osf.io/mc8wd/. aSTG, anterior superior temporal gyrus; IFG, inferior frontal gyrus; mSTG, middle superior temporal gyrus; PreM, premotor cortex; pSTG, posterior superior temporal gyrus; pSTS, posterior superior temporal sulcus. A broad sensorimotor network in the right hemisphere was observed in 9–12 Hz (the alpha band, or the “mu” rhythm) and 16–20 Hz (the mid-beta band). These frequency bands also showed higher phase coherence in the inferior parts of the subcentral gyrus and opercular part of the IFG in the left hemisphere; the 16–20 Hz also extended to the PreM. The specific involvement of sensory and motor systems in the mu and beta bands were consistent with the alpha-beta synchronization in motor control and motor imagery [e.g., 42, 43], suggesting the specific dynamics for the motor simulation and somatosensory estimation in the motor-to-sensory transformation [3, 22, 25]. The right hemisphere dominance was consistent with imagery in a singing context [e.g., 44, 45, 46]. In the low-gamma band, greater synchronization was observed in the center group than that in the dispersed group over the entire motor and sensory systems that showed tracking of imagery in the delta band (Fig 2D). These results are consistent with the view of local computations for the gamma band [47, 48]. The distinctive neural involvement in the tracking rate of the delta—as well as in the theta, mu, beta, and gamma bands—collaboratively indicates the extent of the motor-to-sensory transformation network and the specific functional dynamics of each frequency component in this network.

Discussion

We investigated the function and dynamics of neural networks that mediate the mental operations of inner speech and covert singing. With a rhythmic entrainment imagined singing paradigm, we found that frontal-parietal-temporal regions in the proposed motor-to-sensory network collaboratively synchronize at the rate of mental operations. Moreover, a double dissociation of operating frequencies and anatomical locations was found in the motor, somatosensory, and auditory cortices that distinctively relate to inner speech and covert singing. These results suggest that neural responses can entrain the rhythm of mental activity and the synchronized neural activity in the motor-to-sensory transformation network mediates mental operations. Neural dynamics are crucial for understanding the computations that mediate cognitive functions [49, 50]. However, probing dynamics is difficult, especially the ones that mediate mental operations without external stimulations. The active nature of mental tasks inevitably causes temporal variations that undermine the foundations of methods for probing dynamics. For example, the phase-coherence analysis in the neural entrainment approach is well established for speech perception. However, the same analysis cannot be directly applied to investigate the neural tracking of speech imagery—the within-group would have too much temporal variance across trials, and therefore the phase coherence would be too small to detect. In this study, we overcame this obstacle by taking advantage of the temporal variance—separating trials based on behavioral timing (Fig 1D). This methodological advance, along with the use of naturalistic sounds and source localization, makes it possible to investigate the dynamics in neural networks that mediate mental operations. Using the functional constraints of phase coherence, the observed frontal-parietal-temporal network during imagined singing (Fig 2D) was consistent with the proposed motor-to-sensory transformation network. The observed IFG, premotor, and INS in the frontal region during imagined singing have been demonstrated in articulatory preparation during overt speech [30, 51, 52] and covert speech [53, 54]. The responses in these frontal cortices were also consistent with findings in speech imagery, reflecting the function of motor simulation [25]. In the parietal region, the observation of the IPS and TPJ—an area closed to the supramarginal gyrus (SMG), parietal operculum (PO), and angular gyrus (AG)—has also been suggested for sensorimotor integration and goal-directed prediction-based speech feedback control [55-58]. The activation of similar parietal areas of the TPJ, SMG, PO, and adjacent IPS was also observed during speech imagery, suggesting the possible functions for estimating somatosensory consequences of actions [22, 25]. The observations in the temporal region further support the motor-to-sensory transformation in covert singing and inner speech in verbal thinking. The similar evoked M100 responses in both listening and imagery conditions (Fig 2A and 2B) suggested that auditory representations can be established in both bottom-up and top-down manners. These auditory representations in imagery were probably induced by different processes from those mediating omission responses [e.g., 59], because the regular occurrence of an active task was required in this imagery study. Moreover, overlapped phase-coherence results were observed in the temporal regions of the primary and secondary auditory cortices between the listening and imagery conditions (Fig 2G). The STG was commonly observed during musical imagery [60], and the activation of auditory imagery has been shown to extend to the HG [8]. The observation of the HG during imagined singing in this study was consistent with the hypothesis that high task demand drives auditory estimation down to the primary sensory area [21]. The right STS was only observed in imagined singing but not in listening conditions, which is consistent with previous findings [25], suggesting a possible specific functional role of STS in auditory imagery. The additional frontal (premotor) and parietal (TPJ, IPS) activations in imagined singing suggested that auditory representations, similar to the representations established in perception, can be constructed via the motor-to-sensory transformation pathway [3, 22, 25]. The observations of neural structures that operate at separate frequency bands are consistent with our hypothesis regarding the algorithms and neural implementations in the motor-to-sensory transformation. The phase coherence and its distribution during imagery are distinct from previous observations in speech perception [e.g. 29, 40, 41, 61, 62–64], suggesting unique motor-to-sensory transformation during mental operations. Specifically, first—using songs with a rhythm in the delta range—a complete frontal-parietal-temporal network was revealed in the corresponding low-frequency band during the imagined singing tasks (Fig 2D). These results provide functional evidence suggesting the extent of the motor-to-sensory transformation pathway, consistent with findings in neuroimaging studies [65-67]. More importantly, the neural phase tracking in the delta band suggested that the fluctuation of excitability in the network temporally aligned with the unfolding of mental operations. Second, phase coherence at the theta band was localized at temporal auditory regions (Fig 3A). These results are consistent with the view that theta oscillations actively mediate the auditory and speech processes [40, 41]. In the context of motor-to-sensory transformation, the precise localization of the theta band suggested that theta oscillations could mediate specific computations for constructing speech-like representations. Third, the mu and beta bands have indicated neural rhythmic signatures for both motor control and imagery [e.g., 42, 43]. In this study, we observed a broad sensorimotor network operating in the mu and beta rhythms during the imagined singing tasks (Fig 3B and 3D). These results further suggested that computations could be executed at these rhythms to internally simulate action during imagery [3, 12, 20, 22, 25]. Such motor simulations presumably generate an efference copy that is used to emulate the somatosensory consequences of actions for estimating the status of motor effectors [3, 22, 25]. The hemispheric lateralization suggests the possible commonalities and distinctions between inner speech and covert singing. The bilateral observations in the delta band supported neural entrainment to the common rhythmicity in both types of mental operations. The lateralized theta band in the left auditory cortices was consistent with speech-weighted processing [68], whereas the right-hemispheric dominance in the mu and beta bands over the sensorimotor cortices suggested the potential construction of melodic and tonal information in the singing context [e.g., 44, 45, 46]. The switching from right to left hemisphere is consistent with the observations that the contents of auditory information determine the hemispheric lateralization—the right-lateralized processing of the humming tones is switched to left lateralization when the tones are superimposed on syllabic contents [69]. Fourth, the gamma band was observed over the entire motor and sensory systems. This result indicated the implementation of local computations in modality-specific regions [47, 48]. Together, the motor-to-sensory transformation network possesses distinct dynamics for mediating mental operations. The frontal-parietal-temporal network entrains the rhythm of metal operations. This low-frequency entrainment is suggested to facilitate communication and coordination between cortical areas for specific tasks via traveling waves [70-72]. Moreover, via cross-frequency coupling [47], specific frequencies (theta for auditory, mu and beta for sensorimotor) may modulate the gamma band to achieve the computations of motor simulation as well as somatosensory and auditory estimation in the motor-to-sensory transformation in the imagery tasks [48, 68]. Future studies are necessary to investigate these hypotheses regarding the connectivity across frequencies and cortical regions so that the functional contributions of the proposed model can be thoroughly examined. This study also provides hints about the possible functions of neural oscillations on perception. Many studies have demonstrated that neural oscillations can entrain speech signals [28, 29]. However, it is still in debate whether the entrainment is driven by the stimuli features or is modulated by top-down factors on intrinsic oscillations [35, 73]. Previous neural entrainment experiments have used external stimuli and investigated how neural oscillations track physical features. The perceptual constructs are thus derived from the external stimuli features. It is hard, if not impossible, to separate perception from stimuli, and therefore the debate cannot be resolved. Our results of imagined singing without external stimulation suggest that the phase of neural oscillations can align with internal mental operations. That is, internally constructed representations can modulate the phase of neural oscillations during rhythmic mental imagery. These results support the view that top-down factors modulate intrinsic neural oscillations. Our results may impact practical and clinical domains. The motor-to-sensory transformation network for imagined speech may implicate novel strategies for building a brain-computer interface (BCI). Previously, direct BCIs mostly focused on the motor system [74]. Our findings of synchronized neural activity across motor and sensory domains during mental imagery suggest possible updates for decoding algorithms from a system-level and multimodal perspective, which is hinted at in a recent advance [75]. Moreover, our results offer insights into the functional and anatomical foundations of auditory hallucinations. We have previously hypothesized that, from a cognitive perspective, auditory hallucinations may be caused by incorrect source monitoring of internally self-induced auditory representations [3, 12]. These results of synchronized neural activity in the frontal-parietal-temporal network suggest the possible neural pathways for the internal generation of auditory representation. These results are consistent with the neural modulation treatment for auditory hallucinations, which targets the motor-to-sensory transformation network with electric stimulation [76]. Using a rhythmic entrainment imagined singing paradigm, we observed that the neural phase was modulated at the rate of inner speech and covert singing. The synchronized activity spanned across dedicated frontal-parietal-temporal regions at multiple frequency bands, which is evidence of the motor-to-sensory transformation. The coherent activation in the motor-to-sensory transformation network mediated the internal construction of perceptual representations and formed the neural computational foundation for mental operations.

Materials and methods

Ethics statement

The study was approved by the New York University Institutional Review Board (IRB# 10–7107) and conducted in conformity with the 45 Code of Federal Regulations (CFR) part 46 and the principles of the Belmont Report.

Participants

Sixteen volunteers (7 males, mean age = 25 [19-32] years) participated in this experiment and received monetary compensation. All participants were right-handed, native English speakers, and without a history of neurological disorders. All participants provided informed written consent and received monetary compensation for their participation. The number of participants was predetermined based on our previous studies that investigated similar speech imagery using MEG [21, 24]. Moreover, the number of 16 participants is in the upper range of the total participants used in previous MEG studies that investigated neural entrainment [e.g., 33, 34, 41, 77]. The effect size that is required to have 80% power at an alpha level of 0.05 for a sample size of 16 is Cohen’s d = 0.99. As the standard deviation estimated in our study was 0.07, the absolute effect size in the measure (phase coherence) was 0.07. The post hoc effect size from our data in all sensors was d = 1.50 (d = 2.32). The absolute effect size is 0.09. In the clustering analysis using in this study, we used a pre-cluster threshold of 0.001 in the paired t test. Considering the sample size of 16, the minimal effect size is d = 1.02. The post hoc effect size from the data in the HG that was independently defined by cortical parcellation (S1 Table and S1 Fig) was d = 1.21 (d = 1.89). The absolute effect size is 0.11.

Materials

Female vocals were recorded singing the first sentence of 4 well-known songs (Alphabet Song: 6.24 s, 16 syllables, 2.56 syllables/s; Itsy Bitsy Spider: 7.38 s, 23 syllables, 3.12 syllables/s; Take Me out to the Ball Game: 5.42 s, 13 syllables, 2.40 syllables/s; and Twinkle Twinkle Little Star: 6.10 s, 14 syllables, 2.30 syllables/s). Arguably, the faster the songs were, the more difficult it would be to sing them at a similar speed across trials (e.g., more temporal variance). Therefore, we recorded the songs at a normal comfortable and intermediate speed (about 3 Hz of syllabic rate) to minimize the temporal variance and thus maximize the statistical power of neural entrainment in the imagery conditions. All songs were recorded with a sampling rate of 44.1 kHz. During the experiment, stimuli were normalized and delivered at about 70 dB SPL via plastic air tubes connected to foam earpieces (E-A-R Tone Gold 3A Insert earphones, Aearo Technologies Auditory Systems, Indianapolis, IN).

Procedure

A fixation cross was presented in the center of the screen throughout the experiment. Color changes of the fixation were used as cues for the listening and imagery tasks to avoid contrast changes (onset and offset of the fixation) that could induce large visual responses. Participants were asked to listen to one of the 4 songs when the color of fixation was red (the listening condition). The fixation changed to yellow after the auditory stimulus offset. After 1.5 s, the fixation changed to purple, and participants were required to imagine singing the song that they just heard (the imagery condition). They were asked to covertly reproduce the song using the same rhythm and speed as the preceding auditory stimuli. Participants pressed a button to indicate the completion of imagery. RT was recorded between the onset of the visual cue for imagery and when the button was pressed to indicate the completion of imagery. After the button press, the fixation turned yellow and stayed on screen for 1.5 s–2.5 s (with an increment of 0.333 s) until the next trial began. Participants were required to refrain from any overt movement and vocalization during imagined singing. A video camera and a microphone were used to monitor any overt movement and vocalization throughout the experiment. Four blocks were included in this experiment, with 24 trials in each block (6 trials per song in each block, 24 trials per song in total). The presentation order was randomized. Participants were familiarized with the experimental procedure before the experiment.

Behavioral analysis

The song Twinkle Twinkle Little Star was used for a research question independent from this study. Therefore, only 3 songs were used for further analysis. The RT was quantified as the duration between the onset of visual cue for imagery and participants’ bottom press that indicated the end of imagery. The mean RT was obtained for the imagery of each song. The RT data were further transformed into z-scores. The distribution of z-scores was obtained and averaged across the 3 songs. The RT z-scores of 24 trials were ranked from shortest to longest for each song and averaged across the 3 songs. Two groups were formed based on RT ranking: the center group consisted of the 12 trials closest to the mean RT, whereas the dispersed group comprised the other 12 trials that were farther away from the mean RT. Because the separation of trials was defined by the differences between individual trial RT and the mean RT in the imagery condition, the RT differences among trials in the center group were much smaller than those among the trials in the dispersed group (Fig 1D). Similar durations indicated that trials in the center group were more likely imagined in a similar temporal manner. These temporal differences between groups of trials were used in the phase-coherence analysis of MEG to investigate our hypothesis of neural entrainment to imagery.

MEG recording

Neuromagnetic signals were measured using a 157-channel whole-head axial gradiometer system (KIT, Kanazawa, Japan). Five electromagnetic coils were attached to each participant’s head to monitor the head position during MEG recording. The locations of the coils were determined with respect to 3 anatomical landmarks (nasion, left and right preauricular points) on the scalp using 3D digitizer software (Source Signal Imaging, San Diego, CA) and digitizing hardware (Polhemus, Colchester, VT). The coils were localized to the MEG sensors at the beginning and the end of the experiment. The MEG data were acquired with a sampling frequency of 1,000 Hz, filtered online between 1 Hz and 200 Hz, with a notch at 60 Hz.

MEG analysis

Raw data were noise reduced offline using the continuously adjusted least-squares method [78] in the MEG160 software (MEG Laboratory 2.001 M, Yokogawa Corporation, Eagle Technology Corporation, Kanazawa Institute of Technology). We used independent component analysis (ICA) to reject artifacts caused by eye movement and cardiac activity. Epochs were extracted for trials in the listening and imagery conditions, with each epoch of 6,000 ms in duration (including 2,000 ms pre-stimulus and 4,000 ms post-stimulus period). For the listening condition, 24 trials of each song were grouped and formed 3 within-groups. Furthermore, 8 trials were randomly sampled from 24 trials of each song and yielded a new group of 24 trials (between-group). This sampling procedure was conducted 3 times to form 3 between-groups. The sampling was without replacement such that each trial was used only once in the 3 between-groups. For imagery conditions, MEG trials were separated into center groups and dispersed groups for each song according to RT z-scores (see aforementioned behavioral analysis). The group separation in the imagery condition was different from the creation of groups in the listening condition. The comparison between the center group and the dispersed group overcame difficulties that were induced by the unique, active nature of imagery tasks. Unlike the fixed duration and dynamics in every listening trial, imagery performance varied from trial to trial. Based on established phase-coherence calculations using all trials, the imagery would reduce the statistical power that was already smaller than perception. The split-group analysis was an extension of the phase-coherence analysis by adapting to the unique requirement of imagery. Another option was to use an intrinsic baseline (rest), but this could introduce additional confounding factors, whereas trials in the dispersed group had identical procedures and tasks. The only difference between trials in the dispersed group and center group was the response timing that reflects the measure of interest (the dynamic processes during imagery). Therefore, the dispersed group was a better-controlled baseline. Using the temporal differences while maintaining everything else between imagery groups can precisely reveal the dynamics of interest in the mental imagery tasks using the analysis of phase coherence across trials. A fast Fourier transform (FFT) was applied to each trial in a group with a 500-ms time window in steps of 200 ms, yielding 19 time points in each 4-s-long trial epoch. The phase values were extracted at each time point and frequency (1–48 Hz with a spectral resolution of 1 Hz). The ITC was calculated as Eq 1 [29]. In Eq 1, θ (t, f) represents the phase value at time point t and frequency f in the jth trial. N represents the total number of trials in a group. The ITC values were obtained for each of the within-groups and between-groups in the listening condition, and each of the center groups and dispersed groups in the imagery condition. The ITC characterized the consistency of the temporal (phase) neural responses across trials. If the phase responses were identical across trials, the ITC value would be 1. To investigate the neural entrainment to the rhythms in the external stimuli and the mental operations, the ITC values were first averaged in the delta band (1–3 Hz). According to our hypothesis that neural responses track the rhythm in acoustic signals and imagery at the syllabic rate of 2–3 Hz, similar durations among trials in the center group should yield higher ITC values than those from trials in the dispersed group in the imagery condition, and trials in the within-group should yield higher ITC values than those from trials in the between-group in the listening condition. Next, the ITC values were averaged over time (0–4 s) and across the 3 songs to yield a single value in every MEG channel for the within-group and between-group in the listening condition, and the center group and dispersed group in the imagery condition. For the between-group in the listening condition, the random grouping procedure was repeated 100 times and yielded 100 ITC values. The 50th percentile of ITC was chosen as a baseline to compare with the ITC of the within-group in the next cluster-based analysis. Distributed source localization of ITC was obtained by using Brainstorm software [79]. The cortical surface was reconstructed from individual structural MRI using Freesurfer (Martinos Center for Biomedical Imaging, Massachusetts General Hospital, Boston, MA). Current sources were represented by 15,002 vertices. The overlapping spheres method was used to compute the individual forward model [79]. The inverse solution was calculated by approximating the spatiotemporal activity distribution that best explained the ITC value. Dynamic statistical parametric mapping (dSPM) [80] was calculated using the noise covariance matrix estimated with the 1,000-ms pre-stimulus period. To compute and visualize the group results, each participant’s cortical surface was inflated and flattened [81] and morphed to a grand average surface [82]. Source data were spatially smoothed using a Gaussian smoothing function “SurStatSmooth” in the SurStat toolbox [83] with 3 mm of Full Width at Half Maximum (FWHM). The nonparametric cluster-based permutation test [84] was used to assess significant differences between groups in the source space [85]. For the listening condition, the ITC values of the within-group were compared with the baseline ITC values of the between-group. For the imagery condition, the ITC values of the center group were compared with the dispersed group. The empirical statistics were first obtained by a two-tailed paired t test with 2 or more adjacent significant vertices with a pre-cluster threshold of alpha = 0.001. Next, a null distribution was formed by randomly shuffling the group labels 1,000 times. Cluster-level, FDR-corrected results were obtained by comparing the empirical statistics with the null distribution (cluster threshold of alpha = 0.05 for both listening and hearing conditions). To test the possible differences between listening and imagery, we directly compared the activations in the listening and imagery conditions. The ITC values in the within-group of the listening condition were compared with the ITC values in the center group of the imagery condition using the same nonparametric cluster-based permutation test. Moreover, to test the possible common cortical regions that mediate both listening and imagery, we conducted a conjunction analysis. The cortical regions that had significant main effects in both listening and imagery conditions were identified and depicted on the cortical surface. To further explore the functional specificity of dynamic processing in the motor-to-sensory pathway, we tested the phase coherence across trials in 5 more frequency bands—the theta (4–8 Hz), alpha (9–12 Hz), low-beta (13–16 Hz), mid-beta (17–20 Hz), high-beta (21–28 Hz), and low-gramma (30–48 Hz) bands. Similar procedures were implemented as those used in the investigation in the delta band, except the phase values were extracted at corresponding frequencies. The phase coherence in each frequency band was obtained using Eq 1. The same distributed source localization and nonparametric cluster-based permutation tests were applied to each frequency band. To further test the reliability of the results, we used a long epoch in the phase-coherence analysis. The duration of imagery across trials and among participants could vary over a wide range. The imagery durations in some trials, especially imagery of the shortest song, could be shorter than the length of epochs if long epochs were extracted. To be conservative, we first chose a stringent threshold of 4 s. It is possible that more data points may balance the reduction of statistical power that is due to the variance of imagery durations across trials. Therefore, we applied FFT on 10-s-long epochs (2 s of pre-stimulus and 8 s of post-stimulus) and applied the same phase-coherence analysis to the 6-s post-stimulus data. To probe the evolution of phase coherence across time, we performed an analysis on shorter, consecutive time bins. The 6-s post-stimulus data were binned into four 3-s-long time bins, with 2-s overlaps between 2 consecutive time bins (0–3 s, 1–4 s, 2–5 s, and 3–6 s). The data in each time bin were subject to the same phase-coherence analysis, followed by the same distributed source localization and nonparametric cluster-based permutation tests.

Cortical parcellation of a grand average surface across 16 participants.

Parcellation superimposed on the inflated average cortical surface. Cortical areas related to this study were labeled with numbers on the left hemisphere. Refer to the S1 Table for the anatomical names for labels. The numeric labels were consistent with the ones used by Destrieux and colleagues [86]. (TIF) Click here for additional data file.

Phase-coherence results using 6-s-long epochs.

The results for (A) the listening condition and (B) the imagery condition were consistent with the results using 4-s-long epochs in Fig 2C and 2D. The underlying data for this figure can be found at https://osf.io/mc8wd/. (TIF) Click here for additional data file.

Evolution of phase coherence in four 3-s time bins.

The results were more reliably toward the end of trials and were consistent with the results in Fig 2D. The underlying data for this figure can be found at https://osf.io/mc8wd/. (TIF) Click here for additional data file.

Cortical parcellation and labels, used in S1 Fig.

(DOCX) Click here for additional data file. 10 Sep 2019 Dear Dr Tian, Thank you for submitting your manuscript entitled "Verbal thinking in rhythm: motor-to-sensory transformation network mediates imagined singing" for consideration as a Short Report by PLOS Biology. Please accept my apologies for the delay in sending this initial decision to you. We were interested in your study, and thus, sought advice from an Academic Editor with relevant expertise. With that advice now in hand, I’m pleased to let you know that we would like to send your submission out for external peer review. However, before we can send your manuscript to reviewers, we need you to complete your submission by providing the metadata that is required for full assessment. To this end, please login to Editorial Manager where you will find the paper in the 'Submissions Needing Revisions' folder on your homepage. Please click 'Revise Submission' from the Action Links and complete all additional questions in the submission questionnaire. Please re-submit your manuscript within two working days, i.e. by Sep 12 2019 11:59PM. Login to Editorial Manager here: https://www.editorialmanager.com/pbiology During resubmission, you will be invited to opt-in to posting your pre-review manuscript as a bioRxiv preprint. Visit http://journals.plos.org/plosbiology/s/preprints for full details. If you consent to posting your current manuscript as a preprint, please upload a single Preprint PDF when you re-submit. Once your full submission is complete, your paper will undergo a series of checks in preparation for peer review. Once your manuscript has passed all checks it will be sent out for review. Feel free to email us at plosbiology@plos.org if you have any queries relating to your submission. Kind regards, Gabriel Gasque, Ph.D., Senior Editor PLOS Biology 13 Nov 2019 Dear Dr Tian, Thank you very much for submitting your manuscript "Verbal thinking in rhythm: motor-to-sensory transformation network mediates imagined singing" for consideration as a Short Reports at PLOS Biology. Your manuscript has been evaluated by the PLOS Biology editors, by an Academic Editor with relevant expertise, and by three independent reviewers. Please accept my sincere apologies for the long delay in sending the decision below to you. As you will see, while the reviewers agree the study is mostly technically solid, there is disagreement about the novelty of the findings. More importantly, however, is the fact that reviewer 3 has identified the low number of participants as a major problem; something that shouldn’t “be advocated, particularly in a wide-reaching journal like Plos Biology”. Having discussed this specific concern with the Academic Editor, we have decided to decline further consideration. However, we would be willing to consider a heavily revised manuscript that addresses all the reviewers’ comments, particularly by increasing the number of volunteers. Please note that by leaving the door open for re-submission, we have decided that your findings, if confirmed by a larger sample size, offer the novelty that we aim to publish in our journal. In addition, our Academic Editor agrees with reviewer 1, especially on the analysis of other frequency bands. You should reframe the “inner thinking” concept with imaging the rhythmicity and tonality of a song, as suggested by reviewer 3, and the direct statistical comparison between the Listening and Imagery groups is a must. We appreciate that these requests represent a great deal of extra work, and we are willing to relax our standard revision time to allow you six months to revise your manuscript. Please email us (plosbiology@plos.org) to discuss this if you have any questions or concerns, or think that you would need longer than this. At this stage, your manuscript remains formally under active consideration at our journal; please notify us by email if you do not wish to submit a revision and instead wish to pursue publication elsewhere, so that we may end consideration of the manuscript at PLOS Biology. Your revisions should address the specific points made by each reviewer. Please submit a file detailing your responses to the editorial requests and a point-by-point response to all of the reviewers' comments that indicates the changes you have made to the manuscript. In addition to a clean copy of the manuscript, please upload a 'track-changes' version of your manuscript that specifies the edits made. This should be uploaded as a "Related" file type. You should also cite any additional relevant literature that has been published since the original submission and mention any additional citations in your response. Please note while forming your response, if your article is accepted, you may have the opportunity to make the peer review history publicly available. The record will include editor decision letters (with reviews) and your responses to reviewer comments. If eligible, we will contact you to opt in or out. Before you revise your manuscript, please review the following PLOS policy and formatting requirements checklist PDF: http://journals.plos.org/plosbiology/s/file?id=9411/plos-biology-formatting-checklist.pdf. It is helpful if you format your revision according to our requirements - should your paper subsequently be accepted, this will save time at the acceptance stage. Please note that as a condition of publication PLOS' data policy (http://journals.plos.org/plosbiology/s/data-availability) requires that you make available all data used to draw the conclusions arrived at in your manuscript. If you have not already done so, you must include any data used in your manuscript either in appropriate repositories, within the body of the manuscript, or as supporting information (N.B. this includes any numerical values that were used to generate graphs, histograms etc.). For an example see here: http://www.plosbiology.org/article/info%3Adoi%2F10.1371%2Fjournal.pbio.1001908#s5. For manuscripts submitted on or after 1st July 2019, we require the original, uncropped and minimally adjusted images supporting all blot and gel results reported in an article's figures or Supporting Information files. We will require these files before a manuscript can be accepted so please prepare them now, if you have not already uploaded them. Please carefully read our guidelines for how to prepare and upload this data: https://journals.plos.org/plosbiology/s/figures#loc-blot-and-gel-reporting-requirements. Upon resubmission, the editors will assess your revision and if the editors and Academic Editor feel that the revised manuscript remains appropriate for the journal, we will send the manuscript for re-review. We aim to consult the same Academic Editor and reviewers for revised manuscripts but may consult others if needed. If you still intend to submit a revised version of your manuscript, please go to https://www.editorialmanager.com/pbiology/ and log in as an Author. Click the link labelled 'Submissions Needing Revision' where you will find your submission record. Thank you again for your submission to our journal. We hope that our editorial process has been constructive thus far, and we welcome your feedback at any time. Please don't hesitate to contact us if you have any questions or comments. Sincerely, Gabriel Gasque, Ph.D., Senior Editor PLOS Biology ***************************************************** Reviewer remarks: Reviewer's Responses to Questions Reviewer #1: The study by Li and colleagues investigated whether imagined singing was associated with activation of the motor-to-sensory-transformation network, as quantified by the M100 response and changes in phase coherence in the delta frequency band (which corresponded to the approximate rhythm of the song frequency). They reported that imagined singing produced changes in measures that were comparable (in pattern and magnitude) to those elicited by passive listening to the songs. This study follows on from the team’s influential work on the role of the motor-to-sensory transformation network in inner speech. The design was creative and innovative; I particularly liked the idea of frequency tagging the rhythm of inner speech (discussed further below). I have some questions and comments regarding the methodology and results that the authors could consider: - If I could just clarify my understanding of the experimental design: participants were first played the full song (~ 5 – 8 seconds – these constituted the ‘listening trials’). There was then a gap of 1.5 seconds. Then the fixation cross changed color and participants were required to imagine themselves singing the song (these constituted the ‘imagery trials’). Is this correct? - The similarity in the form and magnitude of the M100 elicited by the song onset in the listening condition and the imagery onset in the imagery condition is impressive, and the authors’ claim that the auditory cortices were activated in the imagery conditions seems plausible (to me). However, is it possible that this component was (partly) elicited by the change in color of the fixation cross? It would presumably be easy to discount this possibility by measuring the MEG response to a colour change presented alone (i.e., in the absence of auditory stimulation or song imagery). - Regarding the temporal domain (waveform) analysis: I don’t understand why only the first sound (or sound image) of the song elicited an M100 – can the authors clarify? If this signal truly reflects the activation of the auditory cortex by the imagined song, wouldn’t each individual letter be expected to elicit an M100? If might be useful to illustrate the onset of each individual letter in the listening condition (e.g., by means of a dotted line), to see if each letter is followed by a M100 (albeit smaller than the M100 elicited by the first letter). Relatedly, in the Results the authors state: “Moreover, no repetitive patterns were observed in the time course of listening or imagery (Fig. 1g&h), suggesting that the tracking of the acoustic stream or the rate of imagery was not by the response magnitude” – as above, I don’t understand why the authors were not expecting a repetitive pattern if each specific ‘sound image’ was associated with an M100, and particularly given that they claim to have identified delta oscillations to the inner song. Can the authors clarify why they would expect the first sound image in the sound to elicit an M100, but not subsequent sound images? An alternative explanation is that that the signal actually reflects an ‘omission M100’ response (akin to the oN1 response in the ERP literature) caused by a violation in the expectation of hearing a sound – this could perhaps explain why the signal is only present to the first sound in the song. I would be interested to hear the authors’ thoughts on this. - Regarding the time-frequency (spectral) analysis: I think it would be helpful if the authors could add plots showing the change in phase coherence across time, and across a range of frequencies (i.e., demonstrate that the inner singing elicited oscillations). Without this information, how can the authors be sure that the ‘entrainment’ they report is specific to the delta band (i.e., the approximate frequency of the rhythm)? While their hypothesis is plausible, it seems to me first necessary to show that the entrainment is specific to delta, and does not also occur in other frequency bands (theta, alpha, beta, gamma, etc.). - I thought the ‘frequency-tagging’ aspect of the design was interesting and creative. However, given that the rhythm was approximately equivalent for all of the songs (and the data were averaged across songs in any case) I feel like this aspect of the design could be developed further. For example, it would be interesting to compare the oscillations elicited by ‘delta rhythm’ songs vs. ‘theta rhythm’ songs, to see if the oscillations they elicit are specific to their rhythm: a sort of ‘auditory steady-state response’ for imagined sounds. - I’m not sure I fully understand the rationale for comparing the ‘center-group’ (which would more accurately be labelled as ‘center-trials’) and the ‘dispersed-group’ in the imagined song analysis – is the idea that the motor-based predictions are better in the ‘center-trials’, as the RTs better matched the duration of the actual song? If so, I’m not sure I agree – given that participants overestimated the duration of the song in all imagery conditions, doesn’t this imply that the trials with the lowest RTs actually reflect the most accurate predictions? But in any case, why would the ‘motor-based-predictions’ be expected to be stronger in the center trials? What is the benefit of this approach against, say, simply comparing the ITC when (a) participants listened to songs, (b) participants produced inner songs, and (c) participants sat passively. Presumably the (b) vs. (c) comparison would be expected to be even stronger in such an analysis? - With regards to the source-localization: why did the critical p-values differ between the listening and imagery conditions – did the number of statistical comparisons differ between these conditions? As an aside, while the required number of trials per condition is a matter of debate, 12 trials per condition seems pretty minimal for a paradigm with no external stimulation where a large effect size would presumably not be expected – can the authors comment? Reviewer #2: In « Verbal thinking in rhythm: motor-to-sensory transformation network mediates imagined singing », Li and colleagues aimed to define the brain networks associated to imagined singing with an elegant design and methodological approach (frequency tagging on MEG data). The study is interesting and rigorously performed using state-of-the art analyses tools. However I am wondering about the novelty and significance of the findings, as the implication of fronto-temporo-parietal networks in speech and music mental imagery is already very well established (as pointed out by the authors in the introduction and discussion sections). Moreover the potential novelty of this study, as stated by the authors (naturalistic sounds, frequency tagging approach, and MEG source localization), is mainly methodological and is, in my opinion, not so novel because these approaches have been used by plenty of research groups in the domain. In its current form, I thus think if this report might be more suitable for a more specialized journal. I only have minors comments regarding the methodology and the main text that I list below : - The term Reaction Time (RT) is confusing, the authors could consider using a more explicit term that state that this metric meaure the duration of imagination period - In a related vein, the sentence « We found that when participants imagined with less temporal variation … » can be reformulated in order to clarify that it concerns the variation of the duration of the imagination period accross trials - Could the authors justify why they used ICA instead of Signal Space Projection (SSP) for correction of eye movement and cardiac artifacts? - « Furthermore, eight trails were randomly sampled from 24 trials of each song and yielded a new group of 24 trials (between-group) » : change trails by trials - “Comparing the phase coherence results in imagined singing with those in listening conditions, the observations were overlapped in the temporal regions of primary and secondary auditory cortices. » The authors did not compare the listening and imagination conditions and did not estimate the conjunction (overlap) between these conditions. These analyses should be added if the authors want to make this statement. Reviewer #3: Li and colleagues use a clever design to study where in the brain we can find phase-synchronised activity while participants imagine to sing a few lines of children’s songs. Significant phase coherence in the delta band can be found in a large network comprising bilateral inferior frontal, motor, temporal, and parietal areas. The analysis is new in the context of imagined singing (though standard for speech studies). The study has the potential to inform about the neural representation of imagined singing, but I have a few major (essential) and minor concerns and questions. Major 1. The study is framed in a “verbal thinking” or “inner speech” context (as indicated by title, abstract, introduction and discussion), but I don’t think it is appropriate to equate thinking with singing. Singing (production and perception) involves a wider/different network than thinking. The right hemisphere for example, is much more involved in singing than verbal thinking. In my opinion, the manuscript needs to reflect these basic differences and the topic needs to be reframed. 2. Related to 1: I imagine that the same pattern of results would arise if participants were to just imagine humming the song, keeping the rhythmicity. I don’t think the results necessarily show processes related to “inner thinking” or speech, but they show consistent neural processes related to the rhythmicity and tonality of imagining a song. This should at least be discussed. 3. The phase synchronisation results during Listening and Imagery are compared in the discussion, but a statistical comparison in the results is missing. This is the most important contrast in this study (to find out what the “inner” aspect of singing actually adds/changes when compared to simply listening to songs). This could for example be done on a trial-by-trial basis. 4. Related to the previous point – there were two different alpha levels used for analysis in the Listening and Imagery conditions, so their results can’t be compared. I.e. just before the result section: “(alpha=0.05 for the listening conditions, and alpha=0.001 for the imagery conditions).” In the result section itself, there’s even another alpha level mentioned: “For the imagery condition, [..] (pcorr(FDR)<0.01)”. The same thresholds should be used for all conditions, unless there is a clear justification for this. 5. The study had only 16 participants (and no a priori estimated sample size), which is becoming less and less acceptable in psychological and neuroimaging studies (e.g. Simmons JP, Nelson LD, Simonsohn U (2011) False-positive psychology: Undisclosed flexibility in data collection and analysis allows presenting anything as significant. Psychological science 22: 1359-1366. https://doi.org/10.1177/0956797611417632). While it would be unfair to request additional data, a sample size of 16 is not something that should be advocated, particularly in a wide-reaching journal like Plos Biology. Please add an explicit justification for this low number of participants to demonstrate awareness that this is not good scientific practice. 6. The authors propose a model (Figure 1f) that describes Articulatory Simulation, Articulatory Estimation, and Auditory Estimation, including three different areas involved in these processes. While the results support that these areas (and others) show increased delta phase coherence, none of the three proposed processes can be disentangled, using the present analysis. It might be possible to gain support for some aspects of the model using additional analyses. For example, using directed connectivity, the proposed direction of information flow could be supported. But the present paradigm cannot provide support for the different processes of the model. Please make it clear in the manuscript that the used approach is not suitable to study the different functional contributions of the proposed model (only the involvement of the areas in delta phase alignment). Minor 1. The language in the manuscript is problematic. I understand that it can be extremely difficult as a non-native speaker to write in English and reviewers should as much as possible try to ignore this. But in the current manuscript it is often unclear whether a statement refers to previous research, the current study, or something that is proposed here (due to the unclear use of tenses). Please ask a native speaker to proofread the ms before re-submission. 2. Please add the fourth song to the analysis, even if is also used for a different research question. With only 16 participants, this study needs as much data as possible. 3. In the introduction, particularly on page 3, a lot of information is missing at this point to understand the statements here. It is unclear what is meant with “reaction time” or what the conditions are. The descriptions here don’t make sense without knowing the paradigm. Please explain briefly what the participants had to do (maybe at the beginning of the last paragraph of the intro). 4. Please add page and line numbers to the manuscript. 5. The approach used here is described as “frequency tagging”. Frequency tagging is well defined in the field and I think neither the listening nor (even less) the imagery condition can be classed as frequency tagging. 6. In the figure, please indicate the areas in the result plots (1i and 1j) that refer to proposed areas in 1f, to make a comparison of hypotheses and results easier. 7. Please use a more informative plot for 1b. Bar plots are becoming unacceptable in scientific research. 8. Source data were spatially smoothed – using which parameters? 9. Please define the terms of the equation (i.e. t,f,N,θ). Should the θ be δ? 10. Why were only 4 s post-stimulus analysed? The songs were a minimum of 5.42 s long, and participants needed consistently longer to imagine the songs. I would think that 6 s post-stimulus would maximise available data for each epoch, and I would also add an additional 2 s (consistent with the pre-stimulus time) to optimise the FFT results (leading to 10 s [-2s to 8 s] epochs). 8 Jun 2020 Submitted filename: responses_imagined_singing_MEG_20200622-final.docx Click here for additional data file. 13 Aug 2020 Dear Dr Tian, Thank you for submitting your revised Short Report entitled "Mental operations in rhythm: motor-to-sensory transformation mediates imagined singing" for publication in PLOS Biology. I have now obtained advice from the original reviewers and have discussed their comments with the Academic Editor. You will note that reviewer 1, Thomas Whitford, has identified himself. Based on the reviews, we will probably accept this manuscript for publication, assuming that you will modify the manuscript to address the remaining points raised by the reviewers. Please also make sure to address the data and other policy-related requests noted at the end of this email. We expect to receive your revised manuscript within two weeks. **IMPORTANT: Your revisions should address the specific points made by each reviewer. However, we will not press for the inclusion of additional analyses. In addition, please remove the following paragraph from your manuscript, because our expert on statistical analyses thinks it is a circular argument: "The post-hoc effect sizes obtained in multiple analyses are greater than the required effect size. Therefore, using the predetermined sample size of 16 that is consistent with previous studies should provide enough statistical power to reliably measure the well-documented neural entrainment effects using MEG." Please submit the following files along with your revised manuscript: 1. A 'Response to Reviewers' file - this should detail your responses to the editorial requests, present a point-by-point response to all of the reviewers' comments, and indicate the changes made to the manuscript. *NOTE: In your point by point response to the reviewers, please provide the full context of each review. Do not selectively quote paragraphs or sentences to reply to. The entire set of reviewer comments should be present in full and each specific point should be responded to individually, point by point. 2. In addition to a clean copy of the manuscript, please also upload a 'track-changes' version of your manuscript that specifies the edits made. This should be uploaded as a "Related" file type. In addition to the remaining revisions and before we will be able to formally accept your manuscript and consider it "in press", we also need to ensure that your article conforms to our guidelines. A member of our team will be in touch shortly with a set of requests. As we can't proceed until these requirements are met, your swift response will help prevent delays to publication. *Copyediting* Upon acceptance of your article, your final files will be copyedited and typeset into the final PDF. While you will have an opportunity to review these files as proofs, PLOS will only permit corrections to spelling or significant scientific errors. Therefore, please take this final revision time to assess and make any remaining major changes to your manuscript. NOTE: If Supporting Information files are included with your article, note that these are not copyedited and will be published as they are submitted. Please ensure that these files are legible and of high quality (at least 300 dpi) in an easily accessible file format. For this reason, please be aware that any references listed in an SI file will not be indexed. For more information, see our Supporting Information guidelines: https://journals.plos.org/plosbiology/s/supporting-information *Published Peer Review History* Please note that you may have the opportunity to make the peer review history publicly available. The record will include editor decision letters (with reviews) and your responses to reviewer comments. If eligible, we will contact you to opt in or out. Please see here for more details: https://blogs.plos.org/plos/2019/05/plos-journals-now-open-for-published-peer-review/ *Early Version* Please note that an uncorrected proof of your manuscript will be published online ahead of the final version, unless you opted out when submitting your manuscript. If, for any reason, you do not want an earlier version of your manuscript published online, uncheck the box. Should you, your institution's press office or the journal office choose to press release your paper, you will automatically be opted out of early publication. We ask that you notify us as soon as possible if you or your institution is planning to press release the article. *Protocols deposition* To enhance the reproducibility of your results, we recommend that if applicable you deposit your laboratory protocols in protocols.io, where a protocol can be assigned its own identifier (DOI) such that it can be cited independently in the future. For instructions see: https://journals.plos.org/plosbiology/s/submission-guidelines#loc-materials-and-methods *Submitting Your Revision* To submit your revision, please go to https://www.editorialmanager.com/pbiology/ and log in as an Author. Click the link labelled 'Submissions Needing Revision' to find your submission record. Your revised submission must include a cover letter, a Response to Reviewers file that provides a detailed response to the reviewers' comments (if applicable), and a track-changes file indicating any changes that you have made to the manuscript. Please do not hesitate to contact me should you have any questions. Sincerely, Gabriel Gasque, Ph.D., Senior Editor, ggasque@plos.org, PLOS Biology ------------------------------------------------------------------------ ETHICS STATEMENT: -- Please indicate within your manuscript whether your protocols approved by the Institutional Review Board (IRB) at New York University adhered to the Declaration of Helsinki or any other national or international ethical guidelines. -- Please include the ID number of the protocol approved by the Institutional Review Board (IRB) at New York University -- Please indicate if participants gave informed written consent. If consent was oral, please explain why. ------------------------------------------------------------------------ DATA POLICY: -- Please include in our deposition in OSF a README file that would allow the reader to link your data files to each of the figures displaying quantitative data, by explaining how the data was analyzed to generate the final plots and graphs. -- In addition to you raw MEG data, please upload or provide as a supporting file a spreadsheet containing the individual numerical values that were used to generate the summary statistics show in figures 1AB. For an example see here: http://www.plosbiology.org/article/info%3Adoi%2F10.1371%2Fjournal.pbio.1001908#s5 ------------------------------------------------------------------------ Reviewer remarks: Reviewer #1, Thomas Whitford: The authors have considered my comments carefully and made detailed responses. I have no further comments. Reviewer #2: I thank the authors for this revised manuscript and for taking my comments into account. The article has been greatly improved. I have one final comment concerning the last analysis: The ITC for the theta, mu, beta and gamma bands has been computed only for the imaginary condition: in order to be able to conclude that these different oscillatory dynamics (and networks) are specific to mental imagery a direct contrast between the imaginary and listening condition (for each frequency band) is needed. Reviewer #3: I would like to thank the authors for their thorough reply and the additional analyses, in particular the statistical contrast between conditions and the conjunction plot. I also like the inclusion of the other frequency bands. Overall, my comments have been adequately addressed. The only thing I would like to add after reading the revision is that I would be very careful with the idea that the current study can highlight "commonalities and distinctions between inner speech and covert singing" (highlighted manuscript: page 25/line 3, clean manuscript: page 24/line 22). It is true that nursery rhymes are somewhere in between these two stimulus types, but I don't think that it can highlight differences and commonalities between inner speech and covert singing. To do this, it would be necessary to use these two types of stimuli/tasks separately. Any conclusions regarding the differences/similarities based on this used stimulus type must remain speculative (which is okay in a discussion). Also, these conclusions use "reverse-inferring" based on previously found brain regions and this is usually selective and not necessarily informative or applicable. Could the authors express their thoughts on page 25 more carefully and explicitly highlight that this is speculative? 25 Aug 2020 Submitted filename: response to reviewers.docx Click here for additional data file. 1 Sep 2020 Dear Dr Tian, On behalf of my colleagues and the Academic Editor, Hugo Merchant, I am pleased to inform you that we will be delighted to publish your Short Reports in PLOS Biology. The files will now enter our production system. You will receive a copyedited version of the manuscript, along with your figures for a final review. You will be given two business days to review and approve the copyedit. Then, within a week, you will receive a PDF proof of your typeset article. You will have two days to review the PDF and make any final corrections. If there is a chance that you'll be unavailable during the copy editing/proof review period, please provide us with contact details of one of the other authors whom you nominate to handle these stages on your behalf. This will ensure that any requested corrections reach the production department in time for publication. Early Version The version of your manuscript submitted at the copyedit stage will be posted online ahead of the final proof version, unless you have already opted out of the process. The date of the early version will be your article's publication date. The final article will be published to the same URL, and all versions of the paper will be accessible to readers. PRESS We frequently collaborate with press offices. If your institution or institutions have a press office, please notify them about your upcoming paper at this point, to enable them to help maximise its impact. If the press office is planning to promote your findings, we would be grateful if they could coordinate with biologypress@plos.org. If you have not yet opted out of the early version process, we ask that you notify us immediately of any press plans so that we may do so on your behalf. We also ask that you take this opportunity to read our Embargo Policy regarding the discussion, promotion and media coverage of work that is yet to be published by PLOS. As your manuscript is not yet published, it is bound by the conditions of our Embargo Policy. Please be aware that this policy is in place both to ensure that any press coverage of your article is fully substantiated and to provide a direct link between such coverage and the published work. For full details of our Embargo Policy, please visit http://www.plos.org/about/media-inquiries/embargo-policy/. Thank you again for submitting your manuscript to PLOS Biology and for your support of Open Access publishing. Please do not hesitate to contact me if I can provide any assistance during the production process. Kind regards, Alice Musson Publishing Editor, PLOS Biology on behalf of Gabriel Gasque, Senior Editor PLOS Biology

81 in total

1. Memory's echo: vivid remembering reactivates sensory-specific cortex.

Authors: M E Wheeler; S E Petersen; R L Buckner
Journal: Proc Natl Acad Sci U S A Date: 2000-09-26 Impact factor: 11.205

2. The role of area 17 in visual imagery: convergent evidence from PET and rTMS.

Authors: S M Kosslyn; A Pascual-Leone; O Felician; S Camposano; J P Keenan; W L Thompson; G Ganis; K E Sukel; N M Alpert
Journal: Science Date: 1999-04-02 Impact factor: 47.728

Review 3. Computational neuroanatomy of speech production.

Authors: Gregory Hickok
Journal: Nat Rev Neurosci Date: 2012-01-05 Impact factor: 34.870

4. Theta and Gamma Bands Encode Acoustic Dynamics over Wide-Ranging Timescales.

Authors: Xiangbin Teng; David Poeppel
Journal: Cereb Cortex Date: 2020-04-14 Impact factor: 5.357

Review 5. Mechanisms of gamma oscillations.

Authors: György Buzsáki; Xiao-Jing Wang
Journal: Annu Rev Neurosci Date: 2012-03-20 Impact factor: 12.449

6. Lateralization in the dichotic listening of tones is influenced by the content of speech.

Authors: Ning Mei; Adeen Flinker; Miaomiao Zhu; Qing Cai; Xing Tian
Journal: Neuropsychologia Date: 2020-02-10 Impact factor: 3.139

Review 7. Neural Entrainment and Attentional Selection in the Listening Brain.

Authors: Jonas Obleser; Christoph Kayser
Journal: Trends Cogn Sci Date: 2019-10-09 Impact factor: 20.229

Review 8. Cortical travelling waves: mechanisms and computational principles.

Authors: Lyle Muller; Frédéric Chavane; John Reynolds; Terrence J Sejnowski
Journal: Nat Rev Neurosci Date: 2018-03-22 Impact factor: 34.870

Review 9. Effects and potential mechanisms of transcranial direct current stimulation (tDCS) on auditory hallucinations: A meta-analysis.

Authors: Fuyin Yang; Xinyu Fang; Wei Tang; Li Hui; Yan Chen; Chen Zhang; Xing Tian
Journal: Psychiatry Res Date: 2019-01-15 Impact factor: 3.222

Review 10. Repetition and the brain: neural models of stimulus-specific effects.

Authors: Kalanit Grill-Spector; Richard Henson; Alex Martin
Journal: Trends Cogn Sci Date: 2006-01 Impact factor: 20.229

1 in total

1. Imagined speech can be decoded from low- and cross-frequency intracranial EEG features.

Authors: Timothée Proix; Jaime Delgado Saa; Andy Christen; Stephanie Martin; Brian N Pasley; Robert T Knight; Xing Tian; David Poeppel; Werner K Doyle; Orrin Devinsky; Luc H Arnal; Pierre Mégevand; Anne-Lise Giraud
Journal: Nat Commun Date: 2022-01-10 Impact factor: 17.694

1 in total