Literature DB >> 32813712

Dynamic estimation of auditory temporal response functions via state-space models with Gaussian mixture process noise.

Sina Miran¹, Alessandro Presacco², Jonathan Z Simon^2,3,4, Michael C Fu^2,5, Steven I Marcus^2,3, Behtash Babadi^2,3.

Abstract

Estimating the latent dynamics underlying biological processes is a central problem in computational biology. State-space models with Gaussian statistics are widely used for estimation of such latent dynamics and have been successfully utilized in the analysis of biological data. Gaussian statistics, however, fail to capture several key features of the dynamics of biological processes (e.g., brain dynamics) such as abrupt state changes and exogenous processes that affect the states in a structured fashion. Although Gaussian mixture process noise models have been considered as an alternative to capture such effects, data-driven inference of their parameters is not well-established in the literature. The objective of this paper is to develop efficient algorithms for inferring the parameters of a general class of Gaussian mixture process noise models from noisy and limited observations, and to utilize them in extracting the neural dynamics that underlie auditory processing from magnetoencephalography (MEG) data in a cocktail party setting. We develop an algorithm based on Expectation-Maximization to estimate the process noise parameters from state-space observations. We apply our algorithm to simulated and experimentally-recorded MEG data from auditory experiments in the cocktail party paradigm to estimate the underlying dynamic Temporal Response Functions (TRFs). Our simulation results show that the richer representation of the process noise as a Gaussian mixture significantly improves state estimation and capturing the heterogeneity of the TRF dynamics. Application to MEG data reveals improvements over existing TRF estimation techniques, and provides a reliable alternative to current approaches for probing neural dynamics in a cocktail party scenario, as well as attention decoding in emerging applications such as smart hearing aids. Our proposed methodology provides a framework for efficient inference of Gaussian mixture process noise models, with application to a wide range of biological data with underlying heterogeneous and latent dynamics.

Entities: Chemical

Mesh：

Year: 2020 PMID： 32813712 PMCID： PMC7485982 DOI： 10.1371/journal.pcbi.1008172

Source DB: PubMed Journal: PLoS Comput Biol ISSN： 1553-734X Impact factor: 4.475

Introduction

Extracting the latent dynamics that govern biological processes from noisy and limited measurements is a long-standing challenge in computational biology. From the signal processing perspective, state-space modeling is a natural and commonly-used framework for estimation of such latent dynamic processes, i.e., the states, under limited observations [1]. While traditionally used in application domains such as control system design [2], tracking [3], and finance [4], this framework has recently been utilized in the analysis of neural data [5-11]. State-space models (SSMs) often consist of two equations: the state (evolution) equation, to describe the dynamics of the latent process (e.g., the intrinsic level of an internal neural state variable), and the observation equation, to illustrate how the externally-measured observations are related to the process. In signal processing applications, these equations are typically described in a parametric fashion using domain-specific expert knowledge of the problem, and parameter estimation is mostly performed via Expectation Maximization (EM) [12, 13] or Variational Inference (VI) [14, 15]. To better model the state evolution, in addition to expected measurement uncertainties, additive noise terms are often explicitly included in both the state and observation equations. In traditional applications, i.i.d. Gaussian statistics are assumed/imposed on these noise terms to account for the aggregate uncertainties and mismatches in the model. Under linear dynamics and observations, Gaussian noise, and fixed model parameters, Minimum Mean Square Error (MMSE) state estimation is conducted by the well-known Kalman filter and smoother [1]. For more general SSMs, Sequential Monte Carlo (SMC) methods can be used for MMSE state estimation [16]. In the context of neuroimaging data analysis, SMC methods have been utilized in MEG dipole modeling and source localization [7-9]. Gaussian statistics, however, are often inconsistent with the empirical histograms of the observations in various applications, including neuroimaging data analysis. For instance, in MEG analysis, the observation noise consists of intrinsic magnetic noise, ocular or motion-induced artifacts, as well as background activity unrelated to the stimulus. While the intrinsic noise can be reliably modeled by Gaussian statistics and estimated from stimulus-free measurements in experimental settings, the artifacts and neural background activity are manifestly non-Gaussian and non-stationary. However, when there is direct access to the observed signals, source separation techniques have been successfully utilized to remove and mitigate these latter sources of uncertainty [17-22]. Similarly, the state model noise terms introduced above, often referred to as process noise, do not actually follow Gaussian statistics in various real-world applications [23, 24]. This is mainly due to two reasons: First, in time series analysis, abrupt state changes may not be well represented by Gaussian statistics. Second, in practice, the statistics of the process noise heavily depend on the specifics of the experimental design, such as the task demand and subject’s performance, as well as other exogenous variables not accounted for. Critically, unlike the case of the observations, states are only indirectly observed, which limits the utility of source separation techniques. Finally, despite the negative connotation of the word “noise”, the process noise also captures the model-critical stochasticity of the state evolution. As such, the goal is to model and account for said stochasticity, as opposed to removing it as in the case of observation noise. This issue is particularly important in modeling brain function as a latent dynamic process: taking the states to represent the underlying neural circuits that process sensory stimuli, the process noise then consists of both the underlying behaviorally- and stimulus-driven dynamics as well as the background neural activity (not necessarily evoked by the stimulus or behavior), which are typically quite structured and far from being Gaussian. In this context, the state evolution model is more prone to model mismatch and biases, as compared to the observation equation, considering that we generally have more control over the measurement system than the generative mechanism governing the latent process. As a result, the empirical histogram of the process noise (which can be computed from state estimates) could exhibit multimodal morphology, with each mode corresponding to a different exogenous process driving the state dynamics during specific portions of the experiment. This has led researchers to study SSMs with a Gaussian Mixture (GM) process noise [25-29] considering that a GM can, in principle, approximate any multimodal density [30]. These existing results primarily focus on state estimation and approximation of filtering and smoothing densities under a fixed or known GM noise density. As such, parameter estimation for a GM process noise in SSMs has not been well-studied. Switching SSMs has been another direction of research in extending linear Gaussian SSMs to cope with nonstationarity, model mismatch, and exogenous processes [15, 31–39]. In this approach, several linear Gaussian SSMs are considered to underlie the observed time series data, which switch according to a Hidden Markov Model (HMM). Although the filtering and smoothing densities in this model take a GM form, the potential multimodality of the process noise is not explored or modeled in this approach. In addition, parameter estimation for switching SSMs is a challenging task in general, due to the intricate dependence of the data likelihood on the parameters. When the states are directly observable, the resulting models are known as Markov-switching Autoregressive (MSAR) models, which notably admit parameter estimation via the EM algorithm [32, 40]. However, for general switching SSMs, parameter estimation often requires computationally intensive numerical optimization steps [33, 35, 36]. In this work, we fill this gap by developing an EM-based algorithm for estimating the parameters of a GM process noise model from the observations in an SSM. In our model, the process noise is not drawn i.i.d. from a GM but instead, a GM component is chosen at random for a window of fixed (but arbitrary) length, and the process noise within the window is drawn from said component. The parameters of the GM are unknown. The EM algorithm has been widely used for parameter estimation both in state-space modeling [13] and in GM clustering [41], which makes it a promising candidate for our setting. The EM framework in this setting, however, results in intractable expectations for parameter updates. We address this issue by leveraging a Sequential Monte Carlo Expectation Maximization (SMCEM)-type algorithm [42] to approximate the expectations using smoothed particles obtained through SMC. A major drawback of particle smoothing approaches is their excessive computational requirements, or equivalently suffering from sample depletion as the dimension of the target densities grows while fixing the computational costs [43]. As a more scalable alternative, we develop another method of approximating the expectations based on closed-form approximations to the smoothing densities as well as their one-step cross covariances. To this end, we adopt the two-filter formula for smoothing [26] and devise a belief propagation algorithm in our setting. As a result, the computational complexity of the E-step in EM for a GM process noise would be comparable to that of a conventional Gaussian process noise, akin to performing parallel Kalman filtering and smoothing. To demonstrate the benefits of a GM process noise and the efficacy of the developed estimation framework, we consider two experimental paradigms: a dynamic at-will attention switching task in a realistic cocktail party scenario, in which the listener maintains attention to one out of two competing speakers, while being able to switch attention between the two speech streams at will; and an instructed attention switching task in a more restricted cocktail party scenario, in which the listener maintains attention to one out of two competing speakers for the first half of a trial and then switches attention to the other speaker. The cocktail party is among the key paradigms in studying the neural dynamics underlying complex auditory processing [44, 45]. One of the most recent quantitative approaches in uncovering these neural dynamics from neuroimaging data is based on the Temporal Response Function (TRF) model [46]. The TRF can be considered as an evolving Finite Impulse Response (FIR) filter which gets convolved with speech features in time, e.g., the speech envelope, to produce the auditory neural response observed through neuroimaging modalities such as electroencephalography (EEG) and magnetoencephalography (MEG) [47]. The TRF framework has resulted in new insights into the mechanisms of speech processing in the brain in the cocktail party scenario [45, 48–50]. For instance, TRF components at specific lags may exhibit peaks which arise, persist, and disappear over time according to the attentional state of the listener [51]. The different local dynamics of TRF components under each of these conditions motivates a GM density to capture such evolution patterns. Dynamic estimation of TRFs was first discussed in [47] using a Recursive Least Square (RLS) algorithm. However, smoothing estimates and state-space modeling are more robust than RLS and filtering estimates in performing a comprehensive dynamic analysis of TRFs when data from multiple trials is available. Thus, we study dynamic estimation of TRFs using SSMs and apply our SSM framework with a GM process noise to both simulated and experimentally recorded MEG data under a dual-speaker environment where the subject switches attention between the two speakers at will. The results show that our proposed algorithm can effectively recover the multimodal structure of the process noise from SSM observations, and that having a richer and more realistic representation of the process noise allows capturing the TRF dynamics more precisely and more consistent with the subjects’ behavioral reports, as compared to the conventional Gaussian SSM or RLS estimation. While our proposed framework is motivated by and applied to data from auditory experiments, it is applicable to general state-space modeling problems in which states exhibit heterogeneous and recurring local dynamic patterns.

Results

In this section, we demonstrate the utility of our proposed algorithms in estimating TRFs from auditory neural responses to speech, using both simulated and experimentally-recorded MEG data. Before doing so, we will give an overview of the TRF model, existing estimation frameworks, and the benefits of our GM SSM framework for TRF estimation.

The TRF model

Consider a cocktail party setting [45], in which a subject is listening to two speakers simultaneously, but only attending to one of the speakers. While the subject is performing this task, the neural response is recorded using MEG. The TRF is a commonly used linear encoding model that relates the speech features to the neural response, by generalizing the concept of event-related evoked responses: instead of averaging over multiple trials with the same stimulus to obtain the evoked response, the TRF kernel is obtained by averaging the effect of a diverse set of speech stimuli, presented as a continuous time series, and hence results in a generalizable encoding model (See Fig 1 for a schematic depiction). The speech features used in TRF models have included the acoustic envelope, acoustic onsets, phoneme representations, word frequency measures, and semantic composition [52-54]. In a multi-speaker scenario, multiple TRFs are used to capture the effect of the speech features of each speaker to the neural response.

Fig 1

Schematic depiction of the TRF model. The speech features (left, e.g., acoustic envelope) are convolved with the TRF (top) to predict the auditory neural response (right).

Schematic depiction of the TRF model. The speech features (left, e.g., acoustic envelope) are convolved with the TRF (top) to predict the auditory neural response (right). Existing results in auditory neuroscience [11, 46–48, 51, 55] have focused on studying the behavioral significance of the various peaks in the TRF. For instance, the TRF exhibits an early positive peak at around 50 ms, referred to as the M50 component, which is known to represent the encoding of the acoustic envelope. A later negative peak at around 100 ms lag, referred to as the M100 component, has shown to have an attentional modulation effect, so that it appears to have a larger magnitude for the attended speaker’s TRF, compared to the unattended speaker’s TRF. The M50 component is attributed to the effect of early auditory processing in the brain and is equally represented in both speakers’ TRFs, while the M100 component represents the later processing stages that segregate the attended speaker from the unattended one [48]. TRFs are commonly assumed to be static during the duration of an auditory experiment, and are estimated using regularized least squares [56, 57] or boosting [46, 48, 58]. Dynamic estimation of the TRFs, on the other hand, can provide insights into the underlying neural dynamics that process speech in the cocktail party setting, and has significant implications for the design of non-invasive brain-machine interface devices involving auditory processing, such as the emerging ‘smart’ hearing aid technology that utilizes neural signals to steer the hearing aid parameters in real-time. Dynamic estimation of TRFs was first discussed using a regularized RLS framework in [47]. This method considers changes in the TRFs over consecutive non-overlapping time windows of small length, and updates the estimates of the TRFs in a recursive fashion as more data becomes available (See Methods for more details). As such, it provides filtered estimates of the TRFs and is suited for real-time applications. Leveraging SSMs for representation and estimation of the TRFs has the advantage of providing smoothed estimates and directly modeling the evolution of the TRFs through the state equation, and thereby resulting in a more precise dynamic analysis of the TRFs in the off-line fashion. In this work, we consider linear Gaussian SSMs and linear SSMs with GM process noise. As we will demonstrate in the following two subsections, the linear SSMs with GM process noise have the additional advantage of accounting for the heterogeneity of the TRF dynamics. In what follows, we consider regularized RLS estimates of the TRF, estimates of the TRF using a Markov Switching Autoregressive (MSAR) model, and smoothed TRF estimates from a linear Gaussian SSM as benchmarks (See Methods for more details on these algorithms).

Application to simulated data

Consider a 90 s long cocktail party experiment, in which the subject is listening to two speakers simultaneously and is instructed to switch attention between the two every 15 s starting at time 7.5 s. We synthesize the putative TRF dynamics as shown in Fig 2A, based on the relevance of different TRF peaks. We use a sampling rate of F = 100 Hz and a length of 250 ms for the TRFs. The TRFs are represented using a dictionary with five Gaussian atoms with variances of 0.018 s2 whose means are separated by 50 ms increments starting from a lag of 0 ms to 200 ms. Furthermore, we consider a piecewise-constant model for the TRFs over windows of length 300 ms. Letting be the dictionary, the TRF at the nth window is defined as , where x is the state vector at window n. The SSM governing the state evolution is of the form x = α x + w, where α < 1 is a constant and w is the process noise. Finally, the observed neural response is related to the states by , where S are the speech features of the two speakers relevant to window n and v is the i.i.d. Gaussian observation noise, i.e., (See Methods for more details on the TRF and state-space models).

Fig 2

Designed simulation study: A) Heatmaps of the synthetic TRFs in time for a two-speaker cocktail party scenario, where the M100 magnitudes are attention-modulated. B) Example instances of speaker 2’s TRF when the speaker is attended (left plane) and unattended (right plane). C) Oracle histogram of process noise in (14) along the M100 dimension of speaker 2, which is computed from (A), and the fitted GM as the oracle GM fit. Fig 2A shows the synthesized TRF heatmaps for speakers 1 and 2, where the corresponding states are designed such that the M50 component stays relatively constant for the two speakers, the M100 component is modulated by the attentional state, and a common high-latency component at 200 ms varies independently of the subject’s attention. Fig 2B shows two snapshots of the TRF of speaker 2 at 10 s, when speaker 2 is attended, and at 85 s, when speaker 1 is attended. It is worth noting that the corresponding states in Fig 2A are not generated from an SSM. However, the relatively smooth temporal changes of the TRFs in Fig 2A (representing neural activity in controlled experimental conditions) makes the SSM model a suitable candidate for dynamic TRF analysis. Indeed, the TRF components at lags of 100 ms and 200 ms exhibit heterogeneous dynamics across the trial, including periods of increasing, decreasing, and remaining relatively constant, which model the changes in auditory state throughout the experiment. Such dynamics can be modeled using a multimodal process noise w. Fig 2C shows the histogram of true w samples along with the 3rd state dimension of speaker 2’s TRF (corresponding to the M100 component). The true process noise samples are computed as , assuming that the true states (x’s) in Fig 2A are available to an oracle. We refer to this histogram as the oracle histogram and to the maximum-likelihood GM density fit to these oracle samples as the oracle GM fit in Fig 2C. The constant α is chosen close to and less than one to enforce temporal continuity. We assume that the TRF dynamics are governed by one mixture component in each window of length 1.5 s. We simulate the observed neural response y using two speech signal envelopes as the stimulus vectors (See Methods for more details on the parameter settings). Fig 3 shows the convergence of the estimated parameters using the proposed EM algorithm in comparison to those given by the oracle GM fit for a nominal observation SNR of 6.7 dB, using the closed-form approximation approach. The number of mixture components is chosen as 5 using the Akaike Information Criterion (AIC). The observation noise variance σ2 is also estimated within the EM algorithm. The panels for the means and diagonal covariances in Fig 3 correspond to the 3rd state dimension of speaker 2’s TRF (i.e., the M100 component) from Fig 2C. The mixture probabilities and means of the oracle GM fit are recovered within 30 EM iterations. The covariance elements, however, take more iterations to converge and tend to underestimate those of the oracle GM fit. This shows that at the nominal SNR of 6.7 dB in our simulation, the algorithm is more sensitive to recovering the average TRF dynamics in each 1.5 s window than to retrieving the detailed variations within the window.

Fig 3

Convergence of Gaussian mixture parameters for M = 5 in the EM algorithm with closed-form approximations: A) Mixture probabilities. B) Mixture means (along the M100 component of speaker 2 as an example). C) Mixture variances (along the M100 component of speaker 2 as an example). Bold dash lines show the corresponding parameters of the oracle GM fit. D) GM densities (along the M100 component of speaker 2 as an example). It is noteworthy that the initialization points in Fig 3C, given by the estimated process noise variance in a Gaussian SSM, are approximately 100 times larger than those given by the oracle GM fit. Fig 3D shows the corresponding estimated process noise density after 200 EM iterations (blue trace), the oracle GM fit (red trace), and the Gaussian model fit obtained from a linear Gaussian SSM used for EM initialization (yellow trace). While the estimated GM process noise density using our proposed approach closely matches that given by the oracle GM fit, the process noise density obtained by a linear Gaussian model is heavily biased and is not able to capture the multimodal nature of the process. Note that while Fig 2C alludes to a true density with 3 GM components, the AIC criterion chose 5 GM components. Nevertheless, the joint updating of the means, variances, and mixture components (Fig 3A, 3B and 3C) results in a final density estimate that matches the putative true density with 3 GM components (Fig 3D). As such, our algorithm exhibits robustness to overestimation of the number of mixture components. To ease reproducibility, we have archived a MATLAB implementation of the closed-form approximation method in the GitHub repository, which reproduces the results of Fig 3 [59]. Convergence curves for the Monte Carlo approximation method are previously presented in [60], and are omitted here for brevity. Fig 4 shows the normalized RMSE in state estimation with respect to the original states in Fig 2A for nominal observation SNRs in the range [-5.3, 9.7] dB with 3 dB increments. The results are averaged over 10 realizations at each SNR value. The SSMs clearly outperform the RLS and MSAR algorithms in recovering the true states. Also, the SSM with GM process noise with either the closed-form or particle smoothing approximations outperforms the Gaussian SSM. We have considered a total of 2000 particles for the particle smoothing algorithm (Approach 1) so that state estimates are comparable to those obtained by the closed-form approximation (Approach 2). This resulted in a ten-fold increase in the run-time compared to the closed-form approximation method (61.50 seconds and 5.57 seconds for Approaches 1 and 2, respectively, per EM iteration, on a typical desktop workstation for the settings used in the simulation), which shows the advantage of using the closed-form approximation method. Examples of the estimated TRFs of speaker 1 under the low nominal observation SNR of -5.3 dB are shown in Fig 5. The MSAR (panel B) and RLS estimates (panel C) exhibit the highest variability compared to the ground truth in Fig 5A (imported from Fig 2A). While the Gaussian SSM estimate in Fig 5D fails to capture the rapid M100 dynamics as well as the steady M50 component (note the M50 and M100 estimates within the dashed rectangles), the estimate from the SSM with GM process noise in Fig 5E is nearly indistinguishable from the ground truth TRF in Fig 5A.

Fig 4

Fig 5

Example dynamic TRF estimates for speaker 1 under the low nominal observation SNR of −5.3 dB: A) The ground truth TRF. B) MSAR. C) Regularized RLS. D) Linear Gaussian SSM. E) Linear SSM with GM process noise. The dashed rectangles highlight an example difference of these estimates for the sake of comparison.

Averaged normalized RMSE in state estimation computed over 10 runs of observation noise at each SNR value for dynamic TRF estimation algorithms, namely, MSAR, regularized RLS, linear Gaussian SSM, and linear SSM with GM process noise using closed-form and Monte Carlo particle smoothing approximations. States and noise parameters are both estimated simultaneously from the observations in each run. Example dynamic TRF estimates for speaker 1 under the low nominal observation SNR of −5.3 dB: A) The ground truth TRF. B) MSAR. C) Regularized RLS. D) Linear Gaussian SSM. E) Linear SSM with GM process noise. The dashed rectangles highlight an example difference of these estimates for the sake of comparison.

Application to experimentally-recorded MEG data

We present the analysis of data from two separate attention switching experiments, which we refer to as the at-will and instructed attention switching experiments. In the at-will attention switching experiment, subjects listened to a speech mixture, and were instructed to start attending to the male speaker first, and then to switch their attention between the two speakers at their own will for a minimum of one and a maximum of three times during each trial. In the instructed attention switching experiments, the subjects listened to a speech mixture, and were instructed to start attending to one speaker first and then to switch their attention to the other speaker halfway through the trial. The instructed attention switching experiment consists of data from 7 subjects, with 6 trials each, while the data from the at-will attention switching experiment pertains to 3 trials of one subject (See Subjects, Stimuli, and Procedures subsection in the Methods for more details). Although reliable group-level conclusions for the challenging at-will attention switching experiment require data from more subjects and trials, given the novelty of this experimental paradigm and its potential interest to the auditory attention decoding research community, we have included the analysis of data from this one subject, separately from the instructed attention switching data. In addition, a GM process noise in SSMs would be more beneficial in at-will attention switching experiments, as it can capture the rapid dynamics that underlie attention switching instances more reliably.

TRF estimation results and discussion

We set the TRF length to 300 ms and consider TRFs to be piece-wise constant over windows of length 400 ms. Also, we assume that the TRF dynamics are governed by one mixture component in each window of length 2 s. As before, we represent the TRFs over a Gaussian dictionary with means separated by 20 ms starting from 0 to 280 ms, and variances of 8.5 × 10−3 s2. To restrict the dynamic range of the process noise w for the sake of robustness, we consider Inverse Gamma (IG) conjugate priors [61] on the diagonal elements of the process noise covariance matrices. Note that in the absence of such priors, the EM algorithm would likely result in TRFs that are highly variable in time and with no meaningful morphological structure (See Methods for more details on parameter settings). Fig 6 shows example TRF estimates for two trials of the subject in the at-will attention switching experiment. The vertical dashed lines mark reported attention switches by the subject. For the sake of brevity, hereafter we only present TRF estimates based on the RLS, Linear Gaussian SSM and Linear SSM with GM process noise. The number of mixture components for the process noise was set to 3 for trial 1 and 4 for trial 2, using the AIC criterion. Row A shows speaker 1’s TRF estimate using RLS, which exhibits the highest variability. Rows B and C show the TRF for the Gaussian SSM and the SSM with GM process noise, inferred using the closed-form approximation, respectively. Although the estimated process noise variance in the GM case is controlled by that of the Gaussian case in each dimension, we observe that the estimates in row C clearly delineate the heterogeneity of the dynamics of the various TRF components, which are blurred by the linear Gaussian SSM estimates of row B. In other words, the multimodal representation of the process noise allows the model to adapt to rapid changes governed by the subjects’ behavior. Row D displays speaker 2’s TRF estimate using the linear SSM with GM process noise. Comparing rows C and D, we observe the aforementioned attention modulation effect in the magnitude of the M100 components. To illustrate this effect further, row E shows the difference between the M100 magnitudes of the TRFs of speakers 1 and 2, where we locate the M100 at each time as the smallest TRF elements in the [0.1, 0.2] s lag interval. Thus, when speaker 1 (2) is attended, we expect this difference to be positive (negative). The attention decoding accuracy in each trial can be computed by comparing the difference of the M100 magnitudes with level 0 at each time (horizontal dashed line in Fig 6E) considering the reported attended speaker and summing over all the intervals where the M100 of the attended speaker exhibits a larger magnitude than that of the unattended speaker. Note that this decoding strategy is purely based on the TRF estimates in a single trial. As such, it would not be as accurate as the state-of-the-art attention decoding methods that use more complex algorithms and extensive training data. The M100 differences for the RLS exhibit high variability (blue traces), and result in inconsistencies with the reported attended speakers (e.g., trial 1 after the 35 s mark, downward arrow). The M100 differences obtained by the linear Gaussian SSM estimates seem to overly smooth those of the RLS (e.g., trial 2, near the 10 s mark, downward arrow). The M100 differences obtained from the proposed linear SSM with GM process noise, however, provide a desirable compromise between these two extremes: Compared to the linear Gaussian SSM, the M100 differences benefit from the clearly delineated TRF dynamics and can result in earlier detection of an attention switch, leading to higher attention decoding accuracy. Instances of this advantage are marked by green arrows in row E, for both trials. Decoding the attention based on the sign of the M100 differences results in misclassification rates of (6.74%, 12.37%, 22.94%) for trial 1 and (31.72%, 43.08%, 39.03%) for trial 2, respectively for the SSM with GM process noise, Gaussian SSM, and RLS, in accordance with the foregoing qualitative analysis.

Fig 6

TRF estimates for two example trials in the at-will attention switching experiment with vertical dashed lines showing the reported times of attention switches by the subject: A) RLS estimate (speaker 1 TRF). B) Gaussian SSM (speaker 1 TRF). C) SSM with GM process noise (speaker 1 TRF). D) SSM with GM process noise (speaker 2 TRF). E) M100 magnitude differences between the TRFs of speaker 1 and 2 for the different methods. The SSM with GM process noise better delineates the heterogeneity of the TRF dynamics and is more consistent with the subjects’ behavioral reports (see green arrows), while the RLS estimate is highly variable and the estimate of the Gaussian SSM is overly smooth. Fig 7 displays example TRF estimates for two trials in the instructed attention switching experiment, in a similar fashion as Fig 6. The subjects were instructed to switch their attention at the 30 s mark, halfway through the trial. Again, we observe that the SSM with GM process noise emphasizes the detailed dynamics of the TRFs which are sometimes blurred out in the Gaussian SSM or shown with high variability in the filtering estimates of RLS. This can result in stronger attention modulation effects, i.e., larger magnitude for the M100 of the attended speaker, or quicker transitions at the 30 s attention switching mark (marked by green arrows). The misclassification rates for SSM with GM process noise, Gaussian SSM, and RLS are respectively (25.78%, 26.81%, 31.45%) for trial 1 and (8.55%, 9.8%, 17.01%) for trial 2.

Fig 7

TRF estimates for two example trials in the instructed attention switching experiment with vertical dashed lines showing the 30 s mark where subjects were instructed to switch their attentional focus: A) RLS estimate (speaker 1 TRF). B) Gaussian SSM (speaker 1 TRF). C) SSM with GM process noise (speaker 1 TRF). D) SSM with GM process noise (speaker 2 TRF). E) M100 magnitude differences between the TRFs of speaker 1 and 2 for the different methods. The SSM with GM process noise better delineates the heterogeneity of the TRF dynamics and is more consistent with the subjects’ behavioral reports (see green arrows), while the RLS estimate is highly variable and the estimate of the Gaussian SSM is overly smooth. Fig 8 shows the group-level analysis results for the subject in the at-will attention switching experiment (left column) and the seven subjects in the instructed attention switching experiment (right column). The upper panels display the scatter and box plots for the computed attention decoding accuracies for the at-will and instructed attention switching experiments, respectively. With respect to the mean attention decoding accuracy (red plus sign), RLS estimates exhibit the poorest performance, and SSM with GM process noise results in a modest 1% to 3% improvement over the linear Gaussian SSM. The attention decoding accuracy, however, is not an ideal metric to assess the TRF estimation performance, due to the absence of a ground truth and arbitrary attentional variabilities during a single trial. The lower panels in Fig 8 summarize the results of the AIC model selection criterion for number of components in the GM process noise density. For each trial, we considered one to four GM components for the process noise, and the number with the lowest AIC score was chosen for the SSM with GM process noise in that trial. If one GM component is chosen based on the AIC criterion, the SSM with GM process noise reduces to the linear Gaussian SSM. The lower panels in Fig 8 show the normalized histogram of the chosen number of GM components across subjects and trials. It is worth noting that for the trials where four components were chosen, it is possible that a higher number of process noise GM components would have resulted in a lower AIC score. We observe that in none of the trials (out of 45) a linear Gaussian SSM is preferred over an SSM with GM process noise. This suggests that even when accounting for model complexity, the SSM with GM process noise fits the observed MEG data better than a linear Gaussian SSM, and can thus serve as a better explanatory model for the underlying biological processes.

Fig 8

Group-level results for attention decoding accuracy (upper panels) and the AIC model selection criteria (lower panels) for the at-will (left column) and the instructed (right column) attention decoding experiments. The upper panels display the scatter/box plots of trial-level mean attention decoding accuracy across subjects and trials, while the lower panels show the normalized histogram of the chosen number of GM components in each trial. The mean attention decoding accuracies, marked by red plus signs in upper panels, are (75.31%, 72.04%, 66.45%) and (63.24%, 62.74%, 60.28%) for (SSM with GM process noise, Gaussian SSM, RLS) in the at-will and instructed attention switching experiments, respectively. Although the SSM with GM process noise results in modest improvements in mean attention decoding accuracy compared to the linear Gaussian SSM, the AIC model selection criteria always prefers the former to the latter.

Discussion

We considered the problem of estimating latent dynamics of biological processes from noisy and limited observations, in which the commonly-used Gaussian statistics fail to capture the heterogeneous and switching nature of the dynamics. An instance of such dynamics are the neural processes that underlie auditory attention switching in a cocktail party consisting of multiple speakers. To address this shortcoming of Gaussian models, we utilized a SSM with GM process noise and devised an EM algorithm to estimate the parameters of the GM density from SSM observations. To approximate the intractable expectations in EM, we considered two approaches, one based on particle smoothing and another based on closed-form GM approximations to the smoothing densities. The main limitation of the first approach based on particle smoothing is the exponential growth of the number of particles in terms of the dimension of the smoothing densities. The second approach based on closed-form GM approximations significantly reduces the computational complexity by requiring a cubic dependence in the state dimensions, with an additional cubic dependence in the number of GM mixtures. In addition, the closed-form approximations require a linear SSM model to hold. If the underlying state-space model is indeed non-linear, linearization techniques such as those used in the extended [1] or unscented Kalman filter [62] are required, which may result in model mismatch. While both the observation and process noise are often non-Gaussian in practice, in our proposed framework, we have assumed Gaussian statistics to model the observation noise. This is motivated by the conventional preprocessing techniques applied to the observed data (e.g., source separation), which are able to remove the non-Gaussian noise components in such a way that the resulting ‘denoised’ observations admit Gaussian noise models. Nevertheless, the observation noise can also be modeled by a GM density, whose parameters can be estimated in a similar fashion to those of the process noise in our proposed framework. The resulting inference algorithm, however, would be more intricate and is deemed as a future extension of our current methodology. As mentioned in the Introduction, existing parameter estimation techniques for general switching SSMs are computationally demanding and often require direct maximization of the data likelihood via numerical methods. The MSAR method presented here circumvents this challenge by using a surrogate of the states to perform parameter estimation. It is noteworthy that our SSM with GM process noise can be thought of as a special case of an SSM with underlying MSAR state dynamics, in which the state transition probability matrix is constrained to have equal rows. Thus, another potential extension of our proposed methodology is to utilize the closed-form approximation approach for estimating the parameters of more general switching SSMs in a computationally scalable fashion. As our primary application, we considered the problem of dynamic TRF estimation from auditory neural responses to speech. We formulated the problem as a linear SSM with Gaussian or GM process noise, and compared the TRF estimates to those obtained by the RLS and MSAR algorithms. Application to simulated data shows that the algorithm can effectively recover the parameters of the underlying GM process noise and that the GM representation improves state estimation for a synthesized latent process exhibiting heterogeneous and rapid dynamics. Application to experimentally-recorded MEG from both at-will and instructed attention switching two-speaker cocktail party settings revealed that the proposed SSM with GM process noise model and inference methodology clearly delineates the heterogeneous dynamics of the TRF components that are otherwise not captured by the other techniques. While the proposed methodology can be used as a reliable estimation technique for auditory attention decoding in a cocktail party settings, it can be applied to a wider range of biological problems in which the underlying model exhibits heterogeneous and switching dynamics.

Methods

Ethics statement

All experimental protocols and procedures were approved by the University of Maryland Institutional Review Board, and written informed consent was obtained from participants before the experiments.

The main problem formulation

Consider the following generic discrete-time SSM with additive noise: where and represent the states and the observations at time n, respectively. We assume that the functional forms of f(.) and g(.) are known and fixed for n = 1, …, N, from domain-specific knowledge of the problem. Following our arguments in the Introduction on the utility of source separation techniques in removing the non-Gaussian components of the observation noise, we let be the i.i.d. Gaussian sequence of observation noise. The process noise w, on the other hand, accounts for the stochasticity of the state evolution. Note that from a neuroscience perspective, the process noise consists of both the underlying behaviorally- and stimulus-driven dynamics as well as the background neural activity (not necessarily evoked by the stimulus or behavior). While the terminology alludes to a zero-mean Gaussian disturbance, the process noise in this context is typically quite structured and far from being a zero-mean Gaussian disturbance. Nevertheless, we adhere to this terminology for the sake of consistency with existing literature on state estimation. To represent the process noise w, consider a GM with M mixture components and parameter set Θ ≔ {p1:, 1:, Σ1:} containing the mixture probabilities p1:, mean vectors 1:, and covariance matrices Σ1:. We model the state dynamics over K ≔ N/W consecutive non-overlapping windows of length W. Within each window i ∈ {1, …, K}, the process noise is drawn from one of the mixture components, which we denote by z ∈ {1, …, M}. Therefore, we have for n = (i − 1)W + 1, …, iW, independent of v, and we consider the z’s to be i.i.d. with P(z = m) = p for m = 1, …, M. In other words, z determines the active mixture component that governs the state dynamics in window i. This can also be interpreted as a switching Gaussian process noise. For the special case of W = 1, the resulting model could in principle approximate any arbitrary i.i.d. process noise w, as it is fitting a GM model to the process noise. In this case, the labels z of mixture components can vary at the same rate as that of the states and observations. Note that regardless of the choice of W, w has a GM distribution, but is only i.i.d. when W = 1. We hereafter refer to the foregoing process noise model, for general W ≥ 1, as the GM process noise model. Let denote the set of observations from 1 to N, i.e., y1:, and similarly define and for x1: and z1:, respectively. Our goal is to estimate the GM process noise parameters Θ from SSM observations . As estimation of the observation noise covariance R in EM is straightforward [13], we assume R to be fixed for convenience and will briefly review the update equations for R in the following subsection, if it needs to be estimated from the observed data. As mentioned in the Introduction, R can also be estimated from stimulus-free conditions. Finally, we adopt the Maximum Likelihood (ML) estimation framework to estimate Θ as follows: Despite its simple statement, the problem of Eq (2) is challenging due to the difficulties in computing the optimization argument, i.e., data likelihood, in a computationally scalable fashion. We will address this challenge in the forthcoming section.

Parameter estimation

We use the EM algorithm as a solution method for the ML problem in (2). The EM framework provides an iterative procedure to update the estimated parameter set with the guarantee that at iteration (ℓ + 1) we have where is the parameter set estimate from the ℓth iteration [12]. The EM algorithm guarantees convergence to a local maximum, and most of the work on escaping the undesirable local maxima in EM theory have focused on providing an informed initialization of the algorithm [63, 64]. As explained in the Model Parameter Settings subsection of the Methods, we use the fixed-interval smoothed estimates based on a Gaussian model to choose and initialize the algorithm. Let denote the set of latent variables in the SSM, which includes the states and the labels of active mixture component in each window. The EM algorithm performs the following two steps at the (ℓ + 1)th iteration and repeats them until convergence to a parameter estimate : where the surrogate function is a lower bound on the data log-likelihood. The expectation in the E-step is over the conditional density of . As all of the following expectations are also conditioned on and , we drop the conditioning in the notation for convenience, but keep the expectation subscript to denote the random variable with respect to which the expectation is taken. Also, hereafter the subscript (i, j) represents the time index of the jth sample in the ith window, i.e., n = (i − 1)W + j for brevity. The EM algorithm in Eq (4) in our setting can be expressed as follows: E-step: The surrogate function in the SSM is computed as where denotes the indicator function, c1 and c2 are terms not dependent on Θ, and π( is defined as which is computed based on the Gaussian density for w( in Eq (1) when z = m. If we decompose the conditional expectation in Eq (5) into two iterated conditional expectations with respect to and (where is dropped in the latter due to conditional independence), this equation can be written as where c3 is a constant and is the membership probability and can be expressed using Bayes’ rule as: The variable is defined similarly to (6) but for , which makes a constant with respect to Θ. M-step: In this step, we maximize the log-likelihood lower bound with respect to Θ. Differentiating (7) with respect to Θ, enforcing the condition , and invoking the dominated convergence theorem to change the order of expectation and differentiation, we obtain the following parameter updates for m = 1, …, M: where v( ≔ x( − f((x(). Remark 1. If the covariance matrix R of the Gaussian observation noise in (1) also needs to be estimated from , it can be included in the parameter set Θ. The update formula for in the EM framework then becomes [65] In addition, if the function f(⋅) is only known in parametric form, it is in principle possible to estimate it via the same EM framework. As an example, which we use for TRF modeling, consider f(x) = α x, where α is an unknown constant. Then, the coordinate descent update for α takes the form [13]: In the definition of in Eq (8), both the numerator and the denominator include exponential functions of the states. Therefore, the conditional expectations in Eq (7) and in the update equations above are intractable even if the joint smoothing density of is known in closed-form [29]. In this work, we use two different approaches to address this challenge: In Approach 1, we use Monte Carlo Approximations for computing the aforementioned expectations. While this approach is rather straightforward to implement, the resulting algorithm is computationally intensive. In Approach 2, we instead derive closed-form approximations to the densities required for computing the expectations. The underlying parameters can be updated recursively, which makes the resulting algorithm scalable with the problem dimension. The details of these approaches are given in S1 Appendix.

Dynamic estimation of the TRF

Consider a cocktail party setting [45], in which a subject is listening to two speakers simultaneously, but only attending to one of the speakers. While the subject is performing this task, the neural response is recorded using MEG. Let denote the auditory component of the neural response at time t ∈ {1, …, T}, extracted from multichannel MEG recordings via the Denoising Source Separation (DSS) algorithm [22, 66]. Also, let be a speech feature of speaker q ∈ {1, 2} at time t, e.g., the acoustic envelope, and denote by the vector containing the previous L features up to (and including) time t. In this work, we consider to be the acoustic envelope in log scale, which is known to be a reliable predictor of the neural response [47]. Other features such as phoneme representations, word frequency measures, and semantic composition have also been considered in the literature [52-54], and can also be included in . A widely-used linear stimulus-response model is given by: where is the concatenation of and as the TRFs at time t corresponding to speakers 1 and 2, respectively. Also, is the concatenation of the speech feature vectors at time t, and v represents the observation noise. In light of this model, and as mentioned in the Introduction, the TRF can be thought of as the impulse response of a linear, but time-varying, system representing the neural activity and taking as input the speech features of speaker q, for q = 1, 2. We assume and define the nominal observation SNR as , where is the average of the signal component in Eq (13) over the trial of length T. It is common to consider a piecewise-constant approximation to the TRFs over consecutive non-overlapping time windows of length t0, which is comparable to the length of the TRF L. In other words, for t ∈ {(n − 1)t0 + 1, …, nt0} and n ∈ {1, …, N} where N = T/t0 is assumed to be an integer without loss of generality. We then define , , and .

TRF estimation via regularized RLS

First, the TRFs are represented over a dictionary G, i.e, , in order to enforce smoothness in the lag domain and to mimic static TRF estimates [48, 49]. The dynamic TRF estimation framework of [47] can be stated as: where λ ∈ (0, 1) is the forgetting factor, γ is the regularization coefficient, h(.) can either be an ℓ1 or ℓ2 penalty [57], and is a block diagonal matrix with G containing the dictionary atoms. Similar to [11, 47], we consider a Gaussian dictionary where the D columns of G are shifted Gaussian kernels. The parameter λ in Eq (14) induces a trade-off between adaptivity and robustness of TRF estimation.

TRF estimation via MSAR modeling

While the RLS estimates of the TRF capture the dynamics via the forgetting factor mechanism, they are not capable of capturing abrupt and/or recurring state dynamics. MSAR models, on the other hand, explicitly model such dynamics and are thus a suitable class of models for TRF estimation. Given that the TRF is not directly observable, the conventional MSAR models are not readily applicable. In addition, the SSM extensions of MSAR models do not admit simple parameter estimation procedures. We thus consider the regularized least squares (LS) estimates of the TRFs, i.e., the RLS estimates with λ = 0, as a surrogate of the true TRFs, which can then be modeled as an MSAR process. To this end, let be the regularized LS estimates of the TRF. To capture the dynamics of , we consider a first-order Markov-switching process with J states. The underlying HMM is parameterized by the initial probabilities π, i = 1, 2, ⋯, J and transition probability matrix P, i, j = 1, 2, ⋯, J. Let s ∈ {1, 2, ⋯, J} denote the state at time n. Then, we have: where α is the rate of change of the TRF in state j, and is the i.i.d. sequence of process noise in state j, j = 1, 2, ⋯, J. The parameters to be estimated are . Let ω denote . Then, the MSAR estimates are given by: In S1 Appendix, we provide an EM-based algorithm for estimating the parameters and recursively computing ω.

TRF estimation via state-space models

The RLS estimate in (14) is a filtering estimate by design and is suited for real-time estimation of TRFs. For a more precise dynamic analysis of the TRFs in an off-line fashion, SSMs have the advantage of providing smoothed estimates and directly modeling the evolution of the TRFs through the state equation. We use the SSM below to represent the TRF dynamics and its relation to the neural response: where α ∈ (0, 1) controls the nominal rate of change of the TRF, similar to the effect of the forgetting factor λ in Eq (14) for the RLS framework. In [67], a correspondence between α and λ has been discussed which can result in the same filtering estimates of the SSM in Eq (17) with Gaussian noise and the RLS model in Eq (14), without any penalization. The parameter α can either be estimated in the EM framework as in Eq (12) [13], or it can be set based on the domain-specific knowledge of the problem to provide a target adaptivity-robustness trade-off, akin to choosing the forgetting factor in the RLS algorithm. The estimated TRFs in (17) are computed from the smoothing estimates as . By assuming a GM density for w in Eq (17), we can similarly obtain smoothing estimates of the TRFs, by using the two approaches discussed in the preceding section.

Model parameter settings

The following subsections provide detailed information on the choice of the various model parameters used in the simulation study and application to experimentally-recorded data from the Results section.

Parameter settings of the simulation study

For the simulation study, we use a sampling rate of F = 100 Hz and a length of 250 ms for the TRFs, i.e., L = 0.25F. Let G be a dictionary consisting of five Gaussian atoms with variances of 0.018 s2 whose means are separated by 50 ms increments starting from a lag of 0 ms to 200 ms. This results in and in Eqs (14) and (17). We consider a piecewise-constant model for the TRFs over windows of length 300 ms resulting in N = 300 TRF samples over the trial for each speaker. We consider W = 5, i.e., the TRF dynamics are governed by one mixture component in each window of length Wt0/F = 1.5 s. For simplicity, we consider Σ1: to be diagonal, which makes the parameter update formulas of Eqs. (S3) and (S5) in S1 Appendix to also take diagonal forms. The number of mixture components is chosen as M = 5 using the AIC criterion and log-likelihoods computed using Eqs. (S19) and (S20) given in S1 Appendix. The number of states J in the MSAR model can also be chosen via AIC, but we here take J to be the same as M for fairness of comparison with the SSM model with GM process noise. We also set the parameters of Algorithm S2 given in S1 Appendix as Γ = Γ = Γ = M. To initialize the EM algorithm, we use two methods: 1) initializing with , random means close to zero, and equal to the estimated process noise covariance in the linear Gaussian SSM, and 2) setting as the GM fit to the empirical samples of process noise in the linear Gaussian SSM, which are computed from the smoothed state estimates. In other words, a GM is fit on the state residuals, i.e., empirical process noise samples, where denotes the smoothed states using a linear Gaussian SSM. The state residuals here do not necessarily exhibit a clear multimodal histogram due to the Gaussian assumption in the model and the inaccuracies in state estimation. Nevertheless, a GM fit on the state residuals serves as a reasonable initialization for the EM algorithm in our experience. Note that in the simulation studies, we have used the first initialization strategy to show that under reasonable SNR conditions, the algorithm is able to initialize with large covariances, i.e., based on the linear Gaussian SSM estimates, and subsequently retrieve the concentrated mixture components. This is analogous to particle smoothing methods where the initial samples are drawn from a broad density and through consecutive weighting and resampling, the particles can eventually capture the underlying densities. In our experience, the second initialization strategy results in faster convergence, especially under poor SNR conditions, due to the extra information extracted from the residual estimates from the linear Gaussian SSM. Thus, for the real data analysis, we have used the second initialization strategy. For the forgetting factor λ in RLS, an effective estimation length [47] of 2 s is chosen to result in comparable TRF estimates to those of the SSM with α = 0.99. Also, γ in Eq (14) for an ℓ2 penalty is tuned through two-fold cross-validation. For the Gaussian SSM and the SSM with GM process noise, diagonal process noise covariance matrices are considered, and both the process and observation noise parameters as well as the states are estimated simultaneously for each trial run. We have considered a total of U = 2000 particles in Algorithm S1 given in S1 Appendix to approximate densities of dimension 2D(W + 1) = 60, so that state estimates are comparable to those obtained by the closed-form approximation. Note that the choice of the number of particles is critical for the performance of particle smoothing, as the number of particles required for stable estimation grows exponentially in the dimension of the densities.

Parameter settings of the experimentally-recorded data analysis

We set the TRF length to 300 ms and consider TRFs to be piece-wise constant over windows of length 400 ms. Also, we choose W = 5 to enforce that the TRF dynamics are governed by one mixture component in each window of length 2 s. We represent the TRFs over a Gaussian dictionary with means separated by 20 ms starting from 0 to 280 ms, and variances of 8.5 × 10−3 s2. The parameters λ and α are set to 0.92 and 0.97, respectively, to achieve comparable TRF estimates from Eqs (14) and (17). The ℓ2 penalty γ in (14) is determined via two-fold cross-validation. We consider diagonal covariance matrices for the process noise to reduce the dimension of Θ, and estimate the observation noise σ2 in the EM framework. The forgetting factor in Eq (14) enforces a temporal continuity in TRF estimates and increases robustness to noise and artifacts. The same effect can be replicated in the SSM of Eq (17) by considering α close to one and restricting the dynamic range of the process noise w. To enforce the latter, we consider IG conjugate priors [61] on the diagonal elements of the process noise covariance matrices. For the Gaussian SSM with and Q = diag([q1, …, q2]), the log-prior takes the form where are the parameters of the IG prior and c4 includes terms not dependent on q’s. The log-prior is then added to the surrogate Q-function of the EM algorithm, and κ determines the strength of the prior with respect to the complete data log-likelihood. We choose κ = N for the linear Gaussian case and κ = N/M for the linear SSM with GM process noise, to correct for the number of mixture components. We tune the IG parameters using empirical samples of the process noise from the RLS estimates, computed as . Thus, the process noise variance is controlled by the IG prior, which prohibits drastic temporal changes in the TRF. For the SSM with GM process noise, we also bound the elements of in each EM iteration such that the variance of the estimated GM process noise along each dimension is not larger than those of the linear Gaussian case, i.e., estimated q’s using the EM algorithm. Note that in the absence of such priors, the EM algorithm would likely result in TRFs that are highly variable in time and with no meaningful morphological structure.

Subjects, stimuli, and procedures

We have used data from two separate attention switching experiments in this work, which we refer to as the at-will and instructed attention switching experiments. Neuromagnetic signals were recorded at a sampling frequency of 2 kHz using a 157-sensor whole-head MEG system (Kanazawa Institute of Technology, Nonoichi Ishikawa, Japan) in a dim magnetically shielded room. The at-will attention switching dataset is a subset of recordings in [51], where the participants included five younger-adult (22-33 years old) native English speakers with normal hearing recruited from the University of Maryland. Only one of the subjects exhibited a meaningful auditory neural response (i.e., auditory DSS rotation matrix; see MEG Data Preprocessing subsection for details) with a reliable behavioral report. Two stories were presented diotically to subjects’ ears, one narrated by a male speaker and the other one by a female speaker. The stimuli consisted of two segments from the book, The Legend of Sleepy Hollow by Washington Irving. Subjects listened to trials of the same speech mixture (each 90 s in duration), and were instructed to start attending to the male speaker first, and then to switch their attention between the two speakers at their own will for a minimum of one and a maximum of three times during each trial. Subjects were also given a switching button that they were instructed to press every time they decided to switch attention. For each subject, 3 trials were recorded. Prior to the experiment, a single-speaker pilot study was performed where subjects listened to three 60 s trials with similar stimuli. Further experimental details can be found in [51]. The instructed attention switching dataset is from the recordings in [50], where participants included seven normal hearing young adults (20-31 years old). The stimuli consist of four segments from the book A Child’s History of England by Charles Dickens narrated by a male and female reader. Two different 60 s-long speech mixtures of the two speakers were generated, and each mixture was presented to subjects diotically for three trials. In each trial, subjects were instructed to focus on one speaker in the first 28 s of the trial, switch their attention to the other speaker after hearing a 2 second pause (between 28 s and 30 s time stamps), and maintain their focus on the latter speaker through the end of the trial. After completing the trials for each mixture, subjects answered comprehensive questions related to the passages they attended to. The MEG recording and preprocessing setup for this experiment is similar to that of the at-will attention switching experiment, and more details can be found in [50].

MEG data preprocessing

Three reference channels were used to measure and cancel the environmental magnetic field by using time-shift PCA [21]. All MEG channels and speech envelopes were band-pass filtered between 2 Hz and 8 Hz (delta and theta bands), corresponding to the slow temporal modulations in speech [46, 48], and downsampled to F = 100 Hz. Similar to [47, 50, 51], we used the DSS algorithm [22] on pilot trials to decompose the MEG data into temporally uncorrelated components. By using an averaging bias filter for promoting consistency across trials, we ordered the DSS components according to their trial-to-trial phase-locking reliability and chose the first component as the auditory neural response.

Supplementary methods.

This appendix includes: (i) detailed derivations of the two approaches used for computing the expectations in Eqs (9), (10), and (11), (ii) the criteria for model order selection, and (iii) details of the MSAR estimation procedure. (PDF) Click here for additional data file. 22 Apr 2020 Dear Prof. Babadi, Thank you very much for submitting your manuscript "Dynamic Estimation of Auditory Temporal Response Functions via State-Space Models with Gaussian Mixture Process Noise" for consideration at PLOS Computational Biology. As with all papers reviewed by the journal, your manuscript was reviewed by members of the editorial board and by several independent reviewers. In light of the reviews (below this email), we would like to invite the resubmission of a significantly-revised version that takes into account the reviewers' comments. The technical quality has generally been very appreciated, even though some issues need to be addressed. An aspect necessary to work on is the (neuro)biological relevance. We cannot make any decision about publication until we have seen the revised manuscript and your response to the reviewers' comments. Your revised manuscript is also likely to be sent to reviewers for further evaluation. When you are ready to resubmit, please upload the following: [1] A letter containing a detailed list of your responses to the review comments and a description of the changes you have made in the manuscript. Please note while forming your response, if your article is accepted, you may have the opportunity to make the peer review history publicly available. The record will include editor decision letters (with reviews) and your responses to reviewer comments. If eligible, we will contact you to opt in or out. [2] Two versions of the revised manuscript: one with either highlights or tracked changes denoting where the text has been changed; the other a clean version (uploaded as the manuscript file). Important additional instructions are given below your reviewer comments. Please prepare and submit your revised manuscript within 60 days. If you anticipate any delay, please let us know the expected resubmission date by replying to this email. Please note that revised manuscripts received after the 60-day due date may require evaluation and peer review similar to newly submitted manuscripts. Thank you again for your submission. We hope that our editorial process has been constructive so far, and we welcome your feedback at any time. Please don't hesitate to contact us if you have any questions or comments. Sincerely, Daniele Marinazzo Deputy Editor PLOS Computational Biology Daniele Marinazzo Deputy Editor PLOS Computational Biology *********************** Reviewer's Responses to Questions Comments to the Authors: Please note here if the review is uploaded as an attachment. Reviewer #1: Summary: The paper introduces a Gaussian Mixture (GM) state space method (SSM) designed to estimate the latent dynamics of neural systems. The method is particularly suitable in experimental settings were sudden switches between neural states are expected since they induces multi-modal distributions that are well-captured by GM models. one of these settings is dichotic listening, where the participant can switch their attention towards one of two simultaneously presented speech recordings. The core of the paper is a family of expectation-maximization EM algorithms for efficient maximum likelihood inference. The proposed algorithms cleverly are well designed and properly validated. The proposed method is definitely relevant for cognitive neuroscience as multi-modality is common in the analysis of brain signals and it can lead to meaningful interpretation of the data in terms of simpler latent signals. In general, while I highly appreciate the technical quality of the paper, I think that at the present state it is more suitable for a machine learning/sigma processing venue. However, the paper could fit in PLOS comp bio if the authors put some effort in improving the exposition and stressing the neuroscientific motivations. Major Comments: 1) I think that the structure of the paper is not very suitable to the PLOS comp bio audience. The core of the paper is purely methodological without much reference to neuroscience. The method section shoud ideally be placed after the results and some of the most technical details should be moved to supplementary material. The result section should begin a high level treatment of the method focused on its underlying assumptions and its connections with alternative methods commonly used in neuroscience (i.e. TRF). This will allow the less technically minded reader to understand the main results without getting bogged on technical details that are of higher interest for the machine learning and statistics audience. Emphasis should be given on the formulation and justification of the model instead of the algorithmic solution. 2) The results of the method should be compared with a wider range of baseline models. In particular, there need to be a quantitative comparison with Markov-switching autoregressive models as they are a obvious alternative to the GM approach. These models should be also discussion in the introduction. Minor comments: 1) Referring to w_n as noise in Eq.1 is rather confusing from a neuroscience point of view as it actually represent the cognitively modulated signal. This terminology is particularly confusing since w_n has a non vanishing expected value. Reviewer #2: “Dynamic Estimation of Auditory Temporal Response Functions via State-Space Models with Gaussian Mixture Process Noise” by Miran and colleagues tackles the estimation of the neural dynamics underlying auditory process from MEG recordings in a cocktail party setting. The authors develop a method based on Expectation-Maximization algorithm to infer the parameters of a Gaussian mixture process noise from state-space observations. The method is applied both to simulated and real MEG data. They state that representing the process noise as a Gaussian mixture process significantly improves state estimation and the underlying dynamics. Finally, application to MEG data reveals improvements over existing techniques that estimate the Temporal Response Functions (TFR). However the study is well introduced respect to previous literature, some reference about application of state space model to MEG data misses (see e.g. Sorrentino et al, 2009, 2014, Sommariva 2014). Overall the article is well written and provides a lot of mathematical details about the developed algorithms, some of which could be moved to Appendix to facilitate the reading. Indeed sometime the reader could loose the final aim to estimate the parameters of the Gaussian Mixture process noise. I have very few comments: - line 112: "To represent the process noise" shoud be To represent the process noise wn - In line 126 for the sake of clarity should be better replacing n1, n2 with 1 to N - some information on computational time on a generic machine could be added - in both simulated and read data, a discussion on sensitivity to chosen parameters (e.g. M, number of particles) could be given - an overall analysis of the drawbacks of the method could be added Reviewer #3: The authors present a state-space model in which the process noise is modeled by a mixture of Gaussians. They derive the EM updating equations, where some of the variables turn out to be intractable. The authors then provide two algorithms to estimate these: one based on Monte-Carlo and one based on approximative closed-form expressions based on Taylor approximations. The method is applied in a dynamic TRF estimation problem. In simulations, both approaches are almost equally accurate, whereas the latter is much faster in terms of computations. The paper is interesting and very well written. I only have one major comment (and several minor ones, see below). While the simulations are convincing, I find the results provided on real data a bit too anecdotal (two trials of a single subject). It would be fine as an illustrative example of a potential application for the proposed algorithm, but as the title does emphasize TRF estimation as the focus of the paper, I feel the real-data experiments are not that convincing. Would it be possible to do a group analysis and to show that the GM-based model results in a statistically significant improvement wrt RLS or a linear Gaussian SSM? I know it is hard due to lack of a ground truth TRF, but perhaps the attention decoding accuracy could be used as an indirect performance metric. A group-wise version of Fig. 6E may also be interesting (after aligning the attention switches of the different subjects), showing medians and quantiles across subjects. Detail comments: - In the introduction the authors motivate the use of GMs for the process noise, while claiming that a single Gaussian is sufficient to model measurement noise. I don't fully agree with that statement. In the case of neural signals there are often artifacts, which may result in outliers in the signals which typically do not follow Gaussian statistics. Moreover, the noise process in (30) is background MEG activity (unrelated to the stimulus response). Background MEG is known to change over time, i.e., different covariance matrices R would be 'active' in different segments of the data. So it would also make sense to model measurement noise in a similar fashion as proposed for the process noise. Can the method be extended to also include gaussian mixtures in the observation noise? (I am not asking to include this, but rather to briefly comment on this case and how difficult/trivial it would be to add this extension). -The SMCEM acronym is not explained in intro -Which bias filter was used in the DSS method? Usually repeated stimuli are required, but the description of the experiment protocol does not mention any repeated stimuli. Please elaborate. -page 19: I guess y_t is one of the DSS components? Please explain this. -it is mentioned that the parameter alpha can be estimated within the EM framework. It may be useful for the reader to have the corresponding updating equation (could be added in an appendix). In fact, it could be useful to provide all the updating equations for the model defined in (32) as a special case of the more general updating equations. - it is mentioned that -alternatively- the alpha parameter can be set based on the domain-specific knowledge of the problem to provide a desired adaptivity-robustness trade-off. I find this statement a bit strange, as the state dynamics are defined by the dynamics of the underlying (neural) processes. The word 'desired' therefore seems a bit strange here. Although I do understand what the authors mean (this parameter is indeed akin to the forgetting factor in RLS), I would rephrase or elaborate a bit. -the second strategy to initialize the EM process (page 24, line375): Is this based on the oracle gaussian fit in Fig. 2-C? Or do you mean to first run a naive model with a single Gaussian? Please explain this a bit better. - It is mentioned that the results of Fig. 3 are based on the first initialization strategy for the EM procedure. Is that also the case for Fig. 4 and 5? Are there no results for the second initialization strategy? -Fig. 5: it would help if the ground truth TRF heatmap is also added here for convenience -In the last sentence of the 'results' section, the authors provide attention decoding accuracies. How are these defined? Are they computed over the entire trial on a sample-by-sample basis? - The description of the data set mentions five subjects, yet only one subject is reported in the results. -Typo: Although the estimated process noise variance in in the GM case is controlled by... --> 2x 'in' ********** Have all data underlying the figures and results presented in the manuscript been provided? Large-scale datasets should be made available via a public repository as described in the PLOS Computational Biology data availability policy, and numerical data that underlies graphs or summary statistics should be provided in spreadsheet form as supporting information. Reviewer #1: Yes Reviewer #2: Yes Reviewer #3: Yes ********** PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files. If you choose “no”, your identity will remain anonymous but your review may still be made public. Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy. Reviewer #1: Yes: Luca Ambrogioni Reviewer #2: No Reviewer #3: Yes: Alexander Bertrand Figure Files: While revising your submission, please upload your figure files to the Preflight Analysis and Conversion Engine (PACE) digital diagnostic tool, . PACE helps ensure that figures meet PLOS requirements. To use PACE, you must first register as a user. Then, login and navigate to the UPLOAD tab, where you will find detailed instructions on how to use the tool. If you encounter any issues or have any questions when using PACE, please email us at . Data Requirements: Please note that, as a condition of publication, PLOS' data policy requires that you make available all data used to draw the conclusions outlined in your manuscript. Data must be deposited in an appropriate repository, included within the body of the manuscript, or uploaded as supporting information. This includes all numerical values that were used to generate graphs, histograms etc.. For an example in PLOS Biology see here: http://www.plosbiology.org/article/info%3Adoi%2F10.1371%2Fjournal.pbio.1001908#s5. Reproducibility: To enhance the reproducibility of your results, PLOS recommends that you deposit laboratory protocols in protocols.io, where a protocol can be assigned its own identifier (DOI) such that it can be cited independently in the future. For instructions, please see 6 Jul 2020 Submitted filename: SSGM_PLoS_Response.pdf Click here for additional data file. 21 Jul 2020 Dear Prof. Babadi, We are pleased to inform you that your manuscript 'Dynamic Estimation of Auditory Temporal Response Functions via State-Space Models with Gaussian Mixture Process Noise' has been provisionally accepted for publication in PLOS Computational Biology. Before your manuscript can be formally accepted you will need to complete some formatting changes, which you will receive in a follow up email. A member of our team will be in touch with a set of requests. Please note that your manuscript will not be scheduled for publication until you have made the required changes, so a swift response is appreciated. IMPORTANT: The editorial review process is now complete. PLOS will only permit corrections to spelling, formatting or significant scientific errors from this point onwards. Requests for major changes, or any which affect the scientific understanding of your work, will cause delays to the publication date of your manuscript. Should you, your institution's press office or the journal office choose to press release your paper, you will automatically be opted out of early publication. We ask that you notify us now if you or your institution is planning to press release the article. All press must be co-ordinated with PLOS. Thank you again for supporting Open Access publishing; we are looking forward to publishing your work in PLOS Computational Biology. Best regards, Daniele Marinazzo Deputy Editor PLOS Computational Biology Daniele Marinazzo Deputy Editor PLOS Computational Biology *********************************************************** Reviewer's Responses to Questions Comments to the Authors: Please note here if the review is uploaded as an attachment. Reviewer #2: My previous comments have been sufficiently addressed. The manuscript was significantly improved and in my opinion could be accepted in the present form. Best, Annalisa Pascarella Reviewer #3: The authors have adequately addressed my comments and modified the paper accordingly. It is ready for publication. Congratulations on this nice work. One final suggestion (just as a side remark, I am not requesting any changes in the manuscript here). Concerning the issue with poor DSS outcomes on 4/5 subjects: it may be worth to also try the stimulus-aware spatial filtering method from Das et al., which is akin to DSS, but it can achieve higher SNRs and it does not use repeated trials for the creation of a bias filter. Das et al. "Stimulus-aware spatial filtering for single-trial neural response and temporal response function estimation in high-density EEG with applications in auditory research", NeuroImage, Vol. 204, 116211, 2020. ********** Have all data underlying the figures and results presented in the manuscript been provided? Large-scale datasets should be made available via a public repository as described in the PLOS Computational Biology data availability policy, and numerical data that underlies graphs or summary statistics should be provided in spreadsheet form as supporting information. Reviewer #2: None Reviewer #3: Yes ********** PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files. If you choose “no”, your identity will remain anonymous but your review may still be made public. Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy. Reviewer #2: Yes: Annalisa Pascarella Reviewer #3: Yes: Alexander Bertrand 12 Aug 2020 PCOMPBIOL-D-20-00251R1 Dynamic Estimation of Auditory Temporal Response Functions via State-Space Models with Gaussian Mixture Process Noise Dear Dr Babadi, I am pleased to inform you that your manuscript has been formally accepted for publication in PLOS Computational Biology. Your manuscript is now with our production department and you will be notified of the publication date in due course. The corresponding author will soon be receiving a typeset proof for review, to ensure errors have not been introduced during production. Please review the PDF proof of your manuscript carefully, as this is the last chance to correct any errors. Please note that major changes, or those which affect the scientific understanding of the work, will likely cause delays to the publication date of your manuscript. Soon after your final files are uploaded, unless you have opted out, the early version of your manuscript will be published online. The date of the early version will be your article's publication date. The final article will be published to the same URL, and all versions of the paper will be accessible to readers. Thank you again for supporting PLOS Computational Biology and open-access publishing. We are looking forward to publishing your work! With kind regards, Matt Lyles PLOS Computational Biology | Carlyle House, Carlyle Road, Cambridge CB4 3DN | United Kingdom ploscompbiol@plos.org | Phone +44 (0) 1223-442824 | ploscompbiol.org | @PLOSCompBiol

24 in total

1. Variational learning for switching state-space models.

Authors: Z Ghahramani; G E Hinton
Journal: Neural Comput Date: 2000-04 Impact factor: 2.026

2. Estimating a state-space model from point process observations.

Authors: Anne C Smith; Emery N Brown
Journal: Neural Comput Date: 2003-05 Impact factor: 2.026

3. At what time is the cocktail party? A late locus of selective attention to natural speech.

Authors: Alan J Power; John J Foxe; Emma-Jane Forde; Richard B Reilly; Edmund C Lalor
Journal: Eur J Neurosci Date: 2012-03-30 Impact factor: 3.386

4. Neural coding of continuous speech in auditory cortex during monaural and dichotic listening.

Authors: Nai Ding; Jonathan Z Simon
Journal: J Neurophysiol Date: 2011-10-05 Impact factor: 2.714

5. Denoising based on time-shift PCA.

Authors: Alain de Cheveigné; Jonathan Z Simon
Journal: J Neurosci Methods Date: 2007-06-08 Impact factor: 2.390

6. Dynamical MEG source modeling with multi-target Bayesian filtering.

Authors: Alberto Sorrentino; Lauri Parkkonen; Annalisa Pascarella; Cristina Campi; Michele Piana
Journal: Hum Brain Mapp Date: 2009-06 Impact factor: 5.038

7. Robust decoding of selective auditory attention from MEG in a competing-speaker environment via state-space modeling.

Authors: Sahar Akram; Alessandro Presacco; Jonathan Z Simon; Shihab A Shamma; Behtash Babadi
Journal: Neuroimage Date: 2015-10-04 Impact factor: 6.556

8. Dynamic Estimation of the Auditory Temporal Response Function From MEG in Competing-Speaker Environments.

Authors: Sahar Akram; Jonathan Z Simon; Behtash Babadi
Journal: IEEE Trans Biomed Eng Date: 2016-11-15 Impact factor: 4.538

9. Enhanced detection of artifacts in EEG data using higher-order statistics and independent component analysis.

Authors: Arnaud Delorme; Terrence Sejnowski; Scott Makeig
Journal: Neuroimage Date: 2006-12-26 Impact factor: 6.556

10. The Multivariate Temporal Response Function (mTRF) Toolbox: A MATLAB Toolbox for Relating Neural Signals to Continuous Stimuli.

Authors: Michael J Crosse; Giovanni M Di Liberto; Adam Bednar; Edmund C Lalor
Journal: Front Hum Neurosci Date: 2016-11-30 Impact factor: 3.169

2 in total

1. A Speech-Level-Based Segmented Model to Decode the Dynamic Auditory Attention States in the Competing Speaker Scenes.

Authors: Lei Wang; Yihan Wang; Zhixing Liu; Ed X Wu; Fei Chen
Journal: Front Neurosci Date: 2022-02-10 Impact factor: 4.677

2. EEG alpha and pupil diameter reflect endogenous auditory attention switching and listening effort.

Authors: Stephanie Haro; Hrishikesh M Rao; Thomas F Quatieri; Christopher J Smalt
Journal: Eur J Neurosci Date: 2022-02-16 Impact factor: 3.698

2 in total