| Literature DB >> 28123363 |
Aisling E O'Sullivan1, Michael J Crosse2, Giovanni M Di Liberto1, Edmund C Lalor3.
Abstract
Speech is a multisensory percept, comprising an auditory and visual component. While the content and processing pathways of audio speech have been well characterized, the visual component is less well understood. In this work, we expand current methodologies using system identification to introduce a framework that facilitates the study of visual speech in its natural, continuous form. Specifically, we use models based on the unheard acoustic envelope (E), the motion signal (M) and categorical visual speech features (V) to predict EEG activity during silent lipreading. Our results show that each of these models performs similarly at predicting EEG in visual regions and that respective combinations of the individual models (EV, MV, EM and EMV) provide an improved prediction of the neural activity over their constituent models. In comparing these different combinations, we find that the model incorporating all three types of features (EMV) outperforms the individual models, as well as both the EV and MV models, while it performs similarly to the EM model. Importantly, EM does not outperform EV and MV, which, considering the higher dimensionality of the V model, suggests that more data is needed to clarify this finding. Nevertheless, the performance of EMV, and comparisons of the subject performances for the three individual models, provides further evidence to suggest that visual regions are involved in both low-level processing of stimulus dynamics and categorical speech perception. This framework may prove useful for investigating modality-specific processing of visual speech under naturalistic conditions.Entities:
Keywords: EEG; EEG prediction; lipreading/speechreading; motion; temporal response function (TRF); visemes; visual speech
Year: 2017 PMID: 28123363 PMCID: PMC5225113 DOI: 10.3389/fnhum.2016.00679
Source DB: PubMed Journal: Front Hum Neurosci ISSN: 1662-5161 Impact factor: 3.169
Figure 1Assessing the representation of visual speech in EEG (adapted from Di Liberto et al., 128-channel EEG data were recorded while subjects watched videos of continuous, natural speech consisting of a well-known male speaker. Linear regression was used to fit multivariate temporal response functions (mTRFs) between the low-frequency (0.3–15 Hz) EEG recordings and seven different representations of the speech stimulus (EV, MV and EM models are not shown). Each mTRF model was then tested for its ability to predict EEG using leave-one-out cross-validation.
Figure 2(A) Grand-average (N = 21) EEG prediction correlations (Pearson’s r) for the visual speech models (mean ± SEM) for low-frequency EEG (0.3–15 Hz). The ▴ indicates the models other models (▴ p < 0.05), except for EV vs. M (p = 0.263). There is no difference in performance between these three models (p > 0.05). The dotted line represents the 95th percentile of chance-level prediction accuracy. (B) The prediction accuracy (N = 21) for normal and randomized visemes within their active time points. (C) Correlation values between recorded EEG and that predicted by each mTRF model for individual subjects. The subjects are sorted according to the prediction accuracies of the viseme model. (*p < 0.05, **p < 0.005, ***p < 0.001).
Figure 3Spatiotemporal analysis of mTRF models for natural visual speech. mTRFs plotted for envelope (A), motion (B) and visemes (C) at peri-stimulus time lags from −50 ms to 500 ms, for representative central and occipital electrode channels. The dotted line separates vowels and consonant visemes. The phonemes contained within each viseme category are shown on the left. (D) Topographies of P1 (140 ms) and N1 (234 ms) TRF components for the envelope model. (E) Topographies of P1 (125 ms) and N1 (210 ms) TRF components for the motion model. (F) Topographies for representative vowel and consonant visemes corresponding to their P1 time points. All visemes have a similar distribution of scalp activity. The black markers in the topographies (D–F) indicate the channels for which the corresponding TRFs (A–C) were plotted. The viseme mTRF and the topography colorbar are the same.