| Literature DB >> 31824257 |
Christian Herff1,2, Lorenz Diener2, Miguel Angrick2, Emily Mugler3, Matthew C Tate4, Matthew A Goldrick5, Dean J Krusienski6, Marc W Slutzky3,7,8, Tanja Schultz2.
Abstract
Neural interfaces that directly produce intelligible speech from brain activity would allow people with severe impairment from neurological disorders to communicate more naturally. Here, we record neural population activity in motor, premotor and inferior frontal cortices during speech production using electrocorticography (ECoG) and show that ECoG signals alone can be used to generate intelligible speech output that can preserve conversational cues. To produce speech directly from neural data, we adapted a method from the field of speech synthesis called unit selection, in which units of speech are concatenated to form audible output. In our approach, which we call Brain-To-Speech, we chose subsequent units of speech based on the measured ECoG activity to generate audio waveforms directly from the neural recordings. Brain-To-Speech employed the user's own voice to generate speech that sounded very natural and included features such as prosody and accentuation. By investigating the brain areas involved in speech production separately, we found that speech motor cortex provided more information for the reconstruction process than the other cortical areas.Entities:
Keywords: BCI; ECoG; brain-computer interface; brain-to-speech; speech; synthesis
Year: 2019 PMID: 31824257 PMCID: PMC6882773 DOI: 10.3389/fnins.2019.01267
Source DB: PubMed Journal: Front Neurosci ISSN: 1662-453X Impact factor: 4.677
Figure 1Experimental Setup: ECoG and audible speech (light blue) were measured simultaneously while participants read words shown on a computer screen. We recorded ECoG data on inferior frontal (green), premotor (blue), and motor (purple) cortices.
Participant demographics and electrode information.
| 1 | 368 | 752.8 | 12 | 19 | 18 |
| 2 | 370 | 761.7 | 8 | 15 | 19 |
| 3 | 249 | 509.2 | 16 | 21 | 20 |
| 4 | 249 | 571.5 | 11 | 29 | 18 |
| 5 | 244 | 499.2 | 0 | 19 | 19 |
| 6 | 372 | 760.8 | 15 | 18 | 12 |
Figure 2Electrode grid positions for all six participants. Grids always covered areas in inferior frontal gyrus pars opercularis (IFG, green), ventral premotor cortex (PMv, blue), and ventral motor cortex (M1v, purple).
Figure 3Speech Generation Approach: For each window of high gamma activity in the test data (top left), the cosine similarity to each window in the training data (center bottom) was computed. The window in the training data that maximized the cosine similarity was determined and the corresponding speech unit (center top) was selected. The resulting overlapping speech units (top right) were combined using Hanning windows to form the generated speech output (bottom right). Also see Supplementary Video 1.
Figure 4Generation example: Examples of actual (top) and generated (bottom) audio waveforms (A) and spectrograms (B) of seven words spoken by participant 5. Similarities between the generation and actual speech are striking, especially in the spectral domain (B). These generated examples can be found in the Supplementary Audio 1.
Figure 5Performance of our generation approach. (A) Correlation coefficients between the spectrograms of original and generated audio waveforms for the best (purple) and average (green) participant. While all regions yielded better than randomized results on average, M1v provided most information for our reconstruction process. (B) Results of listening test with 55 human listeners. Accuracies in the 4-option forced intelligibility test were above chance level (25%, dashed line) for all listeners.
Figure 6Detailed decoding results. (A) Correlations between original and reconstructed spectrograms (melscaled) for all participants and electrode locations. Stars indicate significance levels (* Larger than 95% of random activations, *** Larger than 99.9% of random activations). M1v contains most information for our decoding approach. (B) Detailed results for best participant using all electrodes and the entire temporal context (blue) and only using activity prior to the current moment (cyan) across all frequency coefficients. Shaded areas denote 95% confidence intervals. Reconstruction is reliable across all frequency ranges and above chance level (maximum of all randomizations, red) for all frequency ranges.