| Literature DB >> 26124702 |
Christian Herff1, Dominic Heger1, Adriana de Pesters2, Dominic Telaar1, Peter Brunner3, Gerwin Schalk4, Tanja Schultz1.
Abstract
It has long been speculated whether communication between humans and machines based on natural speech related cortical activity is possible. Over the past decade, studies have suggested that it is feasible to recognize isolated aspects of speech from neural signals, such as auditory features, phones or one of a few isolated words. However, until now it remained an unsolved challenge to decode continuously spoken speech from the neural substrate associated with speech and language processing. Here, we show for the first time that continuously spoken speech can be decoded into the expressed words from intracranial electrocorticographic (ECoG) recordings.Specifically, we implemented a system, which we call Brain-To-Text that models single phones, employs techniques from automatic speech recognition (ASR), and thereby transforms brain activity while speaking into the corresponding textual representation. Our results demonstrate that our system can achieve word error rates as low as 25% and phone error rates below 50%. Additionally, our approach contributes to the current understanding of the neural basis of continuous speech production by identifying those cortical regions that hold substantial information about individual phones. In conclusion, the Brain-To-Text system described in this paper represents an important step toward human-machine communication based on imagined speech.Entities:
Keywords: ECoG; automatic speech recognition; brain-computer interface; broadband gamma; electrocorticography; pattern recognition; speech decoding; speech production
Year: 2015 PMID: 26124702 PMCID: PMC4464168 DOI: 10.3389/fnins.2015.00217
Source DB: PubMed Journal: Front Neurosci ISSN: 1662-453X Impact factor: 4.677
Figure 1Electrode positions for all seven subjects. Captions include age [years old (y/o)] and sex of subjects. Electrode locations were identified in a post-operative CT and co-registered to preoperative MRI. Electrodes for subject 3 are on an average Talairach brain. Combined electrode placement in joint Talairach space for comparison of all subjects. Participant 1 (yellow), subject 2 (magenta), subject 3 (cyan), subject 5 (red), subject 6 (green), and subject 7 (blue). Participant 4 was excluded from joint analysis as the data did not yield sufficient activations related to speech activity (see Section 2.4).
Data recording details for every session.
| 1 | 1 | Gettysburg address | 36 | 279.87 |
| 2 | JFK inaugural | 38 | 326.90 | |
| 2 | 1 | Humpty dumpty | 21 | 129.87 |
| 2 | Humpty dumpty | 21 | 129.07 | |
| 3 | Humpty dumpty | 21 | 126.37 | |
| 3 | 1 | Charmed fan-fiction | 42 | 310.27 |
| 2 | Charmed fan-fiction | 40 | 310.93 | |
| 3 | Charmed fan-fiction | 41 | 307.50 | |
| 4 | 1 | Gettysburg address | 38 | 299.67 |
| 2 | Gettysburg address | 38 | 311.97 | |
| 5 | 1 | JFK inaugural | 49 | 341.77 |
| 2 | Gettysburg address | 39 | 222.57 | |
| 6 | 1 | Gettysburg address | 38 | 302.83 |
| 7 | 1 | JFK inaugural | 48 | 590.10 |
| 2 | Gettysburg address | 38 | 391.43 |
Figure 2Synchronized recording of ECoG and acoustic data. Acoustic data are labeled using our in-house decoder BioKIT, i.e., the acoustic data samples are assigned to corresponding phones. These phone labels are then imposed on the neural data.
Grouping of phones.
| aa | ɑ æ ʌ |
| b | ɓ |
| ch | ʈ ʃʃ ʒ |
| eh | ɛ ɝ eɪ |
| f | f |
| hh | h |
| ih | i ɪ |
| jh | dʒ g j |
| k | k |
| l | ɫ |
| m | m |
| n | n ŋ |
| ow | o ℧ ɔ |
| p | p |
| r | r |
| s | s z ð θ |
| t | t d |
| uw | u ℧ |
| v | v |
| w | w |
| ow ih | ɔɪ |
| aa ih | aɪ |
| aa ow | a℧ |
English phones are based on the International Phonetic Alphabet (IPA).
Figure 3Overview of the system: ECoG broadband gamma activities (50 ms segments) for every electrode are recorded. Stacked broadband gamma features are calculated (Signal processing). Phone likelihoods over time can be calculated by evaluating all Gaussian ECoG phone models for every segment of ECoG features. Using ECoG phone models, a dictionary and an n-gram language model, phrases are decoded using the Viterbi algorithm. The most likely word sequence and corresponding phone sequence are calculated and the phone likelihoods over time can be displayed. Red marked areas in the phone likelihoods show most likely phone path. See also Supplementary Video.
Figure 4Mean Kullback-Leibler Divergences between models for every electrode position of every subject. Combined electrode montage of all subjects except subject 4 in common Talairach space. Heat maps on rendered average brain shows regions of high discriminability (red). All shown discriminability exceeds chance level (larger than 99% of randomized discriminabilities). The temporal course of regions with high discriminability between phone models shows early differences in diverse areas up to 200 ms before the actual phone production. Phone models show high discriminability in sensorimotor cortex 50 ms before production and yield different models in auditory regions of the superior temporal gyrus 100 ms after production.
Figure 5Results: (A) Frame-wise accuracy for all sessions. All sessions of all subjects show significantly higher true positive rates for Brain-To-Text (green bars) than for the randomized models (orange bars). (B) Confusion matrix for subject 7, session 1. The clearly visible diagonal indicates that all phones are decoded reliably. (C) Word Error Rates depending on dictionary size (lines). Word error rates for Brain-To-Text (green line) are lower than the randomized models for all dictionary sizes. Average true-positive rates across phones depending on dictionary size (bars) for subject 7, session 1. Phone true positive rates remain relatively stable for all dictionary sizes and are always much higher for Brain-To-Text than for the randomized models.