| Literature DB >> 32316162 |
Debadatta Dash1,2, Paul Ferrari3,4, Satwik Dutta5, Jun Wang2,6.
Abstract
Neural speech decoding-driven brain-computer interface (BCI) or speech-BCI is a novel paradigm for exploring communication restoration for locked-in (fully paralyzed but aware) patients. Speech-BCIs aim to map a direct transformation from neural signals to text or speech, which has the potential for a higher communication rate than the current BCIs. Although recent progress has demonstrated the potential of speech-BCIs from either invasive or non-invasive neural signals, the majority of the systems developed so far still assume knowing the onset and offset of the speech utterances within the continuous neural recordings. This lack of real-time voice/speech activity detection (VAD) is a current obstacle for future applications of neural speech decoding wherein BCI users can have a continuous conversation with other speakers. To address this issue, in this study, we attempted to automatically detect the voice/speech activity directly from the neural signals recorded using magnetoencephalography (MEG). First, we classified the whole segments of pre-speech, speech, and post-speech in the neural signals using a support vector machine (SVM). Second, for continuous prediction, we used a long short-term memory-recurrent neural network (LSTM-RNN) to efficiently decode the voice activity at each time point via its sequential pattern-learning mechanism. Experimental results demonstrated the possibility of real-time VAD directly from the non-invasive neural signals with about 88% accuracy.Entities:
Keywords: LSTM-RNN; MEG; SVM; VAD; brain-computer interface; speech-BCI; wavelet
Year: 2020 PMID: 32316162 PMCID: PMC7218843 DOI: 10.3390/s20082248
Source DB: PubMed Journal: Sensors (Basel) ISSN: 1424-8220 Impact factor: 3.576
Figure 1The MEG unit.
Figure 2Time-locked protocol.
Figure 3The four features selected for the experiment are shown for one example trial (in black). Corresponding speech signal is shown in blue. Feature 1: sum of absolute (abs) values; Feature 2: root mean square (RMS) values; Feature 3: standard deviation (STD) across sensors; Feature 4: Index of the sensor with highest magnitude at each ms of time.
Figure 4Results of classification between Pre-, Post-, and Speech segments; Error bars show the standard deviation across 8 subjects).
Figure 5Prediction results with NeuroVAD: (a) while subject 8 was speaking “That’s perfect” and (b) Prediction accuracy with LSTM-RNN for all 8 subjects and their average; Error bars indicate the standard deviation in predictions across trials.
Figure 6Frequency plot of most active sensors across all 8 subjects shown with 3 views: (a) left (b) right, and (c) axial. Each dot represents a sensor. Color and size represent the count. (d) represents a histogram plot of the frequency (count) of all the active sensors for each subject individually represented by a unique color.
Figure 7Difference between Feature 1 (Sum of absolute values) and Feature 3 (Standard deviation) for one example trial.