| Literature DB >> 36015817 |
Beiming Cao1,2, Alan Wisler3, Jun Wang2,4.
Abstract
Silent speech interfaces (SSIs) convert non-audio bio-signals, such as articulatory movement, to speech. This technology has the potential to recover the speech ability of individuals who have lost their voice but can still articulate (e.g., laryngectomees). Articulation-to-speech (ATS) synthesis is an algorithm design of SSI that has the advantages of easy-implementation and low-latency, and therefore is becoming more popular. Current ATS studies focus on speaker-dependent (SD) models to avoid large variations of articulatory patterns and acoustic features across speakers. However, these designs are limited by the small data size from individual speakers. Speaker adaptation designs that include multiple speakers' data have the potential to address the issue of limited data size from single speakers; however, few prior studies have investigated their performance in ATS. In this paper, we investigated speaker adaptation on both the input articulation and the output acoustic signals (with or without direct inclusion of data from test speakers) using the publicly available electromagnetic articulatory (EMA) dataset. We used Procrustes matching and voice conversion for articulation and voice adaptation, respectively. The performance of the ATS models was measured objectively by the mel-cepstral distortions (MCDs). The synthetic speech samples were generated and are provided in the supplementary material. The results demonstrated the improvement brought by both Procrustes matching and voice conversion on speaker-independent ATS. With the direct inclusion of target speaker data in the training process, the speaker-adaptive ATS achieved a comparable performance to speaker-dependent ATS. To our knowledge, this is the first study that has demonstrated that speaker-adaptive ATS can achieve a non-statistically different performance to speaker-dependent ATS.Entities:
Keywords: articulation-to-speech synthesis; silent speech interface; speaker adaption; voice conversion
Mesh:
Year: 2022 PMID: 36015817 PMCID: PMC9416444 DOI: 10.3390/s22166056
Source DB: PubMed Journal: Sensors (Basel) ISSN: 1424-8220 Impact factor: 3.847
Number of sentences and duration recorded from each speaker.
| Speaker | Phrase Num. | Duration (min) |
|---|---|---|
| F01 | 1738 | 61.75 |
| F02 | 1560 | 60.58 |
| F03 | 1617 | 58.63 |
| F04 | 1618 | 59.71 |
| M01 | 1553 | 55.67 |
| M02 | 1554 | 57.47 |
| M03 | 1610 | 60.31 |
| M04 | 1620 | 59.65 |
| Sum. | 12,870 | 472.82 |
| Ave. | 1609 | 59.22 |
Figure 1The overview illustration of a generic articulation-to-speech synthesis model.
Figure 2The pipeline of ATS Speaker adaptation using voice conversion. For each target speaker, the other N (seven) speakers were training speakers.
Figure 3Example of shapes (motion path of the articulators) before and after Procrustes matching for producing “the birch canoe slid on the smooth planks”. In this coordinate system, y is vertical and z is anterior-posterior. (a) Before Procrustes matching. (b) After Procrustes matching.
Topologies of the neural networks in this study.
|
| |
| Mel-spectrogram | 80-dim. vectors |
| Sampling rate | 22,050 Hz |
| Windows length | 1024 |
| Step size | 256 |
|
| 54-dim. vectors |
| Articulatory movement (6 sensors) | (18-dim. vectors) + |
|
| |
| Input | 54-dim. articulatory |
| Output. | 80-dim. acoustic feature |
| No. of LSTM nodes each hidden layer | 256 |
| Depth | 3-depth layers |
| Batch size | 1 sentence (one whole sentence per batch) |
| Max Epochs | 50 |
| Learning rate | 0.0003 |
| Optimizer | Adam |
|
| |
| Input | 54-dim. articulatory |
| Output. | 80-dim. acoustic feature |
| No. of LSTM nodes each hidden layer | 256 |
| Depth | 3-depth layers |
| Batch size | 1 sentence (one whole sentence per batch) |
| Max Epochs | 30 |
| Learning rate | 0.00001 |
| Optimizer | Adam |
|
| |
| Input | 80-dim. acoustic feature |
| Output. | 80-dim. acoustic feature |
| No. of LSTM nodes each hidden layer | 128 |
| Depth | 3-depth layers |
| Batch size | 1 sentence (one whole sentence per batch) |
| Max Epochs | 30 |
| Learning rate | 0.00005 |
| Optimizer | Adam |
|
| Pytorch |
Figure 4Average MCDs of the experiments in this study. SD: speaker-dependent. SI: speaker-independent. SI-P speaker-independent with Procrustes matching. SI-VC: speaker-independent ATS with voice conversion. SA: data from targets speakers were directly added to the ATS training set.
MCD of ATS experiments on each speaker.
| SD | SI | SI-P | SI-VC | SI-VC-P | SA-P | SA-VC-P | |
|---|---|---|---|---|---|---|---|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| F01 | 4.98 | 7.80 | 7.48 | 6.63 | 5.79 | 5.26 | 5.08 |
| F02 | 5.47 | 8.41 | 8.21 | 6.82 | 6.45 | 5.51 | 5.23 |
| F03 | 6.02 | 9.04 | 8.66 | 8.03 | 6.99 | 6.11 | 6.20 |
| F04 | 5.99 | 8.37 | 8.35 | 7.87 | 7.19 | 6.14 | 6.33 |
| M01 | 8.96 | 10.41 | 10.44 | 9.45 | 9.33 | 8.22 | 8.23 |
| M02 | 7.54 | 10.66 | 10.05 | 9.25 | 8.85 | 7.29 | 7.21 |
| M03 | 6.59 | 8.18 | 8.37 | 7.95 | 7.55 | 6.87 | 6.85 |
| M04 | 7.14 | 8.83 | 8.69 | 8.71 | 8.38 | 7.11 | 7.03 |
| Mean | 6.59 | 8.96 | 8.78 | 8.09 | 7.57 | 6.56 | 6.52 |
| STD | 1.27 | 1.04 | 0.98 | 1.03 | 1.21 | 0.99 | 1.05 |
Figure 5Examples of original and speaker-dependent ATS predicted mel-spectrograms and the synthetic waveforms. (a) Mel-spectrogram from SI ATS. (b) Speech waveform from SI ATS. (c) Mel-spectrogram from SI-P ATS. (d) Speech waveform from SI-P ATS. (e) Mel-spectrogram from SI-VC-P ATS. (f) Speech waveform from SI-VC-P ATS. (g) Mel-spectrogram from SA-VC-P ATS. (h) Speech waveform from SA-VC-P ATS. (i) Mel-spectrogram from SD ATS. (j) Speech waveform from SD ATS. (k) Original mel-spectrogram. (l) Original speech waveform.
MCD of voice conversion (dB) during the speaker adaptation on acoustics. The diagonal cells are empty, because voice conversion is not applicable to the same speaker.
| Target | F01 | F02 | F03 | F04 | M01 | M02 | M03 | M04 | |
|---|---|---|---|---|---|---|---|---|---|
| Source | |||||||||
| F01 | 6.32 | 7.23 | 6.71 | 7.08 | 6.86 | 7.91 | 8.69 | ||
| F02 | 9.26 | 6.75 | 7.60 | 7.66 | 7.55 | 8.46 | 9.15 | ||
| F03 | 7.43 | 7.01 | 7.02 | 6.85 | 7.04 | 7.83 | 8.77 | ||
| F04 | 7.15 | 6.24 | 7.45 | 7.24 | 7.66 | 8.02 | 9.25 | ||
| M01 | 6.64 | 6.40 | 6.32 | 7.38 | 7.03 | 7.68 | 8.52 | ||
| M02 | 6.47 | 6.64 | 6.50 | 7.56 | 6.78 | 7.77 | 8.70 | ||
| M03 | 6.97 | 6.63 | 6.65 | 7.51 | 6.70 | 7.05 | 8.50 | ||
| M04 | 7.01 | 6.32 | 7.05 | 7.42 | 7.96 | 7.17 | 7.71 | ||
| Average | 7.30 | 6.51 | 6.85 | 7.31 | 7.18 | 7.19 | 7.91 | 8.80 | |