| Literature DB >> 31141924 |
Abstract
Several researchers have contemplated deep learning-based post-filters to increase the quality of statistical parametric speech synthesis, which perform a mapping of the synthetic speech to the natural speech, considering the different parameters separately and trying to reduce the gap between them. The Long Short-term Memory (LSTM) Neural Networks have been applied successfully in this purpose, but there are still many aspects to improve in the results and in the process itself. In this paper, we introduce a new pre-training approach for the LSTM, with the objective of enhancing the quality of the synthesized speech, particularly in the spectrum, in a more efficient manner. Our approach begins with an auto-associative training of one LSTM network, which is used as an initialization for the post-filters. We show the advantages of this initialization for the enhancing of the Mel-Frequency Cepstral parameters of synthetic speech. Results show that the initialization succeeds in achieving better results in enhancing the statistical parametric speech spectrum in most cases when compared to the common random initialization approach of the networks.Entities:
Keywords: LSTM; deep learning; machine learning; post-filtering; signal processing; speech synthesis
Year: 2019 PMID: 31141924 PMCID: PMC6630405 DOI: 10.3390/biomimetics4020039
Source DB: PubMed Journal: Biomimetics (Basel) ISSN: 2313-7673
Amount of data (vectors) available for each voice in the databases.
| Voice | Gender/Accent | Total | Train | Validation | Test |
|---|---|---|---|---|---|
| BDL | (M) US-English | 676,554 | 473,588 | 135,311 | 67,655 |
| SLT | (F) US-English | 677,970 | 474,579 | 135,594 | 67,797 |
| CLB | (F) US-English | 769,161 | 538,413 | 153,832 | 76,916 |
| RMS | (M) US-English | 793,067 | 555,147 | 158,613 | 79,307 |
| JMK | (M) US-English | 635,503 | 541,856 | 62,135 | 31,512 |
(a) JMK voice in US-English was produced from a Canadian English speaker.
Comparison of the results for the test set during training the Long Short-term Memory (LSTM) networks to enhance MFCC.
|
| |||||||||
| SLT | BDL | CLB | RMS | JMK | |||||
| Epochs | sse | Epochs | sse | Epochs | sse | Epochs | sse | Epochs | sse |
| 327 | 290.00 | 236 | 378.58 | 198 | 362.56 | 196 | 382.80 | 232 | 352.36 |
|
| |||||||||
| SLT | BDL | CLB | RMS | JMK | |||||
| Epochs | sse | Epochs | sse | Epochs | sse | Epochs | sse | Epochs | sse |
| 232 | 276.33 | 134 | 364.28 | 135 | 350.92 | 147 | 368.07 | 200 | 341.72 |
Figure 1Evolution of the sse value for the validation set during the training process of the BDL voice.
Figure 2Evolution of the sse value for the validation set during the training process of the CLB voice.
Figure 3Evolution of the sse value for the validation set during the training process of the JMK voice.
Figure 4Evolution of the sse value for the validation set during the training process of the RMS voice.
Figure 5Evolution of the sse value for the validation set during the training process of the SLT voice.
Mean PESQ Results for BDL voice. Higher values represent better results. * is the best result. The superscript 1 means that the LSTM post-filter was pre-trained as ANN using natural parameters, while the superscript 2 means the same procedure applied with synthetic parameters.
| Random-Worst | Random-Best | BDL 1 | CLB 1 | JMK 1 | RMS 1 | SLT 1 |
|---|---|---|---|---|---|---|
| 1.45 | 1.46 | 1.45 | 1.46 | 1.45 | 1.44 | 1.46 |
|
|
|
|
|
| ||
| 1.43 | 1.45 | 1.47 | 1.49 * | 1.47 |
Mean PESQ Results for CLB voice. Higher values represent better results. * is the best result. The superscript 1 means that the LSTM post-filter was pre-trained as ANN using natural parameters, while the superscript 2 means the same procedure applied with synthetic parameters.
| Random-Worst | Random-Best | BDL 1 | CLB 1 | JMK 1 | RMS 1 | SLT 1 |
|---|---|---|---|---|---|---|
| 1.16 | 1.19 | 1.18 | 1.20 | 1.20 | 1.23* | 1.23 * |
|
|
|
|
|
| ||
| 1.20 | 1.20 | 1.22 | 1.23 * | 1.18 |
Mean PESQ Results for JMK voice. Higher values represent better results.
| Random-Worst | Random-Best | BDL 1 | CLB 1 | JMK 1 | RMS 1 | SLT 1 |
|---|---|---|---|---|---|---|
| 1.51 | 1.56 | 1.59 * | 1.53 | 1.56 | 1.56 | 1.54 |
|
|
|
|
|
| ||
| 1.55 | 1.55 | 1.58 | 1.55 | 1.55 |
Mean PESQ results for RMS voice. Higher values represent better results.
| Random-Worst | Random-Best | BDL 1 | CLB 1 | JMK 1 | RMS 1 | SLT 1 |
|---|---|---|---|---|---|---|
| 1.70 | 1.72 * | 1.71 | 1.71 | 1.71 | 1.68 | 1.71 |
|
|
|
|
|
| ||
| 1.71 | 1.69 | 1.70 | 1.70 | 1.70 |
Mean PESQ Results for SLT voice. Higher values represent better results.
| Random-Worst | Random-Best | BDL 1 | CLB 1 | JMK 1 | RMS 1 | SLT 1 |
|---|---|---|---|---|---|---|
| 0.95 | 0.97 | 0.95 | 0.97 | 0.98 * | 0.95 | 0.96 |
|
|
|
|
|
| ||
| 0.97 | 0.97 | 0.98 * | 0.97 | 0.94 |
Figure 6First MFCC for the BDL voice.
Figure 7Sample contour of the fifth MFCC for the SLT voice.