| Literature DB >> 31861828 |
Michelle Gutiérrez-Muñoz1, Astryd González-Salazar1, Marvin Coto-Jiménez1.
Abstract
Speech signals are degraded in real-life environments, as a product of background noise or other factors. The processing of such signals for voice recognition and voice analysis systems presents important challenges. One of the conditions that make adverse quality difficult to handle in those systems is reverberation, produced by sound wave reflections that travel from the source to the microphone in multiple directions. To enhance signals in such adverse conditions, several deep learning-based methods have been proposed and proven to be effective. Recently, recurrent neural networks, especially those with long short-term memory (LSTM), have presented surprising results in tasks related to time-dependent processing of signals, such as speech. One of the most challenging aspects of LSTM networks is the high computational cost of the training procedure, which has limited extended experimentation in several cases. In this work, we present a proposal to evaluate the hybrid models of neural networks to learn different reverberation conditions without any previous information. The results show that some combinations of LSTM and perceptron layers produce good results in comparison to those from pure LSTM networks, given a fixed number of layers. The evaluation was made based on quality measurements of the signal's spectrum, the training time of the networks, and statistical validation of results. In total, 120 artificial neural networks of eight different types were trained and compared. The results help to affirm the fact that hybrid networks represent an important solution for speech signal enhancement, given that reduction in training time is on the order of 30%, in processes that can normally take several days or weeks, depending on the amount of data. The results also present advantages in efficiency, but without a significant drop in quality.Entities:
Keywords: LSTM; artificial neural network; deep learning; speech processing
Year: 2019 PMID: 31861828 PMCID: PMC7148527 DOI: 10.3390/biomimetics5010001
Source DB: PubMed Journal: Biomimetics (Basel) ISSN: 2313-7673
Figure 1Bidirectional Long Short-term Memory (BLSTM) network structure. Adapted from [32].
Figure 2Sample of three networks compared in this work: The purely multi-layer perceptron (MPL), a mixed network, and the purely BLSTM network.
Efficiency of the different combinations of hidden layers, by the condition of reverberation. * is the best value of sse in each condition.
| Condition | Network (Hidden Layers) | sse | Time per Epoch (s) |
|---|---|---|---|
| MARDY | BLSTM–BLSTM–BLSTM | 201.34 * | 50.6 |
| BLSTM–BLSTM–MLP | 204.39 | 33.3 | |
| BLSTM–MLP–BLSTM | 210.81 | 33.5 | |
| BLSTM–MLP–MLP | 218.91 | 15.9 | |
| MLP–BLSTM–BLSTM | 204.82 | 36.1 | |
| MLP–BLSTM–MLP | 256.32 | 18.6 | |
| MLP–MLP–BLSTM | 216.46 | 18.8 | |
| MLP–MLP–MLP | 400.34 | 1.2 | |
| Lecture Room | BLSTM–BLSTM–BLSTM | 213.12 | 74.9 |
| BLSTM–BLSTM–MLP | 214.35 | 48.8 | |
| BLSTM–MLP–BLSTM | 221.88 | 49.3 | |
| BLSTM–MLP–MLP | 229.22 | 23.2 | |
| MLP–BLSTM–BLSTM | 212.34 * | 52.8 | |
| MLP–BLSTM–MLP | 226.39 | 27.7 | |
| MLP–MLP–BLSTM | 230.85 | 27.6 | |
| MLP–MLP–MLP | 360.41 | 1.8 | |
| Artificial Room | BLSTM–BLSTM–BLSTM | 88.47 * | 55.5 |
| BLSTM–BLSTM–MLP | 90.37 | 36.5 | |
| BLSTM–MLP–BLSTM | 93.61 | 36.6 | |
| BLSTM–MLP–MLP | 104.23 | 17.4 | |
| MLP–BLSTM–BLSTM | 92.18 | 39.5 | |
| MLP–BLSTM–MLP | 108.56 | 20.6 | |
| MLP–MLP–BLSTM | 111.13 | 20.5 | |
| MLP–MLP–MLP | 170.61 | 1.3 | |
| ACE Building | BLSTM–BLSTM–BLSTM | 207.32 * | 73.8 |
| BLSTM–BLSTM–MLP | 210.17 | 45.8 | |
| BLSTM–MLP–BLSTM | 214.29 | 46.1 | |
| BLSTM–MLP–MLP | 212.54 | 21.6 | |
| MLP–BLSTM–BLSTM | 208.04 | 49.2 | |
| MLP–BLSTM–MLP | 221.28 | 25.6 | |
| MLP–MLP–BLSTM | 220.13 | 25.8 | |
| MLP–MLP–MLP | 333.60 | 1.7 | |
| Meeting Room | BLSTM–BLSTM–BLSTM | 197.37 | 69.9 |
| BLSTM–BLSTM–MLP | 199.03 | 45.7 | |
| BLSTM–MLP–BLSTM | 204.68 | 45.8 | |
| BLSTM–MLP–MLP | 217.52 | 21.6 | |
| MLP–BLSTM–BLSTM | 196.90 * | 49.6 | |
| MLP–BLSTM–MLP | 206.03 | 25.7 | |
| MLP–MLP–BLSTM | 214.28 | 25.9 | |
| MLP–MLP–MLP | 363.19 | 1.7 |
Objective evaluations for the different combinations of hidden layers, by the condition of reverberation. * is the best value. The p-value was obtained with the Friedman test, with a significance of 0.05.
| Condition | Network (Hidden Layers) | PESQ | Significative Difference | |
|---|---|---|---|---|
| MARDY | BLSTM-BLSTM-BLSTM | 2.30 | - | - |
| BLSTM–BLSTM–MLP | 2.31 * | no | 0.715 | |
| BLSTM–MLP–BLSTM | 2.27 | yes | 0.003 | |
| BLSTM–MLP–MLP | 2.19 | yes | 6.648 × 10 | |
| MLP–BLSTM–BLSTM | 2.28 | no | 0.147 | |
| MLP–BLSTM–MLP | 2.08 | yes | 1.965 × 10 | |
| MLP–MLP–BLSTM | 2.24 | yes | 0.000 | |
| MLP–MLP–MLP | 1.94 | yes | 0.000 | |
| Lecture Room | BLSTM–BLSTM–BLSTM | 2.28 * | - | - |
| BLSTM–BLSTM–MLP | 2.21 | no | 0.095 | |
| BLSTM–MLP–BLSTM | 2.22 | yes | 0.0034 | |
| BLSTM–MLP–MLP | 2.20 | yes | 1.729 × 10 | |
| MLP–BLSTM–BLSTM | 2.27 | no | 0.199 | |
| MLP–BLSTM–MLP | 2.21 | yes | 9.635 × 10 | |
| MLP–MLP–BLSTM | 2.20 | yes | 9.617 | |
| MLP–MLP–MLP | 2.00 | yes | 0.000 | |
| Artificial Room | BLSTM–BLSTM–BLSTM | 3.18 * | - | - |
| BLSTM–BLSTM–MLP | 3.17 | no | 1.000 | |
| BLSTM–MLP–BLSTM | 3.14 | yes | 0.002 | |
| BLSTM–MLP–MLP | 3.12 | yes | 6.650 × 10 | |
| MLP–BLSTM–BLSTM | 3.17 | no | 1.000 | |
| MLP–BLSTM–MLP | 3.06 | yes | 1.965 × 10 | |
| MLP–MLP–BLSTM | 3.08 | yes | 2.695 × 10 | |
| MLP–MLP–MLP | 2.90 | yes | 0.000 | |
| ACE Building | BLSTM–BLSTM–BLSTM | 2.37 * | - | - |
| BLSTM–BLSTM–MLP | 2.35 | no | 0.068 | |
| BLSTM–MLP–BLSTM | 2.35 | no | 0.147 | |
| BLSTM–MLP–MLP | 2.32 | yes | 4.22 × 10 | |
| MLP–BLSTM–BLSTM | 2.36 | no | 0.474 | |
| MLP–BLSTM–MLP | 2.33 | yes | 0.026 | |
| MLP–MLP–BLSTM | 2.33 | yes | 0.008 | |
| MLP–MLP–MLP | 2.08 | yes | 0.000 | |
| Meeting Room | BLSTM–BLSTM–BLSTM | 2.28 | - | - |
| BLSTM–BLSTM–MLP | 2.29 * | no | 0.147 | |
| BLSTM–MLP–BLSTM | 2.24 | no | 0.060 | |
| BLSTM–MLP–MLP | 2.23 | yes | 0.002 | |
| MLP–BLSTM–BLSTM | 2.28 | no | 0.474 | |
| MLP–BLSTM–MLP | 2.25 | no | 0.715 | |
| MLP–MLP–BLSTM | 2.20 | yes | 0.001 | |
| MLP–MLP–MLP | 2.0 | yes | 1.960 × 10 |
Figure 3Spectrograms of a phrase in the database: (a) speak clean; (b) speak with reverberation (ACE Building Lobby); (c) enhancement result with the BLSTM network; and (d) enhancement result with the mixed MLP–BLSTM–BLSTM network.