| Literature DB >> 35408056 |
Tomasz Grzywalski1, Szymon Drgas1.
Abstract
Monaural speech enhancement aims to remove background noise from an audio recording containing speech in order to improve its clarity and intelligibility. Currently, the most successful solutions for speech enhancement use deep neural networks. In a typical setting, such neural networks process the noisy input signal once and produces a single enhanced signal. However, it was recently shown that a U-Net-based network can be trained in such a way that allows it to process the same input signal multiple times in order to enhance the speech even further. Unfortunately, this was tested only for two-iteration enhancement. In the current research, we extend previous efforts and demonstrate how the multi-forward-pass speech enhancement can be successfully applied to other architectures, namely the ResBLSTM and Transformer-Net. Moreover, we test the three architectures with up to five iterations, thus identifying the method's limit in terms of performance gain. In our experiments, we used the audio samples from the WSJ0, Noisex-92, and DCASE datasets and measured speech enhancement quality using SI-SDR, STOI, and PESQ. The results show that performing speech enhancement up to five times still brings improvements to speech intelligibility, but the gain becomes smaller with each iteration. Nevertheless, performing five iterations instead of two gives additional a 0.6 dB SI-SDR and four-percentage-point STOI gain. However, these increments are not equal between different architectures, and the U-Net and Transformer-Net benefit more from multi-forward pass compared to ResBLSTM.Entities:
Keywords: ResBLSTM; Transformer-Net; U-Net; enhancement; multi-pass; speech
Mesh:
Year: 2022 PMID: 35408056 PMCID: PMC9003084 DOI: 10.3390/s22072440
Source DB: PubMed Journal: Sensors (Basel) ISSN: 1424-8220 Impact factor: 3.576
Figure 1The general architecture of multi-pass speech enhancement. The ⊕ symbol represents element-wise addition operation.
Figure 2Input, Base, and Output layers in the three neural network architectures tested with multi-pass speech enhancement.
Figure 3Dependence of SI-SDR on number of passes.
Figure 4Dependence of STOI on number of passes.
Babble noise (TIMIT). Best performances in each category are marked with bold font.
| Metric |
|
| SI-SDR (dB) | STOI | PESQ | |||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| SNR | −5 dB | 0 dB | 5 dB | 10 dB | −5 dB | 0 dB | 5 dB | 10 dB | −5 dB | 0 dB | 5 dB | 10 dB | ||
| original | - | - | −5.04 | −0.04 | 4.96 | 9.97 | 0.375 | 0.555 | 0.716 | 0.833 | 1.67 | 1.92 | 2.23 | 2.55 |
| Dilated | 1 | 1 | 1.82 | 7.20 | 11.21 | 13.70 | 0.575 | 0.775 | 0.894 | 0.948 | 1.90 | 2.41 | 2.80 | 3.10 |
| U-Net | 5 | 1 | 1.41 | 6.85 | 10.84 | 12.89 | 0.559 | 0.759 | 0.882 | 0.940 | 1.85 | 2.35 | 2.75 | 3.05 |
| 2 | 2.51 | 7.82 | 11.75 | 14.44 | 0.612 | 0.812 | 0.917 | 0.959 | 1.97 | 2.48 | 2.86 | 3.17 | ||
| 3 | 3.01 | 8.21 | 11.99 | 14.80 | 0.644 | 0.840 | 0.930 | 0.964 | 2.03 | 2.54 | 2.92 | 3.21 | ||
| 4 | 3.26 | 8.35 | 12.06 | 14.91 | 0.658 | 0.852 | 0.934 | 0.966 | 2.06 | 2.57 | 2.94 | 3.22 | ||
| 5 | 3.37 | 8.40 | 12.06 | 14.95 | 0.667 | 0.857 | 0.935 | 0.966 | 2.07 | 2.58 | 2.95 | 3.22 | ||
| Res | 1 | 1 | 2.16 | 6.85 | 10.53 | 13.30 | 0.634 | 0.830 | 0.919 | 0.956 | 1.98 | 2.50 | 2.90 | 3.21 |
| BLSTM | 5 | 1 | 2.12 | 7.06 | 10.81 | 13.60 | 0.623 | 0.818 | 0.919 | 0.959 | 1.93 | 2.45 | 2.88 | 3.21 |
| 2 | 2.92 | 7.65 | 11.28 | 14.09 | 0.668 | 0.858 | 0.934 | 0.963 | 2.05 | 2.58 | 2.98 | 3.28 | ||
| 3 | 3.15 | 7.78 | 11.37 | 14.14 | 0.686 | 0.867 | 0.936 | 0.964 | 2.07 | 2.60 | 3.00 | 3.29 | ||
| 4 | 3.22 | 7.82 | 11.39 | 14.13 | 0.692 | 0.870 | 0.937 | 0.964 | 2.08 | 2.61 | 3.00 | 3.29 | ||
| 5 | 3.25 | 7.83 | 11.39 | 14.12 |
|
| 0.937 | 0.964 | 2.08 | 2.60 | 3.00 | 3.28 | ||
| Trans- | 1 | 1 | 2.06 | 7.27 | 11.35 | 14.37 | 0.598 | 0.798 | 0.909 | 0.957 | 1.87 | 2.38 | 2.79 | 3.12 |
| former | 5 | 1 | 1.63 | 7.02 | 11.13 | 13.87 | 0.567 | 0.768 | 0.888 | 0.946 | 1.87 | 2.37 | 2.78 | 3.10 |
| -Net | 2 | 2.82 | 8.09 | 12.08 | 15.24 | 0.625 | 0.824 | 0.922 | 0.962 | 2.01 | 2.52 | 2.92 | 3.23 | |
| 3 | 3.49 | 8.45 | 12.34 | 15.55 | 0.656 | 0.844 | 0.932 | 0.966 | 2.09 | 2.60 | 2.99 | 3.28 | ||
| 4 | 3.71 | 8.58 | 12.41 | 15.62 | 0.666 | 0.855 | 0.937 | 0.968 |
|
|
|
| ||
| 5 |
|
|
|
| 0.676 | 0.861 |
|
|
|
|
|
| ||
Shopping mall noise (DCASE). Best performances in each category are marked with bold font.
| Metric |
|
| SI-SDR (dB) | STOI | PESQ | |||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| SNR | −5 dB | 0 dB | 5 dB | 10 dB | −5 dB | 0 dB | 5 dB | 10 dB | −5 dB | 0 dB | 5 dB | 10 dB | ||
| original | - | - | −5.31 | −0.31 | 4.68 | 9.53 | 0.471 | 0.638 | 0.776 | 0.879 | 1.70 | 2.02 | 2.36 | 2.74 |
| Dilated | 1 | 1 | 5.65 | 9.62 | 13.00 | 15.94 | 0.749 | 0.882 | 0.944 | 0.970 | 2.32 | 2.70 | 3.01 | 3.28 |
| U-Net | 5 | 1 | 5.31 | 9.34 | 12.72 | 15.52 | 0.735 | 0.871 | 0.939 | 0.968 | 2.20 | 2.58 | 2.91 | 3.21 |
| 2 | 6.12 | 9.95 | 13.28 | 16.32 | 0.788 | 0.899 | 0.950 | 0.974 | 2.29 | 2.66 | 2.98 | 3.26 | ||
| 3 | 6.41 | 10.13 | 13.44 | 16.51 | 0.812 | 0.908 | 0.953 | 0.975 | 2.34 | 2.70 | 3.01 | 3.28 | ||
| 4 | 6.53 | 10.19 | 13.48 | 16.55 | 0.820 | 0.910 |
|
| 2.36 | 2.71 | 3.01 | 3.28 | ||
| 5 | 6.57 | 10.21 | 13.48 | 16.56 |
|
|
|
| 2.36 | 2.71 | 3.01 | 3.28 | ||
| Res | 1 | 1 | 5.36 | 8.99 | 12.18 | 15.02 | 0.778 | 0.890 | 0.942 | 0.968 | 2.27 | 2.63 | 2.95 | 3.23 |
| BLSTM | 5 | 1 | 5.32 | 9.08 | 12.31 | 15.11 | 0.761 | 0.885 | 0.942 | 0.969 | 2.29 | 2.62 | 2.92 | 3.20 |
| 2 | 5.86 | 9.48 | 12.65 | 15.49 | 0.802 | 0.902 | 0.948 | 0.971 | 2.39 | 2.71 | 2.99 | 3.24 | ||
| 3 | 6.01 | 9.57 | 12.71 | 15.56 | 0.812 | 0.905 | 0.949 | 0.971 |
| 2.72 | 2.99 | 3.24 | ||
| 4 | 6.03 | 9.59 | 12.72 | 15.56 | 0.815 | 0.906 | 0.949 | 0.971 |
| 2.72 | 2.99 | 3.24 | ||
| 5 | 6.03 | 9.59 | 12.72 | 15.57 | 0.816 | 0.906 | 0.949 | 0.971 |
| 2.71 | 2.98 | 3.23 | ||
| Trans- | 1 | 1 | 5.42 | 9.46 | 12.94 | 16.05 | 0.750 | 0.881 | 0.943 | 0.971 | 2.16 | 2.56 | 2.92 | 3.24 |
| former | 5 | 1 | 4.96 | 9.24 | 12.78 | 15.75 | 0.714 | 0.861 | 0.936 | 0.969 | 2.16 | 2.55 | 2.92 | 3.25 |
| -Net | 2 | 6.06 | 9.98 | 13.42 | 16.59 | 0.777 | 0.895 | 0.950 | 0.975 | 2.30 | 2.67 | 3.01 | 3.31 | |
| 3 | 6.42 | 10.19 | 13.55 | 16.72 | 0.798 | 0.904 | 0.953 |
| 2.36 | 2.72 | 3.04 | 3.32 | ||
| 4 | 6.55 | 10.25 | 13.58 | 16.74 | 0.804 | 0.907 | 0.953 |
| 2.37 |
|
|
| ||
| 5 |
|
|
|
| 0.806 | 0.907 |
|
| 2.38 |
|
|
| ||
Figure 5Example of multi-forward speech enhancement (Transformer-Net, babble noise at 0 dB SNR). Warmer colors indicate higher energy in given time frame and frequency band.