| Literature DB >> 35009917 |
Mohamed Nabih Ali1,2, Daniele Falavigna2, Alessio Brutti2.
Abstract
Robustness against background noise and reverberation is essential for many real-world speech-based applications. One way to achieve this robustness is to employ a speech enhancement front-end that, independently of the back-end, removes the environmental perturbations from the target speech signal. However, although the enhancement front-end typically increases the speech quality from an intelligibility perspective, it tends to introduce distortions which deteriorate the performance of subsequent processing modules. In this paper, we investigate strategies for jointly training neural models for both speech enhancement and the back-end, which optimize a combined loss function. In this way, the enhancement front-end is guided by the back-end to provide more effective enhancement. Differently from typical state-of-the-art approaches employing on spectral features or neural embeddings, we operate in the time domain, processing raw waveforms in both components. As application scenario we consider intent classification in noisy environments. In particular, the front-end speech enhancement module is based on Wave-U-Net while the intent classifier is implemented as a temporal convolutional network. Exhaustive experiments are reported on versions of the Fluent Speech Commands corpus contaminated with noises from the Microsoft Scalable Noisy Speech Dataset, shedding light and providing insight about the most promising training approaches.Entities:
Keywords: intent classification; joint training; speech enhancement
Mesh:
Year: 2022 PMID: 35009917 PMCID: PMC8749591 DOI: 10.3390/s22010374
Source DB: PubMed Journal: Sensors (Basel) ISSN: 1424-8220 Impact factor: 3.576
Figure 1The basic diagram of speech enhancement system.
Figure 2Schematic diagram of the conventional joint training speech enhancement with different back-end.
Figure 3The full pipeline of our intent classification scheme, including speech enhancement and intent classifier.
Figure 4The general architecture of: (a) Time Convolutional network for intent classification. (b) The architecture of the residual block.
Figure 5Three strategies of the proposed joint training approaches: (a) based on the mixture signals (JT). (b) based on bottleneck representation (BN). (c) based on the concatenation between mixture signals and bottleneck representation (BN-Mix).
FSC dataset Description.
| Data | Speakers No. | Utterance No. | Total Hours |
|---|---|---|---|
| Train Data | 77 | 23,132 | 14.7 |
| Validation Data | 10 | 3119 | 1.9 |
| Evaluation Data | 10 | 3793 | 2.4 |
Classification accuracy for different architectures with different .
| Noisy | Jt-Clean | Jt | BN | BN-Mix | |
|---|---|---|---|---|---|
| 53.2% |
| 73.37% | 72.80% | 72.39% | - |
|
| 91.53% | 80.50% | 77.80% | 58.02% | |
|
| 92.77% | 86.02% | 77.53% | 54.99% | |
|
| - | 82.52% | 77.90% | 66.67% |
The PESQ metric for different architectures with different .
| Noisy | JT | BN | BN-Mix | |
|---|---|---|---|---|
| 1.28 |
| 1.14 | 1.16 | - |
|
| 1.18 | 1.71 | 1.81 | |
|
| 1.15 | 1.76 | 1.67 | |
|
| 1.14 | 1.79 | 1.83 |
The STOI metric for different architectures with different .
| Noisy | JT | BN | BN-Mix | |
|---|---|---|---|---|
| 0.84 |
| 0.46 | 0.60 | - |
|
| 0.48 | 0.83 | 0.85 | |
|
| 0.47 | 0.84 | 0.85 | |
|
| 0.58 | 0.85 | 0.86 |
The MSE metric for different architectures with different .
| Noisy | JT | BN | BN-Mix | |
|---|---|---|---|---|
|
|
|
|
| - |
|
|
|
|
| |
|
|
|
|
| |
|
|
|
|
|
Figure 6Graphical representation for: (a) the classification accuracy. (b) PESQ metric. (c) STOI metric. (d) MSE metric against values for all experiments.
List of Abbreviations.
| Abbreviation | Meaning |
|---|---|
| ASR | Automatic Speech Recognition |
| SNR | Signal-to-Noise Ratio |
| SE | Speech Enhancement |
| MVDR | Minimum Variance Distortion Response |
| SDR | Signal-to-Distortion Ratio |
| WER | Word Error Rate |
| IBM | Ideal Binary Mask |
| IRM | Ideal Ratio Mask |
| MSE | Mean Square Error |
| TCN | Time Convolutional Neural Network |
| IC | Intent Classification |
| VAD | Voice Activity Detection |
| VAE | Variational Auto Encoder |
| DNN | Deep Neural Network |
| GAN | Generative Adversarial Network |
| GRF | Gated Recurrent Fusion |
| NLP | Natural Language Processing |
| SLU | Spoken Language Understanding |
| E2E | End-to-End |
| FSC | Fluent Speech Commands |
| PESQ | Perceptual Evaluation of Speech Quality |
| STOI | Short Time Objective Intelligibility |