| Literature DB >> 35891070 |
Geon Woo Lee1, Hong Kook Kim1,2.
Abstract
In this paper, a new two-step joint optimization approach based on the asynchronous subregion optimization method is proposed for training a pipeline model composed of two different models. The first-step processing of the proposed joint optimization approach trains the front-end model only, and the second-step processing trains all the parameters of the combined model together. In the asynchronous subregion optimization method, the first-step processing only supports the goal of the front-end model. However, the first-step processing of the proposed approach works with a new loss function to make the front-end model support the goal of the back-end model. The proposed optimization approach was applied, here, to a pipeline composed of a deep complex convolutional recurrent network (DCCRN)-based speech enhancement model and a conformer-transducer-based ASR model as a front-end and a back-end, respectively. Then, the performance of the proposed two-step joint optimization approach was evaluated on the LibriSpeech automatic speech recognition (ASR) corpus in noisy environments by measuring the character error rate (CER) and word error rate (WER). In addition, an ablation study was carried out to examine the effectiveness of the proposed optimization approach on each of the processing blocks in the conformer-transducer ASR model. Consequently, it was shown from the ablation study that the conformer-transducer-based ASR model with the joint network trained only by the proposed optimization approach achieved the lowest average CER and WER. Moreover, the proposed optimization approach reduced the average CER and WER on the Test-Noisy dataset under matched noise conditions by 0.30% and 0.48%, respectively, compared to the approach of separate optimization of speech enhancement and ASR. Compared to the conventional two-step joint optimization approach, the proposed optimization approach provided average CER and WER reductions of 0.22% and 0.31%, respectively. Moreover, it was revealed that the proposed optimization approach achieved a lower average CER and WER, by 0.32% and 0.43%, respectively, than the conventional optimization approach under mismatched noise conditions.Entities:
Keywords: auxiliary loss function; joint optimization; noise-robust speech recognition; speech enhancement
Mesh:
Year: 2022 PMID: 35891070 PMCID: PMC9324918 DOI: 10.3390/s22145381
Source DB: PubMed Journal: Sensors (Basel) ISSN: 1424-8220 Impact factor: 3.847
Figure 1Architecture of the deep complex convolution recurrent network (DCCRN)-based speech enhancement, where is the component-wise vector multiplication operator [30].
Figure 2Architecture of the conformer-transducer-based speech recognition model, where is the vector-addition operator [20].
Figure 3Block diagram of the noise-robust ASR used in this experiment.
Figure 4Block diagram of computing the first-step loss of the proposed two-step joint optimization approach.
Figure 5Block diagram of computing the second-step loss of the proposed two-step joint optimization approach.
Figure 6Block diagram comparison of the joint optimization approaches between (a) the conventional joint optimization [38] and (b) the proposed two-step joint optimization, applied to the noise-robust ASR pipeline composed of DCCRN-based speech enhancement and conformer-transducer-based ASR models.
Comparison of the average character error rate (CER) and word error rate (WER) of automatic speech recognition (ASR) models trained by various optimization approaches under clean and matched noise conditions.
| Model | Training Approach | Dev-Clean | Dev-Noisy | Test-Clean | Test-Noisy | ||||
|---|---|---|---|---|---|---|---|---|---|
| CER (%) | WER (%) | CER (%) | WER (%) | CER (%) | WER (%) | CER (%) | WER (%) | ||
| ASR-only | Clean-Condition Training | 3.50 | 8.00 | 28.08 | 39.96 | 3.53 | 8.79 | 25.68 | 37.39 |
| ASR-only | Multi-Condition Training | 3.50 | 8.00 | 9.68 | 16.61 | 3.53 | 8.54 | 9.26 | 16.43 |
| SE-ASR | Separate Optimization | 3.50 | 7.91 | 9.57 | 16.04 | 3.51 | 8.42 | 9.20 | 15.94 |
| SE+ASR | Joint Optimization | 3.48 | 7.98 | 9.61 | 16.23 | 3.51 | 8.41 | 9.21 | 15.98 |
| SE+ASR | Conventional Two-Step Joint Optimization | 3.49 | 7.91 | 9.38 | 15.81 | 3.50 | 8.37 | 9.12 | 15.77 |
| SE+ASR | Proposed Two-Step Joint Optimization | 3.48 | 7.92 | 9.31 | 15.77 | 3.50 | 8.37 | 9.07 | 15.54 |
Comparison of the average character error rate (CER) and word error rate (WER) of automatic speech recognition (ASR) models trained by various optimization approaches under mismatched noise conditions.
| Model | Training Approach | Dev-Noisy-Mismatched | Test-Noisy-Mismatched | ||
|---|---|---|---|---|---|
| CER (%) | WER (%) | CER (%) | WER (%) | ||
| ASR-only | Clean-Condition Training | 27.67 | 37.20 | 24.46 | 34.37 |
| ASR-only | Multi-Condition Training | 12.26 | 20.01 | 10.46 | 18.28 |
| SE-ASR | Separate Optimization | 16.42 | 28.67 | 16.39 | 28.44 |
| SE+ASR | Joint Optimization | 12.30 | 20.81 | 10.19 | 17.80 |
| SE+ASR | Conventional Two-Step Joint Optimization | 11.88 | 19.71 | 10.00 | 17.61 |
| SE+ASR | Proposed Two-Step Joint Optimization | 11.57 | 19.68 | 9.87 | 17.37 |
Ablation study of the proposed two-step joint optimization approach applied to various combinations of the processing blocks in the conformer-transducer speech recognition model, measured by average character error rate (CER) and word error rate (WER), where check symbol √ means the proposed two-step joint training approach being applied.
| Speech Enhancement | Speech Recognition Block Trained in Second-Step Processing | Test-Clean | Test-Noisy | Test-Noisy-Mismatched | |||||
|---|---|---|---|---|---|---|---|---|---|
| Encoder | Prediction Network | Joint Network | CER (%) | WER (%) | CER (%) | WER (%) | CER (%) | WER (%) | |
| √ | 3.51 | 8.34 | 9.09 | 16.27 | 9.92 | 17.73 | |||
| √ | √ | 3.53 | 8.60 | 10.85 | 18.14 | 10.27 | 17.92 | ||
| √ | √ | 3.50 | 8.29 | 8.90 | 15.46 | 9.42 | 16.57 | ||
| √ | √ | √ | 3.53 | 8.61 | 10.79 | 17.99 | 10.18 | 17.90 | |
| √ | √ | √ | √ | 3.50 | 8.37 | 9.07 | 15.54 | 9.87 | 17.49 |