| Literature DB >> 35898005 |
Wentao Yu1, Steffen Zeiler1, Dorothea Kolossa1.
Abstract
Audio-visual speech recognition (AVSR) can significantly improve performance over audio-only recognition for small or medium vocabularies. However, current AVSR, whether hybrid or end-to-end (E2E), still does not appear to make optimal use of this secondary information stream as the performance is still clearly diminished in noisy conditions for large-vocabulary systems. We, therefore, propose a new fusion architecture-the decision fusion net (DFN). A broad range of time-variant reliability measures are used as an auxiliary input to improve performance. The DFN is used in both hybrid and E2E models. Our experiments on two large-vocabulary datasets, the Lip Reading Sentences 2 and 3 (LRS2 and LRS3) corpora, show highly significant improvements in performance over previous AVSR systems for large-vocabulary datasets. The hybrid model with the proposed DFN integration component even outperforms oracle dynamic stream-weighting, which is considered to be the theoretical upper bound for conventional dynamic stream-weighting approaches. Compared to the hybrid audio-only model, the proposed DFN achieves a relative word-error-rate reduction of 51% on average, while the E2E-DFN model, with its more competitive audio-only baseline system, achieves a relative word error rate reduction of 43%, both showing the efficacy of our proposed fusion architecture.Entities:
Keywords: audio-visual speech recognition; decision fusion net; end-to-end recognition; hybrid models; reliability measures
Mesh:
Year: 2022 PMID: 35898005 PMCID: PMC9370936 DOI: 10.3390/s22155501
Source DB: PubMed Journal: Sensors (Basel) ISSN: 1424-8220 Impact factor: 3.847
Figure 1Audio-visual fusion based on the DFN, applied to one stream of audio and two streams of video features.
Figure 2Audio encoder (left), video encoder (middle) and reliability measure encoder (right) for both modalities . The blue blocks are used to align video features with audio features; the turquoise block shows the transformer encoder.
Figure 3Transformer encoder for both modalities . The blue block shows the sub-sampling, whereas the turquoise blocks comprise the the transformer encoder.
Figure 4Transformer decoder (left) and CTC decoder (right) for both modalities .
Figure 5DFN fusion topology for E2E model, .
Overview of reliability measures.
| Model-Based | Signal-Based | |
|---|---|---|
| Audio-Based | Video-Based | |
| Entropy | MFCC | Confidence |
Characteristics of the utilized datasets.
| Subset | Utterances | Vocabulary | Duration [hh:mm] |
|---|---|---|---|
| LRS2 pre-train | 96,318 | 41,427 | 196:25 |
| LRS2 train | 45,839 | 17,660 | 28:33 |
| LRS2 validation | 1082 | 1984 | 00:40 |
| LRS2 test | 1243 | 1698 | 00:35 |
| LRS3 pre-train | 118,516 | 51 k | 409:10 |
Figure 6Decision fusion net structure for the hybrid model. The turquoise block indicates the successively repeated layers.
Figure 7(left) and (right). The turquoise blocks indicate the successively repeated layers.
Decoding results for three exemplary sentences S1, S2 and S3. RT represents the reference transcription; AO is audio only model; EI is early integration; CE and MSE represent dynamic stream weighting with CE and MSE as loss functions; OW is the oracle stream-weighting; and LSTM-DFN and BLSTM-DFN are variants of our proposed integration model.
| Type | Result | |
|---|---|---|
| RT | However, what a surprise when you come in | |
| AO | However, what a surprising coming | |
| EI | However, what a surprising coming | |
| CE | However, what a surprising coming | |
| S1 | MSE | However, what a surprising coming |
| OW | However, what a surprising coming | |
| LSTM-DFN | However, what a surprising coming | |
| BLSTM-DFN | However, what a surprise when you come in | |
| RT | I’m not massively happy | |
| AO | I’m not mass of the to | |
| EI | Some more massive happy | |
| CE | I’m not massive into | |
| S2 | MSE | I’m not massive into |
| OW | I’m not mass of the happiest | |
| LSTM-DFN | I’m not massive it happened | |
| BLSTM-DFN | I’m not massively happy | |
| RT | Better street lighting can help | |
| AO | Benefit lighting hope | |
| EI | However, the street lighting and hope | |
| CE | Benefit lighting hope | |
| S3 | MSE | Benefit lighting hope |
| OW | In the street lighting hope | |
| LSTM-DFN | However, the street lighting and hope | |
| BLSTM-DFN | Better street lighting can help |
Figure 8Estimated log-posteriors of sentence S2 for the target state , with additive noise at dB. All abbreviations are the same as in Table 3. The whiskers show the maximum and minimum values; the upper and lower bounds of the green blocks represent the respective 25th and 75th percentile; the yellow line in the center of the green block indicates the median.
Figure 9WER (%) on the test set of the LRS2 corpus in different noise conditions.
Word error rate (%) on the LRS2 test set under additive noise.
| dB | −9 | −6 | −3 | 0 | 3 | 6 | 9 | Clean | Avg. | |
|---|---|---|---|---|---|---|---|---|---|---|
| Model | ||||||||||
| AO | 48.96 | 41.44 | 33.07 | 30.81 | 22.85 | 18.89 | 16.49 | 10.12 | 27.83 | |
| VA | 85.83 | 87.00 | 85.26 | 88.10 | 87.03 | 88.44 | 88.25 | 88.10 | 87.25 | |
| VS | 88.11 | 90.27 | 87.29 | 88.88 | 85.88 | 85.33 | 88.58 | 87.10 | 87.68 | |
| EI | 40.14 | 32.47 | 23.96 | 26.59 | 20.67 | 16.68 | 14.76 | 10.02 | 23.16 | |
| MSE | 46.48 | 37.79 | 27.45 | 27.47 | 19.52 | 16.58 | 15.09 | 9.42 | 24.98 | |
| CE | 45.79 | 37.14 | 26.32 | 28.03 | 19.40 | 16.68 | 14.76 | 9.42 | 24.65 | |
| OW | 30.33 | 26.47 |
| 21.25 |
| 11.66 |
|
| 17.10 | |
| LSTM-DFN | 33.30 | 27.22 | 21.26 | 21.25 | 19.17 | 13.97 | 15.84 | 10.32 | 20.29 | |
| BLSTM-DFN |
|
| 17.89 |
| 14.93 |
| 10.78 | 7.84 |
| |
Asterisks indicate a statistically significant difference compared with the audio-only model (AO). *** denotes p ⩽ 0.001, ** shows 0.001 < p ⩽ 0.01, * corresponds to 0.01 < p ⩽ 0.05, and ns indicates results where p > 0.05.
| dB | −9 | −6 | −3 | 0 | 3 | 6 | 9 | Clean | Avg. | |
|---|---|---|---|---|---|---|---|---|---|---|
| Model | ||||||||||
| EI | *** | *** | *** | * | ns | ns | ns | ns | *** | |
| MSE | * | *** | *** | ns | * | ** | ** | ns | *** | |
| CE | ns | *** | *** | ns | * | ** | ** | ns | *** | |
| OW | *** | *** | *** | *** | *** | *** | *** | *** | *** | |
| LSTM-DFN | *** | *** | *** | *** | * | *** | ns | ns | *** | |
| BLSTM-DFN | *** | *** | *** | *** | *** | *** | *** | * | *** | |
Far-field AVSR WER (%) and statistically significance compared with the AO model on the LRS2 dataset. *** denotes p ⩽ 0.001, ** shows 0.001 < p ⩽ 0.01.
| AO | EI | MSE | CE | OW | LSTM-DFN | BLSTM-DFN |
|---|---|---|---|---|---|---|
| 23.61 | 19.15 (**) | 19.54 (***) | 19.44 (***) | 15.67 (***) | 15.28 (***) |
BLSTM-DFN word error rates (%) on the LRS2 test set under additive noise. All: apply all reliability indicators as shown in Table 1; : all audio-based reliability indicators; : all video-based reliability indicators; : using the video-based reliability indicators, excluding the image distortion estimates; : using all reliability indicators except for image distortion estimates; None: proposed model without reliabilities. Avg: Average performance, together with the significance of improvements (compared with None). : not significant and ***: p ⩽ 0.001.
| dB | −9 | −6 | −3 | 0 | 3 | 6 | 9 | Clean |
| |
|---|---|---|---|---|---|---|---|---|---|---|
| Model | ||||||||||
|
| 27.55 | 23.11 | 17.89 | 16.35 | 14.93 | 10.25 | 10.78 | 7.84 | 16.09 | |
|
| 23.39 |
| 14.51 | 15.68 |
| 8.44 | 10.67 |
| 13.82 *** | |
|
| 98.12 | 98.50 | 98.76 | 98.22 | 99.43 | 98.79 | 99.46 | 98.81 | 98.76 | |
|
| 25.97 | 21.23 | 17.66 | 17.58 | 14.24 | 10.85 |
| 7.54 | 15.60 | |
|
| 24.48 | 21.70 | 17.55 | 18.35 | 16.07 | 9.35 | 12.07 | 8.43 | 16.00 | |
|
|
| 18.52 |
|
| 13.66 |
| 9.91 | 7.84 | ||
Performance of the audio-visual and uni-modal speech recognition (WER [%]). AO: audio only. VO: video only. AV: AV baseline [9]. DFN: proposed DFN fusion. m: music noise. a: ambient noise. vc: clean visual data. gb: visual Gaussian blur. sp: visual salt-and-pepper noise.
| dB | −12 | −9 | −6 | −3 | 0 | 3 | 6 | 9 | 12 | Clean |
| |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Model | ||||||||||||
| AO (m) | 18.9 | 13.7 | 11.2 | 8.4 | 6.3 | 6.8 | 4.5 | 4.1 | 4.3 | 4.2 | 8.2 | |
| AO (a) | 25.7 | 23.4 | 18.5 | 11.6 | 8.2 | 9.0 | 5.9 | 3.8 | 4.4 | 4.2 | 11.5 | |
| VO (vc) | 58.7 | 61.0 | 61.7 | 69.6 | 69.6 | 63.5 | 64.6 | 63.6 | 66.6 | 61.9 | 64.1 | |
| VO (gb) | 66.6 | 69.2 | 71.0 | 68.5 | 68.5 | 71.1 | 62.7 | 69.4 | 67.6 | 66.9 | 68.2 | |
| VO (sp) | 68.5 | 72.5 | 73.7 | 70.1 | 70.1 | 70.6 | 68.3 | 69.1 | 73.1 | 67.9 | 70.4 | |
| AV (m.vc) | 14.6 | 11.8 | 6.4 | 7.9 | 7.9 | 6.3 | 5.2 | 4.4 | 3.4 | 4.0 | 7.2 | |
| DFN (m.vc) |
|
|
|
|
|
|
|
|
|
|
| |
| AV (a.vc) | 19.1 | 19.0 | 14.3 | 7.3 | 6.3 | 6.0 | 5.7 | 4.5 | 4.9 | 4.0 | 9.1 | |
| DFN (a.vc) |
|
|
|
|
|
|
|
|
|
|
| |
| AV (a.gb) | 20.6 | 18.9 | 15.0 | 7.7 | 6.8 | 7.5 | 5.9 | 3.9 | 4.8 | 4.0 | 9.5 | |
| DFN (a.gb) |
|
|
|
|
|
|
|
|
|
|
| |
| AV (a.sp) | 19.5 | 19.9 | 15.3 | 7.7 | 7.2 | 6.3 | 5.6 | 4.4 | 4.6 | 4.3 | 9.5 | |
| DFN (a.sp) |
|
|
|
|
|
|
|
|
|
|
| |
Statistical significance tests, comparing the results of different model setups *** denotes p ⩽ 0.001, ** shows 0.001 < p ⩽ 0.01, * corresponds to 0.01 < p ⩽ 0.05, and ns indicates results where p > 0.05; the other abbreviations are described in Table 8.
| dB | −12 | −9 | −6 | −3 | 0 | 3 | 6 | 9 | 12 | Clean |
| |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Model | ||||||||||||
| AO-AV (m.vc) | * | ns | *** | ns | ns | ns | ns | ns | ns | ns | *** | |
| AO-DFN (m.vc) | *** | *** | *** | ** | ns | ** | ns | ns | * | *** | *** | |
| AV-DFN (m.vc) | ** | ** | ns | ** | *** | * | ns | * | ns | ** | *** | |
| AO-AV (a.vc) | *** | ** | ** | ** | ns | ** | ns | ns | ns | ns | *** | |
| AO-DFN (a.vc) | *** | *** | *** | *** | *** | *** | ** | ns | ns | *** | *** | |
| AV-DFN (a.vc) | ** | *** | *** | * | * | ns | ** | ns | ns | ** | *** | |
| AO-DFN (a.gb) | *** | *** | *** | *** | *** | *** | * | ns | ns | ** | *** | |
| AV-DFN (a.gb) | *** | *** | *** | * | ** | * | ** | ns | ns | * | *** | |
| AO-DFN (a.sp) | *** | *** | *** | *** | ** | ** | *** | ns | ns | ** | *** | |
| AV-DFN (a.sp) | * | *** | *** | * | * | ns | ** | * | ns | ** | *** | |
Performance of the proposed E2E DFN fusion (WER [%]), based on the different E2E reliability indicator configurations. Among these, applies only audio-based reliability indicators and applies only video-based reliability indicators. None: proposed model without reliability information; All: use all reliability indicators. Other abbreviations as defined in Table 8. Avg: Average performance, together with the significance of improvements (compared with None). : not significant, ***: p ⩽ 0.001, **: 0.001 < p ⩽ 0.01 and *: 0.01 < p ⩽ 0.05.
| dB | −12 | −9 | −6 | −3 | 0 | 3 | 6 | 9 | 12 | Clean |
| |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Model | ||||||||||||
| 11.2 | 9.4 | 6.5 |
| 5.4 | 5.5 |
|
| 2.3 |
| 5.4 * | ||
| (a.vc) | 14.9 | 14.5 | 10.0 | 6.6 | 4.2 | 5.8 | 4.3 |
|
|
| 6.8 | |
| (a.gb) | 16.4 | 14.3 | 10.7 | 6.3 | 4.8 | 6.0 | 4.6 |
|
|
| 7.1 ** | |
| (a.sp) | 17.1 | 15.7 | 11.3 | 6.6 |
| 6.1 | 4.5 | 2.8 |
|
| 7.4 | |
|
|
| 6.2 | 5.3 | 5.3 | 5.6 | 3.7 |
| 2.6 | 2.7 | 5.3 * | ||
| (a.vc) |
| 14.9 | 11.0 | 6.4 | 5.6 | 6.6 | 5.2 | 3.3 | 3.6 | 2.7 | 7.4 | |
| (a.gb) | 16.4 | 15.2 | 11.3 | 6.9 | 4.9 | 6.4 | 4.7 | 3.6 | 3.4 | 2.6 | 7.5 | |
| (a.sp) | 16.1 | 15.0 | 11.4 | 6.6 | 5.3 | 6.1 | 5.1 | 3.1 | 3.4 |
| 7.5 | |
| 11.8 | 8.8 | 6.7 | 7.5 | 6.0 | 5.6 |
| 3.6 | 3.0 | 3.7 | 6.0 | ||
| (a.vc) | 14.9 | 15.2 | 11.3 | 6.0 | 5.2 | 5.9 | 5.6 | 3.8 | 3.3 | 3.7 | 7.5 | |
| (a.gb) | 17.2 | 15.1 | 12.6 | 6.8 | 5.7 | 6.3 | 6.6 | 4.4 | 3.6 | 3.6 | 8.2 | |
| (a.sp) | 16.7 | 16.6 | 12.4 | 6.1 | 6.0 | 5.9 | 5.7 | 3.4 | 3.4 | 3.5 | 8.0 | |
| 11.1 | 8.7 |
| 4.8 |
|
|
| 3.3 |
|
| |||
| (a.vc) |
|
|
|
|
|
|
|
| 3.6 |
| ||
| (a.gb) |
|
|
|
|
|
|
|
| 4.1 | 2.6 | ||
| (a.sp) |
|
|
|
| 4.7 |
|
|
| 4.0 |
| ||