| Literature DB >> 35245327 |
Prashanth Gurunath Shivakumar1, Panayiotis Georgiou1, Shrikanth Narayanan1.
Abstract
Word vector representations enable machines to encode human language for spoken language understanding and processing. Confusion2vec, motivated from human speech production and perception, is a word vector representation which encodes ambiguities present in human spoken language in addition to semantics and syntactic information. Confusion2vec provides a robust spoken language representation by considering inherent human language ambiguities. In this paper, we propose a novel word vector space estimation by unsupervised learning on lattices output by an automatic speech recognition (ASR) system. We encode each word in Confusion2vec vector space by its constituent subword character n-grams. We show that the subword encoding helps better represent the acoustic perceptual ambiguities in human spoken language via information modeled on lattice-structured ASR output. The usefulness of the proposed Confusion2vec representation is evaluated using analogy and word similarity tasks designed for assessing semantic, syntactic and acoustic word relations. We also show the benefits of subword modeling for acoustic ambiguity representation on the task of spoken language intent detection. The results significantly outperform existing word vector representations when evaluated on erroneous ASR outputs, providing improvements up-to 13.12% relative to previous state-of-the-art in intent detection on ATIS benchmark dataset. We demonstrate that Confusion2vec subword modeling eliminates the need for retraining/adapting the natural language understanding models on ASR transcripts.Entities:
Mesh:
Year: 2022 PMID: 35245327 PMCID: PMC8896703 DOI: 10.1371/journal.pone.0264488
Source DB: PubMed Journal: PLoS One ISSN: 1932-6203 Impact factor: 3.240
Fig 1Example confusion network output by ASR for the ground-truth phrase—“I want to sit”.
Figure adapted from P. G. Shivakumar and P. Georgiou, “Confusion2vec: Towards enriching vector space word representations with representational ambiguities,” PeerJ Computer Science, vol. 5, p. e195, 2019 [10].
Results: Different proposed models.
| Model | Analogy Tasks | Similarity Tasks | |||||
|---|---|---|---|---|---|---|---|
| S&S | Acoustic | S&S-Acoustic | Average Accuracy | Word Similarity | Acoustic Similarity | ||
| Google W2V [ | 61.42% | 0.9% | 16.99% | 26.44% |
| -0.3489 | |
| In-domain W2V | 59.17% | 0.6% | 8.15% | 22.64% | 0.4417 | -0.4377 | |
| fastText [ |
| 0.46% | 17.40% | 31.26% | 0.7361 | -0.3659 | |
| In-domain fastText | 46.45% | 0.75% | 17.05% | 21.42% | 0.3584 | 0.2610 | |
| C2V-1 | 70.56% | 1.46% | 23.86% | 31.96% | 0.6036 | -0.4327 | |
| C2V-a | 63.97% | 16.92% | 43.34% | 41.41% | 0.5228 | 0.6200 | |
| C2V-c |
| 27.33% | 38.29% | 43.69% | 0.5798 | 0.5825 | |
| C2V-1 | 56.83% | 1.46% | 20.99% | 26.43% | 0.3720 | 0.3022 | |
| C2V-a | 56.74% |
|
|
| 0.2929 |
| |
| C2V-c | 56.87% |
|
|
| 0.2893 |
| |
C2V-a: Intra-Confusion; C2V-c: Inter-Confusion; C2V-1: Top-Confusion; S&S: Semantic & Syntactic Analogy.
The results of the analogy tasks represent percentage accuracy; and the results of the similarity tasks represent Spearman correlation.
For the analogy tasks: the accuracies of baseline word2vec, fastText models are for top-1 evaluations, whereas of the other models are for top-2 evaluations (as discussed in [10]). For the similarity tasks: all the correlations (Spearman’s) are statistically significant with p < 0.001. See Tables 6 and 8 in S1 Appendix for more detailed results.
Results: Concatenated models.
| Model | Analogy Tasks | Similarity Tasks | |||||
|---|---|---|---|---|---|---|---|
| S&S | Acoustic | S&S-Acoustic | Average Accuracy | Word Similarity | Acoustic Similarity | ||
| Google W2V [ | 61.42% | 0.9% | 16.99% | 26.44% |
| -0.3489 | |
| In-domain W2V | 59.17% | 0.6% | 8.15% | 22.64% | 0.4417 | -0.4377 | |
| fastText [ | 75.93% | 0.46% | 17.40% | 31.26% | 0.7361 | -0.3659 | |
| C2V-1 + C2V-a | 67.03% | 25.43% | 40.36% | 44.27% | 0.5102 | 0.7231 | |
| C2V-1 + C2V-c | 70.84% | 35.25% | 35.18% | 47.09% | 0.5609 | 0.6345 | |
| C2V-1 + C2V-c (UJO) | 65.88% |
| 41.51% |
| 0.5379 |
| |
| fastText + C2V-a |
| 22.67% |
|
| 0.5744 |
| |
| fastText + C2V-c |
| 22.56% |
|
| 0.5732 |
| |
C2V-a: Intra-Confusion; C2V-c: Inter-Confusion; C2V-1: Top-Confusion; S&S: Semantic & Syntactic Analogy; UJO: Unrestricted Joint Optimization (see [10]). The results of the analogy tasks represent percentage accuracy; and the results of the similarity tasks represent Spearman correlation. For the analogy tasks: the accuracies of baseline word2vec, fastText models are for top-1 evaluations, whereas of the other models are for top-2 evaluations (as discussed in [10]). For the similarity tasks: all the correlations (Spearman’s) are statistically significant with p < 0.001. See Tables 7 and 9 in S1 Appendix for more detailed results.
Fig 22-D plots of selected word vectors portraying semantic, syntactic and acoustic relationships after dimension reduction using PCA.
The blue lines indicate semantic relationships, blue ellipses indicate syntactic relationships, orange lines indicate acoustic-semantic/syntactic relations and orange ellipses indicate acoustic ambiguity word relations. Plots with identical word sets corresponding to Confusion2Vec 1.0 and Google W2V can be found in [11]. Please note that the out-of-vocabulary word “prinz” cannot be represented in Google W2V and Confusion2Vec 1.0 spaces.
Intent Classification Error Rates (CER): Trained on clean reference transcripts, evaluated on clean reference and noisy ASR transcripts.
| Model | Reference | ASR | Δdiff | |
|---|---|---|---|---|
| Context-Free Embeddings | Random | 2.69 | 10.75 | 8.06 |
| GloVe [ | 1.90 | 8.17 | 6.27 | |
| Word2Vec [ | 2.69 | 8.06 | 5.37 | |
| fastText [ | 1.90 | 8.40 | 6.50 | |
| Joint SLU-LM [ | 1.90 | 9.41 | 7.51 | |
| Attn. RNN Joint SLU [ | 1.79 | 8.06 | 6.27 | |
| Slot-Gated Attn. [ | 3.92 | 10.64 | 6.72 | |
| Self Attn. SLU [ | 2.02 | 9.18 | 7.16 | |
| SF-ID Network [ | 3.14 | 10.53 | 7.39 | |
| C2V 1.0 [ | 2.46 | 6.38 | 3.92 | |
| Contextual Embeddings | ELMo [ | 1.79 | 6.83 | 5.04 |
| ELMo [ | 1.46 | 7.05 | 5.59 | |
| BERT [ | 1.79 | 7.05 | 5.26 | |
| BERT [ |
| 6.16 | 5.04 | |
| Joint BERT [ | 2.46 | 7.73 | 5.27 | |
| ASR Robust ELMo (unsup.) [ | 3.24 | 5.26 | 2.02 | |
| ASR Robust ELMo (sup.) [ | 3.46 | 5.03 |
| |
| Proposed Context-Free Embeddings | C2V-c 2.0 | 3.36 | 5.82 | 2.46 |
| C2V-a 2.0 | 2.46 |
|
| |
| fastText + C2V-c 2.0 | 1.79 |
| 2.91 | |
| fastText + C2V-a 2.0 | 1.90 |
| 3.14 |
Δdiff is the absolute degradation of model from clean to ASR. C2V 1.0 corresponds to C2V-1 + C2V-c (UJO) in Tables 1 and 2.
† indicates joint modeling of intent and slot-filling.
Examples of intent detection: Trained on clean reference text, evaluated on ASR transcripts.
| System | Text | True Intent | Predicted Intent | |
|---|---|---|---|---|
| fastText | concat C2V-c 2.0 | |||
| Manual | “what is the seating capacity of a DC9” | Capacity | Meal | Capacity |
| ASR | “what is | |||
| Manual | “what is the lowest fare for a flight from washington dc to boston” | Airfare | Flight | Airfare |
| ASR | “what is the lowest | |||
| Manual | “list fares from washington dc to boston” | Airfare | Flight | Flight |
| ASR | “ | |||
| Manual | “what does fare code bh mean” | Abbreviation | Ground Service | Abbreviation |
| ASR | “ | |||
Manual refers to clean, human annotated transcripts. ASR refers to the automatic speech transcription by ASR. The bold text highlights the errors made by ASR. “concat C2V-c 2.0” refers to the proposed model: concatenated fastText + inter-confusion model
Intent Classification Error Rates (CER): Trained and evaluated on noisy ASR transcripts.
| Model | WER % | CER % |
|---|---|---|
| Random | 18.54 | 5.15 |
| GloVe [ | 18.54 | 6.94 |
| Word2Vec [ | 18.54 | 5.49 |
| Schumann et, al, 2018 [ | 10.55 | 5.04 |
| C2V 1.0 | 18.54 | 4.70 |
| C2V-c 2.0 | 18.54 | 4.82 |
| C2V-a 2.0 | 18.54 |
|
| fastText + C2V-c 2.0 | 18.54 |
|
| fastText + C2V-a 2.0 | 18.54 |
|
C2V 1.0 corresponds to C2V-1 + C2V-c (UJO) in Tables 1 and 2. Note, we don’t domain-constrain, optimize or re-score our ASR, as in [44]