| Literature DB >> 35647534 |
Heting Gao1, Junrui Ni1, Yang Zhang2, Kaizhi Qian2, Shiyu Chang2,3, Mark Hasegawa-Johnson1.
Abstract
A language-independent automatic speech recognizer (ASR) is one that can be used for phonetic transcription in languages other than the languages in which it was trained. Language-independent ASR is difficult to train, because different languages implement phones differently: even when phonemes in two different languages are written using the same symbols in the international phonetic alphabet, they are differentiated by different distributions of language-dependent redundant articulatory features. This article demonstrates that the goal of language-independence may be approximated in different ways, depending on the size of the training set, the presence vs. absence of familial relationships between the training and test languages, and the method used to implement phone recognition or classification. When the training set contains many languages, and when every language in the test set is related (shares the same language family with) a language in the training set, then language-independent ASR may be trained using an empirical risk minimization strategy (e.g., using connectionist temporal classification without extra regularizers). When the training set is limited to a small number of languages from one language family, however, and the test languages are not from the same language family, then the best performance is achieved by using domain-invariant representation learning strategies. Two different representation learning strategies are tested in this article: invariant risk minimization, and regret minimization. We find that invariant risk minimization is better at the task of phone token classification (given known segment boundary times), while regret minimization is better at the task of phone token recognition.Entities:
Keywords: automatic speech recognition; distributionally robust optimization; domain generalization; invariant risk minimization; regret minimization; under-resourced languages
Year: 2022 PMID: 35647534 PMCID: PMC9133481 DOI: 10.3389/frai.2022.806274
Source DB: PubMed Journal: Front Artif Intell ISSN: 2624-8212
Figure 1The phonemic transcript, Y, captures a limited set of information about the speech signal, X. The limits of the transcription process are dependent on the language environment, . Language-independent ASR finds a feature embedding, Z = ϕ(X), such that the relationship between Z and Y is independent of .
Figure 2How to compute the risk for regret minimization in a two-environment setting. In addition to the ERM risk calculated on inputs from both environment under the shared feature extractor and classifier, regret minimization inserts an additional regret term for each environment into the total risk. The regret for environment 1, for example, can be calculated by first feeding the inputs from environment 1 into the feature extractor, and then into the classifier trained on environment 1, as well as the classifier trained on all environments except environment 1. The difference calculated from the corresponding loss term for the out-of-environment classifier and the within-environment classifier is the regret. The same calculation can be done to calculate the regret for environment 2.
Figure 3The modified architecture for regret minimization (RGM) vs. the original architecture for empirical risk minimization (ERM). Both methods train a language-agnostic phone token classifier; RGM also trains language-specific phone token classifiers.
Sources of data used in our cross-lingual experiment.
|
|
|
|
|
|
|
|---|---|---|---|---|---|
| Portuguese | por | GP | Read | Romance | 26 |
| Turkish | tur | GP | Read | Turkic | 17 |
| German | deu | GP | Read | Germanic | 18 |
| Bulgarian | bul | GP | Read | South Slavic | 21 |
| Thai | tha | GP | Read | Tai | 22 |
| Mandarin | cmn | GP | Read | Sinitic | 31 |
| French | fra | GP | Read | Romance | 25 |
| Czech | ces | GP | Read | West Slavic | 29 |
| Dutch | nld | CGN | Read | Germanic | 64 |
| Georgian | kat | Babel | Sp. | Kartvelian | 190 |
| Javanese | jav | Babel | Sp. | Austronesian | 204 |
| Amharic | amh | Babel | Sp. | Ethiopic | 204 |
| Zulu | zul | Babel | Sp. | Bantu | 211 |
| Vietnamese | vie | Babel | Sp. | Vietic | 215 |
| Bengali | ben | Babel | Sp. | Indo-Aryan | 215 |
| Croatian | hrv | GP | Read | South Slavic | 16 |
| Polish | pol | GP | Read | West Slavic | 24 |
| Spanish | spa | GP | Read | Romance | 22 |
| Lao | lao | Babel | Sp. | Tai | 207 |
| Cantonese | yue | Babel | Sp. | Sinitic | 215 |
The upper part is the multilingual set and the lower part is the cross-lingual set. “Corpus” is GlobalPhone, corpus of spoken Dutch, or Babel. “Type” column denotes whether the corpus contains spontaneous (Sp.) or read speech. “Len” column shows the total duration of all utterances in hours. “Family” column shows the language family.
Phone token error rates (PTER, %) of an ASR trained on 15 languages, tested on 5 additional languages.
|
|
| ||||||||
|---|---|---|---|---|---|---|---|---|---|
|
|
|
|
|
|
|
|
|
|
|
| Portuguese |
| 22.6 | 20.5 | 22.1 | Croatian |
| 48.9 | 49.3 | 50.9 |
| Turkish |
| 23.0 | 24.0 | 25.0 | Polish | 62.5 |
| 63.7 | 65.5 |
| German |
| 28.4 | 27.2 | 29.4 | Spanish |
| 39.8 | 39.6 | 40.6 |
| Bulgarian |
| 30.0 | 30.1 | 30.2 | Lao |
|
| 79.0 | 78.8 |
| Thai |
| 30.0 | 31.3 | 34.5 | Cantonese |
| 78.0 | 78.4 | 77.7 |
| Mandarin |
| 38.5 | 33.8 | 46.3 | - | - | - | - | - |
| French |
| 19.1 | 16.3 | 16.8 | - | - | - | - | - |
| Czech |
| 15.6 | 12.8 | 13.7 | - | - | - | - | - |
| Dutch |
| 28.7 | 28.3 | 27.6 | - | - | - | - | - |
| Georgian |
| 43.9 | 46.6 | 41.5 | - | - | - | - | - |
| Javanese |
| 54.4 | 55.6 | 49.6 | - | - | - | - | - |
| Amharic |
| 52.2 | 53.0 | 49.7 | - | - | - | - | - |
| Zulu |
| 48.9 | 48.9 | 46.3 | - | - | - | - | - |
| Vietnamese |
| 59.1 | 63.1 | 58.5 | - | - | - | - | - |
| Bengali |
| 47.0 | 47.4 | 43.4 | - | - | - | - | - |
|
|
| 36.1 | 35.9 | 35.6 |
|
| 61.4 | 62.0 | 62.7 |
Early-stopping epoch and other hyperparameters of each algorithm were selected based on development test data in the training languages. Numbers reported are from the evaluation test data in each language. Bold denotes lowest error in each row.
Phone token error rates (PTER, %) of an ASR trained on three Slavic languages (Czech, Bulgarian and Polish).
|
|
| |||||||
|---|---|---|---|---|---|---|---|---|
|
|
|
|
|
|
|
|
|
|
| ERM |
|
|
|
|
| 71.5 | 65.4 | 64.4 |
| DRO | 37.2 | 49.7 | 51.1 | 46.0 | 60.6 | 75.0 | 67.8 | 67.8 |
| IRM | 34.0 | 47.3 | 50.2 | 43.8 | 57.7 | 70.3 | 66.7 | 64.9 |
| RGM | 32.3 | 46.0 | 48.2 | 42.2 | 57.1 |
|
|
|
Early-stopping and other hyperparameters of each algorithm were selected based on development test data in the three training languages. Numbers reported are from the evaluation test data in each of the three training languages, and in each of three previously unseen test languages. Bold denotes lowest error in each column.
Phone token classification error rates (PTCER, %) of an ASR trained on three Slavic languages (Czech, Bulgarian and Polish).
|
|
| |||||||
|---|---|---|---|---|---|---|---|---|
|
|
|
|
|
|
|
|
|
|
| ERM |
| 46.2 | 42.9 | 39.6 | 48.3 | 56.6 | 59.3 | 54.7 |
| DRO | 40.7 | 51.5 | 46.1 | 46.1 | 50.7 |
| 60.1 | 55.5 |
| IRM, λ = 0.001 | 34.6 | 49.9 | 43.5 | 42.7 | 49.0 | 57.6 | 59.8 | 55.5 |
| IRM, λ = 0.01 | 34.7 | 49.6 | 43.3 | 42.5 | 48.2 | 57.4 | 59.3 | 55.0 |
| IRM, λ = 0.1 | 34.8 | 49.8 | 43.3 | 42.6 | 48.2 | 57.3 | 59.1 | 54.9 |
| IRM, λ = 1 | 35.6 | 50.8 | 43.5 | 43.3 | 49.1 | 57.1 | 60.1 | 55.4 |
| IRM, λ = 10 | 30.7 |
|
|
|
| 55.8 | 59.1 |
|
| IRM, λ = 100 | 41.1 | 51.5 | 48.6 | 47.1 | 47.6 | 55.9 |
| 54.1 |
| RGM | 32.0 | 49.0 | 45.7 | 42.2 | 48.8 | 57.3 | 64.3 | 56.8 |
Early-stopping and other hyperparameters of each algorithm were selected based on development test data in the three training languages. Numbers reported are from the evaluation test data in each of the three training languages, and in each of three previously unseen test languages. Bold denotes lowest error in each column.
Phone token classification error rates (PTCER, %) of an ASR trained on three Slavic languages (Czech, Bulgarian and Polish) and tested on one Slavic language (Croatian) and two other Indo-European languages (French and German).
|
|
| |||
|---|---|---|---|---|
|
|
|
|
|
|
| ERM | Croatian | 46.6 | 59.4 | 63.4 |
| RGM | Croatian | 44.9 |
|
|
| ERM | French | 46.8 | 58.4 | 60.6 |
| RGM | French | 47.5 | 56.0 | 60.5 |
| ERM | German | 48.9 | 62.2 | 59.3 |
| RGM | German |
|
| 57.2 |
In this table, the epoch for early stopping was chosen using development-test data from one of the three test languages: Croatian in rows 1–2, French in rows 3–4, German in rows 5–6. PTCER was then measured using evaluation-test data from each test language. Numbers reported using early-stopping on the test language are considered oracle; boldface shows the lowest non-oracle error rate.