| Literature DB >> 35668730 |
Puneet Bawa1, Virender Kadyan2, Abinash Tripathy3, Thipendra P Singh2.
Abstract
Development of a native language robust ASR framework is very challenging as well as an active area of research. Although an urge for investigation of effective front-end as well as back-end approaches are required for tackling environment differences, large training complexity and inter-speaker variability in achieving success of a recognition system. In this paper, four front-end approaches: mel-frequency cepstral coefficients (MFCC), Gammatone frequency cepstral coefficients (GFCC), relative spectral-perceptual linear prediction (RASTA-PLP) and power-normalized cepstral coefficients (PNCC) have been investigated to generate unique and robust feature vectors at different SNR values. Furthermore, to handle the large training data complexity, parameter optimization has been performed with sequence-discriminative training techniques: maximum mutual information (MMI), minimum phone error (MPE), boosted-MMI (bMMI), and state-level minimum Bayes risk (sMBR). It has been demonstrated by selection of an optimal value of parameters using lattice generation, and adjustments of learning rates. In proposed framework, four different systems have been tested by analyzing various feature extraction approaches (with or without speaker normalization through Vocal Tract Length Normalization (VTLN) approach in test set) and classification strategy on with or without artificial extension of train dataset. To compare each system performance, true matched (adult train and test-S1, child train and test-S2) and mismatched (adult train and child test-S3, adult + child train and child test-S4) systems on large adult and very small Punjabi clean speech corpus have been demonstrated. Consequently, gender-based in-domain data augmented is used to moderate acoustic and phonetic variations throughout adult and children's speech under mismatched conditions. The experiment result shows that an effective framework developed on PNCC + VTLN front-end approach using TDNN-sMBR-based model through parameter optimization technique yields a relative improvement (RI) of 40.18%, 47.51%, and 49.87% in matched, mismatched and gender-based in-domain augmented system under typical clean and noisy conditions, respectively.Entities:
Keywords: Children speech recognition; Data augmentation; Mismatched conditions; Sequence discriminative training
Year: 2022 PMID: 35668730 PMCID: PMC9160864 DOI: 10.1007/s40747-022-00651-7
Source DB: PubMed Journal: Complex Intell Systems ISSN: 2199-4536
Fig. 1A comparative block diagram of heterogeneous front-end feature extraction approaches
Detailed information of Punjabi adult and children corpora
| Characteristics | Adult dataset | Child dataset |
|---|---|---|
| No. of speakers | 21 | 39 |
| Speech data type | Isolated words and phonetically rich sentences | Continuous speech sentences |
| Recording environment | Closed room using dictaphone and microphone | Open and closed environment using microphone |
| No. of utterances | 3953 | 2159 |
| Age | 17–26 years | 7–12 years |
| Duration | 10 h 12 min | 4 h 10 min |
| No. of unique words | 6567 | 4863 |
| Gender | 9 male/12 female | 20 male/19 female |
Different matched or mismatched system employed for training and testing
| Type of ASR | Training | Testing |
|---|---|---|
| Adult ASR-S1 system | Adult dataset | Adult dataset |
| Children ASR-S2 system | Children dataset | Children dataset |
| Mismatched ASR-S3 system | Adult dataset | Children dataset |
| Semi-mismatched-S4 system | Adult and children mixed dataset | Children dataset |
Fig. 2a Adult original signal and noisy signal. b Child dataset original signal and noisy signal
Fig. 3Comparative illustration of original children and spectral warped adult audio signal
Fig. 4Basic block diagram of heterogeneous feature extraction-based ASR framework on true matched and mismatched systems
Fig. 5Basic block diagram of robust ASR framework on vocal length normalized-induced front-end approach using varying discriminative sequence training on mismatched systems
Fig. 6Lattice network for word lattice in the speech utterance
WER obtained on different system type using conventional front-end (MFCC) and acoustic model method in clean environment conditions
| Training set | Testing set | System type | DNN (WER%) (%) |
|---|---|---|---|
| Adult | Adult | S1 | 6.52 |
| Child | Child | S2 | 15.43 |
| Adult | Child | S3 | 41.28 |
| Adult–child | Child | S4 | 14.27 |
Fig. 7a WER obtained on utilization of MFCC feature extraction technique on both matched and mismatched systems. b WER obtained on utilization of RASTA-PLP feature extraction technique on both matched and mismatched systems. c WER obtained on utilization of GFCC feature extraction technique on both matched and mismatched systems. d WER obtained on utilization of PNCC feature extraction technique on both matched and mismatched systems
WER obtained on noise augmented train set using varying front-end approaches
| Training set | MFCC | RASTA-PLP | GFCC | PNCC | ||||
|---|---|---|---|---|---|---|---|---|
| Clean test set | Noisy test set | Clean test set | Noisy test set | Clean test set | Noisy test set | Clean test set | Noisy test set | |
| S1 + random noise | 7.32 | 9.42 | 7.01 | 8.25 | 6.50 | 7.12 | 5.99 | 6.04 |
| S2 + random noise | 15.61 | 18.55 | 15.07 | 17.42 | 14.61 | 15.66 | 13.24 | 13.31 |
| S3 + random noise | 42.21 | 49.62 | 41.93 | 47.26 | 40.15 | 44.13 | 37.21 | 39.23 |
| S4 + random noise | 14.18 | 17.96 | 13.86 | 16.51 | 13.16 | 14.53 | 12.67 | 12.69 |
WER obtained on varying no of MMI iterations in matched and mismatched systems using clean and noisy test sets
| No. of iterations (MMI) | WER (%) | |||
|---|---|---|---|---|
| Clean test set | Noisy test set | |||
| S1 | S4 | S1 | S4 | |
| 1 | 6.97 | 14.25 | 6.89 | 13.89 |
| 2 | 6.25 | 13.12 | 5.97 | 12.27 |
| 3 | 12.65 | 12.14 | ||
| 4 | 5.68 | 5.61 | ||
| 5 | 5.59 | 12.19 | 5.48 | 12.12 |
| 6 | 5.58 | 12.17 | 5.51 | 12.09 |
| 7 | 5.59 | 12.18 | 5.51 | 12.11 |
| 8 | 5.59 | 12.17 | 5.5 | 12.11 |
Bold values imply a reduced word error rate (WER) that will be carried through
WER obtained on varying no of LM models with MPE training criteria in matched and mismatched systems using clean and noisy test sets
| LM | WER (%) | |||
|---|---|---|---|---|
| Clean test set | Noisy test set | |||
| S1 | S4 | S1 | S4 | |
| 1-g | 7.56 | 14.21 | 7.52 | 14.04 |
| 2-g | 6.61 | 12.27 | 6.47 | 12.02 |
| 3-g | ||||
| 4-g | 5.59 | 11.81 | 5.4 | 11.66 |
Bold values imply a reduced word error rate (WER) that will be carried through
WER obtained on varying boost factor with MMI approach in matched and mismatched systems in clean and noisy test sets
| Boost factor | WER (%) | |||
|---|---|---|---|---|
| Clean test set | Noisy test set | |||
| S1 | S4 | S1 | S4 | |
| 0 (mmi) | 5.63 | 12.13 | 5.5 | 12.07 |
| 0.05 | 5.6 | 12.04 | 5.47 | 12.01 |
| 0.1 | 5.52 | 11.93 | 5.43 | 11.87 |
| 0.15 | 11.89 | 11.73 | ||
| 0.2 | 5.51 | 5.41 | ||
| 0.25 | 5.53 | 11.76 | 5.44 | 11.66 |
Bold values imply a reduced word error rate (WER) that will be carried through
An overview of WER obtained discriminative training approaches in matched and mismatched systems using clean and noisy test sets
| System type | WER (%) | |||
|---|---|---|---|---|
| Clean test set | Noisy test set | |||
| S1 | S4 | S1 | S4 | |
| DNN-MMI | 5.63 | 12.13 | 5.5 | 12.07 |
| DNN-MPE | 5.57 | 11.92 | 5.46 | 11.76 |
| DNN-bMMI | 5.49 | 11.74 | 5.39 | 11.64 |
| DNN-sMBR | 4.97 | 10.17 | 4.82 | 9.97 |
An overview of WER obtained of discriminative training approaches employing gender-based selection on mismatched system using clean and noisy test sets
| System type | WER (%) | |||
|---|---|---|---|---|
| Clean test set | Noisy test set | |||
| Female adult + child | Male adult + child | Female adult + child | Male adult + child | |
| DNN-MMI | 11.81 | 12.34 | 11.69 | 12.26 |
| DNN-MPE | 11.85 | 12.32 | 11.65 | 11.82 |
| DNN-bMMI | 11.57 | 11.80 | 11.44 | 11.85 |
| DNN-sMBR | 10.05 | 11.01 | 9.85 | 10.34 |
An overview of WER obtained from perturbation training using PNCC and VTLN approaches in matched and mismatched systems using clean and noisy test sets
| Training set | Classifier type | PNCC | PNCC + VTLN | ||
|---|---|---|---|---|---|
| Clean test set | Noisy test set | Clean test set | Noisy test set | ||
| S1 + noise + 3-way | DNN | 4.64 | 4.68 | 4.37 | 4.48 |
| S4 + noise + 3-way | 9.38 | 9.24 | 8.82 | 8.64 | |
| Female adult + noise + 3-way | 9.31 | 9.18 | 8.71 | 8.62 | |
| S1 + noise + 3-way | TDNN | 4.18 | 4.27 | 3.90 | 4.02 |
| S4 + noise + 3-way | 8.89 | 8.65 | 8.26 | 8.10 | |
| Female adult + noise + 3-way | 8.85 | 8.59 | 8.20 | 8.08 | |
Fig. 8WER obtained on utilization of spectral warped adult female dataset employed with PNCC + VTLN-based feature extraction technique on mismatched systems
An overview of WER obtained after combining spectral warping technique through mismatched systems on clean and noisy test sets
| Training set | Perturbation type | Classifier type | PNCC + VTLN | ||
|---|---|---|---|---|---|
| Noise augmented dataset | Warping factor | Clean test set | Noisy test set | ||
| Female adult + noise | – | Three-way | TDNN | 8.20 | 8.08 |
| − 0.1 + 0.05 | 7.75 | 7.06 | |||
| − 0.1 ± 0.05 | 7.78 | 7.14 | |||
| 0.05 ± 0.05 | 7.86 | 7.34 | |||
Comparative analysis and summarization of earlier implemented approaches in constant to proposed system architecture
| Author details | Dataset details | Methodologies | Summary |
|---|---|---|---|
| Kadyan et al. [ | Punjabi adult corpora constituting continuous and phonetically rich sentences | MFCC; GFCC-based hybrid DNN–HMM and GMM–HMM modeling | The reduction in size, vector knowledge de-correlation and speaker heterogeneity are being discussed by the researcher employing LDA, transition probability, speaker adaptive tri-phones, highest probability, linear regression adaptation models. In two hybrid classifiers, the accuracy of the interconnected and ongoing Punjabi voice corpus is studied. GMM–HMM and DNN–HMM with the experimental configuration detailing significant RI of 4–5% and 1–3%, respectively |
| Shivakumar et al. [ | English language children dataset employing transfer learning | MFCC-based GMM–HMM and DNN–HMM-based modeling | The paper presents a systematic and an extensive analysis of the proposed transfer learning technique considering the key factors affecting children’s speech recognition from prior literature. Evaluations are presented by making the comparisons of earlier GMM–HMM and the newer DNN Models such that the author had experimented for the detailed effectiveness of standard adaptation techniques versus transfer learning |
| Kumar et al. [ | Adult data comprising of 13,218 Punjabi words with over 200 min of recorded speech | MFCC feature extraction technique | In this paper, the author has experimented for auto-denoising method employing the novel Corpus Optimization Algorithm on the Punjabi language corpus. At the same time, for 13,218 Punjabi words, the WER was lowered to 5.8%. Likewise, some other important factors such as the total probability per frame and the convergence ratio spanning different iterations for obtainable Gaussian mixtures has also been evaluated and consequently the improved performance of the system has been relatively being suggested |
| Gretter et al. [ | TLT-school corpora containing Italian children recorded English dataset | Metrics for collection of adequate children data based upon good pronunciation vs bad pronunciation | The researchers have maintained for the collection of corpuses corresponding to students between 9 and 16 years of age, students from elementary, secondary and secondary schools, was registered in 2017 and 2018. Both statements have been obtained by human experts with regard to certain predefined ability measures |
| Kadyan et al. [ | Punjabi children speech corpora | MFCC; MFFC + Pitch; MFCC + Pitch + VTLN-based DNN–HMM modeling | Substantially lower error rates from an increase in off-domain data dependent on prosody modifications has been experimented by the researcher. Furthermore, the authors analyzed the impact of changing the number of senones, the number of hidden nodes and layers, and the early stagnation, which resulted in a relative improvement of 32.1% (RI) in contrast to the baseline structure of different senones |
| Dua et al. [ | Hindi speech corpora | Discriminative training based on MPE through variations among the quantity of Gaussian mixtures | The researcher has trained speech recognition through interpolation of language model and discriminative approaches. They achieved a relative improvement of 85.45 under clean and 82.95 under noisy conditions |
| Kadyan et al. [ | Punjabi adult corpora comprising of isolated and phonetically rich sentences | MFCC coupled bottleneck features based on Tandem-NN acoustic modeling | In this paper, the authors have processed context-independent input speech signal information through utilization of bottleneck characteristics. Further noisy data have been handled and experimental results revealed that under clean and noisy settings a Tandem-NN system achieved a RI of 13.53% as compared to the Baseline system |
| Dua et al. [ | Hindi continuous sentences speech corpora and noise augmented dataset | Use of noise-resistant integrated features and an improved HMM model for the development of discriminatively trained speech recognition system | The suggested study has examined that with MF-PLP and MF-GFCC alone or integrated feature vectors results into large performance improvement |
| Kumar and Aggarwal [ | Two low-resource Indo-Aryan family languages including Hindi and Marathi | Integrated features vector with RNN being employed on Hindi ASR system utilizing MLLR and constrained-MLLR) | The researcher experimented 256 Gaussian mixtures corresponding to every HMM state using discriminatively trained method of MMI and MPE. The experiments showcased that the discriminative training has been improved in comparison to baseline system by 3% |
| Bawa et al. [ | Gender-based selection under mismatched conditions | MFCC; GFCC-based DNN–HMM modeling | The study attempts to create Punjabi Children ASR in mismatched parameters via noise-robust techniques such as the MFCC or GFCC. Accordingly, acoustic and phonetic differences between adults and children are managed by gender-based selection of adult data and subsequent acoustic variability across speakers in training and test conditions are normalized by means of the VTLN with 30.94% of RI in comparison to the baseline system |
| Proposed approach | Punjabi adult and children under mismatched conditions | PNCC; PNCC + VTLN-based DNN-sMBR and TDNN-sMBR modeling; gender-based selection; spectral augmentation | (i) The results demonstrate that ASR frames examined on PNCC + VTLN techniques are only successful when testing it on sMBR optimized acoustic models. The outcomes of these experiments shown that an overall RI of 40.18%, 47.51%, and 47.64% are achieved, respectively, with S1 and S4 ASR systems and female adult-selected ASR system (ii) Second, the gender-based spectral augmentation has led to an enhanced performance improvement of 49.87% in comparison to the baseline system |