| Literature DB >> 33265689 |
Abstract
Recently, the accuracy of voice authentication system has increased significantly due to the successful application of the identity vector (i-vector) model. This paper proposes a new method for i-vector extraction. In the method, a perceptual wavelet packet transform (PWPT) is designed to convert speech utterances into wavelet entropy feature vectors, and a Convolutional Neural Network (CNN) is designed to estimate the frame posteriors of the wavelet entropy feature vectors. In the end, i-vector is extracted based on those frame posteriors. TIMIT and VoxCeleb speech corpus are used for experiments and the experimental results show that the proposed method can extract appropriate i-vector which reduces the equal error rate (EER) and improve the accuracy of voice authentication system in clean and noisy environment.Entities:
Keywords: CNN; i-vector; speaker authentication; wavelet entropy
Year: 2018 PMID: 33265689 PMCID: PMC7513125 DOI: 10.3390/e20080600
Source DB: PubMed Journal: Entropy (Basel) ISSN: 1099-4300 Impact factor: 2.524
Figure 1The decomposition structure of PWPT.
Figure 2Comparison of PWPT, WT and WPT.
Figure 3The i-vector extraction framework.
Figure 4The structure of DNN.
Figure 5CNN structure.
The Comparison of the DNN and CNN.
| Layer | Shape | Node Size | Parameter Size | |||
|---|---|---|---|---|---|---|
| DNN | CNN | DNN | CNN | DNN | CNN | |
| Input Layer | 256 × 1, 1 | 16 × 16, 1 | 256 | 256 | 226,144 | 272 |
| Hidden Layer 1~7 | 1024 × 1, 1 | 8 × 8, 16 | 1024 | 1024 | 1,048,576 | 160 |
| Output Layer | 2048 × 1, 1 | 2048 × 1, 1 | 2048 | 2048 | 131,072 | 131,072 |
Figure 6I-vectors for two speakers.
Figure 7The flow chart of voice authentication process.
ESER of PWPT with different mother wavelet.
| Wavelets | Wavelets | Wavelets | Wavelets | ||||
|---|---|---|---|---|---|---|---|
| Db 1 | 888.37 | Db 6 | 896.53 | Sym 1 | 888.35 | Sym 6 | 908.39 |
| Db 2 | 890.32 | Db 7 | 891.69 | Sym 2 | 890.36 | Sym 7 | 902.44 |
| Db 3 | 897.44 | Db 8 | 890.84 | Sym 3 | 894.93 | Sym 8 | 898.37 |
| Db 4 | 907.45 | Db 9 | 888.21 | Sym 4 | 899.75 | Sym 9 | 896.35 |
| Db 5 | 901.41 | Db 10 | 884.50 | Sym 5 | 903.82 | Sym 10 | 891.34 |
EER (%) of recognition system with different wavelet entropy features.
| WT | WPT | PWPT | |
|---|---|---|---|
| ShE | 8.51 | 5.46 | 5.49 |
| NE | 8.57 | 5.53 | 5.51 |
| LE | 9.03 | 6.67 | 6.78 |
| SE | 8.91 | 6.23 | 6.27 |
Figure 8Comparison of WPT and PWPT in feature extraction. (a) EERs of WPT and PWPT. (b) Time cost of WPT and PWPT.
EER and accuracy of spectral features.
| Spectral Features | Accuracy (%) | |||
|---|---|---|---|---|
| Noisy | Clean | Noisy | Clean | |
| PWPT-NE | 6.24 | 5.53 | 90.13 | 92.14 |
| WPT-NE | 7.11 | 5.51 | 89.47 | 92.48 |
| WT-NE | 10.27 | 8.43 | 86.39. | 90.12 |
| MFCC | 11.43 | 9.23 | 83.10 | 89.31 |
| LPCC | 11.77 | 9.31 | 83.24 | 88.97 |
The comparison of three UBMs.
| UBMs | Accuracy (%) | |||
|---|---|---|---|---|
| Noisy | Clean | Noisy | Clean | |
| GMM (1024) | 13.42 | 11.96 | 82.75 | 86.19 |
| GMM (2048) | 11.19 | 9.23 | 86.17 | 89.94 |
| GMM (3072) | 9.78 | 7.54 | 88.73 | 91.97 |
| DNN | 7.11 | 5.51 | 89.47 | 92.48 |
| CNN | 6.24 | 5.53 | 90.13 | 92.14 |
Figure 9The accuracy and computational speed of CNN and DNN. (a) Accuracy (b) Computational speed.
The performance of i-vector extraction methods.
| Strategies | EER (%) | Accuracy (%) | ||
|---|---|---|---|---|
| Noisy | Clean | Noisy | Clean | |
| MFCC + GMM | 13.02 | 9.15 | 80.74 | 89.59 |
| WPE + GMM | 13.17 | 10.97. | 85.97 | 87.49 |
| MFCC + DNN | 10.15 | 5.68 | 85.6 | 91.91 |
| WPE + DNN | 8.76 | 6.87 | 90.17 | 92.87 |
| MFCC + CNN | 8.02 | 5.97 | 86.43 | 91.48. |
| WPE + CNN | 6.24 | 5.53 | 90.13 | 92.14 |
Figure 10DEERs of the three i-vector extraction methods.