| Literature DB >> 35336315 |
Meng Wang1, Dazheng Feng1, Tingting Su1, Mohan Chen1.
Abstract
Convolutional neural networks (CNNs) have significantly promoted the development of speaker verification (SV) systems because of their powerful deep feature learning capability. In CNN-based SV systems, utterance-level aggregation is an important component, and it compresses the frame-level features generated by the CNN frontend into an utterance-level representation. However, most of the existing aggregation methods aggregate the extracted features across time and cannot capture the speaker-dependent information contained in the frequency domain. To handle this problem, this paper proposes a novel attention-based frequency aggregation method, which focuses on the key frequency bands that provide more information for utterance-level representation. Meanwhile, two more effective temporal-frequency aggregation methods are proposed in combination with the existing temporal aggregation methods. The two proposed methods can capture the speaker-dependent information contained in both the time domain and frequency domain of frame-level features, thus improving the discriminability of speaker embedding. Besides, a powerful CNN-based SV system is developed and evaluated on the TIMIT and Voxceleb datasets. The experimental results indicate that the CNN-based SV system using the temporal-frequency aggregation method achieves a superior equal error rate of 5.96% on Voxceleb compared with the state-of-the-art baseline models.Entities:
Keywords: convolutional neural networks; self-attention; speaker verification; temporal-frequency aggregation
Mesh:
Year: 2022 PMID: 35336315 PMCID: PMC8953125 DOI: 10.3390/s22062147
Source DB: PubMed Journal: Sensors (Basel) ISSN: 1424-8220 Impact factor: 3.576
The architecture of Thin ResNet-34. ReLu and batch normalization layers are not shown.
| Log Fbank Feature ( | Output Size ( |
|---|---|
| Conv2d, |
|
|
|
|
|
|
|
|
|
|
|
|
|
Figure 1Diagram of the attention-based temporal-frequency aggregation method.
Figure 2The CNN-based SV system using SAP-SGFSAP.
The experimental results on TIMIT under different SNR levels and typical distortions.
| Systems | Clean | 30 dB | 20 dB | 10 dB | Distortions |
|---|---|---|---|---|---|
| TAP | 5.22 | 5.47 | 7.01 | 8.78 | 8.09 |
| SAP | 4.79 | 5.49 | 6.84 | 9.00 | 8.36 |
| ASP | 4.99 | 5.41 | 6.69 | 9.54 | 8.09 |
| SAP-SGFSAP-10 | 5.46 | 5.21 | 6.50 | 8.14 | 7.48 |
| ASP-SGFSAP-2 | 5.05 | 5.00 | 6.62 | 7.95 | 9.25 |
The experimental results of various SV systems on Voxceleb.
| Categories | Systems | EER (%) |
|---|---|---|
| GMM-based systems | GMM-UBM [ | 15.0 |
| i-vector/PLDA [ | 8.8 | |
| TDNN-based systems | x-vector (Cosine) [ | 11.3 |
| x-vector (PLDA) [ | 7.1 | |
| CNN-based systems | CNN-embedding [ | 7.8 |
| SGFSAP-19 | 6.26 | |
| SAP-SGFSAP-19 | 6.11 | |
| ASP-SGFSAP-1 | 5.96 |
The experimental results of CNN-based SV systems with different aggregation methods.
| Categories | Systems | EER (%) |
|---|---|---|
| Temporal aggregation | TAP | 6.60 |
| SAP | 6.58 | |
| ASP | 6.54 | |
| NetVLAD | 7.00 | |
| GhostVLAD | 7.14 | |
| Frequency aggregation | SGFSAP-19 | 6.26 |
| Temporal-frequency aggregation | SAP-SGFSAP-19 | 6.11 |
| ASP-SGFSAP-1 | 5.96 |
Figure 3Two-dimensional representation of the speaker embeddings generated by various CNN-based SV systems: (a) SAP, (b) SAP-SGFSAP-19, (c) ASP, (d) ASP-SGFSAP-1.
Effectiveness of shared-parameter.
| Systems | EER (%) |
|---|---|
| GFSAP-19 | 6.40 |
| SGFSAP-19 | 6.26 |
| SAP-GFSAP-19 | 6.30 |
| SAP-SGFSAP-19 | 6.11 |
Figure 4The effectiveness of grouping method (EER versus R).
Figure 5Visualization of the intermediate results of CNN-based SV systems: (a) log Fbank coefficients, (b) the mean of frame-level features in the ASP-SGFSAP-1 system, (c) the attention map generated by SAP-SGFSAP-19, (d) the attention map generated by ASP-SGFSAP-1.