| Literature DB >> 33261046 |
Xin Fang1,2, Tian Gao1, Liang Zou3,4, Zhenhua Ling1.
Abstract
Automatic speaker verification provides a flexible and effective way for biometric authentication. Previous deep learning-based methods have demonstrated promising results, whereas a few problems still require better solutions. In prior works examining speaker discriminative neural networks, the speaker representation of the target speaker is regarded as a fixed one when comparing with utterances from different speakers, and the joint information between enrollment and evaluation utterances is ignored. In this paper, we propose to combine CNN-based feature learning with a bidirectional attention mechanism to achieve better performance with only one enrollment utterance. The evaluation-enrollment joint information is exploited to provide interactive features through bidirectional attention. In addition, we introduce one individual cost function to identify the phonetic contents, which contributes to calculating the attention score more specifically. These interactive features are complementary to the constant ones, which are extracted from individual speakers separately and do not vary with the evaluation utterances. The proposed method archived a competitive equal error rate of 6.26% on the internal "DAN DAN NI HAO" benchmark dataset with 1250 utterances and outperformed various baseline methods, including the traditional i-vector/PLDA, d-vector, self-attention, and sequence-to-sequence attention models.Entities:
Keywords: CNN; bidirectional attention; interactive representation; text-dependent speaker verification
Year: 2020 PMID: 33261046 PMCID: PMC7730222 DOI: 10.3390/s20236784
Source DB: PubMed Journal: Sensors (Basel) ISSN: 1424-8220 Impact factor: 3.576
Figure 1The architecture of convolutional neural network (CNN)-based d-vector extraction model. Cross-entropy (CE) loss and triplet loss are used in this study.
Figure 2Demonstration of the Sequence to Sequence (Seq2Seq) [7] attention-based text-dependent speaker verification (TDSV) model.
Figure 3The architecture of the proposed bidirectional attention-based TDSV model, including NET1 for frame-level hidden feature extraction, NET2 for feature combination, NET3 for the bidirectional attention, and NET4 for the metric learning.
Figure 4The structure of the bidirectional attention mechanism. For either branch, the frame-level hidden features of one utterance and the utterance-level hidden features of the other utterance are adopted as the inputs.
Figure 5The equal error rate () corresponding to different weights of losses. (a) the on the development set when ; (b) the on the development set when .
The (%) on the development and test sets with and without .
| Losses | Development Set | Test Set |
|---|---|---|
|
| 6.60 | 6.62 |
|
| 6.27 | 6.26 |
The (%) on the development and test sets with different losses.
| Architecture | Losses | Development Set | Test Set |
|---|---|---|---|
| d-vector |
| 8.03 | 7.99 |
| d-vector |
| 7.43 | 7.18 |
| BaCNN |
| 6.60 | 6.51 |
| BaCNN |
| 6.27 | 6.26 |
The (%) on the development and test sets with different inputs.
| Inputs of NET4 | Development Set | Test Set |
|---|---|---|
| 6.55 | 6.62 | |
| 7.55 | 7.18 | |
| 6.25 | 6.41 | |
| 6.33 | 6.58 | |
| 6.27 | 6.26 |
Comparison with state-of-the-art methods on the development set and the test set.
| Method | Development Set | Test Set | ||||
|---|---|---|---|---|---|---|
| EERR (%) |
| EER (%) |
| |||
| i-vector/PLDA [ | 11.61 | 77.51 | 0.5578 | 11.80 | 76.83 | 0.5499 |
| d-vector and cosine [ | 7.43 | 87.67 | 0.4033 | 7.18 | 89.42 | 0.4017 |
| Self-attention [ | 6.96 | 90.40 | 0.3795 | 6.87 | 89.98 | 0.4235 |
| Seq2Seq attention [ | 6.88 | 89.57 | 4059 | 6.83 | 89.73 | 0.4236 |
| BaCNN-1step | 7.60 | 88.10 | 0.4373 | 6.91 | 89.18 | 0.4606 |
| BaCNN | 6.27 | 92.00 | 0.3709 | 6.26 | 91.42 | 0.3996 |
Figure 6DET curves of different methods.
Figure 7An illustrative example of the attention weight . The physical length of the enrollment is 120 frames.