| Literature DB >> 33175898 |
Hongwei Mao1, Yan Shi1, Yue Liu1, Linqiang Wei1, Yijie Li2, Yanhua Long1.
Abstract
In recent years, great progress has been made in the technical aspects of automatic speaker verification (ASV). However, the promotion of ASV technology is still a very challenging issue, because most technologies are still very sensitive to new, unknown and spoofing conditions. Most previous studies focused on extracting target speaker information from natural speech. This paper aims to design a new ASV corpus with multi-speaking styles and investigate the ASV robustness to these different speaking styles. We first release this corpus in the Zenodo website for public research, in which each speaker has several text-dependent and text-independent singing, humming and normal reading speech utterances. Then, we investigate the speaker discrimination of each speaking style in the feature space. Furthermore, the intra and inter-speaker variabilities in each different speaking style and cross-speaking styles are investigated in both text-dependent and text-independent ASV tasks. Conventional Gaussian Mixture Model (GMM), and the state-of-the-art x-vector are used to build ASV systems. Experimental results show that the voiceprint information in humming and singing speech are more distinguishable than that in normal reading speech for conventional ASV systems. Furthermore, we find that combing the three speaking styles can significantly improve the x-vector based ASV system, even when only limited gains are obtained by conventional GMM-based systems.Entities:
Year: 2020 PMID: 33175898 PMCID: PMC7657545 DOI: 10.1371/journal.pone.0241809
Source DB: PubMed Journal: PLoS One ISSN: 1932-6203 Impact factor: 3.240
RSH corpus description.
| Item | RSH Details |
|---|---|
| Recording software | Audition |
| Language | Mandarin |
| Text | 10 daily used phrase or sentence for all speakers, 1 personalized unique text for each speaker |
| Environment | quiet lab environment |
| Format | 16,000 Hz, 16 bit, 1 channel |
| Speaker | 46 undergraduate students(20 male,26 female) |
| Microphone | common laptop built-in microphone |
| Format | 16,000 Hz, 16 bit, 1 channel |
| Biometric signal | singing, humming and reading speech |
Details of text-dependent and text-independent short-time ASV tasks.
| Task | Target Speakers | Test Segments | Target Trials | Nontarget Trials |
|---|---|---|---|---|
| TD-GD | 460 | 460 | 200 male,260 female | 3800 male,6500 female |
| TD-GI | 460 | 460 | 460 | 20700 |
| TI-GD | 460 | 460 | 2000 male,2600 female | 38000 male,65000 female |
| TI-GI | 460 | 460 | 4600 | 207000 |
Fig 1t-SNE visualization of two speakers’ features under text-dependent condition.
All the texts are the same for two speakers.
Fig 2Two speaker’s feature space discrimination using t-SNE for each speaking style under text-dependent condition.
Fig 3t-SNE visualization of two speakers’ features under text-independent condition.
The texts are the same for each speaker with three speaking styles.
Fig 4Two speaker’s feature space discrimination using t-SNE for each speaking style under text-independent condition.
The texts are the same for each speaker with three speaking styles.
EER% on the text-dependent ASV tasks, using three types of single-speaking styles recordings.
| System | Task | Gender | Reading | Singing | Humming |
|---|---|---|---|---|---|
| GMM-based | TD-GD | Male | 24.0 | 22.5 | 18.0 |
| Female | 30.0 | 22.6 | 17.6 | ||
| TD-GI | All | 25.0 | 19.1 | 16.9 | |
| x-vector based | TD-GD | Male | 3.0 | 3.5 | 6.5 |
| Female | 6.1 | 5.3 | 8.0 | ||
| TD-GI | All | 3.9 | 4.1 | 5.2 |
EER% on the TI-GI ASV tasks, using three types of single-speaking styles recordings.
| System | Reading | Singing | Humming |
|---|---|---|---|
| GMM-based | 34.9 | 31.7 | 21.5 |
| x-vector based | 16.8 | 17.4 | 9.5 |
EER% on the GMM-based cross-speaking ASV tasks.
| Task | Test | Reading | Singing | Humming |
|---|---|---|---|---|
| Enroll | ||||
| TD-GI | Reading | 30.4 | 38.0 | |
| Singing | 31.7 | 38.0 | ||
| Humming | 38.4 | 37.6 | ||
| TI-GI | Reading | 37.2 | 40.0 | |
| Singing | 36.4 | 38.4 | ||
| Humming | 39.2 | 38.8 |
EER% on the x-vector based cross-speaking ASV tasks.
| Task | Test | Reading | Singing | Humming |
|---|---|---|---|---|
| Enroll | ||||
| TD-GI | Reading | 14.1 | 32.8 | |
| Singing | 20.7 | 32.8 | ||
| Humming | 34.3 | 32.1 | ||
| TI-GI | Reading | 20.2 | 33.0 | |
| Singing | 20.6 | 33.1 | ||
| Humming | 34.4 | 33.3 |
EER% on both the GMM-based and x-vector based ASV tasks using multi-speaking style training data combination.
| System | Task | Reading | Singing | Humming |
|---|---|---|---|---|
| GMM-based | TD-GI | 26.3 | 21.7 | 22.0 |
| TI-GI | 33.8 | 30.7 | 26.7 | |
| x-vector based | TD-GI | 1.3 | 1.7 | 7.8 |
| TI-GI | 11.2 | 11.8 | 11.3 |
EER% on the single-speaking styles TD-GI and TI-GI ASV tasks, using the whole RSH to train the x-vector extractor.
| System | Task | Reading | Singing | Humming |
|---|---|---|---|---|
| x-vector based | TD-GI | 22.6 | 17.8 | 15.2 |
| TI-GI | 26.5 | 23.7 | 16.7 |
EER% on the x-vector based cross-speaking ASV tasks, using the whole RSH to train the x-vector extractor.
| Task | Test | Reading | Singing | Humming |
|---|---|---|---|---|
| Enroll | ||||
| TD-GI | Reading | 25.6 | 32.0 | |
| Singing | 26.7 | 33.7 | ||
| Humming | 32.2 | 33.0 | ||
| TI-GI | Reading | 29.2 | 33.0 | |
| Singing | 29.0 | 34.2 | ||
| Humming | 33.3 | 33.7 |
EER% on the x-vector based ASV tasks using the whole RSH to train the x-vector extractor, and multi-speaking style training data combination for the target speaker enrollment.
| System | Task | Reading | Singing | Humming |
|---|---|---|---|---|
| x-vector based | TD-GI | 19.8 | 16.7 | 17.3 |
| TI-GI | 23.4 | 19.4 | 14.3 |