| Literature DB >> 26442245 |
Luis Miguel Mazaira-Fernandez1, Agustín Álvarez-Marquina1, Pedro Gómez-Vilda1.
Abstract
Person identification, especially in critical environments, has always been a subject of great interest. However, it has gained a new dimension in a world threatened by a new kind of terrorism that uses social networks (e.g., YouTube) to broadcast its message. In this new scenario, classical identification methods (such as fingerprints or face recognition) have been forcedly replaced by alternative biometric characteristics such as voice, as sometimes this is the only feature available. The present study benefits from the advances achieved during last years in understanding and modeling voice production. The paper hypothesizes that a gender-dependent characterization of speakers combined with the use of a set of features derived from the components, resulting from the deconstruction of the voice into its glottal source and vocal tract estimates, will enhance recognition rates when compared to classical approaches. A general description about the main hypothesis and the methodology followed to extract the gender-dependent extended biometric parameters is given. Experimental validation is carried out both on a highly controlled acoustic condition database, and on a mobile phone network recorded under non-controlled acoustic conditions.Entities:
Keywords: GMM–UBM; source-tract separation; speaker characterization; speaker recognition; voice biometry; voice processing
Year: 2015 PMID: 26442245 PMCID: PMC4585141 DOI: 10.3389/fbioe.2015.00126
Source DB: PubMed Journal: Front Bioeng Biotechnol ISSN: 2296-4185
Figure 1Separation algorithm using first-order prediction lattice and including a lip-radiation compensation stage.
Figure 2Vocal tract (middle) and Glottal source (lower) estimates for a female sustained vowel/a/utterance (upper).
Figure 3Power spectral density of the glottal source evaluated over a temporal window which includes multiple glottal cycles. The relative maxima of the distribution are marked by the harmonics present in the signal. The interconnection of these maxima is known as harmonic envelope or power spectral density profile.
Figure 4General parameterization scheme used for both female and male speakers.
Figure 5GMM–UBM speaker verification system. (A) UBM training. (B) GMM speaker model building. (C) Speaker Verification. (D) Score normalization.
Description of the contents of the different subsets for the different scenarios.
| MOBIO | ALBAYZIN | |||||||||
|---|---|---|---|---|---|---|---|---|---|---|
| Background | Background | |||||||||
| Enrollment | Enrollment | |||||||||
| Speakers | #Files | Speakers | #Files | |||||||
| 37 | 7104 | 25 | 75 | |||||||
| 13 | 2496 | 25 | 75 | |||||||
| 50 | 9600 | 50 | 150 | |||||||
| 24 | 120 | 24 | 2520 | 60480 | 25 | 75 | 25 | 550 | 13750 | |
| 18 | 90 | 18 | 1890 | 34020 | 25 | 75 | 25 | 550 | 13750 | |
| 42 | 210 | 42 | 4410 | 94500 | 50 | 150 | 50 | 1100 | 27500 | |
| 38 | 190 | 38 | 3990 | 151620 | 88 | 264 | 88 | 4136 | 363968 | |
| 20 | 100 | 20 | 2100 | 42000 | 88 | 264 | 88 | 4136 | 363968 | |
| 58 | 290 | 58 | 6090 | 193620 | 176 | 528 | 176 | 8272 | 727936 | |
Configurations providing most successful results in terms of EER for GDC and GIC for the ALBAYZIN development set scenario [RR .
| Parameters | Genre | EERM [ | EERM RR | EERF [ | EERF RR | HEER [RR] | ||||
|---|---|---|---|---|---|---|---|---|---|---|
| Gender-independent configuration (GIC MFCCs + Δ) | M/F | 256 | 5 | 34 | 26 | 2.534% [−0.178] | – | 2.170% [−0.169] | – | 2.352% [–] |
| Gender-independent configuration (GIC MFCCs + Δ + ΔΔ) | M/F | 256 | 5 | 50 | 26 | 3.042% [−0.401] (2.04 × 10−1) | −20.05% | 2.409% [−0.375] (3.06 × 10−1) | −11.01% | 2.725% [−15.85%] |
| Gender-dependent configuration (GDC MFCCs + Δ) | M | 256 | 16 | 34 | 26 | 12.390% [−0.001] (5.57 × 10−1) | 5.68% | 2.193% [6.76%] | ||
| F | 256 | 5 | 44 | 26 | 1.996% [−0.166] (4.07 × 10−1) | 8.02% |
The results highlighted in green, are the ones achieving higher recognition rates.
EER.
| Parameters | Genre | GSE | Extra parameters | EERM [ | EERM RR | EERF [ | EERF RR | HEER [RR] |
|---|---|---|---|---|---|---|---|---|
| Gender-independent configuration (GIC MFCCs + Δ) | M/F | – | – | 2.534% [−0.178] | – | 2.170% [–0.169] | – | 2.352% [–] |
| Gender-dependent configuration (GDC MFCCs + Δ) | M | – | – | 2.390% [−0.001] (5.57 × 10−1) | 5.68% | 2.193% [6.76%] | ||
| F | – | – | 1.996% [−0.166] (4.07 × 10−1) | 8.02% | ||||
| Gender-dependent configuration (GDC MFCCs + Δ + Extra) | M | – | E + ΔE | 2.163% [−0.035] (2.09 × 10−1) | 14.64% | 1.991% [15.37%] | ||
| F | – | E + ΔE + F0 + F3 | 1.818% [−0.113] (2.12 × 10−1) | 16.23% | ||||
| Gender-dependent configuration (GDC MFCCs + Δ + Extra + GSE) | M | E + ΔE | 1.504% [−0.131] (5.41 × 10−4) | 40.65% | 1.477% [37.19%] | |||
| F | E + ΔE + F0 + F3 | 1.451% [−0.145] (1.67 × 10−3) | 33.15% | |||||
The results highlighted in green, are the ones achieving higher recognition rates.
Figure 6DET curves comparing classical parameters in a gender-independent setup with the GDEB parameterization on ALBAYZIN development set for male (left) and female (right) speakers.
Figure 7Influence of the GSE configuration on the results achieved on terms of EER for both male and female speakers on development set.
HTER.
| Score norm | Parameters | EERM [ | HTERM ( | HTERM RR | EERF [ | HTERF ( | HTERF RR |
|---|---|---|---|---|---|---|---|
| Gender-Independent configuration (GIC MFCCs + Δ) | 2.534% [−0.178] | 3.347% | – | 2.170% [−0.169] | 3.250% | – | |
| Gender-dependent configuration (GDC MFCCs + Δ + Extra) | 2.163% [−0.035] (2.09 × 10−1) | 3.089% (3.86 × 10−1) | 7.70% | 1.818% [−0.113] (2.11 × 10−1) | 3.094% (8.51 × 10−1) | 4.79% | |
| Gender-dependent configuration (GDC MFCCs + Δ + Extra + GSE) | 1.504% [−0.131] (5.40 × 10−4) | 2.189% (6.37 × 10−13) | 34.61% | 1.451% [−0.145] (1.67 × 10−3) | 2.673% (2.97 × 10−2) | 17.74% | |
| Gender-independent configuration (GIC MFCCs + Δ) | 2.000% [1.847] | 2.783% | – | 1.655% [2.233] | 3.081% | – | |
| Gender-dependent configuration (GDC MFCCs + Δ + Extra) | 1.636% [1.995] (6.32 × 10−2) | 2.432% (8.60 × 10−7) | 12.59% | 1.455% [2.305] (1.73 × 10−1) | 2.870% (3.97 × 10−1) | 6.85% | |
| Gender-dependent configuration (GDC MFCCs + Δ + Extra + GSE) | 1.273% [2.031] (1.30 × 10−3) | 1.977% (0.00) | 28.94% | 1.273% [2.304] (2.41 × 10−2) | 2.709% (4.84 × 10−1) | 12.07% | |
| Gender-dependent configuration (GDC MFCCs + Δ + Extra + GSE + VTE) | 1.114% [2.092] (3.21 × 10−4) | 1.917% (1.11 × 10−15) | 31.12% | – | – | – | |
| Gender-independent configuration (GIC MFCCs + Δ) | 2.000% [1.004] | 2.806% | – | 1.455% [1.118] | 2.835% | – | |
| Gender-dependent configuration (GDC MFCCs + Δ + Extra) | 1.807% [1.199] (3.96 × 10−1) | 2.555% (3.14 × 10−4) | 8.93% | 1.424% [1.238] (8.20 × 10−1) | 2.598% (1.61 × 10−1) | 8.36% | |
| Gender-dependent configuration (GDC MFCCs + Δ + Extra + GSE) | 1.288% [1.252] (8.70 × 10−4) | 1.812% (2.22 × 10−16) | 35.42% | 1.133% [1.151] (1.84 × 10−4) | 2.289% (1.15 × 10−1) | 19.28% | |
| Gender-dependent configuration (GDC MFCCs + Δ + Extra + GSE + VTE) | – | – | – | 1.091% [1.270] (7.48 × 10−2) | 2.262% (2.29 × 10−1) | 20.21% | |
| Gender-independent configuration (GIC MFCCs + Δ) | 2.045% [2.477] | 3.388% | – | 1.818% [2.886] | 3.075% | – | |
| Gender-dependent configuration (GDC MFCCs + Δ + Extra) | 1.848% [2.794] (6.30 × 10−2) | 3.040% (2.49 × 10−3) | 10.25% | 1.655% [3.228] (4.50 × 10−1) | 3.203% (4.54 × 10−2) | −4.16% | |
| Gender-dependent configuration (GDC MFCCs + Δ + Extra + GSE) | 1.496% [2.777] (1.25 × 10−3) | 1.980% (0.00) | 41.55% | 1.231% [3.030] (1.88 × 10−3) | 2.635% (2.95 × 10−2) | 14.31% |
The results highlighted in green, are the ones achieving higher recognition rates.
Figure 8DET curves comparing classical parameters in a gender-independent setup with the GDEB parameterization on ALBAYZIN evaluation set for male (left) and female (right) speakers.
Configurations providing most successful results in terms of EER for GDC and GIC for the MOBIO development set scenario [RR → Relative Reduction/[threshold]/(.
| Parameters | Genre | MFCC | EERM [ | EERM RR | EERF [ | EERF RR | HEER [RR] | |||
|---|---|---|---|---|---|---|---|---|---|---|
| Gender-independent configuration (GIC MFCCs + Δ) | M/F | 256 | 24 | 30 | 27 | 11.567% [−0.007] | – | 11.693% [−0.009] | – | 11.630% [–] |
| Gender-dependent configuration (GDC MFCCs + Δ) | M | 1024 | 24 | 44 | 25 | 10.654% [0.014] (9.81 × 10−2) | 7.89% | 11.150% [4.12%] | ||
| F | 256 | 24 | 34 | 24 | 11.646% [0.011] (8.49 × 10−1) | 0.40% |
The results highlighted in green, are the ones achieving higher recognition rates.
Figure 9EER achieved for male and female speakers, when ZTNorm in the case of male (left) and NoNorm in the case of female (right) speakers are applied, in a gender-dependent setup which incorporates different combinations of extra parameters.
EER obtained on development set (ZTNorm – male and NoNorm – female), comparing classical parameters in a gender-independent setup with a gender-dependent setup in which extra parameters and extended biometric parameters are incorporated [RR .
| Parameters | Genre | GSE | Extra parameters | EERM [ | EERM RR | EERF [ | EERF RR | HEER [RR] |
|---|---|---|---|---|---|---|---|---|
| Gender-independent configuration (GIC MFCCs + Δ) | M/F | – | – | 10.594% [1.556] | – | 11.693% [−0.009] | – | 11.143% [–] |
| Gender-dependent configuration (GDC MFCCs + Δ + Extra) | M | – | ΔE + F0 + F3 | 9.165% [1.597] (2.02 × 10−5) | 13.5% | 10.183% [8.61%] | ||
| F | – | F0 + F3 | 11.201% [0.016] (2.36 × 10−1) | 6.37% | ||||
| Gender-dependent configuration (GDC MFCCs + Δ + Extra + GSE) | M | ΔE + F0 + F3 | 8.332% [1.619] (1.38 × 10−8) | 21.3% | 9.48% [14.92%] | |||
| F | F0 + F3 | 10.643% [0.010] (4.84 × 10−4) | 8.98% | |||||
| Gender-dependent configuration (GDC MFCCs + Δ + Extra + GSE + VTE) | M | ΔE + F0 + F3 | 8.496% [1.506] (2.09 × 10−5) | 19.8% | 9.75% [12.50%] | |||
| 14-Channel | ||||||||
| Filter bank | ||||||||
| 2 MFCC | ||||||||
| F | F0 + F3 | 11.016% [0.023] (2.02 × 10−1) | 5.79% | |||||
| 25-Channel | ||||||||
| Filter bank | ||||||||
| 2 MFCC |
The results highlighted in green, are the ones achieving higher recognition rates.
Figure 10DET curves comparing classical parameters in a gender-independent setup with the GDEB parameterization on MOBIO development set for male (left) and female (right) speakers.
EER on the development set and HTER on the evaluation set for the systems participating in .
| System name | MALE | System name | FEMALE | |||
|---|---|---|---|---|---|---|
| Development set EER ( | Evaluation set HTER ( | Development set EER ( | Evaluation set HTER ( | |||
| 2.897% | 4.767% | 3.556% | 6.986% | |||
| 5.040% | 7.076% | 7.982% | 10.678% | |||
| 7.889% | 8.191% | 8.364% | 14.181% | |||
| 8.332% (1.39 × 10−8) | 8.382% (5.86 × 10−9) | 10.643% (4.85 × 10−4) | 13.107% (1.40 × 10−10) | |||
| 8.496% (2.09 × 10−5) | 8.631% (1.70 × 10−3) | 11.005% | 17.266% | |||
| 9.601% | 10.779% | 11.016% (2.02 × 10−1) | 13.150% (5.68 × 10−18) | |||
| 9.960% | 10.032% | 11.429% | 11.633% | |||
| 10.198% | 9.109% | 12.011% | 14.269% | |||
| 10.599% | 11.129% | 13.484% | 22.140% | |||
| 11.310% | 10.058% | 14.348% | 15.987% | |||
| 11.824% | 10.214% | 16.836% | 17.858% | |||
| 12.738% | 19.404% | 17.937% | 19.511% | |||
| 14.881% | 15.429% | 3.556% | 6.986% | |||
| 24.643% | 22.524% | 7.982% | 10.678% | |||
First row corresponds to results achieved by the fusion of all systems. Red, systems performing fusion of different systems. Orange, systems incorporating external/additional training data. Green, results achieved by the proposal presented in this article.