| Literature DB >> 35260572 |
Xingyu Chen1,2, Zhengxiong Li1,2, Srirangaraj Setlur2, Wenyao Xu3.
Abstract
Systemic inequity in biometrics systems based on racial and gender disparities has received a lot of attention recently. These disparities have been explored in existing biometrics systems such as facial biometrics (identifying individuals based on facial attributes). However, such ethical issues remain largely unexplored in voice biometric systems that are very popular and extensively used globally. Using a corpus of non-speech voice records featuring a diverse group of 300 speakers by race (75 each from White, Black, Asian, and Latinx subgroups) and gender (150 each from female and male subgroups), we explore and reveal that racial subgroup has a similar voice characteristic and gender subgroup has a significant different voice characteristic. Moreover, non-negligible racial and gender disparities exist in speaker identification accuracy by analyzing the performance of one commercial product and five research products. The average accuracy for Latinxs can be 12% lower than Whites (p < 0.05, 95% CI 1.58%, 14.15%) and can be significantly higher for female speakers than males (3.67% higher, p < 0.05, 95% CI 1.23%, 11.57%). We further discover that racial disparities primarily result from the neural network-based feature extraction within the voice biometric product and gender disparities primarily due to both voice inherent characteristic difference and neural network-based feature extraction. Finally, we point out strategies (e.g., feature extraction optimization) to incorporate fairness and inclusive consideration in biometrics technology.Entities:
Mesh:
Year: 2022 PMID: 35260572 PMCID: PMC8904636 DOI: 10.1038/s41598-022-06673-y
Source DB: PubMed Journal: Sci Rep ISSN: 2045-2322 Impact factor: 4.379
Figure 1These voice biometric services could produce significantly different identification results towards speakers with diverse demographic backgrounds.
The details of open-source research products on voice biometric.
| Model name | Network block | Feature |
|---|---|---|
| 1d-CNN[ | Convolutional layer | log Mel-spec-32 + Conv1D*8 |
| TDNN (x-vector)[ | Time-delay neural network | log Mel-spec-32 + x-vector |
| ResNet-18[ | Residual block | Spec-257 + Featuremap [2, 2, 2, 2] |
| ResNet-34[ | Residual block | Spec-257 + Featuremap [3, 4, 6, 3] |
| AutoSpeech[ | Normal and reduction cells | Spec-257 + Searched Architecture |
log Mel-spec log Mel-spectrogram, Spec spectrogram
Figure 2Health condition and age distribution of subgroups. Ages are positively skewed at the age of 20 years.
List of critical voice fundamental metrics.
| Feature name | Voice property | Voice measurement |
|---|---|---|
| Spectral entropy[ | Voice signal irregularity | The power spectral density |
| PDF entropy[ | Voice uniqueness and stability | The mutual information |
| Permutation entropy[ | Voice complexity | The comparison to the ordinal probability distribution |
| SVD entropy[ | The complexity of voice | The dimensionality of the signal |
| MFCC[ | The short-term power spectrum of voice | The shape of a spectral envelope |
| Formants[ | Acoustic resonance of the vocal tract | The spectral peaks of sound spectrum |
| RMS[ | Continuous power of voice | The root mean square of signal amplitude |
| Pitch onsets[ | Increases in spectral energy | The number of peaks from onset strength envelope |
| Centroid[ | Brightness of the voice | The centroid of spectrum |
| Roll-Off[ | Approximate low bass and high treble | The center frequency for a spectrogram bin that contains |
Figure 3Selected voice fundamental metrics for human voice nature properties measurement on both racial and gender datasets. The y-axis is the metric value. The x-axis is subgroups (W White, B Black, A Asian, L Latinx, M Male, F Female). There are significant differences in the racial group (in F0, F1, F2, PDF entropy, and Perm entropy) and the gender group (in all metrics except MFCC and MFCC).
Voice fundamental metrics results of racial subgroups.
| White | Black | Latinx | Asian | |||||
|---|---|---|---|---|---|---|---|---|
| Mean | STD | Mean | STD | Mean | STD | Mean | STD | |
| Onsets | 41.2012 | 28.2322 | 39.1875 | 28.4768 | 42.998 | 27.6175 | 42.0039 | 27.1027 |
| RMS | 0.1807 | 0.0619 | 0.1772 | 0.0716 | 0.1816 | 0.0689 | 0.1914 | 0.0721 |
| Centroid | 628.3296 | 133.7869 | 642.0621 | 165.6585 | 624.9091 | 134.0623 | 597.3267 | 118.6543 |
| Roll-off | 2261.4896 | 881.9737 | 2361.1009 | 968.6721 | 2210.6749 | 836.0810 | 2105.1559 | 731.9750 |
| MFCC | 39.9458 | 43.6219 | 38.5572 | 40.7739 | ||||
| 0.7752 | 1.5033 | 0.6574 | 1.3368 | 0.6960 | 1.4778 | 0.7030 | 1.4450 | |
| 0.1147 | 0.3910 | 0.1235 | 0.3761 | 0.1023 | 0.3910 | 0.1150 | 0.3585 | |
| F0 | 134.0372 | 42.5300 | 131.9899 | 54.2746 | 132.4334 | 40.2225 | 142.1182 | 43.8327 |
| F1 | 495.3698 | 158.2555 | 496.7421 | 145.8382 | 485.3269 | 165.9858 | 454.0798 | 170.0188 |
| F2 | 1025.3560 | 218.2198 | 1074.4825 | 213.0972 | 1029.9507 | 250.5701 | 1008.1242 | 231.5954 |
| PDF entropy | 6.4428 | 0.0113 | 6.4360 | 0.0153 | 6.4395 | 0.0136 | 6.4395 | 0.0141 |
| Perm entropy | 1.9686 | 0.1622 | 2.0170 | 0.1773 | 1.9714 | 0.1701 | 1.9411 | 0.1472 |
| Spectral entropy | 9.1265 | 0.8457 | 9.2374 | 1.0474 | 9.4270 | 1.0227 | 9.2293 | 0.9024 |
| SVD entropy | 0.8531 | 0.1010 | 0.8577 | 0.1093 | 0.8548 | 0.1072 | 0.8254 | 0.1029 |
Voice fundamental metrics results of gender subgroups.
| Male | Female | ||||||
|---|---|---|---|---|---|---|---|
| Mean | STD | Mean | STD | p value | 95% CI | ||
| Onsets | 43.91 | 27.73 | 39.78 | 27.70 | -193.05 | ||
| RMS | 0.1839 | 0.061 | 0.2130 | 0.076 | 382.01 | ||
| Centroid | 599.8 | 120.9 | 667.7 | 150.8 | 361.97 | 483.61 | |
| Roll-off | 2066 | 781.5 | 2410 | 905.8 | 290.35 | 411.99 | |
| MFCC | 38.60 | 39.98 | |||||
| 0.76 | 1.488 | 0.71 | 1.391 | 0.4766 | 38.73 | ||
| 0.11 | 0.38 | 0.19 | 0.38 | 210.53 | 332.18 | ||
| F0 | 117.9 | 24.77 | 210.5 | 51.48 | 0 | 1.27 | 1.39 |
| F1 | 522 | 147.12 | 361.4 | 158.8 | |||
| F2 | 1048 | 224.4 | 906 | 184.51 | |||
| PDF Entropy | 6.440 | 0.011 | 6.448 | 0.014 | 38.99 | 78.25 | |
| Perm Entropy | 1.966 | 0.157 | 1.972 | 0.171 | 0.8158 | 21.96 | |
| Spectral Entropy | 9.347 | 0.824 | 8.52 | 1.02 | |||
| SVD Entropy | 0.853 | 0.1 | 0.116 | 0.116 | 0.628 | 14.77 | |
Figure 4Voice identification performance among the matched racial dataset. The x-axis is racial subgroups (W White, B Black, A Asian, L Latinx). The y-axis is the percentage accuracy. ResNet-34 has the best overall accuracy. Significant differences exist among all data or between sub-groups in Microsoft Azure, 1d-CNN, ResNet-18, and AutoSpeech. No significant differences in TDNN and ResNet-34.
Figure 5Voice identification performance among the matched gender dataset. The x-axis is gender subgroups. The y-axis is the percentage accuracy. ResNet-34 has the best overall accuracy. Significant differences between females and males are discovered in Microsoft Azure, ResNet-18, ResNet-34, and AutoSpeech. No significant differences in TDNN and 1d-CNN.
Figure 6A taxonomy of the vocal biological structure, voice properties, and computational voice features for voice biometrics.
Figure 7The voice biometric system is mainly divided into two parts: feature extraction and classification. The feature extraction part includes the base feature and neural network-based feature maps. ResNet-18, ResNet-34, and AutoSpeech utilize the same classification and base feature (spectrogram). The racial disparity is discovered in ResNet-18 and AutoSpeech, and gender disparity is detected in these three. The neural network-based feature extraction primarily causes these disparities. F1 is noted for the first format here. The average F1 for White speakers is 495.3 Hz, for Latinx speakers is 485.3 Hz.