| Literature DB >> 35281657 |
Philip Weber1,2, Ewald Enzinger1,2,3, Beltrán Labrador4, Alicia Lozano-Díez4, Daniel Ramos4, Joaquín González-Rodríguez4, Geoffrey Stewart Morrison1,2.
Abstract
This paper reports on validations of an alpha version of the E3 Forensic Speech Science System (E3FS3) core software tools. This is an open-code human-supervised-automatic forensic-voice-comparison system based on x-vectors extracted using a type of Deep Neural Network (DNN) known as a Residual Network (ResNet). A benchmark validation was conducted using training and test data (forensic_eval_01) that have previously been used to assess the performance of multiple other forensic-voice-comparison systems. Performance equalled that of the best-performing system with previously published results for the forensic_eval_01 test set. The system was then validated using two different populations (male speakers of Australian English and female speakers of Australian English) under conditions reflecting those of a particular case to which it was to be applied. The conditions included three different sets of codecs applied to the questioned-speaker recordings (two mismatched with the set of codecs applied to the known-speaker recordings), and multiple different durations of questioned-speaker recordings. Validations were conducted and reported in accordance with the "Consensus on validation of forensic voice comparison".Entities:
Keywords: Forensic voice comparison; Likelihood ratio; Validation; x-vector
Year: 2022 PMID: 35281657 PMCID: PMC8908042 DOI: 10.1016/j.fsisyn.2022.100223
Source DB: PubMed Journal: Forensic Sci Int Synerg ISSN: 2589-871X
Fig. 1High-level architecture of E3FS3α.
Fig. 2Simplified schematic of the architecture of the ResNet used for extracting x-vectors.
Sizes of the dimensions of the structures and substructures of the ResNet used for extracting x-vectors.
| Structure | Substructure | Dimensions | ||
|---|---|---|---|---|
| time | frequency | channels | ||
| Feature vectors | – | 400 | 40 | 1 |
| Input layer | – | 400 | 20 | 16 |
| Group 1 | 3 blocks | 400 | 20 | 16 |
| Group 2 | 4 blocks | 200 | 10 | 32 |
| Group 3 | 6 blocks | 100 | 5 | 64 |
| Group 4 | 3 blocks | 100 | 5 | 128 |
| Statistics-pooling block | Layer 1 | 100 | 1 | 128 |
| Channel-attention layer | 1 | 1 | 128 | |
| Layer 2 | 100 | 1 | 1 | |
| Layer 3 | 1 | 1 | 128 | |
| x-vector layer | – | 1 | 1 | 512 |
| Output layer | – | 1 | 1 | Number of training speakers |
Fig. 3Simplified schematic of the feature vectors and the input layer of the ResNet used for extracting x-vectors. Only one channel is shown.
Fig. 4Simplified schematic of the architecture of a block within the ResNet used for extracting x-vectors. Only one channel is shown.
Fig. 5Simplified schematic of the final stages of the ResNet used for extracting x-vectors. In this figure “ × ” indicates matrix multiplication.
Fig. 6Tippett plot of the results of validating E3FS3α using the forensic_eval_01 data.
Cllr values from the best-performing version of each system validated in the Speech Communication virtual special issue, plus the Cllr result from E3FS3α, each validated using the for ensic_eval_01 data. Alternating background shading groups types of systems.
Numbers of feature vectors and corresponding net-speech durations of the questioned-speaker-condition recordings used for the case-specific validations reported in the present paper.
| Number of feature vectors extracted | Net-speech duration (s) |
|---|---|
| 500 | 5 |
| 1,000 | 10 |
| 1,500 | 15 |
| 2,000 | 20 |
| 3,000 | 30 |
| 4,500 | 45 |
| 6,000 | 60 |
| 9,000 | 90 |
| 12,000 | 120 |
| 18,000 | 180 |
Cllr values for case-specific conditions – male speakers.
| Questioned-speaker net-speech duration (s) | Questioned-speaker condition | ||
|---|---|---|---|
| GSM 06.10 | μ-law + G.729a | μ-law + G.723.1 | |
| 5 | 0.330 | 0.341 | 0.376 |
| 10 | 0.241 | 0.156 | 0.253 |
| 15 | 0.208 | 0.133 | 0.189 |
| 20 | 0.129 | 0.100 | 0.136 |
| 30 | 0.090 | 0.097 | 0.085 |
| 45 | 0.118 | 0.094 | 0.057 |
| 60 | 0.090 | 0.069 | 0.075 |
| 90 | 0.084 | 0.079 | 0.054 |
| 120 | 0.083 | 0.064 | 0.067 |
| 180 | 0.077 | 0.059 | 0.057 |
Cllr values for case-specific conditions – female speakers.
| Questioned-speaker net-speech duration (s) | Questioned-speaker condition | ||
|---|---|---|---|
| GSM 06.10 | μ-law + G.729a | μ-law + G.723.1 | |
| 5 | 0.374 | 0.451 | 0.456 |
| 10 | 0.251 | 0.349 | 0.238 |
| 15 | 0.276 | 0.311 | 0.285 |
| 20 | 0.206 | 0.214 | 0.184 |
| 30 | 0.150 | 0.122 | 0.144 |
| 45 | 0.138 | 0.143 | 0.167 |
| 60 | 0.117 | 0.120 | 0.106 |
| 90 | 0.095 | 0.081 | 0.088 |
| 120 | 0.112 | 0.074 | 0.104 |
| 180 | 0.096 | 0.074 | 0.101 |
Fig. 7Cllr values for case-specific conditions. (a) Male speakers. (b) Female speakers.
Fig. 8Tippett plots of the results of validating E3FS3α using case-specific data (female Australian-English speakers and questioned-speaker condition μ-law + G.729a) at different questioned-speaker net-speech durations.