| Literature DB >> 27912151 |
Eugenia San Segundo1, Athanasios Tsanas2, Pedro Gómez-Vilda3.
Abstract
There is a growing consensus that hybrid approaches are necessary for successful speaker characterization in Forensic Speaker Comparison (FSC); hence this study explores the forensic potential of voice features combining source and filter characteristics. The former relate to the action of the vocal folds while the latter reflect the geometry of the speaker's vocal tract. This set of features have been extracted from pause fillers, which are long enough for robust feature estimation while spontaneous enough to be extracted from voice samples in real forensic casework. Speaker similarity was measured using standardized Euclidean Distances (ED) between pairs of speakers: 54 different-speaker (DS) comparisons, 54 same-speaker (SS) comparisons and 12 comparisons between monozygotic twins (MZ). Results revealed that the differences between DS and SS comparisons were significant in both high quality and telephone-filtered recordings, with no false rejections and limited false acceptances; this finding suggests that this set of voice features is highly speaker-dependent and therefore forensically useful. Mean ED for MZ pairs lies between the average ED for SS comparisons and DS comparisons, as expected according to the literature on twin voices. Specific cases of MZ speakers with very high ED (i.e. strong dissimilarity) are discussed in the context of sociophonetic and twin studies. A preliminary simplification of the Vocal Profile Analysis (VPA) Scheme is proposed, which enables the quantification of voice quality features in the perceptual assessment of speaker similarity, and allows for the calculation of perceptual-acoustic correlations. The adequacy of z-score normalization for this study is also discussed, as well as the relevance of heat maps for detecting the so-called phantoms in recent approaches to the biometric menagerie.Entities:
Keywords: Acoustic analysis; Forensic phonetics; Pause fillers; Perceptual assessment; Twins; Voice quality
Mesh:
Year: 2016 PMID: 27912151 PMCID: PMC5698260 DOI: 10.1016/j.forsciint.2016.11.020
Source DB: PubMed Journal: Forensic Sci Int ISSN: 0379-0738 Impact factor: 2.395
Feature set names, description and number of features per category (total: 309 different features).
| Feature | Description | Number of features |
|---|---|---|
| Jitter variants | Fundamental frequency perturbations | 30 |
| Shimmer variants | Amplitude perturbations | 21 |
| Harmonics to noise ratio | Signal to noise ratio using autocorrelation | 4 |
| Glottal quotient | Quantifying vocal fold cycle variability | 3 |
| Recurrence period density entropy (RPDE) | Uncertainty in estimation of fundamental frequency | 1 |
| Detrended fluctuation analysis (DFA) | Stochastic self-similarity of turbulent noise | 1 |
| Pitch period entropy (PPE) | Quantifying variability in F0 over and above normal variability in healthy controls | 1 |
| Glottal to noise excitation (GNE) | Noise synchronization in different frequency bands | 6 |
| Vocal Fold Excitation Ratio (VFER) | Noise synchronization in different frequency bands | 9 |
| Empirical Mode Decomposition Excitation Ratio (EMD-ER) | Decomposing the signal in multiple time series using EMD and quantifying energy and entropy | 6 |
| Mel Frequency Cepstral Coefficients (MFCC) | Amplitude and spectral fluctuations | 42 |
| F0-related measures | f0 statistical characterization, differences compared to age- and gender-matched healthy controls | 3 |
| Wavelet-based measures | Characterizing f0 using wavelet decomposition methods | 182 |
Simplified Vocal Profile Analysis Scheme (SVPAS). Full names of the abbreviations used in the table: Mandib.: Mandibular; Ling.: Lingual; Pharyng.: Pharyngeal; Velo-pharyng.: Velo-pharyngeal; VT: Vocal Tract; L: Laryngeal; Phon.: Phonation; Labiodent.: Labiodentalization; Protr.: Protruded; Creak.: Creakiness.; Whisp.: Whisperiness.
| Key | Major setting groups | |||||||||
|---|---|---|---|---|---|---|---|---|---|---|
| Labial | Mandib. | Ling. tip | Ling. body | Pharyng. | Velo-pharyng. | Larynx height | VT tension | L tension | Phon. types | |
| 1a | Lip rounding | Close | Advanced | Front & raised | Constricted | Audible nasal escape | Raised larynx | Tense | Tense | Falsetto |
| 1b | Lip spreading | Open | Retracted | Back & lowered | Expanded | Nasal | Lowered larynx | Lax | Lax | Creak. |
| 1c | Labiodent. | Protr. | Denasal | Whisp. | ||||||
| 1d | Harsh. | |||||||||
| 1e | Tremor | |||||||||
Example of calculation of Simple Matching Coefficients (SMC) for MZ twin pair 41–42.
| Major setting groups | ||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Labial | Mandib. | Ling. tip | Ling. body | Pharyng. | Velo-pharyng. | Larynx height | VT tension | L tension | Phon. types | |||
| Speakers | 41 | 0 | 1a | 1a | 0 | 0 | 0 | 0 | 1b | 1b | 1c | |
| 42 | 0 | 1a | 0 | 0 | 0 | 0 | 1b | 1b | 1b | 1c | ||
| Matches | 1 | 1 | 0 | 1 | 1 | 1 | 0 | 1 | 1 | 1 | 0.8 | |
| SMC | ||||||||||||
Fig. 1Fundamental frequency (f0) contour of 10 randomly selected tokens to visually assess f0 variability. Each token corresponds to different subjects (S).
ED values in high-quality (HQ) and telephone-filtered (TF) condition for the 12 MZ pairs. The values considered outliers are shown in italics, corresponding to the strongest dissimilarity (pair 11–12 for both conditions; pair 35–36 for TF condition).
| SP_1 | 1 | 3 | 5 | 7 | 9 | 11 | 33 | 35 | 37 | 39 | 41 | 43 |
| SP_2 | 2 | 4 | 6 | 8 | 10 | 12 | 34 | 36 | 38 | 40 | 42 | 44 |
| HQ | 6.11 | 6.86 | 6.54 | 8.18 | 6.27 | 6.04 | 6.16 | 5.95 | 7.26 | 5.44 | 6.43 | |
| TF | 6.37 | 7.22 | 6.91 | 8.64 | 10.70 | 6.00 | 5.48 | 10.49 | 6.11 | 6.96 |
ED values in high-quality (HQ) and telephone-filtered (TF) condition for the 54 same-speaker comparisons.
| SP_1 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 13 | 14 | 15 | 16 | 17 | 18 |
| SP_2 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 13 | 14 | 15 | 16 | 17 | 18 |
| HQ | 5.32 | 5.34 | 5.38 | 5.34 | 5.41 | 5.33 | 5.39 | 5.35 | 5.34 | 5.32 | 5.33 | 5.31 | 5.36 | 5.33 | 5.40 | 5.41 | 5.26 | 5.37 |
| TF | 5.31 | 5.30 | 5.30 | 5.35 | 5.36 | 5.19 | 5.35 | 5.31 | 5.27 | 5.30 | 5.31 | 5.26 | 5.28 | 5.31 | 5.36 | 5.31 | 5.23 | 5.32 |
| SP_1 | 19 | 20 | 21 | 22 | 23 | 24 | 25 | 26 | 27 | 28 | 29 | 30 | 31 | 32 | 33 | 34 | 35 | 36 |
| SP_2 | 19 | 20 | 21 | 22 | 23 | 24 | 25 | 26 | 27 | 28 | 29 | 30 | 31 | 32 | 33 | 34 | 35 | 36 |
| HQ | 5.35 | 5.39 | 5.35 | 5.30 | 5.38 | 5.35 | 5.43 | 5.35 | 5.33 | 5.34 | 5.35 | 5.38 | 5.37 | 5.33 | 5.39 | 5.36 | 5.41 | 5.33 |
| TF | 5.32 | 5.34 | 5.28 | 5.16 | 5.30 | 5.23 | 5.36 | 5.24 | 5.27 | 5.28 | 5.36 | 5.26 | 5.30 | 5.32 | 5.37 | 5.33 | 5.37 | 5.33 |
| SP_1 | 37 | 38 | 39 | 40 | 41 | 42 | 43 | 44 | 45 | 46 | 47 | 48 | 49 | 50 | 51 | 52 | 53 | 54 |
| SP_2 | 37 | 38 | 39 | 40 | 41 | 42 | 43 | 44 | 45 | 46 | 47 | 48 | 49 | 50 | 51 | 52 | 53 | 54 |
| HQ | 5.31 | 5.36 | 5.35 | 5.41 | 5.41 | 5.40 | 5.35 | 5.35 | 5.34 | 5.28 | 5.27 | 5.36 | 5.34 | 5.34 | 5.36 | 5.38 | 5.39 | 5.37 |
| TF | 5.33 | 5.34 | 5.34 | 5.40 | 5.36 | 5.33 | 5.28 | 5.34 | 5.25 | 5.20 | 5.22 | 5.29 | 5.34 | 5.33 | 5.34 | 5.33 | 5.31 | 5.36 |
ED values in high-quality (HQ) and telephone-filtered (TF) condition for the 54 different-speaker (DS) comparisons. The values considered outliers are shown in italics, corresponding to the strongest between-speaker dissimilarity. The values in bold are the lowest ED for DS comparisons, and they overlap with the average ED for SS comparisons (False acceptances).
| SP_1 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 13 | 14 | 15 | 16 | 17 | 18 |
| SP_2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 13 | 14 | 15 | 16 | 17 | 18 | 19 | 20 |
| HQ | 11.44 | 6.85 | 9.06 | 7.99 | 7.16 | 10.01 | 13.36 | 15.69 | 8.55 | 8.64 | 9.77 | 8.16 | 7.15 | 10.41 | 11.49 | 9.26 | 9.14 | |
| TF | 6.87 | 7.13 | 6.71 | 8.11 | 7.50 | 11.78 | 18.17 | 8.76 | 8.67 | 7.30 | 6.97 | 7.01 | 8.52 | 7.24 | 32.87 | 7.68 | ||
| SP_1 | 19 | 20 | 21 | 22 | 23 | 24 | 25 | 26 | 27 | 28 | 29 | 30 | 31 | 32 | 33 | 34 | 35 | 36 |
| SP_2 | 21 | 22 | 23 | 24 | 25 | 26 | 27 | 28 | 29 | 30 | 31 | 32 | 33 | 34 | 35 | 36 | 37 | 38 |
| HQ | 7.64 | 8.53 | 8.93 | 7.61 | 6.94 | 7.86 | 10.53 | 19.41 | 9.07 | 8.20 | 7.90 | 7.68 | 7.93 | 6.88 | 7.14 | 8.15 | 6.06 | |
| TF | 6.61 | 17.83 | 6.89 | 11.56 | 7.27 | 7.68 | 8.02 | 6.40 | 13.97 | 9.77 | 7.72 | 6.49 | 9.89 | 7.19 | 6.42 | |||
| SP_1 | 37 | 38 | 39 | 40 | 41 | 42 | 43 | 44 | 45 | 46 | 47 | 48 | 49 | 50 | 51 | 52 | 53 | 54 |
| SP_2 | 39 | 40 | 41 | 42 | 43 | 44 | 45 | 46 | 47 | 48 | 49 | 50 | 51 | 52 | 53 | 54 | 1 | 2 |
| HQ | 10.73 | 15.03 | 8.00 | 15.66 | 6.62 | 6.70 | 7.88 | 31.37 | 6.09 | 6.17 | 9.05 | 6.90 | 7.49 | 6.53 | 13.82 | 7.31 | 8.80 | |
| TF | 9.35 | 19.19 | 9.02 | 11.92 | 6.21 | 6.41 | 10.36 | 21.14 | 20.36 | 4.59 | 6.10 | 9.17 | 5.90 | 7.29 | 8.91 | 8.39 | 6.37 | |
Fig. 2ED distribution per type of speaker comparison (SS: same speaker; DS: different speakers; MZ: monozygotic pairs) in the high quality (HQ) condition.
Fig. 3ED distribution per type of speaker comparison (SS: same speaker; DS: different speakers; MZ: monozygotic pairs) in the high quality (HQ) condition: zoom view after removing the three outliers in Fig. 2, i.e. speaker pairs 11–13, 20–22 and 45–47.
Fig. 4ED distribution per type of speaker comparison (SS: same speaker; DS: different speakers; MZ: monozygotic pairs) in the telephone-filtered (TF) condition.
Fig. 5ED distribution per type of speaker comparison (SS: same speaker; DS: different speakers; MZ: monozygotic pairs) in the high-quality (HQ) condition with z-score normalization.
Fig. 6ED distribution per type of speaker comparison (SS: same speaker; DS: different speakers; MZ: monozygotic pairs) in the telephone-filtered (TF) condition with z-score normalization.
Fig. 7Heat map for all 54 DS comparisons in HQ condition using standardized ED (color bar on the right).
Fig. 8Heat map for all 54 DS comparisons in HQ condition using z-score normalization.
Fig. 9Cumulative proportion of Euclidean Distances (Log 10) for different-speaker (DS) and same-speaker (SS) comparisons. Red lines represent DS pairs while blue lines are used for SS comparisons. Continuous lines depict high quality (HQ) conditions whereas dotted lines are used for telephone-filter (TF) conditions. (For interpretation of the references to color in this figure legend, the reader is referred to the web version of this article.)
Euclidean Distances (ED) between pairs of speakers: monozygotic (MZ) pairs and different-speaker (DS) pairs. Both acoustic ED and perceptual ED are based on high-quality recordings. Perceptual ED are calculated as Similarity Matching Coefficients (MFCs). Higher values in acoustic ED means greater dissimilarity while higher values in perceptual ED mean greater similarity.
| MZ pairs | DS pairs | ||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Speaker_1 | 1 | 3 | 5 | 7 | 9 | 11 | 33 | 35 | 37 | 39 | 41 | 43 | 11 | 20 | 45 |
| Speaker_2 | 2 | 4 | 6 | 8 | 10 | 12 | 34 | 36 | 38 | 40 | 42 | 44 | 13 | 22 | 47 |
| Acoustic ED | 6.11 | 6.86 | 6.54 | 8.18 | 6.27 | 16.19 | 6.04 | 6.16 | 5.95 | 7.26 | 5.44 | 6.43 | 138.6 | 282.2 | 222.9 |
| Perceptual ED | 0.4 | 0.7 | 0.6 | 0.6 | 0.5 | 0.3 | 0.6 | 0.9 | 0.3 | 0.6 | 0.8 | 0.6 | 0.3 | 0.1 | 0 |