| Literature DB >> 27588160 |
Jeremy J Donai1, Saeid Motiian2, Gianfranco Doretto2.
Abstract
The high-frequency region of vowel signals (above the third formant or F3) has received little research attention. Recent evidence, however, has documented the perceptual utility of high-frequency information in the speech signal above the traditional frequency bandwidth known to contain important cues for speech and speaker recognition. The purpose of this study was to determine if high-pass filtered vowels could be separated by vowel category and speaker type in a supervised learning framework. Mel frequency cepstral coefficients (MFCCs) were extracted from productions of six vowel categories produced by two male, two female, and two child speakers. Results revealed that the filtered vowels were well separated by vowel category and speaker type using MFCCs from the high-frequency spectrum. This demonstrates the presence of useful information for automated classification from the high-frequency region and is the first study to report findings of this nature in a supervised learning framework.Entities:
Keywords: Classification; formants; high-frequency; mel frequency cepstral coefficients; vowels
Year: 2016 PMID: 27588160 PMCID: PMC4988094 DOI: 10.4081/audiores.2016.137
Source DB: PubMed Journal: Audiol Res ISSN: 2039-4330
Figure 1.Visual illustration of the processing steps used in the classification experiments.
Vowel category classification accuracy for the first and second experiment. Full set refers to the full hVd signals while segmented refers to the steady state portion of the vowel extracted from the full hVd signal.
| First experiment | Second experiment | |||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Vowels | Full set 96 kHz | Full set 48 kHz | Segmented set 48 kHz | Full set 96 kHz | Full set 48 kHz | Segmented set 48 kHz | ||||||
| Child | Female | Male | Child | Female | Male | Child | Female | Male | ||||
| had | 96.67 | 93.33 | 76.67 | 100 | 100 | 100 | 90 | 90 | 100 | 80 | 80 | 100 |
| heed | 96.67 | 90 | 90 | 90 | 90 | 100 | 90 | 100 | 100 | 90 | 100 | 100 |
| herd | 93.33 | 93.33 | 93.33 | 80 | 90 | 100 | 80 | 90 | 100 | 90 | 90 | 100 |
| hawd | 93.33 | 96.66 | 93.33 | 90 | 90 | 100 | 90 | 90 | 100 | 80 | 90 | 100 |
| hayed | 86.67 | 86.67 | 86.67 | 90 | 80 | 100 | 80 | 90 | 90 | 80 | 80 | 100 |
| who’d | 93.33 | 90 | 100 | 90 | 90 | 100 | 80 | 90 | 100 | 80 | 80 | 100 |
| Cumulative | 93.33 | 91.67 | 90 | 90 | 90 | 100 | 85 | 91.66 | 98.33 | 83.33 | 86.66 | 100 |
Figure 2.A) Confusion matrices for the first experiment. Full set with 96 kHz rate (left). Full set with 48 kHz rate (middle). Segmented set with 48 kHz rate (right). B-D) Confusion matrices for the second experiment: B) Full set with 96 kHz rate; C) Full set with 48 kHz rate; D) Segmented set with 48 kHz rate. Note: Vowel category actual on the vertical axis and vowel classified on the horizontal axis.
Speaker type classification accuracy for the third and fourth experiment. Full set refers to the full hVd signals while segmented refers to the steady state portion of the vowel extracted from the full hVd signal.
| Experiment | Vowels | Classification results | |||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Full set 96 kHz | Full set 48 kHz | Segmented set 48 kHz | |||||||||||
| Child | Female | Male | Cumulative | Child | Female | Male | Cumulative | Child | Female | Male | Cumulative | ||
| Third | Combined | 90 | 93.33 | 91.66 | 91.66 | 83.3 | 91.66 | 90 | 90 | 83.33 | 86.66 | 86.66 | 85.55 |
| Fourth | had | 100 | 100 | 100 | 100 | 100 | 100 | 100 | 100 | 100 | 100 | 100 | 100 |
| heed | 100 | 100 | 100 | 100 | 100 | 100 | 100 | 100 | 100 | 100 | 100 | 100 | |
| herd | 100 | 100 | 100 | 100 | 100 | 100 | 100 | 100 | 100 | 100 | 100 | 100 | |
| hawd | 100 | 100 | 100 | 100 | 100 | 100 | 100 | 100 | 100 | 100 | 100 | 100 | |
| hayed | 100 | 100 | 100 | 100 | 100 | 100 | 100 | 100 | 100 | 100 | 100 | 100 | |
| who’d | 90 | 100 | 100 | 96.66 | 80 | 100 | 100 | 93.33 | 80 | 100 | 100 | 93.33 | |
Figure 3.Confusion matrices for the third experiment. Full set with 96 kHz rate (left). Full set with 48 kHz rate (middle). Segmented with 48 kHz rate (right). Note: Speaker type actual on the vertical axis and speaker type classified on the horizontal axis.