| Literature DB >> 35326294 |
Yue Chen1, Yingming Gao2, Yi Xu1.
Abstract
It has been widely assumed that in speech perception it is imperative to first detect a set of distinctive properties or features and then use them to recognize phonetic units like consonants, vowels, and tones. Those features can be auditory cues or articulatory gestures, or a combination of both. There have been no clear demonstrations of how exactly such a two-phase process would work in the perception of continuous speech, however. Here we used computational modelling to explore whether it is possible to recognize phonetic categories from syllable-sized continuous acoustic signals of connected speech without intermediate featural representations. We used Support Vector Machine (SVM) and Self-organizing Map (SOM) to simulate tone perception in Mandarin, by either directly processing f0 trajectories, or extracting various tonal features. The results show that direct tone recognition not only yields better performance than any of the feature extraction schemes, but also requires less computational power. These results suggest that prior extraction of features is unlikely the operational mechanism of speech perception.Entities:
Keywords: Mandarin tones; speech perception; tone features; tone recognition
Year: 2022 PMID: 35326294 PMCID: PMC8946547 DOI: 10.3390/brainsci12030337
Source DB: PubMed Journal: Brain Sci ISSN: 2076-3425
Figure 1Hypothetical model of brain functions in speech perception and production (adapted from Fant, 1967 [53]).
Figure 2Mean time-normalized syllable-sized f0 contours of four Mandarin tones, averaged separately for female and male speakers.
Figure 3Mean time-normalized syllable-sized semitone contours of four Mandarin tones, averaged separately for female and male speakers.
Tone recognition rates using raw f0 contours based on SVM and SOM models.
| SVM | SOM | |||
|---|---|---|---|---|
| Hertz | Semitone | Hertz | Semitone | |
|
| 96.0% | 99.1% | 89.7% | 90.8% |
|
| 76.7% | 96.6% | 76.7% | 77.6% |
|
| 86.3% | 97.4% | 72.8% | 72.0% |
SVM: Support Vector Machine; SOM: Self-Organizing Map.
Tone confusion matrix using semitone of mixed-gender based on SVM.
| T1 | T2 | T3 | T4 | |
|---|---|---|---|---|
|
| 98.2% | 0.2% | 0.5% | 1.1% |
|
| 0.9% | 96.6% | 0.9% | 1.6% |
|
| 0.3% | 1.0% | 98.4% | 0.3% |
|
| 2.7% | 0.0% | 0.7% | 96.6% |
T means Tone here.
Tone confusion matrix context-free words in AWGN-corrupted spoken sentences [108].
| T1 | T2 | T3 | T4 | |
|---|---|---|---|---|
|
| 95.68% | 2.03% | 2.08% | 0.21% |
|
| 1.24% | 97.92% | 0.08% | 0.76% |
|
| 0.35% | 0.42% | 99.02% | 0.21% |
|
| 2.63% | 1.72% | 1.38% | 94.27% |
Tone recognition rates using two-level abstraction based on SVM and SOM models.
| Separate | Together | |||
|---|---|---|---|---|
| Hertz | Semitone | Hertz | Semitone | |
|
| 93.5% | 93.7% | 91.9% | 92.9% |
|
| 80.8% | 80.6% | 81.0% | 82.8% |
SVM: Support Vector Machine; SOM: Self-Organizing Map.
Figure 4Five-level Representation of Mandarin Tones.
Tone recognition rates using five-level abstraction based on SVM and SOM models.
| Separate | Together | |||
|---|---|---|---|---|
| Hertz | Semitone | Hertz | Semitone | |
|
| 88.8% | 90.3% | 80.5% | 84.1% |
|
| 56.7% | 58.9% | 43.5% | 43.7% |
SVM: Support Vector Machine; SOM: Self-Organizing Map.
Tone recognition rates using f0 profile features based on SVM model.
| Herz | Semitone | |
|---|---|---|
|
| 89.3% | 92.3% |
|
| 85.9% | 90.4% |
|
| 68.3% | 66.7% |
|
| 71.8% | 75.7% |
|
| 70.4% | 75.1% |
Tone recognition rates using qTA (quantitative target approximation) features.
| Features | Accuracy |
|---|---|
|
| 90.7% |
|
| 97.1% |
Figure 5Summary of tone recognition rates based on SVM model for full f0 contour (Experiment 1), two-level feature (Experiment 2), five-level feature (Experiment 3), f0 profile (Experiment 4) and qTA and qTA + f0 onset (Experiments 5). SVM: Support Vector Machine.
Time complexity of different tone recognition schemes.
| Scheme | Method | Size of Input | No. Steps | Time Complexity at Testing Phase |
|---|---|---|---|---|
|
| SVM | 30 points | 1 | O(30 × 4) |
|
| SVM × 2 | 15 points | 3 | O(15 × 2) × 2 + 1 |
| Matching | 2 features | |||
|
| SVM × 2 | 15 points | 3 | O(15 × 5) × 2 |
| Matching | 2 features | |||
|
| Parabola/Broken Line | 30 points | 2 | O(303) + O(2 × 4) |
| SVM | 2 features | |||
|
| qTA Extraction | 30 points | 2 | O(303) + O(3 × 4) |
| SVM | 3 features |
SVM: Support Vector Machine. Qta: quantitative target approximation.