| Literature DB >> 36062214 |
Jialu Li1,2, Mark Hasegawa-Johnson1,2, Nancy L McElwain1,3.
Abstract
Classification of infant and parent vocalizations, particularly emotional vocalizations, is critical to understanding how infants learn to regulate emotions in social dyadic processes. This work is an experimental study of classifiers, features, and data augmentation strategies applied to the task of classifying infant and parent vocalization types. Our data were recorded both in the home and in the laboratory. Infant vocalizations were manually labeled as cry, fus (fuss), lau (laugh), bab (babble) or scr (screech), while parent (mostly mother) vocalizations were labeled as ids (infant-directed speech), ads (adult-directed speech), pla (playful), rhy (rhythmic speech or singing), lau (laugh) or whi (whisper). Linear discriminant analysis (LDA) was selected as a baseline classifier, because it gave the highest accuracy in a previously published study covering part of this corpus. LDA was compared to two neural network architectures: a two-layer fully-connected network (FCN), and a convolutional neural network with self-attention (CNSA). Baseline features extracted using the OpenSMILE toolkit were augmented by extra voice quality, phonetic, and prosodic features, each targeting perceptual features of one or more of the vocalization types. Three web data augmentation and transfer learning methods were tested: pre-training of network weights for a related task (adult emotion classification), augmentation of under-represented classes using data uniformly sampled from other corpora, and augmentation of under-represented classes using data selected by a minimum cross-corpus information difference criterion. Feature selection using Fisher scores and experiments of using weighted and unweighted samplers were also tested. Two datasets were evaluated: a benchmark dataset (CRIED) and our own corpus. In terms of unweighted-average recall of CRIED dataset, the CNSA achieved the best UAR compared with previous studies. In terms of classification accuracy, weighted F1, and macro F1 of our own dataset, the neural networks both significantly outperformed LDA; the FCN slightly (but not significantly) outperformed the CNSA. Cross-examining features selected by different feature selection algorithms permits a type of post-hoc feature analysis, in which the most important acoustic features for each binary type discrimination are listed. Examples of each vocalization type of overlapped features were selected, and their spectrograms are presented, and discussed with respect to the type-discriminative acoustic features selected by various algorithms. MFCC, log Mel Frequency Band Energy, LSP frequency, and F1 are found to be the most important spectral envelope features; F0 is found to be the most important prosodic feature.Entities:
Keywords: Convolutional neural networks; Emotion classifier; Feature selection; Global feature; Infant vocalizations; Infant-directed speech; Self-attention
Year: 2021 PMID: 36062214 PMCID: PMC9435967 DOI: 10.1016/j.specom.2021.07.010
Source DB: PubMed Journal: Speech Commun ISSN: 0167-6393 Impact factor: 2.723
Data sources, infant ages, models, features, target classes and best accuracy recorded in published infant vocalization experiments.
| Studies | Data sources | Infant ages | Models | Features | Target classes | Best accuracy |
|---|---|---|---|---|---|---|
|
| 230 crying episodes recorded in hospital | 1–2 m | FF, FT, RNN TDNN, CC | Mel-cepstrum Mel filterbank |
| 79.4% |
|
| 20-min sessions of Infant/caregiver joint play | 3, 6, 9 m | SOM+ single layer perceptron | Spectrogram |
| 46.6% |
|
| In-home recordings of one infant | 1.5 m | ASR | ASR+F0 PCA of FFT |
| 67.4% |
|
| BabySounds: 12445 infant vocalizations, in-home recordings | 2–36 m | SVM | ComParE+BoAW+ auto-encoder based features | BabySounds: | 58.7% |
|
| BabySounds | 2–36 m | SVM | ComParE+ Fisher vector | BabySounds | 59.5% |
|
| BabySounds | 2–36 m | SVM | ComParE+BoAW+ auto-encoder based | BabySounds | 62.39% |
|
| Vocalizations from online sound libraries | <9 m | CNN | Log mel-scaled spectrograms |
| 72.0% |
|
| Infant/parent vocalizations, in-home recordings | 3, 6, 9, 18 m | CNN | Audio waveform |
| 97.01% |
|
| CRIED: 5587 infant vocalizations | 1–4 m | CNN | Spectrogram |
| 86.1% |
|
| Baby Chillanto (2267 vocalizations)/Baby2020 (5540 vocalizations) | 0–9 m/0–3 m | CNN+GCN | Spectrogram | Chillanto: | 94.39%/74.37% |
|
| Dunstan Baby Language: 315 crying sounds | 0–3 m | CNN-RNN | Spectrogram | Dunstan: | 86.03% |
|
| Dunstan Baby Language/Baby Chillanto | 0–3 m/0–9 m | Multistage CNN | Hybrid feature | Dustan/Chillanto | 88.22%/95.10% |
Note: except as noted, each study uses a different dataset, therefore the percentage accuracies listed in the last column are generally not comparable across studies. Though not comparable, accuracies are reported here to give the reader a sense for the state of the art.
Number of families participating in the TDP and IDP recordings, grouped by infant’s age.
| Age | TDP LENA | IDP LENA | IDP LAB | Total |
|---|---|---|---|---|
| 3 months | – | 7 | 79 | 86 |
| 6 months | – | 6 | 70 | 76 |
| 9 months | – | 5 | 53 | 58 |
| 12 months | – | 6 | 0 | 6 |
| 13 to 24 months | 10 | – | – | 10 |
Descriptions of the adult facial expression codes used in this study.
| Class | Descriptions |
|---|---|
|
| Neutral flat: displaying a very plain or uninterested facial expression, can be regardless of the vocalization or body movement. Could also be bored, tired, disengaged, uninterested, (almost like a depressed facial expression, uninterested, tired, do not want to be there). |
|
| Neutral interested: looks attentively at infant. Bright eyes. There is a hint of positivity in some cases but with no visible smile. During free play, this is usually the “default” if nothing else is obvious. |
|
| Mild/low positive: simple smile; bright; animated, on face. |
|
| Strong/high positive: broad full smile (might be accompanied by laughter); exaggerated play face; mock face. |
Number of infant vocalizations of each class in the training and testing set for each fold of validation test.
| Number of fold | Dataset split |
|
|
|
|
|
|---|---|---|---|---|---|---|
| 1 | Training | 1145 | 4642 | 899 | 16 954 | 145 |
| Testing | 621 | 2304 | 417 | 8493 | 58 | |
| 2 | Training | 1192 | 4612 | 859 | 16 991 | 131 |
| Testing | 574 | 2334 | 457 | 8456 | 72 | |
| 3 | Training | 1195 | 4638 | 874 | 16 949 | 130 |
| Testing | 571 | 2308 | 442 | 8498 | 73 |
Descriptions of infant vocalization codes.
| Class | Descriptions |
|---|---|
|
| Crying/screaming; crying is higher intensity than fussing and often includes gasps for air in between cries |
|
| Fussing/whining; lower intensity than crying; fussing sounds will often be less broken up and more extended than crying |
|
| Laugher, chuckles, giggling |
|
| Babbling/cooing; Babbling is non-intelligible speech, such as babababa, dada, oaahh, rrrr, that typically involves a combination of consonant and vowel sounds. Cooing involves vowel sounds only. |
|
| High-pitched screeching |
Number of adult female vocalizations of each class in the training and testing set for each fold of validation test.
| Number of fold | Dataset split |
|
|
|
|
|
|
|---|---|---|---|---|---|---|---|
| 1 | Training | 4078 | 201 | 2253 | 533 | 1087 | 506 |
| Testing | 2015 | 107 | 1148 | 262 | 541 | 256 | |
| 2 | Training | 3993 | 221 | 2255 | 531 | 1134 | 524 |
| Testing | 2100 | 87 | 1146 | 264 | 494 | 238 | |
| 3 | Training | 4115 | 194 | 2294 | 526 | 1035 | 494 |
| Testing | 1978 | 114 | 1107 | 269 | 593 | 268 |
Descriptions of the mother vocalizations.
| Class | Descriptions |
|---|---|
|
| Infant-directed speech (or motherese), with much change of modulation and bigger jump in pitch. Elongated vowels. Could be delivered with a “cooing” pattern of intonation different from that of normal adult speech. With many glissando variations that are more pronounced than those of normal speech. |
|
| Adult-directed speech: speaking normally as if mother is talking to an adult (matter of fact) with “regular” rhythm and intonation. |
|
| Non-speech playful noises, including “raspberries,” kissing sounds, and animal sounds that occur alone (i.e., not as a part of a sentence); includes using a word to represent a playful noise (e.g., “bounce-bounce-bounce”, “shake-shake-shake”). |
|
| Rhythmic sounds (e.g., singing, sing-songy or cartoon-like voice). |
|
| Showing positive emotion with laughter, chuckle or explosive vocal sound (e.g., high-pitch positive vocalization sounds that does not contain speech). |
|
| Whispering; speech is low and gaspy, including “shhhh.” |
Functionals used in all experiments.
| Name | Description | D | 9 | C |
|---|---|---|---|---|
| max | The maximum value of the contour | * | * | |
| min | The minimum value of the contour | * | * | |
| maxPos | The absolute temporal offset of the maximum value | * | * | |
| minPos | The absolute temporal offset of the minimum value | * | * | |
| range | max-min | * | * | |
| mean | Arithmetic mean | * | * | * |
| stddev | Standard deviation | * | * | |
| skewness | 3rd order central moment | * | * | |
| kurtosis | 4th order central moment | * | * | |
| quartile1 | The first quartile | * | * | |
| quartile2 | The second quartile | * | * | |
| quartile3 | The third quartile | * | * | |
| iqr1–2 | The inter-quartile range: quartile2--quartile1 | * | * | |
| iqr2–3 | The inter-quartile range: quartile3--quartile2 | * | * | |
| iqr1–3 | The inter-quartile range: quartile3--quartile1 | * | * | |
| percentile1.0 | Outlier-robust minimum value (1st percentile) | * | * | |
| percentile99.0 | Outlier-robust maximum value (99th percentile) | * | * | |
| pctlrange0–1 | percentile99.0-percentile1.0 | * | * | |
| upleveltime75 | The percentage of time the signal is above (0.75 ×range+min) | * | * | |
| upleveltime90 | The percentage of time the signal is above (0.90 ×range+min) | * | * | |
| linregc1 | The slope of a linear approximation of the contour | * | * | * |
| linregc2 | The offset of a linear approximation of the contour | * | * | * |
| linreggerrA | The linear error (difference between the actual feature contour, as a function of time, and its linear approximation) | * | * | |
| linreggerrQ | The quadratic error (difference between the actual feature contour, as a function of time, and its quadratic polynomial approximation) | * | * | * |
Column 1: name, column 2: description. Columns 3–5 contain * if the functional was computed for, respectively, the default feature set (emobase2010.conf)(D), the IS09_emotion.conf portion of the complementary set (9), or the rest of the complementary feature set (C).
Number of infant vocalizations for each class from external corpora selected in three-fold cross validation tests of different models for default and default+complementary features.
| Model | Features | Fold |
|
|
|
|
|---|---|---|---|---|---|---|
| FCN | Default | 1 | 34 | 23 | 139 | 905 |
| 2 | 40 | 21 | 90 | 950 | ||
| 2 | 41 | 15 | 224 | 846 | ||
| Default+complementary | 1 | 23 | 22 | 149 | 1359 | |
| 2 | 27 | 32 | 97 | 1394 | ||
| 3 | 17 | 26 | 180 | 1374 | ||
| CNSA | Default | 1 | 32 | 4 | 56 | 2441 |
| 2 | 10 | 4 | 56 | 2499 | ||
| 3 | 4 | 6 | 119 | 3309 | ||
| Default+complementary | 1 | 11 | 5 | 33 | 3418 | |
| 2 | 9 | 25 | 33 | 3006 | ||
| 3 | 6 | 19 | 57 | 2735 |
FCN = Fully-connected network, CNSA = convolutional neural net with self-attention.
Number of adult female vocalizations of each emotion class (neutral vs. happy) in the training and testing set for each fold of validation test.
| Number of fold | Dataset split |
|
|
|---|---|---|---|
| 1 | Training | 4188 | 3714 |
| Testing | 1881 | 2382 | |
| 2 | Training | 3772 | 4753 |
| Testing | 1823 | 2440 | |
| 3 | Training | 3704 | 4822 |
| Testing | 1891 | 2371 |
Fig. 1.Overview of the CNSA.
CNSA Hyper-parameters.
| Name | Settings |
|---|---|
| Input feature dimension | 40 × |
| CNN kernel sizes | 40 × 8, 40 × 16 |
| 40 × 32, and 40 × 64 | |
| Number of filters | 384 |
| Max pooling kernel size | 7 |
| Max pooling stride | 7 |
| Attention hidden dimension ( | 1024 |
| Attention hops ( | 20 |
| Dense layer hidden dimension | 1024 |
| Dropout | 0.2 |
T is the number of rows of the feature matrix, thus for default features, T = 40; for default+complementary, T = 47; for raw Mel spectrogram, T = the number of 10 ms audio frames in the segment.
Vocalization type classification scores for infant and mother vocalizations trained on the default feature set using the CNN and CNSA (CNN with self-attention).
| Model | Settings | Accuracy | Weighted F1 | Macro F1 |
|---|---|---|---|---|
| CNN | Infant | 80.97% ± 0.07% | 79.24% ± 0.39% | 45.85% ± 1.69% |
| Mother | 67.08% ± 0.56% | 64.78% ± 1.33% | 47.68% ± 3.86% | |
| CNSA | Infant | |||
| Mother |
Mean and standard deviations of accuracy, weighted F1 score, and macro F1 score for three-fold cross-validation tests are shown below.
Optimization hyperparameters for FCN and CNSA training.
| Name | Settings |
|---|---|
| Number of epochs | 60 |
| Optimizer | RMSprop |
| Learning rate | 10−4 |
| Batch size | 128 |
Vocalization type classification scores for infant vocalizations of CRIED dataset: accuracy, macro F1 score, and UAR for LOSO training/testing.
| Authors | Model | Features | Settings | Accuracy | Macro F1 | UAR |
|---|---|---|---|---|---|---|
| Ours | LDA | Default features | U | 80.61% | 63.02% | 64.47% |
| Default+comp | U | 80.17% | 62.29% | 64.27% | ||
| Ours | FCN | Default features | U | 85.38% | 68.16% | 68.29% |
| Default features | W | 81.22% | 65.72% | 72.44% | ||
| Default features | U+Fisher | 85.70% | 68.28% | 69.82% | ||
| Default features | W+Fisher | 80.83% | 66.22% | 74.17% | ||
| Default+comp | U | 85.49% | 67.56% | 66.35% | ||
| Default+comp | W+Fisher | 81.78% | 67.91% | 76.15% | ||
|
| W+Fisher | 80.94% | 66.63% | 75.84% | ||
| Ours | CNSA | Default features | U | 86.25% | 67.09% | |
| Default features | W | 79.99% | 67.49% | 75.90% | ||
| Default features | U+Fisher | 73.52% | 60.54% | 74.35% | ||
| Default features | W+Fisher | 78.25% | 65.28% | 75.04% | ||
| Default+comp | U | 68.63% | 64.92% | |||
| Default+comp | W+Fisher | 82.47% | 66.90% | 74.48% | ||
|
| W+Fisher | 77.19% | 67.79% | |||
|
| END2YOU | CNN-based | CNN+LSTM | 70.8% | – | – |
| Baseline |
|
| SVM | 82.6% | – | 75.6% |
|
|
| BoAW+SVM | 84.2% | – | 76.9% | |
|
| AE-based | RNN+SVM | 83.5% | – | 74.4% | |
|
| CapsNet | Spectrogram | CNN | 86.1% | – | 71.6% |
|
| Combined NN |
| LSTM+dense | – | – | 68.72% |
“U/W” indicates training with unweighted/weighted sampler respectively. “Fisher” implies top 1000 features are selected based on Fisher scores. Best result for each metric is bolded.
Vocalization type classification scores for infant vocalizations in our TDP and IDP datasets.
| Model | Features | Settings | Accuracy | Weighted F1 | Macro F1 |
|---|---|---|---|---|---|
| LDA | Default | U | 78.84% ± 0.19% | 78.20% ± 0.15% | 53.70% ± 0.31% |
| Default+comp | U | 79.55% ± 0.29% | 78.61% ± 0.31% | 54.80% ± 0.60% | |
| FCN | Default | U | 81.17% ± 0.02% | 80.48% ± 0.004% | 55.69% ± 0.32% |
| Default | U+augmented | 81.11% ± 0.12% | 80.36% ± 0.11% | 56.71% ± 0.52% | |
| Default+comp | U | 81.48% ± 0.11% | 80.70% ± 0.19% | 56.54% ± 0.53% | |
| Default+comp | U+Fisher | 81.52% ± 0.17% | 80.91% ± 0.05% | ||
| Default+comp | W+Fisher | 76.71% ± 1.39% | 77.95% ± 1.02% | 56.69% ± 0.28% | |
| Default+comp | U+augmented | 81.38% ± 0.25% | 56.56% ± 0.42% | ||
| CNSA | Default | U | 81.07% ± 0.13% | 79.69% ± 0.28% | 52.16% ± 0.78% |
| Default | U+augmented | 80.96% ± 0.04% | 79.23% ± 0.13% | 50.51% ± 1.43% | |
| Default+comp | U | 81.43% ± 0.13% | 80.22% ± 0.43% | 55.23% ± 0.74% | |
| Default+comp | U+Fisher | 80.74% ± 0.51% | 80.41% ± 0.38% | 56.87% ± 1.29% | |
| Default+comp | W+Fisher | 81.06% ± 0.27% | 80.40% ± 0.25% | 56.63% ± 0.85% | |
| Default+comp | U+augmented | 80.19% ± 0.31% | 55.09% ± 1.21% | ||
| Pretrained | U | 81.41% ± 0.11% | 80.00% ± 0.40% | 54.95% ± 1.38% |
Mean and standard deviations of accuracy, weighted F1 score and macro F1 score for three-fold cross-validation tests are shown below. “U/W” indicates training with unweighted/weighted sampler respectively. “Fisher” implies top 1000 features are selected based on Fisher scores. Best result for each metric is bolded.
Statistical significance tests results for the mean and standard deviation of three validation sets of infant vocalizations trained under unweighted sampler and various settings are shown below.
| Model | Feature set | |
|---|---|---|
| FCN | Default vs. default+comp | 1.17 × 10−1 ± 0.14 |
| LDA vs. FCN | Default+comp | 1.14 × 10−13 ± 8.07 × 10−14 |
| LDA vs. CNSA | Default+comp | 8.59 × 10−10 ± 1.09 × 10−9 |
| CNSA vs. FCN | Default+comp | 5.31 × 10−1± 6.93 × 10−1 |
Fig. 2.Number of top selected features vs. average accuracy and macro F1 score for three-fold cross validation tests for FCN (left column) and LDA (right column). Four feature selection methods (Fisher, Extratree, MRMR, and Chi square) are included. Dashed lines indicate accuracy and F1 of a 46-feature overlap set, composed of features selected by all four algorithms.
Overlap feature set. Features shown in each square were among the top 30 features for that square, according to at least three of the four feature selection algorithms.
|
|
|
|
|
| ||||||
|---|---|---|---|---|---|---|---|---|---|---|
| Feature | Functional | Feature | Functional | Feature | Functional | Feature | Functional | Feature | Functional | |
|
| – | – | logMelFreqBand[4] | quartile1 | logMelFreqBand[4] | quartile1 | ||||
| logMelFreqBand[5] | quartile1 | logMelFreqBand[5] | quartile1 | |||||||
| logMelFreqBand[5] | quartile2 | logMelFreqBand[5] | quartile2 | |||||||
| signal energy | – | |||||||||
| F0finEnv | Mean | |||||||||
|
| – | – | F0final | percentile99.0 | ||||||
| F0final | linreggerrQ | |||||||||
| F0final | linreggerrA | |||||||||
| F0final | stddev | |||||||||
| F0finEnv | percentile99.0 | |||||||||
| mfcc[11] | percentile99.0 | |||||||||
| mfcc[12] | percentile99.0 | |||||||||
| pitch | percentile99.0 | |||||||||
| pitch | max | |||||||||
| pitch | range | |||||||||
| pitch | pctlrange0–1 | |||||||||
| pitch | quartile3 | |||||||||
|
| logMelFreqBand[0] | linregerrA | logMelFreqBand[0] | iqr2–3 | – | – | F0final | linregerrA | ||
| logMelFreqBand[1] | iqr2–3 | logMelFreqBand[0] | linregerrA | F0final | iqr2–3 | |||||
| logMelFreqBand[2] | iqr1–3 | logMelFreqBand[1] | iqr1–3 | F0final | iqr1–3 | |||||
| logMelFreqBand[1] | iqr2–3 | F0final | linregerrQ | |||||||
| logMelFreqBand[1] | linregerrA | F0final | quartile3 | |||||||
| logMelFreqBand[2] | iqr2–3 | F0final | iqr1–2 | |||||||
| F0final | stddev | |||||||||
| logMelFreqBand[0] | iqr2–3 | |||||||||
| logMelFreqBand[1] | iqr1–3 | |||||||||
| logMelFreqBand[1] | iqr2–3 | |||||||||
|
| pitch | percentile99.0 | F0final | percentile99.0 | F0final | linregerrQ | – | – | F0final(d) | linregerrQ |
| mfcc[10] | percentile99.0 | F0final | linregerrQ | F0final | linregerrA | |||||
| F0final | stddev | F0final(d) | linregerrQ | |||||||
| F0final | linregerrA | |||||||||
| F0final(d) | linregerrQ | |||||||||
| F0final(d) | percentile99.0 | |||||||||
| F0finEnv | percentile99.0 | |||||||||
| pitch | percentile99.0 | |||||||||
| pitch | max | |||||||||
| pitch | range | |||||||||
| pitch | quartile3 | |||||||||
| mfcc[10] | percentile99.0 | |||||||||
| mfcc[11] | percentile99.0 | |||||||||
| mfcc[12] | pctlrange0–1 | |||||||||
|
| lspFreq[0] | quartile2 | lspFreq[0] | quartile2 | lspFreq[0] | quartile2 | lspFreq[0] | quartile2 | – | – |
| lspFreq[0] | quartile3 | lspFreq[0] | quartile3 | lspFreq[0] | quartile3 | lspFreq[0] | quartile3 | |||
| lspFreq[0] | mean | lspFreq[0] | mean | lspFreq[0] | mean | lspFreq[0] | mean | |||
| lspFreq[0] | percentile99.0 | lspFreq[0] | percentile99.0 | lspFreq[0] | percentile99.0 | lspFreq[0] | percentile99.0 | |||
| lspFreq[1] | quartile2 | lspFreq[0] | quartile1 | lspFreq[0] | quartile1 | lspFreq[0] | quartile1 | |||
| lspFreq[1] | quartile3 | lspFreq[0](d) | linregerrQ | lspFreq[1] | quartile1 | lspFreq[1] | quartile1 | |||
| lspFreq[1] | mean | lspFreq[1] | quartile1 | lspFreq[1] | quartile2 | lspFreq[1] | quartile2 | |||
| F1 | quartile1 | lspFreq[1] | quartile2 | lspFreq[1] | quartile3 | lspFreq[1] | quartile3 | |||
| F1 | quartile2 | lspFreq[1] | quartile3 | lspFreq[1] | mean | lspFreq[1] | mean | |||
| mfcc[1] | quartile2 | lspFreq[1] | mean | F1 | quartile1 | F1 | quartile1 | |||
| F1 | quartile1 | F1 | quartile2 | F1 | quartile2 | |||||
| F1 | quartile2 | mfcc[1] | quartile1 | |||||||
A square represents overlap between the top 30 features selected for a one-vs-one classification problem (row-vs-column), and the row’s corresponding one-vs-all classification problem. For example, the first row shows overlapped features selected by cry-vs-all and cry-vs-fus, cry-vs-lau, cry-vs-bab, and cry-vs-scr. Empty cell implies no overlapped features found.
Fig. 3.Clusters computed by selecting top 5 features from each one-vs-one classifier, merging the resulting feature vector, then projecting using LDA followed by t-SNE.
Vocalization type classification results for mother vocalizations.
| Model | Setting | Accuracy | Weighted F1 | Macro F1 |
|---|---|---|---|---|
| LDA | Default | 62.58% ± 1.08% | 62.20% ± 0.13% | 50.21% ± 0.74% |
| FCN | Default | 54.29% ± 0.58% | ||
| Augmented data | 66.19% ± 0.42% | 66.20% ± 0.48% | ||
| CNSA | Default | 67.12% ± 0.53% | 65.93% ± 0.68% | 51.34% ± 0.70% |
| Augmented data | 66.11% ± 0.55% | 65.11% ± 0.56% | 51.87% ± 0.32% | |
| Pretrained embeddings | 67.06% ± 0.05% | 65.71% ± 0.07% | 51.04% ± 1.46% |
Mean and standard deviations of accuracy, weighted F1 score, and macro F1 score for three-fold cross-validation tests are shown below. Best result for each metric is bolded.
Significance test results for the mean and standard deviation of three validation sets, comparing accuracy of three classifiers of mother vocalizations.
| Model | Feature set | |
|---|---|---|
| LDA vs. FCN | Default | 7.40 × 10−15 ± 5.39 × 10−15 |
| LDA vs. CNSA | Default | 8.08 × 10−9 ± 6.87 × 10−9 |
| CNSA vs. FCN | Default | 1.04 × 10−1 ± 3.49 × 10−2 |
Emotion classification results for infant-directed speech data.
| Model | Setting | Accuracy | Weighted F1 | Macro F1 |
|---|---|---|---|---|
| LDA | Default | 60.93% ± 0.23% | 60.82% ± 0.25% | 60.10% ± 0.22% |
| FCN | Default | 64.68% ± 0.06% | ||
| CNSA | Default | 64.16% ± 0.40% | 61.34% ± 1.43% | 59.64% ± 2.06% |
Mean and standard deviations of accuracy, weighted F1 score, and macro F1 score for three-fold cross-validation tests are shown below. Best result is bold.
Significance tests results or the mean and standard deviation of three validation sets, comparing three emotion classifiers for infant-directed speech.
| Model | Feature set | |
|---|---|---|
| LDA vs. FCN | Default | 1.76 × 10−6 ± 2.10 × 10−6 |
| LDA vs. CNSA | Default | 1.42 × 10−3 ± 1.99 × 10−3 |
| CNSA vs. FCN | Default | 2.46 × 10−1 ± 3.20 × 10−1 |
Accuracy, weighted F1 score and macro F1 score for three-fold cross validation tests on both training and testing data for 2-layer FCN trained with augmented data for mother vocalization.
| Fold | Accuracy | Weighted F1 | Macro F1 | |
|---|---|---|---|---|
| Training | 1 | 84.37% | 84.09% | 81.39% |
| 2 | 82.35% | 81.95% | 79.39% | |
| 3 | 85.81% | 85.60% | 83.34% | |
| Testing | 1 | 65.74% | 65.87% | 55.27% |
| 2 | 66.76% | 66.88% | 55.64% | |
| 3 | 66.07% | 65.85% | 54.72% |
Fig. 4.Fundamental frequency contour overlaid on the spectrogram for two selected audio samples in each class. Blue dots with white outline are detected F0 values.
Fig. 5.Linear-frequency spectrogram and logMelFreqBand(d)[0–1] for cry, fus, lau and bab.
Fig. 6.Linear-frequency spectrogram and logMelFreqBand[4–5] for cry, fus, and bab.
Fig. 7.Linear-frequency spectrogram overlaid with lspFreq[0–1].
Fig. 8.Linear-frequency spectrogram overlaid with first formant frequency F1.