| Literature DB >> 23750144 |
Felix Weninger1, Florian Eyben, Björn W Schuller, Marcello Mortillaro, Klaus R Scherer.
Abstract
WITHOUT DOUBT, THERE IS EMOTIONAL INFORMATION IN ALMOST ANY KIND OF SOUND RECEIVED BY HUMANS EVERY DAY: be it the affective state of a person transmitted by means of speech; the emotion intended by a composer while writing a musical piece, or conveyed by a musician while performing it; or the affective state connected to an acoustic event occurring in the environment, in the soundtrack of a movie, or in a radio play. In the field of affective computing, there is currently some loosely connected research concerning either of these phenomena, but a holistic computational model of affect in sound is still lacking. In turn, for tomorrow's pervasive technical systems, including affective companions and robots, it is expected to be highly beneficial to understand the affective dimensions of "the sound that something makes," in order to evaluate the system's auditory environment and its own audio output. This article aims at a first step toward a holistic computational model: starting from standard acoustic feature extraction schemes in the domains of speech, music, and sound analysis, we interpret the worth of individual features across these three domains, considering four audio databases with observer annotations in the arousal and valence dimensions. In the results, we find that by selection of appropriate descriptors, cross-domain arousal, and valence regression is feasible achieving significant correlations with the observer annotations of up to 0.78 for arousal (training on sound and testing on enacted speech) and 0.60 for valence (training on enacted speech and testing on music). The high degree of cross-domain consistency in encoding the two main dimensions of affect may be attributable to the co-evolution of speech and music from multimodal affect bursts, including the integration of nature sounds for expressive effects.Entities:
Keywords: audio signal processing; emotion recognition; feature selection; music perception; sound perception; speech perception; transfer learning
Year: 2013 PMID: 23750144 PMCID: PMC3664314 DOI: 10.3389/fpsyg.2013.00292
Source DB: PubMed Journal: Front Psychol ISSN: 1664-1078
Database statistics.
| Database | Domain | Agreement [ | #Annot. | #Inst. | Length [h:m] | |
|---|---|---|---|---|---|---|
| Arousal | Valence | |||||
| VAM | Speech (spontaneous) | 0.81 | 0.56 | 6–17 | 947 | 0:50 |
| GEMEP | Speech (enacted) | 0.64 | 0.68 | 20 | 154 | 0:06 |
| NTWICM | Music | 0.70 | 0.69 | 4 | 2648 | 168:03 |
| ESD | Sound | 0.58 | 0.80 | 4 | 390 | 0:25 |
Figure 1Distribution of valence/arousal EWE on the VAM (A), GEMEP (B), emotional sound (C), and NTWICM (D) databases: number of instances per valence/arousal bin.
ComParE acoustic feature set: 64 provided low-level descriptors (LLD).
| Group | |
|---|---|
| Sum of auditory spectrum (loudness) | Prosodic |
| Sum of RASTA-style filtered auditory spectrum | Prosodic |
| RMS energy, zero-crossing rate | Prosodic |
| RASTA-style auditory spectrum, bands 1–26 (0–8 kHz) | Spectral |
| MFCC 1–14 | Cepstral |
| Spectral energy 250–650 Hz, 1 k–4 kHz | Spectral |
| Spectral roll off point 0.25, 0.50, 0.75, 0.90 | Spectral |
| Spectral flux, centroid, entropy, slope | Spectral |
| Psychoacoustic sharpness, harmonicity | Spectral |
| Spectral variance, skewness, kurtosis | Spectral |
| Prosodic | |
| Prob. of voice | Sound quality |
| Log. HNR, Jitter (local, delta), Shimmer (local) | Sound quality |
ComParE acoustic feature set: functionals applied to LLD contours (Table .
| Group | |
|---|---|
| Quartiles 1–3, 3 inter-quartile ranges | Percentiles |
| 1% Percentile (≈min), 99% percentile ( | Percentiles |
| Percentile range 1–99% | Percentiles |
| Position of min/max, range (max − min) | Temporal |
| Arithmetic mean1, root quadratic mean | Moments |
| Contour centroid, flatness | Temporal |
| Standard deviation, skewness, kurtosis | Moments |
| Rel. duration LLD is above 25/50/75/90% range | Temporal |
| Rel. duration LLD is rising | Temporal |
| Rel. duration LLD has positive curvature | Temporal |
| Gain of linear prediction (LP), LP coefficients 1–5 | Modulation |
| Mean, max, min, SD of segment length2 | Temporal |
| Mean value of peaks | Peaks |
| Mean value of peaks – arithmetic mean | Peaks |
| Mean/SD of inter peak distances | Peaks |
| Amplitude mean of peaks, of minima | Peaks |
| Amplitude range of peaks | Peaks |
| Mean/SD of rising/falling slopes | Peaks |
| Linear regression slope, offset, quadratic error | Regression |
| Quadratic regression a, b, offset, quadratic error | Regression |
| Percentage of non-zero frames3 | Temporal |
1Arithmetic mean of LLD/positive Δ LLD. 2Not applied to voice related LLD except F0. 3Only applied to F0.
Cross-domain feature relevance for arousal: top features ranked by absolute correlation (.
| Rank | LLD | Functional | CDCC3 | |||
|---|---|---|---|---|---|---|
| Sound | Music | Speech | ||||
| 1 | Loudness | R.q. mean | 0.59** | 0.16** | 0.75** | 0.31 |
| 4 | Loudness | Lin. regr. offset | 0.54** | 0.27** | 0.56** | 0.36 |
| 6 | Loudness | 99-Percentile | 0.53** | 0.09 ° | 0.67** | 0.23 |
| 8 | Energy | R.q. mean | 0.50** | 0.07− | 0.64** | 0.21 |
| 1 | Spectral flux | R.q. mean | 0.38** | 0.13** | 0.76** | 0.21 |
| 9 | Δ Spectral flux | Arith. mean | 0.25* | 0.28** | 0.68** | 0.26 |
| 63 | Δ MFCC 14 | R.q. mean | 0.14− | 0.32** | 0.58** | 0.20 |
| 97 | F0 | R.q. mean | 0.17− | 0.09o | 0.55** | 0.12 |
| 1 | Loudness | Mean peak dist. | 0.02− | –0.58** | –0.08− | 0.01 |
| 2 | Spectral ent. | Mean peak dist. | 0.04− | –0.54** | –0.16** | 0.03 |
| 3 | Loudness | Peak dist. SD | 0.02− | –0.53** | –0.10− | 0.02 |
| 5 | MFCC 1 | R.q. mean | –0.11− | –0.53** | –0.47** | 0.23 |
| 1 | Loudness | Quad.reg. offset | 0.41** | 0.37** | 0.37** | 0.37 |
| 4 | Loudness | Arith. mean | 0.57** | 0.18** | 0.73** | 0.31 |
| 5 | Spectral flux | Quad. reg. offset | 0.32** | 0.30** | 0.45** | 0.31 |
| 6 | Δ Energy 1–4 kHz | Quartile 1 | –0.32** | –0.30** | –0.59** | 0.31 |
Significance denoted by **p < 0.00, *p < 0.01, °p < 0.05, −p ≥ 0.05; Bonferroni corrected p-values from two-sided paired sample t-tests.
Cross-domain feature relevance for valence: top features ranked by absolute correlation (.
| Rank | LLD | Functional | CDCC3 | |||
|---|---|---|---|---|---|---|
| Sound | Music | Speech | ||||
| 1 | Loudness | Quartile 3 | –0.31** | 0.27** | –0.21** | –0.09 |
| 2 | Loudness | Rise time | –0.30** | –0.21** | –0.04− | 0.10 |
| 3 | Loudness | R.q. mean | –0.29** | 0.29** | –0.23** | –0.10 |
| 10 | Spectral flux | Skewness | 0.27** | –0.13** | 0.11− | –0.04 |
| 1 | F0 | Quartile 2 | 0.05− | –0.07− | –0.31** | –0.01 |
| 2 | Energy 1–4 kHz | Arith. mean | –0.17− | 0.23** | –0.31** | –0.07 |
| 4 | Δ Energy 1–4 kHz | Arith. mean | –0.08− | 0.26** | –0.30** | –0.09 |
| 10 | F0 | Quartile 1 | 0.07− | –0.14** | –0.29** | 0.00 |
| 1 | Δ Loudness | Mean peak dist. | –0.02− | –0.65** | –0.03− | 0.02 |
| 2 | Loudness | Mean peak dist. | –0.12− | –0.65** | –0.04− | 0.06 |
| 3 | MFCC 1 | Quartile 2 | –0.04− | –0.61** | 0.24** | –0.08 |
| 9 | Spectral ent. | Mean peak dist. | 0.05− | –0.57** | 0.04− | –0.02 |
| 1 | Spect. centroid | Rise time | –0.13− | –0.16** | –0.12− | 0.12 |
| 2 | Psy. sharpness | Rise time | –0.13− | –0.16** | –0.12− | 0.12 |
| 5 | Energy 250–650 Hz | IQR 1–3 | –0.14− | –0.11** | –0.15* | 0.12 |
| 6 | MFCC 13 | IQR 1–3 | –0.08− | –0.20** | –0.18** | 0.12 |
Significance denoted by **p < 0.001, *p < 0.01, op < 0.05), −p ≥ 0.05; Bonferroni corrected p-values from two-sided paired sample t-tests.
Figure 2Feature relevance by LLD group (A: arousal, B: valence) and functional group (C: arousal, D: valence). Number of features in the top 200 features ranked by absolute correlation with gold standard for single domain and CDCC [equation (3)] for cross domain.
Results of within-domain and pair-wise cross-domain support vector regression on arousal observer ratings for sound (emotional sound database), music (NTWICM database), and spontaneous and enacted speech (VAM/GEMEP databases).
| Test on | Mean | ||||
|---|---|---|---|---|---|
| Train on | Sound | Music | Speech | ||
| Sp. | En. | ||||
| Sound | 0.54** | 0.14** | 0.70** | 0.64** | 0.51 |
| Music | 0.11− | 0.65** | 0.46** | 0.39** | 0.40 |
| Speech/Sp. | 0.38** | 0.37** | 0.81** | 0.80** | 0.59 |
| Speech/En. | 0.20* | 0.32** | 0.60** | 0.85** | 0.49 |
| Mean | 0.30 | 0.37 | 0.64 | 0.67 | 0.50 |
| Sound | 0.59** | 0.46** | 0.76** | 0.79** | 0.65 |
| Music | 0.46** | 0.67** | 0.73** | 0.75** | 0.65 |
| Speech/Sp. | 0.54** | 0.47** | 0.83** | 0.78** | 0.66 |
| Speech/En. | 0.56** | 0.46** | 0.77** | 0.85** | 0.66 |
| Mean | 0.54 | 0.52 | 0.77 | 0.79 | 0.65 |
| Sound | 0.56** | 0.35** | 0.78** | 0.56** | 0.56 |
| Music | 0.38** | 0.66** | 0.74** | 0.63** | 0.60 |
| Speech/Sp. | 0.44** | 0.43** | 0.82** | 0.69** | 0.59 |
| Speech/En. | 0.31** | 0.45** | 0.77** | 0.78** | 0.58 |
| Mean | 0.42 | 0.47 | 0.78 | 0.67 | 0.58 |
Significance denoted by **p < 0.001, *p < 0.01, −p ≥ 0.05; Bonferroni corrected p-values from two-sided paired sample t-tests. Full ComParE feature set (cf. Tables 2 and 3); 200 top features selected by CDCC2 for specific within-domain or cross-domain regression tasks; Generic features: 200 features selected by CDCC3 across sound, music, and speech domains (cf. Table 4).
Results of within-domain and pair-wise cross-domain support vector regression on valence observer ratings for sound (emotional sound database), music (NTWICM database), and spontaneous and enacted speech (VAM/GEMEP databases).
| Test on | Mean | ||||
|---|---|---|---|---|---|
| Train on | Sound | Music | Speech | ||
| Sp. | En. | ||||
| Sound | 0.40** | −0.11** | 0.21** | −0.02− | 0.12 |
| Music | −0.17 ° | 0.80** | −0.13* | 0.08− | 0.15 |
| Speech/Sp. | 0.11− | −0.15** | 0.46** | 0.21− | 0.16 |
| Speech/En. | −0.06− | −0.18** | 0.12* | 0.26o | 0.03 |
| Mean | 0.07 | 0.09 | 0.17 | 0.13 | 0.12 |
| Sound | 0.51** | 0.36** | 0.27** | 0.48** | 0.41 |
| Music | 0.40** | 0.82** | 0.33** | 0.52** | 0.52 |
| Speech/Sp. | 0.30** | 0.45** | 0.44** | 0.26o | 0.36 |
| Speech/En. | 0.45** | 0.60** | 0.36** | 0.50** | 0.48 |
| Mean | 0.41 | 0.56 | 0.35 | 0.44 | 0.44 |
| Sound | 0.26** | 0.41** | 0.27** | 0.12− | 0.27 |
| Music | 0.27** | 0.75** | 0.33** | 0.25o | 0.40 |
| Speech/Sp. | 0.20* | 0.45** | 0.35** | 0.19− | 0.30 |
| Speech/En. | 0.20** | 0.44** | 0.32** | 0.23− | 0.30 |
| Mean | 0.23 | 0.52 | 0.32 | 0.20 | 0.32 |
Significance denoted by **p < 0.001, *p < 0.01, °p < 0.05, −p ≥ 0.05; Bonferroni corrected p-values from two-sided paired sample t-tests. Full ComParE feature set (cf. Tables 2 and 3); 200 top features selected by CDCC2 for specific within-domain or cross-domain regression tasks; Generic features: 200 features selected by CDCC3 across sound, music, and speech domains (cf. Table 5).