| Literature DB >> 28228741 |
Stephen McAdams1, Chelsea Douglas1, Naresh N Vempala2.
Abstract
Composers often pick specific instruments to convey a given emotional tone in their music, partly due to their expressive possibilities, but also due to their timbres in specific registers and at given dynamic markings. Of interest to both music psychology and music informatics from a computational point of view is the relation between the acoustic properties that give rise to the timbre at a given pitch and the perceived emotional quality of the tone. Musician and nonmusician listeners were presented with 137 tones produced at a fixed dynamic marking (forte) playing tones at pitch class D# across each instrument's entire pitch range and with different playing techniques for standard orchestral instruments drawn from the brass, woodwind, string, and pitched percussion families. They rated each tone on six analogical-categorical scales in terms of emotional valence (positive/negative and pleasant/unpleasant), energy arousal (awake/tired), tension arousal (excited/calm), preference (like/dislike), and familiarity. Linear mixed models revealed interactive effects of musical training, instrument family, and pitch register, with non-linear relations between pitch register and several dependent variables. Twenty-three audio descriptors from the Timbre Toolbox were computed for each sound and analyzed in two ways: linear partial least squares regression (PLSR) and nonlinear artificial neural net modeling. These two analyses converged in terms of the importance of various spectral, temporal, and spectrotemporal audio descriptors in explaining the emotion ratings, but some differences also emerged. Different combinations of audio descriptors make major contributions to the three emotion dimensions, suggesting that they are carried by distinct acoustic properties. Valence is more positive with lower spectral slopes, a greater emergence of strong partials, and an amplitude envelope with a sharper attack and earlier decay. Higher tension arousal is carried by brighter sounds, more spectral variation and more gentle attacks. Greater energy arousal is associated with brighter sounds, with higher spectral centroids and slower decrease of the spectral slope, as well as with greater spectral emergence. The divergences between linear and nonlinear approaches are discussed.Entities:
Keywords: emotion; energy arousal; musical instruments; musical timbre; pitch register; preference; tension arousal; valence
Year: 2017 PMID: 28228741 PMCID: PMC5296353 DOI: 10.3389/fpsyg.2017.00153
Source DB: PubMed Journal: Front Psychol ISSN: 1664-1078
Figure 1Screenshot of the experimental interface on the iPad.
Person's correlation coefficients among ratings of perceived valence, tension arousal, energy arousal, preference, and familiarity for sounds common to both the main and control experiments.
| Main | Tension | 0.46 | ||||||||
| Energy | 0.68 | −0.29 | ||||||||
| Preference | 0.72 | 0.75 | 0.19 | |||||||
| Familiarity | 0.56 | 0.38 | 0.31 | 0.66 | ||||||
| Control | Valence | 0.89 | −0.65 | 0.44 | 0.81 | 0.60 | ||||
| Tension | −0.28 | 0.89 | 0.47 | −0.67 | −0.31 | 0.51 | ||||
| Energy | 0.56 | 0.37 | 0.92 | 0.01 | 0.18 | 0.33 | −0.57 | |||
| Preference | 0.57 | −0.72 | 0.02 | 0.89 | 0.67 | 0.80 | 0.69 | −0.06 | ||
| Familiarity | 0.46 | −0.37 | 0.20 | 0.57 | 0.90 | 0.56 | 0.31 | 0.14 | 0.65 | |
df = 135 for comparisons among Main variables and df = 38 between Main and Control and among Control variables. Bonferroni-corrected
p < 0.05,
p < 0.01.
Linear mixed effects model type III wald .
| Intercept | 1, 124.4 | 145.63 | < 0.001 | 1, 121.70 | 121.96 | < 0.001 |
| Training (T) | 1, 136.88 | 0.43 | 0.512 | 1, 132.04 | 1.56 | 0.214 |
| Family (F) | 3, 120.66 | 7.30 | < 0.001 | 3, 121.11 | 4.04 | 0.009 |
| Register (R) | 6, 122.01 | 7.98 | < 0.001 | 6, 120.75 | 4.15 | < 0.001 |
| T × F | 3, 104.62 | 0.24 | 0.871 | 3, 95.27 | 1.38 | 0.253 |
| T × R | 6, 69.38 | 3.71 | 0.003 | 6, 56.34 | 2.27 | 0.050 |
| F × R | 16, 111.00 | 2.41 | 0.004 | 16, 111.00 | 2.41 | 0.004 |
| T × F × R | 16, 111.00 | 1.04 | 0.417 | 16, 111.00 | 2.13 | 0.011 |
| Intercept | 1, 124.02 | 381.45 | < 0.001 | 1, 135.08 | 120.15 | < 0.001 |
| T | 1, 137.48 | 0.05 | 0.819 | 1, 96.02 | 1.97 | 0.164 |
| F | 3, 118.78 | 1.67 | 0.178 | 3, 125.81 | 10.19 | < 0.001 |
| R | 6, 112.83 | 13.27 | < 0.001 | 6, 120.90 | 1.65 | 0.139 |
| T × F | 3, 91.45 | 1.44 | 0.239 | 3, 94.77 | 1.10 | 0.353 |
| T | 6, 53.83 | 2.10 | 0.068 | 6, 58.39 | 3.89 | 0.002 |
| F | 16, 111.00 | 3.09 | < 0.001 | 16, 111.00 | 1.44 | 0.135 |
| T | 16, 111.00 | 2.30 | 0.006 | 16, 111.00 | 1.99 | 0.020 |
| Intercept | 1, 149.78 | 112.47 | < 0.001 | |||
| T | 1, 70.46 | 5.89 | 0.018 | |||
| F | 3, 131.80 | 6.33 | < 0.001 | |||
| R | 5, 120.36 | 0.82 | 0.540 | |||
| T × F | 3, 81.11 | 2.09 | 0.108 | |||
| T × R | 5, 69.51 | 0.58 | 0.716 | |||
| F × R | 16, 111.00 | 1.85 | 0.039 | |||
| T × F × R | 16, 111.00 | 1.62 | 0.084 | |||
N = 5480. All predictors are sum-coded factor variables. The following random effects were included: (a) random intercepts for Participant and Sounds, (b) random slopes for Family and Register (within Participants) and Training (within Sounds).
Figure 2Means of perceived valence ratings across pitch register for each instrument family and each musical training group.
Figure 3Predicted means of perceived tension-arousal ratings across pitch register for each instrument family and training group.
Figure 4Predicted means of perceived energy-arousal ratings across pitch register for each instrument family and training group.
Figure 5Predicted means of preference ratings across pitch register for each instrument family and training group.
Figure 6Predicted means of familiarity ratings across pitch register for each instrument family and training group.
Definition of acoustic descriptors from the timbre toolbox.
| Spectral | Centroid (log) | Center of gravity of the spectrum | Med |
| Spread | Standard deviation of the spectrum around the mean | Med, IQR | |
| Skewness | Asymmetry of the spectrum around the mean | Med | |
| Kurtosis | Flatness of spectrum around the mean | Med, IQR | |
| Slope | Linear regression over the spectral amplitude values | Med, IQR | |
| Decrease | Average of slopes between F0 and 2nd to | Med | |
| Rolloff | Frequency below which 95% of the signal energy is contained | Med, IQR | |
| Variation | Variation of the spectrum over time | Med | |
| Flatness | Ratio of the geometric and arithmetic means of the spectrum | Med | |
| Crest | Ratio of the spectral maximum to the arithmetic spectral mean | Med | |
| Temporal | Attack time (log) | Duration of the attack portion of the sound | |
| Attack slope | Rate of change of energy over time in the attack portion | ||
| Centroid | Center of gravity of the energy envelope |
For time-varying spectral descriptors, both the median (Med) and interquartile ranges (IQR) are computed over the duration of the sound, so each of these descriptors produces two measures.
Indicates the 17 descriptors included in partial least-squares regression and neural network analyses (see text).
Figure 7Dendrogram of the hierarchical cluster analysis of descriptor values across the entire stimulus set.
.
| 0.6617 | 0.5605 | 0.7535 | |
| 0.6032 | 0.4406 | 0.6867 | |
| RMSE | 0.0827 | 0.0841 | 0.0772 |
Loadings of each audio descriptor on the Principal Components (PC) for each emotion dimension in the PLSR analysis.
| Spec centroid Med | 6.18 | 1.55 | −3.95 | 1.81 | −0.73 | −5.05 | 0.89 | |||
| Spec centroid IQR | 3.25 | −4.45 | 5.35 | 2.43 | −1.67 | −2.07 | 2.83 | 3.51 | −6.72 | 3.90 |
| Spec spread IQR | 1.46 | −2.92 | 4.27 | −0.43 | 1.38 | −2.36 | −0.30 | 0.97 | −3.63 | 5.55 |
| Spec skewness Med | −5.88 | −1.89 | − | 3.98 | −1.07 | 0.89 | − | 5.92 | −0.64 | |
| Spec skewness IQR | 0.67 | −0.41 | 4.96 | −2.29 | 0.41 | −2.28 | 3.76 | −0.49 | −3.50 | 5.20 |
| Spec decrease Med | − | 4.88 | −1.12 | − | 5.88 | −3.77 | 1.60 | − | 0.86 | −0.73 |
| Spec decrease IQR | −4.38 | 3.96 | −1.83 | −6.82 | 4.41 | −2.93 | − | −6.21 | 3.24 | −1.61 |
| Spec rolloff IQR | 1.20 | −1.39 | 4.61 | −1.59 | 0.50 | −2.34 | 2.26 | 0.30 | −3.21 | 5.87 |
| Spec variation Med | −2.18 | −4.58 | − | 0.99 | 4.41 | −3.74 | −1.40 | 2.32 | 2.38 | |
| Spec variation IQR | −4.81 | −5.62 | −3.20 | 1.68 | −0.30 | −2.94 | −3.53 | −1.65 | 2.26 | |
| Spec flatness Med | −2.10 | −6.56 | 2.56 | 4.67 | 0.23 | −5.97 | −1.65 | −0.07 | − | −1.43 |
| Spec flatness IQR | 1.87 | −3.27 | 3.87 | 0.92 | 1.77 | −1.54 | −2.20 | 1.87 | −1.55 | |
| Spec crest Med | −0.22 | −0.05 | 3.89 | −5.53 | 4.97 | −3.79 | 5.09 | 0.75 | ||
| Spec crest IQR | 6.81 | −1.46 | 1.50 | 2.38 | −3.54 | 1.85 | −5.53 | 6.87 | 2.47 | 1.77 |
| Log attack time | −7.47 | −3.19 | 6.05 | 4.49 | 4.06 | −7.34 | 2.65 | −4.85 | −3.15 | |
| Attack slope | 7.99 | 3.02 | −5.67 | −4.36 | −3.72 | −3.36 | 5.35 | 3.93 | −5.19 | |
| Temporal centroid | − | −1.59 | 2.56 | 2.65 | 3.56 | −6.34 | 4.28 | −6.29 | −3.41 | 1.54 |
| Partial | 0.58 | 0.06 | 0.03 | 0.39 | 0.09 | 0.05 | 0.03 | 0.72 | 0.02 | 0.02 |
Loadings greater that 8.0 (or the highest value for a given PC if none are above 8) are shown in bold to highlight the primary contributors discussed in the text.
Performance of neural networks modeling valence, tension arousal, and energy arousal.
| 1 | 0.0800 | 0.0814 | 0.0703 |
| 2 | 0.0801 | 0.0813 | 0.0672 |
| 3 | 0.0809 | 0.0803 | 0.0672 |
| 4 | 0.0816 | 0.0853 | 0.0687 |
| 5 | 0.0825 | 0.0822 | 0.0680 |
| Mean | 0.0810 | 0.0821 | 0.0683 |
Primary timbre descriptor contributions to each type of neural net output unit.
| Spectral centroid median | − | +8.6 | +8.3 |
| Spectral spread IQR | − | −7.0 | − |
| Spectral decrease median | −9.5 | − | −9.6 |
| Spectral rolloff IQR | − | +6.7 | − |
| Spectral variation median | −10.6 | +15.9 | − |
| Spectral variation IQR | −11.0 | − | −12.1 |
| Spectral flatness median | − | −7.2 | − |
| Spectral crest median | +10.9 | − | +7.7 |
| Spectral crest IQR | − | − | +15.2 |
| Attack slope | +7.9 | −7.1 | − |
| Temporal centroid | −15.8 | − | −13.9 |
The values indicate the mean % contribution and the sign of the relation between the model prediction and the audio descriptor.
Comparison of model fitness, predictive power, and prediction error for PLSR and neural network models.
| PLSR | 0.6617 | 0.5605 | 0.7535 | 0.6032 | 0.4406 | 0.6867 | 0.0827 | 0.0841 | 0.0772 |
| NN-MLP | 0.9971 | 0.9963 | 0.9983 | 0.6117 | 0.4870 | 0.7658 | 0.0810 | 0.0821 | 0.0683 |
| Percentage of improvement (%) | 51 | 78 | 32 | 1 | 11 | 12 | −2 | −2 | −12 |
Ranks of primary audio descriptors contributing to PLSR and NN models.
| Spectral centroid median | Spectral | 2 | – | 1 | 2 | 3 | 5 |
| Spectral centroid IQR | Spectrotemporal | – | – | – | – | 6 | – |
| Spectral spread IQR | Spectrotemporal | – | – | – | 5 | – | – |
| Spectral skewness median | Spectral | 1 | – | 2 | – | 4 | – |
| Spectral skewness IQR | Spectrotemporal | – | – | – | – | – | – |
| Spectral decrease median | Spectral | 5 | 5 | 5 | — | 1 | 4 |
| Spectral decrease IQR | Spectrotemporal | – | – | – | – | – | – |
| Spectral rollof IQR | Spectrotemporal | – | – | – | 6 | – | – |
| Spectral variation median | Spectrotemporal | – | 4 | 4 | 1 | – | – |
| Spectral variation IQR | Spectrotemporal | – | 2 | 3 | – | – | 3 |
| Spectral flatness median | Spectral | – | – | – | 3 | 2 | – |
| Spectral flatness IQR | Spectrotemporal | – | – | – | – | – | – |
| Spectral crest median | Spectral | 4 | 3 | – | – | 5 | 6 |
| Spectral crest IQR | Spectrotemporal | – | – | – | – | – | 1 |
| Log attack time | Temporal | – | – | – | – | – | – |
| Attack slope | Temporal | 6 | 6 | 6 | 4 | – | – |
| Temporal centroid | Temporal | 3 | 1 | – | – | – | 2 |
PLSR is in green and NN in orange. Descriptors that make major contributions to both models are highlighted with darker colors.