| Literature DB >> 36106044 |
Kaitong Zheng1,2, Ruijie Meng1,2, Chengshi Zheng1,2, Xiaodong Li1,2, Jinqiu Sang1,2, Juanjuan Cai3, Jie Wang4, Xiao Wang5.
Abstract
With the development of deep neural networks, automatic music composition has made great progress. Although emotional music can evoke listeners' different auditory perceptions, only few research studies have focused on generating emotional music. This paper presents EmotionBox -a music-element-driven emotional music generator based on music psychology that is capable of composing music given a specific emotion, while this model does not require a music dataset labeled with emotions as previous methods. In this work, pitch histogram and note density are extracted as features that represent mode and tempo, respectively, to control music emotions. The specific emotions are mapped from these features through Russell's psychology model. The subjective listening tests show that the Emotionbox has a competitive performance in generating different emotional music and significantly better performance in generating music with low arousal emotions, especially peaceful emotion, compared with the emotion-label-based method.Entities:
Keywords: auditory perceptions; deep neural networks; emotional music generation; music element; music psychology
Year: 2022 PMID: 36106044 PMCID: PMC9465382 DOI: 10.3389/fpsyg.2022.841926
Source DB: PubMed Journal: Front Psychol ISSN: 1664-1078
An example of a pitch histogram in a C major scale.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Pitch histogram | 2 | 0 | 1 | 0 | 1 | 2 | 0 | 2 | 0 | 1 | 0 | 1 |
| Probability distribution | 0.2 | 0 | 0.1 | 0 | 0.1 | 0.2 | 0 | 0.2 | 0 | 0.1 | 0 | 0.1 |
Means higher in pitch by one semitone.
Figure 1Simplified Russell's two-dimensional valence-arousal emotion space. The x-axis denotes valence while the y-axis denotes arousal.
Figure 2The illustration of gated recurrent units (GRU). x and y denote the current input and output of GRU, h and h are the last hidden information and current hidden information, r and z are the reset and update gates. A GRU network is formed from a series of GRUs.
Figure 3Diagram of the EmotionBox model architecture. “Input X” denotes a sequence of events and “Input Z” denotes the pitch histogram and note density.
The average note density of the experimental stimuli.
|
|
| |
|---|---|---|
| Happy | 18.03 | 20.54 |
| Tensional | 17.23 | 32.39 |
| Sad | 6.24 | 12.06 |
| Peaceful | 6.29 | 14.41 |
Figure 4The mean accuracy and SD of subjective evaluation test for classifying generated music samples into emotion categories.
A post-hoc Bonferroni adjusted pairwise comparison of each emotion pair between two methods.
|
|
|
|
|---|---|---|
| Happy | Happy | 0.606 |
| Tensional | Tensional |
|
| Sad | Sad | 0.240 |
| Peaceful | Peaceful |
|
-value less than 0.05 means a statistically significant difference at a confidence level of 5 and is presented in bold type.
The results of human classification for each combination between specific emotion at generating stage and emotion classified by subjects.
|
| ||||
|---|---|---|---|---|
|
|
|
|
|
|
| Happy | 71% | 28% | 0% | 1% |
| Tensional | 17% | 74% | 5% | 4% |
| Sad | 1% | 8% | 56% | 35% |
| Peaceful | 8% | 4% | 26% | 63% |
|
| ||||
|
|
|
|
|
|
| Happy | 74% | 23% | 0% | 3% |
| Tensional | 10% | 90% | 0% | 0% |
| Sad | 4% | 18% | 47% | 31% |
| Peaceful | 26% | 28% | 5% | 41% |
(A) The results of the EmotionBox. (B) The results of the emotion-label-based model.
Figure 5The mean accuracy and SD of subjective evaluation test for classifying generated music samples into arousal and valence categories.
A post-hoc Bonferroni adjusted pairwise comparison of each arousal and valence conditions of the two methods.
|
|
|
|
|---|---|---|
| High arousal | High arousal | 0.325 |
| Low arousal | Low arousal | <0.01 |
| High valence | High valence | 0.891 |
| Low valence | Low valence | 0.220 |
p-value less than 0.05 means a statistically significant difference at a confidence level of 5% and is presented in bold type.