| Literature DB >> 35161804 |
Konstantinos Pyrovolakis1, Paraskevi Tzouveli1, Giorgos Stamou1.
Abstract
The production and consumption of music in the contemporary era results in big data generation and creates new needs for automated and more effective management of these data. Automated music mood detection constitutes an active task in the field of MIR (Music Information Retrieval). The first approach to correlating music and mood was made in 1990 by Gordon Burner who researched the way that musical emotion affects marketing. In 2016, Lidy and Schiner trained a CNN for the task of genre and mood classification based on audio. In 2018, Delbouys et al. developed a multi-modal Deep Learning system combining CNN and LSTM architectures and concluded that multi-modal approaches overcome single channel models. This work will examine and compare single channel and multi-modal approaches for the task of music mood detection applying Deep Learning architectures. Our first approach tries to utilize the audio signal and the lyrics of a musical track separately, while the second approach applies a uniform multi-modal analysis to classify the given data into mood classes. The available data we will use to train and evaluate our models comes from the MoodyLyrics dataset, which includes 2000 song titles with labels from four mood classes, {happy, angry, sad, relaxed}. The result of this work leads to a uniform prediction of the mood that represents a music track and has usage in many applications.Entities:
Keywords: BERT; convolutional neural networks; deep learning; digital signal processing; mood classification; natural language processing; transfer learning
Mesh:
Year: 2022 PMID: 35161804 PMCID: PMC8838547 DOI: 10.3390/s22031065
Source DB: PubMed Journal: Sensors (Basel) ISSN: 1424-8220 Impact factor: 3.576
Association between structural features of music and emotion [23].
| Structural Feature | Definition | Associated Emotion |
|---|---|---|
| Tempo | The speed or pace of a musical piece | Fast tempo: happiness, excitement, anger. Slow tempo: sadness, serenity. |
| Mode | The type of scale | Major tonality: happiness, joy. Minor tonality: sadness. |
| Loudness | The physical strength and amplitude of a sound | Intensity, power, or anger |
| Melody | The linear succession of musical tones that the listener perceives as a single entity | Complementing harmonies: happiness, relaxation, serenity. Clashing harmonies: excitement, anger, unpleasantness. |
| Rhythm | The regularly recurring pattern or beat of a song | Smooth/consistent rhythm: happiness, peace. Rough/irregular rhythm: amusement, uneasiness. Varied rhythm: joy. |
Figure 1Mel Spectrogram, Log-Mel Spectrogram, MFCC.
Figure 2Chroma, Tonal Centroids, Spectral Contrast.
Figure 3Audio signal of a musical track.
Figure 4Spectrogram of a musical track.
Figure 5Discrimination of four mood classes in Circumplex model [18].
Song Classification.
| Valence (V) and Arousal (A) Values | Mood |
|---|---|
| Happy | |
| Angry | |
| Sad | |
| Relaxed |
Figure 6Architecture of Models .
Figure 7Architecture of Models .
Figure 8Train-Test spit of Audio Data.
Figure 9Train-Test spit of Text Data.
Figure 10Accuracy of .
Figure 11Loss of .
Evaluation values of and .
| Model | Embedding Method | Loss | Accuracy (%) |
|---|---|---|---|
|
| BoW | 1.287 | 65.49 |
|
| TF-IDF | 1.381 | 67.98 |
|
| Word2Vec | 1.262 | 41.66 |
|
| GloVe | 1.064 | 53.33 |
|
| Bert | 1.353 | 69.11 |
Evaluation values of feature matrices.
| Feature Combination | Accuracy (%) |
|---|---|
| Mel | 64.97 |
| Mel, Log-Mel | 68.38 |
| Mel, Chroma, Tonnetz, Spectral Contrast | 60.86 |
| Log-Mel, Chroma, Tonnetz, Spectral Contrast | 58.96 |
| MFCC, Chroma, Tonnetz, Spectral Contrast | 65.36 |
| Mel, Log-Mel, MFCC, Chroma, Tonnetz | 69.77 |
| Mel, Log-Mel, MFCC, Chroma, Tonnetz, Spectral Contrast | 70.34 |
Figure 12Accuracy of .
Figure 13Loss of .
Evaluation values of , and .
| Model | Loss | Accuracy (%) |
|---|---|---|
|
| 1.381 | 67.98 |
|
| 1.353 | 69.11 |
|
| 0.743 | 70.51 |
Figure 14Accuracy of .
Figure 15Loss of .
Evaluation values of , , and .
| Model | Loss | Accuracy (%) | Computational Time |
|---|---|---|---|
|
| 1.381 | 67.98 | 0 m 25.391 s |
|
| 1.353 | 69.11 | 18 m 12.444 s |
|
| 0.743 | 70.51 | 80 m 13.064 s |
|
| 0.156 | 94.58 | 3 m 38.551 s |
Comparison of our model with other research papers that use MoodyLyrics [18].
| Model | Accuracy (%) |
|---|---|
| 94.78 | |
| 91.08 | |
| 75.60 | |
| 89.91 | |
| 72.24 | |
|
| 94.58 |