| Literature DB >> 35582331 |
Baijun Xie1, Jonathan C Kim1, Chung Hyuk Park1.
Abstract
This paper presents a method for extracting novel spectral features based on a sinusoidal model. The method is focused on characterizing the spectral shapes of audio signals using spectral peaks in frequency sub-bands. The extracted features are evaluated for predicting the levels of emotional dimensions, namely arousal and valence. Principal component regression, partial least squares regression, and deep convolutional neural network (CNN) models are used as prediction models for the levels of the emotional dimensions. The experimental results indicate that the proposed features include additional spectral information that common baseline features may not include. Since the quality of audio signals, especially timbre, plays a major role in affecting the perception of emotional valence in music, the inclusion of the presented features will contribute to decreasing the prediction error rate.Entities:
Keywords: Musical emotion recognition; deep learning; machine learning; principal component regression; sinusoidal model; spectral feature extraction
Year: 2020 PMID: 35582331 PMCID: PMC9109831 DOI: 10.3390/app10030902
Source DB: PubMed Journal: Appl Sci (Basel) ISSN: 2076-3417 Impact factor: 2.838
Figure 1.(a) Harmonic peaks selected by SEEVOC peak-picking routine for a song in the database, (b) short-time Fourier transforms of a C major chord played by a piano with and without a tremolo effect, and (c) their corresponding autocorrelations, R.
List of statistical and regression measures applied to low-level descriptors.
| num. | description |
|---|---|
| 1 | maximum |
| 2 | minimum |
| 3 | mean |
| 4 | standard deviation |
| 5 | kurtosis |
| 6 | skewness |
| 7~9 | 1 |
| 10 | interquartile range |
| 11~12 | 1 |
| 13 | RMS value |
| 14 | slope of linear regression |
| 15 | approximation error of linear regression |
applied to A, f, ΔA, Δf, and the peak locations and magnitudes of R.
Figure 2.The proposed feature extraction method overview.
Figure 3.RMSEs for the arousal dimension using (a) principal component regression models and (b) partial least square models. RMSEs for the valence dimension using (c) principal component regression models and (d) partial least square models.
Regression prediction results using the baseline features, STC-based features, and the combined features. The three regression models used are principal component regression (PCR), partial least square (PLS), and feedforward neural network (FF) models.
| RMSE | ||||||
|---|---|---|---|---|---|---|
| Arousal | Valence | |||||
| PLS | PCR | FF | PLS | PCR | FF | |
| base. | 0.146 | 0.147 | 0.172 | 0.156 | 0.156 | 0.169 |
| STC | 0.149 | 0.150 | 0.179 | 0.156 | 0.157 | 0.172 |
| base+STC | 0.144 | 0.144 | 0.165 | 0.151 | 0.150 | 0.158 |
| Peason’s coefficient | ||||||
| Arousal | Valence | |||||
| PLS | PCR | FF | PLS | PCR | FF | |
| base. | 0.785 | 0.781 | 0.732 | 0.601 | 0.612 | 0.570 |
| STC | 0.775 | 0.770 | 0.730 | 0.600 | 0.585 | 0.551 |
| base+STC | 0.793 | 0.793 | 0.754 | 0.630 | 0.634 | 0.609 |
Figure 4.The flow chart of the overall deep learning approach.
Figure 5.(a) Sample images from each class, (b) the scatter plot for arousal vs. valence domain values
Classification results using transfer learning scheme. The pre-trained deep neural networks used are VGG, AlexNet, Inception, ResNet, DenseNet and ResNext.
| Model Name | Top-1 Acc. | Top-5 Acc. | Best performance | Trainable Params. |
|---|---|---|---|---|
| VGG-11 | 64.79±1.51% | 95.52±2.17% | 66.56% | 32776 |
| VGG-13 | 65.05±1.32% | 96.62±1.83% | 65.61% | 32776 |
| VGG-16 | 64.77±1.44% | 96.36±2.08% | 66.93% | 32776 |
| VGG-19 | 64.58±1.49% | 95.86±2.06% | 66.77% | 32776 |
| AlexNet | 65.00±1.01% | 96.07±1.74% | 65.61% | 32776 |
| Inception-V3 | 64.53±1.44% | 95.36±1.64% | 66.40% | 16392 |
| ResNet-18 | 64.86±1.16% | 95.45±2.05% | 66.78% | 4,104 |
| ResNet-34 | 65.04±1.34% | 95.85±1.91% | 66.72% | 4,104 |
| ResNet-50 | 65.24±1.35% | 95.63±1.66% | 67.42% | 16392 |
| ResNet-101 | 65.11±1.35% | 96.23±1.84% | 67.10% | 16392 |
| ResNet-152 | 65.31 ±1.02% | 95.92±1.69% | 66.61% | 16392 |
| DenseNet-121 | 64.79±1.37% | 95.64±2.36% | 67.26% | 8200 |
| DenseNet-169 | 64.94±1.45% | 96.68±2.13% | 67.10% | 13320 |
| DenseNet-201 | 64.67±1.32% | 96.91 ±2.05% | 66.45% | 15368 |
| ResNext-50 | 65.45±1.29% | 96.17±2.38% | 67.10% | 16392 |
| ResNext-101 | 65.31±1.02% | 96.26±1.91% | 66.61% | 16392 |