| Literature DB >> 35459018 |
Maciej Blaszke1, Bożena Kostek2.
Abstract
The work aims to propose a novel approach for automatically identifying all instruments present in an audio excerpt using sets of individual convolutional neural networks (CNNs) per tested instrument. The paper starts with a review of tasks related to musical instrument identification. It focuses on tasks performed, input type, algorithms employed, and metrics used. The paper starts with the background presentation, i.e., metadata description and a review of related works. This is followed by showing the dataset prepared for the experiment and its division into subsets: training, validation, and evaluation. Then, the analyzed architecture of the neural network model is presented. Based on the described model, training is performed, and several quality metrics are determined for the training and validation sets. The results of the evaluation of the trained network on a separate set are shown. Detailed values for precision, recall, and the number of true and false positive and negative detections are presented. The model efficiency is high, with the metric values ranging from 0.86 for the guitar to 0.99 for drums. Finally, a discussion and a summary of the results obtained follows.Entities:
Keywords: deep learning; musical information retrieval; musical instrument identification
Mesh:
Year: 2022 PMID: 35459018 PMCID: PMC9025072 DOI: 10.3390/s22083033
Source DB: PubMed Journal: Sensors (Basel) ISSN: 1424-8220 Impact factor: 3.576
Related work.
| Authors | Year | Task | Input Type | Algorithm | Metrics |
|---|---|---|---|---|---|
| Avramidis K., Kratimenos A., Garoufis C., Zlatintsi A., Maragos P. [ | 2021 | Predominant instrument recognition | Raw audio | RNN (recurrent neural networks), CNN (convolutional neural networks), and CRNN (convolutional recurrent neural network) | LRAP (label ranking average precision)—0.747 |
| Kratimenos A., Avramidis K., Garoufis C., Zlatintsi, A., Maragos P. [ | 2021 | Instrument identification | CQT (constant-Q transform) | CNN | LRAP—0.805 |
| Zhang F. [ | 2021 | Genre detection | MIDI music | RNN | Accuracy—89.91% |
| Shreevathsa P. K., Harshith M., A. R. M. and Ashwini [ | 2020 | Single instrument classification | MFCC (mel-frequency cepstral coefficient) | ANN (artificial neural networks) and CNN | ANN accuracy—72.08% |
| Blaszke M., Koszewski D., Zaporowski S. [ | 2019 | Single instrument classification | MFCC | CNN | Precision—0.99 |
| Das O. [ | 2019 | Single instrument classification | MFCC and WLPC (warped linear predictive coding) | Logistic regression and SVM (support vector machine) | Accuracy—100% |
| Gururani S., Summers C., Lerch A. [ | 2018 | Instrument identification | MFCC | CNN and CRNN | AUC ROC—0.81 |
| Rosner A., Kostek B. [ | 2018 | Genre detection | FV (feature vector) | SVM | Accuracy—72% |
| Choi K., Fazekas G., Sandler M., Cho K. [ | 2017 | Audio tagging | MFCC | CRNN (convolutional recurrent neural network) | ROC AUC (receiver operator characteristic)—0.65-0.98 |
| Han Y., Kim J., Lee K. [ | 2017 | Predominant instrument recognition | MFCC | CNN | F1 score macro—0.503 |
| Pons J., Slizovskaia O., Gong R., Gómez E., Serra X. [ | 2017 | Predominant instrument recognition | MFCC | CNN | F1 score micro—0.503 |
| Bhojane S.B., Labhshetwar O.G., Anand K., Gulhane S.R. [ | 2017 | Single instrument classification | FV | k-NN (k-nearest neighbors) | A system that can listen to the musical instrument tone and recognize it (no metrics shown) |
| Lee J., Kim T., Park J., Nam J. [ | 2017 | Instrument identification | Raw audio | CNN | AUC ROC—0.91 |
| Li P., Qian J., Wang T. [ | 2015 | Instrument identification | Raw audio, MFCC, and CQT (constant-Q transform) | CNN | Accuracy—82.74% |
| Giannoulis D., Benetos E., Klapuri A., Plumbley M. D. [ | 2014 | Instrument identification | CQT (constant-Q transform of a time domain signal) | Missing feature approach with AMT (automatic music transcription) | F1—0.52 |
| Giannoulis D., Klapuri A., [ | 2013 | Instrument recognition in polyphonic audio | A variety of acoustic features | Local spectral features and missing-feature techniques, mask probability estimation | Accuracy—67.54% |
| Bosch J. J., Janer J., Fuhrmann F., Herrera P. [ | 2012 | Predominant instrument recognition | Raw audio | SVM | F1 score micro—0.503 |
| Heittola T., Klapuri A., Virtanen T. [ | 2009 | Instrument recognition in polyphonic audio | MFCC | NMF (non-negative matrix factorization) and GMM | F1 score—0.62 |
| Essid S., Richard G., David B. [ | 2006 | Single instrument classification | MFCC and FV | GMM (Gaussian mixture model) | Accuracy—93% |
| Kostek B. [ | 2004 | Single instrument classification (12 instruments) | Combined MPEG-7 and Wavelet-Based FVs | ANN | Accuracy—72.24% |
| Eronen A. [ | 2003 | Single instrument classification | MFCC | ICA (independent component analysis) ML and HMM (hidden Markov model) | Accuracy between: 62–85% |
| Kitahara T., Goto M., Okuno H. [ | 2003 | Single instrument classification | FV | Discriminant function | Recognition rate—79.73% |
| Tzanetakis G., Cook P. [ | 2002 | Genre detection | FV and MFCC | SPR (subtree pruning–regrafting) | Accuracy—61% |
| Kostek B., Czyżewski A. [ | 2001 | Single instrument classification | FV | ANN | Accuracy—94.5% |
| Eronen A., Klapuri A. [ | 2000 | Single instrument classification | FV | k-NN | Accuracy—80% |
| Marques J., Moreno P. J. [ | 1999 | Single instrument classification | MFCC | GMM and SVM | Error rate—17% |
Figure 1Example of spectrograms of selected instruments and the prepared mix.
Figure 2Histogram of the instrument classes in the mixes.
Figure 3Histogram of instruments in the mixes.
Figure 4Model architecture.
Figure 5Example of the receiver operating characteristic and area under the curve.
Figure 6Metrics achieved by the algorithm on the training set.
Figure 7Metrics achieved by the algorithm on the validation set.
Figure 8ROC curves for each instrument tested.
Results per instrument.
| Metric | Bass | Drums | Guitar | Piano |
|---|---|---|---|---|
| Precision | 0.94 | 0.99 | 0.82 | 0.87 |
| Recall | 0.94 | 0.99 | 0.82 | 0.91 |
| F1 score | 0.95 | 0.99 | 0.82 | 0.89 |
| True positive | 5139 | 6126 | 2683 | 3811 |
| True negative | 1072 | 578 | 2921 | 2039 |
| False positive | 288 | 38 | 597 | 589 |
| False negative | 301 | 58 | 599 | 361 |
Confusion matrix (in percentage points).
| Ground Truth Instrument [%] | |||||
|---|---|---|---|---|---|
| Bass | Guitar | Piano | Drums | ||
| Predicted instrument | Bass | 81 | 8 | 7 | 0 |
| Guitar | 4 | 69 | 13 | 0 | |
| Piano | 5 | 12 | 77 | 0 | |
| Drums | 3 | 7 | 6 | 82 | |
Comparison between the first submodel and models after modification.
| Block Number | Unified Submodel | Guitar Submodel | Drums Submodel |
|---|---|---|---|
| 1 | 2D convolution: Kernel—3 × 3 Filters—128 Kernel 2 × 2 | 2D convolution: Kernel—3 × 3 Filters—256 Kernel 2 × 2 | 2D convolution: Kernel—3 × 3 Filters—64 Kernel 2 × 2 |
| 2 | 2D convolution: Kernel—3 × 3 Filters—62 Kernel 2 × 2 | 2D convolution: Kernel—3 × 3 Filters—128 Kernel 2 × 2 | 2D convolution: Kernel—3 × 3 Filters—32 Kernel 2 × 2 |
| 3 | 2D convolution: Kernel—3 × 3 Filters—32 Kernel 2 × 2 | 2D convolution: Kernel—3 × 3 Filters—64 Kernel 2 × 2 | 2D convolution: Kernel—3 × 3 Filters—16 Kernel 2 × 2 |
| 4 | Dense Layer: Units—64 | Dense Layer: | Dense Layer: |
| 5 | Dense Layer: | Dense Layer: | Dense Layer: |
| 6 | Dense Layer: | Dense Layer: | Dense Layer: |
| 7 | Dense Layer: | Dense Layer: | Dense Layer: |
Figure 9AUC ROC achieved by the algorithms on the training set.
Figure 10AUC ROC achieved by the algorithms on the validation set.
Comparison between the first submodel and models after modifications.
| Metric | Unified Model | Modified Model |
|---|---|---|
| Precision | 0.92 | 0.93 |
| Recall | 0.93 | 0.93 |
| AUC ROC | 0.96 | 0.96 |
| F1 score | 0.93 | 0.93 |
| True positive | 17,759 | 17,989 |
| True negative | 6610 | 6851 |
| False positive | 1512 | 1380 |
| False negative | 1319 | 1380 |
Results per modified instrument models.
| Drums | Guitar | |||
|---|---|---|---|---|
| Metric | Unified Model | Modified | Unified Model | Modified |
| Precision | 0.99 | 0.99 | 0.82 | 0.86 |
| Recall | 0.99 | 0.99 | 0.82 | 0.8 |
| F1 score | 0.99 | 0.99 | 0.82 | 0.83 |
| True positive | 6126 | 6232 | 2683 | 2647 |
| True negative | 578 | 570 | 2921 | 3150 |
| False positive | 38 | 47 | 597 | 444 |
| False negative | 58 | 51 | 599 | 659 |
Figure 11ROC curves for each instrument tested on the unified and modified models.
Figure 12Reduced heatmaps for the last convolutional layers per identified instrument.