| Literature DB >> 28190932 |
Arthur Flexer1, Thomas Grill1.
Abstract
One of the central goals of Music Information Retrieval (MIR) is the quantification of similarity between or within pieces of music. These quantitative relations should mirror the human perception of music similarity, which is however highly subjective with low inter-rater agreement. Unfortunately this principal problem has been given little attention in MIR so far. Since it is not meaningful to have computational models that go beyond the level of human agreement, these levels of inter-rater agreement present a natural upper bound for any algorithmic approach. We will illustrate this fundamental problem in the evaluation of MIR systems using results from two typical application scenarios: (i) modelling of music similarity between pieces of music; (ii) music structure analysis within pieces of music. For both applications, we derive upper bounds of performance which are due to the limited inter-rater agreement. We compare these upper bounds to the performance of state-of-the-art MIR systems and show how the upper bounds prevent further progress in developing better MIR systems.Entities:
Keywords: evaluation; information retrieval; perception
Year: 2016 PMID: 28190932 PMCID: PMC5256035 DOI: 10.1080/09298215.2016.1200631
Source DB: PubMed Journal: J New Music Res ISSN: 0929-8215 Impact factor: 1.143
Figure 1. Average FINE score inter-rater agreement for different intervals of FINE scores (solid line) one standard deviation (dash-dot lines). Dashed line indicates theoretical perfect agreement.
Figure 2. Average FINE score of best performing system (y-axis) versus year (x-axis) plotted as circles connected via thick solid line. Upper bounds (solid), (dashed) and (dash-dot) plotted as horizontal lines.
Comparison of best system versus three upper bounds , and due to low inter-rater agreement. Mean FINE scores plus standard deviations and t-test statistics are shown. Differences that are statistically not significant are given in bold.
| year | system | mean FINE | t( | t( | t( |
|---|---|---|---|---|---|
| 2006 | EP | 43.01 | |||
| 2007 | PS | 56.75 | |||
| 2009 | PS2 | 64.58 | |||
| 2010 | SSPK2 | 56.64 | |||
| 2011 | SSPK2 | 58.64 | |||
| 2012 | SSKS2 | 53.19 | |||
| 2013 | SS2 | 55.21 | |||
| 2014 | SS2 | 49.35 |
Figure 3. Inter-rater scores plotted as a histogram over all double-annotated pieces contained in the SALAMI data set for a tolerance of 0.5 s. Mean value plotted as a dashed line.
measures (mean and standard deviation) for lower () and upper () bounds within the SALAMI data set.
| tolerance | ||
|---|---|---|
measures (mean and standard deviation) for lower () and upper () bounds within different genre classes of the SALAMI data set (tolerance is 0.5 s).
| genre class | ||
|---|---|---|
| popular | ||
| jazz | ||
| classical | ||
| world |
Comparison of best algorithm per MIREX edition on the SALAMI data set versus upper bound for a tolerance of 0.5 s. Boundary recognition mean values and standard deviations, and paired t-test statistics are shown.
| year | algorithm | ||
|---|---|---|---|
| 2012 | KSP2 | ||
| 2013 | MP2 | ||
| 2014 | SUG1 | ||
| 2015 | GS3 |
Comparison of best algorithm per MIREX edition on the SALAMI data set versus upper bound for a tolerance of 3 s. Boundary recognition mean values and standard deviations, and paired t-test statistics are shown.
| year | algorithm | ||
|---|---|---|---|
| 2012 | KSP3 | ||
| 2013 | MP2 | ||
| 2014 | SUG2 | ||
| 2015 | GS3 |