| Literature DB >> 25694856 |
Arijit Ghosal1, Rudrasis Chakraborty2, Bibhas Chandra Dhara3, Sanjoy Kumar Saha4.
Abstract
Audio classification acts as the fundamental step for lots of applications like content based audio retrieval and audio indexing. In this work, we have presented a novel scheme for classifying audio signal into three categories namely, speech, music without voice (instrumental) and music with voice (song). A hierarchical approach has been adopted to classify the signals. At the first stage, signals are categorized as speech and music using audio texture derived from simple features like ZCR and STE. Proposed audio texture captures contextual information and summarizes the frame level features. At the second stage, music is further classified as instrumental/song based on Mel frequency cepstral co-efficient (MFCC). A classifier based on Random Sample and Consensus (RANSAC), capable of handling wide variety of data has been utilized. Experimental result indicates the effectiveness of the proposed scheme.Entities:
Keywords: Audio texture; Instrumental/Song classification; Mel frequency cepstral co-efficient; Random sample and consensus; Speech/Music classification
Year: 2013 PMID: 25694856 PMCID: PMC4322669 DOI: 10.1186/2193-1801-2-526
Source DB: PubMed Journal: Springerplus ISSN: 2193-1801
Figure 12-D contour plot of co-occurrence matrices of speech and music signals – (a) ZCR based co-occurrence matrices (b) STE based co-occuurence matrices. X and Y axes show ZCR(STE) bins and values of the matrix elements are denoted by different colours.
Figure 2Different types of audio signals and their MFCC plots.
Overall accuracy (in %) for direct classification into speech, instrumental and song
| T1 | T2 | |||||
|---|---|---|---|---|---|---|
| Classification | Audio | MFCC | A+B | Audio | MFCC | A+B |
| scheme | texture | (B) | texture | (B) | ||
| (A) | (A) | |||||
| K-means | 54.00 | 52.67 | 54.88 | 51.19 | 50.00 | 52.67 |
| MLP | 62.00 | 71.78 | 60.00 | 60.68 | 69.14 | 59.20 |
| RANSAC | 79.77 | 72.22 | 79.33 | 78.88 | 70.18 | 79.08 |
Accuracy (in %) of speech, music classification
| Classification scheme | Experiment setup | Type of signals | Featutre set | ||
|---|---|---|---|---|---|
| ZCR, STE based features | Proposed audio texture | ||||
| K-means | T1 | Speech | 50.55 | 60.00 | 73.89 |
| Music | 74.07 | 74.07 | 85.93 | ||
| Overall | 64.67 | 68.08 | 81.11 | ||
| T2 | Speech | 48.15 | 57.52 | 71.85 | |
| Music | 71.04 | 72.10 | 84.40 | ||
| Overall | 61.87 | 65.91 | 79.38 | ||
| MLP | T1 | Speech | 71.11 | 78.50 | 78.33 |
| Music | 90.37 | 75.92 | 88.15 | ||
| Overall | 82.67 | 77.02 | 84.22 | ||
| T2 | Speech | 68.52 | 74.92 | 74.07 | |
| Music | 86.63 | 72.84 | 84.40 | ||
| Overall | 79.38 | 73.72 | 80.27 | ||
| SVM | T1 | Speech | 73.89 | 86.00 | 78.33 |
| Music | 90.74 | 81.48 | 89.26 | ||
| Overall | 84.00 | 83.40 | 84.89 | ||
| T2 | Speech | 69.63 | 83.28 | 76.67 | |
| Music | 88.86 | 79.75 | 85.40 | ||
| Overall | 81.16 | 81.25 | 81.90 | ||
| RANSAC | T1 | Speech | 75.00 | 88.00 | 96.11 |
| Music | 92.96 | 85.93 | 97.78 | ||
| Overall | 85.78 | 86.80 | 97.11 | ||
| T2 | Speech | 74.07 | 85.62 | 93.70 | |
| Music | 90.59 | 82.47 | 93.81 | ||
| Overall | 83.98 | 83.81 | 93.77 |
Comparison of performance for speech, music classification
| Classification accuracy (in %) | ||||||
|---|---|---|---|---|---|---|
| Methodology | Speech | Music | Overall | |||
| T1 | T2 | T1 | T2 | T1 | T2 | |
| Sadjadi’s method | 88.89 | 86.67 | 84.44 | 82.67 | 86.22 | 84.27 |
| ( | ||||||
| Proposed method | 95.55 | 93.70 | 97.78 | 93.81 | 96.89 | 93.77 |
Accuracy (in %) of instrumental, song classification
| Classific. scheme | Instrumental | Song | Overall | |||
|---|---|---|---|---|---|---|
| T1 | T2 | T1 | T2 | T1 | T2 | |
| K-means | 40.07 | 39.60 | 92.60 | 87.13 | 66.67 | 63.37 |
| MLP | 62.22 | 59.41 | 94.07 | 89.60 | 78.15 | 74.50 |
| SVM | 77.04 | 74.75 | 82.96 | 81.68 | 80.00 | 78.21 |
| RANSAC | 94.81 | 90.59 | 91.11 | 88.61 | 92.96 | 89.60 |
Comparison of performance (in %) for instrumental, song classification
| Methodology | Instrumental | Song | Overall | |||
|---|---|---|---|---|---|---|
| T1 | T2 | T1 | T2 | T1 | T2 | |
| Zhang’s feature | ||||||
| and SVM | 64.44 | 62.87 | 82.22 | 80.69 | 73.33 | 71.78 |
| (Zhang | ||||||
| Zhang’s feature | ||||||
| and RANSAC | 74.81 | 73.27 | 77.78 | 76.73 | 76.30 | 75.00 |
| (Zhang | ||||||
| Proposed method | 94.81 | 90.59 | 91.11 | 88.61 | 92.96 | 89.60 |