| Literature DB >> 26656189 |
Abstract
Audio information plays a rather important role in the increasing digital content that is available today, resulting in a need for methodologies that automatically analyze such content: audio event recognition for home automations and surveillance systems, speech recognition, music information retrieval, multimodal analysis (e.g. audio-visual analysis of online videos for content-based recommendation), etc. This paper presents pyAudioAnalysis, an open-source Python library that provides a wide range of audio analysis procedures including: feature extraction, classification of audio signals, supervised and unsupervised segmentation and content visualization. pyAudioAnalysis is licensed under the Apache License and is available at GitHub (https://github.com/tyiannak/pyAudioAnalysis/). Here we present the theoretical background behind the wide range of the implemented methodologies, along with evaluation metrics for some of the methods. pyAudioAnalysis has been already used in several audio analysis research applications: smart-home functionalities through audio event detection, speech emotion recognition, depression classification based on audio-visual features, music segmentation, multimodal content-based movie recommendation and health applications (e.g. monitoring eating habits). The feedback provided from all these particular audio applications has led to practical enhancement of the library.Entities:
Mesh:
Year: 2015 PMID: 26656189 PMCID: PMC4676707 DOI: 10.1371/journal.pone.0144610
Source DB: PubMed Journal: PLoS One ISSN: 1932-6203 Impact factor: 3.240
Related Work.
| Name | Description |
|---|---|
| Yaafe | A Python library for audio feature extraction and basic audio I/O ( |
| Essentia | An open-source C++ library for audio analysis and music information retrieval. Mostly focuses on audio feature extraction, basic I/O, while it also provides some basic classification functionalities |
| aubio | A C library for basic audio analysis: pitch tracking, onset detection, extraction of MFCCs, beat and meter tracking, etc. Provides wrappers for Python. |
| CLAM (C++ Library for Audio and Music) | A framework for research / development in the audio and music domain. Provides the means to perform complex audio signal analysis, transformations and synthesis. Also provides a graphical tool. |
| Matlab Audio Analysis Library | A Matlab library for audio feature extraction, classification, segmentation and music information retrieval |
| librosa | A Python library that implements some audio features (MFCCs, chroma and beat-related features), sound decomposition to harmonic and percussive components, audio effects (pitch shifting, etc) and some basic communication with machine learning components (e.g. clustering) |
| PyCASP | This Python library focuses on providing a collection of specializers towards automatic mapping of computations onto parallel processing units (either GPUs or multicore CPUs). These computations are presented through a couple of audio-related examples. |
| seewave | This is an R package for basic sound analysis and synthesis. Mostly focusing on feature extraction and basic I/O. |
| bob | An open general signal processing and machine learning library (C++ and Python). |
A list of related libraries and packages focusing on audio analysis.
Fig 1Library General Diagram.
Fig 2pyAudioAnalysis provides easy-to-use and high-level Python wrappers for several audio analysis tasks.
Audio Features.
| Index | Name | Description |
|---|---|---|
| 1 | Zero Crossing Rate | The rate of sign-changes of the signal during the duration of a particular frame. |
| 2 | Energy | The sum of squares of the signal values, normalized by the respective frame length. |
| 3 | Entropy of Energy | The entropy of sub-frames’ normalized energies. It can be interpreted as a measure of abrupt changes. |
| 4 | Spectral Centroid | The center of gravity of the spectrum. |
| 5 | Spectral Spread | The second central moment of the spectrum. |
| 6 | Spectral Entropy | Entropy of the normalized spectral energies for a set of sub-frames. |
| 7 | Spectral Flux | The squared difference between the normalized magnitudes of the spectra of the two successive frames. |
| 8 | Spectral Rolloff | The frequency below which 90% of the magnitude distribution of the spectrum is concentrated. |
| 9–21 | MFCCs | Mel Frequency Cepstral Coefficients form a cepstral representation where the frequency bands are not linear but distributed according to the mel-scale. |
| 22–33 | Chroma Vector | A 12-element representation of the spectral energy where the bins represent the 12 equal-tempered pitch classes of western-type music (semitone spacing). |
| 34 | Chroma Deviation | The standard deviation of the 12 chroma coefficients. |
Complete list of implemented audio features. Each short-term window is represented by a feature vector of 34 features listed in the Table.
Fig 3Local maxima detection for beat extraction.
An example of local maxima detection on each of the adopted short-term features. The time distances between successive local maxima are used in the beat extraction process.
Fig 4Beat histogram example.
An aggregated histogram of time distances between successive feature local maxima. The histogram’s maximum position is used to estimate the BPM rate.
Fig 5Segmentation Example.
Supervised segmentation results and statistics for a radio recording. A binary speech vs music classifier is used to classify each fix-sized segment.
HMM joint segmentation classification performance.
| Method | Accuracy |
|---|---|
| Fix-sized window kNN | 93.1% |
| Fix-sized window SVM | 94.6% |
| HMM | 95.1% |
Average accuracy of the of each segmentation-classification method on a radio broadcasting dataset.
Fig 6Silence Removal Example.
An example of applying the silence removal method on an audio recording. Upper subfigure represents the audio signal, while the second subfigure shows the SVM probabilistic sequence.
HMM joint segmentation classification performance.
| FLsD | StWin (ms) | ACP | ASP | F1 |
|---|---|---|---|---|
| No | MFCCs (mean) | 77 | 72.5 | 74.5 |
| Yes | MFCCs (mean) | 83 | 80 | 81.5 |
| No | MFCCs (mean), Gender | 73 | 70 | 71.5 |
| Yes | MFCCs (mean), Gender | 83 | 81 | 82 |
| No | MFCCs (mean-std) | 76 | 72 | 74 |
| Yes | MFCCs (mean-std) | 84 | 82 | 83 |
| No | MFCCs (mean-std), Gender | 78 | 73 | 75.5 |
| Yes | MFCCs (mean-std), Gender | 83 | 81 | 82 |
| No | MFCCs (mean-std), Spectral | 70 | 63 | 66.5 |
| Yes | MFCCs (mean-std), Spectral | 85 | 81 | 83 |
| No | MFCCs (mean-std), Spectral, Gender | 68 | 61 | 64.5 |
| Yes | MFCCs (mean-std), Spectral, Gender | 85 | 81 | 83 |
Performance measures of the implemented speaker diarization method for different initial feature sets. The FLsD method provides a more robust behavior independently from the initial feature space, since it helps to discover a speaker-discriminant subspace.
Fig 7Audio thumbnailing example.
Example of a self-similarity matrix for the song “Charmless Man” by Blur. The detected diagonal segment defines the two thumbnails, i.e. segments (115.0sec–135.0sec) and (156.0sec–176.0sec).
Fig 8Chordial Content Visualization Example.
Different colors of the edges and nodes (recordings) represent different categories (artists in our case).
Realtime ratios for some basic functionalities and different devices.
| Procedure | Realtime Ratio 1 | Realtime Ratio 2 | Realtime Ratio 3 |
|---|---|---|---|
| Short-term feature extraction | 115 | 16 | 2.2 |
| Mid-term segment classification | 100 | 14 | 2 |
| Fix-sized segmentation-classification | 100 | 13 | 2 |
| HMM-based segmentation-classification | 100 | 13 | 2 |
| Silence Removal | 105 | 13 | 2 |
| Audio Thumbnailing | 450 | 16 | 7 |
| Diarization—no FLsD | 28 | 5 | 0.6 |
| Diarization—FLsD | 7 | 2 | 0.3 |
Realtime ratios express how many times faster than the actual signal’s duration is the respective computation. The particular values have been calculated for mono—16kHz signals. In addition, a 50 ms short-term window has been adopted. Both of these parameters (sampling rate and short-term window step) have a linear impact on the computational complexity of all functionalities. All of the functionalities are independent to the input signal’s duration apart from the audio thumbnailing and the diarization methods. For these methods, the particular ratios have been extracted using a 5-minute signal as input.