| Literature DB >> 34069189 |
Roneel V Sharan1, Hao Xiong1, Shlomo Berkovsky1.
Abstract
Audio signal classification finds various applications in detecting and monitoring health conditions in healthcare. Convolutional neural networks (CNN) have produced state-of-the-art results in image classification and are being increasingly used in other tasks, including signal classification. However, audio signal classification using CNN presents various challenges. In image classification tasks, raw images of equal dimensions can be used as a direct input to CNN. Raw time-domain signals, on the other hand, can be of varying dimensions. In addition, the temporal signal often has to be transformed to frequency-domain to reveal unique spectral characteristics, therefore requiring signal transformation. In this work, we overview and benchmark various audio signal representation techniques for classification using CNN, including approaches that deal with signals of different lengths and combine multiple representations to improve the classification accuracy. Hence, this work surfaces important empirical evidence that may guide future works deploying CNN for audio signal classification purposes.Entities:
Keywords: convolutional neural networks; fusion; interpolation; machine learning; spectrogram; time-frequency image
Mesh:
Year: 2021 PMID: 34069189 PMCID: PMC8156023 DOI: 10.3390/s21103434
Source DB: PubMed Journal: Sensors (Basel) ISSN: 1424-8220 Impact factor: 3.576
Figure 1Overview of same sized time-frequency (TF) image formation techniques.
Figure 2Overview of fusion techniques for learning from multiple representations of the same signal: (1) early-fusion, (2) mid-fusion, and (3) late-fusion.
CNN architecture used for the sound event and speech command datasets.
| Sound Event | Speech Command | |
|---|---|---|
| Image input layer | 32 × 15 | 64 × 64 |
| Middle layers | Conv. 1: 16@3 × 3, Stride 1 × 1, Pad 1 × 1 | Conv. 1: 48@3 × 3, Stride 1 × 1, Pad ‘same’ |
| Final layers | Fully connected layer: 50 | Fully connected layer: 36 |
Optimization algorithm and hyperparameter settings for training the CNN.
| Sound Event | Speech Command | |
|---|---|---|
| Optimization algorithm | Adam | Adam |
| Initial learn rate | 0.001 | 0.0003 |
| Mini batch size | 50 | 128 |
| Max epochs | 30 | 25 |
| Learn rate drop factor | 0.5 | 0.1 |
| Learn rate drop period | 6 | 20 |
| L2 regularization | 0.05 | 0.05 |
Figure 3(a) Plot of the speech command signal backward and its time-frequency representations: (b) spectrogram, (c) smoothed-spectrogram, (d) mel-spectrogram, and (e) cochleagram.
Average classification accuracy (in %) of time-frequency image representations.
| Signal Representation | Sound Event | Speech Command | ||
|---|---|---|---|---|
| Validation | Test | Validation | Test | |
| Spectrogram | 92.70 | 93.77 | 92.33 | 91.90 |
| Smoothed-Spectrogram | 96.48 | 97.32 | 93.79 | 93.41 |
| Mel-Spectrogram | 96.45 | 96.31 | 93.64 | 93.64 |
| Cochleagram |
|
|
|
|
Average classification accuracy (in %) of resized time-frequency image representations.
| Signal Representation | Sound Event | Speech Command | ||
|---|---|---|---|---|
| Validation | Test | Validation | Test | |
| Resized spectrogram (nearest-neighbour) | 93.51 | 94.19 | 93.20 | 93.10 |
| Resized spectrogram (bilinear) | 95.71 | 96.31 |
| 93.81 |
| Resized spectrogram (bicubic) | 96.02 | 96.59 | 94.03 |
|
| Resized spectrogram (Lanczos-2) | 95.75 | 96.42 | 93.75 | 93.77 |
| Resized spectrogram (Lanczos-3) |
|
| 94.02 | 93.75 |
Average classification accuracy (in %) of signal representation fusion techniques.
| Signal Representation Fusion Technique | Sound Event | Speech Command | ||
|---|---|---|---|---|
| Validation | Test | Validation | Test | |
| Early-fusion | 98.42 | 98.63 | 94.47 | 94.29 |
| Mid-fusion | 98.48 | 98.82 | 94.65 | 94.49 |
| Late-fusion |
|
|
|
|