| Literature DB >> 30978974 |
Yu Su1,2, Ke Zhang3, Jingyu Wang4, Kurosh Madani5.
Abstract
With the popularity of using deep learning-based models in various categorization problems and their proven robustness compared to conventional methods, a growing number of researchers have exploited such methods in environment sound classification tasks in recent years. However, the performances of existing models use auditory features like log-mel spectrogram (LM) and mel frequency cepstral coefficient (MFCC), or raw waveform to train deep neural networks for environment sound classification (ESC) are unsatisfactory. In this paper, we first propose two combined features to give a more comprehensive representation of environment sounds Then, a fourfour-layer convolutional neural network (CNN) is presented to improve the performance of ESC with the proposed aggregated features. Finally, the CNN trained with different features are fused using the Dempster-Shafer evidence theory to compose TSCNN-DS model. The experiment results indicate that our combined features with the four-layer CNN are appropriate for environment sound taxonomic problems and dramatically outperform other conventional methods. The proposed TSCNN-DS model achieves a classification accuracy of 97.2%, which is the highest taxonomic accuracy on UrbanSound8K datasets compared to existing models.Entities:
Keywords: Auditory Cognition; Convolutional Neural Network; Dempster—Shafer evidence theory; Environment Sound Classification; Fusion Model
Year: 2019 PMID: 30978974 PMCID: PMC6479959 DOI: 10.3390/s19071733
Source DB: PubMed Journal: Sensors (Basel) ISSN: 1424-8220 Impact factor: 3.576
Figure 1The spectrogram of LMC and MC feature sets.
Figure 2The architecture of proposed four-layer CNN.
Figure 3The overall framework of the DS theory based ISR system.
Figure 4The architecture and size of feature maps in each convolutional layer.
Parameters and memories of CNN with different number of convolutional layers.
| four-layer | 6-Layer | 8-Layer | ||||
|---|---|---|---|---|---|---|
| Layer | param | memory | param | memory | param | memory |
| input | 0 | 3.5 K | 0 | 3.5 K | 0 | 3.5 K |
|
| 288 | 111.5 K | 288 | 111.5 K | 288 | 111.5 K |
|
| 9.2 K | 111.5 K | 9.2 K | 111.5 K | 9.2 K | 111.5 K |
|
| 18.4 K | 57.8 K | 18.4 K | 57.8 K | 18.4 K | 57.8 K |
|
| 36.8 K | 57.8 K | 36.8 K | 57.8 K | 36.8 K | 57.8 K |
|
| 0 | 0 | 73.7 K | 31 K | 73.7 K | 31 K |
|
| 0 | 0 | 147.5 K | 31 K | 147.5 K | 31 K |
|
| 0 | 0 | 0 | 0 | 294.9 K | 4.6 K |
|
| 0 | 0 | 0 | 0 | 589.8 K | 4.6 K |
|
| 15.9 M | 1024 | 8.7 M | 1024 | 4.7 M | 1024 |
|
| 10.2 K | 10 | 10.2 K | 10 | 10.2 K | 10 |
| Total | 15.9 M | 339.6 K | 8.9 M | 401.6 K | 5.9 M | 413.4 K |
Figure 5The spectrogram of MLMC feature sets.
Class-wise accuracy of four models with four-layer CNN evaluated on UrbanSound8K.
| Class | LMC (LMCNet) | MC (MCNet) | MLMC | TSCNN-DS |
|---|---|---|---|---|
| ac | 98.6% | 99.9% | 99.2% | 99.9% |
| ch | 93.9% | 91.4% | 93.2% | 94.2% |
| cp | 97.3% | 93.9% | 96.1% | 97.5% |
| db | 92.6% | 90.4% | 94.2% | 95.3% |
| dr | 94.8% | 95.0% | 95.7% | 97.2% |
| ei | 98.9% | 99.6% | 98.5% | 99.6% |
| gs | 88.6% | 91.1% | 85.9% | 95.4% |
| jh | 93.2% | 95.9% | 91.1% | 97.1% |
| si | 98.6% | 98.3% | 98.5% | 98.9% |
| sm | 95.0% | 97.4% | 94.1% | 96.9% |
| Avg. | 95.2% | 95.3% | 94.6% | 97.2% |
Statistics analyze and time cost of four-layer CNN based models and TSCNN-DS model.
| Mean | N | Std Deviation | Time Cost | |
|---|---|---|---|---|
| LMCNet | 0.9515 | 10 | 0.03121 | 0.023 |
| MCNet | 0.9529 | 10 | 0.03352 | 0.024 |
| MLMC | 0.9465 | 10 | 0.03812 | 0.028 |
| TSCNN-DS | 0.9720 | 10 | 0.01788 | 0.077 |
Class-wise accuracy of four models based on 6-layer CNN evaluated on UrbanSound8K.
| Class | LMC (LMCNet) | MC (MCNet) | MLMC | TSCNN-DS |
|---|---|---|---|---|
| ac | 98.9% | 98.9% | 97.5% | 99.9% |
| ch | 90.2% | 69.4% | 87.9% | 89.2% |
| cp | 94.8% | 91.1% | 93.6% | 96.4% |
| db | 91.3% | 88.0% | 91.6% | 93.1% |
| dr | 93.8% | 90.9% | 91.5% | 95.5% |
| ei | 98.2% | 97.7% | 98.1% | 99.1% |
| gs | 77.2% | 77.2% | 81.7% | 85.1% |
| jh | 92.6% | 91.6% | 93.4% | 97.1% |
| si | 99.0% | 96.1% | 99.0% | 98.9% |
| sm | 94.3% | 92.1% | 92.9% | 94.7% |
| Avg. | 93.0% | 89.3% | 92.7% | 94.9% |
Class-wise accuracy of four models based on 8-layer CNN evaluated on UrbanSound8K.
| Class | LMC (LMCNet) | MC (MCNet) | MLMC | TSCNN-DS |
|---|---|---|---|---|
| ac | 94.8% | 91.5% | 93.2% | 98.2% |
| ch | 76.1% | 47.3% | 88.1% | 69.9% |
| cp | 84.0% | 80.9% | 87.9% | 88.0% |
| db | 79.9% | 73.3% | 86.8% | 80.8% |
| dr | 87.8% | 87.4% | 87.0% | 91.6% |
| ei | 96.8% | 94.8% | 95.3% | 97.4% |
| gs | 57.2% | 63.4% | 45.4% | 67.8% |
| jh | 89.8% | 74.7% | 85.9% | 87.6% |
| si | 97.8% | 88.3% | 96.5% | 96.3% |
| sm | 85.3% | 71.8% | 90.3% | 80.3% |
| Avg. | 84.9% | 77.3% | 85.7 % | 85.8% |
The ESC results of stacked CNNs with 4, 6 and 8 convolution layers.
| Model | Accuracy |
|---|---|
| Stacked four-layer CNN | 86.4% |
| Stacked 6-layer CNN | 79.8% |
| Stacked 8-layer CNN | 80.1% |
Comparison of classification accuracy with other models on UrbanSound8K datasets. The bold is our result.
| Model | Feature | Accuracy |
|---|---|---|
| Piczak [ | LM | 72.7% |
| Tokozume [ | Raw Data | 78.3% |
| Zhang X. [ | Mel | 81.9% |
| Zhang Z. [ | LM-GS | 83.7% |
| Li [ | Raw Data-LM | 92.2% |
| Boddapati [ | Spec -MFCC-CRP | 93% |
| LMCNet | LM-C | 95.2% |
| MCNet | M-C | 95.3% |
|
|
|
|