| Literature DB >> 31208007 |
Nam Kyun Kim1, Kwang Myung Jeon2, Hong Kook Kim3.
Abstract
This paper proposes a sound event detection (SED) method in tunnels to prevent further uncontrollable accidents. Tunnel accidents are accompanied by crashes and tire skids, which usually produce abnormal sounds. Since the tunnel environment always has a severe level of noise, the detection accuracy can be greatly reduced in the existing methods. To deal with the noise issue in the tunnel environment, the proposed method involves the preprocessing of tunnel acoustic signals and a classifier for detecting acoustic events in tunnels. For preprocessing, a non-negative tensor factorization (NTF) technique is used to separate the acoustic event signal from the noisy signal in the tunnel. In particular, the NTF technique developed in this paper consists of source separation and online noise learning. In other words, the noise basis is adapted by an online noise learning technique for enhancement in adverse noise conditions. Next, a convolutional recurrent neural network (CRNN) is extended to accommodate the contributions of the separated event signal and noise to the event detection; thus, the proposed CRNN is composed of event convolution layers and noise convolution layers in parallel followed by recurrent layers and the output layer. Here, a set of mel-filterbank feature parameters is used as the input features. Evaluations of the proposed method are conducted on two datasets: a publicly available road audio events dataset and a tunnel audio dataset recorded in a real traffic tunnel for six months. In the first evaluation where the background noise is low, the proposed CRNN-based SED method with online noise learning reduces the relative recognition error rate by 56.25% when compared to the conventional CRNN-based method with noise. In the second evaluation, where the tunnel background noise is more severe than in the first evaluation, the proposed CRNN-based SED method yields superior performance when compared to the conventional methods. In particular, it is shown that among all of the compared methods, the proposed method with the online noise learning provides the best recognition rate of 91.07% and reduces the recognition error rates by 47.40% and 28.56% when compared to the Gaussian mixture model (GMM)-hidden Markov model (HMM)-based and conventional CRNN-based SED methods, respectively. The computational complexity measurements also show that the proposed CRNN-based SED method requires a processing time of 599 ms for both the NTF-based source separation with online noise learning and CRNN classification when the tunnel noisy signal is one second long, which implies that the proposed method detects events in real-time.Entities:
Keywords: convolutional recurrent neural network (CRNN); non-negative tensor factorization (NTF); online noise learning; sound event detection (SED); tunnel accident detection
Year: 2019 PMID: 31208007 PMCID: PMC6631336 DOI: 10.3390/s19122695
Source DB: PubMed Journal: Sensors (Basel) ISSN: 1424-8220 Impact factor: 3.576
Figure 1Block diagram of a typical SED method.
Figure 2Architecture of an accident management system using acoustic sensors in a tunnel.
Figure 3Block diagram of a conventional GMM–HMM-based SED method.
Figure 4The network architecture of sound event classification based on GMM–HMM.
Figure 5Block diagram of the proposed SED method based on the NTF source separation with online noise learning and a CRNN-based classifier with event sound and noise CNNs.
Figure 6Architecture of the CRNN in the proposed SED method.
Distribution of the MIVIA road audio events dataset [4].
| Class | # Events | Duration |
|---|---|---|
| Tire skid (TS) | 200 | 326.38 s |
| Car crash (CC) | 200 | 522.5 s |
| Background noise (BN) | - | 2732.0 s |
Distribution of the audio dataset for the development of SED in a tunnel environment.
| Class | Training Set (Recorded) | Training Set (Generated) | Evaluation Set (Recorded) | |||
|---|---|---|---|---|---|---|
| # Events | Duration | # Events | Duration | # Events | Duration | |
| Tire skid (TS) | 54 | 120.55 s | 311 | 383.45 s | 39 | 109.88 s |
| Car crash (CC) | 30 | 68.27 s | 93 | 84.07 s | 9 | 19.31 s |
| Background noise (BN) | - | 5423.66 s | - | - | - | ~ 48 h |
Configuration of network architectures of three different deep neural networks used for performance comparison.
| Layer | Deep Neural Network | ||
|---|---|---|---|
| CNN | CRNN [ | Proposed CRNN | |
| No. of convolution layers | 3 | 3 | 3, 3 |
| No. of kernels | (8, 16, 32) | (8, 16, 32) | (8, 16, 32), (8, 16, 32) |
| Kernel size | (3, 3) | (3, 3) | (3, 3) |
| Pool size | (2, 2, 4) | (2, 2, 4) | (2, 2, 4), (2, 2, 4) |
| RNN layer | - | 16 Bi-directional GRUs | 16 Bi-directional GRUs |
| FC layer | Exists | Exists | Exists |
Performance comparison of the proposed and other SED methods evaluated on the MIVIA road audio events dataset.
| Methods | Measures | ||||
|---|---|---|---|---|---|
| Features | Classifier | RR (%) | MDR (%) | FPR (%) | AUC (%) |
| MFCC features BoW * [ | SVM | 78.20 | 21 | 10.96 | 86.00 |
| Temporal and spectral features * [ | SVM | 82.65 | 19 | 5.48 | 90.00 |
| Selected time and frequency features * [ | SVM | 95.00 | 2.75 | 5.00 | 98.32 |
| Mel-filterbanks from noisy signal { | GMM–HMM [ | 67.75 | 32.00 | 29.76 | 82.90 |
| CNN | 96.25 | 2.00 | 4.38 | 97.59 | |
| CRNN [ | 96.00 | 3.25 | 3.06 | 97.01 | |
| Mel-filterbanks from NTF w/o | GMM–HMM [ | 79.50 | 20.50 | 17.94 | 94.20 |
| CNN | 94.00 | 2.75 | 3.94 | 96.56 | |
| CRNN [ | 96.50 | 2.00 | 7.22 | 96.36 | |
| Mel-filterbanks from NTF with | GMM–HMM [ | 84.75 | 15.00 | 13.35 | 95.20 |
| CNN | 96.00 | 2.50 | 2.40 | 97.45 | |
| CRNN [ | 96.00 | 2.75 | 3.28 | 97.33 | |
| Mel-filterbanks from NTF with | Proposed CRNN | 98.25 | 1.00 | 3.06 | 98.39 |
* Since the experimental setup using the MIVIA road audio events dataset was identical to the previous work in [23], the results of the star-marked methods indicated in the table were excerpted from [23].
Figure 7Comparison of the receiver operating characteristic (ROC) curves between the proposed CRNN-based SED method and the other SED methods.
Performance comparison of the proposed and other SED methods evaluated on the real tunnel event dataset.
| Methods | Measures | ||||
|---|---|---|---|---|---|
| Features | Classifier | RR (%) | MDR (%) | FPR (%) | AUC (%) |
| Mel-filterbanks from noisy signal { | GMM–HMM [ | 69.81 | 30.19 | 88.68 | 69.11 |
| CNN | 71.70 | 28.30 | 7.55 | 80.75 | |
| CRNN [ | 81.13 | 18.87 | 11.32 | 82.66 | |
| Mel-filterbanks from NTF w/o | GMM–HMM [ | 69.81 | 30.19 | 7.55 | 77.22 |
| CNN | 79.25 | 20.75 | 41.51 | 64.68 | |
| CRNN [ | 83.02 | 16.98 | 18.67 | 84.56 | |
| Mel-filterbanks from NTF with | GMM–HMM [ | 83.02 | 16.98 | 15.09 | 87.83 |
| CNN | 83.92 | 16.07 | 17.57 | 85.87 | |
| CRNN [ | 87.50 | 12.50 | 10.71 | 89.92 | |
| Mel-filterbanks from NTF with | Proposed CRNN | 91.07 | 8.93 | 7.14 | 92.08 |
Figure 8Architecture of the CNN-based feature extractor.
Performance comparison of the CNN-based and CRNN-based SED methods with the CNN-based feature parameters and mel-filterbanks evaluated on the MIVIA road audio events dataset.
| Methods | Measures | ||||
|---|---|---|---|---|---|
| Features | Classifier | RR (%) | MDR (%) | FPR (%) | AUC (%) |
| CNN-based feature from noisy signal { | CNN | 96.50 | 3.5 | 4.38 | 96.90 |
| CRNN [ | 95.25 | 4.75 | 5.03 | 96.09 | |
| Mel-filterbanks from noisy signal { | CNN | 96.25 | 2.00 | 4.38 | 97.59 |
| CRNN [ | 96.00 | 3.25 | 3.06 | 97.01 | |
| CNN-based feature from NTF w/o | CNN | 96.50 | 3.25 | 5.69 | 96.97 |
| CRNN [ | 96.00 | 3.50 | 5.03 | 96.09 | |
| Mel-filterbanks from NTF w/o | CNN | 94.00 | 2.75 | 3.94 | 96.56 |
| CRNN [ | 96.50 | 2.00 | 7.22 | 96.36 | |
Comparison of the number of parameters and processing time for training and testing the SED methods.
| Item | SED methods | |||
|---|---|---|---|---|
| GMM–HMM [ | CNN | CRNN [ | Proposed CRNN | |
| No. of parameters | 9.6K | 21K | 34K | 64K |
| Processing time for model training per epoch | 4 s | 5 s | 8 s | 12 s |
| Processing time per second of test signal + | 117 ms | 2 ms | 10 ms | 11 ms |
+ The NTF source separation with online noise learning required 588 ms, which was not counted for the processing time denoted in this table.