| Literature DB >> 35746234 |
Van-Thuan Tran1, Wei-Ho Tsai1, Yury Furletov2,3, Mikhail Gorodnichev2.
Abstract
The train horn sound is an active audible warning signal used for warning commuters and railway employees of the oncoming train(s), assuring a smooth operation and traffic safety, especially at barrier-free crossings. This work studies deep learning-based approaches to develop a system providing the early detection of train arrival based on the recognition of train horn sounds from the traffic soundscape. A custom dataset of train horn sounds, car horn sounds, and traffic noises is developed to conduct experiments and analysis. We propose a novel two-stream end-to-end CNN model (i.e., THD-RawNet), which combines two approaches of feature extraction from raw audio waveforms, for audio classification in train horn detection (THD). Besides a stream with a sequential one-dimensional CNN (1D-CNN) as in existing sound classification works, we propose to utilize multiple 1D-CNN branches to process raw waves in different temporal resolutions to extract an image-like representation for the 2D-CNN classification part. Our experiment results and comparative analysis have proved the effectiveness of the proposed two-stream network and the method of combining features extracted in multiple temporal resolutions. The THD-RawNet obtained better accuracies and robustness compared to those of baseline models trained on either raw audio or handcrafted features, in which at the input size of one second the network yielded an accuracy of 95.11% for testing data in normal traffic conditions and remained above a 93% accuracy for the considerable noisy condition of-10 dB SNR. The proposed THD system can be integrated into the smart railway crossing systems, private cars, and self-driving cars to improve railway transit safety.Entities:
Keywords: audio classification; convolutional neural networks; end-to-end models; railway audible warning signal; railway transit safety; raw waveforms; train horn detection
Mesh:
Year: 2022 PMID: 35746234 PMCID: PMC9227093 DOI: 10.3390/s22124453
Source DB: PubMed Journal: Sensors (Basel) ISSN: 1424-8220 Impact factor: 3.847
Figure 1The general structure of the TH-TAD system.
Figure 2The general structure of the THD-RawNet in the TH-TAD system. is the sampling rate, is the input length (in seconds), is the number of filters in a 1D-Conv layer, “Concat” stands for concatenation operation, and FC denotes a fully connected layer.
The summary of our data preparation.
| Data Class | Data Sources | Total (#Samples) | |
|---|---|---|---|
| Our Collection | ESC-50 | ||
| Train Horn | 5289 | - | 5289 |
| Car Horn | 5808 | 40 | 5848 |
| Noise | 4302 | 1960 | 6262 |
| Total (#samples) | 15,399 | 2000 | 17,399 |
| Total duration | 8.55 h | 2.77 h | 11.32 h |
| Clip length | 2 s | 5 s | - |
Data separation for TH-TAD experiments.
| Subset | Train Horn | Car Horn | Noise | Total |
|---|---|---|---|---|
| Train | 3211 | 3624 | 3871 | 10,706 |
| Validation | 985 | 1225 | 1226 | 3336 |
| Test | 1093 | 1099 | 1165 | 3357 |
| Total | 5289 | 5848 | 6262 | 17,399 |
Figure 3The procedure for the augmentation of training data.
Performance of the proposed THD-RawNet and baseline models on the THD dataset.
| Model | Input/Features | Inference Time (ms/Sample) | Accuracy (%) |
|---|---|---|---|
| THD-RawNet (this work) | Raw wave | 5 ms | 95.11 |
| SoundNet (5 Conv layers [ | Raw wave | 1 ms | 90.17 |
| SoundNet (8 Conv layers [ | Raw wave | 2 ms | 92.17 |
| EnvNet [ | Raw wave | 2 ms | 88.23 |
| 2D-CNN (K. J. Piczak [ | Mel-scale spectrogram | 3 ms | 89.04 |
| 2D-CNN (J. Salamon et al. [ | Mel-scale spectrogram | 1 ms | 89.90 |
| 2D-CNN (AlexNet [ | Mel-scale spectrogram | 3 ms | 90.05 |
| RNN (I. Lezhenin et al. [ | Mel-scale spectrogram | 8 ms | 80.22 |
| CRNN [ | Mel-scale spectrogram | 2 ms | 87.99 |
Figure 4Confusion matrices for THD-RawNet, SoundNet (8 Conv layers), and AlexNet. CH, NS, and TH denote the car horn, noise, and train horn classes, respectively. (a) THD-RawNet. (b) SoundNet (8 Conv layers). (c) AlexNet.
Performance of the first stream of THD-RawNet with different configurations.
| Model | #Branches | 1D Filter Size in Each Branch | Output of 1D-CNN part | Accuracy (%) |
|---|---|---|---|---|
| 1st Stream of THD-RawNet | 1 | Large (128) |
| 92.25 |
| 1st Stream of THD-RawNet | 1 | Medium (32) |
| 89.66 |
| 1st Stream of THD-RawNet | 1 | Small (8) |
| 87.85 |
| 1st Stream of THD-RawNet | 2 | Large (128), Medium (32) |
| 92.40 |
| 1st Stream of THD-RawNet | 2 | Large (128), Small (8) |
| 92.85 |
| 1st Stream of THD-RawNet | 2 | Medium (32), Small (8) |
| 90.65 |
| 1st Stream of THD-RawNet | 3 | Large (128), Medium (32), Small (8) |
| 93.68 |
Performances of the proposed THD-RawNet and its two streams.
| Model | Features | Inference Time (ms/Sample) | Accuracy (%) |
|---|---|---|---|
| THD-RawNet | Raw wave | 5 ms | 95.11 |
| 1st Stream of THD-RawNet | Raw wave | 4 ms | 93.68 |
| 2nd Stream of THD-RawNet | Raw wave | 2 ms | 92.52 |
Results of proposed THD-RawNet and baseline models across various levels of noise.
| Models | Input/Features | Accuracy (%) on Each SNR | |||||||
|---|---|---|---|---|---|---|---|---|---|
| −15 dB | −10 dB | −5 dB | 0 dB | +5 dB | +10 dB | +15 dB | Original Data | ||
| THD-RawNet (this work) | Raw wave | 82.90 | 93.08 | 93.74 | 94.31 | 94.51 | 94.66 | 94.70 | 95.11 |
| 1st Stream of THD-RawNet (this work) | Raw wave | 80.16 | 89.51 | 92.25 | 92.52 | 92.76 | 93.39 | 93.68 | 93.68 |
| 2nd Stream of THD-RawNet (this work) | Raw wave | 79.56 | 88.35 | 91.45 | 91.71 | 92.01 | 92.04 | 92.37 | 92.52 |
| SoundNet (five Conv layers [ | Raw wave | 71.02 | 80.87 | 86.53 | 88.44 | 89.18 | 89.24 | 89.78 | 90.17 |
| SoundNet (eight Conv layers [ | Raw wave | 75.93 | 84.27 | 88.53 | 90.11 | 90.49 | 90.55 | 91.39 | 92.17 |
| EnvNet [ | Raw wave | 72.08 | 77.06 | 83.37 | 85.25 | 85.79 | 86.71 | 87.01 | 88.23 |
| 2D-CNN (K. J. Piczak [ | Spectrogram | 77.45 | 84.59 | 85.23 | 86.92 | 87.42 | 88.47 | 88.62 | 89.04 |
| 2D-CNN (J. Salamon et al. [ | Spectrogram | 77.86 | 84.62 | 85.43 | 86.62 | 88.17 | 88.88 | 89.87 | 89.90 |
| 2D-CNN (AlexNet [ | Spectrogram | 77.21 | 83.26 | 85.79 | 86.38 | 87.60 | 88.44 | 89.06 | 90.05 |
| RNN (I. Lezhenin et al. [ | Spectrogram | 55.46 | 58.26 | 65.65 | 70.74 | 72.00 | 75.96 | 78.37 | 80.22 |
| CRNN [ | Spectrogram | 75.66 | 81.14 | 84.18 | 84.77 | 85.56 | 86.77 | 87.75 | 87.99 |
Performances of THD-RawNet with different input sizes.
| Input Size (s) | 0.25 s | 0.5 s | 0.75 s | 1 s | 2 s |
|---|---|---|---|---|---|
| Accuracy | 92.11% | 93.71% | 94.81% | 95.11% | 95.53% |
| Inference time (ms/sample) | 1 ms | 2 ms | 4 ms | 5 ms | 10 ms |