| Literature DB >> 33953232 |
Shengzhao Tian1, Duanbing Chen1,2,3, Hang Wang1, Jingfa Liu4,5.
Abstract
In underwater acoustic target recognition, deep learning methods have been proved to be effective on recognizing original signal waveform. Previous methods often utilize large convolutional kernels to extract features at the beginning of neural networks. It leads to a lack of depth and structural imbalance of networks. The power of nonlinear transformation brought by deep network has not been fully utilized. Deep convolution stack is a kind of network frame with flexible and balanced structure and it has not been explored well in underwater acoustic target recognition, even though such frame has been proven to be effective in other deep learning fields. In this paper, a multiscale residual unit (MSRU) is proposed to construct deep convolution stack network. Based on MSRU, a multiscale residual deep neural network (MSRDN) is presented to classify underwater acoustic target. Dataset acquired in a real-world scenario is used to verify the proposed unit and model. By adding MSRU into Generative Adversarial Networks, the validity of MSRU is proved. Finally, MSRDN achieves the best recognition accuracy of 83.15%, improved by 6.99% from the structure related networks which take the original signal waveform as input and 4.48% from the networks which take the time-frequency representation as input.Entities:
Year: 2021 PMID: 33953232 PMCID: PMC8099869 DOI: 10.1038/s41598-021-88799-z
Source DB: PubMed Journal: Sci Rep ISSN: 2045-2322 Impact factor: 4.379
Figure 1The basic unit of DRSN[47]. In convolution layer, parameter c is the convolutional out channel number, k is the kernel size of convolution, and s is the stride of convolution.
Figure 2The structure of Multiscale Residual Unit (MSRU). Two hyper-parameters C and S will determine the output shape. The shape of input data is in which B represents batch size, represents channel number, and L represents data length. The shape of output data is . Hyper-parameters S is usually set to 1 or 2. Parallel multiscale convolution module consists of four convolutional layers with different kernel size and a channel concat operation. Soft Threshold Learning Module consists of a global average pooling layer and two fully connected nonlinear transformation layer. In every convolution layer, parameter c is the convolutional out channel number, k is the kernel size of convolution, and s is the stride of convolution. All the padding approach are “same”.
Figure 3The structure of multiscale residual deep neural network (MSRDN). The shape of input data is [B, 1, L], in which B represents batch size and L represents data length. The shape of output is , in which B represents batch size and represents the number of predicted categories. In the head of the network are four parallel convolution with different kernel size. the main body of MSRDN is stacked by MSRU. According to the difference of hyper-parameter C, all MSRUs will be divided into four convolution stacks. The convolution stacks can be directly connected to each other due to the independence and flexibility of MSRU.
The number of recordings.
| Class label | A | B | C | D |
|---|---|---|---|---|
| Total number of recordings | 750 | 750 | 750 | 750 |
| Number of training recordings | 530 | 538 | 530 | 530 |
| Number of testing recordings | 220 | 212 | 220 | 220 |
Each recording is a 5 min audio file in WAV format. Each recording will be sliced into 100 segments as samples to make up the input of neural networks. Each segment has 3 s duration.
Formal description of the confusion matrix.
| Truth | Prediction | |||
|---|---|---|---|---|
| Class A | Class B | Class C | Class D | |
| Class A | ||||
| Class B | ||||
| Class C | ||||
| Class D | ||||
Figure 4Time-Frequency Representation of each category.
Classification experimental results.
| Input | Models | Accuracy (%) | Avg precision (%) | Avg recall (%) | Macro F1 (%) | AUC | Parameter size |
|---|---|---|---|---|---|---|---|
| Wave | ResNet[ | 76.16 | 76.18 | 76.06 | 76.08 | 0.9147 | 35.95M |
| Wave | DRSN[ | 80.30 | 80.20 | 80.20 | 80.15 | 0.9143 | 111.97M |
| Wave | DNN[ | 81.19 | 80.76 | 81.05 | 80.82 | 0.9278 | 61.61M |
| Wave | LSDN | 81.99 | 82.02 | 81.91 | 81.94 | 137.63M | |
| Wave | MSRDN | 0.9307 | 54.38M | ||||
| T-F | Inception-Res[ | 82.38 | 82.26 | 82.32 | 82.17 | 0.9165 | 29.82M |
| T-F | DenseNet[ | 81.86 | 81.66 | 81.78 | 81.64 | 0.9293 | 18.09M |
| T-F | ResNet[ | 80.10 | 80.13 | 80.04 | 79.96 | 0.9111 | 42.49M |
| T-F | MLENET[ | 78.67 | 78.46 | 78.59 | 78.45 | 0.9297 | 3.19M |
Confusion matrix of samples.
| Truth | Prediction | ||||
|---|---|---|---|---|---|
| Class A | Class B | Class C | Class D | Recall (%) | |
| Class A | 16944 | 3040 | 1368 | 648 | 77.02 |
| Class B | 3393 | 15027 | 574 | 2206 | 70.88 |
| Class C | 571 | 737 | 20554 | 138 | 93.43 |
| Class D | 352 | 1509 | 154 | 19985 | 90.84 |
| Precision | 79.70% | 73.98% | 90.75% | 86.98% | 83.15 |
The accuracy is listed at the bottom-right corner.
Generative experimental results.
| Model | FID(TEG) | FID(TRG) |
|---|---|---|
| WaveGAN | 1.9131 | 2.3598 |
| MSRWaveGAN | 1.0589 | 2.1155 |
| BIGGAN | 9.7091 | 9.4453 |
| MSBIGGAN | 1.6397 | 2.6340 |
Figure 5The loss curves of the discriminators and generators. (a) Generator loss of BigGAN and MSBigGAN. (b) Discriminator loss of BigGAN and MSBigGAN. (c) Generator loss of WaveGAN and MSRWaveGAN. (d) Discriminator loss of WaveGAN and MSRWaveGAN. The smoothing function is used, and the smoothing factors are set to 0.8 in all subfigure.