| Literature DB >> 31035673 |
Meihan Wu1, Qi Wang2, Eric Rigall3, Kaige Li4, Wenbo Zhu5, Bo He6, Tianhong Yan7.
Abstract
This paper presents a novel and practical convolutional neural network architecture to implement semantic segmentation for side scan sonar (SSS) image. As a widely used sensor for marine survey, SSS provides higher-resolution images of the seafloor and underwater target. However, for a large number of background pixels in SSS image, the imbalance classification remains an issue. What is more, the SSS images contain undesirable speckle noise and intensity inhomogeneity. We define and detail a network and training strategy that tackle these three important issues for SSS images segmentation. Our proposed method performs image-to-image prediction by leveraging fully convolutional neural networks and deeply-supervised nets. The architecture consists of an encoder network to capture context, a corresponding decoder network to restore full input-size resolution feature maps from low-resolution ones for pixel-wise classification and a single stream deep neural network with multiple side-outputs to optimize edge segmentation. We performed prediction time of our network on our dataset, implemented on a NVIDIA Jetson AGX Xavier, and compared it to other similar semantic segmentation networks. The experimental results show that the presented method for SSS image segmentation brings obvious advantages, and is applicable for real-time processing tasks.Entities:
Keywords: deeply-supervised nets; fully convolutional neural networks; image-to-image prediction; imbalance classification; semantic segmentation; side scan sonar (SSS)
Year: 2019 PMID: 31035673 PMCID: PMC6540294 DOI: 10.3390/s19092009
Source DB: PubMed Journal: Sensors (Basel) ISSN: 1424-8220 Impact factor: 3.576
Figure 1An illustration of the ECNet architecture. Three side-output layers are placed behind encoder1-3. Deep supervision is performed on each side-output layer, conducting the side-outputs to the predictions we expect. Our encoder–decoder module has no fully connected layers in encoder–decoder module. In addition to the inputs of the decoder4, the inputs of the other decoders are the sum of the corresponding encoder outputs and the latest decoder outputs. A decoder performs an up-sampling operation on its input based on the pooling indices the corresponding encoder provide. The feature maps of the final decoder output are given as inputs to a sigmoid classifier to complete the pixel level classification.
Figure 2The architecture of an encoder unit (a) and a decoder unit (b).
The number of input and output feature maps.
| Block | Encoder | Decoder | |
|---|---|---|---|
| A | b | ||
| 1 | 3 | 64 | 32 |
| 2 | 64 | 128 | 64 |
| 3 | 128 | 256 | 128 |
| 4 | 256 | 512 | 256 |
The receptive field (rf) and stride size (s_size) in encoder module used in ECNet.
| Block | Encoder1 | Encoder2 | Encoder3 | Encoder4 |
|---|---|---|---|---|
| rf | 8 | 22 | 50 | 106 |
| s_size | 2 | 4 | 8 | 16 |
Figure 3The experimental scheme of our proposed SSS image segmentation method.
Figure 4Examples of the raw SSS images.
Figure 5Examples of our data: (a) The results of the interpolation and removal operation used to form dataset; (b) corresponding ground truth.
Figure 6Examples of SSS image: (a) Clear and standard; (b) some target pixels are obvious and others are weak; (c) dark; (d) discontinuous; (e) weak and unclear; (f) noises.
Figure 7The number of pixels in dataset.
Comparison of different decoding methods in our architecture. (The best results are bold)
| Decoding Method | Pixel acc. (%) | Mean acc. (%) | Mean IU (%) | f.w. IU (%) |
|---|---|---|---|---|
| U-Net | 91.09 | 75.75 | 64.71 | 85.38 |
| SegNet | 90.79 | 73.57 | 63.21 | 84.86 |
| LinkNet | 91.01 | 74.40 | 63.94 | 85.17 |
| Ours |
|
|
|
|
Figure 8The test results on different data in the test set. From left to right, the images are the original SSS images, their corresponding ground truth and prediction results, respectively. Row1 (a1 and a2): clear and standard. Row2 (b1 and b2): some target pixels are obvious and others are weak. Row3 (c1 and c2): dark. Row4 (d1 and d2): discontinuous. Row5 (e1 and e2): weak and unclear. Row6 (f1 and f2): strong noise.
Comparison of different architectures performance in dataset. (The best results are bold)
| Model | Pixel acc. (%) | Mean acc. (%) | Mean IU (%) | f.w. IU (%) |
|---|---|---|---|---|
| U-Net |
| 75.02 |
|
|
| SegNet | 91.99 | 73.69 | 65.22 | 86.29 |
| LinkNet | 91.45 | 73.63 | 64.28 | 85.64 |
| Ours | 91.62 |
| 66.18 | 86.10 |
Comparison of different architectures implementation on NVIDIA Xavier. (The best results are bold)
| Model | Parameters | Model Size | Time |
|---|---|---|---|
| U-Net | 31.0 M | 124.1 MB | 87.6 ms |
| SegNet | 28.4 M | 113.8 MB | 93.9 ms |
| LinkNet | 21.6 M | 86.7 MB | 29.4 ms |
| Ours |
|
|
|