| Literature DB >> 36236322 |
Xi Xu1, Yi Qin1, Dejun Xi1, Ruotong Ming1, Jie Xia1.
Abstract
Image segmentation plays an important role in the sensing systems of autonomous underwater vehicles for fishing. Via accurately perceiving the marine organisms and surrounding environment, the automatic catch of marine products can be implemented. However, existing segmentation methods cannot precisely segment marine animals due to the low quality and complex shapes of collected marine images in the underwater situation. A novel multi-scale transformer network (MulTNet) is proposed for improving the segmentation accuracy of marine animals, and it simultaneously possesses the merits of a convolutional neural network (CNN) and a transformer. To alleviate the computational burden of the proposed network, a dimensionality reduction CNN module (DRCM) based on progressive downsampling is first designed to fully extract the low-level features, and then they are fed into a proposed multi-scale transformer module (MTM). For capturing the rich contextural information from different subregions and scales, four parallel small-scale encoder layers with different heads are constructed, and then they are combined with a large-scale transformer layer to form a multi-scale transformer module. The comparative results demonstrate MulTNet outperforms the existing advanced image segmentation networks, with MIOU improvements of 0.76% in the marine animal dataset and 0.29% in the ISIC 2018 dataset. Consequently, the proposed method has important application value for segmenting underwater images.Entities:
Keywords: contextural information; marine animal; multi-scale transformer; semantic segmentation; subregion
Mesh:
Year: 2022 PMID: 36236322 PMCID: PMC9571946 DOI: 10.3390/s22197224
Source DB: PubMed Journal: Sensors (Basel) ISSN: 1424-8220 Impact factor: 3.847
A summary of the segmentation method mentioned in the literature.
| Methodology | Year | Highlights | Limitations |
|---|---|---|---|
| FCN [ | 2015 | It is the first time CNN is used to extract features in the field of semantic segmentation. | This model has poor robustness on the image detail and does not consider the relationship between pixels. |
| U-Net [ | 2015 | This model exploits skip connections between encoder and decoder to decrease the loss of context information. | This model generally shows poor performance in capturing details and explicitly building long-range dependency. |
| Attention U-Net [ | 2018 | Suppress irrelevant areas in the input image and highlight the salient features of specific local areas. | Attention gates (AGS) mainly focus on extracting the spatial information of the region of interest, and the ability to extract less relevant local regions is poor. |
| RefineNet [ | 2016 | It effectively exploits features in the downsampling process to achieve high-resolution prediction using long-distance residual connectivity. | This model occupies large computing resources, resulting in low training speed and, secondarily, requires pre-trained weights on its backbones. |
| ResUNet [ | 2017 | Residual units simplify the training process of deep networks and skip connections could facilitate information propagation. | This model cannot establish the dependency between pixels, which shows poor segmentation performance in blurred images. |
| Deeplabv3+ [ | 2018 | This model can capture sharper object boundaries and extract multi-scale contextual information. Moreover, the resolution of the feature map of the encoder module can be arbitrarily controlled by atrous convolution. | It adopts atrous convolution, which results in the loss of spatially continuous information. |
| PSPNet [ | 2017 | It can aggregate the context information of different regions to improve the global information extraction ability | It takes a long time to train and performs relatively poorly in detail handling. |
| CKDNet [ | 2020 | This model adopts knowledge transfer and diffusion strategies to aggregate semantic information from different tasks to boost segmentation performance. | This model consumes a lot of computing resources, resulting in low training speed. |
| U-Net-FS [ | 2020 | The outcomes of experiments show U-Net-FS can capture numerous types of free surfaces with a high dice accuracy of over 0.94. | This model can only focus on local information and be trained with a single scale, so it cannot handle the change of image size well. |
| The fish image segmentation method is combined with the K-means clustering segmentation algorithm and mathematical morphology [ | 2013 | The traditional K-means algorithm is improved by using the optimal number of clusters based on the peak number of the image gray histogram. | It is sensitive to the selection of the initial value K and outliers, which will lead to poor segmentation performance. |
| Transunet [ | 2021 | This model can extract abundant global context by converting image features into sequences and exploiting the low-level CNN spatial information via a u-shaped architectural design. | This model exploits transposed convolutional layers to restore feature maps. However, it usually results in the chess-board effect, which shows discontinuous predictions among adjacent pixels. |
| Swin transformer [ | 2021 | The shift window scheme improves efficiency via limiting self-attention computation to non-overlapping local windows. Experimental results show SOTA performance in various tasks, including classification, detection, and segmentation tasks. | Constraint by shift window operation. It is time-consuming that this model must be modified and retrained according to different sizes of input. |
| CoTr [ | 2021 | It exploits a deformable self-attention mechanism to decrease the spatial and computational complexities of building long-range relationships on multi-scale feature maps. | Since both transformer and 3D volumetric data require a large amount of GPU memory, this method divides the data into small patches and deals with them one at a time, which causes the loss of features from other patches. |
| TrSeg [ | 2021 | Different from existing networks for multi-scale feature extraction, this model incorporates a transformer to generate dependencies on original context information, which can adaptively extract multi-scale information well. | This model has relatively few limitations on low-level semantic information extraction. |
| FAT-Net [ | 2021 | This model exploits both CNN and transformer encoder branches to extract rich local features and capture the important global context information. | This segmentation network still has limitations when the color change of the image is too complex, and the contrast of the image is too low. |
Figure 1The detailed structure diagram of MulTNet.
Hyper-parameters of all the models used in this study.
| Parameter | Configuration |
|---|---|
| Optimizer | SGD |
| Learning rate | 0.03 |
| Weight decay | 0.01 |
| Momentum | 0.9 |
| Batch size | 6 |
| Image size | 256*256 |
| Activation function | GELU |
| Dropout | 0.3 |
Comparison of various segmentation networks for the marine animal dataset.
| Model |
|
|
| |||
|---|---|---|---|---|---|---|
| FCN-8s | 36.87% | 27.41% | 1.68% | 15.99% | 35.48% | 97.05% |
| U-Net | 40.91% | 29.14% | 3.45% | 26.47% | 39.01% | 96.97% |
| Attention U-Net | 41.54% | 30.87% | 3.87% | 27.28% | 39.76% | 96.95% |
| RefineNet | 43.14% | 32.63% | 4.02% | 31.32% | 41.21% | 96.89% |
| ResUnet | 43.79% | 33.02% | 4.56% | 33.29% | 41.96% | 96.93% |
| DeepLabv3+ | 44.24% | 34.21% | 5.22% | 33.23% | 42.40% | 96.96% |
| CKDNet | 46.05% | 39.26% | 4.31% | 33.58% | 43.54% | 97.09% |
| PSPNet | 45.38% | 37.67% | 5.62% | 34.13% | 43.62% | 97.07% |
| TrSeg | 46.32% | 40.67% | 4.71% | 33.47% | 43.98% | 97.11% |
| SegFormer [ | 46.67% | 39.73% | 5.91% | 33.51% | 44.06% | 97.10% |
| FAT-Net | 47.17% | 42.97% | 5.06% | 34.35% | 44.87% | 97.16% |
| MulTNet | 47.69% | 44.14% | 5.21% | 35.93% | 45.63% | 97.27% |
Figure 2Examples of the marine animal segmentation obtained by FCN-8s, U-Net, Attention U-Net, RefineNet, ResUNet, Deeplabv3+, PSPNet, TrSeg, CKDNet, SegFormer, FAT-Net, and the proposed MulTNet.
Figure 3Examples of different levels of blur obtained by the proposed MulTNet.
Comparison of various segmentation networks for the public ISIC 2018 dataset.
| Model |
|
|
|
|
|---|---|---|---|---|
| FCN-8s | 89.98% | 67.47% | 79.03% | 92.12% |
| U-Net | 92.19% | 76.05% | 84.59% | 94.36% |
| Attention U-Net | 93.05% | 76.35% | 84.76% | 94.41% |
| RefineNet | 92.85% | 76.64% | 84.83% | 94.32% |
| ResUnet | 93.17% | 77.27% | 85.37% | 94.64% |
| DeepLabv3+ | 93.31% | 78.21% | 85.99% | 94.92% |
| PSPNet | 93.08% | 79.52% | 86.77% | 95.15% |
| CKDNet | 93.13% | 79.90% | 86.99% | 95.21% |
| SegFormer | 93.40% | 80.19% | 87.21% | 95.32% |
| TrSeg | 93.22% | 80.37% | 87.31% | 95.33% |
| FAT-Net | 93.65% | 81.09% | 87.80% | 95.56% |
| MulTNet | 94.04% | 81.48% | 88.09% | 95.71% |
Figure 4Examples of the marine animal segmentation obtained by FCN-8s, U-Net, Attention U-Net, RefineNet, ResUNet, Deeplabv3+, PSPNet, TrSeg, SegFormer, CKDNet, FAT-Net, and the proposed MulTNet.
Figure 5The training losses of various segmentation networks for two datasets: (a) dataset of marine animals; (b) dataset of ISIC 2018.
Comparison of computational costs.
| Model | Total Params | Training Speed (s/Iteration) | Testing Speed (s/Image) |
|---|---|---|---|
| FCN-8s | 134 M | 0.725 | 0.291 |
| RefineNet | 85 M | 0.381 | 0.148 |
| CKDNet | 52 M | 0.342 | 0.135 |
| PSPNet | 71 M | 0.317 | 0.124 |
| Segformer | 64 M | 0.312 | 0.127 |
| TrSeg | 74 M | 0.293 | 0.113 |
| ResUnet | 67 M | 0.285 | 0.110 |
| DeepLabv3+ | 55 M | 0.238 | 0.096 |
| AttentionU-Net | 42 M | 0.203 | 0.092 |
| FAT-Net | 30 M | 0.194 | 0.089 |
| U-Net | 32 M | 0.167 | 0.075 |
| MulTNet | 59 M | 0.183 | 0.087 |
Ablation experiments.
| ISIC Dataset | Marine Animal Dataset | |||||
|---|---|---|---|---|---|---|
| Model |
|
|
|
|
|
|
| Transformer | 73.89% | 89.80% | 90.34% | 23.99% | 25.00% | 95.97% |
| DRCM-Transformer | 85.36% | 92.29% | 94.53% | 40.31% | 42.33% | 96.97% |
| MulTNet | 88.09% | 94.04% | 95.71% | 45.63% | 47.69% | 97.27% |