| Literature DB >> 34884076 |
Yu-Bang Chang1, Chieh Tsai1, Chang-Hong Lin1, Poki Chen1.
Abstract
As the techniques of autonomous driving become increasingly valued and universal, real-time semantic segmentation has become very popular and challenging in the field of deep learning and computer vision in recent years. However, in order to apply the deep learning model to edge devices accompanying sensors on vehicles, we need to design a structure that has the best trade-off between accuracy and inference time. In previous works, several methods sacrificed accuracy to obtain a faster inference time, while others aimed to find the best accuracy under the condition of real time. Nevertheless, the accuracies of previous real-time semantic segmentation methods still have a large gap compared to general semantic segmentation methods. As a result, we propose a network architecture based on a dual encoder and a self-attention mechanism. Compared with preceding works, we achieved a 78.6% mIoU with a speed of 39.4 FPS with a 1024 × 2048 resolution on a Cityscapes test submission.Entities:
Keywords: autonomous driving; convolution neural network; deep learning; edge devices; image recognition; real-time semantic segmentation
Mesh:
Year: 2021 PMID: 34884076 PMCID: PMC8659896 DOI: 10.3390/s21238072
Source DB: PubMed Journal: Sensors (Basel) ISSN: 1424-8220 Impact factor: 3.576
Figure 1Speed–accuracy comparison using the Cityscapes Dataset. Our methods achieved a state-of-the-art trade-off between the accuracy and the inference time.
Figure 2The proposed overall network architecture.
Figure 3Refinement Module.
Figure 4Factorized atrous spatial pyramid pooling module.
Quantitative results in the test data of the Cityscapes Dataset.
| Method | Resolution | GPU | mIoU | FPS | Parameters |
|---|---|---|---|---|---|
| SegNet [ | 360 × 640 | Titan | 56.1 | 14.6 | 29.5 M |
| ESPNet [ | 512 × 1024 | Titan X | 60.3 | 112.9 | 0.4 M |
| ESPNetv2 [ | 512 × 1024 | Titan X | 62.1 | 80.0 | 0.8 M |
| ERFNet [ | 512 × 1024 | Titan X M | 68.0 | 41.7 | 2.1 M |
| ICNet [ | 1024 × 2048 | Titan X | 69.5 | 30.3 | 26.5 M |
| BiSeNet (Xception39) [ | 768 × 1536 | Titan XP | 68.4 | 105.8 | 5.8 M |
| BiSeNet (ResNet-18) [ | 768 × 1536 | Titan XP | 74.7 | 65.5 | 49.0 M |
| DFANet [ | 1024 × 1024 | Titan X | 71.3 | 100.0 | 7.8 M |
| SwiftNet [ | 1024 × 2048 | GTX 1080Ti | 75.5 | 39.9 | 11.8 M |
| FANet-34 [ | 1024 × 2048 | Titan X | 75.5 | 58.0 | - |
| GAS [ | 769 × 1537 | Titan XP | 71.8 | 108.4 | - |
| STDC2-Seg75 [ | 768 × 1536 | GTX 1080Ti | 76.8 | 97.0 | - |
| HyperSeg-M [ | 512 × 1024 | GTX 1080Ti | 75.8 | 36.9 | 10.1 M |
| DESANet-RN | 1024 × 2048 | GTX 1080Ti | 77.7 | 58.3 | 15.3 M |
| DESANet-HN | 1024 × 2048 | GTX 1080Ti | 78.6 | 39.4 | 6.2 M |
Figure 5(a) Input image and the results of the (b) ResNet-18 backbone [28] and the (c) HarDNet-68ds [37] backbone.
Figure 6(a) Input image and results of the (b) ResNet-18 backbone [28] and the (c) HarDNet-68ds [37] backbone.