| Literature DB >> 36176679 |
Guichao Lin1,2, Chenglin Wang1,2, Yao Xu1, Minglong Wang1, Zhihao Zhang1, Lixue Zhu1,2.
Abstract
It is imminent to develop intelligent harvesting robots to alleviate the burden of rising costs of manual picking. A key problem in robotic harvesting is how to recognize tree parts efficiently without losing accuracy, thus helping the robots plan collision-free paths. This study introduces a real-time tree-part segmentation network by improving fully convolutional network with channel and spatial attention. A lightweight backbone is first deployed to extract low-level and high-level features. These features may contain redundant information in their channel and spatial dimensions, so a channel and spatial attention module is proposed to enhance informative channels and spatial locations. On this basis, a feature aggregation module is investigated to fuse the low-level details and high-level semantics to improve segmentation accuracy. A tree-part dataset with 891 RGB images is collected, and each image is manually annotated in a per-pixel fashion. Experiment results show that when using MobileNetV3-Large as the backbone, the proposed network obtained an intersection-over-union (IoU) value of 63.33 and 66.25% for the branches and fruits, respectively, and required only 2.36 billion floating point operations per second (FLOPs); when using MobileNetV3-Small as the backbone, the network achieved an IoU value of 60.62 and 61.05% for the branches and fruits, respectively, at a speed of 1.18 billion FLOPs. Such results demonstrate that the proposed network can segment the tree-parts efficiently without loss of accuracy, and thus can be applied to the harvesting robots to plan collision-free paths.Entities:
Keywords: MobileNetV3; attention mechanism; harvesting robot; neural network; tree-part segmentation
Year: 2022 PMID: 36176679 PMCID: PMC9513386 DOI: 10.3389/fpls.2022.991487
Source DB: PubMed Journal: Front Plant Sci ISSN: 1664-462X Impact factor: 6.627
FIGURE 1Image example. (A) A guava tree. (B) Different parts of the guava tree, where the red, green, and black regions represent the fruit, branch, and background, respectively.
FIGURE 2Design details of CSAM. Note that Conv is convolutional operation, and BN is batch normalization; 1 × 1 represents the kernel size, H × W × C and H × W × C/r denote the tensor shape (height, width, and depth); the first ⊗ refers to channel-wise multiplication, and the second ⊗ is element-wise multiplication.
FIGURE 3Design details of FAM. Note that up refers to up-sampling by bilinear interpolation.
FIGURE 4Illustration of the segmentation head. Note that S is the scale ratio of up-sampling, and N is the number of classes.
FIGURE 5Overview of the tree-part segmentation network, where three-dimensional blocks represent feature maps and two-dimensional blocks refer to convolutional modules.
Ablations on the backbone and feature aggregation module.
| Row | AC | R | NF | IoU (%) | mIoU (%) | PA (%) | FPS | #Params | FLOPs | ||
| Branch | Fruit | Background | |||||||||
| 1 | ✓ | x | 4 | 63.37 | 66.67 | 93.05 | 74.03 | 93.76 | 32.84 | 6.9M | 3.48B |
| 2 | ✓ | ✓ | 4 | 62.51 | 67.05 | 93.18 | 74.25 | 93.87 | 33.85 | 5.7M | 3.08B |
| 3 | x | x | 4 | 63.40 | 66.03 | 93.26 | 74.23 | 93.95 | 33.80 | 6.9M | 2.44B |
| 4 | x | ✓ | 4 | 63.33 | 66.25 | 93.12 | 74.23 | 93.84 | 36.00 | 5.7M | 2.36B |
| 5 | x | ✓ | 3 | 58.72 | 63.14 | 92.20 | 71.35 | 92.96 | 34.67 | 5.7M | 1.66B |
| 6 | x | ✓ | 2 | 49.74 | 61.16 | 90.72 | 67.21 | 91.49 | 34.36 | 5.7M | 1.46B |
AC, Apply atrous convolution in the last block of the backbone; R, Remove the last layer in stage 5 of the backbone; NF, Number of feature maps fused in FAM. When NF = 4, {G2, G3, G4, G5} are fused. When NF = 3, {G3, G4, G5} are fused. When NF = 2, {G4, G5} are fused. M and B represent million and billion, respectively.
Ablations on the auxiliary segmentation head, which is inserted after the output of different stages in the backbone.
| Stage | IoU (%) | mIoU (%) | PA (%) | ||
| Branch | Fruit | Background | |||
| 2 | 62.45 | 65.32 | 92.95 | 73.58 | 93.68 |
| 3 | 63.33 | 66.25 | 93.12 | 74.23 | 93.84 |
| 4 | 64.04 | 61.96 | 93.33 | 73.11 | 94.02 |
| 5 | 63.07 | 61.98 | 93.23 | 72.76 | 93.92 |
Accuracy and real-time performance of the proposed network and comparison methods on test set.
| Methods | Backbone | IoU (%) | mIoU (%) | PA (%) | FPS | #Params | FLOPs | ||
| Branch | Fruit | Background | |||||||
| Ours | MobileNetV3-Large | 63.33 | 66.25 | 93.12 | 74.23 | 93.84 | 36.00 | 5.7M | 2.36B |
| Ours | MobileNetV3-Small | 60.62 | 61.05 | 92.82 | 71.50 | 93.52 | 37.91 | 2.7M | 1.18B |
| LR-ASPP | MobileNetV3-Large | 60.05 | 58.60 | 92.85 | 70.50 | 93.52 | 36.67 | 5.7M | 2.37B |
| DeepLabV3 | MobileNetV3-Large | 56.34 | 58.82 | 92.14 | 69.11 | 92.85 | 35.78 | 13.5M | 11.58B |
| DeepLabV3+ | MobileNetV3-Large | 62.59 | 61.05 | 93.36 | 72.33 | 94.00 | 31.52 | 14.2M | 35.73B |
| FANet | ResNet18 | 54.71 | 57.57 | 92.25 | 68.17 | 92.97 | 36.65 | 13.8M | 6.93B |
FIGURE 6Visual examples illustrating results of our network and comparison networks. (A) RGB image. (B) Ground truth. (C) Ours (MobileNetV3-Large). (D) LR-ASPP. (E) DeepLabV3. (F) DeepLabV3+. (G) FANet.