| Literature DB >> 35651761 |
Chenglin Wang1,2, Suchun Liu2, Yawei Wang2, Juntao Xiong3, Zhaoguo Zhang1, Bo Zhao4, Lufeng Luo5, Guichao Lin6, Peng He7.
Abstract
As one of the representative algorithms of deep learning, a convolutional neural network (CNN) with the advantage of local perception and parameter sharing has been rapidly developed. CNN-based detection technology has been widely used in computer vision, natural language processing, and other fields. Fresh fruit production is an important socioeconomic activity, where CNN-based deep learning detection technology has been successfully applied to its important links. To the best of our knowledge, this review is the first on the whole production process of fresh fruit. We first introduced the network architecture and implementation principle of CNN and described the training process of a CNN-based deep learning model in detail. A large number of articles were investigated, which have made breakthroughs in response to challenges using CNN-based deep learning detection technology in important links of fresh fruit production including fruit flower detection, fruit detection, fruit harvesting, and fruit grading. Object detection based on CNN deep learning was elaborated from data acquisition to model training, and different detection methods based on CNN deep learning were compared in each link of the fresh fruit production. The investigation results of this review show that improved CNN deep learning models can give full play to detection potential by combining with the characteristics of each link of fruit production. The investigation results also imply that CNN-based detection may penetrate the challenges created by environmental issues, new area exploration, and multiple task execution of fresh fruit production in the future.Entities:
Keywords: computer vision; convolutional neural network; deep learning; fruit detection; fruit production
Year: 2022 PMID: 35651761 PMCID: PMC9149381 DOI: 10.3389/fpls.2022.868745
Source DB: PubMed Journal: Front Plant Sci ISSN: 1664-462X Impact factor: 6.627
FIGURE 1Convolutional neural network (CNN)-based detection application in main links of fresh fruit production.
Structure and performance of common convolutional neural network (CNN) models for image detection.
| CNN models | Weight layers | Convolution | Kernel size | Active function | Dropout | LRN | BN | Top-5 error (on ImageNet) |
| AlexNet | 8 | 5 | 3×3, 5×5, 11×11 | ReLU | √ | √ | – | 16.4% |
| VGG | 19 | 16 | 3×3 | ReLU | √ | – | – | 7.3% |
| GoogleNet | 22 | 21 | 1×1, 3×3, 5×5, 7×7 | ReLU | √ | √ | – | 6.7% |
| ResNet | 152 | 151 | 1×1, 3×3, 7×7 | ReLU | √ | – | √ | 3.57% |
| DenseNet | 265 | 264 | 1×1, 3×3, 7×7 | ReLU | √ | – | √ | 5.29% |
| MobileNet | 28 | 27 | 1×1, 3×3 | ReLU | – | – | √ |
|
*Means that we have not found relevant data about Mobilenet in the public references.
FIGURE 2Structure of PointNet.
FIGURE 3Faster-R-CNN structure. The feature map is extracted by a convolutional neural network, and then the RPN (region proposal network) generates several accurate region proposals according to the feature map. The region proposals are mapped to the feature map. The ROI (region of interest) pooling layer is responsible for collecting proposal boxes and calculating proposal feature maps. Finally, the category of each proposal is predicted through the FC (full connect) layer.
Summary of common CNN-based object detection models.
| Type | Name | Backbone | Bounding boxes generation | Additional blocks | FPS | mAP/% | References | |
| VOC2012 | COCO | |||||||
| Two-stage | R-CNN | AlexNet | SS | – | 0.03 | 59.2 | – |
|
| Fast-R-CNN | VGG-16 | SS+ROI pooling | – | 7.00 | 68.4 | 19.7 |
| |
| Faster-R-CNN | VGG-16/ResNet-101 | RPN+ROI pooling | – | 7.00/5.00 | 70.4/73.8 | 21.9/34.9 |
| |
| Mask-R-CNN | ResNeXt-101-FPN | RPN+ROI align | FCN | 11.00 | 73.9 | 39.8 |
| |
| One-stage | SSD | VGG-16 | Anchor | – | 19.3 | 78.5 | 28.8 |
|
| YOLOv1 | GoogleNet | – | – | 45.0 | 57.9 | – |
| |
| YOLOv2 | DarkNet-19 | Anchor | – | 40.0 | 73.5 | 21.6 |
| |
| YOLOv3 | DarkNet-53 | Anchor | FPN, SPP | 51.0 | – | 33.0 |
| |
| YOLOv4 | CSPDarkNet53 | Anchor | FPN+PA, SPP | 23.0 | – | 43.5 |
| |
FIGURE 4Different CNN-based algorithms for pear flower detection. (A) Original image, (B) object detection, (C) semantic segmentation, and (D) instance segmentation.
FIGURE 5Example images with different image processes. (A) Original image, (B) vertical flip image, (C) noise injected image, (D) sharpened image, (E) Gaussian blurry image, (F) random erased image, (G) image with brightness adjustment, (H) RGB2GRB image, and (I) gray image.
FIGURE 6Apple inflorescence: (A) the central flower and the side flowers have a bud shape, (B) the central flower has a semi-open shape and the side flowers have a bud shape, (C) the central flower has a fully open shape and the side flowers have bud and semi-open shapes, and (D) the central flower and the side flowers have a fully open shape.
FIGURE 7Procedure of image generation in Tian et al. (2020).
Comparison of Caffe, TensorFlow, Keras, and PyTorch.
| Name | Caffe | TensorFlow | Keras | PyTorch |
| Support language | C++/Python/MATLAB | C++/Python | Python | Python |
| Support hardware | CPU/GPU | CPU/GPU/Mobile | CPU/GPU/Mobile | CPU/GPU |
| Support system | Linux/Windows/MacOS | Linux/Windows/MacOS/Android/IOS | Linux/Windows/MacOS/Android/IOS | Linux/Windows/MacOS |
| Traits | Strong readability and expansibility, stable and superior performance | Comprehensive functionality, good visualization, and active user community | Highly modular, keeping each module short and simple, and ease of extension. | Intuitive design, ease of use, and active user community |
Different languages define the code of the first convolution layer of Lenet-5.
|
|
FIGURE 8Basic confusion matrix.
FIGURE 9Instance matching and tracking by 3-D assignment. (Left) Key frames extracted from a video sequence with a 1,080-p camera. (Right) Graph-based tracking. Each column represents instances found by a neural network, and each color represents an individual grape cluster in a video frame.
FIGURE 10Different maturity levels of passion fruit in Tu et al. (2018). (A) Near-young passion fruit. (B) Young passion fruit. (C) Near-mature passion fruit. (D) Mature passion fruit. (E) After-mature passion fruit.
FIGURE 11Detection examples in Ni et al. (2020). The black rectangle contains the ID number and three traits (number, maturity, and compactness) of the corresponding sample.
FIGURE 12Workflow of field tests (Apolo-Apolo et al., 2020).
Summary of related studies on application of CNN-based detection models in growing fruits.
| Platform | Purpose | Detected object | CNN-based detection model | Following-up works | Remarks | References |
| Terrestrial platform | Yield estimation | Apple | Mask-R-CNN (2D detection) | SFM photogrammetry is used for generating 3D point cloud and SVM is used for removing false positive | Detection accuracy: 76.2% (2D image detections) and 85.7% (3D detections). Prediction precision: |
|
| Apple | YOLOv3 | Counting detected fruits for yield estimation | Detection accuracy: 84% |
| ||
| Mango | Faster R-CNN | The GPS/INS, color cameras with strobes, and LiDAR used for fruit locating, tracking, and counting | Prediction accuracy: |
| ||
| MangoYOLO | Correction factors are used for estimating yield load | Detection precision: 98.3%. Estimation precision: 4.6–15.2% of packhouse fruit counts |
| |||
| Tomato | Faster R-CNN | Stitching detected images and compiling a tomato location map of a greenhouse, estimating tomato size as per bounding box size. | Model performance: average precision: 87%, |
| ||
| Cheery tomato | YOLOv3 | ResNet-50 is used for classifying fruit clusters and counting total fruit number | Prediction precision: RMSE = 6.37, MAPE = 13.9% |
| ||
| Passion fruit | Faster R-CNN | Counting detected fruits for yield estimation | Model performance: |
| ||
| Oliver | Inception-ResNetV2 | Counting detected fruits for yield estimation | Model performance: F1 = 0.84 |
| ||
| Grape clusters | MobileNet-V2 | DeepLabV3 segmenting each berry for counting | Berry detection accuracy of 94.0% in the VSP and 85.6% in the SMPH |
| ||
| Kiwifruit | SSD (with MobileNetV2, quantized MobileNetV2, InceptionV3, and quantized InceptionV3) | Performing on mobiles with Android system and counting detected fruits for yield estimation | True detected rate (TDR) of MobileNetV2, quantized MobileNetV2, InceptionV3, and quantized InceptionV3 are 90.8%, 89.7%, 87.6%, and 72.8%, respectively. |
| ||
| Blueberry | Mask-R-CNN | Using different backbones: ResNet101, ResNet50 and MobileNetV1 to Mask-R-CNN and adding a step to outputs each instance of a blueberry to quantify the total number of blueberries in an image. | The best result was obtained when the ResNet50 backbone was used achieving a mIoU score of 0.595. |
| ||
| Blueberry | Mask-R-CNN | 3D minimum bounding box calculating fruit cluster compactness after 3D reconstruction and proposing a trait extraction algorithm to segment individual 3D blueberries, count berry number, calculate maturity, and estimate berry size. | The average counting accuracy for the 40 samples is 97.3%. The fruit clusters with a low fruit number generally have a higher accuracy, resulting in almost 100% accuracy. |
| ||
| Multi-fruit | Faster-R-CNN with MIoU | Counting detected fruits for yield estimation | Model performance: |
| ||
| Maturity detection | Apple (“Young Apple,” “Expanding apple,” “Ripe apple”) | YOLOv3 | Using different data augment methods and data numbers to comparison. Detection under occlusion and overlapping apple conditions and no apple environment. | Model performance: F1 = 0.817. Average detection time: 0.304 s |
| |
| Tomato (“Flower,” “Green tomato,” “Red tomato”) | Faster-R-CNN | Taking comparison between YOLOv2, YOLOv3, original Faster-R-CNN, R-FCN, and proposed model. | Model performance: Mean average precision: 90.7%. Average test time: 0.073 s. Model memory: 115.9 MB |
| ||
| Tomato (“Breakers,” “Turning,” “Pink,” “Light red,” “Red”) | Own model | Using own designed CNN for images classification | Classification accuracy: 91.9% |
| ||
| Tomato (“Immature,” “Breaker,” “Preharvest,” “Harvest”) | Fuzzing Mask-R-CNN | Locating the stalk points of ripe tomatoes by obtaining the contours of tomatoes from Mask-R-CNN for harvesting. | Model performance: |
| ||
| Four blueberry cultivars (“Immature” and “Mature”) | Mask-R-CNN | Defining and calculating blueberry maturity and compactness. Assessing the extracted traits and delineating trait differences in four blueberry cultivars. | Model performance: Mean average precision: 78%. |
| ||
| Coconut (“coconut” and “Mature coconut”) | Faster-R-CNN | Comparing the performance of Faster-R-CNN with different backbones, comparing the performance of improved model and other objection detection models. | Model performance: Mean average precision: 89.4%. Detection speed: 3.124 s |
| ||
| Passion fruit (“After-mature,” “Mature,” “Near-mature,” “Near-young,” “Young”) | Faster R-CNN | Using DSIFT algorithm and LLC algorithm to extract the features of fruit from R, G, B channels and send the representative features to SVM classifier for maturity indentation. | Detection accuracy: 92.71% and maturity classification accuracy: 91.52% |
| ||
| Litchi (“Ripe litchi,” “Expanding litchi,” “Young litchi”) | YOLOv3-Litchi | Comparing the proposed model with YOLOv2, YOLOv3, and Faster-R-CNN. | Model performance: average detection time: 0.029 s, mean average precision: 75.6%, average precision of young litchi, expanding litchi, and expanding litchi is 67.3%, 71.9%, 73.8%. |
| ||
| Oliver (“ZIG,” “RIG,” “ZVS,” “RVS,” “ZBS,” “RBS,” “ZOR,” “ROR”) | Own model | Evaluating the efficiency of six optimizers: Adagrad, SGD, SGDM, RMSProp, Adam, and Nadam. | Overall accuracy 91.91%, detection speed: 12.64 ms/frame (CPU) |
| ||
| Strawberries (“Flower,” “Flower-Fruit,” “Green-Fruit,” “Green-White-Fruit,” “White-Red-Fruit,” “Red-Fruit,” and “Rotted-Fruit”) | YOLOv3 | Identify the different ripeness of the detected fruit. | The mAP of strawberry maturity classification was 0.89, and the highest classification AP was 0.94 for fully matured fruit. |
| ||
| Cherry (“Cherry,” “Cherry_1,” “Cherry_2”) | YOLOv4 | DenseNet is used to replace the CSPDarkNet53 in YOLO-V4 and comparing different models in detecting ripe cherries | The mAP increased 0.15 comparing with the YOLO-V4 model and the F1 scores, IOU is 0.947 and 0.856. |
| ||
| Aerial platform | Yield estimation | Apple, orange | FCN | A second neural network and a linear regression were used to count the number of fruit. | Mean IU of 0.813 on the oranges and 0.838 on the apples, a best l2 error of 13.8 on the oranges, and 10.5 on the apples |
|
| Green mango | YOLOv2 | Counting detected fruits for yield estimation. | The mAP was 86.4%, a precision was 96.1% and a recall rate was 89.0%. |
| ||
| Citrus | Faster-R-CNN | Counting detected fruits and estimate the weight for yield estimation. | Mean error is 7.22%. |
| ||
| Citrus | YOLOv5 | Comparing the proposed model with different models and different occlusion degrees. | Accuracy: 93.32%, speed: 180 ms/frame, FPS: 83 s (In 2080ti), recall: 88.78% |
| ||
| Melon | RetinaNet | Estimate the weight of the detected fruit. | Overall average precision score: 0.92 and F1 is more than 0.9 |
| ||
| Maturity detection | Strawberries (“Flower,” “Immature Fruit,” “Mature Fruit”) | YOLOv3 | Identify the different ripeness of the detected fruit. | For Flower, Immature Fruit, and Mature Fruit detection from the test data set at 2 m, the APs were 0.83, 0.87, and 0.93, the mAP for the test data set at 2 m was 0.88. |
|
Summary of related studies on application of CNN-based detection models in fruit harvesting.
| Crop applied | Basic model | Data augment | Dataset | Transfer learning | Detection rate (%) | Inference speed (s/image) | References |
| Apple | SSD | √ | 589 RGB images | √ | 89.2 | – |
|
| R-CNN | – | 270 RGB-D images | √ | 86.0 | – |
| |
| Faster-R-CNN | √ | 967 three-modalities images (RGB, range-corrected intensity, and depth) | √ | 94.8 | 0.074 @548×373 px |
| |
| SSD | – | 250 RGB-D images | – | 92.3 | 2.00 @3840×1080 px |
| |
| LedNet (FPN+ASPP) | √ | 1,100 RGB images | √ | 85.3 | 0.028 @320×320 px |
| |
| Faster-R-CNN | √ | 12,800 RGB images | – | 87.6 | 0.241 @1920×1080 px |
| |
| Faster-R-CNN | √ | 800 RGB-D images | √ | 87.1 | 0.124 @1920×1080 px |
| |
| Faster R-CNN | √ | 820 RGB images | – | 92.5 | 0.058 @100×100 px |
| |
| Faster-R-CNN | √ | 675 RGB-D images | √ | 82.4 | 0.450 @360×640 px |
| |
| Mask-R-CNN | – | 1,140 RGB images | √ | 97.3 | – |
| |
| Mask-R-CNN | √ | 24,005 RGB images | √ | 58.1 | – |
| |
| Mask-R-CNN | – | 19,528 RGB images | – | 88.0 | 0.250 @1280×720 px |
| |
| DenseNet+FPN | √ | 953 RGB images | √ | 93.2 | 0.023 @200×308 px |
| |
| Citrus | SSD | √ | 1,660 RGB images | √ | 91.1 | – |
|
| Mask-R-CNN | – | 300 RGB images | √ | 85.1 | 0.045 @1024×768 px |
| |
| Mask-R-CNN | √ | RGB and RGB-HSV images | √ | 97.5 | 0.011 @256×256 px |
| |
| Mask-R-CNN | – | 5,195 RGB images | √ | – | – |
| |
| Mask-R-CNN | – | 750 RGB images | – | 98.2 | 0.700 @1024×768 px |
| |
| Mask R-CNN | – | 5,195 RGB images | – | 92.2 | 9.230 @1920×1080 px |
| |
| Faster R-CNN | √ | 799 RGB images | – | 90.7 | 0.058 @100×100 px |
| |
| Kiwifruit | Faster-R-CNN | – | 20,160 images | √ | 92.3 | 0.274 @2352×1568 px |
|
| Faster-R-CNN | √ | 20,160 images | √ | 87.6 | 0.347 @2352×1568 px |
| |
| Faster-R-CNN | √ | 21,147 RGB images | √ | 96.0 | 1.070 @1920×1080 px |
| |
| Faster-R-CNN | – | 1,000 NIR images+1,000 RGB images+1,000 RGB-D images | – | 91.7 | 0.134 @512×424 px |
| |
| YOLOv3 | √ | 20,160 RGB images | √ | 90.1 | 0.034 @2352×1568 px |
| |
| Strawberry | SSD | √ | 4,550 RGB images | √ | 87.7 | 0.23 @360×640 px |
|
| Mask R-CNN | – | 2,000 RGB images | √ | 95.8 | 0.125 @640×480 px |
| |
| Mask R-CNN | √ | – | – | 81.0 | 0.620 @640×480 px |
| |
| Mask R-CNN | – | – | – | – | – |
| |
| Mask R-CNN | √ | 3000 RGB images | – | 78.3 | 0.01 @ 1008×756 px |
| |
| FCN | 3100 RGB images | – | 93.4 | 0.03 @ 1008×756 px |
| ||
| Grape | Mask R-CNN | √ | 1,050 RGB-D images | – | 89.5 | 1.100 @1920× 1080 px |
|
| Litchi | SSD | √ | 636 RGB images | √ | 86.7 | – |
|
| Mango | Faster R-CNN | 822 RGB images | – | 88.9 | 0.058 @100×100 px |
| |
| DenseNet+FPN | √ | 1694 RGB images | √ | 93.6 | 0.023 @500×500 px |
| |
|
| Faster R-CNN | √ | 8,475 RGB images | – | 92.0 | 0.200 @500×500 px |
|
| Guava | Mask R-CNN | √ | 304 RGB-D images | √ | 53.7 | 0.250 @512×424 px |
|
| Sweet pepper | Faster-R-CNN | √ | 122 RGB-NIR images | √ | 83.8 | 0.393 @– |
|
| Deep CNN | √ | 960 RGB images | – | 82.9 | – |
| |
| Cherry tomato | YOLOv3 | √ | 1825 RGB images | – | 96.8 | 0.058 @ 1,292×964 px |
|
FIGURE 13Diagram of fusion methods in Sa et al. (2016). (A) Early fusion: first, channels of the detected image are augmented from three to four channels. Second, the augmented image is detected by Faster-R-CNN. Third, NMS (non-maximum suppression) removes duplicate predictions. Finally, the classifier and regressor calculate the category and coordinate of the bounding box. (B) Late fusion: first, the RGB image and the NIR image are detected by Faster-R-CNN. Second, the detected outputs from two Faster R-CNN networks are fused. Third, NMS (non-maximum suppression) removes duplicate predictions. Finally, the classifier and regressor calculate the category and coordinate of the bounding box.
FIGURE 14Feature-fusion model in Liu et al. (2019). First, it inputs the RGB and NIR images separately into two VGG16 networks and then combined them on the feature map; then, the feature map is detected by Faster-R-CNN.
FIGURE 15Possible types of fruit in one scene formulated by Rehman and Miura (2021). (A) Center, (B) left, (C) right, and (D) occluded.
FIGURE 16Automatic apple harvesting mode in Onishi et al. (2019). (A) Detection of a two-dimensional position, (B) detection of a three-dimensional position, (C) approaching the target apple.
FIGURE 17Platform setup and computer vision system (Chen et al., 2021). (A) The citrus processing line was assembled in the laboratory, with a webcam mounted above the conveyor. (B) The diagram shows an automated citrus sorting system using a camera and robot arms, and the robot arms will be implemented in future studies.