| Literature DB >> 35812910 |
Xuguang Feng1,2,3,4, Chunjiang Zhao2,3, Chunshan Wang1,2,3,4, Huarui Wu2,3, Yisheng Miao2,3, Jingjian Zhang5.
Abstract
In view of the differences in appearance and the complex backgrounds of crop diseases, automatic identification of field diseases is an extremely challenging topic in smart agriculture. To address this challenge, a popular approach is to design a Deep Convolutional Neural Network (DCNN) model that extracts visual disease features in the images and then identifies the diseases based on the extracted features. This approach performs well under simple background conditions, but has low accuracy and poor robustness under complex backgrounds. In this paper, an end-to-end disease identification model composed of a disease-spot region detector and a disease classifier (YOLOv5s + BiCMT) was proposed. Specifically, the YOLOv5s network was used to detect the disease-spot regions so as to provide a regional attention mechanism to facilitate the disease identification task of the classifier. For the classifier, a Bidirectional Cross-Modal Transformer (BiCMT) model combining the image and text modal information was constructed, which utilizes the correlation and complementarity between the features of the two modalities to achieve the fusion and recognition of disease features. Meanwhile, the problem of inconsistent lengths among different modal data sequences was solved. Eventually, the YOLOv5s + BiCMT model achieved the optimal results on a small dataset. Its Accuracy, Precision, Sensitivity, and Specificity reached 99.23, 97.37, 97.54, and 99.54%, respectively. This paper proves that the bidirectional cross-modal feature fusion by combining disease images and texts is an effective method to identify vegetable diseases in field environments.Entities:
Keywords: complex background; cross-modal fusion; disease identification; few-shot; transformer
Year: 2022 PMID: 35812910 PMCID: PMC9263697 DOI: 10.3389/fpls.2022.918940
Source DB: PubMed Journal: Front Plant Sci ISSN: 1664-462X Impact factor: 6.627
The number of samples in each set.
| Disease class | Number of training images | Number of validation images | Number of testing images |
|---|---|---|---|
| Tomato powdery mildew | 112 | 31 | 15 |
| Tomato early blight | 200 | 57 | 28 |
| Tomato virus disease | 169 | 48 | 24 |
| Cucumber powdery mildew | 120 | 34 | 17 |
| Cucumber virus disease | 169 | 48 | 24 |
| Cucumber downy mildew | 160 | 45 | 22 |
Examples of the image-text pair.
| Disease category | Text description | Disease image | Disease category | Text description | Disease image |
|---|---|---|---|---|---|
| Tomato powdery mildew | Some white powdery spots scatter on the front of tomato leaves. |
| Tomato early blight | Some yellow-brown sunken ring spots are observed on the front of tomato leaves. |
|
| Tomato virus disease | Numerous water-soaked chlorotic spots are observed on tomato leaves. |
| Cucumber powdery mildew | Some white powdery spots scatter on cucumber leaves. |
|
| Cucumber virus disease | Numerous chlorotic folded areas are observed on cucumber leaves. |
| Cucumber downy mildew | Yellow-brown irregular shaped spots scatter on the front of cucumber leaves. |
|
The specific number of images in the dataset.
| Disease class | Number of original images | Number of enhanced images |
|---|---|---|
| Tomato powdery mildew | 158 | 474 |
| Tomato early blight | 285 | 444 |
| Tomato virus disease | 241 | 482 |
| Cucumber powdery mildew | 171 | 513 |
| Cucumber virus disease | 241 | 482 |
| Cucumber downy mildew | 227 | 420 |
Figure 1The overall network structure. The input of the model is the “image-text pair.” The detector is YOLOv5s, which is used to detect the disease-spot regions in the images. For text descriptions, Token Embeddings are obtained by loading the BERT Chinese pre-training model. BiCMT is a Bidirectional Cross-Modal Transformer; the two MLP heads of the last BiCMT layer are extracted for classification purpose.
Figure 2The working principle of fine-grained disease-spot region detection.
Figure 3The overall structure of Cross-Modal Transformer.
Comparison of different YOLOv5 models.
| Models | Precision/% | Recall/% | F1/% | mAP/% | Model size(MB) | GFLOPs | FPS |
|---|---|---|---|---|---|---|---|
| Faster RCNN | 66.7 | 52.8 | 59.0 | 59.4 | 113.6 | 473.3 | 23 |
| SSD | 56.8 | 84.4 | 67.9 | 68.3 | 97.7 | 137.8 | 51 |
| YOLOv5l | 91.6 | 90.0 | 90.8 | 93.4 | 92.9 | 108.0 | 38 |
| YOLOv5m | 91.7 | 90.3 | 91.0 | 94.2 | 42.2 | 48.1 | 47 |
| YOLOv5n | 91.4 | 90.2 | 90.8 | 93.1 | 3.9 | 4.2 | 54 |
| YOLOv5s | 92.3 | 91.4 | 91.9 | 94.3 | 14.4 | 15.9 | 53 |
| YOLOv5x | 89.7 | 91.4 | 90.5 | 93.6 | 173.2 | 204.3 | 29 |
Figure 4Comparison of accuracy and loss for original image classification.
Figure 5Comparison of accuracy and loss for detection plus classification.
Comparison of image classification results.
| Models | Accuracy/% | Precision/% | Sensitivity/% | Specificity/% |
|---|---|---|---|---|
| VGG16 | 92.56 | 77.18 | 75.17 | 95.56 |
| AlexNet | 95.38 | 85.63 | 82.17 | 97.15 |
| ResNet50 | 95.38 | 84.01 | 83.09 | 97.24 |
| ResNet101 | 95.64 | 85.73 | 80.04 | 97.38 |
| DenseNet121 | 96.67 | 90.40 | 87.94 | 97.96 |
| ViT | 95.90 | 86.90 | 85.95 | 97.53 |
| YOLOv5s + VGG16 | 92.82 | 84.94 | 75.67 | 95.64 |
| YOLOv5s + AlexNet | 96.67 | 89.27 | 88.81 | 98.03 |
| YOLOv5s + ResNet50 | 96.15 | 87.85 | 86.62 | 97.69 |
| YOLOv5s | 96.15 | 87.73 | 86.59 | 97.66 |
| YOLOv5s + DenseNet121 | 97.44 | 91.82 | 91.24 | 98.46 |
| YOLOv5s + ViT | 96.92 | 91.44 | 88.86 | 98.13 |
Comparison of the results in the text control group.
| Models | Accuracy/% | Precision/% | Sensitivity/% | Specificity/% |
|---|---|---|---|---|
| BOW+Transformer | 95.38 | 88.20 | 85.28 | 97.19 |
| BERT+Transformer | 96.46 | 91.44 | 91.26 | 98.07 |
Figure 6Comparison of accuracy and loss for cross-modal training.
Comparison of cross-modal training results.
| Models | Accuracy/% | Precision/% | Sensitivity/% | Specificity/% |
|---|---|---|---|---|
| P2TCMT | 97.66 | 92.07 | 91.82 | 98.57 |
| T2PCMT | 98.21 | 94.32 | 94.34 | 98.93 |
| BiCMT | 99.23 | 97.37 | 97.54 | 99.54 |
Figure 7Confusion matrices of the identification results of each model. Note: “0” refers to “tomato powdery mildew”; “1” refers to “tomato early blight”; “2” refers to “tomato virus disease”; “3” refers to “cucumber powdery mildew”; “4” refers to “cucumber virus disease”; “5” refers to “cucumber downy mildew.”