| Literature DB >> 35505067 |
Daisy Salifu1, Eric Ali Ibrahim2,3, Henri E Z Tonnang2.
Abstract
Analysis of landmark-based morphometric measurements taken on body parts of insects have been a useful taxonomic approach alongside DNA barcoding in insect identification. Statistical analysis of morphometrics have largely been dominated by traditional methods and approaches such as principal component analysis (PCA), canonical variate analysis (CVA) and discriminant analysis (DA). However, advancement in computing power creates a paradigm shift to apply modern tools such as machine learning. Herein, we assess the predictive performance of four machine learning classifiers; K-nearest neighbor (KNN), random forest (RF), support vector machine (the linear, polynomial and radial kernel SVMs) and artificial neural network (ANNs) on fruit fly morphometrics that were previously analysed using PCA and CVA. KNN and RF performed poorly with overall model accuracy lower than "no-information rate" (NIR) (p value > 0.1). The SVM models had a predictive accuracy of > 95%, significantly higher than NIR (p < 0.001), Kappa > 0.78 and area under curve (AUC) of the receiver operating characteristics was > 0.91; while ANN model had a predictive accuracy of 96%, significantly higher than NIR, Kappa of 0.83 and AUC was 0.98. Wing veins 2, 3, 8, 10, 14 and tibia length were of higher importance than other variables based on both SVM and ANN models. We conclude that SVM and ANN models could be used to discriminate fruit fly species based on wing vein and tibia length measurements or any other morphologically similar pest taxa. These algorithms could be used as candidates for developing an integrated and smart application software for insect discrimination and identification. Variable importance analysis results in this study would be useful for future studies for deciding what must be measured.Entities:
Mesh:
Year: 2022 PMID: 35505067 PMCID: PMC9065030 DOI: 10.1038/s41598-022-11258-w
Source DB: PubMed Journal: Sci Rep ISSN: 2045-2322 Impact factor: 4.996
Values of the tuning parameter, k and the corresponding accuracy and kappa statistics for the kNN model on the training dataset.
| k | Accuracy | Kappa |
|---|---|---|
| 5 | 0.927 | 0.639 |
| 7 | 0.924 | 0.615 |
| 9 | 0.915 | 0.564 |
| 11 | 0.908 | 0.510 |
| 13 | 0.904 | 0.483 |
| 15 | 0.897 | 0.430 |
| 17 | 0.893 | 0.399 |
| 19 | 0.889 | 0.367 |
| 21 | 0.887 | 0.348 |
| 23 | 0.884 | 0.324 |
Figure 1Variation in accuracy for number of randomly selected predictor variables (mtry) for the random forest classifier. Model accuracy is highest for mtry = 7.
Figure 2Linear SVM model accuracy (y-axis) for values of cost parameter (x-axis) obtained from the repeated cross-validation of the training sample data. Cost “C” = 5.75 gives the optimal model.
Classification results for the SVM classifiers on test dataset of morphometric measurements of Bactrocera spp., with observed species affiliation in the rows and predicted species allocation in the columns. Correct classification rate appears along the diagonal in bold.
| Classifier | Observed | Predicted (%) | Sensitivity | Specificity | ||||||
|---|---|---|---|---|---|---|---|---|---|---|
| Bco | Bcu | Bdo | BI | Bka | Bol | Bzo | ||||
| SVM-L | Bco | 0 | 0 | 20.0 | 0 | 0 | 0 | 1.000 | 0.997 | |
| Bcu | 0 | 0 | 0 | 0 | 0 | 0 | 0.818 | 1.000 | ||
| Bdo | 0 | 0 | 75.0 | 0 | 0 | 0 | 0.500 | 0.981 | ||
| BI | 0 | 0.7 | 0.7 | 0 | 0 | 0 | 0.965 | 0.892 | ||
| Bka | 0 | 0 | 0 | 37.5 | 0 | 0 | 1.000 | 0.991 | ||
| Bol | 0 | 0 | 0 | 0 | 0 | 0 | 1.000 | 1.000 | ||
| Bzo | 0 | 0 | 0 | 0 | 0 | 0 | 1.000 | 1.000 | ||
| SVM-R | Bco | 0 | 0 | 20.0 | 0 | 0 | 0 | 1.000 | 0.997 | |
| Bcu | 0 | 0 | 11.1 | 0 | 0 | 0 | 1.000 | 0.997 | ||
| Bdo | 0 | 0 | 62.5 | 0 | 0 | 0 | 1.000 | 0.984 | ||
| BI | 0 | 0 | 0 | 0 | 0 | 0 | 0.956 | 1.000 | ||
| Bka | 0 | 0 | 0 | 75.0 | 0 | 0 | 1.000 | 0.981 | ||
| Bol | 0 | 0 | 0 | 0 | 0 | 0 | 1.000 | 1.000 | ||
| Bzo | 0 | 0 | 0 | 0 | 0 | 0 | 1.000 | 1.000 | ||
| SVM-P | Bco | 0 | 0 | 20.0 | 0 | 0 | 0 | 0.800 | 0.997 | |
| Bcu | 0 | 0 | 11.1 | 0 | 0 | 0 | 0.889 | 0.997 | ||
| Bdo | 0 | 0 | 50.0 | 0 | 0 | 0 | 1.000 | 0.988 | ||
| BI | 0.35 | 0.35 | 0 | 1.1 | 0 | 0 | 0.962 | 0.865 | ||
| Bka | 0 | 0 | 0 | 62.5 | 0 | 0 | 0.500 | 0.984 | ||
| Bol | 0 | 0 | 0 | 0 | 0 | 0 | 1.000 | 1.000 | ||
| Bzo | 0 | 0 | 0 | 0 | 0 | 0 | 1.000 | 1.000 | ||
Bco—B. Correcta, Bcu—B cucurbitae, Bdo—B. dorsalis, BI—B. invadens, Bka—B. kandiensis, Bol—B. oleae, Bzo—B. zonata; SVM-L: linear kernel SVM, SVM-R: radial kernel SVM, SVM-P: polynomial kernel SVM.
Classification results for the ANN classifier on test dataset of morphometric measurements of Bactrocera spp., with observed species affiliation in the rows and predicted species allocation in the columns. Correct classification rate appears along the diagonal in bold.
| Observed | Predicted (%) | Sensitivity | Specificity | ||||||
|---|---|---|---|---|---|---|---|---|---|
| Bco | Bcu | Bdo | BI | Bka | Bol | Bzo | |||
| Bco | 0 | 0 | 0 | 0 | 0 | 0 | 1.000 | 1.000 | |
| Bcu | 0 | 0 | 11.1 | 0 | 0 | 0 | 0.889 | 0.997 | |
| Bdo | 0 | 0 | 50.0 | 0 | 0 | 0 | 0.667 | 0.987 | |
| BI | 0 | 0.35 | 0.35 | 1.1 | 0 | 0 | 0.975 | 0.878 | |
| Bka | 0 | 0 | 12.5 | 25.0 | 0 | 0 | 0.625 | 0.991 | |
| Bol | 0 | 0 | 0 | 0 | 0 | 0 | 1.000 | 1.000 | |
| Bzo | 0 | 0 | 0 | 0 | 0 | 0 | 1.000 | 1.000 | |
Bco—B. Correcta, Bcu—B cucurbitae, Bdo—B. dorsalis, BI—B. invadens, Bka—B. kandiensis, Bol—B. oleae, Bzo—B. zonata.
Summary of performance metrics for all the machine learning classifiers under study.
| Classifier model | Accuracy [95% CI] | Kappa | NIR | AUC | |
|---|---|---|---|---|---|
| (Acc > NIR) | |||||
| k-Nearest Neighbor | 0.932 [0.899, 0.957] | 0.648 | 0.929 | 0.469 | |
| Random Forest | 0.912 [0.874, 0.939] | 0.536 | 0.929 | 0.916 | |
| Linear kernel | 0.957 [0.929, 0.976] | 0.811 | 0.886 | < 0.0001 | 0.911 |
| Radial kernel | 0.960 [0.933, 0.979] | 0.810 | 0.908 | 0.0002 | 0.933 |
| Polynomial kernel | 0.951 [0.921, 0.972] | 0.784 | 0.886 | < 0.0001 | 0.959 |
| ANN | 0.960 [0.933, 0.979] | 0.827 | 0.883 | < 0.0001 | 0.986 |
NIR no-information rate, Acc accuracy, AUC Area under the curve of the receiver operating characteristics, SVM Support vector machine, ANN Artificial neural network.
Figure 3Analysis of variable importance (VI) for the radial kernel SVM model. Veins 3, 2, 8, and 10 are identified as predictors of higher importance than others in all species except for Bdo (B. dorsalis) and BI (B. invadens).
Figure 4Analysis of variable importance (VI) for the artificial neural network model. Veins 3, 8, 2, 14 and tibia length are analysed as of higher importance than others in predicting the Bactrocera spp.
Mean measurements of wing vein distances and tibia length (mm) of fruit fly (Bactrocera spp.) specimen collected from African countries and Asia.
| Variable | |||||||
|---|---|---|---|---|---|---|---|
| Bco | Bcu | Bdo | BI | Bka | Bol | Bzo | |
| Vein 1 | 4.086 | 5.115 | 4.211 | 4.748 | 4.947 | 3.585 | 4.334 |
| Vein 2 | 0.631 | 0.871 | 0.719 | 0.746 | 0.749 | 0.612 | 0.641 |
| Vein 3 | 1.022 | 1.382 | 1.175 | 1.284 | 1.343 | 0.876 | 1.195 |
| Vein 4 | 0.503 | 0.548 | 0.517 | 0.545 | 0.605 | 0.316 | 0.616 |
| Vein 5 | 1.265 | 1.584 | 1.351 | 1.497 | 1.591 | 1.018 | 1.510 |
| Vein 6 | 0.384 | 0.504 | 0.412 | 0.444 | 0.488 | 0.291 | 0.399 |
| Vein 7 | 1.761 | 2.150 | 1.891 | 2.067 | 2.156 | 1.549 | 1.943 |
| Vein 8 | 0.621 | 0.865 | 0.706 | 0.772 | 0.789 | 0.544 | 0.679 |
| Vein 9 | 0.701 | 0.913 | 0.770 | 0.878 | 0.907 | 0.653 | 0.727 |
| Vein 10 | 0.962 | 1.332 | 1.094 | 1.191 | 1.263 | 0.844 | 0.981 |
| Vein 11 | 2.160 | 2.726 | 2.291 | 2.489 | 2.641 | 1.940 | 2.306 |
| Vein 12 | 1.120 | 1.356 | 1.151 | 1.229 | 1.270 | 1.114 | 1.116 |
| Vein 13 | 1.078 | 1.340 | 1.051 | 1.150 | 1.251 | 0.938 | 1.186 |
| Vein 14 | 2.054 | 2.500 | 2.156 | 2.362 | 2.409 | 1.689 | 2.165 |
| Tibia length | 1.471 | 1.728 | 1.522 | 1.679 | 1.721 | 1.153 | 1.506 |
Bco—B. Correcta (n = 18), Bcu—B. cucurbitae (n = 31), Bdo—B. dorsalis (n = 28), BI—B. invadens (n = 940), Bka—B. kandiensis (n = 28), Bol—B. oleae (n = 28), Bzo—B. zonata (n = 18).
Figure 5A schematic diagram illustrating the structure of a simple multilayer neural network. Arrows represent the direction that values are passed. At the end of the network, the output layer provides the probability that the specimen in question belongs to a given species.