| Literature DB >> 31862925 |
Md Matiur Rahaman1,2, Md Asif Ahsan1, Ming Chen3.
Abstract
Statistical data-mining (DM) and machine learning (ML) are promising tools to assist in the analysis of complex dataset. In recent decades, in the precision of agricultural development, plant phenomics study is crucial for high-throughput phenotyping of local crop cultivars. Therefore, integrated or a new analytical approach is needed to deal with these phenomics data. We proposed a statistical framework for the analysis of phenomics data by integrating DM and ML methods. The most popular supervised ML methods; Linear Discriminant Analysis (LDA), Random Forest (RF), Support Vector Machine with linear (SVM-l) and radial basis (SVM-r) kernel are used for classification/prediction plant status (stress/non-stress) to validate our proposed approach. Several simulated and real plant phenotype datasets were analyzed. The results described the significant contribution of the features (selected by our proposed approach) throughout the analysis. In this study, we showed that the proposed approach removed phenotype data analysis complexity, reduced computational time of ML algorithms, and increased prediction accuracy.Entities:
Mesh:
Year: 2019 PMID: 31862925 PMCID: PMC6925301 DOI: 10.1038/s41598-019-55609-6
Source DB: PubMed Journal: Sci Rep ISSN: 2045-2322 Impact factor: 4.379
Figure 1Framework of plant phenotype image-based traits (features) selection.
Average classification accuracy (%) of the simulated data (p = 25) subjected to 100 repeats of 10-cross-validation based on rank features.
| Rank Features Accuracy | ||||||
|---|---|---|---|---|---|---|
| ML Methods | 10% | 20% | 30% | 40% | 50% | All features (100%) |
| LDA | 98.21 | 98.87 | 99.41 | 99.63 | 99.86 | 100.00 |
| RF | 97.30 | 97.56 | 97.70 | 97.78 | 97.81 | 97.90 |
| SVM- | 98.08 | 98.65 | 99.07 | 99.22 | 99.39 | 99.53 |
| SVM- | 97.88 | 98.34 | 98.55 | 98.61 | 98.67 | 98.53 |
Figure 2Performance of the number of percentage of the rank features according to the computational time.
Average classification accuracy (%) of the simulated data (p = 50) subjected to 100 repeats of 10-cross-validation based on rank features.
| Rank Features Accuracy | ||||||
|---|---|---|---|---|---|---|
| ML Methods | 10% | 20% | 30% | 40% | 50% | All features (100%) |
| LDA | 91.97 | 93.29 | 94.32 | 95.19 | 95.78 | 100.00 |
| RF | 91.15 | 91.48 | 91.66 | 91.73 | 91.79 | 91.91 |
| SVM- | 91.82 | 92.90 | 93.62 | 94.23 | 94.73 | 95.07 |
| SVM- | 91.48 | 92.62 | 93.33 | 93.67 | 93.75 | 93.13 |
Average classification accuracy (%) of the simulated data (p = 100) subjected to 100 repeats of 10-cross-validation based on rank features.
| Rank Features Accuracy | ||||||
|---|---|---|---|---|---|---|
| ML Methods | 10% | 20% | 30% | 40% | 50% | All features (100%) |
| LDA | 84.97 | 87.23 | 89.30 | 91.10 | 92.43 | 94.88 |
| RF | 83.55 | 83.80 | 83.89 | 83.92 | 83.87 | 83.87 |
| SVM- | 84.49 | 86.53 | 88.21 | 89.50 | 90.39 | 91.45 |
| SVM- | 84.42 | 86.78 | 88.03 | 88.62 | 89.04 | 87.21 |
Figure 3Plant phenotype dataset. Dataset preparation based on features categories of two plant growing period.
Figure 4Performance of rank features for stress period data set. ‘Geo’ is geometrical, ‘Phy’ is Physiological and ‘Geo + Phy’ is combined Geometrical and Physiological features.
Figure 5Performance of rank features for recovery period data set. ‘Geo’ is Geometrical, ‘Phy’ is Physiological and ‘Geo + Phy’ is combined Geometrical and Physiological features.
Figure 6Comparison of classification accuracy of ML methods based on rank features for stress period data set. The Number of rank features is shown on the left and features categories are shown in the right of panels, respectively. In each column of panels, the results from a different type of ML methods are shown. Every ML method was subjected to 100 repeats of 10-cross-validation and the results shown are the average of the classification accuracy. The value in each cell is color coded (0, 1), ranging from red to blue.
Figure 7Comparison of classification accuracy of ML methods based on rank features for recovery period data set. The Number of rank features is shown on the left and features categories are shown in the right of panels, respectively. In each column of panels, the results from a different type of ML methods are shown. Every ML method was subjected to 100 repeats of 10-cross-validation and the results shown are the average of the classification accuracy. The value in each cell is color coded (0, 1), ranging from red to blue.
| 1. Procedure: Process ( |
| Where Ω is phenotypic traits space, |
| 2. Ψs ← Trait Selection ( |
| 3. Inputs: Training sample (Processed phenotypic image dataset) |
| 4. Group labels |
| 5. Initialize: Ψs = [1, 2,…, |
| 6. Trait ranked list, |
| 7. α ← |
| 8. w ← |
| 9. |
| 10. g ← |
| 11. |
| 12. Ψs ← Ψs (1:g-1, g + 1:length(ψs)); eliminate the trait with lowest ranking. |
| 13. return () |
| 14. End procedure. |