| Literature DB >> 33789577 |
Artur van Bemmelen van der Plaat1, Rob van Treuren2, Theo J L van Hintum2.
Abstract
BACKGROUND: To address the need for easy and reliable species classification in plant genetic resources collections, we assessed the potential of five classifiers (Random Forest, Neighbour-Joining, 1-Nearest Neighbour, a conservative variety of 3-Nearest Neighbours and Naive Bayes) We investigated the effects of the number of accessions per species and misclassification rate on classification success, and validated theirs generic value results with three complete datasets.Entities:
Keywords: Crop wild relatives; Gene bank documentation; Genomics; Machine learning; Plant genetic resources; Species classification
Mesh:
Year: 2021 PMID: 33789577 PMCID: PMC8011391 DOI: 10.1186/s12859-021-04018-6
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Median prediction accuracy in the 15 × 5,000 curated Helianthus datasets
| Accessions per species | Misclassification rate (%) | RF | NB | NJ | 1-NN | 3-NN |
|---|---|---|---|---|---|---|
| 6.25 | 0.63 | 0.81 | 0.81 | 0.75 | ||
| 2 | 12.50 | 0.56 | 0.63 | |||
| 18.75 | 0.50 | 0.63 | ||||
| 6.25 | 0.84 | 0.91 | 0.78 | |||
| 4 | 12.50 | 0.81 | 0.88 | 0.84 | 0.75 | |
| 18.75 | 0.78 | 0.81 | 0.78 | 0.69 | ||
| 6.25 | 0.94 | |||||
| 6 | 12.50 | 0.85 | 0.92 | 0.88 | ||
| 18.75 | 0.77 | 0.88 | 0.81 | 0.88 | ||
| 6.25 | 0.95 | 0.88 | 0.94 | 0.83 | ||
| 8 | 12.50 | 0.94 | 0.81 | 0.89 | 0.78 | |
| 18.75 | 0.70 | 0.84 | 0.72 | 0.91 | ||
| 6.25 | 0.88 | 0.93 | 0.91 | |||
| 10 | 12.50 | 0.75 | 0.89 | 0.85 | 0.94 | |
| 18.75 | 0.65 | 0.85 | 0.79 | 0.89 | ||
| Median | 0.92 | 0.81 | 0.88 | 0.79 | 0.94 |
Classifiers are Random Forest (RF), Naive Bayes (NB), Neighbour-Joining (NJ), 1-Nearest Neighbour (1-NN), and 3-Nearest Neighbours (3-NN) respectively. For each parameter combination, the highest median score is presented in bold
Confusion matrix of H. petiolaris and H. neglectus in the sunflower dataset
| RF | NB | NJ | |||||||
|---|---|---|---|---|---|---|---|---|---|
| A priori classification | Other | Other | Other | ||||||
| 1.00 | 0.00 | 0.00 | 0.00 | 0.00 | 1.00 | 0.94 | 0.06 | 0.00 | |
| 0.11 | 0.84 | 0.05 | 0.00 | 0.00 | 1.00 | 0.05 | 0.84 | 0.11 | |
Confusion matrix showing fractionally how often H. petiolaris and H. neglectus are classified as themselves, as each other, and as other species by Random Forest (RF), Naive Bayes (NB), Neighbour-Joining (NJ), 1-Nearest Neighbour (1-NN), and 3-Nearest Neighbours (3-NN)
Confusion matrix of S. peruvianum and S. corneliomulleri in the AFLP dataset
| RF | NB | NJ | |||||||
|---|---|---|---|---|---|---|---|---|---|
| A priori classification | Other | Other | Other | ||||||
| 0.92 | 0.08 | 0.00 | 0.00 | 0.00 | 1.00 | 0.75 | 0.25 | 0.00 | |
| 1.00 | 0.00 | 0.00 | 0.00 | 0.25 | 0.75 | 0.50 | 0.50 | 0.00 | |
Confusion matrix of the AFLP tomato dataset, showing fractionally how often S. peruvianum and S. corneliomulleri are classified as themselves, as each other, and as other species by Random Forest (RF), Naive Bayes (NB), Neighbour-Joining (NJ), 1-Nearest Neighbour (1-NN), and 3-Nearest Neighbours (3-NN)
Prediction accuracy per complete dataset.
| RF OOB | NB LOO | NJ | 1-NN | 3-NN | |
|---|---|---|---|---|---|
| Resequenced sunflower | 0.86 | 0.07 | 0.91 | 0.81 | |
| AFLP tomato | 0.92 | 0.14 | 0.84 | 0.86 | |
| Resequenced tomato | 0.85 | 0.75 | 0.85 | 0.88 |
The supervised machine learners (RF and NB) have been crossvalidated as described in the Methods section. The best performance for each dataset is presented in bold
Characteristics of the datasets used to compare the classifiers.
| Crop | Type | Accessions | Species | Markers | Reference |
|---|---|---|---|---|---|
| Sunflower | Resequenced | 287 | 21 | 15,285 | Baute et al. [ |
| Tomato | AFLP | 210 | 16 | 219 | Zuriaga et al. [ |
| Tomato | Resequenced | 80 | 13 | 100,000 | 100 Tomato Genome Sequencing Consortium et al. [ |
In dataset of the 100 Tomato Genome Sequencing Consortium et al., we only included accessions of unadmixed ancestry. The number of markers listed is the number remaining after filtering