| Literature DB >> 29523803 |
Teresita M Porter1,2, Mehrdad Hajibabaei3.
Abstract
We introduce a method for assigning names to CO1 metabarcode sequences with confidence scores in a rapid, high-throughput manner. We compiled nearly 1 million CO1 barcode sequences appropriate for classifying arthropods and chordates. Compared to our previous Insecta classifier, the current classifier has more than three times the taxonomic coverage, including outgroups, and is based on almost five times as many reference sequences. Unlike other popular rDNA metabarcoding markers, we show that classification performance is similar across the length of the CO1 barcoding region. We show that the RDP classifier can make taxonomic assignments about 19 times faster than the popular top BLAST hit method and reduce the false positive rate from nearly 100% to 34%. This is especially important in large-scale biodiversity and biomonitoring studies where datasets can become very large and the taxonomic assignment problem is not trivial. We also show that reference databases are becoming more representative of current species diversity but that gaps still exist. We suggest that it would benefit the field as a whole if all investigators involved in metabarocoding studies, through collaborations with taxonomic experts, also planned to barcode representatives of their local biota as a part of their projects.Entities:
Mesh:
Year: 2018 PMID: 29523803 PMCID: PMC5844909 DOI: 10.1038/s41598-018-22505-4
Source DB: PubMed Journal: Sci Rep ISSN: 2045-2322 Impact factor: 4.379
CO1 Eukaryote v1 training set summary.
| Training set | Number of taxa (all ranks) | Number of sequences |
|---|---|---|
| Whole training set | 29,998 | 912,253 |
| Arthropoda | 21,267 | 685,651 |
| Chordata | 7,344 | 215,530 |
| Outgroup taxa | 1,385 | 11,072 |
Figure 1The proportion of correct taxonomic assignments increases with more inclusive taxonomic ranks and longer CO1 sequences. Results summarize results from leave-one-out testing of the CO1 Eukaryote v1 training set.
Bootstrap support cutoff values that produced at least 99% correct assignments during CO1 Eukaryote v1 leave-one-out testing.
| Rank | 500 bp+ | 400 bp | 200 bp | 100 bp | 50 bp |
|---|---|---|---|---|---|
|
| |||||
| Superkingdom | 0 | 0 | 0 | 0 | 0 |
| Kingdom | 0 | 0 | 0 | 0 | 0 |
| Phylum | 0 | 0 | 0 | 0 | 0 |
| Class | 0 | 0 | 0 | 0 | 60 |
| Order | 0 | 0 | 10 | 40 | 80 |
| Family | 20 | 20 | 30 | 40 | 80 |
| Genus | 70 | 60 | 60 | 60 | N/A |
|
| |||||
| Superkingdom | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
| Kingdom | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
| Phylum | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
| Class | 0.0 | 0.0 | 0.0 | 0.0 | 13.2 |
| Order | 0.0 | 0.0 | 0.2 | 5.7 | 60.7 |
| Family | 0.7 | 1.2 | 4.1 | 17.1 | 78.7 |
| Genus | 3.4 | 4.7 | 10.8 | 31.0 | N/A |
‘N/A’, not applicable, refers to the inability to observe 99% correct taxonomic assignments.
Representation of freshwater biomonitoring taxa in the Eukaryote CO1v1 training set.
| Class | Order | No. reference sequences | % Incorrect (No cutoff) | % Incorrect (Cutoff) |
|---|---|---|---|---|
| Bivalvia | — | 667 | 3.7 | 0.3 |
| Clitellata | — | N/A | N/A | N/A |
| Gastropoda | — | 1,896 | 3.7 | 0.4 |
| Insecta | Coleoptera | 89,484 | 7.5 | 1.1 |
| Insecta | Diptera | 118,896 | 3.8 | 0.8 |
| Insecta | Ephemeroptera | 6,722 | 2.8 | 0.3 |
| Insecta | Megaloptera | 469 | 3.6 | 1.7 |
| Insecta | Odonata | 3,553 | 6.9 | 1.2 |
| Insecta | Plecoptera | 2,679 | 2.7 | 0.1 |
| Insecta | Trichoptera | 17,277 | 3.1 | 0.3 |
| Malacostraca | Amphipoda | 8,483 | 3.4 | 1.3 |
| Malacostraca | Isopoda | 3,659 | 2.9 | 0.1 |
| Polychaeta | — | 888 | 2.8 | 0.2 |
| Turbellaria | — | N/A | N/A | N/A |
N/A, not applicable, as of October 2016 there are no full length CO1 sequences identified to the species rank in the GenBank nucleotide database. We used a 70% bo otstrap support cutoff value at the genus rank.
Figure 2CO1 primers included in this study. Primer map of the CO1 barcoding region showing the relative position and direction of the primer-anchored 200 bp fragments analyzed in this study. The CO1 helix regions that are embedded in the mitochondrial inner membrane are also shown for reference.
Figure 3The proportion of correctly assigned primer-anchored 200 bp sequences can vary across the CO1 barcoding region before applying a bootstrap support cutoff. Primer names are prefixed with the outermost alignment position along the CO1 barcoding region and are arranged along the x-axis in the order that they would be encountered from the 5′ to 3′ end. Top panel: Coverage of primer-anchored 200 bp sequences in the CO1 Eukaryote v1 training set. Middle panel: Proportion of correct taxonomic assignments. Bottom panel: Proportion of correct assignments after filtering by a 60% bootstrap support cutoff at the genus rank. Note the differing scale on the y-axes.
Taxonomic assignment outcomes at the genus rank from primer-anchored 200 bp sequences using the top BLAST hit method compared with the RDP classifier and the CO1 Eukaryote v1 training set.
| Method | N* | No results returned** | TP | FN | TN | FP | Accuracy | TPR | FPR |
|---|---|---|---|---|---|---|---|---|---|
| Top BLAST hit approach | 17,960,965 | 1,642 | 17,559,411 | 3,350 | 384 | 397,820 | 98% | ~100% | ~100% |
| RDP Classifier CO1 Eukaryote v1 | 17,962,607 | N/A | 16,887,619 | 727,269 | 230,262 | 117,457 | 95% | 96% | 34% |
TP = true positive, FN = false negative, TN = true negative, FP = false positive, TPR = true positive rate, FPR = false positive rate
~Indicates that the value was rounded up and is nearly 100%
*N = Total number of primer-anchored 200 bp CO1 sequences used as queries
**BLAST results were not returned because the expect value was greater than 10.
Figure 4The RDP classifier taxonomically assigns more queries per minute than the top BLAST hit method. The number of primer-anchored 200 bp query sequences taxonomically assigned per minute is compared using the top BLAST hit method against a locally installed copy of the nucleotide database and the RDP classifier 2.12 with the CO1 Eukaryote v1 training set.