| Literature DB >> 19900297 |
Frederic Austerlitz1, Olivier David, Brigitte Schaeffer, Kevin Bleakley, Madalina Olteanu, Raphael Leblois, Michel Veuille, Catherine Laredo.
Abstract
BACKGROUND: DNA barcoding aims to assign individuals to given species according to their sequence at a small locus, generally part of the CO1 mitochondrial gene. Amongst other issues, this raises the question of how to deal with within-species genetic variability and potential transpecific polymorphism. In this context, we examine several assignation methods belonging to two main categories: (i) phylogenetic methods (neighbour-joining and PhyML) that attempt to account for the genealogical framework of DNA evolution and (ii) supervised classification methods (k-nearest neighbour, CART, random forest and kernel methods). These methods range from basic to elaborate. We investigated the ability of each method to correctly classify query sequences drawn from samples of related species using both simulated and real data. Simulated data sets were generated using coalescent simulations in which we varied the genealogical history, mutation parameter, sample size and number of species.Entities:
Mesh:
Year: 2009 PMID: 19900297 PMCID: PMC2775147 DOI: 10.1186/1471-2105-10-S14-S10
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Figure 1Hypothetical representations of gene genealogies between two species and of some hypothetical mutation patterns between them. Individual A is the global MRCA of all individuals; individuals B and C are respectively the MRCA of the two derived species 1 and 2. Cases a, b and c correspond to reciprocal monophyly and case d to reciprocal paraphyly. In some cases of reciprocal monophyly, one mutation is diagnostic (a), while no mutation is diagnostic in other cases (b and c). A combination of mutation can also be sufficient to perform barcoding (c). Barcoding is also possible in the case of reciprocal paraphyly, by also using combinations of mutations that are specific to a given species (d).
Figure 2Schematic representation of simulations for two species (. It is assumed that the species split T generations ago. The thin lines represent the coalescent lineages and stars indicate the mutations that occurred along these lineages. For each species, we simulated n reference individuals and one additional individual, which was used to test the methods.
Characteristics of the data sets used in this study.
| COI | 466 | 12 | 15.32 | Phylogenetical | |
| Cowries species | COI | 2036 | 180 | 36.36 | Genetical and morphological |
| Cowries species and subspecies | COI | 2036 | 249 | 36.36 | |
| Amazonian Butterflies | COI | 424 | 61 | 46.13 | Morphological |
| Amazonian Butterflies | mt DNA | 424 | 61 | 146.09 | |
| Amazonian Butterflies | nuclear (Ef1α) | 191 | 52 | 57.32 | |
Number of species or subspecies with at least two individuals.
Average θ value per species estimated with Watterson's [37] estimate.
Figure 3Illustration of our assignment technique for the phylogeny-based methods. X denotes the query sequence to assign, and individuals 1_x or 2_x belong respectively to species 1 or 2. In case A, the sister group (1_1, 1_3, 2_1) of X contains a majority of individuals of species 1, thus X is assigned to species 1. In case B, the sister group (1_1, 1_3, 2_1, 2_4) of X contains an equal number of individuals of species 1 and 2, thus we have to consider the sister group at the upper level (one node above), this group is (1_1, 1_3, 2_1, 2_4, 1_2) and contains a majority of species 1 individuals. X is thus assigned to species 1.
Success rate (%) of data analysis methods with varying speciation time; mtDNA sequences were simulated for two species and a sample size n = 10; the mutation parameter θ was either 3 (A) or 30 (B).
| (A) | ||||||||
| 100 | 62.90 | 62.25 ¶ | 65.45 * | 65.40 | 64.30 | 64.95 | 1 | |
| 500 † | 87.25* | 86.30 | 87.20 | 87.15 | 86.40 | 87.15 | ||
| 1000 | 95.90 | 96.00 | 96.75 * | 96.55 | 96.00 | 95.75 | ||
| 5000 | 100.00 * | 100.00 * | 100.00 * | 99.85 | 100.00 * | 99.80 | ||
| 10000 | 100.00 * | 100.00 * | 100.00 * | 100.00 * | 100.00 * | 99.90 | ||
| (B) | ||||||||
| 100 | 75.60 | 75.30 | 76.25 * | 75.50 | 77.75 | 73.45 | ||
| 500 † | 96.10 | 96.20 * | 95.55 | 93.50 ¶ | 95.25 | 94.00 ¶ | 2 | 2 |
| 1000 | 99.15 * | 99.15 * | 98.55 | 97.10 ¶ | 98.35 ¶ | 96.90 ¶ | 3 | 3 |
| 5000 | 99.95 | 100.00 * | 100.00 * | 99.40 ¶ | 100.00 * | 99.45 ¶ | 1 | 2 |
| 10000 | 100.00 * | 100.00 * | 100.00 * | 99.40 ¶ | 100.00 * | 99.55 ¶ | 2 | 2 |
* Best score
¶significantly below the best score. Columns 8 and 9 indicate the number of methods with p-values below 0.05 and 0.01 respectively.
† Focal set of parameters, for comparison across tables.
Success rate (%) of data analysis methods for a number of additional nuclear loci; DNA sequences were simulated for two species, for a sample size n = 10 and a speciation time T = 500 (N/2); the mutation parameter θ was either 3 (A) or 30 (B).
| (A) | ||||||||
| 0 † | 87.25 | 86.30 | 87.30* | 87.15 | 86.40 | 87.15 | ||
| 1 | 88.05 ¶ | 87.10 ¶ | 91.40 * | 89.70 | 83.55 ¶ | 83.70 ¶ | 4 | 4 |
| 2 | 90.60 ¶ | 90.35 ¶ | 95.00 * | 93.20 ¶ | 86.25 ¶ | 86.80 ¶ | 5 | 4 |
| 3 | 92.80 ¶ | 92.40 ¶ | 96.20 * | 95.05 | 88.95 ¶ | 89.75 ¶ | 4 | 4 |
| 4 | 94.70 ¶ | 94.60 ¶ | 97.70 * | 96.55 ¶ | 91.30 ¶ | 91.90 ¶ | 5 | 4 |
| (B) | ||||||||
| 0 † | 96.10 | 96.20 * | 95.55 | 93.50 ¶ | 95.25 | 94.00 ¶ | 1 | 1 |
| 1 | 96.00 | 96.20 | 96.80 | 95.95 | 97.00* | 96.10 | ||
| 2 | 98.55* | 98.55* | 98.35 | 98.05 | 98.15 | 97.60 | ||
| 3 | 99.40 | 99.30 | 99.50 * | 98.95 ¶ | 99.40 | 98.75 | 1 | 1 |
| 4 | 99.75 * | 99.70 | 99.75 * | 99.50 | 99.75 * | 99.50 | ||
* Best score
¶significantly below the best score. Columns 8 and 9 indicate the number of methods with p-values below 0.05 and 0.01 respectively.
† Focal set of parameters, for comparison across tables.
Success rate (%) of data analysis methods when varying the reference sample size; mtDNA sequences were simulated for two species and a speciation time of T = 500 (N/2); the mutation parameter θ was either 3 (A) or 30 (B).
| (A) | ||||||||
| 3 | 77.45 | 77.50 | 78.05 * | 77.15 | 77.35 | 76.15 | ||
| 5 | 84.20 * | 83.85 | 83.30 | 82.95 | 82.40 | 82.10 | ||
| 10 † | 87.25* | 86.30 | 87.20 | 87.15 | 86.40 | 87.15 | ||
| 25 | 92.00 * | 91.70 | 91.10 | 90.80 | 89.40 ¶ | 90.75 | 1 | 1 |
| (B) | ||||||||
| 3 | 82.80 | 82.95 | 83.55 * | 79.40 ¶ | 81.95 | 80.45 ¶ | 2 | 2 |
| 5 | 89.50 | 89.45 | 90.25 * | 86.20 ¶ | 89.30 | 86.85 ¶ | 2 | 2 |
| 10 † | 96.10 | 96.20 * | 95.55 | 93.50 ¶ | 95.25 | 94.00 ¶ | 1 | 1 |
| 25 | 98.95 | 98.95 | 99.15 * | 98.30 ¶ | 99.00 | 98.20 ¶ | 2 | 1 |
* Best score
¶significantly below the best score. Columns 8 and 9 indicate the number of methods with p-values below 0.05 and 0.01 respectively.
† Focal set of parameters, for comparison across tables.
Success rate (%) of data analysis methods for a number of species ranging from two to five; mtDNA sequences were simulated for a reference sample size n = 10 and a separation time T = 500 (N/2); the mutation parameter θ was either 3 (A) or 30 (B).
| (A) | ||||||||
| 2 † | 87.25 | 86.30 | 87.30* | 87.15 | 87.20 | 87.15 | ||
| 3 | 81.73 * | 80.77 | 80.67 | 80.40 | 80.97 | 81.10 | ||
| 4 | 75.80 | 75.00 | 75.40 | 75.68 | 75.95 * | 74.78 ¶ | 1 | |
| 5 | 73.26 * | 72.36 | 72.58 | 72.84 | 73.22 | 70.74 ¶ | 1 | 1 |
| (B) | ||||||||
| 2 † | 96.10 | 96.20 * | 95.55 | 93.50 ¶ | 95.25 | 94.00 ¶ | 2 | 2 |
| 3 | 94.40 * | 94.23 | 94.00 | 90.93 ¶ | 93.50 | 92.10 ¶ | 2 | 2 |
| 4 | 93.78 * | 93.73 | 92.90 | 90.10 ¶ | 92.53 ¶ | 91.40 ¶ | 3 | 2 |
| 5 | 92.46 | 92.38 | 92.70 * | 88.98 ¶ | 92.08 | 90.46 ¶ | 2 | 2 |
* Best score
¶significantly below the best score. Columns 8 and 9 indicate the number of methods with p-values below 0.05 and 0.01 respectively.
† Focal set of parameters, for comparison across tables.
Success rate (%) of the methods on real data sets.
| 99.36 | 99.36 * | 99.36 | 98.22 | 99.36 | 99.57* | |
| Cowries (species) | 95.45 * | 93.40 | 95.45 * | 78.40 | 94.65 | 94.45 |
| Cowries (species and subspecies) | 91.10 | 86.37 | 91.31 | 72.38 | 91.41 * | 89.20 |
| Amazonian Butterflies (barcode) | 91.73 | 91.20 | 90.40 | 75.47 | 92.00 | 92.80* |
| Amazonian Butterflies (mtDNA) | 91.47 | 91.2 | 91.73 | 71.20 | 92.00 | 93.60* |
| Amazonian Butterflies (nuclear gene) | 87.74 | 90.32 * | 90.32 * | 52.90 | 80.64 | 89.03 |
* Best score