| Literature DB >> 26811761 |
Arong Luo1, Haiqiang Lan2, Cheng Ling3, Aibing Zhang4, Lei Shi5, Simon Y W Ho6, Chaodong Zhu7.
Abstract
For some groups of organisms, DNA barcoding can provide a useful tool in taxonomy, evolutionary biology, and biodiversity assessment. However, the efficacy of DNA barcoding depends on the degree of sampling per species, because a large enough sample size is needed to provide a reliable estimate of genetic polymorphism and for delimiting species. We used a simulation approach to examine the effects of sample size on four estimators of genetic polymorphism related to DNA barcoding: mismatch distribution, nucleotide diversity, the number of haplotypes, and maximum pairwise distance. Our results showed that mismatch distributions derived from subsamples of ≥20 individuals usually bore a close resemblance to that of the full dataset. Estimates of nucleotide diversity from subsamples of ≥20 individuals tended to be bell-shaped around that of the full dataset, whereas estimates from smaller subsamples were not. As expected, greater sampling generally led to an increase in the number of haplotypes. We also found that subsamples of ≥20 individuals allowed a good estimate of the maximum pairwise distance of the full dataset, while smaller ones were associated with a high probability of underestimation. Overall, our study confirms the expectation that larger samples are beneficial for the efficacy of DNA barcoding and suggests that a minimum sample size of 20 individuals is needed in practice for each population.Entities:
Keywords: Coalescence; haplotype; maximum pairwise distance; mismatch distribution; nucleotide diversity
Year: 2015 PMID: 26811761 PMCID: PMC4717336 DOI: 10.1002/ece3.1846
Source DB: PubMed Journal: Ecol Evol ISSN: 2045-7758 Impact factor: 2.912
Figure 1Split information around internal nodes of four chosen genealogies. The x‐axis represents seven internal nodes beginning at the root, while the y‐axis represents the size of the larger daughter clade. Among the ten trees consisting of 500 tips, data are shown here for tree_A (blue solid circles), tree_B (green solid circles), tree_F (red solid circles), and tree_I (yellow solid circles). Empty black circles represent data from a balanced tree topology.
Figure 2Mismatch distributions together with kernel density estimates of dataset seq_I and its subsamples. Only the result from one randomly chosen subsample of each size is shown here.
Figure 3Histograms showing distributions of nucleotide diversity values of subsamples from dataset seq_J. The blue curves are from kernel density estimates, while the red vertical lines indicate nucleotide diversity of the full dataset.
Descriptive statistics of nucleotide diversities. Each of the ten datasets (from seq_A to seq_J) contains 500 simulated sequences, while seq_K and seq_J contain 300 and 1000 sequences, respectively
| Dataset |
| Mean value of | |||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Percent of values in range of | |||||||||||||
| 2 | 5 | 10 | 20 | 30 | 40 | 50 | 60 | 70 | 80 | 90 | 100 | ||
| Seq_A | 0.014 | 0.014 | 0.014 | 0.014 | 0.014 | 0.014 | 0.014 | 0.014 | 0.014 | 0.014 | 0.014 | 0.014 | 0.014 |
| 0.00% | 36.73% | 87.80% | 61.97% | 89.11% | 94.99% | 97.27% | 98.17% | 98.61% | 99.37% | 99.62% | 99.75% | ||
| Seq_B | 0.012 | 0.012 | 0.012 | 0.012 | 0.012 | 0.012 | 0.012 | 0.012 | 0.012 | 0.012 | 0.012 | 0.012 | 0.012 |
| 8.17% | 32.13% | 46.32% | 71.65% | 83.74% | 89.64% | 93.58% | 95.84% | 97.44% | 98.10% | 98.78% | 99.05% | ||
| Seq_C | 0.011 | 0.011 | 0.011 | 0.011 | 0.011 | 0.011 | 0.011 | 0.011 | 0.011 | 0.011 | 0.011 | 0.011 | 0.011 |
| 0.64% | 27.28% | 37.39% | 55.44% | 65.86% | 74.53% | 79.13% | 82.52% | 85.99% | 88.96% | 90.57% | 91.98% | ||
| Seq_D | 0.009 | 0.009 | 0.009 | 0.009 | 0.009 | 0.009 | 0.009 | 0.009 | 0.009 | 0.009 | 0.009 | 0.009 | 0.009 |
| 4.40% | 58.52% | 81.59% | 92.18% | 96.90% | 98.70% | 99.46% | 99.81% | 99.96% | 99.96% | 100.00% | 100.00% | ||
| Seq_E | 0.012 | 0.012 | 0.012 | 0.012 | 0.012 | 0.012 | 0.012 | 0.012 | 0.012 | 0.012 | 0.012 | 0.012 | 0.012 |
| 29.19% | 40.57% | 61.32% | 80.06% | 88.82% | 93.06% | 96.20% | 97.77% | 98.94% | 99.20% | 99.48% | 99.76% | ||
| Seq_F | 0.010 | 0.010 | 0.010 | 0.010 | 0.010 | 0.010 | 0.010 | 0.010 | 0.010 | 0.010 | 0.010 | 0.010 | 0.010 |
| 19.79% | 50.49% | 75.99% | 90.58% | 96.14% | 97.71% | 98.68% | 99.29% | 99.67% | 99.85% | 99.89% | 99.94% | ||
| Seq_G | 0.012 | 0.012 | 0.012 | 0.012 | 0.012 | 0.012 | 0.012 | 0.012 | 0.012 | 0.012 | 0.012 | 0.012 | 0.012 |
| 6.23% | 35.48% | 56.93% | 80.71% | 90.35% | 95.30% | 97.96% | 99.06% | 99.45% | 99.76% | 99.92% | 99.92% | ||
| Seq_H | 0.005 | 0.005 | 0.005 | 0.005 | 0.005 | 0.005 | 0.005 | 0.005 | 0.005 | 0.005 | 0.005 | 0.005 | 0.005 |
| 20.19% | 8.03% | 32.76% | 60.49% | 72.78% | 77.77% | 83.80% | 88.68% | 92.31% | 93.81% | 95.75% | 96.62% | ||
| Seq_I | 0.013 | 0.013 | 0.013 | 0.013 | 0.013 | 0.013 | 0.013 | 0.013 | 0.013 | 0.013 | 0.013 | 0.013 | 0.013 |
| 0.00% | 30.79% | 78.59% | 90.98% | 94.87% | 97.22% | 98.70% | 98.95% | 99.41% | 99.59% | 99.71% | 99.85% | ||
| Seq_J | 0.013 | 0.013 | 0.013 | 0.013 | 0.013 | 0.013 | 0.013 | 0.013 | 0.013 | 0.013 | 0.013 | 0.013 | 0.013 |
| 0.05% | 26.03% | 35.84% | 52.19% | 63.23% | 71.47% | 77.83% | 83.00% | 86.59% | 89.20% | 91.91% | 93.84% | ||
| Seq_K | 0.008 | 0.008 | 0.008 | 0.008 | 0.008 | 0.008 | 0.008 | 0.008 | 0.008 | NA | NA | NA | NA |
| 4.77% | 56.27% | 78.72% | 87.34% | 91.99% | 94.79% | 96.85% | 98.01% | NA | NA | NA | NA | ||
| Seq_L | 0.010 | 0.010 | 0.010 | 0.010 | 0.010 | 0.010 | 0.010 | 0.010 | 0.010 | NA | NA | NA | NA |
| 22.75% | 24.00% | 43.79% | 64.91% | 75.38% | 82.92% | 87.62% | 90.91% | NA | NA | NA | NA | ||
Size of subsamples that were drawn randomly from the full dataset.
Figure 4(A) Boxplots showing the numbers of haplotypes for every 100 repeats of subsamples of the same size from dataset seq_C. The x‐axis denotes the sample size, while the y‐axis represents the detailed number of haplotypes. (B) Ten asymptotic‐logarithm curves corresponding to the ten Michaelis‐Menten equations, which were estimated from the median values in boxplots of datasets from seq_A to seq_J.
Figure 5Histograms showing distributions of maximum pairwise distances of subsamples from dataset seq_E. The red vertical lines indicate maximum pairwise distance of the full dataset.