| Literature DB >> 23990417 |
Jiajie Zhang1, Paschalia Kapli, Pavlos Pavlidis, Alexandros Stamatakis.
Abstract
MOTIVATION: Sequence-based methods to delimit species are central to DNA taxonomy, microbial community surveys and DNA metabarcoding studies. Current approaches either rely on simple sequence similarity thresholds (OTU-picking) or on complex and compute-intensive evolutionary models. The OTU-picking methods scale well on large datasets, but the results are highly sensitive to the similarity threshold. Coalescent-based species delimitation approaches often rely on Bayesian statistics and Markov Chain Monte Carlo sampling, and can therefore only be applied to small datasets.Entities:
Mesh:
Year: 2013 PMID: 23990417 PMCID: PMC3810850 DOI: 10.1093/bioinformatics/btt499
Source DB: PubMed Journal: Bioinformatics ISSN: 1367-4803 Impact factor: 6.937
Fig. 1.Illustration of the PTP. The example tree contains 6 speciation events: R, A, B, D, E, F, and 4 species: C, D, E, F. Species C consists of one individual; species D, E, F have two individuals each. The thick lines represent among-species PTP, and the thin lines represent within-species PTPs. The Newick representation of this tree is ((C:0.14, (d1:0.01, d2:0.02)D:0.1)A:0.15, ((e1:0.015, e2:0.014)E:0.1, (f1:0.03, f2:0.02)F:0.12)B:0.11)R. The tree has a total of 16 different possible species delimitations. The maximum likelihood search returned the depicted species delimitation with a log-likelihood score of 24.77, and = 8.33 and = 55.05
Arthropod dataset: number of estimated MOTUs and species for the complete reference data and tree
| No. reads | OTU-picking | EPA-PTP | ||||
|---|---|---|---|---|---|---|
| Number of cluster | Drop-out (%) | No-match (%) | Number of cluster | Drop-out (%) | No-match (%) | |
| 973 | 19 | 42.8 | 587 | 7.3 | 13.6 | |
| 602 | 24 | 25.4 | 516 | 11.5 | 6.2 | |
| — | 36 | — | 441 | 21.9 | 3.2 | |
Note: Sanger data (the reference dataset) has a total of 547 MOTUs. The ‘—’ indicates that the number is not available in the original publication.
Number of species delimitated on real data
| Taxon | Morphological | GMYC | PTP | CROP | UCLUST |
|---|---|---|---|---|---|
| 24 | 48 | 27/44 | 6 | 82 | |
| 7 | 10 | 9/11 | 7 | 11 | |
| 1 | 10 | 9/15 | 7 | 10 | |
| 1 | 2 | 2/2 | 1 | 3 | |
| 7 | 10 | 10/10 | 9 | 15 |
aUsing the ultrametric tree as an input for PTP.
Species delimitation accuracy (measured in NMI) on simulated evenly sampled data
| NMI | b′ | Mean (variance) | |||||
|---|---|---|---|---|---|---|---|
| 5 | 10 | 20 | 40 | 80 | 160 | ||
| 1000 bp | |||||||
| UCLUST | 0.969 | 0.959 | 0.938 | 0.892 | 0.782 | 0.575 | 0.852 (0.023) |
| CROP | 0.964 | 0.930 | 0.848 | 0.646 | 0.232 | 0.038 | 0.609 (0.151) |
| GMYC | 0.924 | 0.914 | 0.907 | 0.886 | 0.834 | 0.697 | 0.860 (0.007) |
| PTP | 0.944 | 0.935 | 0.922 | 0.905 | 0.882 | 0.857 | 0.907 (0.001) |
| 250 bp | |||||||
| UCLUST | 0.967 | 0.954 | 0.930 | 0.871 | 0.735 | 0.522 | 0.829 (0.029) |
| CROP | 0.961 | 0.917 | 0.800 | 0.545 | 0.152 | 0.024 | 0.566 (0.159) |
| GMYC | 0.892 | 0.620 | 0.484 | 0.464 | 0.550 | 0.503 | 0.585 (0.025) |
| PTP | 0.946 | 0.927 | 0.907 | 0.881 | 0.833 | 0.780 | 0.879 (0.003) |
Species delimitation accuracy (measured in NMI) on simulated evenly sampled data using the EPA-PTP pipeline
| NMI | b′ | Mean (variance) | |||||
|---|---|---|---|---|---|---|---|
| 5 | 10 | 20 | 40 | 80 | 160 | ||
| 1000 bp | |||||||
| Full ref. | 0.989 | 0.978 | 0.962 | 0.933 | 0.884 | 0.836 | 0.930 (0.003) |
| 90% ref. | 0.984 | 0.972 | 0.955 | 0.925 | 0.876 | 0.830 | 0.923 (0.003) |
| 80% ref. | 0.976 | 0.966 | 0.949 | 0.921 | 0.872 | 0.823 | 0.917 (0.003) |
| 70% ref. | 0.971 | 0.959 | 0.943 | 0.912 | 0.868 | 0.816 | 0.911 (0.003) |
| 60% ref. | 0.966 | 0.956 | 0.939 | 0.908 | 0.860 | 0.805 | 0.905 (0.003) |
| 50% ref. | 0.962 | 0.950 | 0.934 | 0.904 | 0.853 | 0.787 | 0.898 (0.004) |
| 250 bp | |||||||
| Full ref. | 0.978 | 0.968 | 0.949 | 0.918 | 0.863 | 0.811 | 0.914 (0.004) |
| 90% ref. | 0.967 | 0.955 | 0.935 | 0.907 | 0.854 | 0.800 | 0.903 (0.004) |
| 80% ref. | 0.956 | 0.944 | 0.926 | 0.895 | 0.846 | 0.786 | 0.892 (0.004) |
| 70% ref. | 0.942 | 0.926 | 0.912 | 0.880 | 0.830 | 0.773 | 0.877 (0.004) |
| 60% ref. | 0.927 | 0.911 | 0.893 | 0.861 | 0.813 | 0.755 | 0.860 (0.004) |
| 50% ref. | 0.909 | 0.891 | 0.871 | 0.838 | 0.784 | 0.732 | 0.837 (0.004) |
Note: ref. indicates reference sequences