| Literature DB >> 28472072 |
Scott W Olesen1, Claire Duvallet1, Eric J Alm1,2.
Abstract
Distribution-based operational taxonomic unit-calling (dbOTU) improves on other approaches by incorporating information about the input sequences' distribution across samples. Previous implementations of dbOTU presented challenges for users. Here we introduce and evaluate a new implementation of dbOTU that is faster and more user-friendly. We show that this new implementation has theoretical and practical improvements over previous implementations of dbOTU, making the algorithm more accessible to microbial ecology and biomedical researchers.Entities:
Mesh:
Year: 2017 PMID: 28472072 PMCID: PMC5417438 DOI: 10.1371/journal.pone.0176335
Source DB: PubMed Journal: PLoS One ISSN: 1932-6203 Impact factor: 3.240
Fig 2Comparison of genetic dissimilarities.
The dbOTU2 metric (blue crosses) and dbOTU3’s Levenshtein metric (pink circles) predict true pairwise dissimilarities. Each point represents a comparison between a pair of sequences that were subjected to the genetic criterion while running dbOTU3 on the mock community data.
Comparison of the dbOTU implementations.
| Implementation | Programming languages | Required input | Genetic criterion | Distribution criterion |
|---|---|---|---|---|
| dbOTU1 | Perl, R | Matrices of genetic distances, sequence count table | Values from input genetic distance matrix | Simulated |
| dbOTU2 | Python 2, R | Unaligned and aligned sequences, sequence count table | Proportion of mismatched sites | Simulated |
| dbOTU3 | Python 2/3 | Unaligned sequences, sequence count table | Levenshtein edit distance | Likelihood-ratio test |
*The first two implementations recommended inputting information about dissimilarity of aligned and unaligned sequences, and the dissimilarity used in the genetic criterion was the minimum of those two dissimilarities for each pair of sequences.
Benchmarks for the speed of the entire OTU calling process.
| Step | Time (sec) |
|---|---|
| mothur alignment | 3.92 ± 0.34 |
| dbOTU1 | 13.17 ± 0.73 |
| dbOTU2 | 13.16 ± 0.17 |
| dbOTU3 | 1.01 ± 0.01 |
“mothur alignment” refers to using mothur to align the input sequences, which was required before running dbOTU1 and dbOTU2. The time required to compute the FastTree distance matrix, which is required for dbOTU1, was small (≲0.1 seconds) and is not shown. Errors show the standard deviations over 10 runs.
Fig 1Comparisons of communities analyzed by different methods.
dbOTU3 produces nearly identical results with dbOTU2 when visualized in a principal coordinate analysis ordination plot. Each point represents a community resulting from analysis of the mock community data one of the OTU callers. (The two triangles representing dbOTU2 and dbOTU3 always appear on top of one another, making a six-pointed triangle.) The “true composition” is the community composition expected based on how the communities were constructed. The principal components were computed using a matrix of the square roots of the Jensen-Shannon divergence between each pair of computed community compositions.
Benchmarks for the speed of the dissimilarity metric.
| Metric | Time (sec) | Relative time |
|---|---|---|
| Biopython | 171.3 | 1119 |
| Clustal Omega | 40.3 | 263 |
| Levenshtein | 0.2 | 1 |
Times are relative to dbOTU3’s Levenshtein metric. Each method was used to align the 3,688 pairs of sequences that were compared during the evaluation of dbOTU3 on the mock community data.
Accuracy of the genetic classification.
| Metric | Level (%) | True similars | False similars | False dissimilars | True dissimilars | Sensitivity (%) | Specificity (%) |
|---|---|---|---|---|---|---|---|
| 5 | 34 | 0 | 0 | 3 654 | 100.0 | 100.0 | |
| 10 | 166 | 11 | 1 | 3 510 | 99.4 | 99.7 | |
| 20 | 435 | 3 | 52 | 3 198 | 89.3 | 99.9 | |
| 30 | 1 623 | 17 | 474 | 1 574 | 77.4 | 98.9 | |
| 5 | 34 | 0 | 0 | 3 654 | 100.0 | 100.0 | |
| 10 | 167 | 30 | 0 | 3 491 | 100.0 | 99.1 | |
| 20 | 484 | 139 | 3 | 3 062 | 99.4 | 95.7 | |
| 30 | 2 097 | 854 | 0 | 737 | 100.0 | 46.3 |
The genetic similarity metrics used in dbOTU2 and dbOTU3 were used to classify pairs of sequences as sufficiently genetically similar to be subjected to the distribution criterion. To compute the accuracies, dissimilarities for the 3,688 pairs of sequences compared while running dbOTU3 on the mock community data (i.e., the same ones shown in Fig 2) were computed by the gold standard, dbOTU2, and dbOTU3 (Levenshtein) metrics. Here a “similar” result means that a metric concludes that a pair of sequences are sufficiently genetically similar to be considered for merging into an OTU; a “dissimilar” results means that the sequences are too dissimilar to be merged. The “level” is the genetic dissimilarity threshold used for the test. For example, a dbOTU2 true similar at the 5% level means that the dbOTU2 metric and the gold standard both concluded that the two sequences were at least 95% similar and thus candidates for merging. A false dissimilar, by contrast, means that the sequences were genetically similar but the metric concluded they were not, thus erroneously excluding them from a distribution criterion check.
Correlation coefficients between the genetic dissimilarities.
| Metric | Correlation coefficient (%) |
|---|---|
| dbOTU2 | 94.3 (93.9–94.6) |
| dbOTU3 | 97.1 (96.9–97.2) |
The correlations are between the values computed by the gold standard (pairwise alignment with Clustal Omega) and by the metrics used in dbOTU2 and dbOTU3. The correlations were computed for the 167 pairs of sequences that were compared while running dbOTU3 on the mock community data and for which the true genetic dissimilarity is at most 10% (i.e., those points in Fig 2 for which the true dissimilarity is at most 10%). Ranges are 95% confidence intervals.
Benchmarks for the speed of the distribution criteria.
| Distribution criterion | Time (sec) | Relative time |
|---|---|---|
| dbOTU1 | 7.112 | 963 |
| dbOTU2 | 0.146 | 20 |
| dbOTU3 | 0.007 | 1 |
The time reported is the time required to evaluate the distribution criterion for the 47 OTU/sequence pairs that were evaluated while running dbOTU3 on the mock community data.
Sequence count information for the case in which the simulated χ2 test and likelihood-ratio test disagree.
| Even1 | Even2 | Even3 | Uneven1 | Uneven2 | Uneven3 | |
|---|---|---|---|---|---|---|
| seq15 | 138 | 129 | 163 | 92 | 258 | 14 |
| seq45 | 15 | 11 | 28 | 1 | 13 | 1 |
In this case, the simulated χ2 test determines that the OTU (top row) and candidate sequence (bottom row) are identically distributed but the likelihood-ratio test determines that they are differently distributed (both with respect to the threshold p = 0.001).
Likelihood-ratio test at varying p-value thresholds.
| Threshold | True dissimilars | False dissimilars | False similars | Accuracy ( |
|---|---|---|---|---|
| 0.0000004 | 14 | 0 | 2 | 93.3 |
| 0.0000020 | 15 | 0 | 1 | 96.8 |
| 0.0000045 | 16 | 0 | 0 | 100.0 |
| 0.0002374 | 16 | 1 | 0 | 97.0 |
| 0.0021540 | 16 | 2 | 0 | 94.1 |
| 0.0024196 | 16 | 3 | 0 | 91.4 |
| 0.0028419 | 16 | 4 | 0 | 88.9 |
| 0.0059160 | 16 | 5 | 0 | 86.5 |
| 0.0241790 | 16 | 6 | 0 | 84.2 |
| 0.0308700 | 16 | 7 | 0 | 82.1 |
| 0.0417458 | 16 | 8 | 0 | 80.0 |
| 0.0436377 | 16 | 9 | 0 | 78.0 |
| 0.0575800 | 16 | 10 | 0 | 76.2 |
The likelihood-ratio test perfectly reproduces the results of the χ2 test for a smaller p-value threshold (∼5 × 10−6). In these comparisons, the χ2 test’s p-value threshold was fixed at 0.001, and the likelihood-ratio test’s threshold was adjusted to intermediate values appearing in the list of p-values the test computed. Here “dissimilar” means that the test returned a p-value below the threshold, supporting the conclusion that the OTU and sequence are distributed differently; a “similar” result means that test returned a p-value above the threshold, not contradicting the conclusion that the OTU and sequence are distributed similarly. The χ2 test delivered 16 “dissimilar” results.