| Literature DB >> 23967117 |
Wei Chen1, Clarence K Zhang, Yongmei Cheng, Shaowu Zhang, Hongyu Zhao.
Abstract
Recent studies of 16S rRNA sequences through next-generation sequencing have revolutionized our understanding of the microbial community composition and structure. One common approach in using these data to explore the genetic diversity in a microbial community is to cluster the 16S rRNA sequences into Operational Taxonomic Units (OTUs) based on sequence similarities. The inferred OTUs can then be used to estimate species, diversity, composition, and richness. Although a number of methods have been developed and commonly used to cluster the sequences into OTUs, relatively little guidance is available on their relative performance and the choice of key parameters for each method. In this study, we conducted a comprehensive evaluation of ten existing OTU inference methods. We found that the appropriate dissimilarity value for defining distinct OTUs is not only related with a specific method but also related with the sample complexity. For data sets with low complexity, all the algorithms need a higher dissimilarity threshold to define OTUs. Some methods, such as, CROP and SLP, are more robust to the specific choice of the threshold than other methods, especially for shorter reads. For high-complexity data sets, hierarchical cluster methods need a more strict dissimilarity threshold to define OTUs because the commonly used dissimilarity threshold of 3% often leads to an under-estimation of the number of OTUs. In general, hierarchical clustering methods perform better at lower dissimilarity thresholds. Our results show that sequence abundance plays an important role in OTU inference. We conclude that care is needed to choose both a threshold for dissimilarity and abundance for OTU inference.Entities:
Mesh:
Substances:
Year: 2013 PMID: 23967117 PMCID: PMC3742672 DOI: 10.1371/journal.pone.0070837
Source DB: PubMed Journal: PLoS One ISSN: 1932-6203 Impact factor: 3.240
Details on the simulated datasets.
| Num. species | Species length | Speices similarity arrange (%) | Total reads | Initial Abundance ratio (%) | |
| Simclone10_1 | 10 | 226∼252 | 70.00∼93.00 | 69958 | [6.25,3.89,5.06,11.31,7.74,29.77,8.04,12.50,5.95,9.52] |
| Simclone10_2 | 10 | 218∼255 | 70.59∼92.94 | 148374 | [6.13,8.02,4.25,9.91,9.43,14.62,10.38,12.26,11.79,13.21] |
| Simclone15_1 | 15 | 59∼71 | 50.68∼83.04 | 63616 | [3.57,1.78,1.78,5.35,3.57,17.86,3.57,7.14,1.79,1.79,5.36,3.57,10.71,14.29,17.86] |
| Simclone15_2 | 15 | 59∼82 | 50.68∼82.54 | 134092 | [7.58,5.30,12.12, 18.94, 3.79, 8.71, 17.05, 3.79, 8.71,14.02] |
| Simclone20 | 20 | 64∼261 | 25.58∼94.19 | 115654 | [2.87,1.97,2.12,4.54,4.84,9.53,6.05,6.05,3.02,2.42,3.18,5.45,3.18,7.57,8.62,6.81,5.45,4.84,8.62,2.87] |
| Simclone30 | 30 | 212∼241 | 69.72∼94.44 | 128308 | [4.41,5.26,1.07,4.21,2.98,0.96,1.73,3.07,4.56,5.45,1.12,4.92,4.76,2.66,3.93,1.02,2.10,4.90,4.56,4.48,3.46,0.85,4.62,4.77,3.31,4.01,3.55,2.25,3.87,1.14,] |
| Simclone50 | 50 | 210∼242 | 69.86∼95.85 | 152373 | [1.22,3.10,2.78,0.68,0.58,1.70,3.50,1.46,2.19,0.90,3.22,1.09,1.87,3.08,3.41,3.72,2.18,0.68,0.81,0.97,3.21,0.99,3.15,1.10,4.06,1.27,0.85,1.19,2.20,1.86,1.66,3.53,2.16,1.94,3.72,1.19,3.21,3.39,1.42,2.07,0.44,0.38,2.38,2.95,3.93,0.62,2.47,1.91,0.21,1.38] |
| Simclone100 | 100 | 212∼276 | 67.03∼95.93 | 248968 | [0.14,0.94,1.55,0.72,1.44,0.89,0.62,0.91,1.29,0.97,0.37,1.76,0.70,0.43,1.02,0.48,0.32,1.12,1.25,0.81,1.04,0.36,0.94,0.42,0.57,0.60,0.32,1.44,0.45,0.89,1.15,0.81,1.02,1.47,1.38,0.70,0.65,1.01,0.46,1.62,0.41,0.40,1.48,1.20,1.68,0.86,0.72,0.80,0.38,1.46,1.80,1.02,1.25,1.77,1.29,1.09,1.54,0.80,1.74,1.62,1.06,1.70,1.28,1.56,1.13,1.50,0.38,1.58,0.57,1.01,1.13,0.68,1.27,1.76,0.99,0.75,0.37,1.39,0.34,0.77,1.25,0.44,0.94,0.48,1.48,1.23,1.24,0.97,1.58,1.17,1.41,1.58,0.51,0.91,0.58,1.47,0.42,1.16,0.91,0.68] |
| Simclone150 | 150 | 211∼276 | 67.03∼96.97 | 359153 | [0.09,0.64,1.03,0.50,1.00,0.61,0.40,0.64,0.89,0.68,0.25,1.17,0.47,0.29,0.76,0.32,0.21,0.75,0.90,0.55,0.73,0.24,0.65,0.30,0.41,0.41,0.23,1.00,0.31,0.62,0.82,0.57,0.71,1.01,0.97,0.52,0.47,0.69,0.33,1.09,0.28,0.27,1.02,0.82,1.11,0.61,0.51,0.56,0.28,0.97,1.20,0.72,0.86,1.18,0.87,0.76,1.09,0.56,1.19,1.11,0.72,1.16,0.91,1.06,0.78,1.04,0.25,1.05,0.40,0.68,0.77,0.48,0.85,1.17,0.71,0.50,0.26,0.92,0.21,0.53,0.83,0.30,0.65,0.33,0.99,0.84,0.86,0.69,1.07,0.77,0.93,1.06,0.34,0.62,0.43,0.99,0.29,0.82,0.64,0.48,0.72,0.34,0.38,1.32,0.62,0.36,0.25,1.20,1.27,0.60,0.27,1.10,0.59,0.34,0.85,0.25,0.22,0.77,1.24,0.38,0.30,0.21,0.88,0.95,1.17,0.46,0.86,0.27,0.32,0.65,0.72,0.32,0.91,0.70,0.75,0.88,0.31,0.84,1.20,0.60,0.29,0.24,0.19,0.54,0.41,0.55,0.86,0.82,0.49,0.65] |
| Simclone200 | 200 | 190∼276 | 64.26∼96.97 | 484404 | [0.06,0.47,0.77,0.37,0.75,0.46,0.31,0.48,0.65,0.50,0.19,0.87,0.35,0.22,0.54,0.24,0.16,0.57,0.65,0.42,0.54,0.19,0.49,0.22,0.29,0.30,0.18,0.74,0.22,0.46,0.61,0.43,0.52,0.75,0.71,0.38,0.35,0.51,0.24,0.80,0.19,0.20,0.76,0.61,0.83,0.45,0.38,0.40,0.19,0.73,0.88,0.52,0.63,0.86,0.64,0.55,0.82,0.40,0.86,0.82,0.53,0.86,0.65,0.78,0.58,0.76,0.19,0.79,0.30,0.51,0.58,0.35,0.63,0.87,0.53,0.38,0.20,0.70,0.16,0.41,0.60,0.23,0.47,0.24,0.75,0.62,0.64,0.50,0.80,0.59,0.69,0.77,0.25,0.46,0.32,0.71,0.21,0.59,0.47,0.35,0.53,0.26,0.28,0.98,0.47,0.26,0.18,0.88,0.93,0.44,0.18,0.81,0.42,0.25,0.61,0.18,0.17,0.59,0.92,0.28,0.23,0.14,0.66,0.70,0.87,0.33,0.63,0.19,0.24,0.49,0.53,0.23,0.66,0.52,0.56,0.63,0.25,0.62,0.90,0.45,0.21,0.16,0.14,0.39,0.31,0.41,0.62,0.60,0.37,0.48,0.45,0.77,0.46,0.43,0.56,0.56,0.16,0.27,0.47,0.83,0.34,0.65,0.16,0.14,0.62,0.68,0.43,0.51,0.46,0.76,0.62,0.31,0.91,0.48,0.40,0.41,0.86,0.56,0.51,0.81,0.28,0.27,0.37,0.14,0.89,0.18,0.15,0.84,0.76,0.82,0.82,0.18,0.33,0.63,0.92,0.60,0.54,0.75,0.62,0.44] |
Numbers of inferred OTUs from different dissimilarity thresholds for different algorithms.
| Clone43 | Simclone15_1 | |||||||
| Expected OTUs | Inferred | inferred OTUs(3%) | inferred OTUs(4%) | Expected OTUs | inferred OTUs(2%) | inferred OTUs(3%) | inferred OTUs(4%) | |
| Mothur | 43 | 1882 | 720 | 369 | 15 | 63 | 41 | 20 |
| Muscle+Mothur | 2478 | 1418 | 784 | 117 | 89 | 54 | ||
| ESPRIT | 4474 | 4397 | 1733 | 131 | 131 | 55 | ||
| ESPRIT-Tree | 2301 | 1096 | 279 | 96 | 29 | 16 | ||
| SLP | 286 | 245 | 227 | 17 | 17 | 15 | ||
| Uclust | 2177 | 1883 | 597 | 80 | 75 | 51 | ||
| CD-HIT | 1473 | 1464 | 481 | 50 | 49 | 32 | ||
| DNAClust | 3768 | 3658 | 1103 | 239 | 225 | 53 | ||
| GramCluster | 2119 | 2071 | 2071 | 70 | 70 | 70 | ||
| CROP | 339 | 133 | 62 | 21 | 15 | 15 | ||
: all the listed numbers of OTU are the average numbers over xx simulations.
Figure 1NID scores of ten algorithms based on the data set simclone15_1 and simclone_200.
Figure 2A Precision versus Recall plot generated from data set simclone15_1.
Running time for different algorithms when cluster sequences into OTUs for dissimilarity thresholds ranging from 0.01 to 0.10 based on simclone20.
| input | Simclone20 | |
| Running time (minute) for sequences (wall time) | ||
| Mothur | UniqueSeq | 469.00 |
| Muscle+Mothur | UniqueSeq | 6.27 |
| ESPRIT | All | 75.21 |
| ESPRIT-Tree | All | 2.37 |
| SLP | UniqueSeq | 586.55 |
| Uclust | All | 0.87 |
| CD-HIT | All | 3.85 |
| DNAClust | All | 3.01 |
| GramCluster | All | 36.85 |
| CROP | All | 173.40 |
: UniqueSeq represented only the unique, unaligned sequences were takes as input, All represented all sequences including the identical sequences are taken as input.
Figure 3The results of OTUs estimated with different frequency thresholds at different dissimilarity levels, from the data set Clone43.