| Literature DB >> 19417062 |
Yijun Sun1, Yunpeng Cai, Li Liu, Fahong Yu, Michael L Farrell, William McKendree, William Farmerie.
Abstract
Recent metagenomics studies of environmental samples suggested that microbial communities are much more diverse than previously reported, and deep sequencing will significantly increase the estimate of total species diversity. Massively parallel pyrosequencing technology enables ultra-deep sequencing of complex microbial populations rapidly and inexpensively. However, computational methods for analyzing large collections of 16S ribosomal sequences are limited. We proposed a new algorithm, referred to as ESPRIT, which addresses several computational issues with prior methods. We developed two versions of ESPRIT, one for personal computers (PCs) and one for computer clusters (CCs). The PC version is used for small- and medium-scale data sets and can process several tens of thousands of sequences within a few minutes, while the CC version is for large-scale problems and is able to analyze several hundreds of thousands of reads within one day. Large-scale experiments are presented that clearly demonstrate the effectiveness of the newly proposed algorithm. The source code and user guide are freely available at http://www.biotech.ufl.edu/people/sun/esprit.html.Entities:
Mesh:
Substances:
Year: 2009 PMID: 19417062 PMCID: PMC2691849 DOI: 10.1093/nar/gkp285
Source DB: PubMed Journal: Nucleic Acids Res ISSN: 0305-1048 Impact factor: 16.971
Figure 1.(a) k-mer distances are highly correlated with genetic distances. (b) Removing sequence pairs with k-mer distances larger than the default threshold has a negligible impact on the estimation accuracy for the distance levels of interest. The experiment was performed on the 53R seawater sample (see Results Section).
Figure 2.Lineage-through-time curves generated by using ESPRIT and MUSCLE+DOTUR algorithms performed on simulation data with each read containing up to (a) 3% and (b) 5% sequencing errors.
Figure 3.Lineage-through-time curves generated by MUSCLE+DOTUR using simple or default parameters performed on simulation data with each read containing up to (a) 3% and (b) 5% sequencing errors.
Running time of the PC version of ESPRIT performed on eight seawater samples
| Data sets | ||||||||
|---|---|---|---|---|---|---|---|---|
| 112R | 115R | 137 | 138 | 53R | 55R | FS312 | FS396 | |
| Number of reads | 9282 | 11 005 | 13 097 | 14 374 | 5000 | 13 902 | 4835 | 17 666 |
| CPU time | 2 m 54 s | 4 m 13 s | 4 m 28 s | 6 m 19 s | 59 s | 7 m 31 s | 49 s | 6 m 34 s |
Figure 4.(a) Lineage-through-time curves, (b) ACE and (c) Chao1 estimates generated by using ESPRIT and MUSCLE+DOTUR algorithms performed on the FS396 data. Error bars of Chao1 estimates represent the 95% confidence interval.
Figure 5.Rarefaction curves generated by using (a) MUSCLE+DOTUR and (b) ESPRIT performed on the FS396 data.
Figure 6.Rarefaction analysis of an air sample. Rarefaction curves are shown for OTUs with sequence variations that do not exceed 1, 3, 5 or 10%.
The number of observed OTUs, the ACE and Chao1 estimates of an air sample at four different distance levels
| Pairwise distance | ||||
|---|---|---|---|---|
| 0.01 | 0.03 | 0.05 | 0.1 | |
| OTUs | 80 238 | 18 686 | 8344 | 2109 |
| ACE | 147 266 | 23 894 | 9664 | 2262 |
| Chao1 | 138 376 | 23 921 | 9748 | 2293 |
| Upper | 139 911 | 24 302 | 9932 | 2362 |
| Lower | 136 881 | 23 566 | 9585 | 2242 |
The 95% CIs of the Chao1 estimates are also provided.