Warning: Undefined array key "mm" in /www/wwwroot/www.ai-bt.com/si.php on line 10 Deprecated: trim(): Passing null to parameter #1 ($string) of type string is deprecated in /www/wwwroot/www.ai-bt.com/si.php on line 10 ESPRIT-Forest: Parallel clustering of massive amplicon sequence data in subquadratic time.

Literature DB >> 28437450

ESPRIT-Forest: Parallel clustering of massive amplicon sequence data in subquadratic time.

Yunpeng Cai¹, Wei Zheng², Jin Yao³, Yujie Yang¹, Volker Mai⁴, Qi Mao³, Yijun Sun^2,3,5.

Abstract

The rapid development of sequencing technology has led to an explosive accumulation of genomic sequence data. Clustering is often the first step to perform in sequence analysis, and hierarchical clustering is one of the most commonly used approaches for this purpose. However, it is currently computationally expensive to perform hierarchical clustering of extremely large sequence datasets due to its quadratic time and space complexities. In this paper we developed a new algorithm called ESPRIT-Forest for parallel hierarchical clustering of sequences. The algorithm achieves subquadratic time and space complexity and maintains a high clustering accuracy comparable to the standard method. The basic idea is to organize sequences into a pseudo-metric based partitioning tree for sub-linear time searching of nearest neighbors, and then use a new multiple-pair merging criterion to construct clusters in parallel using multiple threads. The new algorithm was tested on the human microbiome project (HMP) dataset, currently one of the largest published microbial 16S rRNA sequence dataset. Our experiment demonstrated that with the power of parallel computing it is now compu- tationally feasible to perform hierarchical clustering analysis of tens of millions of sequences. The software is available at http://www.acsu.buffalo.edu/∼yijunsun/lab/ESPRIT-Forest.html.

Entities: Chemical Disease Gene Species

Mesh：

Substances：
RNA, Ribosomal, 16S

Year: 2017 PMID： 28437450 PMCID： PMC5421816 DOI： 10.1371/journal.pcbi.1005518

Source DB: PubMed Journal: PLoS Comput Biol ISSN： 1553-734X Impact factor: 4.475

44 in total

1. QuickTree: building huge Neighbour-Joining trees of protein sequences.

Authors: Kevin Howe; Alex Bateman; Richard Durbin
Journal: Bioinformatics Date: 2002-11 Impact factor: 6.937

2. UPARSE: highly accurate OTU sequences from microbial amplicon reads.

Authors: Robert C Edgar
Journal: Nat Methods Date: 2013-08-18 Impact factor: 28.547

3. FastTree 2--approximately maximum-likelihood trees for large alignments.

Authors: Morgan N Price; Paramvir S Dehal; Adam P Arkin
Journal: PLoS One Date: 2010-03-10 Impact factor: 3.240

Review 4. Classification of metagenomic sequences: methods and challenges.

Authors: Sharmila S Mande; Monzoorul Haque Mohammed; Tarini Shankar Ghosh
Journal: Brief Bioinform Date: 2012-09-08 Impact factor: 11.622

5. Diverse somatic mutation patterns and pathway alterations in human cancers.

Authors: Zhengyan Kan; Bijay S Jaiswal; Jeremy Stinson; Vasantharajan Janakiraman; Deepali Bhatt; Howard M Stern; Peng Yue; Peter M Haverty; Richard Bourgon; Jianbiao Zheng; Martin Moorhead; Subhra Chaudhuri; Lynn P Tomsho; Brock A Peters; Kanan Pujara; Shaun Cordes; David P Davis; Victoria E H Carlton; Wenlin Yuan; Li Li; Weiru Wang; Charles Eigenbrot; Joshua S Kaminker; David A Eberhard; Paul Waring; Stephan C Schuster; Zora Modrusan; Zemin Zhang; David Stokoe; Frederic J de Sauvage; Malek Faham; Somasekar Seshagiri
Journal: Nature Date: 2010-07-28 Impact factor: 49.962

6. Dynamics and associations of microbial community types across the human body.

Authors: Tao Ding; Patrick D Schloss
Journal: Nature Date: 2014-04-16 Impact factor: 49.962

7. Taxonomic binning of metagenome samples generated by next-generation sequencing technologies.

Authors: Johannes Dröge; Alice C McHardy
Journal: Brief Bioinform Date: 2012-07-31 Impact factor: 11.622

8. Structure, function and diversity of the healthy human microbiome.

Authors:
Journal: Nature Date: 2012-06-13 Impact factor: 49.962

9. ESPRIT-Tree: hierarchical clustering analysis of millions of 16S rRNA pyrosequences in quasilinear computational time.

Authors: Yunpeng Cai; Yijun Sun
Journal: Nucleic Acids Res Date: 2011-05-19 Impact factor: 16.971

10. De novo clustering methods outperform reference-based methods for assigning 16S rRNA gene sequences to operational taxonomic units.

Authors: Sarah L Westcott; Patrick D Schloss
Journal: PeerJ Date: 2015-12-08 Impact factor: 2.984

2 in total

1. MicroPheno: predicting environments and host phenotypes from 16S rRNA gene sequencing using a k-mer based representation of shallow sub-samples.

Authors: Ehsaneddin Asgari; Kiavash Garakani; Alice C McHardy; Mohammad R K Mofrad
Journal: Bioinformatics Date: 2018-07-01 Impact factor: 6.937

2. Comparison of Methods for Picking the Operational Taxonomic Units From Amplicon Sequences.

Authors: Ze-Gang Wei; Xiao-Dan Zhang; Ming Cao; Fei Liu; Yu Qian; Shao-Wu Zhang
Journal: Front Microbiol Date: 2021-03-24 Impact factor: 5.640

2 in total