Literature DB >> 24215029

HPC-CLUST: distributed hierarchical clustering for large sets of nucleotide sequences.

João F Matias Rodrigues¹, Christian von Mering.

Abstract

MOTIVATION: Nucleotide sequence data are being produced at an ever increasing rate. Clustering such sequences by similarity is often an essential first step in their analysis-intended to reduce redundancy, define gene families or suggest taxonomic units. Exact clustering algorithms, such as hierarchical clustering, scale relatively poorly in terms of run time and memory usage, yet they are desirable because heuristic shortcuts taken during clustering might have unintended consequences in later analysis steps.
RESULTS: Here we present HPC-CLUST, a highly optimized software pipeline that can cluster large numbers of pre-aligned DNA sequences by running on distributed computing hardware. It allocates both memory and computing resources efficiently, and can process more than a million sequences in a few hours on a small cluster.
AVAILABILITY AND IMPLEMENTATION: Source code and binaries are freely available at http://meringlab.org/software/hpc-clust/; the pipeline is implemented in Cþþ and uses the Message Passing Interface (MPI) standard for distributed computing.

Entities: Chemical

Mesh：

Substances：

Year: 2013 PMID： 24215029 PMCID： PMC3892691 DOI： 10.1093/bioinformatics/btt657

Source DB: PubMed Journal: Bioinformatics ISSN： 1367-4803 Impact factor: 6.937

1 INTRODUCTION

The time complexity of hierarchical clustering algorithms (HCA) is quadratic or even worse , depending on the selected cluster linkage method (Day and Edelsbrunner, 1984). However, HCAs have a number of advantages that make them attractive for applications in biology: (i) they are well defined and should be reproducible across implementations, (ii) they require nothing but a pairwise distance matrix as input and (iii) they are agglomerative, meaning that sets of clusters at arbitrary similarity thresholds can be extracted quickly by post-processing, once a complete clustering run has been executed. Consequently, HCAs have been widely adopted in biology, in areas ranging from data mining to sequence analysis to evolutionary biology. Apart from generic implementations, a number of hierarchical clustering implementations exist that focus on biological sequence data, taking advantage of the fact that distances between sequences can be computed relatively cheaply, even in a transient fashion. However, the existing implementations such as MOTHUR (Schloss ), ESPRIT (Sun ) or RDP online clustering (Cole ), all struggle with large sets of sequences. In light of these performance limits, heuristic optimizations have also been implemented such as CD-HIT (Li and Godzik, 2006) and UCLUST (Edgar, 2010). Hierarchical clustering starts by considering every sequence separately and merging the two closest ones into a cluster. Then, iteratively, larger clusters are formed, by joining the closest sequences and/or clusters. The distance between two clusters with several sequences will depend on the clustering linkage chosen. In single linkage, it is the similarity between the two most similar sequences; in complete linkage, between the two most dissimilar sequences; and in average linkage, the average of all pairwise similarities. The latter method is also known as the Unweighted Pair Group Method with Arithmetic Mean (UPGMA) and is often used in the construction of phylogenetic guide trees. In the type of approach used by CD-HIT and UCLUST, each input sequence is considered sequentially, and is either added to an existing cluster (if it is found to meet the clustering threshold) or is used as a seed to start a new cluster. Although this approach is extremely efficient, it can lead to some undesired characteristics (Sun ): (i) it will create clusters with sequences that may be more dissimilar than the chosen clustering threshold; (ii) it can occur that a new cluster is created close to an existing cluster, but at a distance just slightly longer than the clustering threshold; at this point, any new sequences close to both clusters will be split among the two clusters, whereas previous sequences will have been added to only the first cluster; this effectively results in a reduction of the clustering threshold locally; and (iii) different sequence input orders will result in different sets of clusters due to different choices of the seed sequences. Point (i) also affects HCA using single linkage and to a lesser extent average linkage, but does not occur with complete linkage. Here we present a distributed implementation of an HCA that can handle large numbers of sequences. It can compute single-, complete- and average-linkage clusters in a single run and produces a merge-log from which clusters can subsequently be parsed at any threshold. In contrast to CD-HIT, UCLUST and ESPRIT, which all take unaligned sequence data as their input, HPC-CLUST (like MOTHUR) takes as input a set of pre-aligned sequences. This allows for flexibility in the choice of alignment algorithm; a future version of HPC-CLUST may include the alignment step as well. For further details on implementation and algorithms, see the Supplementary Material.

2 METHODS

For all benchmarks, we used one or more dedicated Dell Blade M605 compute nodes with 2 quad-core Opteron 2.33 GHz processors and 24 GB of random access memory. The most recent version of each software pipeline was used: HPC-CLUST (v1.0.0), MOTHUR (v.1.29.2), ESPRIT (Feb. 2011), CD-HIT (v4.6.1) and UCLUST (v6.0.307). Detailed information on settings and parameters is available in the Supplementary Material. We compiled a dataset of publicly available full-length 16S bacterial ribosomal RNA sequences from NCBI Genbank. Sequences were aligned using INFERNAL v1.0.2 with a 16S model for bacteria from the ssu-align package (Nawrocki ). Importantly, INFERNAL uses a profile alignment strategy that scales linearly O(N) with the number of sequences, and can be trivially parallelized. Indels were removed and sequences were trimmed between two well-conserved alignment columns, such that all sequences had the same aligned length. The final dataset consisted of 1 105 195 bacterial sequences (833 013 unique) of 1301 in aligned length.

3 RESULTS

3.1 Clustering performance on a single computer

HPC-CLUST has been highly optimized for computation speed and memory efficiency. It is by far the fastest of the exact clustering implementations tested here, even when running on a single computer (Fig. 1). Compared with MOTHUR, it produces identical or nearly identical clustering results (see Supplementary Material). Because CD-HIT and UCLUST use a different approach to clustering, they are not directly comparable and are included for reference only..

Fig. 1.

Runtime comparisons. For HPC-CLUST and MOTHUR, runtimes are shown both including and excluding sequence alignment runtime. UCLUST and CD-HIT exhibited only negligible decreases in runtime when using multiple threads. Identity threshold for clustering was 98% identity In HPC-CLUST, the largest fraction of computation time is spent calculating the pairwise sequence distances, the second largest in sorting the distances and the final clustering step is the fastest. HPC-CLUST can make use of multithreaded execution on multiple nodes and practically achieves optimal parallelization in the distance calculation step. Additional benchmarks are shown and discussed in the Supplementary Material.

3.2 Distributed clustering performance

Clustering the full dataset (833 013 unique sequences) to 97% identity threshold required a total of 2 h and 42 min on a compute cluster of 24 nodes with 8 cores each (192 total cores). Owing to parallelization, the distance and sorting computation took only 57 min (wall clock time), corresponding to >10 000 min CPU time. The remaining 1 h and 45 min (wall clock time) were spent collecting and clustering the distances. The combined total memory used for the distance matrix was 59.8 or 2.6 GB per node. The node on which the merging step was performed used a maximum of 4.9 GB of memory when doing single-, complete- and average-linkage clusterings in the same run

4 CONCLUSION

Clustering is often among the first steps when dealing with raw sequence data, and therefore needs to be as fast and as memory efficient as possible. The implementation of a distributed version of hierarchical clustering in HPC-CLUST makes it now possible to fully cluster a much larger number of sequences, essentially limited only by the number of available computing nodes.

7 in total

1. Search and clustering orders of magnitude faster than BLAST.

Authors: Robert C Edgar
Journal: Bioinformatics Date: 2010-08-12 Impact factor: 6.937

2. Cd-hit: a fast program for clustering and comparing large sets of protein or nucleotide sequences.

Authors: Weizhong Li; Adam Godzik
Journal: Bioinformatics Date: 2006-05-26 Impact factor: 6.937

3. Infernal 1.0: inference of RNA alignments.

Authors: Eric P Nawrocki; Diana L Kolbe; Sean R Eddy
Journal: Bioinformatics Date: 2009-03-23 Impact factor: 6.937

4. Introducing mothur: open-source, platform-independent, community-supported software for describing and comparing microbial communities.

Authors: Patrick D Schloss; Sarah L Westcott; Thomas Ryabin; Justine R Hall; Martin Hartmann; Emily B Hollister; Ryan A Lesniewski; Brian B Oakley; Donovan H Parks; Courtney J Robinson; Jason W Sahl; Blaz Stres; Gerhard G Thallinger; David J Van Horn; Carolyn F Weber
Journal: Appl Environ Microbiol Date: 2009-10-02 Impact factor: 4.792

5. A large-scale benchmark study of existing algorithms for taxonomy-independent microbial community analysis.

Authors: Yijun Sun; Yunpeng Cai; Susan M Huse; Rob Knight; William G Farmerie; Xiaoyu Wang; Volker Mai
Journal: Brief Bioinform Date: 2011-04-27 Impact factor: 11.622

6. The Ribosomal Database Project: improved alignments and new tools for rRNA analysis.

Authors: J R Cole; Q Wang; E Cardenas; J Fish; B Chai; R J Farris; A S Kulam-Syed-Mohideen; D M McGarrell; T Marsh; G M Garrity; J M Tiedje
Journal: Nucleic Acids Res Date: 2008-11-12 Impact factor: 16.971

7. ESPRIT: estimating species richness using large collections of 16S rRNA pyrosequences.

Authors: Yijun Sun; Yunpeng Cai; Li Liu; Fahong Yu; Michael L Farrell; William McKendree; William Farmerie
Journal: Nucleic Acids Res Date: 2009-05-05 Impact factor: 16.971

7 in total

14 in total

Review 1. The intestinal microbiota: its role in health and disease.

Authors: Luc Biedermann; Gerhard Rogler
Journal: Eur J Pediatr Date: 2015-01-07 Impact factor: 3.183

2. Protein tyrosine phosphatase non-receptor type 22 modulates colitis in a microbiota-dependent manner.

Authors: Marianne R Spalinger; Thomas Sb Schmidt; Marlene Schwarzfischer; Larissa Hering; Kirstin Atrott; Silvia Lang; Claudia Gottier; Annelies Geirnaert; Christophe Lacroix; Xuezhi Dai; David J Rawlings; Andrew C Chan; Christian von Mering; Gerhard Rogler; Michael Scharl
Journal: J Clin Invest Date: 2019-05-20 Impact factor: 14.808

3. Unravelling the collateral damage of antibiotics on gut bacteria.

Authors: Lisa Maier; Camille V Goemans; Jakob Wirbel; Michael Kuhn; Claudia Eberl; Mihaela Pruteanu; Patrick Müller; Sarela Garcia-Santamarina; Elisabetta Cacace; Boyao Zhang; Cordula Gekeler; Tisya Banerjee; Exene Erin Anderson; Alessio Milanese; Ulrike Löber; Sofia K Forslund; Kiran Raosaheb Patil; Michael Zimmermann; Bärbel Stecher; Georg Zeller; Peer Bork; Athanasios Typas
Journal: Nature Date: 2021-10-13 Impact factor: 69.504

4. ESPRIT-Forest: Parallel clustering of massive amplicon sequence data in subquadratic time.

Authors: Yunpeng Cai; Wei Zheng; Jin Yao; Yujie Yang; Volker Mai; Qi Mao; Yijun Sun
Journal: PLoS Comput Biol Date: 2017-04-24 Impact factor: 4.475

5. A faecal microbiota signature with high specificity for pancreatic cancer.

Authors: Ece Kartal; Thomas S B Schmidt; Esther Molina-Montes; Sandra Rodríguez-Perales; Jakob Wirbel; Oleksandr M Maistrenko; Wasiu A Akanni; Bilal Alashkar Alhamwe; Renato J Alves; Alfredo Carrato; Hans-Peter Erasmus; Lidia Estudillo; Fabian Finkelmeier; Anthony Fullam; Anna M Glazek; Paulina Gómez-Rubio; Rajna Hercog; Ferris Jung; Stefanie Kandels; Stephan Kersting; Melanie Langheinrich; Mirari Márquez; Xavier Molero; Askarbek Orakov; Thea Van Rossum; Raul Torres-Ruiz; Anja Telzerow; Konrad Zych; Vladimir Benes; Georg Zeller; Jonel Trebicka; Francisco X Real; Nuria Malats; Peer Bork
Journal: Gut Date: 2022-03-08 Impact factor: 31.793

6. Improved OTU-picking using long-read 16S rRNA gene amplicon sequencing and generic hierarchical clustering.

Authors: Oscar Franzén; Jianzhong Hu; Xiuliang Bao; Steven H Itzkowitz; Inga Peter; Ali Bashir
Journal: Microbiome Date: 2015-10-05 Impact factor: 14.650

7. Ecological consistency of SSU rRNA-based operational taxonomic units at a global scale.

Authors: Thomas S B Schmidt; João F Matias Rodrigues; Christian von Mering
Journal: PLoS Comput Biol Date: 2014-04-24 Impact factor: 4.475

8. Minimum entropy decomposition: unsupervised oligotyping for sensitive partitioning of high-throughput marker gene sequences.

Authors: A Murat Eren; Hilary G Morrison; Pamela J Lescault; Julie Reveillaud; Joseph H Vineis; Mitchell L Sogin
Journal: ISME J Date: 2015-03-17 Impact factor: 10.302

9. Preventive Trichuris suis ova (TSO) treatment protects immunocompetent rabbits from DSS colitis but may be detrimental under conditions of immunosuppression.

Authors: Irina Leonardi; Alexandra Gerstgrasser; Thomas S B Schmidt; Flora Nicholls; Bernhard Tewes; Roland Greinwald; Christian von Mering; Gerhard Rogler; Isabelle Frey-Wagner
Journal: Sci Rep Date: 2017-11-28 Impact factor: 4.379

10. A family of interaction-adjusted indices of community similarity.

Authors: Thomas Sebastian Benedikt Schmidt; João Frederico Matias Rodrigues; Christian von Mering
Journal: ISME J Date: 2016-12-09 Impact factor: 10.302