Literature DB >> 23060610

CD-HIT: accelerated for clustering the next-generation sequencing data.

Limin Fu¹, Beifang Niu, Zhengwei Zhu, Sitao Wu, Weizhong Li.

Abstract

SUMMARY: CD-HIT is a widely used program for clustering biological sequences to reduce sequence redundancy and improve the performance of other sequence analyses. In response to the rapid increase in the amount of sequencing data produced by the next-generation sequencing technologies, we have developed a new CD-HIT program accelerated with a novel parallelization strategy and some other techniques to allow efficient clustering of such datasets. Our tests demonstrated very good speedup derived from the parallelization for up to ∼24 cores and a quasi-linear speedup for up to ∼8 cores. The enhanced CD-HIT is capable of handling very large datasets in much shorter time than previous versions. AVAILABILITY: http://cd-hit.org. CONTACT: liwz@sdsc.edu SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.

Entities: Chemical Disease Gene Species

Mesh：

Year: 2012 PMID： 23060610 PMCID： PMC3516142 DOI： 10.1093/bioinformatics/bts565

Source DB: PubMed Journal: Bioinformatics ISSN： 1367-4803 Impact factor: 6.937

1 INTRODUCTION

Sequence analysis has played a crucial role in computational biology. With the advancement of the next-generation sequencing technologies, the amount of available sequencing data is growing exponentially. Removing redundancy from such data by clustering could be crucial for reducing storage space, computational time and noise interference in some analysis methods, etc. CD-HIT was originally developed to cluster protein sequences to create reference databases with reduced redundancy (Li ) and was then extended to support clustering nucleotide sequences (Li and Godzik, 2006). Since its release, CD-HIT has become very widely used for a large variety of applications ranging from non-redundant dataset creation (Suzek ), protein family classifications (Yooseph ), artifact identification (Niu ), metagenomics annotation (Sun ), RNA analysis (Loong and Mishra, 2007), to various prediction studies (Rubinstein and Fiser, 2008). With sequencing data rapidly growing in public data repositories as well as in individual laboratories, there has been strong demand for an enhanced CD-HIT with greater efficiency. In response to such demand, we have developed this enhanced and parallelized version of CD-HIT, to exploit the fact that multi-core machines have become very common in ordinary laboratories. A computer cluster-based parallelization procedure for CD-HIT has been proposed in Suzek , though not fully parallelized, this procedure provides good speedup using computer cluster. Since computer clusters are not as easily available as multi-core machines, here we propose an alternative parallelization technique, which assumes shared memory model and works well on multi-core machines.

2 METHODS

Basically, CD-HIT is a greedy incremental algorithm that starts with the longest input sequence as the first cluster representative, and then process the remaining sequences from long to short to classify each sequence as a redundant or representative sequence based on its similarities to the existing representatives. The similarities are estimated by common word counting using word indexing and counting tables to filter out unnecessary sequence alignments, which are used to compute exact similarities. In the following sections, we will describe the techniques that are used to accelerate CD-HIT.

2.1 Simplification

In order to support full parallelization, the core steps of CD-HIT have been simplified into two key procedures: a checking procedure and a clustering procedure. Using these two procedures, the algorithm requires at most two word tables without the need to swap them to disk, which was necessary in the original CD-HIT due to the use of multiple tables for large datasets. Given a word table, the checking procedure checks each of the remaining sequences against the table and determines whether it is a redundant sequence. The clustering procedure will make a final determination of the status of a sequence, and if it is classified as a representative sequence, it is used to update the word table. A more detailed description with illustration is available in Supplementary Material Section 1.2 and Figure S1.

2.2 Parallelization

Given T threads or cores, the basic idea of our novel parallelization technique is to use two word tables and use T−1 threads to run multiple checking procedures using one word table (a immutable checking table), and the remaining thread to run a single clustering procedure using the other table (a mutable clustering table) in parallel. Due to the sequential characteristics of CD-HIT, it will require properly grouping the input sequences and switch the word tables to guarantee the correctness of the parallelization. This is achieved by dividing each round of computation into two stages. The first stage is to run T checking procedures on the sequence group defined for this round of computation using the word table (checking table) from the previous round. Then, the second stage will use an additional and empty word table (clustering table) to run a clustering procedure in one thread on the current sequence group, and at the same time the remaining T−1 threads will run multiple checking procedures on the remaining sequences. Since the clustering procedure may finish before or after the checking procedures, proper scheduling is used to guarantee all threads are active most of the time. At the end of each round, the clustering table will become the checking table of the next round, and the checking table of this round will be emptied to become the clustering table of the next round. More information including detailed description, illustration and pseudo codes, etc. are available in the Supplementary Material (Sections 1.3–1.5).

2.3 Other enhancements

Besides the parallelization, the new CD-HIT includes other enhancements such as faster file reading, better filtering threshold estimation, more efficient word counting and better alignment band estimation, etc. The new filtering threshold estimation is slightly more precise and can filter out more unnecessary alignments. Now word counting is handled more efficiently for input datasets with high redundancy, by maintaining a smaller counting array for hit representatives instead of a full counting array for every representatives. The improved alignment band estimation can find a narrower band for banded alignment.

2.4 Implementation

This enhanced CD-HIT is implemented in the C++ programming language and uses OpenMP (http://www.openmp.org) for parallelization. The parallel for construct of OpenMP is used for running the checking and clustering procedures in multiple threads with dynamic scheduling. Different computation data buffers are allocated for different threads. The checking word table is immutable and shared by multiple threads.

3 RESULTS

To see how much performance improvement has been achieved, we tested the new CD-HIT (V4.6) and the old CD-HIT (V3.1.2) on a set of datasets including two protein sequence datasets: SWISS-PROT (∼0.4 M sequences), NR (∼12 M sequences) and two DNA sequence datasets: HumanGut (MH0002, ∼23 M reads; Qin ) and TwinStudy (SRP000319, ∼8 M reads; Turnbaugh ). Both the SWISS-PROT and the NR datasets were downloaded from NCBI (ftp://ftp.ncbi.nih.gov/blast/db/FASTA/) on October 20, 2010. We also compared CD-HIT with a similar program UCLUST (V5.1.221) from Edgar (2010). These tests were done on a Debian Linux server with four 12-core AMD Opteron 6172 processors. Equivalent parameters were used for different programs whenever possible. Details and additional tests are available in Supplementary Material 2. Table 1 compares the efficiency of the enhanced CD-HIT to the previous version of CD-HIT and the latest UCLUST. The results demonstrate that the new CD-HIT without using multi-core is significantly more efficient than the old one and is comparable to or more efficient than UCLUST as well. When multi-cores are used, the new CD-HIT is much more efficient than either of them. To test the effectiveness of the parallelization in the new CD-HIT, the full datasets were clustered using different number of cores. The time speedups are shown in Figure 1, which indicates the parallelization has good speedup for up to ∼24 cores with a quasi-linear speedup for up to ∼8 cores. Besides speed improvements, the new CD-HIT also has better clustering quality than the old CD-HIT and UCLUST (Supplementary Material and Table S2).

Table 1.

Comparison to the previous CD-HIT and UCLUST

Dataset	CD-HIT3 (min)	CD-HIT4 (min)	CD-HIT4 (8 cores) (min)	UCLUST5 (min)
Swissprot	80	58	12	15
NR	44	22	6	46
Twinstudy	47	19	4	56
HumanGut	494	42	8	214

UCLUST5 free version cannot run on the full NR, TwinStudy and HumanGut datasets, so subsets with ∼1 M sequences of NR, 1 M reads of TwinStudy and 4 M reads of HumanGut are used in this comparison.

Fig. 1.

Evaluation of CD-HIT parallelization: computational time speedup with respect to the number of used CPU cores

Evaluation of CD-HIT parallelization: computational time speedup with respect to the number of used CPU cores Comparison to the previous CD-HIT and UCLUST UCLUST5 free version cannot run on the full NR, TwinStudy and HumanGut datasets, so subsets with ∼1 M sequences of NR, 1 M reads of TwinStudy and 4 M reads of HumanGut are used in this comparison.

4 CONCLUSIONS

In this application note, we presented an enhanced CD-HIT that has been accelerated by a novel parallelization technique and a few other improvements. We tested and demonstrated that this new CD-HIT can achieve significant speedup over the previous CD-HIT using a single core, and its acceleration by multi-core computer can scale up well to a reasonable number of cores. Clustering on large datasets that normally runs for days can now finish in hours on multicore machines. We believe this enhanced CD-HIT will find more applications in handling the next-generation sequencing data. Funding: This study was supported by National Institute of Health award R01RR025030 from the National Center for Research Resources and award R01HG005978 from the National Human Genome Research Institute. Conflict of Interest: none declared.

11 in total

1. Clustering of highly homologous sequences to reduce the size of large protein databases.

Authors: W Li; L Jaroszewski; A Godzik
Journal: Bioinformatics Date: 2001-03 Impact factor: 6.937

2. Search and clustering orders of magnitude faster than BLAST.

Authors: Robert C Edgar
Journal: Bioinformatics Date: 2010-08-12 Impact factor: 6.937

3. Unique folding of precursor microRNAs: quantitative evidence and implications for de novo identification.

Authors: Stanley Ng Kwang Loong; Santosh K Mishra
Journal: RNA Date: 2006-12-28 Impact factor: 4.942

4. Cd-hit: a fast program for clustering and comparing large sets of protein or nucleotide sequences.

Authors: Weizhong Li; Adam Godzik
Journal: Bioinformatics Date: 2006-05-26 Impact factor: 6.937

5. UniRef: comprehensive and non-redundant UniProt reference clusters.

Authors: Baris E Suzek; Hongzhan Huang; Peter McGarvey; Raja Mazumder; Cathy H Wu
Journal: Bioinformatics Date: 2007-03-22 Impact factor: 6.937

6. A human gut microbial gene catalogue established by metagenomic sequencing.

Authors: Junjie Qin; Ruiqiang Li; Jeroen Raes; Manimozhiyan Arumugam; Kristoffer Solvsten Burgdorf; Chaysavanh Manichanh; Trine Nielsen; Nicolas Pons; Florence Levenez; Takuji Yamada; Daniel R Mende; Junhua Li; Junming Xu; Shaochuan Li; Dongfang Li; Jianjun Cao; Bo Wang; Huiqing Liang; Huisong Zheng; Yinlong Xie; Julien Tap; Patricia Lepage; Marcelo Bertalan; Jean-Michel Batto; Torben Hansen; Denis Le Paslier; Allan Linneberg; H Bjørn Nielsen; Eric Pelletier; Pierre Renault; Thomas Sicheritz-Ponten; Keith Turner; Hongmei Zhu; Chang Yu; Shengting Li; Min Jian; Yan Zhou; Yingrui Li; Xiuqing Zhang; Songgang Li; Nan Qin; Huanming Yang; Jian Wang; Søren Brunak; Joel Doré; Francisco Guarner; Karsten Kristiansen; Oluf Pedersen; Julian Parkhill; Jean Weissenbach; Peer Bork; S Dusko Ehrlich; Jun Wang
Journal: Nature Date: 2010-03-04 Impact factor: 49.962

7. Predicting disulfide bond connectivity in proteins by correlated mutations analysis.

Authors: Rotem Rubinstein; Andras Fiser
Journal: Bioinformatics Date: 2008-01-18 Impact factor: 6.937

8. Artificial and natural duplicates in pyrosequencing reads of metagenomic data.

Authors: Beifang Niu; Limin Fu; Shulei Sun; Weizhong Li
Journal: BMC Bioinformatics Date: 2010-04-13 Impact factor: 3.169

9. Community cyberinfrastructure for Advanced Microbial Ecology Research and Analysis: the CAMERA resource.

Authors: Shulei Sun; Jing Chen; Weizhong Li; Ilkay Altintas; Abel Lin; Steve Peltier; Karen Stocks; Eric E Allen; Mark Ellisman; Jeffrey Grethe; John Wooley
Journal: Nucleic Acids Res Date: 2010-11-02 Impact factor: 16.971

10. A core gut microbiome in obese and lean twins.

Authors: Peter J Turnbaugh; Micah Hamady; Tanya Yatsunenko; Brandi L Cantarel; Alexis Duncan; Ruth E Ley; Mitchell L Sogin; William J Jones; Bruce A Roe; Jason P Affourtit; Michael Egholm; Bernard Henrissat; Andrew C Heath; Rob Knight; Jeffrey I Gordon
Journal: Nature Date: 2008-11-30 Impact factor: 49.962

2000 in total

1. Terpene Synthases and Terpene Variation in Cannabis sativa.

Authors: Judith K Booth; Macaire M S Yuen; Sharon Jancsik; Lufiani L Madilao; Jonathan E Page; Jörg Bohlmann
Journal: Plant Physiol Date: 2020-06-26 Impact factor: 8.340

2. Structure and mechanism of the tRNA-dependent lantibiotic dehydratase NisB.

Authors: Manuel A Ortega; Yue Hao; Qi Zhang; Mark C Walker; Wilfred A van der Donk; Satish K Nair
Journal: Nature Date: 2014-10-26 Impact factor: 49.962

3. Stochastic distribution of small soil eukaryotes resulting from high dispersal and drift in a local environment.

Authors: Mohammad Bahram; Petr Kohout; Sten Anslan; Helery Harend; Kessy Abarenkov; Leho Tedersoo
Journal: ISME J Date: 2015-09-22 Impact factor: 10.302

4. Distributions of experimental protein structures on coarse-grained free energy landscapes.

Authors: Kannan Sankar; Jie Liu; Yuan Wang; Robert L Jernigan
Journal: J Chem Phys Date: 2015-12-28 Impact factor: 3.488

5. VH-VL orientation prediction for antibody humanization candidate selection: A case study.

Authors: Alexander Bujotzek; Florian Lipsmeier; Seth F Harris; Jörg Benz; Andreas Kuglstatter; Guy Georges
Journal: MAbs Date: 2016 Impact factor: 5.857

6. Topical silver diamine fluoride for dental caries arrest in preschool children: A randomized controlled trial and microbiological analysis of caries associated microbes and resistance gene expression.

Authors: Peter Milgrom; Jeremy A Horst; Sharity Ludwig; Marilynn Rothen; Benjamin W Chaffee; Svetlana Lyalina; Katherine S Pollard; Joseph L DeRisi; Lloyd Mancl
Journal: J Dent Date: 2017-09-01 Impact factor: 4.379

7. Performance and bacterial communities of successive alkalinity-producing systems (SAPSs) in passive treatment processes treating mine drainages differing in acidity and metal levels.

Authors: Sokhee Philemon Jung; Youngwook Cheong; Giljae Yim; Sangwoo Ji; Hojeong Kang
Journal: Environ Sci Pollut Res Int Date: 2013-11-27 Impact factor: 4.223

8. Identification of non-coding RNAs with a new composite feature in the Hybrid Random Forest Ensemble algorithm.

Authors: Supatcha Lertampaiporn; Chinae Thammarongtham; Chakarida Nukoolkit; Boonserm Kaewkamnerdpong; Marasri Ruengjitchatchawalya
Journal: Nucleic Acids Res Date: 2014-04-25 Impact factor: 16.971

9. The Draft Genome of a Flat Peach (Prunus persica L. cv. '124 Pan') Provides Insights into Its Good Fruit Flavor Traits.

Authors: Aidi Zhang; Hui Zhou; Xiaohan Jiang; Yuepeng Han; Xiujun Zhang
Journal: Plants (Basel) Date: 2021-03-12

10. Healthy human gut phageome.

Authors: Pilar Manrique; Benjamin Bolduc; Seth T Walk; John van der Oost; Willem M de Vos; Mark J Young
Journal: Proc Natl Acad Sci U S A Date: 2016-08-29 Impact factor: 11.205