| Literature DB >> 23060610 |
Limin Fu1, Beifang Niu, Zhengwei Zhu, Sitao Wu, Weizhong Li.
Abstract
SUMMARY: CD-HIT is a widely used program for clustering biological sequences to reduce sequence redundancy and improve the performance of other sequence analyses. In response to the rapid increase in the amount of sequencing data produced by the next-generation sequencing technologies, we have developed a new CD-HIT program accelerated with a novel parallelization strategy and some other techniques to allow efficient clustering of such datasets. Our tests demonstrated very good speedup derived from the parallelization for up to ∼24 cores and a quasi-linear speedup for up to ∼8 cores. The enhanced CD-HIT is capable of handling very large datasets in much shorter time than previous versions. AVAILABILITY: http://cd-hit.org. CONTACT: liwz@sdsc.edu SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.Entities:
Mesh:
Year: 2012 PMID: 23060610 PMCID: PMC3516142 DOI: 10.1093/bioinformatics/bts565
Source DB: PubMed Journal: Bioinformatics ISSN: 1367-4803 Impact factor: 6.937
Comparison to the previous CD-HIT and UCLUST
| Dataset | CD-HIT3 (min) | CD-HIT4 (min) | CD-HIT4 (8 cores) (min) | UCLUST5 (min) |
|---|---|---|---|---|
| Swissprot | 80 | 58 | 12 | 15 |
| NR | 44 | 22 | 6 | 46 |
| Twinstudy | 47 | 19 | 4 | 56 |
| HumanGut | 494 | 42 | 8 | 214 |
UCLUST5 free version cannot run on the full NR, TwinStudy and HumanGut datasets, so subsets with ∼1 M sequences of NR, 1 M reads of TwinStudy and 4 M reads of HumanGut are used in this comparison.
Fig. 1.Evaluation of CD-HIT parallelization: computational time speedup with respect to the number of used CPU cores