Literature DB >> 20053844

CD-HIT Suite: a web server for clustering and comparing biological sequences.

Ying Huang¹, Beifang Niu, Ying Gao, Limin Fu, Weizhong Li.

Abstract

UNLABELLED: CD-HIT is a widely used program for clustering and comparing large biological sequence datasets. In order to further assist the CD-HIT users, we significantly improved this program with more functions and better accuracy, scalability and flexibility. Most importantly, we developed a new web server, CD-HIT Suite, for clustering a user-uploaded sequence dataset or comparing it to another dataset at different identity levels. Users can now interactively explore the clusters within web browsers. We also provide downloadable clusters for several public databases (NCBI NR, Swissprot and PDB) at different identity levels. AVAILABILITY: Free access at http://cd-hit.org

Entities: Chemical Disease Species

Mesh：

Year: 2010 PMID： 20053844 PMCID： PMC2828112 DOI： 10.1093/bioinformatics/btq003

Source DB: PubMed Journal: Bioinformatics ISSN： 1367-4803 Impact factor: 6.937

1 INTRODUCTION

The size of the biological sequence databases is rapidly growing due to large-scale genome projects and the emerging field of metagenomics (Yooseph et al., 2007). New sequencing technologies are now producing sequence data at a very high rate, and this has created a greater need for bioinformatics tools to effectively organize and analyze the data. Fortunately, biological sequences are related and may share homology, and thus clustering these sequences into groups and finding a representative or a consensus for each group are practical ways to solve the sequence analysis problems. Our previous works (Li and Godzik, 2006; Li et al., 2001; Li et al., 2002) introduced CD-HIT based on short word filtering and a greedy incremental clustering algorithm to cluster and compare large biological sequence datasets. One advantage of CD-HIT is its ultrahigh speed and the ability to handle large datasets. Since its release, CD-HIT has been widely used by many groups in various fields, including UniRef (Suzek et al., 2007), SMART (Letunic et al., 2009) and metagenome data analyses (Turnbaugh et al., 2009; Yooseph et al., 2008). In the last few years, we have been continuously improving this program with more functions and better accuracy, scalability and flexibility. We also implemented a new web server to allow users to cluster or compare sequences without installing and executing the command-line version of CD-HIT locally. The server provides interactive interface and additional visualization tools. It also provides precalculated and regularly updated sequence clusters for several widely used databases, including NCBI NR, Swissprot and PDB.

2 METHODS AND IMPLEMENTATION

The detailed algorithms and benchmark results for CD-HIT can be found from our previous works (Li and Godzik, 2006; Li et al., 2001; Li et al., 2002). Here, we highlight the novel features and functions.

2.1 Improved clustering algorithm

The original CD-HIT uses a fast greedy incremental clustering process. Briefly, sequences are first sorted by decreasing length. The longest one becomes the representative of the first cluster. Then, each remaining sequence is compared with the existing representatives. If the identity with any representative is above a given threshold, it is grouped into that cluster without comparing it to other representatives. Otherwise, it becomes the representative of a new cluster. In the updated CD-HIT, we added a refined greedy incremental clustering process that produces more accurate clusters. In this process, a sequence is grouped into the most similar cluster instead of the first similar cluster. The refined process does not change the representative sequences. CD-HIT uses a short word filter to avoid unnecessary alignments. In short, the minimum number of identical short words (k-mers) shared by two sequences depends on their sequence identity and can be calculated analytically or statistically. Without an actual alignment, we can still determine that the identity of two sequences is below a given threshold by counting short words. A short word filter performs much better with a higher identity threshold. Clustering in the refined process is implemented with a dynamic short word filter. For each sequence to be clustered, the initial filter matches the user-defined identity threshold. But during the clustering procedure, if this sequence hits any cluster with better identity, the filter is reset to match this better identity to increase the performance of the filter. With the dynamic short word filter, although the refined clustering process needs to evaluate the similarities of a sequence and all the existing representatives, it only requires about 1.5–3× CPU time of the original process.

2.2 Improved clustering control

The original CD-HIT uses global sequence identities. The improved CD-HIT also works with local identities. Users can finely control the clustering behavior by including more criteria besides sequence identity cutoffs. We include alignment length, unaligned length and alignment coverage for both aligned sequences as new clustering parameters into the current CD-HIT. For example, users can make clusters of sequences of similar length by specifying that the alignments must cover both sequences at similar coverage.

2.3 Clustering at low identity thresholds

The performance of the native short word filter drops significantly with a lower identity threshold; therefore, the original CD-HIT does not provide protein clustering under 40% identity. However, clustering at low identities has been frequently requested by CD-HIT users. We implemented a script, called PSI-CD-HIT, to perform protein sequence clustering at a low identity threshold such as 30%. It uses the similar greedy incremental clustering strategy, but it uses BLAST to calculate the similarities. So users can also specify an expect-value cutoff. PSI-CD-HIT runs on a stand-alone computer or a LINUX cluster. It can cluster a PDB-sized dataset in ∼20 min.

2.4 Hierarchical clustering

In the hierarchical clustering process, the program first performs clustering on the original input dataset at a high identity threshold, and the representatives of each previous clustering step will be the input of the following clustering run at a lower identity threshold. The whole process iteratively joins the similar sequences into families and therefore produces a hierarchical structure. For protein sequences, the last step is performed with PSI-CD-HIT if the final identity threshold is <40%. This strategy can maximize the computational efficiency and the quality of clustering. We have applied such strategy in a protein family analysis of a large metagenomic dataset with 17 million sequences (Li et al., 2008).

2.5 Annotation enrichment of sequence cluster

We provide an option for joint analysis of sequence clustering and annotation information. Users can place annotation terms (Gene Ontology, protein family, etc.) in the definition lines of input FASTA files. For each annotation term A and each cluster C, we use the following numbers: N = number of sequences with A in C; N = number of sequences in C; N = number of sequences with A in the input; N = number of the input sequences. A P-value is calculated using the one-tailed Fisher's exact test to assess whether N/N > N/N and annotation term A is enriched in cluster C. Such functionality is very useful to check the cluster quality at different identity levels and also for function assignment of proteins with unknown function.

2.6 Web server

All basic functions of CD-HIT are provided through tab-based interfaces in our web server. We provided CD-HIT (CD-HIT-EST) to cluster a protein (DNA/RNA) dataset. Users can upload a FASTA file and select a desired sequence identity level and other parameters. CD-HIT-2D (CD-HIT-EST-2D) can compare two databases uploaded by users. H-CD-HIT and H-CD-HIT-EST in our server performs hierarchical clustering up to three steps. After submitting a clustering or comparison job, a unique identifier will be assigned. A user can use the identifier to track the status of the job. After the job is finished, we provide the raw outputs generated by the command-line CD-HIT. Additionally, we provide tools to visualize the clustering results with cluster explorer and cluster distribution plots. Cluster explorer uses a tree structure to represent the clustering results Figure 1a. Each cluster is represented by a clickable text object on the web page, and users can click on a representative sequence to retrieve information of the sequences belong to the cluster. This option is most useful for investigating the results from hierarchical clustering. In this situation, each sequence could be a representative sequence from the previous clustering step, and users can click it to explore the results from the previous clustering. Cluster distribution plots are scatter plots where the X-axis is the cluster size (number of sequences in a cluster), and then the Y-axis represents the number of clusters of at least this size and the number of corresponding sequences Figure 1b. This tool is very useful to observe the global structure of a sequence database.

Fig. 1.

Screenshots of CD-HIT Suite. (a) Cluster Explorer for investigating clusters. (b) A cluster distribution plot to explore the global structure of a whole dataset.

3 CONCLUSION

CD-HIT has been significantly improved from our previous work. CD-HIT Suite provides users with a friendly web interface to perform biological sequence clustering and comparison with additional visualization tools. It also provides precalculated clusters for several public sequence databases which are regularly updated.

9 in total

1. Clustering of highly homologous sequences to reduce the size of large protein databases.

Authors: W Li; L Jaroszewski; A Godzik
Journal: Bioinformatics Date: 2001-03 Impact factor: 6.937

2. Tolerating some redundancy significantly speeds up clustering of large protein databases.

Authors: Weizhong Li; Lukasz Jaroszewski; Adam Godzik
Journal: Bioinformatics Date: 2002-01 Impact factor: 6.937

3. Cd-hit: a fast program for clustering and comparing large sets of protein or nucleotide sequences.

Authors: Weizhong Li; Adam Godzik
Journal: Bioinformatics Date: 2006-05-26 Impact factor: 6.937

4. UniRef: comprehensive and non-redundant UniProt reference clusters.

Authors: Baris E Suzek; Hongzhan Huang; Peter McGarvey; Raja Mazumder; Cathy H Wu
Journal: Bioinformatics Date: 2007-03-22 Impact factor: 6.937

5. The Sorcerer II Global Ocean Sampling expedition: expanding the universe of protein families.

Authors: Shibu Yooseph; Granger Sutton; Douglas B Rusch; Aaron L Halpern; Shannon J Williamson; Karin Remington; Jonathan A Eisen; Karla B Heidelberg; Gerard Manning; Weizhong Li; Lukasz Jaroszewski; Piotr Cieplak; Christopher S Miller; Huiying Li; Susan T Mashiyama; Marcin P Joachimiak; Christopher van Belle; John-Marc Chandonia; David A Soergel; Yufeng Zhai; Kannan Natarajan; Shaun Lee; Benjamin J Raphael; Vineet Bafna; Robert Friedman; Steven E Brenner; Adam Godzik; David Eisenberg; Jack E Dixon; Susan S Taylor; Robert L Strausberg; Marvin Frazier; J Craig Venter
Journal: PLoS Biol Date: 2007-03 Impact factor: 8.029

6. Probing metagenomics by rapid cluster analysis of very large datasets.

Authors: Weizhong Li; John C Wooley; Adam Godzik
Journal: PLoS One Date: 2008-10-10 Impact factor: 3.240

7. A core gut microbiome in obese and lean twins.

Authors: Peter J Turnbaugh; Micah Hamady; Tanya Yatsunenko; Brandi L Cantarel; Alexis Duncan; Ruth E Ley; Mitchell L Sogin; William J Jones; Bruce A Roe; Jason P Affourtit; Michael Egholm; Bernard Henrissat; Andrew C Heath; Rob Knight; Jeffrey I Gordon
Journal: Nature Date: 2008-11-30 Impact factor: 49.962

8. SMART 6: recent updates and new developments.

Authors: Ivica Letunic; Tobias Doerks; Peer Bork
Journal: Nucleic Acids Res Date: 2008-10-31 Impact factor: 16.971

9. Gene identification and protein classification in microbial metagenomic sequence data via incremental clustering.

Authors: Shibu Yooseph; Weizhong Li; Granger Sutton
Journal: BMC Bioinformatics Date: 2008-04-10 Impact factor: 3.169

9 in total

772 in total

1. JC virus quasispecies analysis reveals a complex viral population underlying progressive multifocal leukoencephalopathy and supports viral dissemination via the hematogenous route.

Authors: Tom Van Loy; Kim Thys; Caroline Ryschkewitsch; Ole Lagatie; Maria C Monaco; Eugene O Major; Luc Tritsmans; Lieven J Stuyver
Journal: J Virol Date: 2014-11-12 Impact factor: 5.103

2. MoRFpred, a computational tool for sequence-based prediction and characterization of short disorder-to-order transitioning binding regions in proteins.

Authors: Fatemeh Miri Disfani; Wei-Lun Hsu; Marcin J Mizianty; Christopher J Oldfield; Bin Xue; A Keith Dunker; Vladimir N Uversky; Lukasz Kurgan
Journal: Bioinformatics Date: 2012-06-15 Impact factor: 6.937

3. Characterization of a nitric oxide synthase from the plant kingdom: NO generation from the green alga Ostreococcus tauri is light irradiance and growth phase dependent.

Authors: Noelia Foresi; Natalia Correa-Aragunde; Gustavo Parisi; Gonzalo Caló; Graciela Salerno; Lorenzo Lamattina
Journal: Plant Cell Date: 2010-11-30 Impact factor: 11.277

4. Combining machine learning and homology-based approaches to accurately predict subcellular localization in Arabidopsis.

Authors: Rakesh Kaundal; Reena Saini; Patrick X Zhao
Journal: Plant Physiol Date: 2010-07-20 Impact factor: 8.340

5. Cardiolipin Interactions with Proteins.

Authors: Joan Planas-Iglesias; Himal Dwarakanath; Dariush Mohammadyani; Naveena Yanamala; Valerian E Kagan; Judith Klein-Seetharaman
Journal: Biophys J Date: 2015-08-20 Impact factor: 4.033

6. Machine learning predicts new anti-CRISPR proteins.

Authors: Simon Eitzinger; Amina Asif; Kyle E Watters; Anthony T Iavarone; Gavin J Knott; Jennifer A Doudna; Fayyaz Ul Amir Afsar Minhas
Journal: Nucleic Acids Res Date: 2020-05-21 Impact factor: 16.971

7. Human symbionts inject and neutralize antibacterial toxins to persist in the gut.

Authors: Aaron G Wexler; Yiqiao Bao; John C Whitney; Louis-Marie Bobay; Joao B Xavier; Whitman B Schofield; Natasha A Barry; Alistair B Russell; Bao Q Tran; Young Ah Goo; David R Goodlett; Howard Ochman; Joseph D Mougous; Andrew L Goodman
Journal: Proc Natl Acad Sci U S A Date: 2016-03-08 Impact factor: 11.205

8. Rumen methanogenic genotypes differ in abundance according to host residual feed intake phenotype and diet type.

Authors: Ciara A Carberry; Sinéad M Waters; Sinead M Waters; David A Kenny; Christopher J Creevey
Journal: Appl Environ Microbiol Date: 2013-11-08 Impact factor: 4.792

9. Substrate Profile of the Phosphotriesterase Homology Protein from Escherichia coli.

Authors: Venkatesh V Nemmara; Dao Feng Xiang; A A Fedorov; E V Fedorov; Jeffrey B Bonanno; Steven C Almo; Frank M Raushel
Journal: Biochemistry Date: 2018-10-16 Impact factor: 3.162

10. Highly Active and Specific Tyrosine Ammonia-Lyases from Diverse Origins Enable Enhanced Production of Aromatic Compounds in Bacteria and Saccharomyces cerevisiae.

Authors: Christian Bille Jendresen; Steen Gustav Stahlhut; Mingji Li; Paula Gaspar; Solvej Siedler; Jochen Förster; Jérôme Maury; Irina Borodina; Alex Toftgaard Nielsen
Journal: Appl Environ Microbiol Date: 2015-04-24 Impact factor: 4.792