| Literature DB >> 22772836 |
Weizhong Li1, Limin Fu, Beifang Niu, Sitao Wu, John Wooley.
Abstract
The rapid advances of high-throughput sequencing technologies dramatically prompted metagenomic studies of microbial communities that exist at various environments. Fundamental questions in metagenomics include the identities, composition and dynamics of microbial populations and their functions and interactions. However, the massive quantity and the comprehensive complexity of these sequence data pose tremendous challenges in data analysis. These challenges include but are not limited to ever-increasing computational demand, biased sequence sampling, sequence errors, sequence artifacts and novel sequences. Sequence clustering methods can directly answer many of the fundamental questions by grouping similar sequences into families. In addition, clustering analysis also addresses the challenges in metagenomics. Thus, a large redundant data set can be represented with a small non-redundant set, where each cluster can be represented by a single entry or a consensus. Artifacts can be rapidly detected through clustering. Errors can be identified, filtered or corrected by using consensus from sequences within clusters.Entities:
Mesh:
Year: 2012 PMID: 22772836 PMCID: PMC3504929 DOI: 10.1093/bib/bbs035
Source DB: PubMed Journal: Brief Bioinform ISSN: 1467-5463 Impact factor: 11.622
Clustering speed and results for common data sets
| Data set | Program and parameters | Time | Clusters |
|---|---|---|---|
| NCBI NR, proteins, 4.3 GB: 12 054 819 sequences | cd-hit v4.5.7 ‘-n 5 -M 0 -c 0.9’ | 1405/181 | 7 036 029 |
| cd-hit v4.5.7 ‘-n 5 -M 0 -c 0.7’ | 962/152 | 4 933 074 | |
| Swissprot, proteins, 222 MB: 437 168 sequences | cd-hit 4.5.7 ‘-n 5 -M 0 -c 0.9’ | 3.7/0.8 | 298 617 |
| Uclust v5 ‘-id 0.9’ | 17.3 | 301 076 | |
| cd-hit 4.5.7 ‘-n 5 -M 0 -c 0.7’ | 4.6/0.8 | 190 695 | |
| Uclust v5 ‘-id 0.7’ | 7.6 | 192 847 | |
| Illumina (SRR061270), 380 MB, 5 million reads | cd-hit v4.5.7 ‘-n 10 -M 0 -c 0.95’ | 56.8/9.2 | 956 734 |
| Uclust v5 ‘-id 0.95’ | 164.6 | 958 887 | |
| cd-hit v4.5.7 ‘-n 10 -M 0 -c 0.9’ | 347.5/46.0 | 751 581 | |
| Uclust v5 ‘-id 0.9’ | 227.5 | 734 981 | |
| cd-hit v5.0 beta ‘-c 0.9’ | 23.5/4.0 | 750 276 | |
| SEED (default parameters) | 7.9 | 1 056 109 | |
| 1.1 million 16s rRNAs: 454 reads Ref. [ | cd-hit v4.5.7 ‘-n 10 -M 0 -c 0.97’ | 47.9/7.5 | 24 842 |
| Uclust v5 ‘-id 0.97’ | 4.3 | 29 586 | |
| DNACLUST ‘-s 0.97’ | 15.3 | 31 151 |
aNR and Swissprot were downloaded from NCBI at ftp://ftp.ncbi.nih.gov/blast/db/FASTA/. Illumina reads from SRR061270 was downloaded from NCBI at http://www.ncbi.nlm.nih.gov/sra. The 16s rRNAs was kindly provided by the authors from Ref. [44]. b‘-c 0.9’, ‘-id 0.9’ and ‘-s 0.9’ mean 90% identity. However, DNACLUST’s definition is slightly different from CD-HIT and Uclust (Ref. [40]). cThe second number is the time for eight cores; currently, only CD-HIT has a multiple threading function. dThe free 32-bit version of Uclust cannot process NR, so only CD-HIT is used.
Accuracy and speed for OTUs identification
| Data | True OTUs | Number of predicted OTUs, sensitivity (%), specificity (%), CPU time (h, min, s) | |||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| CD-HIT-OTU | AmpliconNoise | Denoiser | |||||||||||
| Divergent | 23 | 26 | 100 | 88 | 11 s | 28 | 100 | 82 | 32 h | 35 | 100 | 65 | 15 m |
| Artificial | 33 | 32 | 100 | 100 | 13 s | 34 | 96 | 91 | 22 h | 38 | 96 | 81 | 13 m |
| Even1 | 53 | 71 | 100 | 74 | 8 s | 85 | 100 | 62 | 68 h | NA | |||
| Even2 | 53 | 57 | 96 | 89 | 7 s | 83 | 100 | 63 | 49 h | NA | |||
| Even3 | 52 | 60 | 100 | 86 | 7 s | 90 | 100 | 57 | 65 h | NA | |||
| Uneven1 | 49 | 56 | 91 | 80 | 5 s | 76 | 97 | 63 | 39 h | NA | |||
| Uneven2 | 41 | 45 | 85 | 77 | 7 s | 67 | 95 | 58 | 35 h | NA | |||
| Uneven3 | 38 | 42 | 100 | 90 | 7 s | 73 | 97 | 50 | 44 h | NA | |||
| Titanium | 69 | 69 | 98 | 98 | 7 s | 90 | 100 | 76 | 388 h | 146 | 100 | 47 | 6 h |
aAll data sets were downloaded from http://people.civil.gla.ac.uk/∼quince/Data/AmpliconNoise.html according to an article [56]. bParameters are based on each programs default setting. cTrue OTUs were calculated by clustering the reference sequences that are covered by the raw reads. dFlowgram data are only available in AmpliconNoise-specific format, so we can run AmpliconNoise but not Denoiser. However, Denoiser’s performance for these data sets can be referenced from an article [56].
OTU analysis for pooled human gut and human samples
| Data set | Reads | Region | Platform | OTUs | CPU (s) |
|---|---|---|---|---|---|
| Human_gut | 817942 | V6 | GS 20 | 317 | 37 |
| Human_body | 1071335 | V2 | GS FLX | 238 | 295 |
aThe Human_gut data set was downloaded from http://gordonlab.wustl.edu/NatureTwins_2008/TurnbaughNature_11_30_08.html. The Human_body data set was kindly provided by the authors from Ref. [44].
Figure 1:Distribution of microbial diversity measured by NATs (NAT20, NAT50, NAT80 and NAT99) for 33 human gut samples. The x-axis is NAT category. The y-axis is NAT value. Samples are colored by sample type (obese, over weight or lean). The results show that obese samples have less average NAT50 than the lean samples.
Figure 2:Assembly performance of the filtered reads for metagenomic sample MH0006. x-axis is the redundancy cutoff N. The length of the longest contig (kb) and N50 (kb) are plotted against the left y-axis. The accuracy and genome coverage are against the right y-axis. The assembly results for original reads are at far right side marked as ‘ALL’ on x-axis. The accuracy of contigs is the total length of correct contigs divided by the total length of all contigs. The genome coverage is the fraction of reference genome covered by the correct contigs.
Figure 3:Using NR query and NR reference database for metagenome annotation.
Clustering results of reference databases by CD-HIT package////
| Data set | Number sequences | Total | Cutoff | Clusters | Reduced to (%) | Time (minutes) |
|---|---|---|---|---|---|---|
| NCBI NR | 12 054 819 | 4.3 GB | 90 | 7 036 029 | 58 | 181 |
| 16S (Silva + Greengene) | 555 530 | 799 MB | 98 | 154 170 | 28 | 90 |
| NCBI microbial genomes | 3 355 | 6.4 GB | 90 | 1 279 | 38 | 389 |
| NCBI virus sequences | 1 042 347 | 1.3 GB | 95 | 288 701 | 28 | 480 |
aNCBI NR was downloaded from NCBI at ftp://ftp.ncbi.nih.gov/blast/db/FASTA/. 16S sequences from Silva and Greengene were downloaded from http://www.arb-silva.de/download/archive/ and http://greengenes.lbl.gov/Download/Sequence_Data/, respectively. NCBI microbial genomes were downloaded from ftp://ftp.ncbi.nih.gov/genomes/Bacteria/ (file: all.fna.tar.gz). NCBI virus sequences were kindly provided by the CAMERA project (Ref. 42). bParameters for NR and 16S rRNA are ‘-c 0.9 -n 5 -g 1 -M 0 -T 0’ and ‘-c 0.98 -n 11 -b 5 -M 0 -T 0 -G 1’, respectively. NCBI microbial genomes and virus sequences are clustered by a beta version of CD-HIT that can process very long sequences with parameter ‘-c 0.9’ and ‘-c 0.95’. cTime on computer with eight cores.
Figure 4:Distribution of GOS and MetaHIT protein clusters. The x-axis is the cluster size X. The y-axis in left figures is the number of clusters of size at least X; the y-axis in right figures is the percentage of total sequences included in the clusters of size at least X. Graphs in (A) and (B) are for all GOS and MetaHIT sequences. Graphs in (C) and (D) are only for MetaHIT sequences, grouped by Known and Novel clusters. In addition, two separate lines are made for NR sequences (i.e. the 3 076 514 representative sequences clustered at 90% identity).
A list of clustering tools for metagenomic sequence analysis used in this study
| Tool and reference | Description | Key parameters |
|---|---|---|
| CD-HIT [ | Cluster protein sequences | -c identity cutoff |
| -n word size | ||
| CD-HIT-EST [ | Cluster nucleotide sequences | -c identity cutoff |
| -n word size | ||
| Uclust [ | Cluster protein or nucleotide sequences | -id identity cutoff |
| –w word size | ||
| SEED [ | Cluster highly similar Illumina reads (up to 3 mismatches and overhanging bases) | –mismatch allowed mismatches |
| DNACLUST [ | Cluster highly similar DNA sequences (e.g. 16S rRNAs) | -s similarity cutoff |
| -k word size | ||
| CD-HIT-454 [ | Identify duplicates for 454 reads | -c identity cutoff |
| CD-HIT-DUP | Identify duplicates for single or pair-ended Illumina reads | -e allowed mismatches |
| CD-HIT-LAP | Identify overlapping Illumina reads | -m overlapping length |
| -p overlapping coverage | ||
| PSI-CD-HIT [ | Cluster proteins at low identity cutoff (20–50%) | -c identity cutoff |
| -ce expect value cutoff | ||
| CD-HIT-OTU | Identify operational taxonomic units (OTUs) from rRNAs | Identity cutoff |
| AmpliconNoise [ | Cluster flowgram data to remove noises from reads for OTU clustering | Identity cutoff |
| Denoiser [ | Cluster flowgram data to remove noises from reads for OTU clustering | Identity cutoff |
| Cluster-based filtering | Filter sequence errors for improved sequence assembly | See CD-HIT-EST |
| Protein family clustering [ | Identify protein families from metagenomic sequences | See CD-HIT and PSI-CD-HIT |
aCD-HIT-OTU, AmpliconNoise and Denoiser have multiple steps involves many parameters, which usually do not need to be modified.