| Literature DB >> 21810899 |
Ergude Bao1, Tao Jiang, Isgouhi Kaloshian, Thomas Girke.
Abstract
MOTIVATION: Similarity clustering of next-generation sequences (NGS) is an important computational problem to study the population sizes of DNA/RNA molecules and to reduce the redundancies in NGS data. Currently, most sequence clustering algorithms are limited by their speed and scalability, and thus cannot handle data with tens of millions of reads.Entities:
Mesh:
Substances:
Year: 2011 PMID: 21810899 PMCID: PMC3167058 DOI: 10.1093/bioinformatics/btr447
Source DB: PubMed Journal: Bioinformatics ISSN: 1367-4803 Impact factor: 6.937
Clustering with different methods
| Method | No. of clusters | No. of clusters identical with true ones | Jaccard index | Time | Memory (GB) |
|---|---|---|---|---|---|
| SRR038848 (4 962 666 reads aligned) | |||||
| True | 1 106 780 | ||||
| SEED | 973 627 | 632 209 | 0.96 | 00:06:12 | 2.3 |
| UCLUST | 977 904 | 618 101 | 0.92 | 01:28:54 | 0.4 |
| UCLUSTo | 976 871 | 622 028 | 0.92 | 01:44:25 | 0.4 |
| SSAKE | 1 431 122 | 650 596 | 0.86 | 00:20:09 | 3.0 |
| SRR038849 (2 435 754 reads aligned) | |||||
| True | 973 673 | ||||
| SEED | 880 920 | 512 270 | 0.97 | 00:04:02 | 2.2 |
| UCLUST | 873 784 | 500 982 | 0.94 | 00:36:23 | 0.4 |
| UCLUSTo | 873 135 | 502 654 | 0.94 | 00:42:43 | 0.4 |
| SSAKE | 1 070 654 | 515 574 | 0.91 | 00:13:56 | 2.3 |
| SRR038850 (5 386 160 reads aligned) | |||||
| True | 3 365 685 | ||||
| SEED | 3 151 149 | 664 359 | 0.95 | 00:09:47 | 2.8 |
| UCLUST | 3 086 836 | 669 243 | 0.88 | 04:13:09 | 1.4 |
| UCLUSTo | 3 084 657 | 674 211 | 0.88 | 07:12:52 | 1.4 |
| SSAKE | 3 814 607 | 599 858 | 0.86 | 00:51:38 | 6.9 |
| SRR038851 (3 148 061 reads aligned) | |||||
| True | 2 182 354 | ||||
| SEED | 2 038 577 | 287 903 | 0.94 | 00:06:28 | 2.5 |
| UCLUST | 2 096 534 | 297 756 | 0.84 | 01:37:47 | 0.9 |
| UCLUSTo | 2 094 080 | 300 539 | 0.85 | 01:45:00 | 0.9 |
| SSAKE | 2 540 359 | 214 013 | 0.77 | 00:34:10 | 4.6 |
The clustering results for four ChIP-Seq samples are shown for the true clusters (alignment based method), SEED, SSAKE, and UCLUST with and without its optimal mode. The ‘true’ cluster data were used as references to compute the Jaccard index in the fourth column.
Assembly tests
| Preprocessing | No. of sequences to assemble (read length) | No. of contigs | N50 | Mean length of contigs | Memory for assembly (GB) | Time for assembly | Memory for clustering | Time for clustering |
|---|---|---|---|---|---|---|---|---|
| Genome assembly | ||||||||
| None | 51 448 694 (36 bp) | 2230 | 5143 | 2039 | 9.7 | 07:53:54 | – | – |
| SEED | 10 644 813 (36 bp) | 1918 | 6504 | 2382 | 5.7 | 01:11:59 | 4.1 GB | 03:41:29 |
| Random sampling | 10 644 813 (36 bp) | 2924 | 3855 | 1531 | 2.5 | 01:12:25 | – | – |
| Transcriptome assembly | ||||||||
| None | 72 295 211 (37 bp) | 21 014 | 452 | 338 | 28 | 15:08:36 | – | – |
| SEED | 29 841 222 (37 bp) | 12 988 | 507 | 391 | 22 | 05:59:33 | 8.7 GB | 04:09:51 |
| Random sampling | 29 841 222 (37 bp) | 12 868 | 396 | 315 | 12 | 05:57:09 | – | – |
The assembly results with Velvet/Oases are shown for the genome resequencing data set from Rhodobacter sphaeroides (upper panel) and the transcriptome RNA-Seq data set from Arabidopsis thaliana (lower panel). The table compares row wise the results for the following preprocessing steps of the raw sequences: no preprocessing, preprocessing with SEED, random sampling of the same number of reads obtained with SEED. The parameters used for SEED were ≤3 mismatches, ≤3 overhanging ends and QV mode disabled. The corresponding cluster size distributions for the genome assembly are given in Figure 1.
Fig. 1.Cumulative contig sizes of genome assemblies. The plot compares the cumulative contig size distribution of the Velvet assembly results presented in the upper panel of Table 2 (for details see table legend). In this plot, the N50 value is the contig size (Y-axis) at 50% of the assembly coverage (X-axis).
miRNA profiling with SEED
| Samples | No. of sequences | No. of clusters (size >10) | miRNAs identified (all samples 96%) | PCC |
|---|---|---|---|---|
| SRR032112 (Root−Pi) | 5 142 120 | 37 315 | 76.1 | 0.91 |
| SRR032113 (Root+Pi) | 4 919 514 | 38 193 | 83.3 | 0.89 |
| SRR032114 (Shoot−Pi) | 4 862 947 | 46 776 | 89.4 | 0.82 |
| SRR032115 (Shoot+Pi) | 5 003 481 | 43 176 | 86.6 | 0.87 |
The table gives for the four small RNA samples from (Hsieh ) the number of sequences in each data set, the number of clusters obtained by SEED with ≥10 members, the relative number of miRNAs covered by these clusters, and the PCCs for the published read counts and the ones obtained by SEED.