| Literature DB >> 23236450 |
Xianwen Ren1, Tao Liu, Jie Dong, Lilian Sun, Jian Yang, Yafang Zhu, Qi Jin.
Abstract
Next generation sequencing (NGS) technologies have greatly changed the landscape of transcriptomic studies of non-model organisms. Since there is no reference genome available, de novo assembly methods play key roles in the analysis of these data sets. Because of the huge amount of data generated by NGS technologies for each run, many assemblers, e.g., ABySS, Velvet and Trinity, are developed based on a de Bruijn graph due to its time- and space-efficiency. However, most of these assemblers were developed initially for the Illumina/Solexa platform. The performance of these assemblers on 454 transcriptomic data is unknown. In this study, we evaluated and compared the relative performance of these de Bruijn graph based assemblers on both simulated and real 454 transcriptomic data. The results suggest that Trinity, the Illumina/Solexa-specialized transcriptomic assembler, performs the best among the multiple de Bruijn graph assemblers, comparable to or even outperforming the standard 454 assembler Newbler which is based on the overlap-layout-consensus algorithm. Our evaluation is expected to provide helpful guidance for researchers to choose assemblers when analyzing 454 transcriptomic data.Entities:
Mesh:
Year: 2012 PMID: 23236450 PMCID: PMC3517413 DOI: 10.1371/journal.pone.0051188
Source DB: PubMed Journal: PLoS One ISSN: 1932-6203 Impact factor: 3.240
Figure 1The k-mer uniqueness of Saccharomyces cerevisiae transcripts.
Fixing the length of k-mer, if one cDNA sequence has no k-mer overlapping with other cDNAs in the data set, then this cDNA sequence is k-mer-unique. The k-mer-uniqueness of a data set is the proportion of k-mer-unique cDNAs in the entire set of cDNAs. In principle, k-mer-uniqueness increases as the length of k-mer increases, and the larger the k-mer-uniqueness is, the easier the de novo assembling task.
Figure 2The number of useless reads in the three Saccharomyces cerevisiae data sets.
Data set 1 was of constant 30-fold coverage. Data set 2 and 3 were of varying coverage. Data set 1 and 2 were generated with the default read length distribution of ART whereas data set 3 was generated with a customized longer read length distribution. If there are repeated k-mers in sequence or all the k-mers are isolated in the de Bruijn graph, then this read is defined as “useless” here.
Basic characteristics of various assemblies of Saccharomyces cerevisiae data set 1 with constant 30-fold coverage.
| ABySS | Euler-sr | SOAPdenovo | SOAPdenovo-Trans | Velvet | Oases | Trinity | Newbler | MIRA | |
| Longest Contig (bps) | 9254 | 14606 | 3587 | 11190 | 9312 | 14687 | 11218 | 14707 | 14717 |
| #Contigs > = 100 bps | 10445 | 6278 | 21110 | 8134 | 10534 | 6143 | 7262 | 6138 | 6559 |
| #Contigs > = 500 bps | 5568 | 4792 | 4795 | 5397 | 5902 | 5130 | 5451 | 4872 | 5008 |
| #Contigs > = 1 k bps | 2609 | 3247 | 942 | 3218 | 2805 | 3583 | 3608 | 3316 | 3407 |
| N50 (bps) | 1021 | 1804 | 349 | 1465 | 1125 | 1898 | 1715 | 1821 | 1808 |
| N90 (bps) | 164 | 686 | 45 | 487 | 353 | 798 | 639 | 714 | 691 |
Sensitivity and specificity of various assemblies of Saccharomyces cerevisiae data set 1 with constant 30-fold coverage.
| SeqCov | Index | ABySS | Euler-sr | SOAPdenovo | SOAPdenovo-Trans | Velvet | Oases | Trinity | Newbler | MIRA |
| 95% | Sensitivity | 0.050 | 0.450 | 0.067 | 0.448 | 0.230 | 0.539 | 0.738 | 0.560 | 0.811 |
| Specificity | 0.010 | 0.430 | 0.006 | 0.288 | 0.140 | 0.647 | 0.682 | 0.640 | 0.818 | |
| 90% | Sensitivity | 0.170 | 0.640 | 0.098 | 0.600 | 0.310 | 0.658 | 0.783 | 0.700 | 0.851 |
| Specificity | 0.040 | 0.620 | 0.009 | 0.385 | 0.180 | 0.784 | 0.724 | 0.800 | 0.859 | |
| 85% | Sensitivity | 0.290 | 0.720 | 0.112 | 0.647 | 0.360 | 0.705 | 0.800 | 0.720 | 0.865 |
| Specificity | 0.080 | 0.700 | 0.010 | 0.414 | 0.210 | 0.839 | 0.740 | 0.820 | 0.87 | |
| 80% | Sensitivity | 0.380 | 0.760 | 0.128 | 0.681 | 0.390 | 0.732 | 0.816 | 0.740 | 0.875 |
| Specificity | 0.100 | 0.730 | 0.011 | 0.436 | 0.240 | 0.871 | 0.756 | 0.850 | 0.879 |
SeqCov: sequence coverage. When 95%, 90%, 85% and 80% of both the query sequence (contigs) and the subject sequence (true transcripts) were aligned by BLAST (version 2.2.22, with parameters ‘-e 1e–5–F F’), the trancripts were thought to be reconstructed by the respective contigs and the sensitivity and specificity were calculated, respectively.
Basic characteristics of various assemblies of Saccharomyces cerevisiae data set 2 with varying coverage and short read length.
| ABySS | Euler-sr | SOAPdenovo | SOAPdenovo-Trans | Velvet | Oases | Trinity | Newbler | MIRA | |
| Longest Contig (bps) | 5836 | 9197 | 5584 | 5144 | 6689 | 9116 | 12251 | 9208 | 12255 |
| #Contigs > = 100 bps | 10592 | 8957 | 16641 | 10823 | 8249 | 6808 | 7963 | 6774 | 8380 |
| #Contigs > = 500 bps | 4083 | 4343 | 4831 | 4955 | 4130 | 4270 | 4836 | 4262 | 4696 |
| #Contigs > = 1 k bps | 1597 | 2324 | 1439 | 2114 | 1893 | 2484 | 2786 | 2419 | 2729 |
| N50 (bps) | 774 | 1310 | 532 | 943 | 1057 | 1521 | 1471 | 1477 | 1474 |
| N90 (bps) | 127 | 323 | 48 | 270 | 310 | 472 | 438 | 442 | 397 |
Sensitivity and specificity of various assemblies assembled with Saccharomyces cerevisiae data set 2 with varying coverage and short read length.
| SeqCov | Index | ABySS | Euler-sr | SOAPdenovo | SOAPdenovo-Trans | Velvet | Oases | Trinity | Newbler | MIRA |
| 95% | Sensitivity | 0.000 | 0.220 | 0.100 | 0.147 | 0.150 | 0.301 | 0.498 | 0.300 | 0.544 |
| Specificity | 0.000 | 0.140 | 0.000 | 0.092 | 0.090 | 0.300 | 0.397 | 0.260 | 0.424 | |
| 90% | Sensitivity | 0.100 | 0.380 | 0.150 | 0.273 | 0.230 | 0.423 | 0.616 | 0.490 | 0.653 |
| Specificity | 0.000 | 0.240 | 0.020 | 0.170 | 0.150 | 0.425 | 0.488 | 0.430 | 0.508 | |
| 85% | Sensitivity | 0.140 | 0.470 | 0.200 | 0.340 | 0.300 | 0.485 | 0.650 | 0.570 | 0.687 |
| Specificity | 0.000 | 0.300 | 0.030 | 0.211 | 0.180 | 0.487 | 0.515 | 0.500 | 0.533 | |
| 80% | Sensitivity | 0.200 | 0.530 | 0.240 | 0.384 | 0.350 | 0.527 | 0.676 | 0.610 | 0.714 |
| Specificity | 0.060 | 0.340 | 0.040 | 0.238 | 0.220 | 0.529 | 0.534 | 0.530 | 0.552 |
SeqCov: sequence coverage. When 95%, 90%, 85% and 80% of both the query sequence (contigs) and the subject sequence (true transcripts) were aligned by BLAST (version 2.2.22, with parameters ‘-e 1e–5–F F’), the transcripts were thought to be reconstructed by the respective contigs and the sensitivity and specificity were calculated, respectively.
Basic characteristics of various assemblies of Saccharomyces cerevisiae data set 3 with varying coverage and long read length.
| ABySS | Euler-sr | SOAPdenovo | SOAPdenovo-Trans | Velvet | Oases | Trinity | Newbler | MIRA | |
| Longest Contig (bps) | 6020 | 12160 | 5629 | 6782 | 9055 | 12242 | 14654 | 14586 | 12240 |
| #Contigs > = 100 bps | 10693 | 8075 | 17135 | 10995 | 8411 | 6815 | 8043 | 6824 | 8443 |
| #Contigs > = 500 bps | 4160 | 3969 | 4840 | 5021 | 4222 | 4411 | 4852 | 4345 | 4704 |
| #Contigs > = 1 k bps | 1631 | 2144 | 1365 | 2118 | 1934 | 2562 | 2815 | 2497 | 2713 |
| N50 (bps) | 780 | 1317 | 508 | 927 | 1059 | 1530 | 1486 | 1476 | 1472 |
| N90 (bps) | 125 | 326 | 47 | 269 | 309 | 502 | 435 | 455 | 390 |
Sensitivity and specificity of various assemblies assembled with Saccharomyces cerevisiae data set 3 with varying coverage and long read length.
| SeqCov | Index | ABySS | Euler-sr | SOAPdenovo | SOAPdenovo-Trans | Velvet | Oases | Trinity | Newbler | MIRA |
| 95% | Sensitivity | 0.020 | 0.218 | 0.071 | 0.152 | 0.151 | 0.307 | 0.508 | 0.342 | 0.531 |
| Specificity | 0.006 | 0.166 | 0.011 | 0.077 | 0.108 | 0.314 | 0.397 | 0.327 | 0.408 | |
| 90% | Sensitivity | 0.075 | 0.356 | 0.142 | 0.273 | 0.229 | 0.438 | 0.618 | 0.526 | 0.641 |
| Specificity | 0.022 | 0.269 | 0.022 | 0.138 | 0.164 | 0.442 | 0.483 | 0.505 | 0.495 | |
| 85% | Sensitivity | 0.145 | 0.445 | 0.188 | 0.338 | 0.283 | 0.498 | 0.653 | 0.592 | 0.683 |
| Specificity | 0.043 | 0.335 | 0.029 | 0.171 | 0.202 | 0.502 | 0.511 | 0.567 | 0.525 | |
| 80% | Sensitivity | 0.212 | 0.501 | 0.222 | 0.382 | 0.326 | 0.536 | 0.677 | 0.624 | 0.706 |
| Specificity | 0.062 | 0.376 | 0.034 | 0.193 | 0.233 | 0.539 | 0.529 | 0.597 | 0.542 |
SeqCov: sequence coverage. When 95%, 90%, 85% and 80% of both the query sequence (contigs) and the subject sequence (true transcripts) were aligned by BLAST (version 2.2.22, with parameters ‘-e 1e–5–F F’), the transcripts were thought to be reconstructed by the respective contigs and the sensitivity and specificity were calculated, respectively.
Chimeras identified in the different assemblies on Saccharomyces cerevisiae data set 2 and 3.
| ABySS | Euler-sr | SOAPdenovo | SOAPdenovo-Trans | Velvet | Oases | Trinity | Newbler | MIRA | |
| #Contigs in data set 2 | 22106 | 9918 | 41017 | 13194 | 9191 | 6808 | 8373 | 6936 | 8512 |
| #Chimera in data set 2 | 908 | 690 | 1206 | 804 | 645 | 568 | 794 | 557 | 692 |
| #Contigs in data set 3 | 22707 | 8900 | 43845 | 13261 | 9365 | 6815 | 8471 | 6968 | 8568 |
| #Chimera in data set 3 | 1135 | 672 | 1485 | 858 | 741 | 604 | 784 | 583 | 672 |
If a contig has two subsequences (at least 10% of the contig length) that aligned to two different true transcripts (BLASTN v2.2.22, parameters: -e 1e–5–F F), then this contig is thought to be a chimera.
Basic characteristics of various assemblies of real Trichophyton rubrum data.
| ABySS | Euler-sr | SOAPdenovo | Velvet | Oases | Trinity | Newbler | MIRA | |
| Longest Contig (bps) | 3986 | 6326 | 4349 | 2802 | 10856 | 17023 | 22223 | 7950 |
| #Contigs > = 100 bps | 34779 | 37100 | 65842 | 67119 | 15839 | 18800 | 13294 | 26674 |
| #Contigs > = 500 bps | 10534 | 13479 | 16166 | 9263 | 8215 | 15069 | 9652 | 22063 |
| #Contigs > = 1 k bps | 3139 | 3965 | 3699 | 1230 | 3561 | 9645 | 4858 | 8368 |
| N50 (bps) | 531 | 692 | 474 | 324 | 1071 | 2188 | 1233 | 1115 |
| N90 (bps) | 48 | 207 | 84 | 83 | 369 | 690 | 552 | 526 |
Because SOAPdenovo-Trans did not produce results, the corresponding indices are not available.