| Literature DB >> 22384018 |
Marvin Mundry1, Erich Bornberg-Bauer, Michael Sammeth, Philine G D Feulner.
Abstract
BACKGROUND: The quantity of transcriptome data is rapidly increasing for non-model organisms. As sequencing technology advances, focus shifts towards solving bioinformatic challenges, of which sequence read assembly is the first task. Recent studies have compared the performance of different software to establish a best practice for transcriptome assembly. Here, we adapted a simulation approach to evaluate specific features of assembly programs on 454 data. The novelty of our study is that the simulation allows us to calculate a model assembly as reference point for comparison.Entities:
Mesh:
Substances:
Year: 2012 PMID: 22384018 PMCID: PMC3288049 DOI: 10.1371/journal.pone.0031410
Source DB: PubMed Journal: PLoS One ISSN: 1932-6203 Impact factor: 3.240
Assembler software recently used for de novo assembly of 454 transcriptome data.§
| Assembler | Organism |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
*Utilising a wrapper TGICL [33] or est2assmbly [35].
For more studies refer to Table 1 in [1].
Figure 1Workflow and comparison scheme for assembler evaluation.
Workflows are shown in grey, comparisons between data sets in black. To evaluate the performance of different assemblers three comparisons were performed: 1) Different assemblies of simulated reads were compared to a Model Assembly (MA), which was based on positional information. 2) Different assemblies of simulated reads were compared to a transcriptome annotation. The MA was compared in the same way to provide reference values for the evaluated measurements. 3) Different assemblies of real reads were compared to the transcriptome annotation to compare the simulation approach to values from a real data set.
Basic assembly metrics (simulated 454 reads).
| CAP3 | MIRA | Newbler | Oases | MA | |
| Number of contigs | 45'422 | 40'129 | 9'774 | 11'355 | 24'993 |
| Total bases | 19'147'862 | 22'855'498 | 12'764'265 | 7'937'884 | 18'152'459 |
| Number of contigs (> = 1 kbp) | 606 | 3'683 | 3'938 | 2'138 | 4'337 |
| Total bases (in contigs > = 1 kbp) | 779'806 | 6'626'729 | 9'614'255 | 4'686'216 | 9'935'980 |
| Max contig length | 13'981 | 17'958 | 17'915 | 17'906 | 17'958 |
| Mean contig length | 421 | 569 | 1'305 | 699 | 726 |
| Median contig length | 376 | 427 | 797 | 331 | 330 |
| N50 | 425 | 602 | 2'128 | 1'351 | 1'214 |
| Time taken | 341 min | 859 min | 34 min | 10 min | N/A |
Only contigs >100 bp.
Summed time for velveth, velvetg, and Oases.
Basic assembly metrics (real 454 reads).
| CAP3 | MIRA | Newbler | Oases | |
| Number of contigs | 50'381 | 76'126 | 14'633 | 16'862 |
| Total bases | 22'062'745 | 31'495'153 | 11'728'579 | 9'020'336 |
| Number of contigs (> = 1 kbp) | 2'106 | 2'964 | 3'365 | 2'261 |
| Total bases (in contigs > = 1 kbp) | 2'963'339 | 4'188'919 | 6'007'896 | 3'890'312 |
| Max contig length | 4'859 | 3'958 | 8'611 | 8'461 |
| Mean contig length | 437 | 413 | 801 | 534 |
| Median contig length | 364 | 337 | 565 | 300 |
| N50 | 458 | 456 | 1'025 | 837 |
| Time taken | 1731 min | 816 min | 790 min | 8 min |
*Only contigs >100 bp.
**Summed time for velveth, velvetg, and Oases.
Figure 2Cumulative contig lengths for different assemblies.
Counts of contigs longer than 200, 400, 800, and 1000 base pairs for the different assemblies. Assemblies of simulated (top) and real 454 reads (bottom) are shown in separate diagrams.
Comparison between simulated 454 read assemblies and transcriptome.
| CAP3 | MIRA | Newbler | Oases | MA | |
| Specificity absolute | 42'697/45'422 | 37'587/40'129 | 8'932/9'774 | 10'398/11'355 | 23'737/24'993 |
| Specificity relative | 94.00% | 93.67% | 91.39% | 91.57% | 94.97% |
| Sensitivity absolute | 3'140/146'962 | 14'920/146'962 | 18'723/146'962 | 9'379/146'962 | 23'985/146'962 |
| Sensitivity relative | 2.14% | 10.15% | 12.74% | 6.38% | 16.32% |
Comparison between real 454 read assemblies and transcriptome.
| CAP3 | MIRA | Newbler | Oases | |
| Specificity absolute | 30'256/50'381 | 37'376/76'126 | 11'505/14'633 | 12'065/16'862 |
| Specificity relative | 60.05% | 49.10% | 78.62% | 71.55% |
| Sensitivity absolute | 10'487/146'962 | 11'209/146'962 | 11'857/146'962 | 9'543/146'962 |
| Sensitivity relative | 7.14% | 7.63% | 8.07% | 6.49% |
Comparison between simulated 454 read assemblies and model assembly.
| CAP3 | MIRA | Newbler | Oases | |
| Specificity absolute | 43'312/45'422 | 37'930/40'129 | 8'856/9'774 | 10'582/11'355 |
| Specificity relative | 95.35% | 94.52% | 90.61% | 93.19% |
| Sensitivity absolute | 4'202/24'993 | 10'329/24'993 | 8'530/24'993 | 3'671/24'993 |
| Sensitivity relative | 16.81% | 41.33% | 34.13% | 14.69% |
Alignment ambiguity between simulated reads and assembled contigs.
| CAP3 | MIRA | Newbler | Oases | MA | |
| Contigs hit | 45'410/45'422 | 40'108/40'129 | 9'771/9'774 | 11'342/11'355 | 24'983/24'993 |
| Reads mapped (out of 800'000) | 708'344 | 786'490 | 709'680 | 689'079 | 798'768 |
| Reads mapped to multiple contigs | 609'429 | 611'832 | 202'681 | 223'616 | 294'738 |
Evaluation of chimera formation.
| CAP3 | MIRA | Newbler | |
| Non chimeric contigs absolute | 38'429/45'422 | 34'558/40'129 | 6'138/9'957 |
| Non chimeric contigs relative | 85% | 86% | 62% |
| Average proportion of misplaced reads AS | 5.27% | 4.61% | 11.70% |
| Average proportion of misplaced reads non-AS | 0.88% | 0.91% | 2.82% |
AS: Genes with alternative splicing.
Non-AS: Genes without alternative splicing.