| Literature DB >> 20950480 |
Abstract
BACKGROUND: Roche 454 pyrosequencing has become a method of choice for generating transcriptome data from non-model organisms. Once the tens to hundreds of thousands of short (250-450 base) reads have been produced, it is important to correctly assemble these to estimate the sequence of all the transcripts. Most transcriptome assembly projects use only one program for assembling 454 pyrosequencing reads, but there is no evidence that the programs used to date are optimal. We have carried out a systematic comparison of five assemblers (CAP3, MIRA, Newbler, SeqMan and CLC) to establish best practices for transcriptome assemblies, using a new dataset from the parasitic nematode Litomosoides sigmodontis.Entities:
Mesh:
Year: 2010 PMID: 20950480 PMCID: PMC3091720 DOI: 10.1186/1471-2164-11-571
Source DB: PubMed Journal: BMC Genomics ISSN: 1471-2164 Impact factor: 3.969
Assemblers previously used for de novo assembly of 454 pyrosequencing transcriptome projects
| Assembler | Organism |
|---|---|
| Newbler | |
| CAP3 | |
| MIRA | |
| TGICL | |
| SeqMan | |
| stackPACK |
Features of assembly programmes compared in this study
| Assembler | Type† | Splits reads* | Author | Cost | Source available | URL |
|---|---|---|---|---|---|---|
| CAP3 | OLC† | No | X Huang and A Madan [ | Free for use at non-profit organizations | No | |
| CLC Assembly Cell 3.0 | de Bruijn graph | Yes | CLC | Request quote or trial license | No | |
| MIRA 3.0 | OLC | No | Bastien Chevreux [ | Free | Yes, GPL | |
| Newbler 2.3 and Newbler 2.5 | OLC | Yes | Roche 454 [ | Free for academic use | No | |
| SeqMan NGen 2.1 | OLC | No | DNAStar [ | Request quote or trial license | No |
* i.e. data from one read can appear in multiple contigs
† OLC: Overlap-Layout-Consensus
The Litomosoides sigmodontis transcriptome dataset read statistics
| Technology | Number of reads | Number of raw bases | Number of trimmed reads | Number of trimmed bases | Mean length of trimmed reads | Median length of trimmed reads | |
|---|---|---|---|---|---|---|---|
| Microfilaria (first stage larvae) | Titanium | 366,813 | 203,227,223 | 351,387 | 118,039,337 | 335.92 | 374 |
| Adult female | Standard | 180,271 | 48,434,306 | 176,454 | 38,352,888 | 217.35 | 236 |
| Adult male | Standard | 216,940 | 59,231,575 | 213,546 | 48,673,441 | 227.93 | 245 |
Basic assembly metrics
| CAP3 | CLC | MIRA | Newbler 2.3 | Newbler 2.5 | SeqMan | |
|---|---|---|---|---|---|---|
| Number of contigs† | 24727 | 22746 | 35827 | 12019 | 21734 | 29969 |
| Total Bases | 16733217 | 14875522 | 21339704 | 14456476 | 20066883 | 21355682 |
| Number of contigs (> = 1 kbp) | 4403 | 4174 | 4770 | 6320 | 7661 | 6082 |
| Total Bases (in contigs > = 1 kbp) | 6461079 | 6255785 | 7027775 | 10810962 | 13691429 | 9296011 |
| Max contig length | 4011 | 4368 | 5784 | 5872 | 6228 | 6263 |
| Mean contig length | 677 | 654 | 596 | 1203 | 923 | 713 |
| N50 | 806 | 850 | 708 | 1487 | 1448 | 880 |
| Number of contigs in N50 | 6533 | 5459 | 9148 | 3406 | 4649 | 7555 |
| Reads used (SSAHA2) | 670425 | 679152 | 672036 | 616672 | 667597 | 681974 |
| Multi-hit reads (SSAHA2) | 271648 | 118334 | 392884 | 249210 | 352887 | 322409 |
| Reads used (CLC) | 690889 | 691818 | 696527 | 600132 | 681831 | 711726 |
| Multi-hit reads (CLC) | 91951 | 24485 | 162365 | 213670 | 262178 | 128631 |
| Time taken | 1 day* | 4 minutes * | 3 days * | 2 hours * | 45 minutes * | 6 hours ** |
† only contigs > 100 bases were assessed
* on a dual quad-core 3 GHz Xeon workstation with 32 GB RAM
** on a dual core 2.53 GHz Mac mini server with 4 GB RAM
Figure 1Cumulative contig lengths generated by different assembly programs. For each of six assemblies, contigs longer than 100 bases were ordered by length, and the cumulative length of all contigs shorter than or equal to a given contig was plotted. The total length of the assembly and the number of contigs present in the assembly define the end point of each curve, while the initial slope of each curve reflects the proportion of longer contigs.
Figure 2Novel sequence in pair-wise comparisons between assemblies produced by different assemblers. For each assembly, we calculated the number of bases in the other assemblies that were not present in the focal assembly.
BLAT hits to 1602 Litomosoides sigmodontis EST clusters
| CAP3 | CLC | MIRA | Newbler 2.3 | Newbler 2.5 | SeqMan | |
|---|---|---|---|---|---|---|
| 87.8 | 90.3 | 89.6 | 81.6* | 89.6 | 90.8 | |
| (78.2) | (80.0) | (80.1) | (71.3)* | (78.7) | (82.0) | |
| 59.9 | 51.9 | 62.5 | 59.4 | 63.9 | 65.4 | |
| (59.1) | (50.5)* | (61.1) | (59.4) | (63.7) | (64.3) |
* indicates a value significantly lower than the others, using a Huber M-estimator.
BLASTX hits to 11,472 Brugia malayi proteins
| CAP3 | CLC | MIRA | Newbler 2.3 | Newbler 2.5 | SeqMan | |
|---|---|---|---|---|---|---|
| 76.7 | 78.4 | 77.3 | 68.9* | 77.9 | 78.6 | |
| (60.4) | (62.4) | (59.7) | (51.8)* | (61.5) | (63.0) | |
| 27.0 | 26.0 | 26.5 | 29.2 | 32.4 | 28.7 | |
| (16.8) | (16.1) | (16.4) | (19.9) | (22.3) | (18.0) |
Note: E-value cutoff 1e-5
* indicates a value significantly lower than the others (p < 0.01), using a Huber M-estimator.
BLASTX hits to 3,681 tribes containing 120,926 conserved nematode proteins
| CAP3 | CLC | MIRA | Newbler 2.3 | Newbler 2.5 | SeqMan | |
|---|---|---|---|---|---|---|
| 91.7 | 92.2 | 91.4 | 87.0* | 92.0 | 92.4 | |
| 81.7 | 81.1 | 78.9 | 77.1* | 82.1 | 81.6 |
Note: E-value cutoff 1e-5
* indicates a value significantly lower than the others (p < 0.01), using a Huber M-estimator.
BLASTX hits to 3,731 KOGs containing 9,782 C. elegans proteins
| CAP3 | CLC | MIRA | Newbler 2.3 | Newbler 2.5 | SeqMan | |
|---|---|---|---|---|---|---|
| 89.2 | 89.7 | 88.6 | 83.7* | 89.7 | 90.1 | |
| 30.2 | 27.1 | 28.6 | 33.0 | 35.7 | 30.4 |
Note: E-value cutoff 1e-5
* indicates a value significantly lower than the others (p < 0.01), using a Huber M-estimator.
Secondary assemblies by merging pairs of initial assemblies using CAP3 with default settings
| Assembly 1 | Assembly 2 | Number of "Reads" (contigs) in Assembly 1 | Bases in Assembly 1 | Number of "Reads" (contigs) in Assembly 2 | Bases in Assembly 2 | Number of second-order contigs with "reads" from both assemblies | Bases in second-order contigs with "reads" from both assemblies |
|---|---|---|---|---|---|---|---|
| MIRA | SeqMan | 35827 | 21339704 | 29969 | 21355682 | 18068 | 16293192 |
| MIRA | Newbler 2.5 | 35827 | 21339704 | 21734 | 20066883 | 15951 | 15866051 |
| Newbler 2.5 | SeqMan | 21734 | 20066883 | 29969 | 21355682 | 15783 | 15701053 |
| CLC | Newbler 2.5 | 22746 | 14875522 | 21734 | 20066883 | 15778 | 15825663 |
| CAP3 | MIRA | 24727 | 16733217 | 35827 | 21339704 | 15688 | 14243534 |
| CLC | SeqMan | 22746 | 14875522 | 29969 | 21355682 | 15504 | 14679975 |
| CAP3 | SeqMan | 24727 | 16733217 | 29969 | 21355682 | 15387 | 14824287 |
| CLC | MIRA | 22746 | 14875522 | 35827 | 21339704 | 15334 | 14357031 |
| CAP3 | Newbler 2.5 | 24727 | 16733217 | 21734 | 20066883 | 14275 | 14830304 |
| CAP3 | CLC | 24727 | 16733217 | 22746 | 14875522 | 14149 | 13753398 |
| Newbler 2.3 | Newbler 2.5 | 12019 | 14456476 | 21734 | 20066883 | 9733 | 13252303 |
| MIRA | Newbler 2.3 | 35827 | 21339704 | 12019 | 14456476 | 9380 | 11731374 |
| CLC | Newbler 2.3 | 22746 | 14875522 | 12019 | 14456476 | 8884 | 12318589 |
| CAP3 | Newbler 2.3 | 24727 | 16733217 | 12019 | 14456476 | 8484 | 11426423 |
| Newbler 2.3 | SeqMan | 12019 | 14456476 | 29969 | 21355682 | 8274 | 11452990 |
Alignments to 1602 EST clusters where > 80% of the EST was covered by a match, by pairs of assemblies merged using an OLC assembler
| Assembly pair | % EST clusters hit | % EST bases covered | |
|---|---|---|---|
| CLC | MIRA | 65.4 | 65.1 |
| MIRA | Newbler 2.5 | 65.4 | 65.3 |
| MIRA | SeqMan | 64.9 | 64.6 |
| CLC | Newbler 2.5 | 64.7 | 64.6 |
| Newbler 2.5 | SeqMan | 63.2 | 62.9 |
| CAP3 | Newbler 2.5 | 63.0 | 62.6 |
| CAP3 | CLC | 62.9 | 62.4 |
| CLC | SeqMan | 62.9 | 62.5 |
| CLC | Newbler 2.3 | 62.4 | 62.5 |
| CAP3 | MIRA | 62.2 | 61.9 |
| CAP3 | SeqMan | 61.7 | 61.3 |
| MIRA | Newbler 2.3 | 61.7 | 61.9 |
| Newbler 2.3 | Newbler 2.5 | 61.4 | 61.5 |
| Newbler 2.3 | SeqMan | 60.0 | 60.2 |
| CAP3 | Newbler 2.3 | 59.7 | 59.6 |
Alignments to 11,472 B. malayi peptides using BLASTX (e value < 1e-5) where > 80% of the protein was covered by a match, by pairs of assemblies merged using an OLC assembler
| Assembly Pair | % Peptides hit | % Peptide bases covered | |
|---|---|---|---|
| CLC | Newbler 2.5 | 38.2 | 28.3 |
| MIRA | Newbler 2.5 | 37.3 | 27.4 |
| CLC | MIRA | 37.1 | 27.1 |
| CLC | Newbler 2.3 | 36.5 | 27.4 |
| Newbler 2.5 | SeqMan | 36.5 | 26.8 |
| CAP3 | Newbler 2.5 | 36.4 | 26.9 |
| CAP3 | CLC | 36.1 | 26.3 |
| CLC | SeqMan | 35.9 | 26.1 |
| MIRA | SeqMan | 35.7 | 25.7 |
| Newbler 2.3 | Newbler 2.5 | 35.3 | 26.3 |
| CAP3 | SeqMan | 35.1 | 25.2 |
| Newbler 2.3 | SeqMan | 34.4 | 25.6 |
| CAP3 | MIRA | 34.0 | 24.2 |
| CAP3 | Newbler 2.3 | 34.0 | 25.1 |
| MIRA | Newbler 2.3 | 33.3 | 24.0 |