| Literature DB >> 25128196 |
Kristoffer Sahlin1, Francesco Vezzi, Björn Nystedt, Joakim Lundeberg, Lars Arvestad.
Abstract
BACKGROUND: The use of short reads from High Throughput Sequencing (HTS) techniques is now commonplace in de novo assembly. Yet, obtaining contiguous assemblies from short reads is challenging, thus making scaffolding an important step in the assembly pipeline. Different algorithms have been proposed but many of them use the number of read pairs supporting a linking of two contigs as an indicator of reliability. This reasoning is intuitive, but fails to account for variation in link count due to contig features.We have also noted that published scaffolders are only evaluated on small datasets using output from only one assembler. Two issues arise from this. Firstly, some of the available tools are not well suited for complex genomes. Secondly, these evaluations provide little support for inferring a software's general performance.Entities:
Mesh:
Year: 2014 PMID: 25128196 PMCID: PMC4262078 DOI: 10.1186/1471-2105-15-281
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Figure 1Notation. a) A read pair with insert size x (unknown distance) aligns to two contigs c and c , thus creates a link between c and c . The read pair gives rise to observations o ,o and they are used to infer the unknown distance d. Distances for o ,o ,d and r are illustrated. b) Graph structure and notations of the scaffold graph . Two contigs c and c connected by an edge e created from alignments of read pairs.
GAGE data
| BESST | OPERA | SOPRA | SSPACE | Integrated scaffolder | Unscaffolded | ||||||
|---|---|---|---|---|---|---|---|---|---|---|---|
|
|
|
|
|
|
|
|
|
|
|
| |
| ABySS | 263,4 |
|
| 12 | 103,4 | 2 | 126,3 | 5 | 35,3 | 1 | 31,4 |
| Allpaths-LG | 436,4 |
| 607,4 | 12 | 295,5 |
|
| 1 | 1136,2 | 0 | 90,0 |
| Bambus2 |
|
| 560,0 | 4 | 125,2 | 2 | 665,7 | 2 | 1119,5 | 0 | 19,6 |
| MSR-CA | 744,7 | 3 | 302,4 | 11 | 117,4 |
|
| 2 | 999,9 | 3 | 50,3 |
| SGA | 75,1 |
|
| 3 | 239,9 | 6 | 32,6 | 2 | 162,9 | 1 | 4,7 |
| SOAPdenovo |
|
| 333,1 | 7 | 227,2 |
| 286,7 | 5 | 229,3 | 0 | 68,0 |
| Velvet | 204,2 | 4 |
| 5 | 154,4 |
| 162,2 | 12 | 194,6 | 17 | 48,5 |
| SUM | 2898,1 |
|
| 54 | 1263,0 | 11 | 3085,1 | 29 | |||
The numbers in bold face style indicate the best corrected E-size and number of errors among the stand-alone scaffolders for each assembly.
, GAGE data
| BESST | OPERA | SOPRA | SSPACE | Integrated scaffolder | Unscaffolded | ||||||
|---|---|---|---|---|---|---|---|---|---|---|---|
|
|
|
|
|
|
|
|
|
|
|
| |
| ABySS |
| 13 | 65,8 | 20 | 44,9 | 17 | 34,7 |
| 73,4 | 3 | 6,9 |
| Allpaths-LG |
|
| 852,1 | 4 | 425,4 | 2 | 1271,9 | 1 | 2401,7 | 0 | 35,9 |
| Bambus2 | 1426,0 | 4 | 1446,0 | 8 |
| 3 | 789,9 |
| 1348,4 | 2 | 16,2 |
| CABOG |
|
| 362,6 | 7 | 293,4 |
| 419,1 | 4 | 211,3 | 5 | 21,5 |
| MSR-CA |
| 3 | 573,5 | 8 | 138,2 |
| 1579,8 | 2 | 2001,1 | 5 | 21,6 |
| SGA | 100,5 | 6 |
|
| 105,7 | 41 | 44,9 | 9 | 48,0 | 1 | 3,2 |
| SOAPdenovo |
|
| 841,5 | 7 | 1477,1 | 3 | 1500,6 | 3 | 687,6 | 0 | 18,6 |
| Velvet | 332,9 |
|
| 10 | 175,6 | 11 | 329,6 | 6 | 348,1 | 19 | 16,7 |
| SUM |
|
| 4626,0 | 69 | 4129,2 | 80 | 5970,5 |
| |||
The numbers in bold face style indicate the best corrected E-size and number of errors among the stand-alone scaffolders for each assembly.
Hs14, GAGE data
| BESST | OPERA | SOPRA | SSPACE | Integrated scaffolder | Unscaffolded | ||||||
|---|---|---|---|---|---|---|---|---|---|---|---|
|
|
|
|
|
|
|
|
|
|
|
| |
| ABySS |
|
| 15,8 | 200 | - | - | 15,3 | 47 | 2,8 | 9 | 3,1 |
| Allpaths-LG | 513,6 | 32 | 311,0 | 104 | 194,9 | 17 |
|
| 4652,3 | 45 | 27,1 |
| Bambus2 | 88,2 |
| 61,7 | 331 | - | - |
| 109 | 157,6 | 143 | 6,3 |
| CABOG |
| 31 | 349,1 | 77 | 234,0 |
| 411,0 | 23 | 347,7 | 597 | 30,7 |
| MSR-CA | 51,3 |
| - | - | - | - |
| 146 | 111,9 | 1068 | 5,9 |
| SGA |
| 58 | 3,5 |
| 22,2 | 2253 | 24,8 | 42 | 89,9 | 19 | 3,7 |
| SOAPdenovo |
| 211 | - | - | - | - | 75,3 |
| 99,2 | 268 | 9,8 |
| Velvet | 35,7 |
| - | - |
| 734 | 22,6 | 140 | 26,6 | 9156 | 3,0 |
| SUM |
|
| 741,0 | 751 | 526,5 | 3023 | 1259,0 | 734 | |||
The numbers in bold face style indicate the best corrected E-size and number of errors among the stand-alone scaffolders for each assembly.
on GAGE contig assemblies using the wide MP library
| BESST | OPERA | SOPRA | SSPACE | Unscaffolded | |||||
|---|---|---|---|---|---|---|---|---|---|
|
|
|
|
|
|
|
|
|
| |
| ABySS | 17,6 | 5 |
| 51 | 6,7 | 4 | 9,9 |
| 6,9 |
| Allpaths-LG |
|
| 314,5 | 1 | 44,4 |
| 70,3 | 2 | 35,9 |
| Bambus2 |
|
| 267,1 | 0 | 93,9 |
| 137,9 |
| 16,2 |
| CABOG |
| 3 | 97,3 | 4 | 22,3 |
| 33,5 |
| 21,5 |
| MSR-CA | 192,1 | 1 |
| 6 | 24,0 |
| 38,1 | 1 | 21,6 |
| SGA | 4,3 |
|
| 111 | 3,3 | 12 | 6,5 | 3 | 3,2 |
| SOAPdenovo |
|
| 720,5 | 10 | 156,2 |
| 206,4 | 2 | 18,6 |
| Velvet |
|
| 62,3 | 4 | 17,8 | 3 | 30,9 | 3 | 16,7 |
| SUM |
|
| 1698,4 | 187 | 368,7 | 19 | 533,6 | 12 | |
The numbers in bold face style indicate the best corrected E-size and number of errors among the stand-alone scaffolders for each assembly. Note that the corrected E-size for SOPRA is slightly less than the corrected contig size in the ABySS assembly. This can happen for low contiguity scaffolded assemblies that contains more bases than the contig assemblies (5,0Mbp and 4,5Mbp respectively on this instance). The difference in number of bases is due to the facts that GAGE evaluation script only compute statistics on contigs and scaffolds that are longer than 200bp. GAGE evaluation script returned an error when computing statistics for seven, two and one scaffolds on SSPACE results of ABySS, CABOG and MSR-CA assemblies respectively. On SOPRA and OPERA results, 1 respectively 3 scaffolds of the SGA assembly returned this error. We removed these scaffolds from the evaluation in order to compute the results. In all cases, the scaffolds removed summed up to a total length of less than 110 kbp. Thus, this has a negligible (either positive or negative) effect on E-size computation and an eventual positive effect on the number of errors. Results for BESST contained no scaffolds giving this error.
Types of errors on summed over all assemblies
| Assembly | BESST | OPERA | SOPRA | SSPACE |
|---|---|---|---|---|
| Indels | 2 | 17 | 2 | 6 |
| Inversions | 6 | 30 | 6 | 16 |
| Translocations | 0 | 0 | 0 | 1 |
| Relocations | 1 | 7 | 3 | 6 |
Types of errors on summed over all assemblies
| Assembly | BESST | OPERA | SOPRA | SSPACE |
|---|---|---|---|---|
| Indels | 16 | 28 | 13 | 7 |
| Inversions | 4 | 9 | 7 | 3 |
| Translocations | 2 | 26 | 21 | 14 |
| Relocations | 8 | 6 | 39 | 6 |
Types of errors on Hs14 summed over all assemblies
| Assembly | BESST | OPERA | SOPRA | SSPACE |
|---|---|---|---|---|
| Indels | 398 | 149 | 1062 | 442 |
| Inversions | 163 | 383 | 154 | 289 |
| Translocations | 0 | 0 | 0 | 0 |
| Relocations | 6 | 219 | 1807 | 3 |
Types of errors on with wide MP library summed over all assemblies
| Assembly | BESST | OPERA | SOPRA | SSPACE |
|---|---|---|---|---|
| Indels | 5 | 87 | 4 | 3 |
| Inversions | 3 | 5 | 1 | 0 |
| Translocations | 1 | 9 | 1 | 7 |
| Relocations | 2 | 86 | 13 | 2 |
Runtime for scaffolders on
| Runtime (hh:mm:ss) | ||||
|---|---|---|---|---|
|
|
|
|
|
|
| ABySS | 00:00:40 | 00:28:47 | 01:18:24 | 00:00:26 |
| Allpaths | 00:00:25 | 00:00:47 | 00:11:56 | 00:00:20 |
| Bambus2 | 00:00:26 | 00:00:49 | 00:22:11 | 00:00:21 |
| MSR-CA | 00:00:26 | 00:01:05 | 00:19:21 | 00:00:21 |
| SGA | 00:01:03 | 00:05:58 | 04:30:08 | 00:00:51 |
| SOAPdenovo | 00:00:25 | 00:00:50 | 00:26:51 | 00:00:19 |
| Velvet | 00:00:27 | 00:00:53 | 00:39:25 | 00:00:21 |
Runtime for scaffolders on with GAGE data
| Runtime (hh:mm:ss) | ||||
|---|---|---|---|---|
|
|
|
|
|
|
| ABySS | 00:01:22 | 00:12:38 | 01:17:49 | 00:00:40 |
| Allpaths-LG | 00:00:32 | 00:01:13 | 00:10:35 | 00:00:27 |
| Bambus2 | 00:00:33 | 00:01:38 | 00:10:42 | 00:00:25 |
| CABOG | 00:00:35 | 00:00:59 | 00:11:13 | 00:00:27 |
| MSR-CA | 00:00:38 | 00:01:12 | 00:18:10 | 00:00:29 |
| SGA | 00:01:45 | 00:01:35 | 01:30:44 | 00:00:45 |
| SOAPdenovo | 00:00:33 | 00:03:48 | 00:18:08 | 00:00:27 |
| Velvet | 00:00:49 | 00:01:16 | 00:36:54 | 00:00:27 |
Runtime for scaffolders on Hs14 with GAGE data (upper bound time requirement was set to 48 hours)
| Runtime (hh:mm:ss) | ||||
|---|---|---|---|---|
|
|
|
|
|
|
| ABySS | 00:19:37 | 00:58:22 | - | 00:32:55 |
| Allpaths-LG | 00:05:06 | 00:53:24 | 22:02:09 | 00:12:55 |
| Bambus2 | 00:07:43 | 01:18:06 | - | 00:14:26 |
| CABOG | 00:04:16 | 00:16:16 | 11:50:23 | 00:08:33 |
| MSR-CA | 00:11:22 | - | - | 00:15:38 |
| SGA | 00:53:42 | 00:23:18 | 16:15:22 | 00:38:42 |
| SOAPdenovo | 00:07:50 | - | - | 00:10:46 |
| Velvet | 00:10:07 | - | 01:27:16 | 00:15:35 |
Runtime for scaffolders on with wide mate pair library
| Runtime (hh:mm:ss) | ||||
|---|---|---|---|---|
|
|
|
|
|
|
| ABySS | 00:00:52 | 00:04:33 | 00:14:49 | 00:00:50 |
| Allpaths-LG | 00:00:31 | 00:04:03 | 00:08:48 | 00:00:36 |
| Bambus2 | 00:00:25 | 00:03:50 | 00:09:08 | 00:00:33 |
| CABOG | 00:00:31 | 00:03:10 | 00:07:17 | 00:00:37 |
| MSR-CA | 00:00:30 | 00:03:46 | 00:08:29 | 00:00:39 |
| SGA | 00:00:36 | 00:04:06 | 00:16:29 | 00:00:49 |
| SOAPdenovo | 00:00:27 | 00:04:35 | 00:09:54 | 00:00:35 |
| Velvet | 00:00:33 | 00:03:58 | 00:09:38 | 00:00:40 |
Figure 2Dispersity score. Illustration of dispersity measurement. Read pairs linking contigs c 1 and c 2 of lengths n and m respectively are transformed to data tested with the KS-test. (a) Observations from contig c 1 are translated and reflected on the x-axis while observations from contig c 2 are translated. The two sample KS statistic will indicate high similarity in read distribution. (b) Strange placement of linked reads occur. Several explanations are possible. One possible explanation is that contig c 2 is misassembled (chimeric) and another explanation is that c 2 is a correctly assembled contig with small repeated regions solved on assembly level. The repeat might not be present in other contigs from the assembly and therefore, the alignments to these regions are reported as unique. Contig c 2 is however not close to the to contig c 1 on the genome and linked reads fail to place at the non-repeated regions on c 2. The KS test will indicate low similarity
Figure 3Small contigs. An example of a region in containing smaller contigs. There are 5 possible paths to connect c 1 and c 6. The highest scoring one is [c 1,c 4,c 5,c 6], with , giving π = 34 and it is the selected path between c 1 and c 6. In this path the good edges are represented as solid lines and bad edges are represented as dotted lines. A lower-score alternative is [c 1,c 2,c 4,c 5,c 6] with (g(c ),b(c ))=(47,18), giving π = 29. c 2 is a problematic contig that can be chimeric or consists of repeated sequence(s). The three remaining paths, all of them with negative score, are [c 1,c 2,c 3,c 8,c 6],[c 1,c 4,c 6],[c 1,c 2,c 4,c 6].