| Literature DB >> 26042154 |
Vladimír Boža1, Broňa Brejová1, Tomáš Vinař1.
Abstract
BACKGROUND: Resolution of repeats and scaffolding of shorter contigs are critical parts of genome assembly. Modern assemblers usually perform such steps by heuristics, often tailored to a particular technology for producing paired or long reads.Entities:
Keywords: De Bruijn graphs; Genome assembly; Maximum likelihood; Next generation sequencing; Simulated annealing
Year: 2015 PMID: 26042154 PMCID: PMC4454275 DOI: 10.1186/s13015-015-0052-6
Source DB: PubMed Journal: Algorithms Mol Biol ISSN: 1748-7188 Impact factor: 1.405
Figure 1Examples of proposal moves. a Walk extension joining two walks. b Local improvement by addition of a new loop. c Repeat interchange.
Properties of datasets used
| ID | References | Technology | Insert length (bp) | Read length (bp) | Coverage | Error rate (%) |
|---|---|---|---|---|---|---|
|
| ||||||
| SA1 | [ | Illumina | 180 | 101 | 90 | 3 |
| SA2 | [ | Illumina | 3,500 | 37 | 90 | 3 |
|
| ||||||
| EC1 | [ | Illumina | 300 | 151 | 400 | 0.75 |
| EC2 | [ | PacBio | 4,000 | 30 | 13 | |
| EC3 | Simulated | Illumina | 37,000 | 75 | 0.5 | 4 |
|
| ||||||
| H1 | [ | Illumina | 150 | 101 | 42 | 1 |
| H2 | [ | Illumina | 2,500 | 101 | 26 | 3 |
| H3 | [ | Illumina | 35,000 | 76 | 1.3 | 4.5 |
Comparison of assembly accuracy in the first three scenarios
| Assembler | Number of scaffolds | Longest scaffold (kb) | Longest scaffold corr. (kb) | N50 (kb) | Err. | N50 corr. (kb) | LAP |
|---|---|---|---|---|---|---|---|
|
| |||||||
| GAML | 28 | 1,191 | 1,191 | 514 |
| 514 |
|
| Allpaths-LG |
| 1,435 |
| 1,092 |
|
| −25.02 |
| SOAPdenovo | 99 | 518 | 518 | 332 |
| 332 | −25.03 |
| Velvet | 45 | 958 | 532 | 762 | 17 | 126 | −25.34 |
| Bambus2 | 17 | 1,426 | 1,426 | 1,084 |
| 1,084 | −25.73 |
| MSR-CA | 17 |
| 1,343 |
| 3 | 1,022 | −26.26 |
| ABySS | 246 | 125 | 125 | 34 | 1 | 28 | −29.43 |
| Cons. Velvet* | 219 | 95 | 95 | 31 |
| 31 | −30.82 |
| SGA | 456 | 286 | 286 | 208 | 1 | 208 | −31.80 |
|
| |||||||
| PacbioToCA | 55 | 1,533 | 1,533 |
|
|
|
|
| GAML | 29 | 1,283 | 1,283 | 653 |
| 653 | −33.91 |
| Cerulean |
|
|
| 694 |
| 694 | −34.18 |
| AHA | 54 | 477 | 477 | 213 | 5 | 194 | −34.52 |
| Cons. Velvet* | 383 | 80 | 80 | 21 |
| 21 | −36.02 |
|
| |||||||
| GAML |
|
|
|
|
|
|
|
| Celera | 19 | 4,635 | 2,085 | 4,635 | 19 | 2,085 | −61.47 |
| Cons. Velvet* | 383 | 80 | 80 | 21 |
| 21 | −72.03 |
For all assemblies, N50 values are based on the actual genome size. All misjoins were considered as errors and error-corrected values of N50 and contig sizes were obtained by breaking each contig at each error [24]. All assemblies except for GAML and conservative Velvet were obtained from [24] in the first experiment, and from [6] in the second experiment.
Italic numbers in each column signify the best result.
* Velvet with conservative settings used to create the assembly graph in our method.
Improving existing assemblies of the human chromosome 14 by GAML
| Assembler | Number of scaffolds | Longest scaffold (kb) | Longest scaffold corr. (kb) | N50 (kb) | Err. | N50 corr. (kb) | LAP |
|---|---|---|---|---|---|---|---|
|
| |||||||
| Before | 1,081 | 4,628 | 263 | 1,190 | 9,156 | 27 | −138.765779 |
| After | 1,634 | 1,046 | 265 | 347 | 8,049 | 27 | −138.632657 |
| REAPR | 17,727 | 153 | 81 | 36 | 4,607 | 14 | −162.869192 |
|
| |||||||
| Before | 129 | 81,640 | 14,918 | 81,640 | 34 | 7,652 | −111.288806 |
| After | 139 | 81,640 | 14,918 | 81,640 | 33 | 7,652 | −111.287938 |
| REAPR | 858 | 977 | 146 | 190 | 4,230 | 17 | −168.024865 |
In both experiments, we use read sets H1, H2, and H3 and compare the original assembly computed by another tool with the assembly found by GAML.