| Literature DB >> 26209068 |
Daniel Paulino1, René L Warren2, Benjamin P Vandervalk3, Anthony Raymond4, Shaun D Jackman5, Inanç Birol6,7.
Abstract
BACKGROUND: While next-generation sequencing technologies have made sequencing genomes faster and more affordable, deciphering the complete genome sequence of an organism remains a significant bioinformatics challenge, especially for large genomes. Low sequence coverage, repetitive elements and short read length make de novo genome assembly difficult, often resulting in sequence and/or fragment "gaps" - uncharacterized nucleotide (N) stretches of unknown or estimated lengths. Some of these gaps can be closed by re-processing latent information in the raw reads. Even though there are several tools for closing gaps, they do not easily scale up to processing billion base pair genomes.Entities:
Mesh:
Year: 2015 PMID: 26209068 PMCID: PMC4515008 DOI: 10.1186/s12859-015-0663-4
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Sequence read datasets used
|
|
|
|
|
| |
|---|---|---|---|---|---|
| Coverage | 615x | 25x | 89x | 71x | 65x |
| Read length (bp) | 98 | 100 | 100 | 250 | 150, 300 and 500 |
| Paired-end read count | 14,572,674 | 1,582,417 | 44,675,422 | 4.68x108 | 8.7 x109 |
| Genome size (bp) | 4,686,137 | 12,495,682 | 100,258,171 | 3.3x109 | 20.8x109 |
| Short Read Archive (SRA) accession | SRR959238 | ERR156523 | ERR294494 | ERR309932 | SRS357050 |
| Gene source | GCF_000019425 | GCF_000146045 | GCA_000002985 | Not used | Not used |
| ABySS version and assembly | v1.5.2 | v1.5.2 | v1.3.6 | v1.5.2 | v1.3.5 |
|
|
|
|
|
|
Fig. 1Gap-closing success rates. Results of gap closure of the tools tested on a broad-size genome data spectrum (5 M to 20 Gbp). Baseclear GapFiller could not complete its run on H. sapiens. Both GapFiller and GapCloser were not attempted on the P. glauca, due to their high resources requirements
Gap-closing performance of Sealer, SOAPdenovo GapCloser and GapFiller on five draft genome assemblies ranging from ~5 Mbp to 20 Gbp
| Draft genome species | Total gaps | Software | Gaps completely closed | % Success | Wall clock time (hh:mm) | Memory (GB) |
|---|---|---|---|---|---|---|
|
| 18 | Sealer | 17 | 94.4 | 00:20 | 0.5 |
| GapCloser | 2 | 11.1 | 00:05 | 25.7 | ||
| GapFiller | 15 | 83.3 | 00:43 | 0.4 | ||
|
| 213 | Sealer | 178 | 83.6 | 00:02 | 0.5 |
| GapCloser | 90 | 42.3 | 00:02 | 3.8 | ||
| GapFiller | 168 | 78.9 | 00:20 | 0.7 | ||
|
| 4,223 | Sealer | 2,968 | 70.3 | 09:54 | 1.4 |
| GapCloser | 2,062 | 48.8 | 01:49 | 101.0 | ||
| GapFiller | 3,186 | 75.4 | 37:42 | 0.7 | ||
|
| 237,406 | Sealer | 120,676 | 50.8 | 29:19 | 22.2 |
| GapCloser | 116,297 | 48.9 | 83:15 | 178.1 | ||
| GapFiller | Incomplete. Terminated after 353 h. | |||||
|
| 2,894,274 | Sealer | 399,476 | 13.8 | 26:12 | 45.3 |
| GapCloser | Not attempted | |||||
| GapFiller | ||||||
Fig. 2Identity of closed gaps by Sealer and two leading gap-filling applications. Venn diagrams depict the overlap of gaps closed between each tool for the a) E. coli, b) S. cerevisiae and c) C. elegans datasets. The sizes of individual circles represent the number of gaps closed relative to the other tools. Overlapping closed gaps were approximated using the assessment pipeline described in Section 2.5 and depicted using the online VennDiagram.tk tool [http://www.cmbi.ru.nl/~timhulse/venn/]