| Literature DB >> 22731987 |
Marten Boetzer1, Walter Pirovano.
Abstract
De novo assembly is a commonly used application of next-generation sequencing experiments. The ultimate goal is to puzzle millions of reads into one complete genome, although draft assemblies usually result in a number of gapped scaffold sequences. In this paper we propose an automated strategy, called GapFiller, to reliably close gaps within scaffolds using paired reads. The method shows good results on both bacterial and eukaryotic datasets, allowing only few errors. As a consequence, the amount of additional wetlab work needed to close a genome is drastically reduced. The software is available at http://www.baseclear.com/bioinformatics-tools/.Entities:
Mesh:
Year: 2012 PMID: 22731987 PMCID: PMC3446322 DOI: 10.1186/gb-2012-13-6-r56
Source DB: PubMed Journal: Genome Biol ISSN: 1474-7596 Impact factor: 13.583
Gap closure results obtained on the bacterial datasets
| Method | Original | IMAGE | SOAPdenovo | GapFiller | GapFiller-LC |
|---|---|---|---|---|---|
| Genome size (bp) | 4,478,287 | 4,530,961 | 4,490,973 | 4,490,638 | |
| Scaffolds | 179 | 179 | 179 | 179 | |
| Gap count | 544 | 291 | 16 | 11 | |
| Total gap length (bp) | 12,516 | 2,861 | 16 | 130 | |
| Errors (SNPs) | 12 | 40 | 33 | 22 | |
| Errors (indels) | 4 | 17 | 25 | 9 | |
| Errors (misjoins) | 1 | 1 | 1 | 1 | |
| N50 | 50,557 | 50,558 | 50,558 | 50,558 | |
| Genome size (bp) | 8,558,275 | 8,576,331 | 8,557,720 | 8,558,333 | |
| Scaffolds | 115 | 115 | 115 | 115 | |
| Gap count | 158 | 63 | 60 | 23 | |
| Total gap length (bp) | 9,221 | 4,009 | 1,288 | 806 | |
| Errors (SNPs) | 299 | 423 | 406 | 280 | |
| Errors (indels) | 664 | 677 | 769 | 686 | |
| Errors (misjoins) | 12 | 17 | 18 | 18 | |
| N50 | 173,822 | 173,822 | 173,822 | 173,822 | |
| Genome size (bp) | 2,880,676 | 2,880,926 | 2,881,756 | 2,883,448 | |
| Scaffolds | 19 | 19 | 19 | 19 | |
| Gap count | 48 | 27 | 27 | 22 | |
| Total gap length (bp) | 9,900 | 1,547 | 5,508 | 1,861 | |
| Errors (SNPs) | 79 | 260 | 98 | 173 | |
| Errors (indels) | 16 | 53 | 26 | 37 | |
| Errors (misjoins) | 4 | 13 | 7 | 5 | |
| N50 | 1,091,731 | 1,091,333 | 1,092,281 | 1,092,421 | |
| Genome size (bp) | 4,609,785 | 4,609,466 | 4,609,596 | 4,610,796 | |
| Scaffolds | 38 | 38 | 38 | 38 | |
| Gap count | 170 | 163 | 161 | 139 | |
| Total gap length (bp) | 21,409 | 14,166 | 20,667 | 17,625 | |
| Errors (SNPs) | 218 | 410 | 230 | 300 | |
| Errors (indels) | 187 | 294 | 190 | 199 | |
| Errors (misjoins) | 6 | 10 | 6 | 7 | |
| N50 | 3,192,334 | 3,192,075 | 3,192,215 | 3,192,974 |
Gap closure results obtained on four bacterial datasets show that the GapFiller strategy yields the most accurate finished genomes. Also, the gap count is lower compared to the other methods. The IMAGE method significantly underperforms on all quality measures and would therefore not be the preferred method to use. Differences are smaller between GapFiller and SOAPdenovo. Interestingly, whereas the gap count after closure is generally less for GapFiller, SOAPdenovo yields in three cases a shorter total gap length. This suggests the latter method is able to close larger gaps. Strikingly, however, the amount of errors is significantly higher for SOAPdenovo regardless of the source (SNPs, indels and misjoins). Even when applying less strict settings for GapFiller (GapFiller-LC: minimum coverage o = 1, ratio r = 0.5) to shorten the total gap length, our method still yields significantly less errors.
Figure 1Time and memory consumption of gap closure software. Comparative analysis of the runtime and memory usage per dataset based on a single iteration. SOAPdenovo needs a shorter time to complete the analysis if the amount of data is very small or large, whereas GapFiller is faster for intermediate data sizes (10 to 20 million reads). With regard to memory usage, GapFiller outperforms SOAPdenovo since intermediate output is temporarily stored (and not kept in the memory). For all datasets analyzed, GapFiller requires only 0.1 GB of memory, which is mostly consumed by the Burrows-Wheeler Aligner (BWA). Note that no results are displayed for IMAGE since the method can not handle multiple libraries and requires very large computation times to complete the process. M, million.
Gap closure results obtained on the eukaryotic datasets
| Method | |||
|---|---|---|---|
| Original | SOAPdenovo | GapFiller | |
| Genome size (bp) | 11,388,647 | 11,388,600 | 11,388,609 |
| Scaffolds | 334 | 334 | 334 |
| Gap count | 283 | 67 | 45 |
| Total gap length (bp) | 19,358 | 994 | 2,873 |
| Errors (SNPs) | 890 | 1,033 | 931 |
| Errors (indels) | 565 | 754 | 648 |
| Errors (misjoins) | 23 | 42 | 31 |
| N50 | 84,640 | 84,640 | 84,649 |
| Genome size (bp) | 95,081,274 | 95,059,687 | 95,072,801 |
| Scaffolds | 19,249 | 19,249 | 19,249 |
| Gap count | 2,820 | 1,986 | 1,682 |
| Total gap length (bp) | 949,137 | 423,107 | 699,550 |
| Errors (SNPs) | 76,653 | 79,266 | 76,928 |
| Errors (indels) | 21,261 | 23,144 | 22,338 |
| Errors (misjoins) | 179 | 224 | 187 |
| N50 | 7,748 | 8,262 | 8,469 |
Results of SOAPdenovo and GapFiller obtained for the S. cerevisiae and human genome show the suitability of both methods to close gaps also in eukaryotic genomes. Patterns are similar to the observations made for bacteria: overall, GapFiller yields the most reliable results and the lowest gap count whereas SOAPdenovo yields a significantly shorter total gap length (though at the cost of a fairly increased error rate). In human the shortened genome size and total gap length obtained by SOAPdenovo (together with the increased indel and misjoin error rate) might indicate that some gaps are eventually closed by collapsing of (repeated) elements.
Functional properties of gap closed regions
| Annotation type | Total size in NC_000014 (bp) | Nucleotides closed with GapFiller (bp) | Portion of total nucleotides closed (%) |
|---|---|---|---|
| Gene | 38,883,890 | 33,969 | 38.67% |
| mRNA | 34,180,962 | 26,818 | 30.53% |
| CDS | 29,380,348 | 23,617 | 26.89% |
| Other RNA | 3,245,865 | 2,755 | 3.14% |
| V-segment | 78,961 | 655 | 0.75% |
| C-region | 36,984 | 23 | 0.03% |
| ncRNA | 11,447 | 0 | 0% |
| Total | 105,818,457 | 87,837 | 100.00% |
Functionalities associated with the nucleotides incorporated with GapFiller based on the annotation of human chromosome 14. Clearly the method can add valuable sequence information to a number of functionalities present in GenBank:NC_000014. Most nucleotides are incorporated in coding sequence (CDS) and mRNA regions. ncRNA, non-coding RNA.
Figure 2Schematic overview of the GapFiller algorithm. (a) The input data consist of a set of scaffold sequences containing gapped nucleotides and one or more sets of paired-end and/or mate-pair reads. (b) As a pre-processing step low quality nucleotides are removed from the sequence edges, thus enlarging the gap of ten nucleotides from each side. It should be stressed that the contig ends resulting from a draft assembly often contain misassemblies. (c) Paired-reads are aligned to the scaffolds and retained if one pair aligns to a scaffold sequence (dark grey) and one pair to a gapped region (black). (d) All pairs that are estimated to fall in the gapped regions are split into k-mers and used for gap filling. (e) The gap is closed from each edge by using k-mers that present a sequence overlap of size (k-mer - 1) and one nucleotide overhang. Gaps are closed if the right and left extensions can be merged and correspond to the estimated sequence gap.