| Literature DB >> 30114279 |
Aleksandra Swiercz1,2,3, Wojciech Frohmberg1,3, Michal Kierzynka1,3,4, Pawel Wojciechowski1,2,3, Piotr Zurkowski1,3, Jan Badura1,3, Artur Laskowski1,3, Marta Kasprzak1,2,3, Jacek Blazewicz1,2,3.
Abstract
Next generation sequencers produce billions of short DNA sequences in a massively parallel manner, which causes a great computational challenge in accurately reconstructing a genome sequence de novo using these short sequences. Here, we propose the GRASShopPER assembler, which follows an approach of overlap-layout-consensus. It uses an efficient GPU implementation for the sequence alignment during the graph construction stage and a greedy hyper-heuristic algorithm at the fork detection stage. A two-part fork detection method allows us to identify repeated fragments of a genome and to reconstruct them without misassemblies. The assemblies of data sets of bacteria Candidatus Microthrix, nematode Caenorhabditis elegans, and human chromosome 14 were evaluated with the golden standard tool QUAST. In comparison with other assemblers, GRASShopPER provided contigs that covered the largest part of the genomes and, at the same time, kept good values of other metrics, e.g., NG50 and misassembly rate.Entities:
Mesh:
Year: 2018 PMID: 30114279 PMCID: PMC6095601 DOI: 10.1371/journal.pone.0202355
Source DB: PubMed Journal: PLoS One ISSN: 1932-6203 Impact factor: 3.240
Fig 1Diagram of the GRASShopPER assembler.
The method has three main steps: construction of the overlap graph, its traversal, and correction of contigs.
Characteristics of paired-end data sets of C. Microthrix parvicella strain Bio17-1, C. elegans and Homo sapiens after the preprocessing of raw reads with the adapter and low-quality trimming.
* corresponds to the depth of coverage calculated for the length of the chromosome without a large gap of ‘N’.
| genome | |||
|---|---|---|---|
| species | bacteria | nematode | mammal |
| sequence length | 4,202,850 bp | 100,267,633 bp | 107,043,718 bp |
| sequencer | Illumina GA II | Illumina GA IIx | Illumina HiSeq 2000 |
| avg. read length | 97 bp | 109 bp | 100 bp |
| no. of read pairs | 2,463,704 | 30,436,661 | 12,015,343 |
| avg. depth of cov. | 113 | 66 | 26* |
| avg. insert size | 312 bp | 232 bp | 159 bp |
| st. dev. of insert size | 36 bp | 56 bp | 18 bp |
Assemblies obtained for three data sets: C. Microthrix, C. elegans, and human chromosome 14 (metrics calculated by QUAST).
| Genome statistics | GRASShopPER | Celera | Platanus | SGA | SOAPdenovo2 | Velvet | SPAdes |
|---|---|---|---|---|---|---|---|
| Data set of Candidatus Microthrix parvicella strain Bio17-1 | |||||||
| Genome fraction (%) | 98.73 | 89.21 | 98.38 | 98.82 | 98.52 | 97.86 | 98.96 |
| Duplication ratio | 1.006 | 1.002 | 1.017 | 1.002 | 1.001 | 1.000 | 1.001 |
| Largest alignment | 126,696 | 50,960 | 25,116 | 101,782 | 107,154 | 166,835 | 740,450 |
| Total aligned length | 4,173,839 | 3,755,338 | 4,203,934 | 4,161,932 | 4,145,584 | 4,113,131 | 4,161,980 |
| NG50 | 33,570 | 11,255 | 5,286 | 32,697 | 34,653 | 78,563 | 156,137 |
| NG75 | 16,714 | 5,614 | 2,889 | 18,691 | 17,879 | 39,191 | 104,295 |
| NGA50 | 33,566 | 11,013 | 5,281 | 32,697 | 34,653 | 77,856 | 151,220 |
| NGA75 | 16,712 | 5,528 | 2,886 | 18,691 | 17,879 | 39,191 | 88,932 |
| # misassembled contigs (length) | 4 (11 kb) | 4 (32 kb) | 1 (6 kb) | 3 (37 kb) | 1 (31 kb) | 6 (330 kb) | 5 (370 kb) |
| no. contigs (> 0 bp) | 439 | 493 | 2,107 | 668 | 949 | 103 | 1,297 |
| no. contigs (≥250 bp) | 336 | 449 | 1,395 | 257 | 267 | 103 | 808 |
| no. contigs (≥ 1 kb) | 254 | 424 | 966 | 215 | 220 | 103 | 64 |
| no. contigs (≥ 5 kb) | 159 | 256 | 257 | 161 | 157 | 88 | 49 |
| no. contigs (≥ 10 kb) | 112 | 127 | 66 | 118 | 115 | 75 | 44 |
| no. contigs (≥ 25 kb) | 57 | 21 | 1 | 56 | 54 | 51 | 30 |
| no. contigs (≥ 50 kb) | 19 | 1 | 0 | 20 | 21 | 29 | 23 |
| Data set of Caenorhabditis elegans | |||||||
| Genome fraction (%) | 95.47 | 78.81 | 88.39 | 93.92 | 92.58 | 85.61 | 94.81 |
| Duplication ratio | 1.019 | 1.020 | 1.004 | 1.008 | 1.004 | 1.004 | 1.004 |
| Largest alignment | 96,261 | 33,627 | 63,884 | 80,404 | 83,885 | 58,073 | 180,696 |
| Total aligned length | 97,504,793 | 80,514,782 | 88,972,062 | 94,936,888 | 93,192,365 | 85,981,341 | 95,338,850 |
| NG50 | 7,772 | 3,982 | 4,157 | 6,618 | 6,364 | 7,000 | 20,063 |
| NG75 | 2,793 | 1,789 | 1,402 | 2,665 | 2,486 | 3,018 | 8,732 |
| NGA50 | 7,771 | 3,903 | 4,088 | 6,581 | 6,313 | 6,736 | 18,679 |
| NGA75 | 2,783 | 1,700 | 1,277 | 2,557 | 2,325 | 2,576 | 7,495 |
| # misassembled contigs (length) | 142 (177 kb) | 524 (2402 kb) | 5 (47 kb) | 55 (215 kb) | 12 (58 kb) | 337 (2229 kb) | 475 (8519 kb) |
| no. contigs (> 0 bp) | 82,283 | 21,503 | 233,557 | 150,360 | 160,015 | 17,510 | 52,752 |
| no. contigs (≥ 250 bp) | 38,336 | 20,766 | 42,224 | 34,185 | 33,847 | 17,510 | 13,779 |
| no. contigs (≥ 1 kb) | 15,971 | 20,220 | 20,742 | 18,911 | 19,006 | 17,510 | 9,320 |
| no. contigs (≥ 5 kb) | 5,108 | 4,912 | 4,328 | 5,246 | 5,100 | 5,897 | 4,915 |
| no. contigs (≥ 10 kb) | 2,247 | 1,108 | 1,572 | 2,122 | 2,004 | 2,278 | 2,866 |
| no. contigs (≥ 25 kb) | 401 | 17 | 167 | 287 | 307 | 202 | 946 |
| no. contigs (≥ 50 kb) | 39 | 0 | 6 | 14 | 17 | 5 | 244 |
| Data set of human chromosome 14 | |||||||
| Genome fraction (%) | 92.28 | 75.96 | 71.80 | 88.30 | 88.99 | 72.33 | 93.53 |
| Duplication ratio | 1.038 | 1.005 | 1.005 | 1.007 | 1.007 | 1.007 | 1.011 |
| Largest alignment | 38,022 | 39,634 | 13,122 | 30,294 | 28,332 | 41,564 | 58,597 |
| Total aligned length | 86,648,380 | 69,025,599 | 65,344,557 | 80,527,423 | 81,146,875 | 65,606,013 | 85,602,007 |
| NG50 | 2,500 | 2,891 | 782 | 2,909 | 2,418 | 2,628 | 4,755 |
| NG75 | 1,020 | 1,154 | - | 1,207 | 1,000 | - | 2,260 |
| NGA50 | 2,500 | 2,891 | 782 | 2,909 | 2,418 | 2,628 | 4,755 |
| NGA75 | 1,014 | 1,077 | - | 1,202 | 997 | - | 2,204 |
| # misassembled contigs (length) | 123 (207 kb) | 1110 (5039 kb) | 0 (0 kb) | 51 (171 kb) | 17 (92 kb) | 358 (1606 kb) | 559 (3601 kb) |
| no. contigs (> 0 bp) | 81,314 | 21,003 | 574,441 | 97,520 | 239,297 | 21,153 | 62,461 |
| no. contigs (≥ 250 bp) | 64,638 | 20,930 | 72,766 | 40,353 | 48,947 | 21,153 | 29,935 |
| no. contigs (≥ 1 kb) | 24,310 | 20,880 | 21,035 | 22,930 | 24,237 | 21,153 | 19,521 |
| no. contigs (≥ 5 kb) | 2,988 | 3,607 | 259 | 3,460 | 2,830 | 3,398 | 4,796 |
| no. contigs (≥ 10 kb) | 468 | 559 | 1 | 525 | 390 | 651 | 1,354 |
| no. contigs (≥ 25 kb) | 7 | 4 | 0 | 12 | 1 | 10 | 105 |
| no. contigs (≥ 50 kb) | 0 | 0 | 0 | 0 | 0 | 0 | 1 |
Fig 2Values of NG(X) and the genome coverage as a function of the misassembled contigs length.
(A) Values of NG(X) for C. Microthrix data set. (B) Genome coverage and misassemblies for C. Microthrix data set. (C) Values of NG(X) for C. elegans data set. (D) Genome coverage and misassemblies for C. elegans data set. (E) Values of NG(X) for human data set. (F) Genome coverage and misassemblies for human data set.
Scaffolding of the three data sets for the assemblers GRASShopPER, SOAPdenovo2, and SGA with the best combination of scaffolders SSPACE and SOAPdenovo2 (metrics calculated by QUAST).
| Assembler | Scaffolder | Genome fraction (%) | Largest alignment | Total aligned length | NG50 | NG75 | Misassembled scaffolds (length) |
|---|---|---|---|---|---|---|---|
| Data set of | |||||||
| GRASShopPER | SSPACE | 98.602 | 33,570 | 16,714 | |||
| SGA | SSPACE | 101,782 | 4,161,932 | 32,697 | 3 (37 kb) | ||
| SOAPdenovo2 | SOAPdenovo2 | 98.522 | 107,154 | 4,145,584 | 17,879 | 1 (31 kb) | |
| Data set of | |||||||
| GRASShopPER | SOAPdenovo2 | 11,352 | 4306 | 461 (1,2 Mb) | |||
| SGA | SOAPdenovo2 | 94.037 | 105,248 | 94,990,734 | 10,187 | 4149 | 443 (1,2 Mb) |
| SOAPdenovo2 | SSPACE | 93.549 | 119,149 | 93,929,080 | |||
| Data set of human chromosome 14 | |||||||
| GRASShopPER | SSPACE | 2500 | 1020 | 123 (207 kb) | |||
| SGA | SSPACE | 88.586 | 35,224 | 81,213,834 | 114 (268 kb) | ||
| SOAPdenovo2 | SOAPdenovo2 | 89.582 | 33,186 | 81,612,904 | 2894 | 1206 | |
Fig 3Numbers of scaffolds.
(A) Number of scaffolds of a given length for C. Microthrix data set. (B) Number of scaffolds of a given length for C. elegans data set. (C) Number of scaffolds of a given length for human data set.
Fig 4Graph construction algorithm.
Fig 5An example of the fork detection made by the algorithm.
The ordered vertices are already in a path (A). Vertices from the state are in the path and they vote for the candidates, which could extend the current path (A, B). When all the vertices from the state are moved toward one branch of the fork, and many candidates from the other branch are lost, the algorithm cuts the current path at the beginning of the fork (C).
Fig 6Contigs correction.
Visualization of breaks in the continuity of paired-end information (shown as arches) on a real data set, mapped to a contig created by the traversal step.
Fig 7Contigs correction.
Histogram visualizing the number of reads mapped to a given contig region and having the other read from the pair mapped to a different contig or a distant part of the same contig.
Fig 8Visualization of the problem caused by closely located forks.