| Literature DB >> 28104957 |
Abstract
Largely driven by huge reductions in per-base costs, sequencing nucleic acids has become a near-ubiquitous technique in laboratories performing biological and biomedical research. Most of the effort goes to re-sequencing, but assembly of de novogenerated, raw sequence reads into contigs that span as much of the genome as possible is central to many projects. Although truly complete coverage is not realistically attainable, maximizing the amount of sequence that can be correctly assembled into contigs contributes to coverage. Here we compare three commonly used assembly algorithms (ABySS, Velvet and SOAPdenovo2), and show that empirical optimization of k-mer values has a disproportionate influence on de novo assembly of a eukaryotic genome, the nematode parasite Meloidogynechitwoodi. Each assembler was challenged with about 40 million Iluumina II paired-end reads, and assemblies performed under a range of k-mer sizes. In each instance, the optimal k-mer was 127, although based on N50 values,ABySS was more efficient than the others. That the assembly was not spurious was established using the "Core Eukaryotic Gene Mapping Approach", which indicated that 98.79% of the M. chitwoodi genome was accounted for by the assembly. Subsequent gene finding and annotation are consistent with this and suggest that k-mer optimization contributes to the robustness of assembly.Entities:
Year: 2016 PMID: 28104957 PMCID: PMC5237644 DOI: 10.6026/97320630012036
Source DB: PubMed Journal: Bioinformation ISSN: 0973-2063
Figure 1Empirical optimization of k-mer sizes enhances genome assembly across three software platforms (For details, see Table 1-2. X-axis indicates software, k-mer size, and coverage cut off. Y-axis on the left side indicates the length of longest contig (bp) as a function of x-axis, corresponding to grey bars. Y-axis on the right side indicates N50 length (bp), corresponding to red lines. During optimization process, to assess assemblies by N50 (red edges), it is compared of de novo assembly of ABySS, Velvet, and SOAPdenovo using different k-mer sizes and coverage cut offs.A more contiguous assembly is obtained for larger N50. At the default coverage thresholds, when k-mer sizes were increased, N50 was overall concave, peaking at 127-mer. When coverage threshold was increased within the same k-mer size, N50 was decreased within 127-mer whereas increased within 247-mer. The length of longest contig (grey bar), though not exactly identical, shows similar pattern as N50. Among the selected k-mers, the largest numbers of N50 and the length of the longest contig were achieved at 127-mer and 4.6 coverage-cut-off by ABySS.
Comparison of de novo assembly over different k-mer sizes, setting other parameters at default.
We performed assemblies using k-mer values 63, 99, 161, 197, 233, 247, and 261. The value 247-mer was predicted “Velvet advisor, and 261-mer by “KmerGenie”. At the default k-mer coverage-cut-offs, 5.6, 4.9, 4.0, 3.5, 3.0, 2.6, 2.2, and 2.2 respectively,ABySS resulted in gradual increase in N50 from 63-mer to 161-mer and gradual decrease from 161-mer to 261-mer. To investigate more narrow ranges of k-mer, the averaged value of k-mersizes which resulted in two largest N50 (161-mer: 68,049; 99mer: 60,946), 130-mer, was chosen. Surrounding 130-mer, we increased or decreased k-mer size by 2 (125, 127, 129, 131, and 135), resulting in increasing and decreasing N50s (69 778, 70 023, 70 751, 69 968, 69 506). Though 129-mer resulted in a slightly higher N50 (70,751), it wasted about 20 percentages reads from one-end (unaligned reads: 975,453; singleton: 3,104,888; total one-end reads on contigs: 16,925,193 = 21,005,534 - 975,453 - 3,104,888). Thus, to keep more than 80 percentages of reads, we determined to cut at 127-mer which achieved the second largest N50 (70,023) as well as enough amount of information on reads (unaligned reads: 936,337; singleton: 3,050,181; total one-end reads on contigs: 17,019,016 = 21,005,534 - 936,337 - 3,050,181).
| KmerSize | Cov. Cut Off | Reads OnContigs | No. of Contigs | TotalLgth | Reads/Contig | Avg. Lgth | Longest Contig | N50 | |
| ABySS | 63 | 5.6 | 40,630,414 | 297,634 | 169,228,606 | 137 | 569 | 403,332 | 42,265 |
| 99 | 4.9 | 38,878,837 | 185,458 | 170,576,726 | 210 | 920 | 528,011 | 60,946 | |
| 125 | 4.6 | 37,224,400 | 132,597 | 169,195,886 | 280 | 1,276 | 758,109 | 69,778 | |
| 127 | 4.6 | 37,088,213 | 128,239 | 168,988,320 | 289 | 1,318 | 758,111 | 70,023 | |
| 129 | 4.5 | 36,955,274 | 126,133 | 169,001,206 | 292 | 1,339 | 758,113 | 70,751 | |
| 131 | 4.5 | 36,815,722 | 123,000 | 168,764,165 | 299 | 1,372 | 758,114 | 69,968 | |
| 135 | 4.4 | 36,534,686 | 118,212 | 168,707,383 | 309 | 1,427 | 733,018 | 69,506 | |
| 161 | 4.0 | 34,563,084 | 92,400 | 167,205,432 | 374 | 1,809 | 515,211 | 68,049 | |
| 197 | 3.5 | 31,610,306 | 55,721 | 162,550,195 | 567 | 2,917 | 328,230 | 55,555 | |
| 233 | 3.0 | 28,370,399 | 42,096 | 159,159,885 | 673 | 3,780 | 243,375 | 30,450 | |
| 247 | 2.6 | 27,070,127 | 44,302 | 155,691,515 | 611 | 3,514 | 243,375 | 13,486 | |
| 259 | 2.2 | 25,107,148 | 64,765 | 147,149,476 | 387 | 2,272 | 243,375 | 4,552 | |
| 261 | 2.2 | 24,517,414 | 70,074 | 143,110,673 | 349 | 2,042 | 243,375 | 3,690 | |
With the selected k-mer sizes, different coverage-cut-offs were compared acrossthree software tools.
Empirical optimization of k-mer sizes enhances genome assembly across different software platforms.
| KmerSize | Cov. Cut Off | Reads OnContigs | No. of Contigs | TotalLgth | Reads/Contig | Avg. Lgth | Longest Contig | N50 | |
| ABySS | 63 | 5.6 | 40,630,414 | 297,634 | 169,228,606 | 137 | 569 | 403,332 | 42,265 |
| 99 | 4.9 | 38,878,837 | 185,458 | 170,576,726 | 210 | 920 | 528,011 | 60,946 | |
| 127 | 4.6 | 37,088,213 | 128,239 | 168,988,320 | 289 | 1,318 | 758,111 | 70,023 | |
| 10 | 36,951,014 | 66,039 | 158,776,887 | 560 | 2,404 | 344,946 | 46,837 | ||
| 15 | 35,995,149 | 64,949 | 145,747,066 | 554 | 2,244 | 344,995 | 9,625 | ||
| 247 | 2.6 | 27,070,127 | 44,302 | 155,691,515 | 611 | 3,514 | 243,375 | 13,486 | |
| 10 | 17,222,135 | 10,028 | 50,018,879 | 1,717 | 4,988 | 243,375 | 41,713 | ||
| 15 | 16,818,494 | 6,379 | 48,832,892 | 2,637 | 7,655 | 241,800 | 46,442 | ||
| 259 | 2.2 | 25,107,148 | 64,765 | 147,149,476 | 387 | 2,272 | 243,375 | 4,552 | |
| 10 | 16,436,784 | 7,557 | 49,225,842 | 2,175 | 6,513 | 197,242 | 42,333 | ||
| 15 | 2,264,807 | 3,573 | 20,68,931 | 633 | 579 | 22,094 | 668 | ||
| Velvet | 63 | 5.6 | 38,385,172 | 344,938 | 167,922,256 | 111 | 487 | 154,606 | 15,745 |
| 99 | 4.9 | 34,464,770 | 344,938 | 180,340,024 | 100 | 523 | 254,652 | 13,703 | |
| 127 | 4.6 | 31,625,371 | 193,070 | 173,230,698 | 164 | 897 | 238,111 | 27,066 | |
| 10 | 31,925,142 | 115,193 | 160,122,474 | 277 | 1,390 | 137,609 | 13,257 | ||
| 15 | 31,155,644 | 121,249 | 145,123,478 | 257 | 1,197 | 109,145 | 3,417 | ||
| 247 | 2.6 | 24,473,404 | 93,236 | 159,178,884 | 262 | 1,707 | 159,323 | 2,917 | |
| 10 | 15,327,828 | 17,643 | 49,107,159 | 869 | 2,783 | 159,323 | 27,978 | ||
| 15 | 15,088,514 | 11,075 | 45,945,149 | 1,362 | 4,149 | 159,323 | 30,258 | ||
| Soap | 63 | 5.6 | 39,536,184 | 92,498 | 150,923,645 | 427 | 1,632 | 206,215 | 18,340 |
| 99 | 4.9 | 37,912,050 | 265,563 | 175,094,770 | 143 | 659 | 161,631 | 18,424 | |
| 127 | 4.6 | 36,453,910 | 83,451 | 157,061,370 | 437 | 1,882 | 144,722 | 18,720 | |
| 10 | 36,094,908 | 126,143 | 151,515,830 | 286 | 1,201 | 109,147 | 2,419 | ||
| 15 | 31,333,799 | 127,446 | 103,225,180 | 246 | 810 | 109,147 | 1,018 | ||