| Literature DB >> 29949964 |
Weihua Pan1, Steve I Wanamaker2, Audrey M V Ah-Fong3, Howard S Judelson3, Stefano Lonardi1.
Abstract
Motivation: De novo genome assembly is a challenging computational problem due to the high repetitive content of eukaryotic genomes and the imperfections of sequencing technologies (i.e. sequencing errors, uneven sequencing coverage and chimeric reads). Several assembly tools are currently available, each of which has strengths and weaknesses in dealing with the trade-off between maximizing contiguity and minimizing assembly errors (e.g. mis-joins). To obtain the best possible assembly, it is common practice to generate multiple assemblies from several assemblers and/or parameter settings and try to identify the highest quality assembly. Unfortunately, often there is no assembly that both maximizes contiguity and minimizes assembly errors, so one has to compromise one for the other.Entities:
Mesh:
Year: 2018 PMID: 29949964 PMCID: PMC6022655 DOI: 10.1093/bioinformatics/bty255
Source DB: PubMed Journal: Bioinformatics ISSN: 1367-4803 Impact factor: 6.937
Fig. 1.(A) Contigs (blue) of eight assemblies mapped to one optical molecule (green); minimum tiling path contigs are highlighted in red; (B) final stitched contig, at the end of the iterative stitching process (observe that the final contig spans a genomic region longer than the two MTP contigs)
Fig. 2.Pipeline of Novo&Stitch; the input to the pipeline is a set of chimera-free contigs and an optical map
Parameter choices for Canu v1.3
| CANU assembly | cMS | cMEE | cOC | Quiver |
|---|---|---|---|---|
| 1 | Low | Default | Default | |
| 2 | Low | Default | 100 | |
| 3 | High | Default | Default | |
| 4 | High | 0.15 | 100 | |
| 5 | Normal | 0.15 | 100 | |
| 6 | High | Default | 100 |
Note: Column cMS reports the value of corMhapSensitivity, cMEE reports corMaxEvidenceErate, cOC reports corOutCoverage. Three of these assemblies were polished with Quiver.
Assembly statistics of eight assemblies for cowpea; all reads/transcripts/BAC assemblies were mapped with BWA, MapQ30; number in boldface are the best statistics (min or max) across assemblies; for # contigs 100kbp and 1Mbp it is not obvious whether to report min or max
| Canu1 | Canu2 | ABruijn | Falcon | Canu3 | Canu4 | Canu5 | Canu6 | |
|---|---|---|---|---|---|---|---|---|
| Contig N50 (bp) | 4 859 617 | 4 498 063 | 1 896 002 | 2 869 362 | 3 280 469 | 2 797 949 | 2 666 731 | |
| Contig L50 | 30 | 32 | 74 | 49 | 42 | 51 | 55 | |
| Contig NG50 (bp) | 3 417 577 | 1 330 435 | 1 737 012 | 2 431 239 | 1 949 515 | 2 068 575 | 3 451 071 | |
| Contig LG50 | 43 | 45 | 119 | 73 | 63 | 73 | 77 | |
| Total assembled (bp) | 506 154 442 | 478 230 679 | 511 933 729 | 503 187 311 | 516 537 734 | 515 949 175 | 507 773 747 | |
| # contigs | 894 | 928 | 1820 | 1038 | 1110 | 1140 | 897 | |
| # contigs | 220 | 288 | 437 | 404 | 299 | 354 | 334 | 278 |
| # contigs | 104 | 107 | 151 | 118 | 128 | 142 | 145 | 103 |
| # contigs | 7 | 8 | 0 | 1 | 2 | 2 | 0 | |
| Longest contig (bp) | 18 473 372 | 8 846 014 | 10 554 495 | 14 090 735 | 14 331 160 | 9 775 097 | 17 211 165 | |
| WGS contigs | 98.27412% | 88.30652% | 97.84959% | 98.30618% | 98.25853% | 98.23673% | 98.73930% | |
| UCR2014 reads, % properly paired (202M) | 92.59433% | 92.30106% | 91.95107% | 92.52969% | 92.63057% | 92.62330% | 92.59763% | |
| UCR2014 reads, % mapped (202M) | 64.35764% | 63.50279% | 64.21367% | 59.49035% | 63.00587% | 63.22414% | 62.84466% | |
| Assembled transcripts, % mapped (157K) | 92.60644% | 94.83972% | 94.16235% | 92.65416% | 92.52276% | 92.46959% | 94.85657% | |
| Total length with 100% consistent LG (bp) | 331 956 528 | 338 556 993 | 379 029 914 | 312 593 019 | 356 505 616 | 349 534 672 | 347 586 448 |
Assembly statistics of Novo&Stitch on the eight cowpea assemblies using either the BspQI or the BssSI optical map, ‘best of 8’ is a copy the best statistics (boldface) among the eight assemblies in Table 2—no individual assembly, however, has these statistics; see text about strict and loose parameters; all DNA sequences were mapped with BWA, MapQ30
| Best of 8 | BspQI (loose) | BspQI (strict) | BssSI (loose) | BssSI (strict) | |
|---|---|---|---|---|---|
| Contig N50 (bp) | 4 859 617 | 9 944 851 | 9 944 851 | 9 584 779 | 9 584 779 |
| Contig L50 | 29 | 19 | 19 | 19 | 19 |
| Contig NG50 (bp) | 3 767 556 | 9 944 851 | 8 187 172 | 7 956 155 | 7 826 863 |
| Contig LG50 | 42 | 19 | 24 | 24 | 24 |
| Total assembled (bp) | 516 817 613 | 522 393 141 | 523 526 657 | 520 162 831 | 523 249 509 |
| # contigs | 538 | 791 | 798 | 791 | 798 |
| # contigs | N/A | 211 | 218 | 211 | 218 |
| # contigs | N/A | 72 | 72 | 66 | 69 |
| # contigs | 9 | 18 | 18 | 17 | 17 |
| Longest contig (bp) | 18 498 533 | 21 980 320 | 21 980 320 | 22 385 362 | 22 385 362 |
| WGS contigs | 98.77014% | 97.77496% | 97.79009% | 97.40359% | 97.02018% |
| UCR2014 reads, % properly paired (202M) | 92.64181% | 92.57437% | 92.58778% | 92.47305% | 92.50176% |
| UCR2014 reads, % mapped (202M) | 64.38425% | 62.20807% | 62.11027% | 61.82553% | 61.63417% |
| Assembled transcripts, % mapped (157K) | 94.95582% | 93.93669% | 93.90570% | 94.01125% | 93.46803% |
| % contigs with 100% consistent LG | 425 812 490 | 429 367 225 | 430 234 966 | 423 454 837 | 434 621 644 |
Statistics of six input assemblies for P.infestans and two stitched assemblies (N&S = Novo&Stitch) with strict and loose parameters; all reads were mapped with Bwa, except for 1% of miSeq and 0.1% of Dovetail which were mapped using Blast (e-value < 1e-30)
| Falcon | Canu10 | Canu | ABruijn10 | ABruijn | ABruijn | N | N | |
|---|---|---|---|---|---|---|---|---|
| Contig N50 (bp) | 481 068 | 131 313 | 135 263 | 356 459 | 293 280 | 302 893 | 769 322 | 730 890 |
| Contig L50 | 107 | 462 | 473 | 142 | 171 | 169 | 74 | 82 |
| Total assembled (bp) | 215 910 203 | 305 686 040 | 292 352 599 | 195 768 168 | 177 232 870 | 175 149 119 | 240 150 657 | 250 416 680 |
| # contigs | 1364 | 3496 | 2863 | 835 | 888 | 867 | 1304 | 1329 |
| # contigs | 445 | 667 | 725 | 561 | 539 | 522 | 398 | 423 |
| # contigs | 36 | 12 | 7 | 19 | 9 | 10 | 54 | 55 |
| longest contig (bp) | 4 206 720 | 1 810 393 | 1 813 497 | 2 437 907 | 2 004 950 | 1 638 783 | 4 930 683 | 4 797 067 |
| miSeq reads, % mapped (47M) | 98.4995% | 98.6503% | 98.2370% | 98.0305% | 98.3051% | 98.2923% | 98.0928% | 98.0958% |
| miSeq reads, % properly paired (47M) | 96.3825% | 97.6855% | 95.5383% | 93.5911% | 94.8410% | 94.8218% | 95.6384% | 95.7194% |
| 1% miSeq reads, % mapped (0.47M) | 97.6510% | 97.9313% | 96.7831% | 96.4854% | 97.3612% | 97.3092% | 97.1461% | 97.1521% |
| Dovetail reads, % mapped (202M) | 97.7712% | 97.8934% | 97.6519% | 97.5093% | 97.6161% | 97.5835% | 97.4953% | 97.4989% |
| Dovetail reads, % properly paired (202M) | 38.7274% | 37.5416% | 37.4140% | 38.6057% | 38.5723% | 38.4578% | 37.9324% | 37.7392% |
| 0.1% Dovetail reads, % mapped (0.2M) | 91.6264% | 92.0876% | 91.3643% | 90.8826% | 91.3535% | 91.2612% | 91.1237% | 91.1447% |