| Literature DB >> 30018084 |
Edwin A Solares1, Mahul Chakraborty1, Danny E Miller2,3, Shannon Kalsow1, Kate Hall2, Anoja G Perera2, J J Emerson4, R Scott Hawley5,6.
Abstract
Accurate and comprehensive characterization of genetic variation is essential for deciphering the genetic basis of diseases and other phenotypes. A vast amount of genetic variation stems from large-scale sequence changes arising from the duplication, deletion, inversion, and translocation of sequences. In the past 10 years, high-throughput short reads have greatly expanded our ability to assay sequence variation due to single nucleotide polymorphisms. However, a recent de novo assembly of a second Drosophila melanogaster reference genome has revealed that short read genotyping methods miss hundreds of structural variants, including those affecting phenotypes. While genomes assembled using high-coverage long reads can achieve high levels of contiguity and completeness, concerns about cost, errors, and low yield have limited widespread adoption of such sequencing approaches. Here we resequenced the reference strain of D. melanogaster (ISO1) on a single Oxford Nanopore MinION flow cell run for 24 hr. Using only reads longer than 1 kb or with at least 30x coverage, we assembled a highly contiguous de novo genome. The addition of inexpensive paired reads and subsequent scaffolding using an optical map technology achieved an assembly with completeness and contiguity comparable to the D. melanogaster reference assembly. Comparison of our assembly to the reference assembly of ISO1 uncovered a number of structural variants (SVs), including novel LTR transposable element insertions and duplications affecting genes with developmental, behavioral, and metabolic functions. Collectively, these SVs provide a snapshot of the dynamics of genome evolution. Furthermore, our assembly and comparison to the D. melanogaster reference genome demonstrates that high-quality de novo assembly of reference genomes and comprehensive variant discovery using such assemblies are now possible by a single lab for under $1,000 (USD).Entities:
Keywords: Drosophila; genome assembly; nanopore sequencing
Mesh:
Year: 2018 PMID: 30018084 PMCID: PMC6169397 DOI: 10.1534/g3.118.200162
Source DB: PubMed Journal: G3 (Bethesda) ISSN: 2160-1836 Impact factor: 3.154
Figure 1Assembly strategy used in this manuscript. A lower-contiguity assembly (Canu) is merged with a higher-contiguity assembly (DBG2OLC). The resulting assembly is again merged with the Canu assembly. The genome is then polished one or more times, here with nanopolish followed by Pilon.
Statistics of reads used for genome assembly*
| Total reads | 593,354 |
| Average read length | 7,122 bp |
| Total bases sequenced | 4,184,159,334 |
| Genome Coverage | 30.2x |
| Reads > 1 kb | 530,466 |
| Genome coverage in reads > 1 kb | 29.9x |
| Reads > 10 kb | 145,634 |
| Genome coverage in reads > 10 kb | 17.5x |
Only reads with quality scores ≥ 7 were used. A genome size of 140 Mb was used for all calculations.
Figure 2Read length distribution for reads with quality scores greater than or equal to 7. Length is the sequence length after basecalling by Albacore, not the length that aligned to the genome. (A) Distribution of read lengths less than 50 kb. (B) Distribution of reads 50 kb or greater. The longest read that passed quality filtering was 380 kb.
Genome assembly statistics*
| Name | Genome size (bp) | Contigs | Largest contig (bp) | N50 (bp) | L50 |
|---|---|---|---|---|---|
| FlyBase r6.16 | 142,573,024 | 2,442 | 27,905,053 | 21,485,538 | 3 |
| MiniMap | 131,856,353 | 208 | 16,991,501 | 3,866,686 | 9 |
| DBG2OLC | 131,359,678 | 339 | 13,129,070 | 9,907,730 | 6 |
| Canu | 139,205,737 | 295 | 14,326,064 | 2,971,262 | 11 |
| QuickMerge 2x | 138,130,519 | 250 | 25,434,901 | 18,616,266 | 4 |
| QM2x Nanopolish | 139,303,903 | 250 | 25,367,201 | 18,818,677 | 4 |
| QM2x NP + Pilon x2 | 140,153,080 | 250 | 25,783,280 | 18,923,871 | 4 |
| QuickMerge 2x Bionano | 142,817,829 | 231 | 28,580,427 | 21,305,147 | 3 |
Values are for scaffolds, not contigs.
Busco scores demonstrating genome quality before or after polishing
| Name | Single copy | Duplicate | Fragmented | Missing | Complete |
|---|---|---|---|---|---|
| FlyBase r6.16 | 2,749 (98.2%) | 14 (0.5%) | 22 (0.8%) | 14 (0.5%) | 2,763 |
| MiniMap | 14 (0.5%) | 0 (0.0%) | 31 (1.1%) | 2,754 (98.4%) | 14 |
| DBG2OLC | 1,332 (47.6%) | 3 (0.1%) | 557 (19.9%) | 907 (32.4%) | 1,335 |
| Canu | 1,884 (67.3%) | 11 (0.2%) | 557 (19.9%) | 347 (12.4%) | 1,895 |
| QuickMerge (QM) 2x | 1,623 (58.0%) | 6 (0.3%) | 560 (20.0%) | 610 (21.8%) | 1,629 |
| QM 2x Nanopolish (NP) | 2,189 (78.2%) | 8 (0.3%) | 400 (14.3%) | 202 (7.2%) | 2,197 |
| QM 2x NP + Pilon x2 | 2,726 (97.4%) | 14 (0.5%) | 39 (1.6%) | 20 (0.7%) | 2,740 |
| QM 2x Pilon x2 | 2,718 (97.1%) | 14 (0.5%) | 45 (1.4%) | 22 (0.8%) | 2,732 |
| QM 2x Bionano | 2,715 (97.0%) | 15 (0.5%) | 40 (1.4%) | 29 (1.0%) | 2,730 |
| QM 2x Bionano All | 2,720 (97.2%) | 16 (0.6%) | 40 (1.4%) | 23 (0.8%) | 2,736 |
Figure 3Dot plots showing colinearity of our assembled genomes with the current version of the D. melanogaster reference genome. Red dots represent regions where the assembly and the reference aligned in the same orientation; blue dots represent regions where the genomes are inverted with respect to one another. The vertical grid lines represent boundaries between chromosome scaffolds in the reference assembly. Horizontal grid lines represent boundaries between contig (A-C) or scaffolds (D) in the assemblies reported here. (A) Plot of the Canu-only assembly against the reference genome. (B) Plot of the hybrid DBG2OLC Nanopore and Illumina assembly against the reference. (C) Plot of merged DBG2OLC and Canu assemblies showing a more contiguous assembly than either of the component assemblies. (D) Bionano scaffolding of the merged assembly resolves additional gaps in the merged assembly.
Figure 4QUAST was used to compare each assembly to the D. melanogaster reference genome with selected statistics presented here. (A) Greater than 90% of bases in the reference genome were aligned to each of our four assemblies. (B) The contiguity of assembly blocks aligned to the reference. (C) Total unaligned length includes contigs that did not align to the reference as well as unaligned sequence of partially aligned contigs. (D) The number of contigs that contain misassemblies in which flanking sequences are 1 kb apart, flanking sequences overlap by 1 kb or more, or flanking sequences align to different reference scaffolds. (E) Total count of misassemblies as described in (D). (F) Local misassemblies include those positions in which a gap or overlap between flanking sequence is less than 1 kb [(D) and (E) show those greater than 1 kb] and larger than the maximum indel length of 85 bp on the same reference genome scaffold. (G) Misassemblies can be subdivided into relocations (a single assembled contig aligns to the same reference scaffold but in pieces at least 1 kb apart), inversions (at least one part of a single assembled contig aligns to the reference in an inverted orientation), or translocations (at least one part of a single assembled contig aligns to two different reference scaffolds). Not all misassemblies are captured in these three categories. (H) Total SNPs per assembly are shown and were not significantly different among assemblies. (I) Indels per 100 kb can be divided into small indels (those <5 bp) and large indels (≥ 5 bp). Indels >85 bp are considered misassemblies and are shown in panels D, E, or F.
Figure 5Copy number increase in a 207-bp tandem array located inside the third exon of Muc26B. (A) Three tracks showing Bionano assembly (top) with Nanopore long reads (blue) and reference (Flybase) assembly (red) aligned to it. The alignment gap in the reference assembly is due to the extra sequence copies in the Bionano assembly. (B) Alignment dot plot between the reference sequence possessing the tandem array to itself. (C) Alignment dot plot between the genomic region possessing the tandem array in the Bionano assembly to itself. As evidenced by the dot plot, the Bionano assembly has more repeats in this region than the reference assembly in panel B. (D) Alignment dot plot between the reference genomic region (x axis) shown in (B) and the corresponding Bionano genomic region (y axis) shown in (C).
Figure 6Mitochondrial genome annotations generated by MITOS. (A) Annotation of the reference mitochondrial genome. (B) Annotation of the mitochondrial genome assembled in this project is identical to the reference except that nad4 and nad6 in the reference assembly were both annotated as two genes—nad4 as nad4-a and nad4-b, and nad6 as nad6-a and nad6-b.