| Literature DB >> 30087105 |
Danny E Miller1,2, Cynthia Staber3, Julia Zeitlinger3,4, R Scott Hawley3,5.
Abstract
The Drosophila genus is a unique group containing a wide range of species that occupy diverse ecosystems. In addition to the most widely studied species, Drosophila melanogaster, many other members in this genus also possess a well-developed set of genetic tools. Indeed, high-quality genomes exist for several species within the genus, facilitating studies of the function and evolution of cis-regulatory regions and proteins by allowing comparisons across at least 50 million years of evolution. Yet, the available genomes still fail to capture much of the substantial genetic diversity within the Drosophila genus. We have therefore tested protocols to rapidly and inexpensively sequence and assemble the genome from any Drosophila species using single-molecule sequencing technology from Oxford Nanopore. Here, we use this technology to present highly contiguous genome assemblies of 15 Drosophila species: 10 of the 12 originally sequenced Drosophila species (ananassae, erecta, mojavensis, persimilis, pseudoobscura, sechellia, simulans, virilis, willistoni, and yakuba), four additional species that had previously reported assemblies (biarmipes, bipectinata, eugracilis, and mauritiana), and one novel assembly (triauraria). Genomes were generated from an average of 29x depth-of-coverage data that after assembly resulted in an average contig N50 of 4.4 Mb. Subsequent alignment of contigs from the published reference genomes demonstrates that our assemblies could be used to close over 60% of the gaps present in the currently published reference genomes. Importantly, the materials and reagents cost for each genome was approximately $1,000 (USD). This study demonstrates the power and cost-effectiveness of long-read sequencing for genome assembly in Drosophila and provides a framework for the affordable sequencing and assembly of additional Drosophila genomes.Entities:
Keywords: Drosophila; Genome assembly; Nanopore sequencing; third-generation sequencing
Mesh:
Year: 2018 PMID: 30087105 PMCID: PMC6169393 DOI: 10.1534/g3.118.200160
Source DB: PubMed Journal: G3 (Bethesda) ISSN: 2160-1836 Impact factor: 3.154
Figure 1Phylogenetic tree of flies sequenced in this report including two species (D. melanogaster and D. grimshawi) not sequenced here but that were part of the original 12 genomes project (Drosophila 12 Genomes Consortium ). Adapted from Thomas and Hahn (2017).
Stocks sequenced in this study
| Species | Stock # |
|---|---|
| 14024-0371.13 | |
| 14023-0361.02 | |
| 14024-0381.07 | |
| 14021-0224.01 | |
| 14026-0451.02 | |
| 14021-0241.01 | |
| 15081-1352.22 | |
| 14011-0111.01 | |
| 14011-0121.94 | |
| 14021-0248.01 | |
| 14021-0251.006 | |
| 14028-0691.9 | |
| 15010-1051.87 | |
| 14030-0811.00 | |
| 14021-0261.01 |
Stocks were obtained from the Drosophila Stock Center when it was located at the University of California San Diego. The stock center is now located at Cornell University.
Base-called reads used for genome assembly.
| Depth of coverage | All reads | Reads >1 kb | Reads >10 kb | |||||||
|---|---|---|---|---|---|---|---|---|---|---|
| Species | All reads | Number of reads >1 kb | Number of reads >10 kb | Total number of reads | Average length (bp) | Number of reads | Average length (bp) | Number of reads | Average length (bp) | Longest read (bp) |
| 44.8 | 42.9 | 20.1 | 2,085,829 | 4,227 | 1,409,289 | 5,990 | 252,071 | 15,690 | 110,391 | |
| 28.8 | 27.5 | 12.9 | 1,375,651 | 4,102 | 861,668 | 6,232 | 166,579 | 15,155 | 101,492 | |
| 22.6 | 20.5 | 9.9 | 1,590,181 | 2,909 | 716,348 | 5,861 | 109,688 | 18,473 | 279,705 | |
| 31.7 | 30.6 | 17.5 | 1,022,546 | 4,924 | 658,736 | 7,373 | 151,078 | 18,358 | 176,712 | |
| 22.5 | 21.2 | 8.2 | 1,424,426 | 3,617 | 905,481 | 5,369 | 117,316 | 15,923 | 152,351 | |
| 32.6 | 32.1 | 17.1 | 852,775 | 6,045 | 717,219 | 7,066 | 175,669 | 15,402 | 93,106 | |
| 45.4 | 44.0 | 23.5 | 1,471,959 | 5,129 | 1,063,822 | 6,871 | 189,302 | 20,659 | 245,241 | |
| 34.5 | 33.6 | 15.5 | 1,386,759 | 4,910 | 1,072,290 | 6,179 | 199,330 | 15,304 | 113,218 | |
| 33.3 | 31.7 | 13.8 | 1,417,469 | 3,936 | 905,571 | 5,868 | 154,234 | 15,000 | 104,401 | |
| 23.9 | 23.8 | 16.9 | 390,359 | 10,200 | 366,174 | 10,828 | 110,941 | 25,347 | 254,031 | |
| 30.2 | 30.0 | 25.1 | 389,278 | 12,393 | 311,889 | 15,346 | 140,196 | 28,532 | 309,608 | |
| 18.8 | 18.7 | 13.1 | 409,756 | 9,634 | 379,831 | 10,338 | 120,668 | 22,876 | 238,837 | |
| 22.2 | 21.8 | 11.9 | 1,209,939 | 5,969 | 1,017,016 | 6,976 | 189,572 | 20,402 | 282,795 | |
| 28.5 | 26.2 | 12.9 | 1,868,768 | 3,132 | 923,706 | 5,816 | 158,505 | 16,757 | 100,073 | |
| 21.7 | 21.5 | 12.6 | 509,806 | 7,277 | 462,732 | 7,947 | 103,820 | 20,699 | 195,439 | |
| n/a | ||||||||||
All reads listed had quality scores ≥7.
Assembly statistics for published assemblies using scaffold values and contig statistics of miniasm assemblies after polishing with Racon three times followed by Pilon three times
| Published assembly | miniasm assemblies using only “pass” reads | |||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Species | Estimated genome size (Mb) | Version | Assembly size (Mb) | Number of scaffolds | Scaffold N50 (Mb) | Number of scaffolds >100 kb | Number of scaffolds >1 Mb | Assembly size (Mb) | Number of contigs | Contig N50 (Mb) | Number of contigs >100 kb | Number of contigs >1 Mb |
| 196.6 | 1.05 | 231.0 | 13,749 | 4.6 | 100 | 29 | 188.6 | 371 | 2.6 | 147 | 44 | |
| 195.6 | 2.0 | 169.4 | 5,523 | 3.4 | 95 | 32 | 184.6 | 666 | 2.8 | 224 | 32 | |
| 204.6 | 2.0 | 167.3 | 5,500 | 0.7 | 124 | 35 | 160.5 | 570 | 0.6 | 354 | 26 | |
| 158.9 | 1.05 | 152.7 | 5,124 | 18.7 | 38 | 8 | 127.7 | 58 | 16.6 | 29 | 21 | |
| 228.9 | 2.0 | 156.9 | 4,946 | 1.0 | 150 | 38 | 156.4 | 547 | 1.0 | 256 | 42 | |
| 157.9 | 1.0 | 117.7 | 16 | 21.1 | 12 | 7 | 131.9 | 272 | 4.7 | 64 | 28 | |
| 166.3 | 1.04 | 193.8 | 6,841 | 24.8 | 51 | 12 | 162.1 | 122 | 5.0 | 86 | 39 | |
| 197.1 | 1.3 | 188.4 | 12,838 | 1.9 | 142 | 30 | 165.7 | 415 | 3.5 | 146 | 35 | |
| 167.7 | 3.04 | 152.7 | 4,790 | 12.5 | 25 | 13 | 160.5 | 361 | 3.0 | 143 | 33 | |
| 166.7 | 1.3 | 166.6 | 14,730 | 2.1 | 101 | 23 | 133.4 | 109 | 7.4 | 59 | 23 | |
| 159.6 | 2.02 | 125.0 | 7,619 | 23.5 | 8 | 6 | 132.2 | 76 | 7.7 | 61 | 24 | |
| 210.2 | n/a | n/a | n/a | n/a | n/a | n/a | 170.5 | 482 | 0.72 | 339 | 34 | |
| 325.4 | 1.06 | 206.0 | 13,530 | 10.2 | 79 | 22 | 165.9 | 141 | 4.1 | 83 | 40 | |
| 205.4 | 1.05 | 235.5 | 14,838 | 4.5 | 80 | 38 | 197.7 | 490 | 1.5 | 270 | 49 | |
| 170.7 | 1.05 | 165.7 | 8,122 | 21.8 | 60 | 8 | 141.1 | 111 | 5.2 | 68 | 32 | |
BUSCO scores reveal assembly quality
| Species | Published assembly | miniasm | miniasm –> Racon x1 | miniasm –> Racon x4 | miniasm –> Pilon x1 | miniasm –> Pilon x6 | miniasm –> Racon x3 –> Pilon x3 |
|---|---|---|---|---|---|---|---|
| 98.2 | 1.3 | 88.3 | 91.6 | 74.7 | 91.8 | 98.2 | |
| 98.6 | 3.8 | 88.3 | 92 | 77.1 | 94.7 | 98.7 | |
| 98.2 | 1.7 | 74.9 | 80.1 | 67.2 | 87.9 | 93.9 | |
| 98.6 | 2.5 | 89.9 | 90.4 | 79.5 | 95.5 | 98.6 | |
| 98.5 | 0.8 | 82.6 | 86.7 | 70.7 | 92.7 | 97.9 | |
| 98.6 | 1.4 | 91 | 94.6 | 75.9 | 95 | 98.7 | |
| 98.2 | 0.5 | 82.5 | 88.7 | 66.9 | 93.5 | 98 | |
| 96.6 | 0.4 | 80.3 | 85.3 | 62.1 | 90.9 | 98 | |
| 97.0 | 1.7 | 84.1 | 88.8 | 72.3 | 93.9 | 98 | |
| 97.2 | 1.5 | 91.3 | 92.1 | 75.3 | 95.2 | 98.7 | |
| 98.6 | 2.7 | 91.3 | 95.6 | 77.5 | 94.4 | 98.6 | |
| NA | 3.1 | 82.9 | 85.0 | 73.5 | 87.9 | 93.8 | |
| 97.5 | 1.1 | 84.9 | 89.3 | 70.9 | 93.6 | 97.7 | |
| 98.4 | 0.6 | 79.7 | 82.8 | 72.2 | 92.1 | 98.1 | |
| 98.5 | 1.3 | 86.7 | 91.2 | 73.7 | 95.5 | 98.4 |
Only complete BUSCO scores for the miniasm assembly using reads that passed filter (≥7) are shown. All values are percentages. Higher scores suggest better assembly.
Number of single nucleotide and indel polymorphisms after polishing
| After 4 iterations of Racon | After 6 iterations of Pilon | After 3 iterations of Racon followed by 3 iterations of Pilon | |||||||
|---|---|---|---|---|---|---|---|---|---|
| Species | Indels | Homozygous SNPs | Heterozygous SNPs | Indels | Homozygous SNPs | Heterozygous SNPs | Indels | Homozygous SNPs | Heterozygous SNPs |
| 107,558 | 36,905 | 10,643 | 4,189 | 294 | 45,486 | 1,056 | 445 | 10,215 | |
| 256,004 | 67,310 | 70,947 | 9,646 | 593 | 123,890 | 5,341 | 1,306 | 73,149 | |
| 310,109 | 114,548 | 654,543 | 71,581 | 3,173 | 655,658 | 59,176 | 5,056 | 674,390 | |
| 209,290 | 45,382 | 22,186 | 2,758 | 289 | 37,946 | 1,150 | 435 | 22,179 | |
| 313,426 | 85,150 | 712,410 | 75,480 | 3,359 | 708,483 | 67,204 | 5,024 | 740,846 | |
| 131,501 | 38,401 | 25,587 | 3,325 | 264 | 39,644 | 1,954 | 433 | 24,885 | |
| 213,411 | 71,191 | 36,205 | 9,772 | 311 | 49,328 | 8,516 | 755 | 35,128 | |
| 306,714 | 95,957 | 103,767 | 20,144 | 856 | 130,685 | 15,423 | 1,042 | 76,465 | |
| 266,278 | 66,505 | 20,625 | 5,603 | 298 | 54,530 | 3,265 | 379 | 19,349 | |
| 214,673 | 37,740 | 17,547 | 2,840 | 300 | 34,525 | 1,420 | 538 | 18,117 | |
| 145,544 | 39,508 | 40,197 | 5,008 | 349 | 54,565 | 3,051 | 475 | 40,803 | |
| 373,410 | 171,450 | 843,725 | 79,228 | 3,461 | 812,711 | 68,228 | 5,745 | 865,396 | |
| 211,137 | 71,644 | 19,972 | 5,503 | 265 | 48,459 | 4,207 | 757 | 20,272 | |
| 332,107 | 110,725 | 382,209 | 64,535 | 2,770 | 388,208 | 59,316 | 5,193 | 405,046 | |
| 283,296 | 92,344 | 28,566 | 4,248 | 424 | 59,967 | 1,839 | 687 | 27,336 | |
Figure 2Polishing improves assembly quality. Average BUSCO score for these 15 assembled genomes was 1.6% before polishing. The dotted line in all panels represents the average BUSCO score for all 14 published reference genomes (Table 4). Full polishing results can be found in Tables S5–S8. (A) Complete BUSCO scores for four iterations of Racon alone. (B) Complete BUSCO scores for six iterations of Pilon alone. (C) Complete BUSCO scores shown for three iterations of Racon followed by three iterations of Pilon.
Number of singleton reference contigs that could be placed on the contigs assembled in this study and the number of gaps closed between reference contigs on scaffolds with one or more gaps
| Species | Number of reference scaffolds with at least one gap | Number of reference contigs aligned to assembly | Number of gaps in the reference assembly | Number of reference assembly gaps potentially closed | Percentage of reference assembly gaps potentially closed |
|---|---|---|---|---|---|
| 2,305 | 9,088 | 6,783 | 3,264 | 48% | |
| 1,258 | 3,816 | 2,558 | 1,506 | 59% | |
| 1,639 | 4,996 | 3,357 | 1,624 | 48% | |
| 1,064 | 3,550 | 2,486 | 1,190 | 48% | |
| 1,153 | 4,628 | 3,475 | 1,768 | 51% | |
| 15 | 12,459 | 12,444 | 10,726 | 86% | |
| 1,401 | 6,434 | 5,033 | 3,300 | 66% | |
| 1,636 | 15,611 | 13,975 | 11,464 | 82% | |
| 956 | 5,551 | 4,595 | 2,298 | 50% | |
| 1,558 | 8,253 | 6,695 | 5,431 | 81% | |
| 994 | 4,599 | 3,605 | 2,572 | 71% | |
| 1,101 | 5,953 | 4,852 | 3,242 | 67% | |
| 1,728 | 7,248 | 5,520 | 1,862 | 34% | |
| 1,162 | 6,562 | 5,400 | 3,704 | 69% | |
| Average | 1,284 | 7,053 | 5,770 | 3,854 | 61% |
Figure 3Gaps in the reference genome assembly can be closed using long-read data. This example shows that a 17.4-Mb contig (utg000010l) from our assembly (bottom) closes the gaps (top, gray lines) among 38 contigs (top, shaded boxes) from the D. erecta reference scaffold (scaffold_4845), potentially resolving 3.7 Mb of sequence.