| Literature DB >> 32220952 |
Michael J Bronski1, Ciera C Martinez2,3, Holli A Weld2, Michael B Eisen1,4,5.
Abstract
Large groups of species with well-defined phylogenies are excellent systems for testing evolutionary hypotheses. In this paper, we describe the creation of a comparative genomic resource consisting of 23 genomes from the species-rich Drosophila montium species group, 22 of which are presented here for the first time. The montium group is well-positioned for clade genomics. Within the montium clade, evolutionary distances are such that large numbers of sequences can be accurately aligned while also recovering strong signals of divergence; and the distance between the montium group and D. melanogaster is short enough so that orthologous sequence can be readily identified. All genomes were assembled from a single, small-insert library using MaSuRCA, before going through an extensive post-assembly pipeline. Estimated genome sizes within the montium group range from 155 Mb to 223 Mb (mean = 196 Mb). The absence of long-distance information during the assembly process resulted in fragmented assemblies, with the scaffold NG50s varying widely based on repeat content and sample heterozygosity (min = 18 kb, max = 390 kb, mean = 74 kb). The total scaffold length for most assemblies is also shorter than the estimated genome size, typically by 5-15%. However, subsequent analysis showed that our assemblies are highly complete. Despite large differences in contiguity, all assemblies contain at least 96% of known single-copy Dipteran genes (BUSCOs, n = 2,799). Similarly, by aligning our assemblies to the D. melanogaster genome and remapping coordinates for a large set of transcriptional enhancers (n = 3,457), we showed that each montium assembly contains orthologs for at least 91% of D. melanogaster enhancers. Importantly, the genic and enhancer contents of our assemblies are comparable to that of far more contiguous Drosophila assemblies. The alignment of our own D. serrata assembly to a previously published PacBio D. serrata assembly also showed that our longest scaffolds (up to 1 Mb) are free of large-scale misassemblies. Our genome assemblies are a valuable resource that can be used to further resolve the montium group phylogeny; study the evolution of protein-coding genes and cis-regulatory sequences; and determine the genetic basis of ecological and behavioral adaptations.Entities:
Keywords: Drosophila; assembly; genome; montium
Mesh:
Year: 2020 PMID: 32220952 PMCID: PMC7202002 DOI: 10.1534/g3.119.400959
Source DB: PubMed Journal: G3 (Bethesda) ISSN: 2160-1836 Impact factor: 3.154
Additional sample and genome information
| Species | Strain | NCBI Accession # | Coverage (x) | GC % | Freq. Variant Branches ( | Freq. Repeat Branches ( |
|---|---|---|---|---|---|---|
| E-12502 (TKNK40) | VNJZ00000000 | 28 | 40.74 | 5.988E-04 | 4.636E-04 | |
| 14028-0471.01 | VNJW00000000 | 39 | 40.32 | 4.143E-04 | 4.930E-04 | |
| São Tomé Light | VNJL00000000 | 59 | 41.98 | 2.850E-03 | 4.005E-04 | |
| 14028-0521.00 | VNKA00000000 | 36 | 39.89 | 1.639E-04 | 2.991E-04 | |
| E-12901 (IR2-37) | VNJY00000000 | 41 | 40.32 | 1.961E-04 | 2.103E-04 | |
| 14028-0721.00 | VNKE00000000 | 33 | 40.01 | 5.872E-04 | 2.895E-04 | |
| 14028-0781.00 | VNJT00000000 | 45 | 40.51 | 3.317E-03 | 3.401E-04 | |
| 14028-0671.01 | VNJX00000000 | 32 | 40.61 | 1.230E-03 | 2.458E-04 | |
| 14028-0541.00 | VNJM00000000 | 38 | 39.80 | 1.479E-04 | 1.175E-04 | |
| E-14104 (ISGB1) | VNKF00000000 | 25 | 40.26 | NA | NA | |
| RGN 210-13 | VNKB00000000 | 37 | 40.24 | 3.179E-03 | 2.686E-04 | |
| 14028-0591.01 | VNJN00000000 | 34 | 38.40 | 9.486E-04 | 6.182E-04 | |
| 14028-0601.01 | VNJV00000000 | 40 | 40.36 | 3.686E-03 | 5.446E-04 | |
| 14028-0731.00 | VNKC00000000 | 22 | 38.05 | NA | NA | |
| 14028-0531.01 or MYS-170-D | VNJR00000000 | 47 | 39.39 | 2.047E-03 | 2.967E-04 | |
| E-14802 (EHO91) | VNKH00000000 | 29 | 40.55 | 3.982E-04 | 4.826E-04 | |
| 14028-0671.02 | VNJU00000000 | 38 | 39.11 | 1.145E-03 | 4.656E-04 | |
| 14028-0681.02 | VNKD00000000 | 31 | 38.61 | 1.098E-03 | 3.301E-04 | |
| 14020-0011.00 | VNJO00000000 | 40 | 40.02 | 1.221E-03 | 3.405E-04 | |
| 14028-0691.01 | VNKG00000000 | 28 | 39.95 | 3.047E-03 | 6.126E-04 | |
| RGN23 | VNJQ00000000 | 51 | 37.89 | 2.813E-04 | 3.785E-04 | |
| 14028-0711.00 | VNJP00000000 | 29 | 41.70 | 2.531E-04 | 3.151E-04 | |
| 14028-0531.02 | VNJS00000000 | 38 | 38.99 | 3.715E-03 | 2.936E-04 |
For D. punjabiensis, we sequenced one of two potential strains. Additional sequencing is underway to confirm the strain identification. Coverage is equal to the total amount of sequencing data (after read decontamination) divided by the estimated genome size (from SGA Preqc (Simpson 2014)). The GC % is based on the unassembled reads, not the assembly. See the Materials and Methods for additional information. The frequency of variant and repeat branches in the de Bruijn graph (k = 41) was calculated by SGA Preqc. A k-mer size of 41 was chosen to maximize the number of species that could be compared. Sequence coverage was too low to estimate these parameters at k = 41 for D. lacteicornis and D. pectinifera.
Figure 1For all montium species, the vast majority of the assembly is present in at least gene-sized scaffolds, despite large differences in contiguity. Based on annotations of the previously assembled D. serrata genome (NCBI Drosophila serrata Annotation Release 100; Allen ), the average gene length is up to 6.3 kb. For each montium species, the blue bar graph shows the scaffold NG50, and the red line graph shows the percentage of the assembly (total scaffold length) present in scaffolds that are at least 6.3 kb in length. Species are listed in decreasing order of the scaffold NG50.
Genome size estimates and assembly statistics
| Species | Est. Genome Size (bp) | Total Scaffold Length (bp) | Scaffold NG50 (bp) | Longest Scaffold (bp) | Contig NG50 (bp) | Longest Contig (bp) | Total Gap Length (bp) |
|---|---|---|---|---|---|---|---|
| 155,490,160 | 152,203,088 | 389,587 | 2,274,126 | 301,459 | 2,274,126 | 205,953 | |
| 169,148,727 | 156,593,892 | 211,718 | 1,501,252 | 164,580 | 1,183,551 | 81,207 | |
| 190,688,284 | 167,897,087 | 95,737 | 830,117 | 73,889 | 827,712 | 375,696 | |
| 155,095,574 | 151,202,254 | 78,068 | 785,450 | 65,314 | 785,450 | 69,231 | |
| 181,250,127 | 151,823,105 | 76,713 | 1,142,480 | 65,403 | 1,127,760 | 152,423 | |
| 197,448,094 | 192,339,030 | 72,420 | 1,226,934 | 64,043 | 1,083,757 | 153,372 | |
| 179,468,675 | 163,991,206 | 71,637 | 873,064 | 61,348 | 756,687 | 61,365 | |
| 209,187,412 | 187,578,810 | 65,096 | 530,507 | 51,774 | 472,464 | 116,296 | |
| 206,814,592 | 178,856,532 | 63,109 | 891,413 | 54,123 | 891,413 | 233,653 | |
| 223,398,425 | 167,807,061 | 62,249 | 2,219,437 | 43,922 | 1,355,909 | 364,662 | |
| 216,977,949 | 189,050,820 | 59,266 | 1,052,132 | 50,528 | 904,342 | 75,577 | |
| 184,673,878 | 159,679,625 | 54,224 | 1,091,401 | 43,626 | 718,797 | 79,019 | |
| 203,475,870 | 182,681,050 | 53,799 | 1,044,495 | 44,105 | 766,914 | 60,537 | |
| 220,219,034 | 149,209,000 | 52,632 | 528,734 | 41,478 | 467,725 | 58,142 | |
| 194,820,185 | 180,972,673 | 48,517 | 921,527 | 44,341 | 780,901 | 135,927 | |
| 210,769,271 | 186,167,886 | 43,287 | 498,065 | 38,201 | 498,065 | 80,447 | |
| 182,199,997 | 196,825,890 | 40,952 | 1,045,963 | 36,818 | 656,929 | 135,921 | |
| 220,036,088 | 197,420,731 | 38,365 | 491,046 | 35,679 | 491,046 | 130,216 | |
| 219,308,053 | 187,248,584 | 28,924 | 1,045,797 | 27,520 | 598,596 | 58,713 | |
| 162,918,854 | 164,601,511 | 26,461 | 331,217 | 23,897 | 301,031 | 91,716 | |
| 198,129,694 | 175,666,184 | 24,417 | 628,960 | 23,536 | 628,960 | 153,793 | |
| 217,706,973 | 190,505,469 | 23,001 | 626,542 | 21,180 | 574,375 | 226,904 | |
| 217,036,792 | 197,369,186 | 17,513 | 590,840 | 16,493 | 576,941 | 156,570 |
Genome size estimates were calculated by SGA Preqc (Simpson 2014) based on the k-mer frequency spectrum of the unassembled reads. To calculate the scaffold NG50 (Earl ; Bradnam ), scaffold lengths were ordered from longest to shortest and then summed, starting with the longest scaffold. The NG50 was the scaffold length that brought the sum above 50% of the estimated genome size. Contig lengths were estimated by splitting scaffolds on every N, including single Ns. Species are listed in decreasing order of scaffold NG50.
Figure 2All montium assemblies contain high percentages of known genes despite large differences in contiguity. BUSCO (Simão et al. 2015; Waterhouse ) assessment results for eight montium genomes representing a diversity of genomes / assemblies. The Dipteran BUSCO set contains 2,799 genes. For each assembly, the bar graph reports the number of BUSCOs that are complete and single-copy, complete and duplicated, fragmented, and missing. The scaffold NG50 for each assembly is shown on the right.
Thousands of orthologous montium enhancers can be identified by remapping D. melanogaster enhancer coordinates onto montium assemblies
| Species | Attempted Remappings | Successful Remappings | % Successful Remappings | Reciprocal Best Hits (RBH) | % Successful Remappings that are RBH |
|---|---|---|---|---|---|
| 3,457 | 3,450 | 99.8 | 3,361 | 97.4 | |
| 3,457 | 3,448 | 99.7 | 3,347 | 97.1 | |
| 3,457 | 3,451 | 99.8 | 3,275 | 94.9 | |
| 3,457 | 3,449 | 99.8 | 3,385 | 98.1 | |
| 3,457 | 3,449 | 99.8 | 3,377 | 97.9 | |
| 3,457 | 3,447 | 99.7 | 3,359 | 97.4 | |
| 3,457 | 3,450 | 99.8 | 3,272 | 94.8 | |
| 3,457 | 3,450 | 99.8 | 3,327 | 96.4 | |
| 3,457 | 3,449 | 99.8 | 3,406 | 98.8 | |
| 3,457 | 3,449 | 99.8 | 3,377 | 97.9 | |
| 3,457 | 3,451 | 99.8 | 3,375 | 97.8 | |
| 3,457 | 3,444 | 99.6 | 3,247 | 94.3 | |
| 3,457 | 3,449 | 99.8 | 3,384 | 98.1 | |
| 3,457 | 3,451 | 99.8 | 3,221 | 93.3 | |
| 3,457 | 3,449 | 99.8 | 3,383 | 98.1 | |
| 3,457 | 3,449 | 99.8 | 3,266 | 94.7 | |
| 3,457 | 3,452 | 99.9 | 3,368 | 97.6 | |
| 3,457 | 3,444 | 99.6 | 3,334 | 96.8 | |
| 3,457 | 3,447 | 99.7 | 3,350 | 97.2 | |
| 3,457 | 3,451 | 99.8 | 3,301 | 95.7 | |
| 3,457 | 3,448 | 99.7 | 3,258 | 94.5 | |
| 3,457 | 3,449 | 99.8 | 3,366 | 97.6 | |
| 3,457 | 3,448 | 99.7 | 3,358 | 97.4 | |
| 3,457 | 3,449 | 99.8 | 3,147 | 91.2 |
Coordinates for D. melanogaster enhancers from Kvon ) were remapped onto aligned montium assemblies using liftOver (Hinrichs ). Reciprocal best hits (RBH) were identified by aligning montium sequences back to the melanogaster genome, and melanogaster sequences to the montium genomes - both using BLASTn (Camacho ). See Materials and Methods for additional details. For comparison, we also included the previously assembled D. kikkawai genome (Chen ).
Figure 3Pairwise BLASTn alignments between D. melanogaster enhancers and D. lacteicornis orthologs show highly similar sequences. 3,457 experimentally verified D. melanogaster enhancers from Kvon were remapped onto the D. lacteicornis assembly using liftOver (Hinrichs ). This yielded 3,375 reciprocal best hits between the D. melanogaster and D. lacteicornis genomes. D. lacteicornis was chosen for illustrative purposes because the assembly is close to the median scaffold NG50. The 2D histogram shows query coverage and percent identity for 3,375 pairwise D. melanogaster - D. lacteicornis BLASTn (Camacho ) alignments. Query coverage is the percentage of D. melanogaster sequence that aligned to D. lacteicornis sequence; and percent identity is the length-weighted percent identity for hits in the alignment.
Figure 4Alignments between the five longest scaffolds from our D. serrata assembly and orthologous contigs from a PacBio D. serrata assembly are highly collinear.Each dotplot shows the alignment of a scaffold from our Illumina D. serrata assembly (strain 14028-0681.02) to the orthologous contig from the previously published PacBio D. serrata assembly (strain Fors4) (Allen ). Pairwise alignments were generated by LASTZ (Harris 2007). Parts A) through D) show alignments for different scaffolds. Parts E) and F) show the alignment of the same scaffold to different contigs. Alignments are shown in decreasing order of scaffold length.