| Literature DB >> 28143951 |
Scott L Allen1, Emily K Delaney2, Artyom Kopp2, Stephen F Chenoweth3.
Abstract
Long-read sequencing technology promises to greatly enhance de novo assembly of genomes for nonmodel species. Although the error rates of long reads have been a stumbling block, sequencing at high coverage permits the self-correction of many errors. Here, we sequence and de novo assemble the genome of Drosophila serrata, a species from the montium subgroup that has been well-studied for latitudinal clines, sexual selection, and gene expression, but which lacks a reference genome. Using 11 PacBio single-molecule real-time (SMRT cells), we generated 12 Gbp of raw sequence data comprising ∼65 × whole-genome coverage. Read lengths averaged 8940 bp (NRead50 12,200) with the longest read at 53 kbp. We self-corrected reads using the PBDagCon algorithm and assembled the genome using the MHAP algorithm within the PBcR assembler. Total genome length was 198 Mbp with an N50 just under 1 Mbp. Contigs displayed a high degree of chromosome arm-level conservation with the D. melanogaster genome and many could be sensibly placed on the D. serrata physical map. We also provide an initial annotation for this genome using in silico gene predictions that were supported by RNA-seq data.Entities:
Keywords: Celera; Drosophila; PacBio; genome assembly; long reads; montium
Mesh:
Year: 2017 PMID: 28143951 PMCID: PMC5345708 DOI: 10.1534/g3.116.037598
Source DB: PubMed Journal: G3 (Bethesda) ISSN: 2160-1836 Impact factor: 3.154
serrata genome assembly statistics
| Description | Statistic |
|---|---|
| Number of contigs | 1360 |
| Genome size (bp) | 198,298,763 |
| Longest contig (bp) | 7,300,740 |
| < 1 kbp | 0.0% |
| 1–10 kbp | 3.3% |
| 10–100 kbp | 78.8% |
| 100–1000 kbp | 15.3% |
| > 1 Mbp | 2.6% |
| N50 (bp) | 942,627 |
| GC content | 39.13% |
Contig length percentages refer to percent total length in each size bin.
BUSCO gene content assessment for D. serrata and two different D. melanogaster assemblies, version r6.05 from www.flybase.org, and the full ISO 1 PacBio assembly of Berlin consisting of 790 contigs, also constructed with the PBcR pipeline
| Category | ||||
|---|---|---|---|---|
| r6.05 | PacBio | |||
| Complete Single-copy BUSCOs (%) | 94.1 | 97.1 | 98.2 | 97.7 |
| Duplicated (%) | 2.1 | 1.0 | 0.5 | 0.6 |
| Fragmented BUSCOs (%) | 2.5 | 1.2 | 0.8 | 0.8 |
| Missing BUSCOs (%) | 1.3 | 0.8 | 0.5 | 0.9 |
A total of 2799 BUSCOs were searched that form a set of highly conserved Dipteran genes. PacBio, Pacific Biosciences; BUSCO, Benchmarking Universal Single-Copy Ortholog.
Figure 1Alignment of the six longest contigs from the D. serrata assembly to D. melanogaster genome version 6.05. Red dots indicate a MUMmer alignment that matches to the D. melanogaster genome in the forward orientation; blue dots indicate a MUMmer alignment that matches to the D. melanogaster genome in the reverse orientation. M, million.
Figure 2Comparison between the draft genome assembly and the physical D. serrata genome map, image is adapted from Stocker . Genes in red were mapped by Drosopoulou and Scouras (1995, 1998), Drosopoulou , 1997, 2002), and Pardali . Genes in blue are also included in the linkage map produced by Stocker . Thin red lines are inversions found by Stocker and thin black lines are inversions found by Mavragani-Tsipidou . Contig3208 (shown in red), was split into three parts based on the misassembly; parts 1 and 3 aligned with D. melanogaster 3R and part 2 aligned with 3L (Figure 1). Markers Act88F and hsp70 were not mapped to contigs because the former appears twice and nomenclature changes meant we could not be certain exactly which gene hsp70 was referring to.
Figure 3Comparison of D. serrata gene locations relative to D. melanogaster. On average, > 95% of tBLASTx hits to D. melanogaster genes (version 6.05) in each contig map to a single D. melanogaster arm.