| Literature DB >> 34152413 |
Scott Hotaling1, John S Sproul2, Jacqueline Heckenhauer3,4, Ashlyn Powell5, Amanda M Larracuente2, Steffen U Pauls3,4,6, Joanna L Kelley1, Paul B Frandsen3,5,7.
Abstract
The first insect genome assembly (Drosophila melanogaster) was published two decades ago. Today, nuclear genome assemblies are available for a staggering 601 insect species representing 20 orders. In this study, we analyzed the most-contiguous assembly for each species and provide a "state-of-the-field" perspective, emphasizing taxonomic representation, assembly quality, gene completeness, and sequencing technologies. Relative to species richness, genomic efforts have been biased toward four orders (Diptera, Hymenoptera, Collembola, and Phasmatodea), Coleoptera are underrepresented, and 11 orders still lack a publicly available genome assembly. The average insect genome assembly is 439.2 Mb in length with 87.5% of single-copy benchmarking genes intact. Most notable has been the impact of long-read sequencing; assemblies that incorporate long reads are ∼48× more contiguous than those that do not. We offer four recommendations as we collectively continue building insect genome resources: 1) seek better integration between independent research groups and consortia, 2) balance future sampling between filling taxonomic gaps and generating data for targeted questions, 3) take advantage of long-read sequencing technologies, and 4) expand and improve gene annotations.Entities:
Keywords: Arthropoda; Insecta; Oxford Nanopore; Pacific Biosciences; arthropod genomics; long-read sequencing
Mesh:
Year: 2021 PMID: 34152413 PMCID: PMC8358217 DOI: 10.1093/gbe/evab138
Source DB: PubMed Journal: Genome Biol Evol ISSN: 1759-6653 Impact factor: 3.416
Fig. 1Taxonomic representation, contiguity, and the timeline of availability for the most-contiguous nuclear genome assembly for 601 insect species in GenBank as of November 2020. Only one assembly per named species or subspecies is included. (a) The taxonomic diversity of available insect genome assemblies. Observed versus expected numbers of genome assemblies represent the total number of available assemblies versus those that would be expected given the proportion that each order comprises all described insect diversity. Significance was assessed with Fisher’s exact tests. One order is underrepresented (Coleoptera) whereas four orders are overrepresented (Diptera, Hymenoptera, Collembola, Phasmatodea). Eleven orders (light red silhouettes) have no publicly available genome assembly. A breakdown of sequencing technology by order is shown in supplementary figure S1, Supplementary Material online. (b) Genome contiguity versus total assembly length. Contiguity was assessed with contig N50, the mid-point of the contig distribution where 50% of the genome is assembled into contigs of a given length or longer. The inset plot shows a comparison of contig N50 distributions for short-read (n = 365) versus long-read (n = 126) assemblies. Significance was assessed with a Welch’s t-test. A finer-scale breakdown by sequencing technology is shown in supplementary figure S2, Supplementary Material online. (c) The timeline of genome assembly availability for insects according to the GenBank publication date. A steady increase in contiguity is largely precipitated by the rise of long-read sequencing. Labeled in (b) and (c): well-known or outlier genome assemblies in terms of either model status, assembly size, or contiguity. Groups of species in the same genus are labeled with black circles. (d) Contig N50 by taxonomic group. Generally, taxa were grouped into orders except when 10 or more assemblies were available for a lower taxonomic level (family or genus). As in (b) and (c), each point represents a single insect genome assembly.
Fig. 2.Variation in assembly size and BUSCO gene completeness across Insecta. (a) Assembly size for all insects, grouped by order then family. To improve visualization, the upper display limit was set to 2.8 Gb. Four genome assemblies exceeded this value and are labeled with gray text (in Gb). Taxa silhouettes were either handmade or taken from PhyloPic (http://phylopic.org, last accessed July 15, 2021. (b) BUSCO results for each insect genome assembly. Each horizontal bar represents one assembly (n = 601 species) and corresponds to the same taxon in the assembly size plot to the left in (a). (c–e) Long-read versus short-read genome assembly comparisons of (c) complete BUSCOs (single and duplicated combined), (d) fragmented BUSCOs, and (e) duplicated BUSCOs only. Significance was assessed with Welch’s t-tests. (f) A comparison of BUSCO completeness versus contig N50. Each point represents the best available assembly for one taxon and groups of taxa in the same genus are labeled with black circles. Unsurprisingly, more contiguous genome assemblies also exhibit greater gene completeness. (g) Longer genes are more likely to be fragmented in insect genome assemblies, regardless of the technology used. However, a much stronger correlation exists between short-read assemblies and fragmentation of longer genes (Spearman’s p: 0.24, P < 2.2e-16) than for long-read assemblies (Spearman’s p: 0.08, P = 0.002). Unlike in (c–e), each circle in (g) represents the percent of fragmentation for that BUSCO gene across all long- or short-read assemblies. Thus, each gene is included twice (once for each technology). All 1,367 BUSCO genes in the OrthoDB v.10 Insecta gene set (Kriventseva, et al. 2019) were used except one 2.02 kb gene that was missing in >70% of assemblies and subsequently removed from analysis and visualization. BUSCO gene lengths varied from 198 bp to 9.01 kb.