Literature DB >> 34792158

InsectBase 2.0: a comprehensive gene resource for insects.

Yang Mei¹, Dong Jing¹, Shenyang Tang¹, Xi Chen¹, Hao Chen¹, Haonan Duanmu¹, Yuyang Cong¹, Mengyao Chen¹, Xinhai Ye¹, Hang Zhou¹, Kang He¹, Fei Li¹.

Abstract

Insects are the largest group of animals on the planet and have a huge impact on human life by providing resources, transmitting diseases, and damaging agricultural crop production. Recently, a large amount of insect genome and gene data has been generated. A comprehensive database is highly desirable for managing, sharing, and mining these resources. Here, we present an updated database, InsectBase 2.0 (http://v2.insect-genome.com/), covering 815 insect genomes, 25 805 transcriptomes and >16 million genes, including 15 045 111 coding sequences, 3 436 022 3'UTRs, 4 345 664 5'UTRs, 112 162 miRNAs and 1 293 430 lncRNAs. In addition, we used an in-house standard pipeline to annotate 1 434 653 genes belonging to 164 gene families; 215 986 potential horizontally transferred genes; and 419 KEGG pathways. Web services such as BLAST, JBrowse2 and Synteny Viewer are provided for searching and visualization. InsectBase 2.0 serves as a valuable platform for entomologists and researchers in the related communities of animal evolution and invertebrate comparative genomics.

Entities: Chemical

Mesh：

Substances：
MicroRNAs

Year: 2022 PMID： 34792158 PMCID： PMC8728184 DOI： 10.1093/nar/gkab1090

Source DB: PubMed Journal: Nucleic Acids Res ISSN： 0305-1048 Impact factor: 16.971

INTRODUCTION

Insects represent one of the largest and most diverse group of animals on earth and play important roles in ecological stability (1), agriculture (2), the economy (3) and human health (4). With rapid technological developments, a sea of insect gene data has been generated, including genomes, transcriptomes, proteomes, metabolomes and chromatin interaction information detected by the Hi-C method (5,6). Although most of these data are available in public databases such as National Center for Biotechnology Information (NCBI) (7), many are not well organized, and some are available only as raw data without annotation information. This hampers the full use of these insect gene resources. Several databases have been constructed to provide well-curated annotations and well-designed data organization in entomological field, including i5k Workspace@NAL (8), Bioinformatics Platform for Agroecosystem Arthropods (BIPAA) (https://bipaa.genouest.org/is/), VectorBase (9), FlyBase (10), LepBase (http://lepbase.org/), Hymenoptera Genome Database (11), Butterfly Genome Database (12), FireflyBase (13), SilkDB (14), KAIKObase (15), KONAGAbase (16), MonarchBase (17), LocustBase (18), BeetleBase (19), etc. Most of these databases focus on only one species or a group of closely related species, and few provide a well-designed and user-friendly platform for curating, visualizing, and sharing insect gene data. To fill this gap, we built InsectBase in 2016, which collected almost all insect genome data available at that time (20). Due to the emergence of third-generation sequencing technology, the quantity and quality of insect gene and genome data have greatly increased in recent years. Therefore, to provide a revised and more convenient platform, we have updated InsectBase to version 2.0 with three significant improvements: (i) The quantity and quality of insect gene data are significantly increased. In total, InsectBase 2.0 contains >16 million sequences from 815 species with 207 chromosome-level genomes and 134 full-length transcriptomes. (ii) Multi-level gene and genome data are now provided, including RNA–RNA interactions, gene families, KEGG pathways and HGT genes (21). (iii) The user interface features have been enhanced to improve the web server.

MATERIALS AND METHODS

Data source

We collected insect gene and genome data from several databases (as described below) and developed standardized pipelines for annotation and identification of UTRs, miRNAs, lncRNAs, RNA–RNA interactions, gene families, KEGG pathways and genes likely derived from horizontal gene transfer (referred to as ‘potential HGT genes’).

Genome

We collected and downloaded 815 genomes from NCBI (7), BIPAA (https://bipaa.genouest.org/is/), GigaDB (22), i5k Workspace@NAL (8), InsectBase (20), LepBase (http://lepbase.org/), VectorBase (9), National Genomics Data Center (NGDC) (23), FireflyBase (13), DNA Data Bank of Japan (DDBJ) (24), SilkDB 3.0 (14), Assembled Searchable Giant Arthropod Read Database (ASGARD) (25), DNA Zoo (26), LocustBase (18), DRYAD (https://datadryad.org/stash) and Zenodo (https://zenodo.org/) (Supplementary Tables S1 and S2). Among these, 231 insect genomes were obtained with known annotated official gene sets. A further 482 genomes were annotated using our in-house genome annotation pipeline. First, we identified and masked the repeat sequences by RepeatModeler2 (v.2.0.1) (27) and RepeatMasker (http://www.repeatmasker.org) (v.4.0.7) with both de novo and homology-based methods. Next, three evidences of gene annotation were generated. BRAKER2 (v.2.1.5) (28–34) was used to generate the de novo gene models. HISAT2 (v.2.1.0) (35) and StringTie2 (v.2.1.5) (36) were used for transcripts assembling. And homology-based evidence was generated by GenomeThreader (v.1.7.1) (37). Finally, we integrated three types of evidences by EVidenceModeler (v.1.1.1) (38) to obtain the official gene sets (OGS).

Transcriptome

25 805 transcriptomes of 439 species were downloaded from the NCBI SRA database (Supplementary Table S3) (7). The raw reads were pre-processed using fastp (v.0.21) (39) and mapped to reference genomes with HISAT2 (v.2.1.0) (35). StringTie2 (v.2.1.4) (36) was used for transcript assembly.

ncRNA

1674 small RNA libraries of 60 species were download from the NCBI SRA database (Supplementary Table S4) (7). miRNAs were predicted by miRDeep2 (v.0.1.3) (40) and MapMi (v.1.5.0) (41). TargetScan 70 (42), RNAhybrid (v.2.1.2) (43) and miRanda (v.3.3a) (44) were used for miRNA target prediction. LncRNAs and partner genes were predicted with FEELnc (v.0.2) (45) using the default parameters.

Gene family, KEGG pathway and potential HGT gene

One hundred and sixty-four gene families were annotated by BLASTP against the Swiss-Prot protein database using DIAMOND (v.2.0.0.138) (31,46). For KEGG pathway, the reference KOs of each gene were identified by BLASTP against the KEGG database, and the KEGG pathway genes were obtained by extracting the KO information of each gene (21). Potential HGT genes were filtered by using insect genes to blast against the NCBI non-redundant protein (nr)/nucleotide (nt) database, if at least 15 of the best 20 BLAST hits are from non-insect species, we treated these genes as potential HGT genes (7). It should be noted that this pre-filtering method might have high false positive and further analysis of these genes should consider this.

Insect virus

Genome information of 1524 insect viruses was obtained and organized from the NCBI genome database (7).

Implementation of database

InsectBase 2.0 runs on a nginx (v.1.16.1) web server (http://nginx.org/) based on the CentOS 7.4.1708 platform with a MySQL (v.5.7.17) database (https://dev.mysql.com/). Django (v.3.1.3) framework (https://www.djangoproject.com/) and Vue (v.3.0) JavaScript framework (https://v3.vuejs.org/) were used for the web construction. JBrowse2 (47), the platform for visualizing and integrating biological data, was used for genome visualization. DIAMOND (v.2.0.0.138) (31), NCBI BLAST (v.2.11.0+) and BLAT (v.36) (48) were installed for sequence alignment of genes, proteins, miRNAs and lncRNAs. SynVisio (49) was hosted for visualization of genome synteny files constructed by MCScanX (50).

UPDATES IN INSECTBASE 2.0

More insect gene data with high assembly quality and standard annotations

Recent advances in third-generation sequencing techniques and chromosome conformation capture (3C) methods have provided a valuable platform for generation of high-quality genomes and full-length transcriptomes (5,6). In InsectBase 2.0, we collected 815 genomes from 457 genera and 25 805 well-assembled transcriptomes from 439 species. Among these, 207 genomes were assembled at the chromosome level and 134 full-length transcriptomes from 31 species were generated by nanopore sequencing. Using an in-house pipeline, we annotated 482 insect genomes, yielding standard official gene sets for these species. In total, we generated 15 045 111 coding sequences of 713 insects, 112 162 miRNAs from 807 insects, 1 293 430 lncRNAs representing 376 insects, 419 KEGG pathways, 7 781 686 UTRs in 374 insects and 164 gene families in 713 insects. Overall, this represents a substantial increase in insect gene and genome data from InsectBase 1.0 (Table 1).

Table 1.

Data summary of InsectBase 1.0 and 2.0

Feature	Units	v1.0	v2.0	Fold Increase
Genomes	Species	138	815	5.9
Transcriptomes	Runs	116	25 805	222.4
Coding sequences	Transcripts	160 905	15 045 111	93.5
UTRs	-	678 881	7 781 686	11.4
miRNAs	-	7544	112 162	14.9
lncRNAs	-	2439	1 293 430	530.3
Pathways	-	78	419	5.4
Gene families	-	54	164	3.0
HGT genes	-	-	215 986	New
Insect viruses	-	-	1524	New
miRNA–mRNA interactions	-	-	197 533	New
lncRNA–mRNA interactions	-	-	5 147 543	New

Data summary of InsectBase 1.0 and 2.0

ncRNAs, HGT genes and insect viruses

ncRNAs participate in many important biological processes by interacting with RNAs either directly or indirectly through protein intermediates (51). Here, we predicted 197 533 miRNA–mRNA interactions and identified 1 293 737 lncRNA partner genes. HGT is a key evolutionary force which has constantly reshaped genomes throughout evolution (52). We identified 215 986 potential HGT genes from five kingdoms (Bacteria, Fungi, Metazoa [excluding insecta], Viridiplantae and Virus; Table 1). We also collected 1524 insect viruses which are important pathogens of many arthropod species and are potential microbial control agents. These data will benefit researches in the fields of gene networks, evolution and comparative analysis.

Enhanced user interface features

InsectBase 2.0 contains 12 modules, namely ‘organism’, ‘chromosome’, ‘genome’, ‘transcriptome’, ‘gene’, ‘gene family’, ‘HGT gene’, ‘KEGG pathway’, ‘insect virus’, ‘tools’, ‘links’ and ‘service’ (for searching, browsing, and downloading) (Figure 1).

Figure 1.

Main modules of InsectBase 2.0. It provides information about an organism, genome, transcriptome, chromosome, gene information about protein coding genes, miRNA and lncRNA, gene family, HGT genes, insect pathways, insect viruses, online tools, links and additional services. The ‘organism’ module shows a species tree modified from the NCBI Taxonomy common tree (https://www.ncbi.nlm.nih.gov/Taxonomy/CommonTree/wwwcmt.cgi) (7). Each order, family, genus, and species are introduced with pictures from public sources such as Wikipedia (https://www.wikipedia.org/) and iNaturalist (https://www.inaturalist.org/). Users can click on the species name to access the species page, which shows information about multiple aspects of the selected species. This includes a basic introduction, genome statistics, gene information, and related publications in PubMed (https://pubmed.ncbi.nlm.nih.gov/) (Figure 2A).

Figure 2.

Enhanced user interface features of InsectBase 2.0. (A) Species: basic information, file downloading and publications related to each species. (B) Chromosome: information for each chromosome. (C) Protein coding gene: detailed information about each protein coding gene. (D) JBrowse2: genome browser of each annotated genome. (E) Genome synteny: visualization of synteny of 155 chromosome-level genomes. The advent of high-quality genome has greatly advanced the study of entomology. To aid in investigate of chromosome evolution, the ‘chromosome’ module displays 207 genomes with information at the chromosome level (Figure 2B). Chromosomes in 155 genomes are displayed for browsing and downloading. Transcriptomes are an essential data resource for understanding biological processes under different conditions. The ‘transcriptome’ module contains 25 805 assembled transcriptomes with sample information, including species, gender, tissue, stage, and condition to help researchers conduct genetic investigations with different conditions or treatments. The ‘gene information’ module allows the user to conduct an advanced search for protein coding genes, miRNAs, and lncRNAs by species, gene name, and gene description. Beyond the basic information of selected gene, gene structure, gene sequence and gene interactions such as mRNA–miRNA and mRNA–lncRNA interactions are displayed. By clicking on the interacting genes, users can access the related gene page (Figure 2C). Gene families often exhibit apparent expansion or contraction in terms of gene numbers or structures. Gene family analysis is not only essential for uncovering gene functions, but also frequently used in revealing the evolutional mechanism of gene gain and loss. Hence, InsectBase 2.0 analysed 164 gene families by annotating them with an in-house pipeline. The ‘gene family’ module allows users to easily search and download gene families of interest in a given species. In addition to conventional tools such as DIAMOND (31), BLAT (48) and BLAST, we constructed a comprehensive genome browser with all annotated genomes by JBrowse2 (47) (Figure 2D). Moreover, InsectBase 2.0 provides a genome synteny visualization tool. Genome synteny between 155 chromosome-level genomes is visualized for chromosome evolution analysis (Figure 2E).

DISCUSSION AND FUTURE DEVELOPMENT

At present, insect genome and gene data are stored in multiple databases once they are generated (53). InsectBase 2.0 uses standard pipelines to predict protein coding genes, miRNAs, lncRNAs and UTRs, promoting standardisation of comparative genomics. In addition, gene families, KEGG pathways and genes potentially involved in many crucial biological processes (such as pesticide detoxification metabolism and host-seeking) are annotated. In summary, InsectBase 2.0 is a substantially improved database for insect gene resources and serves as a valuable resource to meet the needs of entomologists and the related research communities of animal evolution and invertebrate comparative genomics. We will continue to add newly-available data and new features. For example, the three-dimensional (3D) organization of genomes plays an essential role in gene regulation. With the development of the 3C technique, such as Hi-C, ChIA-PET, Capture-C and Capture Hi-C, chromosome interaction information has provided an unprecedented opportunity to study spatial organization in a genome-wide fashion (54). We plan to analyse these data and add associated features in the next update. The recently-developed AlphaFold2 (55) predicts protein structure with high accuracy, which would be greatly valuable in investigating protein-protein binding, enzyme active sites, and the functional implications of genetic mutations. We thus plan to integrate this tool in the next update.

DATA AVAILABILITY

All data in InsectBase 2.0 are available for downloading. The database can be accessed at http://v2.insect-genome.com/. The genome annotation pipeline is available at https://github.com/meiyang12/Genome-annotation-pipeline. Click here for additional data file.

52 in total

1. Increased interactivity and improvements to the GigaScience database, GigaDB.

Authors: Si Zhe Xiao; Chris Armit; Scott Edmunds; Laurie Goodman; Peter Li; Mary Ann Tuli; Christopher Ian Hunter
Journal: Database (Oxford) Date: 2019-01-01 Impact factor: 3.451

2. Comprehensive mapping of long-range interactions reveals folding principles of the human genome.

Authors: Erez Lieberman-Aiden; Nynke L van Berkum; Louise Williams; Maxim Imakaev; Tobias Ragoczy; Agnes Telling; Ido Amit; Bryan R Lajoie; Peter J Sabo; Michael O Dorschner; Richard Sandstrom; Bradley Bernstein; M A Bender; Mark Groudine; Andreas Gnirke; John Stamatoyannopoulos; Leonid A Mirny; Eric S Lander; Job Dekker
Journal: Science Date: 2009-10-09 Impact factor: 47.728

3. miRDeep2 accurately identifies known and hundreds of novel microRNA genes in seven animal clades.

Authors: Marc R Friedländer; Sebastian D Mackowiak; Na Li; Wei Chen; Nikolaus Rajewsky
Journal: Nucleic Acids Res Date: 2011-09-12 Impact factor: 16.971

4. fastp: an ultra-fast all-in-one FASTQ preprocessor.

Authors: Shifu Chen; Yanqing Zhou; Yaru Chen; Jia Gu
Journal: Bioinformatics Date: 2018-09-01 Impact factor: 6.937

5. Benchmarking spliced alignment programs including Spaln2, an extended version of Spaln that incorporates additional species-specific features.

Authors: Hiroaki Iwata; Osamu Gotoh
Journal: Nucleic Acids Res Date: 2012-07-30 Impact factor: 16.971

6. MicroRNA targets in Drosophila.

Authors: Anton J Enright; Bino John; Ulrike Gaul; Thomas Tuschl; Chris Sander; Debora S Marks
Journal: Genome Biol Date: 2003-12-12 Impact factor: 13.583

7. A space-efficient and accurate method for mapping and aligning cDNA sequences onto genomic sequence.

Authors: Osamu Gotoh
Journal: Nucleic Acids Res Date: 2008-03-15 Impact factor: 16.971

8. JBrowse: a dynamic web platform for genome visualization and analysis.

Authors: Robert Buels; Eric Yao; Colin M Diesh; Richard D Hayes; Monica Munoz-Torres; Gregg Helt; David M Goodstein; Christine G Elsik; Suzanna E Lewis; Lincoln Stein; Ian H Holmes
Journal: Genome Biol Date: 2016-04-12 Impact factor: 13.583

9. DDBJ update: streamlining submission and access of human data.

Authors: Asami Fukuda; Yuichi Kodama; Jun Mashima; Takatomo Fujisawa; Osamu Ogasawara
Journal: Nucleic Acids Res Date: 2021-01-08 Impact factor: 16.971

10. FlyBase: updates to the Drosophila melanogaster knowledge base.

Authors: Aoife Larkin; Steven J Marygold; Giulia Antonazzo; Helen Attrill; Gilberto Dos Santos; Phani V Garapati; Joshua L Goodman; L Sian Gramates; Gillian Millburn; Victor B Strelets; Christopher J Tabone; Jim Thurmond
Journal: Nucleic Acids Res Date: 2021-01-08 Impact factor: 16.971

5 in total

1. Genome of the parasitoid wasp Cotesia chilonis sheds light on amino acid resource exploitation.

Authors: Xinhai Ye; Shijiao Xiong; Ziwen Teng; Yi Yang; Jiale Wang; Kaili Yu; Huizi Wu; Yang Mei; Cheng Xue; Zhichao Yan; Chuanlin Yin; Fang Wang; Hongwei Yao; Qi Fang; Qisheng Song; Gongyin Ye; Fei Li
Journal: BMC Biol Date: 2022-05-24 Impact factor: 7.364

2. The 2022 Nucleic Acids Research database issue and the online molecular biology database collection.

Authors: Daniel J Rigden; Xosé M Fernández
Journal: Nucleic Acids Res Date: 2022-01-07 Impact factor: 16.971

3. The genome of the rice planthopper egg parasitoid wasps Anagrus nilaparvatae casts light on the chemo- and mechanosensation in parasitism.

Authors: Ying Ma; Zixiao Guo; Liyang Wang; Bingyang Wang; Tingfa Huang; Bingjie Tang; Guren Zhang; Qiang Zhou
Journal: BMC Genomics Date: 2022-07-28 Impact factor: 4.547

4. Genome assembly and annotation of the European earwig Forficula auricularia (subspecies B).

Authors: Upendra R Bhattarai; Mandira Katuwal; Robert Poulin; Neil J Gemmell; Eddy Dowle
Journal: G3 (Bethesda) Date: 2022-09-30 Impact factor: 3.542

Review 5. Metabolization and sequestration of plant specialized metabolites in insect herbivores: Current and emerging approaches.

Authors: Adriana Moriguchi Jeckel; Franziska Beran; Tobias Züst; Gordon Younkin; Georg Petschenka; Prayan Pokharel; Domenic Dreisbach; Stephanie Christine Ganal-Vonarburg; Christelle Aurélie Maud Robert
Journal: Front Physiol Date: 2022-09-27 Impact factor: 4.755

5 in total