Literature DB >> 36073551

The First Chromosome-level Genome Assembly of Cheumatopsyche charites Malicky and Chantaramongkol, 1997 (Trichoptera: Hydropsychidae) Reveals How It Responds to Pollution.

Xinyu Ge¹, Jianfeng Jin¹, Lang Peng¹, Haoming Zang¹, Beixin Wang¹, Changhai Sun¹.

Abstract

Trichoptera is a highly adapted group of freshwater insects. They are generally more sensitive to dissolved oxygen and water quality than most freshwater organisms, and this sensitivity allows them to be used as reliable biological indicators of water quality. At present, there exists no chromosome-level genome of a hydropsychid species. Cheumatopsyche charites Malicky & Chantaramongkol, 1997 can successfully survive and thrive in polluted streams where other caddisflies are infrequent, suggesting that they are tolerant to latent contamination. Here we report a high-quality chromosome-level genome assembly of C. charites generated combining PacBio long reads and Hi-C reads. We obtained a genome assembly of 223.23 Mb, containing 68 scaffolds with an N50 length of 13.97 Mb, and 155 contigs (99.67%) anchored into 16 pseudochromosomes. We identified 36.12 Mb (16.18%) of the genome as being composed of repetitive elements, identified 369 noncoding RNAs, and predicted 8,772 protein-coding genes (96.80% BUSCO completeness). Gene family evolution analyses identified 7,148 gene families, of which 41 experienced rapid evolution. The expanded gene families were shown to be involved in detoxification metabolism, digestive absorption, and resistance to viruses or bacteria. This high-quality genome provides a valuable genomic basis for the study of trichopteran evolution.

Entities: Chemical

Keywords: adaptation; caddisflies; comparative genomics; gene family evolution

Mesh：

Substances：
Oxygen

Year: 2022 PMID： 36073551 PMCID： PMC9539401 DOI： 10.1093/gbe/evac136

Source DB: PubMed Journal: Genome Biol Evol ISSN： 1759-6653 Impact factor: 4.065

Caddisflies are a significant group of aquatic insects. The species diversity within this group is greater than the total of all whole aquatic insect orders. However, to date, 23 trichopteran genomes have been published, but no chromosome-level hydropsychid genome has yet been produced. The lack of such a genome restricts our understanding of trichopteran biological behavior and evolution. Here, we constructed a chromosome-level genome assembly of Cheumatopsyche charites and performed sex chromosome identification and gene family evolution analysis. Our results will improve our understanding of trichopteran evolution and will provide a valuable reference genome for future comparative genomic studies.

Introduction

Trichoptera (caddisfly) is an order of the Insecta containing more than 16,000 extant species, which is the seventh-highest number of any such order (Holzenthal et al. 2011; Jonason et al. 2017; Clair et al. 2018). Moreover, caddisflies among the most successful insect lineages and are often particularly important and ubiquitous members of aquatic communities in both running (lotic) and standing (lentic) water bodies (Wiggins 1996; Holzenthal et al. 2007; Morse et al. 2019). The cosmopolitan family Hydropsychidae (Curtis 1835) is the third largest family of Trichoptera (Geraci et al. 2005; Ge et al. 2020). The hydropsychine genus Cheumatopsyche (Wallengren 1891) is represented on every continent except South America, and includes nearly 390 species (Morse 2022). These species are known for their wide distribution, being present in all biogeographical regions except the Neotropics and Antarctica (Oláh et al. 2008). Cheumatopsyche larvae tend to be dominant in warmer streams, as they have a considerably high stain resistance; they are also often present in polluted streams where other caddisflies are usually absent (Wiggins 1996). Calculated by biomass Cheumatopsyche is the dominant genus in the order Trichoptera, and larval population densities can reach up to 250,000 individuals per square meter (Gibbs 1973; Botosaneanu et al. 1990). The food of Cheumatopsyche larvae mainly consists of algae with a small detrital component (Coffman et al. 1971). Their eating habits allow the body to remove large amounts of suspended solids in all kinds of running freshwater, and may also assist in maintaining their purification capacity (Morse et al. 2019). Despite the importance of Cheumatopsyche, to date no high-quality chromosomal assembly is available. At present, 23 trichopteran genomes have been published, of which two are for hydropsychine species (Heckenhauer et al. 2019, 2022). In this study, we report the first chromosome-level Trichoptera genome assembly (Cheumatopsyche charites, Malicky and Chantaramongkol 1997), and report the identification of the Z chromosome. We also annotated all predicted repetitive elements, noncoding RNAs (ncRNAs), and protein-coding genes (PCGs). Using this data, we reconstructed the phylogenetic relationships between C. charites and seven other insect orders and analyzed the evolution of key gene families. The construction of a chromosome-level genome for C. charites is an important preliminary step for future genomic and applied research on caddisflies, and may also enable further comparative studies of their evolution and biological behavior.

Results and Discussion

Genome Survey and Assembly

In total, we generated 148.57 Gb clean reads for both the genome and transcriptome assemblies. These included 37.42 Gb (168×) PacBio long reads, 60.55 Gb (male 26.01 Gb 117×, female 34.54 Gb 154×) Illumina short reads, 43.97 Gb (197×) Hi-C reads, and 6.63 Gb transcriptome reads. After quality control, we retained 26.01 Gb Illumina reads of the male adult specimen and 32.37 Gb Illumina reads of the female adult specimen for further genomic mapping, polishing, and subsequent identification of the sex chromosome. Our genome survey suggested that the C. charites had a genome size of 218.9–220.2 Mb. Heterozygotic analyses revealed that our C. charites specimens had relatively low heterozygosity (0.83–0.88%). Flye was used to create a preliminary assembly of long reads, resulting in a 225.08 Mb assembly with an N50 length of 2.85 Mb. After polishing, redundancy checks, and Hi-C scaffolding and contaminant removal, we then obtained an assembled C. charites genome of 223.233 Mb. The length of the 16 identifiable pseudochromosomes was relatively even, ranging from 10.35 to 16.01 Mb (fig. 1). The autosomal average sequencing depth (approximately 140×) was significantly higher than the sequencing depth for the sex (Z) chromosome (approximately 76.39×). Our assembly achieved a high completeness score of 99.40% (table 1), as we obtained high mapping rates of both Illumina (97.23%) and PacBio (97.61%) reads. The size of two comparable hydropsychid genomes (Hydropsyche tenuis, 229.66 Mb and Parapsyche elsis, 282.18 Mb) was both slightly greater than C. charites. Moreover, compared with extant public chromosome-level trichopteran genomes, the BUSCO completeness score was significantly higher for C. charites, suggesting that the obtained assembly was high quality (table 1).

Fig. 1

Landscape, characteristics and phylogeny of genome of Cheumatopsyche charites. (a) The genome-wide Hi-C contact maps of 16 chromosomes in C. charites. (b) Characteristics map showing each chromosome characters of C. charites protein-coding genes density (Gene), guanine–cytosine content (GC), DNA transposons (DNA), long-interspersed elements (LINE), long-terminal repeat elements (LTR), short-interspersed elements (SINE), each chromosome length (Chr). (c) Phylogeny and gene family analysis of C. charites.

Table 1

Assembly Statistic Comparison of Hydropsychidae Genomes

	Cheumatopsyche charites ^a	Limnephilus marmoratus	Glyphotaelius pellucidus
Accession no.	JACZEN000000000	GCA_009617725.1	JAGVSN000000000
Assembly size (Mb)	223.23	1629.97	1037.14
Number of scaffolds/contigs	68/207	68/395	57/285
Longest scaffold/contig (Mb)	16.52/6.87	77.26/28.54	51.24/43.27
N50 scaffold/contig length (Mb)	13.97/2.85	56.17/8.02	36.81/8.16
GC (%)	32.97	35.73	35.10
Gaps (%)	0.01	0.007	0.004
BUSCO completeness (%)[b]	C: 99.4% (S: 99.0%, D: 0.4%), F: 0.3%, M: 0.3%	C: 97.8% (S: 96.9%, D: 0.9%), F: 0.3%, M: 1.9%	C: 98.1% (S: 97.2%, D: 0.59%), F: 0.1%, M: 1.8%

C, complete; S, single; D, duplicated; F, fragmented; M, missing.

Assemblies produced in this study.

N Insecta = 1,367.

Genome Annotation

RepeatMasker identified a total of 36.12 Mb (16.18%) of the C. charites genome as repetitive elements. Of these, unclassified elements represented the largest proportion (10.98%). This phenomenon may be due to insufficient studies of the repetitive elements in Trichoptera species. Other less common repetitive elements included DNA transposons (3.24%), long-terminal repeat elements (LTR, 0.18%), long-interspersed elements (LINE, 0.17%), and simple repeats (1.15%; fig. 1; Supplementary table S1, Supplementary Material online). The proportion of repetitive elements was significantly less than those present in other trichopteran genomes (Heckenhauer et al. 2022). In addition, we also identified 369 ncRNA sequences, including 29 ribosomal RNAs, 52 microRNAs, 31 small nuclear RNAs, one long ncRNA, two ribozymes, 32 other ncRNAs, and 222 tRNAs (Supplementary table S2, Supplementary Material online). Next, we used the MAKER3 pipeline to identify 8,772 PCGs in the C. charites genome. Identified PCGs were found to have a mean length of 5,974.8 bp, while the average exon number per gene was 12.5. The BUSCO completeness of the PCGs was 96.8% (n = 1,367). Diamond searches aligned 8,699 (99.17%) of identifiable genes to the UniProtKB database. We then functionally annotated 7,963 Gene Ontology (GO) items, 6,787 KEGG pathway items (combining InterProScan and eggNOG results), 5,943 MetaCyc items, 7,568 Reactome items, and 8,410 COG Functional Categories. To reveal the evolutionary relationships among members of various gene families, we used OrthoFinder to identify and characterize gene families among 12 species in eight orders. In total, 160,141 genes were assigned into 15,534 orthogroups (gene families), of which 3,968 orthogroups were present in all species. We also found that 3,521 orthogroups and 16,061 genes were species specific, and 1,108 genes were single copy (fig. 1). We used orthologous single-copy genes to construct a phylogenetic tree. About 172 single-copy genes were filtered by IQ-TREE using the “symtest” function. The resulting 936 loci were used to construct a phylogenetic tree. The resulting phylogeny was highly consistent with those reported by Misof et al. (2014) and Wipfler et al. (2019), and confirmed the monophyly of Trichoptera. Our results also agreed that Trichoptera originated during the late Carboniferous period (312–314 Ma) as a sister clade of Lepidoptera, as well as that the divergence point between the Stenopsychidae and the Hydropsychidae occurred in the early Jurassic period (180 Ma). These observations agreed with earlier results (Thomas et al. 2020). Based on this phylogeny, we conducted gene family expansion and contraction analyses. First, 10,992 genes from the whole C. charites genome assembly were clustered into 7,148 gene families; of these, 1,287 had expanded and 1,846 had contracted. Of these gene families, 41 had experienced significant expansion (P < 0.05), and these included cytochrome P450, trypsin, and cuticle protein cuticle protein A3A. Analyses from previous reports aligned with our data and showed that these gene families were involved in processes such as detoxification metabolism, digestive absorption, and resistance to infection by foreign viruses or bacteria. We therefore speculated that large expansions of the immune-, digestion-, and detoxification metabolism–related gene families were crucial for the adaptations of caddisflies, especially as larvae, to withstand polluted and harsh environments. With deeper sampling and transcriptomic analyses of different instars in future, we will better understand which specific genes may be involved in the adaptation of hydropsychidan species to survival in relatively contaminated environments. Simultaneously, new high-quality genomes of Hydropsychidae species may provide valuable information that will enable us to better interpret the phylogeny of the Trichoptera. In addition, the identification of the trichopteran Z chromosome may serve as an important reference for further analysis of the particular mechanisms involved in Z0 sex determination, both in the Trichoptera as well as in the Insecta at large.

Materials and Methods

Sample Collection and Sequencing

Adult specimens of C. charites were collected using pan traps with 15 W ultraviolet light tubes on May 19, 2020, Longchuan river, Yuanmou County (25.961°N, 101.874°E; Alt: 880.1 m), Yunnan province, China. Samples were flash frozen and stored at −80 °C before extraction. All samples were washed with ddH2O and intestines were removed (to minimize gut microbial contamination) before samples were sent for sequencing. Specimen identification was performed by X-Y Ge and C-H Sun, as well as by integrated morphology and COI barcode. One male and one female specimen were used for Illumina whole-genome sequencing. One male individual was used for Illumina transcriptomic sequencing and Hi-C sequencing. Ten male specimens were used for PacBio long-read sequencing. Genomic DNA and RNA extraction, short-read library construction, and DNA digestion procedures followed those reported by Wang et al. (2022). A long-read library with a 15-kb-insert size was generated on the PacBio Sequel II platform. All library production and sequencing were generated at Berry Genomics. BBTools v38.29 (Bushnell 2014) was used for second-generation quality control of the raw Illumina sequence data. Quality control included several processes, such as removal of duplicates (with “clumpify.sh”), adapter trimming (using “bbduk.sh”), quality trimming (>Q20), polymer trimming (>10 bp for poly-A/G/C tails), length filtering (>15 bp), and the correction of overlapping paired reads. K-mer analysis was performed by khist.sh with 21 k-mers. Genome size and heterozygosity were estimated using GenomeScope v1.0.0 (Vurture et al. 2017) with the parameter “-m 1000.” A preliminary genome assembly was completed using Flye v2.8.1 (Kolmogorov et al. 2019) with a minimum overlap between reads of 1,000 bp. This was followed by two rounds of self-polishing steps. To improve assembly accuracy, two rounds of sequence polishing and redundant sequence removal were then carried out. Minimap2 v2.17 (Li 2018) was used to generate a sequence map. NextPolish v1.1.0 (Hu et al. 2020) was employed to polish the preliminary assembly with Illumina short reads; this was performed twice. Redundant sequences were then removed using Purge_Dups v1.0.0 (Guan et al. 2020) with the parameter “-a 60.” Hi-C quality control and read alignment to the preliminary genome were performed using Juicer v.1.6.2 (Durand et al. 2016). The 3D de novo assembly (3D-DNA) v.180922 (Dudchenko et al. 2017) pipeline was used to correct misassemblies present and to conduct chromosome anchoring based on Hi-C sequences. Juicebox v1.11.08 (Durand et al. 2016) was used to manually correct possible errors from the first round of alignment by visualizing Hi-C heatmaps. The resulting assembly was then further refined by a second round of alignment using 3D-DNA. To remove potential contaminant sequences, we used BLAST+ (blastn) v2.7.1 (Camacho et al. 2009) to identify potential contaminant sequences by querying the NCBI nucleotide (nt) and UniVec databases. We used BUSCO v3.0.2 (Waterhouse et al. 2018) to evaluate the completeness of the assembly with the parameter “-l insecta_odb10 -sp fly -m genome.” Finally, Minimap2 v2.17 and SAMtools v.1.10 (Li 2009) were used to calculate the mapping rate by aligning PacBio long reads and Illumina short reads to the final chromosome-level assembly. Illumina reads of female adults were aligned to the chromosome-level assembly to identify the sex chromosome based on the average sequencing depth of C. charites. We built a custom library to annotate repetitive elements using two different strategies: ab initio and homology searching. We generated a de novo repeat library for C. charites using RepeatModeler v2.0.1 (Flynn et al. 2020) based on repetitive element self-specificity with the parameter “-LTRStruct.” We combined the de novo repeat library, Dfam v.3.3 (Storer et al. 2021), and the RepBase-20181026 (Bao et al. 2015) database to build a custom library. We masked the repetitive elements within the assembly with RepeatMasker v4.0.7 (Smit et al. 2013–2015) based on the above custom library. tRNAscan-SE v2.0.7 (Chan and Lowe 2019) was used to predict transfer RNAs and Infernal v1.1.2 (Nawrocki and Eddy 2013) was used to identify other remaining ncRNAs. The PCGs were annotated using three distinct strategies: ab initio, homology-based, and transcriptome-based prediction. BRAKER v2.1.5 (Hoff et al. 2016) was used for ab initio gene prediction; this integrated Augustus v3.3.3 (Stanke et al. 2004) and GeneMark v4.32 (Brůna et al. 2020) concurrently with evidence from transcriptome and reference proteins to accurately model sequence properties. Transcriptome assembly alignment to the reference genome was performed using StringTie v2.1.4 (Kovaka et al. 2019). The transcriptome was provided as BAM alignments produced by HISAT2 v2.2.0 (Kim et al. 2015), and arthropod reference protein data were obtained from the OrthoDB v10.1 database (Kriventseva et al. 2019). The protein sequences of Bombyx mori, Drosophila melanogaster, Spodoptera litura, Helicoverpa armigera, and Tribolium castaneum were downloaded from the NCBI database for the protein homology data set. GeMoMa (Keilwagen et al. 2019) was used for homology prediction. We used MAKER v3.01.03 (Holt and Yandell 2011) pipeline for genome annotation by integrating three different data sources listed above. Diamond v2.0.8 (Buchfink et al. 2021) was used to search the UniProtKB database (Morgat et al. 2020) with the parameter “–more-sensitive -e 1e-5.” Concurrently, eggNOG mapper v2.0 (Cantalapiedra et al. 2021) and InterProScan 5.48–83.0 (Jones et al. 2014) were used to annotate protein domains, GO designations, and KEGG pathways (KEGG, Reactome). eggNOG was used to search the eggNOG v5.0 (Huerta-Cepas et al. 2019) database, and InterProScan was used to query six databases, including Pfam, Panther, Gene3D, Superfamily, SMART, and CDD (Wilson et al. 2009; Marchler-Bauer et al. 2017; Letunic and Bork 2018; Lewis et al. 2018; El-Gebali et al. 2019; Mi et al. 2019). Finally, a diagram visualizing key genomic characteristics was drawn by TBtools v1.0692 (Chen et al. 2020).

Gene Family Evolution and Phylogeny

OrthoFinder v2.3.8 (Emms and Kelly 2019) was used to infer sequence orthology based on high-quality nonredundant protein sequences of 11 insect species downloaded from NCBI and GigaDB. These species included one trichopteran species (Stenopsyche tienmushanensis), three lepidopteran species (B. mori, Danaus plexippus, H. armigera), one dipteran (D. melanogaster), one coleopteran species (T. castaneum), one hymenopteran species (Apis mellifera), two paraneopteran species (Rhopalosiphum maidis and Thrips palmi), one isopteran species (Coptotermes formosanus), and one palaeopteran species (Cloeon dipterum; Supplementary table S3, Supplementary Material online). Subsequently, MAFFT v7.450 (Katoh and Standley 2013) was used to align single-copy orthologous sequences via the L-INS-I method. TrimAl v1.4.1 (Capella-Gutierrez et al. 2009) was performed to trim alignments with the parameter “automated1.” Next, FASconCAT-G v1.1.1 (Kück and Meusemann 2010) was used to concatenate them. IQ-TREE v2.07 (Minh et al. 2020) was used to determine which genes violated stationary, reversibility, and homogeneity assumptions using the parameters: “–symtest-pval 0.10”; it was then also used to construct a phylogenetic tree using the general heterogeneous evolution (Ghost) strategy “-m LG + FO + H4.” The posterior mean-site frequency model was used with the following set of parameters: “-m LG + C60 + FO + R” in IQ-TREE. The Ghost tree was used as an initial guide tree. MCMCTree within PAML v4.9j (Yang 2007) was used to estimate divergence time. Six fossils from the PBDB database (https://www.paleobiodb.org/navigator/) were used for node calibration, including the root: Pterygota (<443.4 Ma): Holometabola: 315.2–382.7 Ma (Bashkirian–late Devonian); Paraneoptera (Moscovian): 314.6–323.2 Ma; Coleoptera + Neuroptera (Moscovian): 307.0–323.2 Ma. CAFE v.5.0.0 (Mendes et al. 2021) was used to calculate the expansion or contraction of gene families based on final divergence time and the phylogeny with the default parameter. Click here for additional data file.

51 in total

1. FASconCAT: Convenient handling of data matrices.

Authors: Patrick Kück; Karen Meusemann
Journal: Mol Phylogenet Evol Date: 2010-04-21 Impact factor: 4.286

2. PAML 4: phylogenetic analysis by maximum likelihood.

Authors: Ziheng Yang
Journal: Mol Biol Evol Date: 2007-05-04 Impact factor: 16.240

3. Repbase Update, a database of repetitive elements in eukaryotic genomes.

Authors: Weidong Bao; Kenji K Kojima; Oleksiy Kohany
Journal: Mob DNA Date: 2015-06-02

4. Evolutionary history of Polyneoptera and its implications for our understanding of early winged insects.

Authors: Benjamin Wipfler; Harald Letsch; Paul B Frandsen; Paschalia Kapli; Christoph Mayer; Daniela Bartel; Thomas R Buckley; Alexander Donath; Janice S Edgerly-Rooks; Mari Fujita; Shanlin Liu; Ryuichiro Machida; Yuta Mashimo; Bernhard Misof; Oliver Niehuis; Ralph S Peters; Malte Petersen; Lars Podsiadlowski; Kai Schütte; Shota Shimizu; Toshiki Uchifune; Jeanne Wilbrandt; Evgeny Yan; Xin Zhou; Sabrina Simon
Journal: Proc Natl Acad Sci U S A Date: 2019-01-14 Impact factor: 11.205

5. tRNAscan-SE: Searching for tRNA Genes in Genomic Sequences.

Authors: Patricia P Chan; Todd M Lowe
Journal: Methods Mol Biol Date: 2019

6. The first chromosome-level genome assembly of a green lacewing Chrysopa pallens and its implication for biological control.

Authors: Yuyu Wang; Ruyue Zhang; Mengqing Wang; Lisheng Zhang; Cheng-Min Shi; Jing Li; Fan Fan; Shuo Geng; Xingyue Liu; Ding Yang
Journal: Mol Ecol Resour Date: 2021-09-30 Impact factor: 8.678

7. GenomeScope: fast reference-free genome profiling from short reads.

Authors: Gregory W Vurture; Fritz J Sedlazeck; Maria Nattestad; Charles J Underwood; Han Fang; James Gurtowski; Michael C Schatz
Journal: Bioinformatics Date: 2017-07-15 Impact factor: 6.937

8. Enzyme annotation in UniProtKB using Rhea.

Authors: Anne Morgat; Thierry Lombardot; Elisabeth Coudert; Kristian Axelsen; Teresa Batista Neto; Sebastien Gehant; Parit Bansal; Jerven Bolleman; Elisabeth Gasteiger; Edouard de Castro; Delphine Baratin; Monica Pozzato; Ioannis Xenarios; Sylvain Poux; Nicole Redaschi; Alan Bridge
Journal: Bioinformatics Date: 2020-03-01 Impact factor: 6.937

9. OrthoFinder: phylogenetic orthology inference for comparative genomics.

Authors: David M Emms; Steven Kelly
Journal: Genome Biol Date: 2019-11-14 Impact factor: 13.583

10. Gene3D: Extensive prediction of globular domains in proteins.

Authors: Tony E Lewis; Ian Sillitoe; Natalie Dawson; Su Datt Lam; Tristan Clarke; David Lee; Christine Orengo; Jonathan Lees
Journal: Nucleic Acids Res Date: 2018-01-04 Impact factor: 16.971