Literature DB >> 32792343

Hybrid Genome Assembly and Evidence-Based Annotation of the Egg Parasitoid and Biological Control Agent Trichogramma brassicae.

Kim B Ferguson¹, Tore Kursch-Metz^2,3, Eveline C Verhulst⁴, Bart A Pannebakker⁵.

Abstract

Trichogramma brassicae (Bezdenko) are egg parasitoids that are used throughout the world as biological control agents and in laboratories as model species. Despite this ubiquity, few genetic resources exist beyond COI, ITS2, and RAPD markers. Aided by a Wolbachia infection, a wild-caught strain from Germany was reared for low heterozygosity and sequenced in a hybrid de novo strategy, after which several assembling strategies were evaluated. The best assembly, derived from a DBG2OLC-based pipeline, yielded a genome of 235 Mbp made up of 1,572 contigs with an N50 of 556,663 bp. Following a rigorous ab initio-, homology-, and evidence-based annotation, 16,905 genes were annotated and functionally described. As an example of the utility of the genome, a simple ortholog cluster analysis was performed with sister species T. pretiosum, revealing over 6000 shared clusters and under 400 clusters unique to each species. The genome and transcriptome presented here provides an essential resource for comparative genomics of the commercially relevant genus Trichogramma, but also for research into molecular evolution, ecology, and breeding of T. brassicae.

Entities: Chemical Disease Species

Keywords: Hymenoptera; Wolbachia; biocontrol agent; parasitoid

Mesh：

Substances：

Year: 2020 PMID： 32792343 PMCID： PMC7534424 DOI： 10.1534/g3.120.401344

Source DB: PubMed Journal: G3 (Bethesda) ISSN： 2160-1836 Impact factor: 3.154

The chalcidoid Trichogramma brassicae (Bezdenko) (Hymenoptera: Trichogrammatidae) is a minute parasitoid wasp (∼0.5 mm in length) that develops within the eggs of other insects (Smith 1996). For over 50 years, it has been in use world-wide as a biological control agent as many lepidopteran pests of different crops are suitable hosts (Polaszek 2009). The most common application of T. brassicae in Europe is against Ostrinia nubilalis (Hubner) (Lepidoptera: Pyralidae), the European corn borer. For example, in 2003 alone, over 11,000 hectares (ha) of maize in Germany was treated with T. brassicae (Zimmermann 2004). It is also released against lepidopteran pests in spinach fields as well as in greenhouses (e.g., tomato, pepper, and cucumber) (Klug and Meyhöfer 2009). With its wide application in biological control, T. brassicae is a well-studied species. Field trials have been conducted on several aspects, such as host location and dispersal behavior (Suverkropp , 2010), overwintering ability (Babendreier ), while other biological control related studies considered issues related to low temperature storage (Lessard and Boivin 2013), reaction to insecticides (Liu and Zhang 2012; Delpuech and Delahaye 2013; Ghorbani ; Jamshidnia ; Thubru ), or risk assessment (Kuske ). Next to its application as a biological control agent, this tiny parasitoid has been used in other research, both in genetic studies (Wajnberg 1993; Laurent ; Cruaud ) and ecological studies (Huigens ; Fatouros and Huigens 2012; Cusumano ). In addition, several initiatives investigate the infection of T. brassicae with Wolbachia bacteria (Poorjavad ; Ivezić ) and the consequences of such an infection (Farrokhi ; Poorjavad ; Rahimi-Kaldeh ). As T. brassicae is a cryptic species with several other congenerics, misidentification and misclassification is a known issue (Polaszek 2009). In response, molecular identification of trichogrammatids is well studied and established (Stouthamer ; Sumer ; Rugman-Jones and Stouthamer 2017; Ivezić ). Recently, several restriction-site associated DNA sequencing (RADseq) libraries were constructed from single T. brassicae wasps to aide in resolving the aforementioned phylogenetic issues within Trichogramma (Cruaud ). Otherwise, the genomics of T. brassicae have largely been neglected even though a well annotated genome would allow researchers and biological control practitioners access to a wealth of information and open new avenues for comparative genomics and transcriptomics for evolutionary, ecological, and applied research. Here, we report the whole-genome sequencing and annotation of a T. brassicae strain infected by Wolbachia that had thelytokous reproduction, in which females arise from unfertilized eggs. A hybrid de novo sequencing strategy was chosen to address two common issues: we used long PacBio Sequel reads to bridge the large segments of repetitive sequences often found in Hymenoptera, while countering the error bias of long read technology with the accuracy of Illumina short reads. A similar strategy was recently applied to improve the Apis mellifera genome, where the long PacBio reads were the backbone that boosted the overall contiguity of the genome, alongside the incorporation of repetitive regions (Wallberg ). In this report, we present the hybrid de novo genome of T. brassicae. Three different assemblers were evaluated, and the most complete genome assembly was used for decontamination and ab initio-, homology-, and evidence-based annotation. The resulting annotation was functionally described using gene ontology analysis. Finally, a heterozygosity comparison and simple ortholog cluster analysis with the congeneric T. pretiosum was performed, which can be considered a starting point for future comparative genomics of the commercially important genus Trichogramma.

Materials and Methods

Species origin and description

Individuals of Trichogramma brassicae were acquired by AMW Nützlinge GmbH (Pfungstadt, Germany). The strain was baited in May 2013 in an apple orchard near Eberstadt, Germany. The orchard was surrounded by blackberry hedges, forest, and other orchards. For baiting, the eggs of Sitotroga cerealella (Olivier) (Lepidoptera: Gelechiidae) (Mega Corn Ltd., Bulgaria) were glued on paper cards (AMW Nützlinge GmbH, Germany), usually used for releasing Trichogramma sp. in corn fields and households. These cards were placed directly into the trees, approximately two meters above ground. After five days in the field, baiting cards were collected and incubated together at 25°. Following emergence, individuals were kept together, offered S. cerealella eggs, and reared in a climate chamber (27 ± 2°, L:D = 24:0h for four days, then transferred to16 ± 2°, L:D = 0:24h until emergence). In 2016, the offspring of twenty isolated females were transferred to Wageningen University (The Netherlands) to be reared for low heterozygosity. The resulting offspring were reared in a single general population on irradiated Ephestia kuehniella (Zeller) (Lepidoptera: Pyralidae) eggs as factitious hosts under laboratory conditions in a climate chamber (20 ± 5°, RH 50 ± 5%, L:D = 12:12 h). Wolbachia presence was determined following the PCR amplification protocol of Zhou et al. 1998 in a presence/absence assessment with known positive and negative control samples (Zhou ). Natural Wolbachia infections have previously been detected in Iranian populations of T. brassicae (Farrokhi ), but none of the Eurasian populations have been known to support this symbiosis (Stouthamer 1997; Stouthamer and Huigens 2003).

Isofemale line

Following confirmation of Wolbachia infection (Supplementary materials S1.1.1), a single female from the general population was isolated (generation 0, G0), and given eggs ad libitum. In the resulting generation (G1), unmated females were isolated and reared with eggs ad libitum. Offspring of the initial isolations G0 and G1 were confirmed to be entirely female, suggesting thelytokous parthenogenetic reproduction. Combined with isolating single females, this maximizes genetic similarity of the following generation (G2) of these G1 females. One of these G2 strains, S301, was reared in large population sizes for multiple generations over the period of one year. By the time of collection for sequencing, both the S301 and general population no longer harbored Wolbachia at detectable levels (Supplementary materials S1.1.2).

gDNA extraction

Three separate extractions were prepared in 1.5 mL safelock tubes with each several hundred Trichogramma brassicae. The tubes were frozen in liquid nitrogen with approximately six 1-mm glass beads and shaken for 30 s in a Silamat S6 shaker (Ivoclar Vivadent, Schaan, Liechtenstein). gDNA was then extracted using the Qiagen MagAttract Kit (Qiagen, Hilden, Germany). Following an overnight lysis step with Buffer ATL and proteinase K at 56°, extraction was performed according to the MagAttract Kit protocol. Elutions were performed in two steps with Buffer AE (Tris-EDTA) each time (first 60 µL, then 40 µL), yielding 100 µL. The two extractions yielding the largest amount of gDNA (5.49 µg and 8.24 µg) were combined for long-read sequencing, while the remaining extraction (1.67 µg) was used for short-read sequencing. gDNA concentration was measured with an Invitrogen Qubit 2.0 fluorometer using the dsDNA HS Assay Kit (Thermo Fisher Scientific, Waltham, USA) while fragment length was confirmed on gel.

Library preparation and sequencing

Sequence coverage was calculated using the previously established genome size estimate for T. brassicae of 246 Mbp (Johnston ). Library preparation and sequencing was performed by Novogene Bioinformatics Technology Co., Ltd., (Beijing, China). For Illumina sequencing, gDNA was used to construct one paired-end (PE) library according to the standard protocol for Illumina with an average insert size of 150 bp and was sequenced using an Illumina HiSeq 2000 (Illumina, San Diego, USA). For Single Molecule Real Time (SMRT) sequencing, gDNA was selected for optimal size using a Blue Pippin size selection system (Sage Science, Beverley, USA) following a standard library preparation. The library was then sequenced on a PacBio Sequel (Pacific Biosciences, Menlo Park, USA) with 16 SMRT cells.

Assembly and decontamination

Prior to assembly, Illumina reads were assessed for quality using FASTQC (Andrews ), then trimmed for quality in CLC Genomics Workbench 11 using default settings (Qiagen). Trimmed Illumina reads were paired for subsequent analysis. In order to achieve the best possible assembly, three assembly pipelines were evaluated: one for PacBio-only reads and two hybrid assemblers. The PacBio-only were assembled with Canu (v1.6) with modifications based on PacBio Sequel reads (correctedErrorRate = 0.085 corMhapSensitivity = normal alongside corMhapSensivity = normal) (Koren ). This is assembly version v1.0 in the subsequent discussion. The first hybrid assembly pipeline using both long and short sequencing read sets was SPAdes (v3.11.1) (Bankevich ). The SPAdes genome toolkit supports hybrid assemblies with the hybridSPAdes algorithm (Antipov ). Three iterations of the SPAdes pipeline were run with varying k-mer sizes resulting in three different assembly versions: 21, 33, 55 (default, v2.1); k-mer sizes 21, 33, 55, 77 (v2.2); and a single k-mer size of 127 (v2.3). The second hybrid assembly pipeline was DBG2OLC (Ye ). The DBG2OLC pipeline can be readily tweaked with other programs depending on the job (Chakraborty ). Following the DBG2OLC pipeline, de Bruijn graph contigs were generated using SparseAssembler using default settings and setting the expected genome size to 750 Mbp to ensure a genome size output that is unrestricted (Ye ). Contigs were transformed into read overlaps using DBG2OLC with settings suggested for large genomes and PacBio Sequel data (k = 17; AdaptiveTh = 0.01; KmerCovTh = 2; MinOverlap = 20; RemoveChimera = 1), according to the DBG2OLC manual (https://github.com/yechengxi/DBG2OLC). This creates an assembly backbone of the best overlaps between the short-read de Bruijn contigs and the long reads. minimap2 (v2.9) and Racon (v1.0.2) were used for consensus calling remaining overlaps to the assembly backbone (Vaser ; Li 2018). The resulting consensus assembly was polished twice using the Illumina reads with Pilon (v1.22) (Walker ). This final assembly is v3.0 in subsequent discussion. The best of the five assemblies generated was determined on the basis of N50, genome size, and completeness (Table 1). Genome statistics such as N50, number of contig, and genome size were determined using Quast (Gurevich ). Assembly completeness was assessed using BUSCO (v3.0.2) with the insect_odb9 ortholog set and the fly training parameter (Simão ). Based on these characteristics, the decision was made to move forward with assembly v3.0, which was then decontaminated for microbial sequences using NCBI BLASTn (v2.2.31+) against the NCBI nucleotide collection (nr).

Table 1

Statistics for five assemblies of Trichogramma brassicae. The first strategy was PacBio-only in Canu, while three hybrid assembly strategies were based on SPAdes and modulating k-mer sizes, and an additional hybrid assembly was based on an adapted DBG2OLC+Racon+Pilon protocol. BUSCO score is based on the insect_db09 dataset (Simão )

Assembler	Version	Size (bp)	Contigs	Longest contig (bp)	N50 (bp)	BUSCO (Complete %)
Canu	v1.0	69,522,446	3,007	126,800	27,303	18.7
SPAdes (k = 21, 33, 55)	v2.1	227,096,967	282,988	474,998	36,870	96.8
SPAdes (k = 21, 33, 55, 77)	v2.2	226,864,253	189,696	548,753	49,096	97.1
SPAdes (k = 127)	v2.3	211,402,326	73,567	537,817	63,558	96.4
DBG2OLC+ Racon+Pilon	v3.0	235,413,774	1,572	2,953,580	556,663	95.5

Wolbachia contamination

Two contigs contained a large amount of Wolbachia content, with over 80% of the scaffold containing material with 75% or higher homology to Wolbachia. These contigs were assessed for homology against the NCBI nucleotide collection (nr) and removed from the assembly (Supplementary material S1.2). Post-decontamination, the assembly is referred to as v3.5.

RNA extraction, library construction, and sequencing

T. brassicae wasps from the S301 line were collected for RNAseq for evidence-based annotation. Hundreds of adult individuals (male and female) were collected and stored at -80°. For RNA extraction, samples were frozen in liquid nitrogen in a single 1.5 mL safelock tube with approximately six 1-mm glass beads and shaken for 30 s in a Silamat S6 shaker (Ivoclar Vivadent). The RNeasy Blood and Tissue Kit (Qiagen) was used according to manufacturer’s instructions, and final column elution was achieved using 60 µL sterilized water. The sample was measured for quality and RNA quantity using an Invitrogen Qubit 2.0 fluorometer and the RNA BR Assay Kit (Thermo Fisher Scientific). The RNA sample was then processed by Novogene Bioinformatics Technology Co., Ltd., (Beijing, China) using poly(A) selection followed by cDNA synthesis with random hexamers and library construction with an insert size of 300 bp. Paired-end sequencing was performed on an Illumina HiSeq 4000 according to manufacturer’s instruction. Quality filtering was applied to remove adapters, reads with more than 10% undetermined bases, and reads of low quality for more than 50% of the total bases (Qscore less than or equal to 5).

Ab initio gene finding, transcriptome assembly, and annotation

For the ab initio gene finding, a training set was established using the reference genome of Drosophila melanogaster (Meigen) (Diptera: Drosophilidae) (Genbank: GCA_000001215.4; Release 6 plus ISO1 MT) (Hoskins ) and the associated annotation (Adams ; Dos Santos ). The training parameters were used by GlimmerHMM (v3.0.1) for gene finding in the T. brassicae genome assembly v3.5 (Majoros ). For homology-based gene prediction, GeMoMa v1.6 was used with the D. melanogaster reference genome alongside our RNAseq data as evidence for splice site prediction (Keilwagen ). For evidence-based gene finding, the pooled RNAseq data were mapped to the to the T. brassicae genome separately with TopHat (v2.0.14) with default settings (Trapnell ). After mapping, Cufflinks (v2.2.1) was used to assemble transcripts (Trapnell ). CodingQuarry (v1.2) was used for gene finding in the genome using the assembled transcripts, with the strandness setting set to ‘unstranded’ (Testa ). The tool EVidenceModeler (EVM) (v1.1.1) was used to combine the ab initio, homology-based, and evidence-based information, with evidence-based weighted 1, ab initio weighted 2, and homology-based weighted 3 (Haas et al. 2008). We annotated the predicted proteins with BLASTp (v2.2.31+) on a custom database containing all SwissProt and Refseq genes of D. melanogaster (Boutet ; Camacho ; Acland ), followed by an additional search in the NCBI non-redundant protein database (nr) to obtain additional homology data. The evidence-based annotation (.bam file) was compared to the final annotation (.gff file) for overlap with BEDtools coverage tool (Quinlan and Hall 2010).

GO term analysis

A list of genes was constructed for Gene Ontology (GO) term classification by deduplicating the annotated proteins and removing the non-annotated proteins. These accession IDs were converted into UniProtKB accession IDs using the UniProt ID mapping feature and deduplicated a final time (Boutet ). These UniProtKB accession IDs were in turn used with the DAVID 6.8 Functional Annotation Tool to assign GO terms to each accession ID with the D. melanogaster background and generate initial functional analyses (Huang , 2009b) (see supplementary S1.3 for DAVID input list).

Heterozygosity estimates

The heterozygosity of the S301 line was assessed using sequence reads and k-mer counting, and compared to the congeneric Trichogramma pretiosum (Riley) (Hymenoptera: Trichogrammatidae), for which sequence data exists for both a thelytokous (asexual) Wolbachia-infected strain as well as an inbred arrhenotokous (sexual) line (Lindsey ). Using jellyfish (v2.3.0) to count k-mers, the same trimmed and paired Illumina reads used for assembly were assessed using the default k-mer size of 21 (m = 21), with results exported to a histogram (Marçais and Kingsford 2011). This histogram file was then used with GenomeScope (v1.0) to estimate heterozygosity of the reads based on a statistical model, where a Poisson distribution is expected for a homozygous sample while a bimodal distribution is expected for a heterozygous distribution (Vurture ). This genome profiling gives a reliable estimate for heterozygosity as well as estimates of repetitive content. The same jellyfish and GenomeScope analyses were performed on T. pretiosum short-read sequence data for the thelytokous strain (NCBI SRA database, SRR1191749) and the arrhenotokous line (SRR6447489), with adaptions for reported insert sizes (Lindsey ).

Ortholog cluster analysis

The complete gene set of T. brassicae was compared to that of T. pretiosum (Lindsey ), which was retrieved from the i5K Workspace (Poelchau ). An ortholog cluster analysis was performed on both gene sets via OrthoVenn2 with the default settings of E-values of 1e-5 and an inflation value of 1.5 (Xu ). For T. brassicae protein set, see supplementary materials S1.5. For full results from the cluster analysis, see supplementary materials S1.6.

Data availability

All sequence data are available at the EMBL-ENA database under BioProject PRJEB35413, including assembly (CADCXV010000000.1). An overview of supplementary material is available on figshare (https://doi.org10.6084/m9.figshare.12794771.v1), with additional material found on the DANS EASY Repository (https://doi.org/10.17026/dans-23w-a9tn), such as gel images, the Wolbachia contaminated contigs, input gene list for DAVID, GenomeScope images, and complete protein set. Contained within these supplementary materials are an additional GFF file that can be found on figshare (https://doi.org/10.6084/m9.figshare.12073833.v1) along with the OrthoVenn2 outputs (https://doi.org/10.6084/m9.figshare.12624629.v1).

Results and Discussion

Sequencing, assembly, and decontamination

Sequencing of the Illumina 150 bp paired-end library yielded 80,489,816 reads. After quality filtering and trimming, 80,483,128 paired-end reads were retained. Sequencing the PacBio Sequel library yielded 2,500,204 subreads with an average length of 6377 bp. The genome size estimate for T. brassicae is 246 Mbp (Johnston ) indicating that short-read coverage was 98x while long-read coverage was 64x, resulting in a total coverage of 162x. Three assembly pipelines were used, resulting in five potential assemblies where one, v3.0, was eventually selected for further use. Results of these assemblies are detailed in Table 1. The first draft assembly generated with Canu with the altered settings for PacBio Sequel data resulted in an assembly of approximately 70 Mbp in size, drastically smaller than the 246 Mbp expected, and contained a total of 3,007 contigs with an N50 of 27,303. The longest contig was 126,800 bp in size. The second assembly strategy relied on hybrid assembly pipelines, and SPAdes was used with the default k-mer settings, which resulted in an assembly of approximately 227 Mbp in size with an N50 of 36,870 and a BUSCO completeness of 96.8%. Three different assembly runs were done with differing k-mer sizes: the default k-mer sizes of 21, 33, 55 (v2.1); default k-mer sizes plus 77 (v2.2); or the highest possible k-mer size of 127 (v2.3). Increasing the k-mer size only improved N50 scores to a point, along with decreasing the number of contigs, and stable BUSCO scores, however, the assembled genome size drops dramatically with the third attempt shrinking down to 211 Mbp. Based on BUSCO scores and N50 alone, the second SPAdes attempt, v2.2, would be the best of the three, though all three are similar in most measures. The third assembly strategy used the DGB2OLC+Racon+Pilon pipeline, which resulted in assembly v3.0. Here, there is a large difference compared to the previous SPAdes assemblies. Particularly, the number of contigs is reduced dramatically from the 70,000 to 280,000 range of the SPAdes output down to a mere 1,572. Meanwhile, the assembled genome size is now 235 Mbp and with an N50 of 556,663 and a BUSCO score of 95.5%. The full completeness score for this assembly, using the 1658 BUSCO groups within the insect_od09 BUSCO set, returned 1531 (92.3%) complete and single-copy BUSCOs, 53 (3.2%) complete and duplicated BUSCOs, 22 (1.3%) fragmented BUSCOs, and 52 (3.2%) missing BUSCOs (Simão ). While the PacBio-only assembly in Canu could have been improved using different settings or additional tools, we decided to focus on using the additional sequence information of the Illumina reads in the subsequent hybrid assembly strategies. The SPAdes assemblies (v2.1-3) were already decent but could have been further improved using Pilon, a tool that improves assemblies at the base pair level using high quality Illumina data. However, the v3.0 assembly was by far the best assembly based on assembled genome size, N50, and BUSCO scores and therefore we chose this strategy for our T. brassicae genome assembly. Decontamination of this assembly (v3.0) resulted in the removal of two contigs as the homology analysis using BLASTn with the NCBI nr database indicated that both contigs were confirmed to be largely composed of Wolbachia genomic content. Contig “Backbone_1176” is 9,448 bp in length and two areas of the contig, representing over 80% of its length, showed high homology to Wolbachia. Similarly, contig “Backbone_1392” is 17,350 bp and three separate areas representing over 80% showed similar levels of homology to Wolbachia After decontamination this final assembly (v3.5) was used for annotation. In our RNA sequencing experiment, we generated 26,479,830 150bp paired-end cDNA reads. Filtering the reads for quality retained 99.3% of these reads to be used for evidence-based gene finding via transcriptome assembly. The annotations from the evidence-based gene finding were used alongside homology-based findings and ab initio annotations in a weighted model, resulting in a complete annotation for the assembly. In 865 mRNA tracks, representing approximately 5.1% of the official gene set, a gene model could not be annotated via the SwissProt database, and these tracks are named “No_blast_hit.” The majority of tracks are annotated with reference to SwissProt or GenBank accession number of the top BLASTp hit. Transcriptome assembly and mapping resulted in 45,876,158 mapped transcripts (48,327,134 total). CodingQuarry predicted 45,454 evidence-based genes from these mapped transcripts, while ab initio gene finding using GlimmerHMM resulted in 16,877 genes and homology-based gene finding with GeMoMa resulted in 6,675 genes. The final complete gene set was created using EVidenceModeler, where a weighted model using all three inputs resulted in a complete gene set of 16,905 genes. 38.96% of the annotation is supported by RNAseq based on coverage comparison to the mapped transcripts. The complete gene set of 16,905 genes was deduplicated and genes with no correlating BLASTp hit were removed from this analysis. The remaining 9,373 genes were subjected to UniProtKB ID mapping, resulting in 8,247 genes with a matching ID after another round of deduplication (828 duplicates found). The remaining 755 accession IDs were not able to be matched, half of which are obsolete proteins within the UniParc database (377). The DAVID Functional Annotation Tool used 6,585 genes for the analysis and showed that 80.8% (5,320) contribute to 530 biological processes, 77.5% (5,104) contribute to 115 different cellular component categories, and 74.2% (4,889) contribute to 93 molecular functions (genes can code to multiple GO terms). The remaining 1,662 genes are uncategorized. Using short-read data and k-mer counting, heterozygosity was estimated for our isofemale S301 line and compared to both a parthenogenesis inducing Wolbachia-infected strain and an arrhenotokous line of T. pretiosum (Lindsey ). The average estimated heterozygosity for our S301 T. brassicae line is 0.0332% with approximately 0.608% repetitive content (for full details, see Table 2). This is similar to the thelytokous T. pretiosum line, which has a slightly lower estimated heterozygosity (0.0289%) and a lower amount of repetitive content (0.482%). Both have a very distinct Poisson distribution, indicating a low heterozygosity (Figure S1.4.1-2). The arrhenotokous T. pretiosum showed a higher estimated heterozygosity (0.863%), a larger amount of repetitive content (2.64%), and a slightly bimodal distribution (Figure S1.4.3). The fact that both thelytokous Trichogramma species have a similar low level of heterozygosity when compared to the arrhenotokous T. pretiosum suggests that in both cases Wolbachia infection had a severe effect on genetic diversity. As the canonical mechanism of parthenogenesis-induction in other Wolbachia infected thelytokous Trichogramma species is gamete duplication (Stouthamer and Kazmer 1994; Pannebakker ), in which unfertilized eggs are diploidized and results in fully homozygous progeny in a single generation, the low genomic heterozygosity rate suggests a similar mechanism for Wolbachia-induced parthenogenesis in T. brassicae. However, the involvement of Wolbachia in causing all-female offspring in this T. brassicae strain and the presence and mechanisms of Wolbachia in other thelytokous T. brassicae strains (Farrokhi ; Poorjavad , 2018) does require further investigation.

Table 2

Heterozygosity and repetitive content analysis of Trichogramma brassicae (thelytokous), Trichogramma pretiosum (thelytokous), and T. pretiosum (arrhenotokous) lines based on sequence data

	Heterozygosity (%)	Repetitive content (%)	Source of sequence data
T. brassicae, thelytokous S301 line	0.0332	0.608	This publication
T. pretiosum, thelytokous Wolbachia line	0.0289	0.482	Lindsey et al., 2018
T. pretiosum, arrhenotokous inbred line	0.863	2.64	Lindsey et al., 2018

The complete gene set of T. brassicae was compared to that of T. pretiosum using OrthoVenn2 (full output in Table 3). Both species show a different absolute number of proteins (16,905 in T. brassicae and 13,200 in T. pretiosum) that form a similar number of clusters (6,562 in T. brassicae and 6,507 in T. pretiosum). The two species share 6,178 clusters (of 16,858 proteins), while T. brassicae has 384 unique clusters (1,804 proteins) and T. pretiosum has 329 unique clusters (998 proteins), as shown in Figure 1. These unique clusters account for approximately 5% of the entire cluster set for both species, and may both indicate true areas of differentiation, or result from differences in the annotation strategies. There is a similar amount of singleton clusters (proteins that do not cluster with others) in T. brassicae (5,268) and T. pretiosum (5,177). Finally, between the two species, there are 5,828 single copy-gene clusters.

Table 3

Output of OrthoVenn2 ortholog cluster analysis of Trichogramma brassicae and Trichogramma pretiosum.

Species	Proteins	Clusters	Singletons	Source of gene set
T. brassicae	16,905	6,562	5,268	This work (S1.5)
T. pretiosum	13,200	6,507	5,177	Lindsey et al., 2018; Poelchau et al., 2015

Figure 1

Ortholog cluster analysis between Trichogramma brassicae and Trichogramma pretiosum using OrthoVenn2 (Xu ). The number of clusters shared between the two organisms is in bold, with the number of proteins within each cluster grouping underneath in parentheses. Both the unique clusters and the singleton genes could be novel proteins, regions of contamination, evidence of unique horizontal gene transfer, or pseudogenes. While the total number of genes shows a difference of over 3,000 genes, the cluster analysis shows both a similar number of unique clusters for both species, as well as a similar number of singletons. The BUSCO analysis would indicate that gene inflation due to assembly error is unlikely, as only 3.2% of the BUSCOs are duplicated and 1.3% are fragmented. In addition to the possibility of actual gene duplication in T. brassicae, the difference in the number of genes between the two species could also be due to the different annotation tools and methods used between the two projects. More investigation into these protein clusters in addition to a more comprehensive manual annotation of T. brassicae should shed some light on the differences between these closely related yet geographically distinct parasitoid wasps.

Conclusions and Perspectives

Here, we present the genome of biological control agent Trichogramma brassicae, a chalcidoid wasp used throughout the world for augmentative biological control as well as genetic and ecological research. This unique strain hosted a parthenogenesis-inducing Wolbachia infection and is the first European Trichogramma genome to be published, allowing for comparative analyses with other Trichogramma genomes, as we have shown. Our genomic data also illuminates the possible mechanism of parthenogenesis-induction by Wolbachia in this strain. Furthermore, the variety of genomic and transcriptomic data generated for this genome provide much-need resources to bring T. brassicae into the -omics era of biological research. A hybrid approach was used, resulting in a highly contiguous assembly of 1,572 contigs and 16,905 genes based on ab initio, homology-based, and evidence-based annotation, for a total assembly size of 235 Mbp. Two scaffolds were identified that were of Wolbachia origin and removed. Ortholog cluster analysis with a sister species showed 384 unique protein clusters containing 1,804 proteins. Future studies are needed to show whether these clusters are truly unique in addition to manual annotation that would shed light on possible gene duplication events. This genome and annotation provides the basis for future, more in-depth comparative studies into the genetics, evolution, ecology, and biological control use of Trichogramma species.

46 in total

1. UniProtKB/Swiss-Prot, the Manually Annotated Section of the UniProt KnowledgeBase: How to Use the Entry View.

Authors: Emmanuel Boutet; Damien Lieberherr; Michael Tognolli; Michel Schneider; Parit Bansal; Alan J Bridge; Sylvain Poux; Lydie Bougueleret; Ioannis Xenarios
Journal: Methods Mol Biol Date: 2016

2. BUSCO: assessing genome assembly and annotation completeness with single-copy orthologs.

Authors: Felipe A Simão; Robert M Waterhouse; Panagiotis Ioannidis; Evgenia V Kriventseva; Evgeny M Zdobnov
Journal: Bioinformatics Date: 2015-06-09 Impact factor: 6.937

3. QUAST: quality assessment tool for genome assemblies.

Authors: Alexey Gurevich; Vladislav Saveliev; Nikolay Vyahhi; Glenn Tesler
Journal: Bioinformatics Date: 2013-02-19 Impact factor: 6.937

4. Phylogeny and PCR-based classification of Wolbachia strains using wsp gene sequences.

Authors: W Zhou; F Rousset; S O'Neil
Journal: Proc Biol Sci Date: 1998-03-22 Impact factor: 5.349

5. Attraction of egg-killing parasitoids toward induced plant volatiles in a multi-herbivore context.

Authors: Antonino Cusumano; Berhane T Weldegergis; Stefano Colazza; Marcel Dicke; Nina E Fatouros
Journal: Oecologia Date: 2015-05-08 Impact factor: 3.225

6. FlyBase: introduction of the Drosophila melanogaster Release 6 reference genome assembly and large-scale migration of genome annotations.

Authors: Gilberto dos Santos; Andrew J Schroeder; Joshua L Goodman; Victor B Strelets; Madeline A Crosby; Jim Thurmond; David B Emmert; William M Gelbart
Journal: Nucleic Acids Res Date: 2014-11-14 Impact factor: 16.971

7. Fast and accurate de novo genome assembly from long uncorrected reads.

Authors: Robert Vaser; Ivan Sović; Niranjan Nagarajan; Mile Šikić
Journal: Genome Res Date: 2017-01-18 Impact factor: 9.043

8. Canu: scalable and accurate long-read assembly via adaptive k-mer weighting and repeat separation.

Authors: Sergey Koren; Brian P Walenz; Konstantin Berlin; Jason R Miller; Nicholas H Bergman; Adam M Phillippy
Journal: Genome Res Date: 2017-03-15 Impact factor: 9.043

9. Contiguous and accurate de novo assembly of metazoan genomes with modest long read coverage.

Authors: Mahul Chakraborty; James G Baldwin-Brown; Anthony D Long; J J Emerson
Journal: Nucleic Acids Res Date: 2016-07-25 Impact factor: 16.971

10. OrthoVenn2: a web server for whole-genome comparison and annotation of orthologous clusters across multiple species.

Authors: Ling Xu; Zhaobin Dong; Lu Fang; Yongjiang Luo; Zhaoyuan Wei; Hailong Guo; Guoqing Zhang; Yong Q Gu; Devin Coleman-Derr; Qingyou Xia; Yi Wang
Journal: Nucleic Acids Res Date: 2019-07-02 Impact factor: 16.971

5 in total

1. Novel genetic basis of resistance to Bt toxin Cry1Ac in Helicoverpa zea.

Authors: Kyle M Benowitz; Carson W Allan; Benjamin A Degain; Xianchun Li; Jeffrey A Fabrick; Bruce E Tabashnik; Yves Carrière; Luciano M Matzkin
Journal: Genetics Date: 2022-05-05 Impact factor: 4.402

2. Genome assemblies of three closely related leaf beetle species (Galerucella spp.).

Authors: Xuyue Yang; Tanja Slotte; Jacques Dainat; Peter A Hambäck
Journal: G3 (Bethesda) Date: 2021-08-07 Impact factor: 3.154

3. Comparative Genomics Sheds Light on the Convergent Evolution of Miniaturized Wasps.

Authors: Hongxing Xu; Xinhai Ye; Yajun Yang; Yi Yang; Yu H Sun; Yang Mei; Shijiao Xiong; Kang He; Le Xu; Qi Fang; Fei Li; Gongyin Ye; Zhongxian Lu
Journal: Mol Biol Evol Date: 2021-12-09 Impact factor: 16.240

4. Molecular Characterization of Donacia provosti (Coleoptera: Chrysomelidae) Larval Transcriptome by De Novo Assembly to Discover Genes Associated with Underwater Environmental Adaptations.

Authors: Haixia Zhan; Youssef Dewer; Cheng Qu; Shiyong Yang; Chen Luo; Liangjun Li; Fengqi Li
Journal: Insects Date: 2021-03-25 Impact factor: 2.769

Review 5. Next-generation biological control: the need for integrating genetics and genomics.

Authors: Kelley Leung; Erica Ras; Kim B Ferguson; Simone Ariëns; Dirk Babendreier; Piter Bijma; Kostas Bourtzis; Jacques Brodeur; Margreet A Bruins; Alejandra Centurión; Sophie R Chattington; Milena Chinchilla-Ramírez; Marcel Dicke; Nina E Fatouros; Joel González-Cabrera; Thomas V M Groot; Tim Haye; Markus Knapp; Panagiota Koskinioti; Sophie Le Hesran; Manolis Lyrakis; Angeliki Paspati; Meritxell Pérez-Hedo; Wouter N Plouvier; Christian Schlötterer; Judith M Stahl; Andra Thiel; Alberto Urbaneja; Louis van de Zande; Eveline C Verhulst; Louise E M Vet; Sander Visser; John H Werren; Shuwen Xia; Bas J Zwaan; Sara Magalhães; Leo W Beukeboom; Bart A Pannebakker
Journal: Biol Rev Camb Philos Soc Date: 2020-08-14

5 in total