Literature DB >> 33561225

Genome assembly and annotation of the California harvester ant Pogonomyrmex californicus.

Jonas Bohn¹, Reza Halabian¹, Lukas Schrader², Victoria Shabardina¹, Raphael Steffen², Yutaka Suzuki³, Ulrich R Ernst², Jürgen Gadau², Wojciech Makałowski¹.

Abstract

The harvester ant genus Pogonomyrmex is endemic to arid and semiarid habitats and deserts of North and South America. The California harvester ant Pogonomyrmex californicus is the most widely distributed Pogonomyrmex species in North America. Pogonomyrmex californicus colonies are usually monogynous, i.e. a colony has one queen. However, in a few populations in California, primary polygyny evolved, i.e. several queens cooperate in colony founding after their mating flights and continue to coexist in mature colonies. Here, we present a genome assembly and annotation of P. californicus. The size of the assembly is 241 Mb, which is in agreement with the previously estimated genome size. We were able to annotate 17,889 genes in total, including 15,688 protein-coding ones with BUSCO (Benchmarking Universal Single-Copy Orthologs) completeness at a 95% level. The presented P. californicus genome assembly will pave the way for investigations of the genomic underpinnings of social polymorphism in the number of queens, regulation of aggression, and the evolution of adaptations to dry habitats.

Entities: Chemical Disease Species

Keywords: Hymenoptera; Nanopore sequencing; genome annotation; genome assembly; polygyny; social insect

Year: 2021 PMID： 33561225 PMCID： PMC8022709 DOI： 10.1093/g3journal/jkaa019

Source DB: PubMed Journal: G3 (Bethesda) ISSN： 2160-1836 Impact factor: 3.154

Introduction

Ants (Hymenoptera: Formicidae) are important components of almost all terrestrial ecosystems and more than 16,000 species have been described so far (AntWeb, version 8.41, California Academy of Science, online at https://www.antweb.org; accessed on August 19, 2020). The majority of them, over 6900, belong to the highly diverse subfamily Myrmicinae ants (AntWeb, version 8.41, California Academy of Science, online at https://www.antweb.org; accessed on August 19, 2020). Currently, 40 assembled ant genomes are available at NCBI (Entrez “Genome” accessed on August 19, 2020). The harvester ant genus Pogonomyrmex is endemic to arid and semiarid habitats and deserts of North and South America (Buckley 1867; Cole 1968; Snelling ). This genus thrives in extremely dry habitats, e.g. Death Valley or Anza Borega, and evolved seed harvesting behavior independently from the Old World harvester ant genus Messor. Members of the genus Pogonomyrmex are a very conspicuous element of the deserts in the Southwest of the USA and have been studied extensively (De Vita 1979; Rissing ; Lighton and Turner 2004; Clark and Fewell 2014; Helmkampf ; Overson ). Within this genus, several interesting traits have evolved, such as social parasitism, genetic caste determination, and social polymorphism in terms of the queen number (Cole 1968; Rissing ; Julian ). Arguably, the most widely distributed Pogonomyrmex species in North America is P. californicus (Johnson 2002). Pogonomyrmex californicus colonies are usually monogynous, i.e. a colony has one queen. However, in a few populations in California, primary polygyny has evolved, i.e. several queens cooperate in colony founding after their mating flights and continue to coexist in mature colonies (Rissing ; Johnson 2004; Shaffer ). The Red Imported Fire Ant, Solenopsis invicta, and several other Formica ant species have a similar social polymorphism, which has been shown to be due to a supergene (Wang ; Yan ). This discovery was only possible by next-generation sequencing and the availability of genomic information for these species. Of approximately 70 described Pogonomyrmex species, only Pogonomyrmex barbatus (AntWeb, version 8.41, California Academy of Science, online at https://www.antweb.org; accessed on August 19, 2020) has its genome sequenced, assembled, and annotated (Smith ). Five other species of this genus (Pogonomyrmex anergismus, Pogonomyrmex colei, Pogonomyrmex imberbiculus, Pogonomyrmex occidentalis, and Pogonomyrmex rugosus) have nuclear genomes partially sequenced but none have been so far processed and only raw reads are available in NCBI’s Sequence Reads Archive (SRA). Sequences of P. rugosus, P. anergismus, and P. colei have been aligned to the P. barbatus genome for a study of gene gains/losses in socially parasitic ants (Smith ). The genome sequencing and annotation of P. californicus will result in a better understanding of genomic sequence and structural variations and evolution in Formicidae in general and in the myrmicine genus Pogonomyrmex in particular. It will also pave the way for investigations of the genomic underpinnings of social polymorphism in the queen number, regulation of aggression, and the evolution of adaptations to dry habitats in P. californicus.

Materials and methods

Samples and transcriptome data source

The nuclear DNA was extracted from 13 haploid males from a single polygynous colony collected in 2016 from Pine Valley, CA, USA (32.819761, −116.521512; N32 49 11 – W116 31 17). Previously published transcriptome data based in parts on queens from the same population/area (Helmkampf ) were downloaded from NCBI’s Sequence Read Archive (BioProject accession number PRJDB4319). In addition, Oxford Nanopore sequencing (MinION) was performed on RNA extracted from workers of five polygynous colonies also collected in 2016 from Pine Valley, CA, USA. These reads are accessible at NCBI with BioProject accession number PRJNA622899.

Genome sequencing and assembling

DNA from 13 male ants was isolated using Qiagen MagAttract HMW DNA Kit with the protocol for tissue DNA extraction according to the protocol from October 2017. DNA was extracted from the whole body and all individuals were pooled together. This resulted in 4575 ng of DNA of which 1.2 ng was used for a 10x Genomics Chromium sequencing approach. The sequencing library was prepared according to the Chromium™ Genome Library Kit standard protocol (Manual Part Number CG00022) and the Illumina HiSeq 3000 system was used to sequence the library. The quality of the produced reads was checked using FastQC software, version 0.11.5 (Andrews 2010). We performed neither filtering nor trimming on these linked-reads to avoid losing any information. We used the de novo assembler Supernova, version 2.1 (Weisenfeld ), with the following parameters: –maxreads = 156111200 and –accept-extreme-coverage. The maximum number of reads was set to a 75× effective coverage of the genome, which was chosen based on a set of Supernova runs with different coverages to obtain optimal parameters as a trade-off between genome size, BUSCO assessment (see below), and N50 coverage (see Figure 1) and assuming P. californicus has a genome size of 244.5 Mb (http://www.genomesize.com). Subsequently, the resulting genome assembly was polished by three rounds of Pilon, version 1.23, processing (Walker ). For this step, 678,988,626 one-hundred bp reads from an independent Illumina sequencing run from the same initial DNA extraction were added to the 269,953,173 linked-reads used for the genome assembly, bringing the number of reads used in the polishing step to almost 1 billion. For this additional sequencing standard, an Illumina protocol was used to prepare a sequencing library, which was sequenced using the Illumina HiSeq 3000 system.

Figure 1

Overview of the annotation workflow. The workflow includes a construction of the transcript assembly (upper part) and a pipeline for the genome annotation (lower part). The transcript assembly and annotations from related species are providing evidence for the annotation of PCGs.

Transcript sequencing and assembling

For transcriptome analysis, MinION long-read RNA sequencing of the entire bodies of worker ants from a laboratory colony (pleometrotic colony from Pine Valley, NJ, USA) was performed. We extracted RNA using Monarch® Total RNA Miniprep Kit (New England BioLabs GmbH, Frankfurt, D, E2010). Material was grounded in Mixer Mill 200 (Retsch GmbH, Haan, D) in a protection reagent. A quality check was performed with Bioanalyser, Nanophotometer, and Qubit. The library was prepared from 5 µg of the total RNA using cDNA-PCR sequencing kit SQK_PCS_9035_v108_revD_26.6.17 (Oxford Nanopore Technologies, Oxford, UK). The library was sequenced using MinION and the flow cell FLO-MIN107 R9 (Oxford Nanopore Technologies, Oxford, UK). ONT’s albacor software, version 2.3.1 with standard parameters, was used for base calling and only sequences that passed a standard quality check (placed in “pass” folder by basecaller) were used for further analyses. RNA Illumina reads from Helmkampf were aligned employing Hisat, version 2.1.0 (Kim ), for a genome-guided assembly. A genome-independent transcript assembly was done using Trinity, version 2.8.4 (Grabherr ), on the next generation sequencing (NGS) RNA-Seq data, using the Trinity assembly provided by Helmkampf . In addition, minimap2, version 2.17 within FLAIR pipeline version 1.4, was used for aligning nanopore long reads and the Trinity assemblies to the genome. Finally, StringTie2, version 2.0.1 (Kovaka ), was employed in order to link the different transcript assemblies filtered by a minimum FPKM of 0.14, as also performed by Helmkampf .

Repeat annotation

We used two independent pipelines for de novo repeats discovery, namely RepeatModeler, version 1.0.11 (http://repeatmasker.org/RepeatModeler/), and REPET, version 2.5 (Flutre ). The obtained libraries were merged with Hymenoptera-specific repeats from RepBase, version 22.07 (Bao ). TEclass software, version 2.1.3 (Abrusán ), was used for classification of consensus sequences lacking TE-family assignment. Finally, we removed sequences sharing more than 90% identity by employing CD-HIT, version 4.7 (Fu ). This resulted in the library consisting of 2595 consensus sequences, which were used to annotate repeats in the P. californicus genome using RepeatMasker, version 4.0.7 (Smit ).

Protein-coding gene annotation

The identification of protein-coding genes (PCGs) in the newly assembled genome of P. californicus was carried out by GeneModelMapper (GeMoMa), version 1.6.1 (Keilwagen ), followed by MAKER2, version 2.31.10 (Holt and Yandell 2011) (see Figure 1). We used annotation of four insect species (P. barbatus, S. invicta, Camponotus floridanus, and A. mellifera) to run GeMoMa. These annotations were downloaded from NCBI (see Supplementary Table S1). GeMoMa was run for each reference species separately and the results were merged using the GeMoMa annotation filter (GAF). Next, four runs of MAKER2 were used to refine genome annotation. MAKER2 was used with the following data: GeMoMa predictions, transcript assembly, transcript and protein annotations from relative species, and RepeatMasker annotation (see above). AUGUSTUS (Stanke and Morgenstern 2005), which is a part of the MAKER2 pipeline, was trained on the AUGUSTUS reference model from Nasonia for the first run and trained on the created P. californicus reference model by applying BUSCO, version 3.0.2 (Waterhouse ), for the third run. In addition, SNAP (Korf 2004) was performed for the last three MAKER2 runs and trained on Hidden Markov Model (HMM) reference models from gene predictions of the previous run with a minimum length of 50 amino acids and a maximum annotation edit distance of 0.25. Redundant identical transcripts and proteins within the final predictions of MAKER2 were filtered with CD-Hit, version 4.7 (Fu ).

Functional classification of PCGs

The functional classification of the unique PCGs was based on sequence similarity. NCBI’s non-redundant (nr) protein database was searched using BLASTP, version 2.2.31 (Altschul ), with default settings except e-value set to 1e−6 and coverage threshold as described below. We considered three possibilities for query and reference sequences overlap. First, the exact matches of the BLAST alignment cover more than 70% to the reference protein and the query protein. In this case, the query protein is similar to the reference protein. Second, if the query protein is just a part of the reference protein, the BLAST matches will cover more than 70% of the query sequence but less than 70% of the reference sequence. Lastly, the reference sequence might be included in the query protein. In this case, the BLAST matches are covering more than 70% of the reference protein but less than 70% of the query protein. This allowed addition of the functional description from the reference protein (annotated protein in the nr database) to the protein query (P. californicus protein predicted by MAKER2 annotation). Further downstream analysis was done with Interproscan, version 5.30 (Jones ), for deletion of protein domain residues in classified and non-classified proteins. This analysis includes several pipelines including PANTHER, Pfam, Gene3D, SUPERFAMILY, MobiDBLite, ProSiteProfiles, SMART, CDD, Coils, PRINTS, TIGRFAM, PIRSF, Hamap, ProDom, and SFLD.

Odorant receptor annotation

We annotated odorant receptors (ORs) for the genomes of P. californicus and P. barbatus using manually curated OR gene models from three other ant species: Acromyrmex echinatior, Atta cephalotes, and S. invicta (McKenzie ). Initial OR gene models were annotated with exonerate, version 2.4.0, and GeMoMa, version 1.4, and combined with Evidence Modeler, version 1.1.1 (Haas ). All models were screened for the 7tm_6 protein domain typical for insect OR proteins with PfamScan, version 1.5. All genes were further assigned to different OR protein subfamilies by aligning the protein sequence against a set of OR subfamily reference sequences (S. McKenzie, personal communication). Protein alignment was calculated with MAFFT (Katoh ) using the following parameters: −globalpair = T, −keeporder = T, −maxiterate = 16. The resulting alignment was trimmed employing trimal with the parameters: −keepheader = T −strictall = T (Capella-Gutiérrez ). The phylogenetic tree of all predicted OR gene models in both ant species was inferred with FastTreeMP (Price ) with the following settings: −pseudo −lg −gamma.

Annotation of non-PCGs

In addition to the PCGs, non-coding genes were annotated as well. Genes for tRNAs have been annotated with tRNAscan-SE, version 2.0.3 (Chan ). Other types of ncRNAs, including rRNAs, snRNAs, snoRNAs, miRNAs, and lncRNAs, were predicted by Infernal, version 1.1.2 (Nawrocki and Eddy 2013). To this end, we downloaded the Rfam library, release 14.1, of the covariance models along with the Rfam clan file (https://rfam.xfam.org). Afterwards, cmscan, a built-in Infernal program, was used to annotate the RNAs represented in the Rfam library in the genome under study. Eventually, the lower-scoring overlaps were removed and the final results were used to generate the gff file containing the annotation of non-coding RNA genes. In addition, we searched for homologs of lncRNA genes from P. barbatus (based on the annotation of assembly from Supplementary Table S1) in P. californicus using Splign, version 2.1.0 (Kapustin ). Genes where the exons detected by Splign cover more than 90% of P. barbatus lncRNA genes were classified as lncRNA genes in the P. californicus genome assembly.

Comparative genomic analysis

The LAST aligner, version 909, was used for whole-genome alignments (Kiełbasa ). The P. californicus and P. barbatus genome assemblies were aligned in order to find cognate genes and to search for conserved synteny. We used BEDTools intersect, version 2.27.1 (Quinlan and Hall 2010), to compare the genome annotations and estimate the proportion of shared genes.

Assembly and annotation quality assessment

The nucleotide-level quality of final assembly was evaluated using Merqury software (Rhie ). We assessed the completeness of our assembly and annotation using BUSCO, version 3.0.2 (Waterhouse ), and DOGMA web server (Dohmen ; Kemena ). For BUSCO analyses, we used the Hymenoptera-specific single-copy orthologous genes from OrthoDB, version 9 (Zdobnov ). For DOGMA, we employed the insect domain core set from Pfam, version 32.

Data availability

All analyses, including the assembly and the annotation pipeline, are available at http://www.bioinformatics.uni-muenster.de/publication_data/P.californicus_annotation/index.hbi. The raw sequencing data are available at the NCBI Sequence Read Archive under accession number PRJNA622899 (https://www.ncbi.nlm.nih.gov/sra/?term=PRJNA622899). Supplementary Material is available at figshare https://doi.org/10.25387/g3.13259183.

Results and discussion

Sequencing results

We performed two rounds of NGS genomic sequencing and transcriptome sequencing using nanopore long reads technology. We obtained 339,494,313 of 100 bp (about 34 Gb) pair-end reads after standard Illumina sequencing and 269,953,173 of 150 bp (about 40.5 Gb) linked-reads using a 10x Genomics technology. Only the latter reads were used for the genome assembly. Additionally, MinION transcriptome sequencing resulted in 394,085 reads ranging between 49 and 6182 bp. N50 of the set was 660 bp and the total size of 241.6 Mb with a median read Phred had a quality score of 7.8, which translates to about 85% accuracy.

Genome assembly and evaluation

We assembled a draft P. californicus genome using a linked-read 10x Genomics approach and the Supernova assembler. Assuming 244.5 Mb as the genome size of P. californicus (http://www.genomesize.com), our 10x Genomics data coverage was 162×. Supernova was originally designed for de novo assemblies of human genomes (Weisenfeld ). Nevertheless, recently it has been successfully used for non-human genome assemblies (Ozerov ; Wang ; Lu ). For human genomes, a 56× coverage is recommended. However, since there is not much information on the optimal coverage for non-model genomes, we performed a series of assemblies. We resampled our sequencing data to obtain coverages ranging from 47× to 162× (see Figure 2). In order to minimize the number of artificially duplicated and missing BUSCO genes, we decided that a coverage of 75× is optimal for the assembly of the P. californicus genome (see Figure 2). Based on this assembly (75× coverage), the P. californicus draft genome consisted of 6793 contigs totaling in 240,287,203 bp with about 13 undetermined nucleotides (Ns) per 1 kb. The Supernova assembly was followed by three rounds of polishing by Pilon. This resulted in further improvement of the assembly, with a final assembly of 241,081,918 bp and a reduced number of N characters (see Supplementary Table S2). The nucleotide-level quality value of the final assembly evaluated by Merqury (Rhie ) was 45.56, which corresponds to 99.99% accuracy (error rate = 2.78e−05).

Figure 2

Raw read coverage effect on assembly size and quality. Please note that assembly size is provided in mega base pairs.

Raw read coverage effect on assembly size and quality. Please note that assembly size is provided in mega base pairs. By comparing the genome assemblies of relative ants used in the annotation pipeline, our genome assembly seems to have a very small N character coverage. This means that we have shorter regions between contigs within scaffolds and less portions of input sequencing reads contain N characters (see Table 1). This impact is very much noticeable by comparing assemblies of the congeners (P. californicus and P. barbatus) in our set of insects. The N50 of the scaffolds is five times higher because of the about six times higher N character coverage in the P. barbatus assembly. Additionally, by considering the assembly size difference of 5 Mb between these two ants, we believe that we present a more complete assembly and consequently a better annotation of the P. californicus genome in comparison to the P. barbatus one. Moreover, because of a more fragmented genome assembly, the latter may include more erroneous transcript models.

Table 1

Comparison of genome assemblies of related insect species

Parameter	P. californicus	P. barbatus	S. invicta	C. floridanus	A. mellifera
Assembly size	241 Mb	236 Mb	399 Mb	284 Mb	250 Mb
Scaffold N50	208,871 bp	819,605 bp	621,039 bp	1,585,631 bp	997,192 bp
Scaffold N90	16,229 bp	117,988 bp	1,950 bp	211,219 bp	147,519 bp
Number of scaffolds	6,793	4,645	66,904	657	5,644
Percent of N characters in the assembly	1.15	6.60	8.31	0.62	8.45
GC content	36.7	36.5	36.2	34.3	32.7
RefSeq assembly ID	n/a	GCF_000187915.1	GCF_000188075.2	GCF_003227725.1	GCF_000002195.4

With the exception of P. californicus, the data were taken from NCBI.

Comparison of genome assemblies of related insect species With the exception of P. californicus, the data were taken from NCBI.

Annotation of repetitive sequences

Annotation of repetitive sequences was performed in two stages. First, we built a library of repetitive elements, which was later used to annotate individual repeats and mask the genome for annotation of different gene types. We used two different de novo pipelines to compile consensus sequences of P. californicus repetitive sequences, namely RepeatModeler (http://repeatmasker.org/RepeatModeler/) and REPET (Flutre ). After adding 1240 Hymenoptera-specific repeats from RepBase, our library consisted of 3156 sequences, which were subjected to redundancy filtering using CD-HIT with the cutoff level set at 90%. The final library contained 2595 sequences ranging from 42 to 28,331 bp (median equal to 988 bp). Three hundred forty-five sequences in this dataset were unclassified and TEclass was employed to classify these sequences. We were able to classify most of them and only 71 sequences in our TE library remained unclassified. This library was used as a TE-reference set for a RepeatMasker run. In total, 20.25% of the genome was occupied by repetitive elements, including simple repeats and low complexity regions, 3.98% and 0.53% of the genome, respectively. Not surprisingly, most of the repeats are of TE-origin and all major groups of TEs are represented in the P. californicus genome. DNA elements are most common, followed by LTR retroposons and LINEs (see Table 2). Interestingly, SINEs are very rare in the genome. However, it is possible that most of unclassified interspersed repeats are actually SINEs.

Table 2

Transposable elements present in the P. californius genome

TE-class	Number of elements	Fraction of the genome
LTR	15,391	4.38%
LINE	9,525	1.42%
SINE	596	0.03%
DNA	72,737	8.69%
Unclassified	13,610	1.37%

Transposable elements present in the P. californius genome

Annotation of PCGs

A homology-based GeMoMa annotation followed by four runs of MAKER2 resulted in 15,688 PCGs, which included 170 exact duplicates of potential transcripts. All following downstream analyses were based on a non-redundant set of 15,518 transcripts and translated proteins of this set were referred to as being unique. Additionally, 2288 unique isoforms were annotated by our pipeline based on RNA-seq data. Detailed information on the number of predicted genes at different stages is provided in Supplementary Figure S1. The missing gene numbers presented in this figure come from a BUSCO assessment on unique P. californicus transcripts. Isoforms from the MAKER annotation are referred as alternative transcripts with different intron/exon decomposition (Campbell ). For an annotation with MAKER, it is recommended to run it at least three times. There is a drastic increase of annotation in the second MAKER run, based on the training used SNAP from filtered annotations of the first MAKER run. The high reduction of annotations in the third MAKER run is based on the training of Augustus on the P. californicus genome using BUSCO and forcing detection of start and stop codons in order to predict complete genes. The functional classification of predicted genes was done employing BLASTp against NCBI’s nr protein database. We distinguish three categories of functional annotation: (1) 8807 predicted genes were similar to a protein present in nr database with the alignment coverage on query and target of at least 70% of the protein length; (2) 3129 predicted genes were similar to an nr-protein with an alignment coverage of query or target with less than 70% of the protein length, and (3) 2047 predicted proteins where neither the query nor the target fulfill the 70% alignment coverage threshold. These predictions show some similarity to proteins but may be novel proteins as they are not clearly classified. About 1535 predicted proteins did not have any cognate protein in nr database. Therefore, in total, we classified about 90% of all predicted proteins. These include also 54 proteins that consist of multiple domains potentially representing individual proteins. This may be the result of protein fusion or erroneous gene prediction. Interestingly, from these 1535 potential orphan genes from P. californicus, 544 are apparently present in the P. barbatus genome; however, they are missing from the current P. barbatus annotation. The number of orphan genes or TSG/LSG (taxon-specific/lineage-specific genes) in P. californicus is what would be expected for two relatively closely related ant species but is much lower than what has been shown in leaf cutter ants (Wissler ). In comparison to other relative insect genomes, we have annotated more PCGs (see Table 3). This may suggest that our pipeline resulted in some false-positive predictions. Interestingly but not surprisingly, non-classified proteins are on average significantly shorter than classified proteins: non-classified proteins are on average 108 amino acid long versus a 536 amino acid average length for classified proteins (see Supplementary Figure S2).

Table 3

Comparison of P. californicus genome annotation with selected Hymenopteran genomes

Species	Assembly size	Protein coding	tRNA	lncRNA	Other RNA	Total	Assembly version	Annotation version
Acromyrmex echinatior	296 Mb	11,219	159	1,210	449	13,037	Aech_3.9	100
Camponotus floridanus	233 Mb	12,512	208	1,243	696	14,659	Cflo_v7.5	102
Dinoponera quadriceps	260 Mb	11,048	212	570	493	12,323	ASM131382v1	100
Harpegnathos saltator	335 Mb	12,654	230	1,385	928	15,197	Hsal_v8.5	102
Linepithema humile	220 Mb	11,610	178	1,411	655	13,854	Lhum_UMD_V04	100
Monomorium pharaonis	326 Mb	14,019	186	3,126	1,318	18,649	ASM1337386v2	102
Ooceraea biroi	224 Mb	11,907	202	1571	970	14,650	Obir_v5.4	100
Pogonomyrmex barbatus	236 Mb	11,348	201	1,138	406	13,093	Pbar_UMD_V03	101
Pogonomyrmex californicus	241 Mb	15,688	1,180	931	79	17,878	n/a	n/a
Pseudomyrmex gracilis	283 Mb	11,572	193	935	558	13,258	ASM200609v1	100
Solenopsis invicta	399 Mb	14,820	227	1,376	691	17,114	Si_gnH	103
Apis mellifera	225 Mb	9,935	218	3,146	1,295	14,594	Amel_HAv3.1	104

All the data are taken from NCBI’s genome database except P. californicus.

Comparison of P. californicus genome annotation with selected Hymenopteran genomes All the data are taken from NCBI’s genome database except P. californicus. In addition to the sequence similarity classification, we also performed further protein domain analysis with Interproscan. Ninety-one percent of classified proteins include predictions from Interproscan (see Supplementary Figure S3), while only 24% of non-classified proteins show some Interproscan predictions (see Supplementary Figure S4). The Interproscan results from classified predictions include mostly predictions from PANTHER (Protein ANalysis THrough Evolutionary Relationships), which is a protein classification system (Thomas ) and Pfam, which is a large collection of protein domains (El-Gebali ). These predictions promote the evidence of the classified proteins. Most predictions of the non-classified proteins are coming from MobiDBLite, which is included in the Interpro database and is used for detection of long intrinsically disordered regions (Necci ). Based on Intrinsic disorder (ID) and missing Domains in the Pfam database, at least 10% of the human proteome are missing protein domain detections (Mistry ). This suggests that these proteins are non-classified based on ID and/or incomplete databases. Based on the length of the non-classified proteins (see Supplementary Figure S2), they seem to include several small proteins which are very much of biological importance but not annotated by most annotation pipelines (Su ).

Odorant receptors

Chemical communication and perception of olfactory cues via ORs is essential for the performance of many tasks in ant colonies (Trible ; Yan ). Given the biological significance of this gene family in ants, we generated in-depth annotations of OR genes in the two closely related Pogonomyrmex species, P. californicus and P. barbatus, for which assembled genomes are available. Our custom pipeline predicted 417 OR gene models in the P. californicus genome and 454 OR gene models in the P. barbatus genome. Of these, 303 gene models were complete in P. californicus and 342 were complete in P. barbatus (see Figure 3, Supplementary Table S3). This nearly doubles the number of originally predicted OR gene models (274) published for P. barbatus (Smith ). Classifying our gene models by known OR gene families showed that most of them fall into the 9-exon (9E) family, with the next biggest families being L, V, E, and U in both P. barbatus and P. californicus (see Supplementary Table S4). This is in line with previous studies about OR genes in ants (Engsontia ; McKenzie and Kronauer 2018). A phylogenetic analysis of Pogonomyrmex ORs showed that most ORs can be considered as single-copy orthologs, as expected when comparing two closely related species. Clusters in the gene phylogeny of multiple genes from the same species would indicate either very recent gene duplications or losses (i.e. after the species split) or could hint at assembly errors in either genome. The lack of extensive same-species clusters (largest cluster: seven genes, no other cluster exceeding four genes) thus suggests that the assemblies are of equally high quality, with few signs of gene duplication through assembly errors.

Figure 3

OR gene repertoires are similar between P. californicus (N = 417 genes) and P. barbatus (N = 453 genes). Most gene models have their closest relative in the other species. The gene tree shows no large clusters containing genes exclusively of one of the two species. This is evidence for a close relatedness between the species and an equally high quality of the two genome assemblies. There are several categories of functional RNAs, including tRNAs, lncRNAs, rRNAs, snoRNAs, snRNAs, and rRNAs. Annotation of some of these is pretty straightforward thanks to the conserved secondary structure, e.g. tRNA or snRNA genes, and some are more difficult to annotate, e.g. lncRNA genes. Nevertheless, we were able to annotate more than 2000 such genes in the P. californicus genome (see Table 3). In general, numbers of non-PCGs detected in the P. californicus genome are similar to those in other insect genomes with the exception of tRNA genes exceeding more than five times the usual number of tRNA genes in insect genomes. Upon close inspection, it appeared that the excess of tRNA genes is due to unusual number of tRNAThr genes and in particular its GGT isotype. Moreover, these genes are identical to each other including 200-bp flanking regions, thus suggesting that they might be an artifact of faulty assembly and not a real biological phenomenon. BUSCO and DOGMA programs were used for quality assessment. These programs work on different signatures in order to estimate the completeness of genome assemblies and the resulting annotation of transcripts and proteins. Duplicated transcripts and proteins within annotations of relative genomes were detected using cd-hit as it was done for the P. californicus annotation (see Table 4). In general, results from the two programs are in good agreement. The small differences are consequences of different methodology employed by the software; while BUSCO is searching for single-copy orthologous hymenopteran genes, DOGMA searches for Conserved Domain Arrangements (CDA) from an insect reference set. Our annotation of P. californicus is comparable to or exceeds annotation of published ant genomes. The only parameter that seems to be significantly different in our assembly is the level of genome duplication reported by BUSCO—over 2% comparing to less than 0.4% in other genomes. This is also reflected in the number of duplicated transcript but interestingly not in the number of duplicated proteins (see Table 4). However, at this point, it is difficult to evaluate if this phenomenon reflects the intrinsic biological feature of the P. californicus genome or results from a less-than-perfect assembly of the genome.

Table 4

Comparison of completeness and quality of Hymenopteran insects used for the annotation of P. californicus

Species	BUSCO genome completeness (%)	BUSCO genome duplication (%)	BUSCO transcript completeness (%)	DOGMA transcript completeness (%)
P. californicus	95.80	2.20	91.60	94.80
P. barbatus	94.20	0.10	95.80	97.60
S. invicta	85.70	0.30	94.10	96.50
C. floridanus	85.90	0.30	99.20	98.10
A. mellifera	97.10	0.20	98.60	98.30

BUSCO and DOGMA analyses are based on unique sets of transcripts without duplicated sequences and soft-masked genomes were used in BUSCO assessments.

Comparison of completeness and quality of Hymenopteran insects used for the annotation of P. californicus BUSCO and DOGMA analyses are based on unique sets of transcripts without duplicated sequences and soft-masked genomes were used in BUSCO assessments.

Conclusions

With the availability of a genome assembly and annotation for P. californicus, we can now start to analyze the genetic architecture of the intraspecific social polymorphism, differences in aggressive behavior of founding queens, and adaptations to desert life in this widely distributed harvester ant. This will also allow us to test whether a supergene, similar to other cases of intraspecific social polymorphism, is responsible for this trait variation. We should also be able to demonstrate that the evolution of OR genes in both Pogonomyrmex species proceeded at approximately the same rate without any obvious major gene losses or gains.

Funding

This research was partly funded by the German Research Foundation (DFG) as part of the SFB TRR 212 (NC³)—project numbers 316099922 and internal fund of the Institute of Bioinformatics. Conflicts of interest: None declared.

51 in total

1. Adaptive seeds tame genomic sequence comparison.

Authors: Szymon M Kiełbasa; Raymond Wan; Kengo Sato; Paul Horton; Martin C Frith
Journal: Genome Res Date: 2011-01-05 Impact factor: 9.043

2. Draft genome of the red harvester ant Pogonomyrmex barbatus.

Authors: Chris R Smith; Christopher D Smith; Hugh M Robertson; Martin Helmkampf; Aleksey Zimin; Mark Yandell; Carson Holt; Hao Hu; Ehab Abouheif; Richard Benton; Elizabeth Cash; Vincent Croset; Cameron R Currie; Eran Elhaik; Christine G Elsik; Marie-Julie Favé; Vilaiwan Fernandes; Joshua D Gibson; Dan Graur; Wulfila Gronenberg; Kirk J Grubbs; Darren E Hagen; Ana Sofia Ibarraran Viniegra; Brian R Johnson; Reed M Johnson; Abderrahman Khila; Jay W Kim; Kaitlyn A Mathis; Monica C Munoz-Torres; Marguerite C Murphy; Julie A Mustard; Rin Nakamura; Oliver Niehuis; Surabhi Nigam; Rick P Overson; Jennifer E Placek; Rajendhran Rajakumar; Justin T Reese; Garret Suen; Shu Tao; Candice W Torres; Neil D Tsutsui; Lumi Viljakainen; Florian Wolschin; Jürgen Gadau
Journal: Proc Natl Acad Sci U S A Date: 2011-01-31 Impact factor: 11.205

3. Repbase Update, a database of repetitive elements in eukaryotic genomes.

Authors: Weidong Bao; Kenji K Kojima; Oleksiy Kohany
Journal: Mob DNA Date: 2015-06-02

4. orco Mutagenesis Causes Loss of Antennal Lobe Glomeruli and Impaired Social Behavior in Ants.

Authors: Waring Trible; Leonora Olivos-Cisneros; Sean K McKenzie; Jonathan Saragosti; Ni-Chen Chang; Benjamin J Matthews; Peter R Oxley; Daniel J C Kronauer
Journal: Cell Date: 2017-08-10 Impact factor: 41.582

5. DOGMA: domain-based transcriptome and proteome quality assessment.

Authors: Elias Dohmen; Lukas P M Kremer; Erich Bornberg-Bauer; Carsten Kemena
Journal: Bioinformatics Date: 2016-05-05 Impact factor: 6.937

6. Gene expression and variation in social aggression by queens of the harvester ant Pogonomyrmex californicus.

Authors: Martin Helmkampf; Alexander S Mikheyev; Yun Kang; Jennifer Fewell; Jürgen Gadau
Journal: Mol Ecol Date: 2016-06-30 Impact factor: 6.185

7. Evolution of a supergene that regulates a trans-species social polymorphism.

Authors: Zheng Yan; Simon H Martin; Dietrich Gotzek; Samuel V Arsenault; Pablo Duchen; Quentin Helleu; Oksana Riba-Grognuz; Brendan G Hunt; Nicolas Salamin; DeWayne Shoemaker; Kenneth G Ross; Laurent Keller
Journal: Nat Ecol Evol Date: 2020-01-20 Impact factor: 15.460

8. Transcriptome assembly from long-read RNA-seq alignments with StringTie2.

Authors: Sam Kovaka; Aleksey V Zimin; Geo M Pertea; Roham Razaghi; Steven L Salzberg; Mihaela Pertea
Journal: Genome Biol Date: 2019-12-16 Impact factor: 13.583

9. trimAl: a tool for automated alignment trimming in large-scale phylogenetic analyses.

Authors: Salvador Capella-Gutiérrez; José M Silla-Martínez; Toni Gabaldón
Journal: Bioinformatics Date: 2009-06-08 Impact factor: 6.937

10. Full-length transcriptome assembly from RNA-Seq data without a reference genome.

Authors: Manfred G Grabherr; Brian J Haas; Moran Yassour; Joshua Z Levin; Dawn A Thompson; Ido Amit; Xian Adiconis; Lin Fan; Raktima Raychowdhury; Qiandong Zeng; Zehua Chen; Evan Mauceli; Nir Hacohen; Andreas Gnirke; Nicholas Rhind; Federica di Palma; Bruce W Birren; Chad Nusbaum; Kerstin Lindblad-Toh; Nir Friedman; Aviv Regev
Journal: Nat Biotechnol Date: 2011-05-15 Impact factor: 54.908

1 in total

Review 1. A molecular toolkit for superorganisms.

Authors: Bogdan Sieriebriennikov; Danny Reinberg; Claude Desplan
Journal: Trends Genet Date: 2021-06-08 Impact factor: 11.821

1 in total