Literature DB >> 35929795

The genome sequence of the scarce swallowtail, Iphiclides podalirius.

Alexander Mackintosh¹, Dominik R Laetsch¹, Tobias Baril², Sam Ebdon¹, Paul Jay³, Roger Vila⁴, Alex Hayward², Konrad Lohse¹.

Abstract

The scarce swallowtail, Iphiclides podalirius (Linnaeus, 1758), is a species of butterfly in the family Papilionidae. Here, we present a chromosome-level genome assembly for Iphiclides podalirius as well as gene and transposable element annotations. We investigate how the density of genomic features differs between the 30 Iphiclides podalirius chromosomes. We find that shorter chromosomes have higher heterozygosity at four-fold-degenerate sites and a greater density of transposable elements. While the first result is an expected consequence of differences in recombination rate, the second suggests a counter-intuitive relationship between recombination and transposable element evolution. This high-quality genome assembly, the first for any species in the tribe Leptocircini, will be a valuable resource for population genomics in the genus Iphiclides and comparative genomics more generally.

Entities: Chemical

Keywords: zzm321990 Iphiclides podaliriuszzm321990 ; chromosome length; genome annotation; genome assembly; heterozygosity

Mesh：

Substances：

Year: 2022 PMID： 35929795 PMCID： PMC9434224 DOI： 10.1093/g3journal/jkac193

Source DB: PubMed Journal: G3 (Bethesda) ISSN： 2160-1836 Impact factor: 3.542

Introduction

The scarce swallowtail, Iphiclides podalirius (Linnaeus, 1758), is a widespread butterfly species in the family Papilionidae. The species is common in open habitats in the Palearctic, ranging from France to Western China, but is absent from Northern areas (e.g. Scandinavia and the British Isles) and some Mediterranean Islands (e.g. Sardinia, where only occasional records exist). I. podalirius is generally bivoltine, the larvae feed mainly on different species of Prunus, principally P. spinosa, and overwinter in the pupal stage. The genus Iphiclides belongs to the tribe Leptocircini (kite swallowtails), which diverged from other taxa in the subfamily Papilioninae about 55 million years ago (Allio, Scornavacca ), and only includes two other species: I. podalirinus, which is restricted to Central Asia and Iphiclides feisthamelii, the sister taxon of I. podalirius, which is found in Northern Africa and the Iberian Peninsula. A controversy about the taxonomic status of I. feisthamelii, which has been regarded as a subspecies of I. podalirius (Godart and Duponchel 1832; Tolman and Lewington 2009; Wiemers and Gottsberger 2010), has only recently been resolved; although the two species have no known differences in ecology or life history and share mitochondrial haplotypes (Dincă ), Gaunet show that they differ consistently in wing patterns (including UV reflectance of males), genital morphology and nuclear DNA and are separated by a narrow hybrid zone in Southern France (Descimon and Mallet 2009). Although Allio, Scornavacca have previously generated Illumina shotgun data for I. podalirius, the assemblies based on these data (Allio, Scornavacca ; Ellis ) are highly fragmented (with an N50 of 0.6 and 1.7 kb, respectively). More generally, while chromosome-level assemblies are available for several swallowtail butterflies in the genus Papilio (Lu ), similarly contiguous genome assemblies are lacking for other tribes within Papilioninae. Here, we present a chromosome-level genome assembly for I. podalirius, as well as gene and transposable element (TE) annotations. We use this assembly to investigate how heterozygosity in the reference individual varies both between genomic partitions and chromosomes.

Materials and methods

Sampling

Two male individuals (MO_IP_500 and MO_IP_504) were collected with a hand net at Le Moulin de Bertrand, Saint-Martin-de-Londres, Montpellier, France, and flash frozen in liquid nitrogen. High-molecular-weight (HMW) DNA was extracted from the thorax of one of these individuals (MO_IP_504) using a salting out extraction protocol as described in Mackintosh .

Sequencing

A SMRTbell sequencing library was generated from the HMW extraction of MO_IP_504 by the Exeter Sequencing Service. This was sequenced on 3 SMRT cells on a Sequel I instrument to generate 24.0 Gb of Pacbio continuous long-read (CLR) data. A chromium 10× library was prepared from the HMW extraction at the Cancer Research UK Cambridge Institute, UK. This library was sequenced by Edinburgh Genomics (EG) on a single HiSeqX lane, generating 25.3 Gb of paired-end reads after processing with Long Ranger v2.2.2 (Marks ). However, the weighted mean molecule size of these data (12.86 kb) limited its use for scaffolding of the Pacbio assembly and the reads were therefore only used for polishing. Hereafter, these data are simply referred to as Illumina WGS. In addition, tissue from MO_IP_500 was used for chromatin conformation capture (HiC) sequencing. The HiC reaction was performed using an Arima HiC kit, following the manufacturer’s instructions for flash-frozen animal tissue. An NEBNext Ultra II library, prepared by EG, was sequenced on an Illumina MiSeq, generating 4.7 Gb of paired-end reads. Paired-end RNA-seq data were generated for individual MO_IP_504. To obtain tissue for RNA extraction, the adult individual was divided bilaterally (including all parts of the body: head, thorax, and abdomen). For further details on RNA extraction, see Ebdon .

Genome assembly

Pacbio reads 2 kb (40.4× coverage) were assembled using wtdbg2.5 (Ruan and Li 2020) with the options: -L 2000 -x sq -g 400 m. Errors in the consensus sequence were corrected by three rounds of Pacbio CLR polishing and two rounds of Illumina WGS polishing using Racon v1.4.10 (Vaser ) while retaining any unpolished sequences. Contigs belonging to nontarget organisms were identified using blobtools v1.1.1 (Laetsch and Blaxter 2017) and subsequently removed. Finally, duplicated regions and contigs with extremely low (<5×) or high (>200×) coverage were identified and removed with purge_dups v1.2.5 (Guan ). Mapping of long reads and short reads for the above steps was performed with minimap2 v2.17 and bwa-mem v0.7.17, respectively (Li 2013, 2018). The HiC and RNA-seq reads were adapter and quality trimmed with fastp v0.2.1 (Chen ). The trimmed HiC reads were aligned to the contig-level assembly with Juicer v1.6 (Durand ). Scaffolding was performed with 3d-dna v180922 using default parameters (Dudchenko ). The scaffolded assembly and HiC map generated by 3d-dna was visualized and manually curated in Juicebox v1.11.08 (Robinson ). A total of 7 contigs had their orientation changed through manual curation. Gene completeness was evaluated using BUSCO v5.0.0 with the lepidoptera_odb10 dataset (n = 5,286) (Manni ). Kmer QV was calculated using Merqury v1.1 (Rhie ). Genome size and heterozygosity were estimated from the Merqury kmer spectrum using Genomescope 2.0 (Ranallo-Benavidez ). The mitochondrial genome was assembled and annotated using the Mitofinder pipeline v1.4 (Allio, ). Illumina WGS reads were assembled with metaSPAdes v3.14.1 (Nurk ) and tRNAs were annotated with MiTFi (Jühling ).

Genome annotation

TEs were annotated using the Earl Grey TE annotation pipeline (Jurka ; Xu and Wang 2007; Rubino and Creevey 2014; Smit ; Hubley ; Platt ; Ou and Jiang 2019; Wong and Simakov 2019; Flynn ; Baril ) as in Mackintosh . Following gene annotation, 5′ and 3′ gene flanks were defined as those that were 20 kb upstream and downstream of genes. We define regions as intergenic space if they are neither genic (start/stop codons, exons, and introns) nor gene flanks. Bedtools intersect v2.27.1 (Quinlan and Hall 2010) was used to determine overlap (-wao) between TEs and genomic features. Following this, quantification and plotting were performed in R, using the tidyverse package (Wickham ; RStudio Team 2020; R Core Team 2021). The trimmed RNA-seq reads were mapped to the assembly with HISAT2 v2.1.0 (Kim ). The softmasked assembly and RNA-seq alignments were used for gene prediction with braker2.1.5 (Stanke , 2008; Li ; Barnett ; Lomsadze ; Buchfink ; Hoff , 2019). Gene annotation statistics were calculated with GenomeTools v1.6.1 (Gremme ). Functional annotation was carried out using InterProScan v5.50-84.0 (Jones ) and the Pfam-33.1 database (Mistry ).

Estimating heterozygosity

Heterozygosity was estimated within different partitions of the genome by first mapping the Illumina WGS reads to the assembly with bwa-mem, marking duplicated alignments with sambamba 0.8.1 (Tarasov ), and calling variants with freebayes v1.3.2-dirty (Garrison and Marth 2012). Variants were normalized with bcftools v1.8 (Danecek ), decomposed with vcfallelicprimitives (Garrison ), filtered (QUAL), and subset to biallelic SNPs with bcftools. Callable sites were delimited with mosdepth v0.3.2 (Pedersen and Quinlan 2018), by applying a coverage threshold between 8 and 95 (inclusive). Callable sites were further restricted by removing all sites where there were indels or SNPs with QUAL < 1. Four-fold-degenerate (4D) and zero-fold-degenerate (0D) sites were identified using partition_cds.py v0.2 (see Data availability), based on the CDS BED file (where the 4th column contains transcript ID), the genome FASTA, and the VCF file. Bedtools v2.30.0 was used to intersect callable regions of the genome with intergenic, intronic, exonic, 4D, and 0D sites. Heterozygosity—the number of heterozygous bi-allelic SNPs divided by the number of callable sites—and callable sites of each genomic partition are listed in Table 1.

Table 1.

Estimates of heterozygosity in different partitions of the genome.

Partition	Callable sites (Mb)	Heterozygosity
All	420.3	0.00598
Exonic	19.3	0.00295
Intronic	126.1	0.00615
Intergenic	275.1	0.00609
0D	12.4	0.00157
4D	3.1	0.00680

As a comparison, the kmer-based heterozygosity estimated from all Illumina reads is 0.00702.

Estimates of heterozygosity in different partitions of the genome. As a comparison, the kmer-based heterozygosity estimated from all Illumina reads is 0.00702.

Results and discussion

We sequenced the genome of a male individual (Fig. 1. a and b) by generating both Pacbio continuous long reads (55.7× coverage) and Illumina paired-end short reads (58.8× coverage). From these data, we assembled a haploid representation of the genome consisting of 265 contigs with a total span of 430.6 Mb. The contig assembly is slightly larger than a genome size estimate based on the flow cytometry of male individuals (386.6 Mb, Mackintosh ) and a previous assembly for this species based only on Illumina data (390.9 Mb, Allio, Scornavacca ). However, it is smaller than an estimate of the genome size from kmers in the Illumina short reads (471 Mb, Supplementary Fig. 1), suggesting that the flow cytometry may have produced an underestimate and that the Illumina-based assembly contains collapsed repetitive regions. We scaffolded the contigs with Arima HiC data (11.0x coverage) generated from a different male individual collected at the same locality (Fig. 1, c and d). Scaffolding generated 30 chromosome-scale sequences, which together account for 99.5% of the total assembly length and range from 6.8 to 21.1 Mb in size (Supplementary Fig. 2). The assembly has a contig and scaffold N50 of 5.2 and 15.1 Mb, respectively.

Fig. 1.

Fore and hind wings of the two I. podalirius individuals used to generate the genome sequence. a) Dorsal and b) ventral surface view of wings of specimen MO_IP_504, used to generate Pacbio long-read, Illumina WGS short-read, and Illumina RNA-seq short-read data. c) Dorsal and d) ventral surface view of wings of specimen MO_IP_500, used to generate HiC data. The BUSCO analysis shows that the assembly contains the majority of expected single-copy orthologs with little duplication (S: 96.5%, D: 0.2%, F: 0.4%, M: 2.9%). The Phred quality of the consensus sequence, estimated using solid kmers in the short-read data, is 35.8. We assembled a circularized mitochondrial genome of 15,396 bases, containing 13 protein-coding genes, 22 tRNA genes, and 2 rRNA genes. The sequence can be aligned colinearly with the mitochondrial genome of Graphium doson (Kong ), another species in the tribe Leptocircini, demonstrating that these mitochondrial genomes have not undergone any rearrangements. TEs compromise 32.81% of the genome assembly (Supplementary Table 1 and Fig. 2a). The assembly contains representation from all major TE types (Supplementary Table 1): the most abundant TEs are long-interspersed nuclear elements (LINEs), which constitute 11.01% of the assembly and 33.56% of total TE sequence. Recent activity is high in LINEs and there is also evidence for a very recent increase in LTR element abundance (Fig. 2b).

Fig. 2.

TEs within the genome assembly of I. podalirius. a) The proportion of the assembly comprised of the main TE classifications. b) A repeat landscape plot illustrating the proportion of repeats in the genome at different genetic distances (%) to their respective RepeatModeler consensus sequence. Genetic distance is calculated under a Kimura 2 parameter model with correction for CpG site hypermutability. Lower genetic distances suggest more recent activity. c) The abundance of TEs in different partitions of the genome, shown in bases and as a proportion of the partition. Considering all TE classifications, most TEs (70.14%) are found in intergenic regions. As expected, TEs are largely absent from exons with only 0.07% of exonic sequence consisting of TEs, likely due to the deleterious effects of TE insertions in exons (Sultana ; Bourque ). In contrast, a substantial fraction of intronic sequence (31.47%) is comprised of TEs (Fig. 2c). The most abundant TEs in the assembly, LINEs, comprise 14.25% of intergenic space, 12.14% of gene flanks, 8.98% of intronic regions, and 0.69% of exonic regions (Fig. 2c). We annotated the assembly with 17,826 protein-coding genes, coding for 20,222 transcripts (1.13 transcripts per gene). At least 1 Pfam domain was identified along proteins of 9,363 genes (52.5%) and start codons were found in genes coding for 20,163 proteins (99.71%). The BUSCO score of the protein sequences (S: 66.9%, D: 12.8%, F: 5.5%, M: 14.8%) is lower than the score of the genome sequence (see above) with an expected increase in duplicated BUSCOs. The median length of genes is 4.0 kb, with the majority (51.7%) consisting of 4 exons or fewer. The number of gene predictions is higher than in genome annotations for species in the genus Papilio, such as Papilio dardanus (12,795, Timmermans ) and Papilio bianor (15,375, Lu ), but far lower than in the annotation of the Parnassius apollo genome (28,344, Podsiadlowski ). The density of annotated genomic features differs across chromosomes (Fig. 3 and Supplementary Table 2). For example, the proportion of sequence made up of TEs ranges from 24.4% on chromosome 7 to 39.4% on chromosome 29. Similarly, exon density also ranges approximately two-fold across chromosomes, from 3.3% on chromosome 25 to 6.4% on chromosome 30. The density of TEs is negatively correlated with chromosome length (Spearman’s rank correlation, , P = 0.004), whereas exon density and chromosome length are uncorrelated (Fig. 3).

Fig. 3.

The relationship between chromosome length and a) heterozygosity at 4D sites, b) exon density, and c) TE density.

Genome-wide heterozygosity

Across the genome assembly, we identified 2,514,242 heterozygous SNPs in the reference individual MO_IP_504, which is equivalent to a per-base heterozygosity (across all sites) of 0.00598 (Table 1). As expected, given selective constraint, heterozygosity in exons is less than half of that in introns and intergenic regions (Table 1). Likewise, within coding sequence heterozygosity is highest for 4D sites and lowest for 0D sites (Table 1), which is expected given that the latter are under strong evolutionary constraint (Sawyer ). We note that our estimate of 4D site heterozygosity is comparable, but slightly higher, than estimates previously reported for I. podalirius based on transcriptome assemblies and data from two individuals (0.0052, 0.0057) (Mackintosh ; Ebdon ). This difference most likely reflects the fact that transcriptome assemblies are biased toward highly expressed transcripts, which experience greater indirect effects of purifying selection (Marek and Tomala 2018). Heterozygosity at 4D sites () varies by chromosome (Fig. 3 and Supplementary Table 2): it is lowest on chromosome 1 (0.00328, the putative Z chromosome) and highest on chromosome 25 (0.01026, an autosome). We find a significant negative correlation between chromosome length and (Spearman’s rank correlation, ) (Fig. 3a).

Conclusions

We describe a chromosome-level genome assembly for the scarce swallowtail butterfly I. podalirius, with gene and repeat annotations that are similar to previously published Papilio butterfly genomes. By contrast, the number of gene predictions and TEs in the P. apollo genome assembly is far greater (Podsiadlowski ), suggesting gene and repeat expansions in the subfamily Parnassiinae. When comparing heterozygosity in the reference individual both between chromosomes and between genomic partitions, we recover two well-documented patterns of genome-wide sequence variation, which result from the direct and indirect effects of selection, respectively: (1) stark differences in genetic diversity between genomic partitions reflecting differences in selective constraint and (2) a negative correlation between chromosome length and heterozygosity at putatively neutral 4D sites. The latter has been described for several species (including butterflies Martin ) and is an expected consequence of the fact that the indirect effect of selection on nearby neutral sites depends on the rate of recombination (which tends to be greater for short chromosomes, Haenel ). We also find a negative relationship between chromosome length and TE density. This is surprising given that increased recombination rates on shorter chromosomes are expected to result in more efficient selection against TE insertions (Langley ). Despite this, the observation that smaller chromosomes contain a greater density of repetitive elements has also been reported in Heliconius and Melitaea butterflies (Ahola ; Cicconardi ), suggesting that this may be a general feature of Lepidopteran genomes. The I. podalirius genome will be valuable resource—not only for genomic analyses that investigate the forces driving genome evolution in the long term—but will also allow for detailed studies of the population-level processes that lead to the accumulation of barriers between species still experiencing gene flow.

Data availability

The genome assembly, gene annotation, and raw sequence data can be found at the European Nucleotide Archive under project accession PRJEB51340. The script used for calculating the site degeneracy (partition_cds.py) and the script used for visualizing HiC contacts (HiC_view.py) can be found at the following github repository: https://github.com/A-J-F-Mackintosh/Mackintosh_et_al_2022_Ipod. The mitochondrial genome sequence and the TE annotation can be found at the same repository. Supplemental material is available at G3 online. Click here for additional data file.

57 in total

1. BamTools: a C++ API and toolkit for analyzing and managing BAM files.

Authors: Derek W Barnett; Erik K Garrison; Aaron R Quinlan; Michael P Strömberg; Gabor T Marth
Journal: Bioinformatics Date: 2011-04-14 Impact factor: 6.937

2. Sambamba: fast processing of NGS alignment formats.

Authors: Artem Tarasov; Albert J Vilella; Edwin Cuppen; Isaac J Nijman; Pjotr Prins
Journal: Bioinformatics Date: 2015-02-19 Impact factor: 6.937

3. Whole Genome Shotgun Phylogenomics Resolves the Pattern and Timing of Swallowtail Butterfly Evolution.

Authors: Rémi Allio; Céline Scornavacca; Benoit Nabholz; Anne-Laure Clamens; Felix Ah Sperling; Fabien L Condamine
Journal: Syst Biol Date: 2020-01-01 Impact factor: 15.683

4. Confidence interval for the number of selectively neutral amino acid polymorphisms.

Authors: S A Sawyer; D E Dykhuizen; D L Hartl
Journal: Proc Natl Acad Sci U S A Date: 1987-09 Impact factor: 11.205

5. A spectrum of free software tools for processing the VCF variant call format: vcflib, bio-vcf, cyvcf2, hts-nim and slivar.

Authors: Erik Garrison; Zev N Kronenberg; Eric T Dawson; Brent S Pedersen; Pjotr Prins
Journal: PLoS Comput Biol Date: 2022-05-31 Impact factor: 4.779

10. BUSCO Update: Novel and Streamlined Workflows along with Broader and Deeper Phylogenetic Coverage for Scoring of Eukaryotic, Prokaryotic, and Viral Genomes.

Authors: Mosè Manni; Matthew R Berkeley; Mathieu Seppey; Felipe A Simão; Evgeny M Zdobnov
Journal: Mol Biol Evol Date: 2021-09-27 Impact factor: 16.240