Literature DB >> 31834364

An Annotated Draft Genome of the Mountain Hare (Lepus timidus).

João P Marques^1,2,3, Fernando A Seixas^1,2,3, Liliana Farelo¹, Colin M Callahan⁴, Jeffrey M Good^4,5, W Ian Montgomery⁶, Neil Reid⁶, Paulo C Alves^1,2,5, Pierre Boursot³, José Melo-Ferreira^1,2.

Abstract

Hares (genus Lepus) provide clear examples of repeated and often massive introgressive hybridization and striking local adaptations. Genomic studies on this group have so far relied on comparisons to the European rabbit (Oryctolagus cuniculus) reference genome. Here, we report the first de novo draft reference genome for a hare species, the mountain hare (Lepus timidus), and evaluate the efficacy of whole-genome re-sequencing analyses using the new reference versus using the rabbit reference genome. The genome was assembled using the ALLPATHS-LG protocol with a combination of overlapping pair and mate-pair Illumina sequencing (77x coverage). The assembly contained 32,294 scaffolds with a total length of 2.7 Gb and a scaffold N50 of 3.4 Mb. Re-scaffolding based on the rabbit reference reduced the total number of scaffolds to 4,205 with a scaffold N50 of 194 Mb. A correspondence was found between 22 of these hare scaffolds and the rabbit chromosomes, based on gene content and direct alignment. We annotated 24,578 protein coding genes by combining ab-initio predictions, homology search, and transcriptome data, of which 683 were solely derived from hare-specific transcriptome data. The hare reference genome is therefore a new resource to discover and investigate hare-specific variation. Similar estimates of heterozygosity and inferred demographic history profiles were obtained when mapping hare whole-genome re-sequencing data to the new hare draft genome or to alternative references based on the rabbit genome. Our results validate previous reference-based strategies and suggest that the chromosome-scale hare draft genome should enable chromosome-wide analyses and genome scans on hares.

Entities: Chemical Species

Keywords: Lagomorpha; Leporids; annotation; de novo assembly; hares; whole-genome sequencing

Mesh：

Year: 2020 PMID： 31834364 PMCID： PMC6951464 DOI： 10.1093/gbe/evz273

Source DB: PubMed Journal: Genome Biol Evol ISSN： 1759-6653 Impact factor: 3.416

Introduction

The ability to sequence whole genomes has revolutionized our power to study the evolution of non-model organisms. Hares (genus Lepus) have recently emerged as useful evolutionary models to understand introgressive hybridization and local adaptation (Alves et al. 2008; Jones et al. 2018; Seixas et al. 2018; Giska et al. 2019). Genomic analyses on this group have primarily relied on comparisons to the high-quality reference genome of another leporid, the more extensively studied European rabbit (Oryctolagus cuniculus) (Carneiro et al. 2014), estimated to share a most recent common ancestor with hares ∼12 million years ago (Matthee et al. 2004). Although these studies have generally used iterative mapping approaches to reduce divergence and increase mapping efficiency (e.g., Jones et al. 2018; Seixas et al. 2018), it remains unclear to what extent reliance on an outgroup reference may have limited genomic inferences. We extend the genomic resources of Leporids by assembling the first draft genome of a hare species, the mountain hare (Lepus timidus). The mountain hare is an arcto-alpine species widely distributed in the northern Palearctic, from western Europe to eastern Asia, with some isolated populations, as in the Alps, Poland, Great Britain, and Ireland. The current distribution of the species reflects the colonization of previously glaciated areas in the north, and the retreat from southernmost regions in the south (Waltari and Cook 2005; Hamill ; Melo-Ferreira et al. 2007; Smith et al. 2018). The species has been implicated in recurrent events of introgressive hybridization with other hare species from Europe (Thulin et al. 1997; Alves et al. 2003; Melo-Ferreira et al. 2009; Seixas et al. 2018; Giska et al. 2019), and displays important locally adapted traits, such as varying ecologies (Caravaggi et al. 2017), size differences among regions, or distinctive coat color (Smith et al. 2018; Giska et al. 2019). Furthermore, genus Lepus is distributed worldwide with more than 30 classified species, which show adaptions to contrasting environments, from arctic to arid regions. Detailed investigation of relevant evolutionary processes in the genus can benefit from the availability of hare-specific genomic resources (Fontanesi et al. 2016).

Materials and Methods

DNA Sampling, Extraction, and Sequencing

One female mountain hare (Lepus timidus hibernicus) specimen (NCBI BioSample ID SAMN12621015) was captured from the wild for scientific research purposes by the Irish Coursing Club (ICC) at Borris-in-Ossory, County Laois under National Parks & Wildlife (NPWS) license no. C 337/2012 issued by the Department of Arts, Heritage and the Gaeltacht (dated October 31, 2012). Genomic DNA was extracted from kidney, muscle, and ear tissue using the JETquick Tissue DNA Spin Kit (GENOMED), with RNAse and proteinase K to remove RNA and protein contamination. Genomic libraries of different insert lengths were generated following the standard ALLPATHS-LG protocol (Gnerre et al. 2011): one Illumina TruSeq DNA library of 180 bp fragments was sequenced with overlapping paired-end (OPE) reads, and three Illumina TruSeq DNA mate-pair (MP) libraries of 2.5, 4.5, and 8.0 kb insert sizes. Whole-genome sequencing was performed at The Genome Analysis Center (TGAC, currently Earlham Institute, Norwich, UK)—seven HiSeq2000 lanes (five OPE and two 4.5 kb MP)—and CIBIO’s New-Gen sequencing platform—three HiSeq1500 lanes (2.5 and 8.0 kb MP). Raw sequencing reads were deposited in the Sequence Read Archive (details in supplementary table S1, Supplementary Material online).

Read Quality Assessment and Filtering

Exact duplicates were removed both from OPE and MP libraries using PRINSEQ v0.20.4 (Schmieder and Edwards 2011b). PhiX sequences were identified using Bowtie2-v2.2.3 (Langmead and Salzberg 2012) and removed. Adapter sequences were removed using Cutadapt v1.4.1 (Martin 2011) for OPE reads and Skewer v1.3.1 (Jiang et al. 2014) for mate-pairs. For the latter, only pairs in the correct orientation determined by the presence of the junction adapter were retained.

Genome Size Estimation

Genome size was estimated using a k-mer-based approach (Marçais and Kingsford 2011). First, the frequency distribution of 17 bp k-mers was obtained using jellyfish v2.2.6 (Marçais and Kingsford 2011) based on the OPE raw reads—supplementary figure S1, Supplementary Material online. The sequencing depth was then calculated following M = N * (L − k + 1)/L, where M is the peak of the k-mer depth frequency distribution, L is the read length, and k is the chosen k-mer length in bp. Finally, the genome size was estimated by dividing the total number of bases sequenced by the sequencing depth.

Genome Assembly and Annotation

De novo assembly was performed using ALLPATHS-LG (Gnerre et al. 2011) with default parameters using OPE and mate-pair reads. The resulting assembly was evaluated with REAPR v1.0.18 (Hunt et al. 2013) to break incorrect scaffolds, by mapping the paired-end and the 4.5 kb mate-pair reads on the assembled genome. Another round of scaffolding was then performed using SSPACE v3.0 (Boetzer et al. 2011), with a minimum overlap of 32 bp and supported by a minimum of 20 reads. Finally, we leveraged the existence of the high-quality assembly of the genome of the European rabbit (Oryctolagus cuniculus—Ensembl OryCun2.0), to improve the contiguity of the assembly using the reference-based scaffolder MeDuSa v.1.6 with five iterations (Bosi et al. 2015). This re-scaffolding orders and re-orientates scaffolds without affecting intra-scaffold sequence. Quality control of the assembly at different stages was assessed based on metrics obtained with QUAST v.3.2 (Mikheenko et al. 2016). The completeness of the L. timidus re-scaffolded genome was evaluated using BUSCO v.3.0.2 (Simão et al. 2015), based on the presence and absence of core single-copy genes (from mammalia_odb9 database). We then checked consistency of gene content in the larger chromosome-like scaffolds and rabbit chromosomes using blastp from NCBI BLAST v2.7.1+ (Camacho et al. 2009), considering the best hit per gene with similarity above 90% over 500 bp. The 22 rabbit chromosomes were aligned against inferred corresponding L. timidus re-scaffolded scaffolds using D-Genies v. 1.2.0 Mashmap (Cabanettes and Klopp 2018). Repetitive regions were identified using RepeatModeler v.1.0.11 (Smit and Hubley 2008) and masked using RepeatMasker v.4.0.7 (Smit et al. 2013). The masked genome was used as input for gene prediction in MAKER v.3.01.02 (Cantarel et al. 2008), using ab-initio predictions, L. timidus transcriptome data, and rabbit protein annotations (O. cuniculus) (supplementary text, Supplementary Material online). Functional inference for genes and transcripts was performed using the translated CDS features of each coding transcript. Each predicted protein sequence was based on blastp searches against the Uniprot-Swissprot database to retrieve gene name and function, and InterProscan v5.30-69 (Jones et al. 2014) to retrieve Interpro, Pfam v31.0 (Finn et al. 2016), GO (Mi et al. 2017), KEGG (Kanehisa et al. 2016), and Reactome (Fabregat et al. 2018) information.

Analyses of Whole-Genome Re-Sequencing Data

To compare the performance of using the L. timidus genome or other strategies based on the rabbit genome for whole-genome analyses, we analysed re-sequencing data from the mountain hare and another hare species, the Iberian hare, L. granatensis, mapping the reads to 1) the new L. timidus re-scaffolded genome, 2) the rabbit reference genome (available from Ensembl—OryCun2.0, release 80), and 3) a hare pseudo-reference genome built through iterative mapping of hare sequence reads on the rabbit genome, followed by reference updating (from Seixas et al. 2018). For the re-sequencing data (NCBI Sequence Read Archive Biosamples SAMN07526960 and SAMN07526963; Seixas et al. 2018), adapters were removed using cutadapt version 1.8 (Martin 2011) and low quality bases (quality < 20 at the end of reads, and 4 consecutive bp with average quality < 30) were trimmed using Trimmomatic v0.33 (Bolger et al. 2014). Mapping was done using BWA-MEM v0.7.15 (Li 2013). Mapped reads were sorted with samtools v1.3.1 (Li et al. 2009) and read duplicates removed using Picard Markduplicates (Picard toolkit 2019). Realignment around INDELs was performed using GATK v3.2-2 (Van der Auwera et al. 2013). Genotype calling was performed for each species independently using bcftools v.1.5 (Li 2011), with minimum site (QUAL) and RMS mapping (MQ) qualities of 20, coverage (FMT/DP) between 6X and twice the average genomic coverage, and minimum genotype quality (FMT/GQ) of 20. Indels and flanking 10 bp were coded as missing data. Only sites covered in the two analysed individuals were retained. Heterozygosity was calculated in sliding windows of 50 kb, using the popgenWindows.py script available at https://github.com/simonhmartin/genomics_general, and 500 windows were randomly sampled per references and species. The Pairwise Sequentially Markovian Coalescent (PSMC) model (Li and Durbin 2011) was then used to compare single-genome demographic inferences of the mountain hare using alternative genome references (L. timidus assembled genome, prior to re-scaffolding, was also included, to control for potential biases arising from the reference-based re-scaffolding process), and to infer the demographic profiles of L. timidus and L. granatensis using the L. timidus re-scaffolded reference (as in Seixas et al. 2018, who used the hare pseudo-reference). Changes in the density of called variants among references should cause important differences in the inferred profiles. Diploid consensus sequences were built using samtools v1.3.1 mpileup and call modules, and only sites with minimum base and mapping qualities of 20, and coverage between 8X and twice the average depth were considered (atomic time intervals were set to 4 + 50*2 + 2 + 4 as in Seixas et al. 2018). Results were scaled using a generation time of 2 years (Marboutin and Peroux 1995) and a mutation rate (μ) of 2.8 × 10−9 substitutions/site/generation (Seixas et al. 2018). The variance of effective population size (Ne) estimates was assessed by 50 bootstraps in randomly sampled segments with replacement.

Results and Discussion

De Novo Reference Genome Assembly and Annotation

Genome assembly and sequencing metrics are in table 1 and supplementary table S1, Supplementary Material online. The assembly length, 2.70 Gb, was consistent with the k-mer estimate (∼2.75 Gb) and the assembled length of the rabbit genome (∼2.74 Gb; Carneiro et al. 2014). The L. timidus re-scaffolded genome contained 4,205 scaffolds, being 99.9% of the total assembly size comprised in 605 scaffolds with a minimum length of 35 kb. Of the 4,104 mammalian core genes, 3,793 (92.4%) were present in our assembly, 3,445 of which (90.8%) were found as complete single copies. The number of predicted and annotated genes (29,238 and 24,578, respectively—table 1 and supplementary fig. S2, Supplementary Material online) are in line with several published mammalian genomes that used similar sequencing approaches (Keane et al. 2015; Li et al. 2017; Koepfli et al. 2019; Ming et al. 2019), and with the extrapolation from the BUSCO completeness assessment, suggesting that the majority of genes present in our draft genome was covered by the annotation process. A total of 683 predicted genes were uniquely annotated based on the hare transcriptome and possibly represent hare-specific genes (supplementary fig. S2, Supplementary Material online).

Table 1

Summary Statistics of the L. timidus Genome Assembly, Curation and Annotation

	Value
Raw reads	3,396,221,990
Clean reads	2,084,055,186 (61%)
Raw de novo assembly (ALL-PATHS)
Number of scaffolds	30,212
Largest (bp)	4,384,173
Total length (bp)	2,741,083,031
Contig N50 (bp)	11,800
Scaffold N50 (bp)	347,000
GC content (%)	43.54
Post assembly curation (REAPR)
Number of scaffolds	70,707
Largest (bp)	1,921,307
Total length (bp)	2,662,248,649
N50 (bp)	156,456
GC content (%)	43.54
Scaffolding (SSPACE >1 kb)—L. timidus assembled genome
Number of scaffolds	32,294
Largest (bp)	3,358,433
Total length (bp)	2,703,715,767
N50 (bp)	3,358,433
GC content (%)	43.54
Reference-based scaffolder—L. timidus re-scaffolded genome
Number of scaffolds	4,205 (83% on the 22 scaffolds corresponding to the 22 rabbit chromosomes)
Largest (bp)	194,083,885
Total length (bp)	2,703,257,129
N50 (bp)	117,222,600
GC content (%)	43.40
L. timidus re-scaffolded genome characteristics and annotation statistics
Haploid chromosome number	22 (constrained by the rabbit reference)
Estimated genome length	2.75 Gb
Sequencing coverage	77x
Total assembly length	2.70 Gb
Gaps	18.8% (supplementary fig. S6, Supplementary Material online)
Repeats	28.0%
BUSCO genes present (estimation of genome completeness)	3,793 (92%)
Ab-initio gene prediction	29,238 genes
Gene space (exons, introns, etc.)	196.6 Mb (7.3% of assembly)
Gene length (median)	5.8 kb
Gene fragmentation	137,913 exons
Exon space	26.6 Mb (1.0% of the assembly)
Exon length (median)	193 bp
Homology-based gene annotation (details in supplementary fig. S2, Supplementary Material online)	24,578 proteins
Uniprot-Swissprot	23,375
O. cuniculus transcriptome	18,678
L. timidus transcriptome	15,418 (683 hare transcriptome specific)
Functional annotation
Pfam domains	28,803 (5,068 unique)
Gene ontology (GO)	14,530 (1,382 unique)
Interpro	25,395 (4,740 unique)
KEGG	748 (391 unique)
Reactome	5,066 (1,050 unique)

Summary Statistics of the L. timidus Genome Assembly, Curation and Annotation Through the characterization of gene content (supplementary fig. S3, Supplementary Material online) and chromosome-scaffold alignment (supplementary Figs. S4 and S5, Supplementary Material online), we were able to establish correspondence between the rabbit chromosomes (2N = 44) and 22 scaffolds of the re-scaffolded version of the L. timidus assembly. The 22 scaffolds correspond to 83% of the total length of the assembly (2.24 Gb). It should be noted that hares have 24 chromosomes, since rabbit’s chromosomes 1 and 2 are each split into two in hares (presumably resulting from two fusions in the rabbit lineage; Robinson et al. 2002). This karyotype difference was naturally not recovered in our re-scaffolded assembly, which highlights the inherent shortcomings of reference-guided scaffolding. While the new genome should be accurate in resolving small-scale structural variation (small insertions–-deletions, repeats, and/or inversions as recovered by the original assembled contigs/scaffolds; Salzberg et al. 2012), larger genomic rearrangements, should be missed due to the assumption of synteny with the reference.

The Impact of Alternative Reference Mapping Strategies on Genomic Analyses

The proportion of mapped reads from whole-genome re-sequencing was higher using the hare pseudo-reference, but the number of uniquely mapped reads was larger using the L. timidus reference genome (supplementary table S2, Supplementary Material online). These statistics suggest that the hare pseudo-reference increases both mapping proportion and efficiency, but the new hare reference allows increased confidence in mapping, measured as the proportion of uniquely mapped reads. The distributions of heterozygosity estimated in 50 kb windows along the genome did not differ significantly across analyses with different references (P > 0.05, Wilcoxon Ranked-sum test; supplementary fig. S7, Supplementary Material online). In agreement, the PSMC demographic profiles also displayed similar shapes across references, with only slight differences of inferred effective population sizes (fig. 1). Also, the demographic profiles of both L. timidus and L. granatensis inferred using the new L. timidus re-scaffolded genome are similar to those inferred by Seixas et al. (2018) using the hare pseudo-reference (fig. 1). These results suggest that the use of the alternative tested references does not impact heterozygosity tract patterns, and thus that approaches based on hare pseudo-references has not limited evolutionary inference and genome scans on hares. It also shows that re-scaffolding our de novo assembly using the rabbit genome enables the use of the new hare genome reference for genomic scale scans, where the ordering along the chromosome is important. Finally, the new hare draft genome can be useful to reveal hare-specific variation, reflected for example in the putative hare-specific genes annotated here, which needs to be evaluated and investigated.

—PSMC inference of demographic profiles. (a) Lepus timidus demographic profile inferred using different reference genomes: L. timidus re-scaffolded genome, L. timidus assembled genome (prior to re-scaffolding), hare pseudo-reference genome, and the (European) rabbit reference genome. (b) PSMC inference of the demographic profiles of two European hare species—L. timidus and L. granatensis, using the L. timidus re-scaffolded genome (replicating the analysis by Seixas et al. 2018, which used a hare pseudo-reference). Each bold line represents a full run for each species and thin lines represent 50 randomly sampled fragments bootstrap. A rate of 2.8 × 10-9 substitutions per site per generation and a generation time of two years were assumed. Inflection points are denoted by the gray vertical bars, as in Seixas et al. (2018).

Supplementary Material

Supplementary data are available at Genome Biology and Evolution online. Click here for additional data file.

42 in total

1. Scaffolding pre-assembled contigs using SSPACE.

Authors: Marten Boetzer; Christiaan V Henkel; Hans J Jansen; Derek Butler; Walter Pirovano
Journal: Bioinformatics Date: 2010-12-12 Impact factor: 6.937

2. High-quality draft assemblies of mammalian genomes from massively parallel sequence data.

Authors: Sante Gnerre; Iain Maccallum; Dariusz Przybylski; Filipe J Ribeiro; Joshua N Burton; Bruce J Walker; Ted Sharpe; Giles Hall; Terrance P Shea; Sean Sykes; Aaron M Berlin; Daniel Aird; Maura Costello; Riza Daza; Louise Williams; Robert Nicol; Andreas Gnirke; Chad Nusbaum; Eric S Lander; David B Jaffe
Journal: Proc Natl Acad Sci U S A Date: 2010-12-27 Impact factor: 11.205

Review 3. LaGomiCs-Lagomorph Genomics Consortium: An International Collaborative Effort for Sequencing the Genomes of an Entire Mammalian Order.

Authors: Luca Fontanesi; Federica Di Palma; Paul Flicek; Andrew T Smith; Carl-Gustaf Thulin; Paulo C Alves
Journal: J Hered Date: 2016-02-26 Impact factor: 2.645

4. Spatial patterns of genetic diversity across European subspecies of the mountain hare, Lepus timidus L.

Authors: R M Hamill; D Doyle; E J Duke
Journal: Heredity (Edinb) Date: 2006-08-09 Impact factor: 3.821

5. Introgression drives repeated evolution of winter coat color polymorphism in hares.

Authors: Iwona Giska; Liliana Farelo; João Pimenta; Fernando A Seixas; Mafalda S Ferreira; João P Marques; Inês Miranda; Jérôme Letty; Hannes Jenny; Klaus Hackländer; Eyðfinn Magnussen; José Melo-Ferreira
Journal: Proc Natl Acad Sci U S A Date: 2019-11-11 Impact factor: 11.205

6. InterProScan 5: genome-scale protein function classification.

Authors: Philip Jones; David Binns; Hsin-Yu Chang; Matthew Fraser; Weizhong Li; Craig McAnulla; Hamish McWilliam; John Maslen; Alex Mitchell; Gift Nuka; Sebastien Pesseat; Antony F Quinn; Amaia Sangrador-Vegas; Maxim Scheremetjew; Siew-Yit Yong; Rodrigo Lopez; Sarah Hunter
Journal: Bioinformatics Date: 2014-01-21 Impact factor: 6.937

7. Draft genome of the reindeer (Rangifer tarandus).

Authors: Zhipeng Li; Zeshan Lin; Hengxing Ba; Lei Chen; Yongzhi Yang; Kun Wang; Qiang Qiu; Wen Wang; Guangyu Li
Journal: Gigascience Date: 2017-12-01 Impact factor: 6.524

8. The genomic impact of historical hybridization with massive mitochondrial DNA introgression.

Authors: Fernando A Seixas; Pierre Boursot; José Melo-Ferreira
Journal: Genome Biol Date: 2018-07-30 Impact factor: 13.583

9. The genome resources for conservation of Indo-Pacific humpback dolphin, Sousa chinensis.

Authors: Yao Ming; Jianbo Jian; Xueying Yu; Jingzhen Wang; Wenhua Liu
Journal: Sci Data Date: 2019-05-22 Impact factor: 6.444

10. REAPR: a universal tool for genome assembly evaluation.

Authors: Martin Hunt; Taisei Kikuchi; Mandy Sanders; Chris Newbold; Matthew Berriman; Thomas D Otto
Journal: Genome Biol Date: 2013-05-27 Impact factor: 13.583

4 in total

1. The Legacy of Recurrent Introgression during the Radiation of Hares.

Authors: Mafalda S Ferreira; Matthew R Jones; Colin M Callahan; Liliana Farelo; Zelalem Tolesa; Franz Suchentrunk; Pierre Boursot; L Scott Mills; Paulo C Alves; Jeffrey M Good; José Melo-Ferreira
Journal: Syst Biol Date: 2021-04-15 Impact factor: 15.683

2. GSER (a Genome Size Estimator using R): a pipeline for quality assessment of sequenced genome libraries through genome size estimation.

Authors: Braulio Valdebenito-Maturana; Gonzalo Riadi
Journal: Interface Focus Date: 2021-06-11 Impact factor: 4.661

3. The evolutionary pathways for local adaptation in mountain hares.

Authors: Iwona Giska; João Pimenta; Liliana Farelo; Pierre Boursot; Klaus Hackländer; Hannes Jenny; Neil Reid; W Ian Montgomery; Paulo A Prodöhl; Paulo C Alves; José Melo-Ferreira
Journal: Mol Ecol Date: 2022-01-17 Impact factor: 6.622

4. Chromosome-Level Reference Genome Assembly for the American Pika (Ochotona princeps).

Authors: Bryson M F Sjodin; Kurt E Galbreath; Hayley C Lanier; Michael A Russello
Journal: J Hered Date: 2021-11-01 Impact factor: 2.645

4 in total