Literature DB >> 35511871

A De Novo Chromosome-Level Genome Assembly of the White-Tailed Deer, Odocoileus Virginianus.

Evan W London1,2, Alfred L Roca1,2,3, Jan E Novakofski1,2, Nohra E Mateus-Pinilla1,2.   

Abstract

Cervids are distinguished by the shedding and regrowth of antlers. Furthermore, they provide insights into prion and other diseases. Genomic resources can facilitate studies of the genetic underpinnings of deer phenotypes, behavior, and disease resistance. Widely distributed in North America, the white-tailed deer (Odocoileus virginianus) has recreational, commercial, and food source value for many households. We present a genome generated using DNA from a single Illinois white-tailed sequenced on the PacBio Sequel II platform and assembled using Wtdbg2. Omni-C chromatin conformation capture sequencing was used to scaffold the genome contigs. The final assembly was 2.42 Gb, consisting of 508 scaffolds with a contig N50 of 21.7 Mb, a scaffold N50 of 52.4 Mb, and a BUSCO complete score of 93.1%. Thirty-six chromosome pseudomolecules comprised 93% of the entire sequenced genome length. A total of 20 651 predicted genes using the BRAKER pipeline were validated using InterProScan. Chromosome length assembly sequences were aligned to the genomes of related species to reveal corresponding chromosomes. © The American Genetic Association. 2022.

Entities:  

Keywords:  Illumina; Omni-C; Pacific Biosciences; annotation; haploid; non-model species

Mesh:

Year:  2022        PMID: 35511871      PMCID: PMC9308042          DOI: 10.1093/jhered/esac022

Source DB:  PubMed          Journal:  J Hered        ISSN: 0022-1503            Impact factor:   2.679


The white-tailed deer (Odocoileus virginianus) is 1 of 5 species within the deer family Cervidae that is native to the United States, along with the mule deer (Odocoileus hemionus), moose (Alces americanus), caribou (Rangifer tarandus), and elk (Cervus canadensis). White-tailed deer are the most widespread of all Capreolinae (New-world deer), with a range extending from the Arctic Circle in Canada to Peru and Bolivia (Hewitt 2011). In the United States (USA) deer hunting is a growing industry, accounting for $20 billion of value added to the GDP in 2016 (Allen et al. 2018). Additionally, there were 3172 deer farms operating in the United States with an estimated value of $50 million in meat and animal product sales as of 2017 (USDA National Agricultural Statistics Service 2019). Reference genomes are currently available for 3 North American deer species; mule deer (Lamb et al. 2021), Rocky Mountain elk (Masonbrink et al. 2021), and white-tailed deer (Odocoileus virginianus texanus) (Seabury et al. 2011). Using third-generation sequencing (3GS), the Rocky Mountain elk and mule deer genomes have been resolved at the chromosome level (Lamb et al. 2021; Masonbrink et al. 2021). A chromosome-level assembly sequence is a reasonably complete pseudo-molecule with some gaps but consisting primarily of sequenced bases (Genome Reference Consortium 2021). The Rocky Mountain elk and mule deer genomes were both generated using Pacific Biosciences (PacBio), Illumina, and Hi-C sequencing with both assemblies consisting of 35 chromosome-scale scaffolds (Lamb et al. 2021, Masonbrink et al. 2021). However, 3GS was not yet available when the Seabury et al. assembly was generated for white-tailed deer (Seabury et al. 2011). The existing white-tailed deer genome consists of >17 000 small scaffolds generated using second-generation sequencing (2GS). Third-generation sequencing, such as PacBio, allows for continuous reads of single molecules of DNA ranging in size from 1 to 50 kb (English et al. 2012). Long reads allow for greater overlaps between DNA sequences, resolution of long repeat elements, and the reconstruction of contigs (Pollard et al. 2018). A technique based on chromosome conformation capture, Hi-C (Omni-C) sequencing, is utilized to map associations between sequences originating from the same chromosome (Belton et al. 2012). Applying both 3GS (PacBio) and 2GS sequencing (Illumina, Hi-C) techniques allows for the construction of higher resolution genome assemblies because the high accuracy of 2GS short-reads corrects errors in 3GS long-read sequencing (Mahmoud et al. 2019). Having a high-quality genome assembly can empower further studies and genomic resequencing projects at the population level (Fuentes-Pardo et al. 2017). The ranched/farmed white-tailed deer industry is relatively small but has been expanding within the United States, thus creating a demand for genomic resources that can be used to study heritable traits such as body size, antler rejuvenation (Jamieson et al. 2020), and resistance to pathogens (Seabury et al. 2020). A more complete, 3GS white-tailed deer genome will facilitate future research studies into additional genes that may play a role in diseases of cervids (Masonbrink et al. 2021), including white-tailed deer. For example, all native cervid species in North America are susceptible to chronic wasting disease (CWD), a transmissible spongiform encephalopathy (Rivera et al. 2019), linked to genetic variation in the PRNP gene (Robinson et al. 2012; Brandt et al. 2018; Güere et al. 2020). Additionally, according to recent genome-wide association studies (GWAS), other non-PRNP loci may play a role in CWD (Seabury et al. 2020), as is the case for other prion diseases such as Creutzfeldt-Jakob disease (Jones et al. 2020). Furthermore, having a chromosome-level assembly comparative genomics across species. Extrapolation of linkage is dependent on relative chromosome location, and chromosome arrangements may differ across species (Potter et al. 2017). Knowing the chromosome identities in white-tailed deer will allow for the evaluation of gene relationships within and across chromosomes (Kong et al. 1997). Therefore, there is a need to build on the resolution of the white-tailed deer genome using 3GS and Hi-C scaffolding technologies. The primary aim of this study is to create a de novo chromosome-level deer genome by integrating resources from both 2GS and 3GS platforms as well as Hi-C sequencing for scaffolding. Additionally, chromosome comparisons will be made between white-tailed deer and other mammal species to identify homologous chromosomal regions.

Methods

Biological Sample Collection

A muscle tissue sample was selected from the Illinois tissue research archive used in previous CWD genetic studies (Brandt 2018; Rivera 2019; Ishida 2020) . Illinois white-tailed deer was traditionally classified as Odocoileus virginianus borealis, although the population is an admixture of deer relocated from adjacent regions (Pietsch 1954; Perrin-Stowe 2020). The criteria for choosing the sample included: being stored for fewer than 5 months, cold-weather field conditions during sample collection, and sustained storage at −20 °C. A Male was chosen to sequence both X and Y chromosomes. The selected male white-tailed deer originated from Jo Daviess County and was sampled in February 2020.

Nucleic Acid Library Preparation

Circulomics Nanobind High-Molecular-Weight DNA Extraction

High-molecular-weight DNA was extracted from 0.5 g of muscle tissue using the Nanobind Tissue Big DNA Kit (Curculomics, Baltimore MD). Briefly, tissue was disrupted using a tight-fitting 1.0-mL Dounce homogenizer before lysing with proteinase K. Following homogenization, the solution was centrifuged at 3000 × g at 4 ° C for 5 min to pelletize debris and proteins. The supernatant containing the DNA was transferred to a low-bind microfuge tube. Isopropanol and the magnetic Nanobind disk were then added to the supernatant and gently mixed. The disk, containing bound DNA, was washed 3 times using a magnetic tube rack to prevent DNA shearing. Finally, DNA was eluted from the disk using 75 µL of elution buffer. The resulting DNA concentration was quantified using a Qubit 4 fluorometer (ThermoFisher Scientific, Waltham, MA), and the DNA length was quantified using a Fragment Analyzer™ (Advanced Analytical Technologies, Inc.).

Third- and Second-Generation DNA Sequencing

To generate long-read sequencing libraries, DNA was sheared with a gTube to an average fragment length of 30 kb prior to conversion into a library following the SMRTbell Express Template Prep Kit 2.0 protocol from Pacific Biosciences. The library was sequenced on 2 SMRT cells on a PacBio Sequel II with 24-hr movies. Short-read shotgun genomic libraries were prepared using the Hyper Library construction kit from Kapa Biosystems (Roche, Penzberg, Germany). Libraries were sequenced on the Illumina NovaSeq 6000 equipped with an SP flowcell using 2 × 150 bp paired-end reads. Chromatin conformation capture sequencing libraries were prepared using the Omni-C kit from Dovetail Genomics. Libraries were pooled; quantitated by qPCR and sequenced on the Illumina NovaSeq 6000 equipped with an SP flowcell using 2 × 150 nt paired-end reads. Library preparation and sequencing were conducted by the Roy J. Carver Biotechnology Center of the University of Illinois at Urbana-Champaign (UIUC).

Pre-processing Reads

Sequencing reads that were much shorter or longer than the expected read length for 3GS were removed from the read data sets before genome assembly to reduce misassembles and false contigs. Specifically, PacBio long reads >5000 bp (Hufnagel et al. 2020) were retained from the data sets using Fastp (Chen et al. 2018) to improve the final assembly, and reads greater than 50 kb were removed to reduce the potential for chimeric molecular sequencing templates. All 2GS Adaptor sequences were removed using bcl2fastq (https://support.illumina.com/downloads/bcl2fastq-conversion-software-v2-20.html). Thereafter, sequences were filtered for a minimum read length of 50 bp and a minimum PHRED score of 30 to ensure short read accuracy using Fastp. Omni-C reads were subsampled using Seqtk (https://github.com/lh3/seqtk) to include only 300 million read pairs based on the recommended protocol provided by Dovetail Genomics for the Omni-C kit (Dovetail Genomics, Scotts Valley, CA). We list the programs and versions used throughout the assembly and analysis pipeline in (Table 1).
Table 1.

Bioinformatics software used for assembly and analysis

SoftwareVersion
Assembly and error correction
 Long-read filteringFastp0.20.0
 De novo AssemblyWtdbg22.5
 Contig polishing (long reads) https://github.com/PacificBiosciences/gcpp 8.0.0
 Short-read pre-processingBcl2fastq22.20
 Short-read filteringFastp0.20.0
 Contig polishing (short reads)Pilon1.2.2
 Contig deduplicationPurge-Haplotigs1.1.1
 Contamination screenBLAST+2.10.1
Scaffolding
 Omni-C™ read filtering https://github.com/lh3/seqtk 0.3.0
 Arima genomics mapping pipeline https://github.com/lh3/bwahttp://samtools.sourceforge.net/https://github.com/broadinstitute/picard 0.7.171.112.10.1
 Omni-C™ scaffoldingSALSA22.2
 Omni-C™ contact mapJuiceboxHiCExplorer1.11.082.2.1.1
 Scaffold deduplicationPurge-Haplotigs1.1.1
Benchmarking
 Genome completenessBUSCO4.1.4
 Synteny with other species https://github.com/JustinChu/JupiterPlot 1.0
Annotation
 Repeat assessmentRepeatMasker4.1.1
 Protein alignmentsProtHint2.5.0
 RNA alignmentsSTAR2.7.6a
 Gene predictionBRAKER2.1.6
 Prediction filteringInterproscan5.52-86

Software presented in relative order of use in the pipeline. See citations in-text.

Bioinformatics software used for assembly and analysis Software presented in relative order of use in the pipeline. See citations in-text.

De Novo Genome Assembly and Error Correction

Assembly with Wtdbg2

Filtered PacBio reads were assembled using the Wtdbg2 assembler (Ruan and Li 2020) at successive coverage threshold intervals: 50×, 70×, 90×, 100×, 110×, 135×, and 150× (Supplementary Table S1). Furthermore, analysis was conducted using the 90x threshold coverage assembly because the increase of coverage threshold from 70× to 90× produced the most substantial gains in “longest contig length” (Supplementary Table S1), while limiting excess coverage from the higher error-rate long reads. Genome assembly and polishing were conducted on the Biocluster at the Carl R. Woese Institute for Genomic Biology at UIUC.

Error Correction

Two polishing steps were performed using long and short sequencing reads to improve assembly accuracy. Wtdbg2 performed a single round of consensus polishing (Ruan and Li 2020) using the binned sequences. An additional round of long-read consensus polishing was completed by aligning PacBio reads to the 90× Wtdbg2 assembly using the Arrow algorithm from the PacBio SMRTLink software package (Pacific Biosciences). Short-read consensus polishing rounds were conducted with the filtered Illumina reads using Pilon (Walker et al. 2014) in conjunction with UniCycler (Wick et al. 2017) to execute 10 iterative rounds of Pilon polishing. Pilon polishing with a single round was also conducted to address potential overcorrection. The assembly was examined using BLAST+ for contaminant sequences that may have been introduced through the DNA extraction sequencing process. The core UniVec database (http://ftp.ncbi.nih.gov/pub/UniVec) was downloaded from NCBI on March 10, 2021, and a nucleotide BLAST search was performed against the assembly contigs. All contaminant sequences were excised. The contig from which each was excised was split at the removal sites into 2 separate contigs (NCBI 2016). Excision of contaminating bp was performed using the Emacs text editor using the start and end positions of the alignment output from the nucleotide BLAST search. Furthermore, the genome was assessed for potential contamination during final submission to the GenBank Genome database.

Genome Deduplication and Scaffolding

Removal of Haplotigs and Artifacts

The software Purge-Haplotigs (Roach et al. 2018) was used to remove haplotig and artifact assembly fragments. Artifacts were defined as contigs with greater than 80% of their sequence being above the high or below the low sequencing coverage thresholds. The threshold of 80% (default setting) was previously shown to be sufficient to purge putative artifacts and organelle contigs from the assembly (Roach et al. 2018). Contig coverage histograms were generated by aligning the filtered PacBio reads to the assembly using Minimap2 (v.) (Li 2018). The histogram (Supplementary Figure S1A) was generated using the purge_haplotigs hist command with the long-reads aligned back to the genome in a BAM file as input. The low coverage threshold was set to 15, and the high coverage threshold was set to 190. The midpoint coverage between the haploid and diploid peaks was set to 55. Coverage thresholds were derived from the histogram peaks in Supplementary Figure S1A and their midpoint.

Arima Genomics Mapping Pipeline

Subsampled Omni-C paired reads were aligned to the deduplicated contig assembly using bwa index and bwa mem (Li and Durbin 2009). Aligned read pairs were sorted by position using SAMtools (Li et al. 2009) and filtered for 5" ends using the filter_five_end.pl script (https://github.com/ArimaGenomics/mapping_pipeline). Reads were also filtered with SAMtools using a minimum mapping quality of 10. Read groups and duplicate reads were added using Picard (https://github.com/broadinstitute/picard).

Scaffolding, Contig Reassignment, and Haplotig Removal

Mapped Omni-C reads were used as input for Salsa (Ghurye et al. 2019) scaffolding. Salsa was run in correction mode, allowing the use of mapping information to detect mis-assemblies in the input contigs. The contacts between scaffolds were visualized using Juicebox (Robinson et al. 2018). The second round of deduplication was conducted using Illumina paired-end reads using Purge-Haplotigs (Roach et al. 2018). In short, Illumina reads were aligned to scaffolds using Minimap2 and a read-depth histogram was created using the purge_haplotigs hist command (Supplementary Figure S1B). Scaffolds were filtered based on a low coverage threshold of 5, and a high coverage threshold of 90. The midpoint threshold was set to 25. Coverage thresholds were derived from the histogram in Supplementary Figure S1B peaks and their midpoint.

Chromosome-Level Pseudomolecule Curation

The scaffold chromatin contact matrix was visualized with HiCExplorer (Ramirez et al. 2018) and specific scaffold–scaffold contact graphs were examined using Juicebox. Based on the contact graphs, scaffolds were joined into pseudomolecules when orientation could be determined. The orientation of the largest scaffold in each pseudomolecule was assumed to be in the forward direction. Smaller scaffolds were reversed as necessary based on the contact information. Final chromosomes were aligned to the chromosome assemblies of 6 other species using MiniMap2. The species in order of largest to smallest chromosome number were Cervus canadensis (GCA_019320065.1, Masonbrink et al. 2021), Cervus nippon (GWHANOY00000000, Xiumei et al. 2021), Cervus elaphus (GCA_002197005.1, Bana et al. 2018), Bos taurus (GCA_000003205.1, Mehta et al., 2009), Ovis aries (GCA_011170295.1, Li et al. 2021), and Homo sapiens (GCA_000001405.28, Schneider et al. 2017). The species C. canadensis, C. nippon, and C. elaphus have 68 autosomes; whereas B. taurus and O. aries have 58 and 52 autosomes, respectively. All species had sequences for both X and Y chromosomes except for C. nippon, for which the Y-chromosome sequence was not available at the time of publication. Sex chromosomes were determined based on alignment with the other species.

Genome Annotation

Genomic annotation used multiple available databases for gene prediction. Gene models were predicted using the BRAKER annotation pipeline with transcript and protein evidence via GeneMark ETP+ (Altschul et al. 1990; Lomsadze et al. 2005; Stanke et al. 2008; Camacho et al. 2009; Barnett et al. 2011; Hoff et al. 2016, 2019). RNA alignments were examined using GeneMark (Lomsadze et al. 2014). Proteins were aligned to the genome using ProtHint (Brůna et al. 2020), which combines the Splan (Gotoh 2008; Iwata and Gotoh 2012) and DIAMOND (Buchfink et al. 2015) protein aligners. Prior to annotation, the genome was masked with RepeatMasker (Smit et al. 2013) using Cetartiodactyla and ancestral repeat sequences in the RepBase Update repeat database (Bao et al. 2015). Cetartiodatyla includes cetaceans and even-toed ungulates (Price et al. 2005). Soft-masking of repeat sequences using RepeatMasker was used to increase annotation speed and accuracy (Hoff et al. 2019). Available RNA-seq data for white-tailed deer were downloaded from the NCBI Sequence Read Archive. RNA-Seq data have been generated in previous studies from multiple tissue types including retropharyngeal lymph node (SRX4604241), liver (SRX2175788, SRX2175791), antler (SRX2175789), bone (SRX2175790), lung (SRX2175792), brain (SRX2175793), muscle (SRX2175794), testis (SRX2175795, SRX2175797), and pedicle (SRX2175796). All RNA reads were trimmed using Trim Galore (Martin et al. 2011) using the default settings to remove adapter sequences and sequences with an average Phred score below 30. Following trimming, RNA was aligned to the genome using STAR (Dobin and Gingeras 2015) and sorted into bam files using SAMtools (Li et al. 2009). All RNA dataset BAM files were then merged into a single input file for BRAKER. Following the guidance of BRAKER pipeline D (https://github.com/Gaius-Augustus/BRAKER), protein sequences from humans (n = 20 396) and artiodactyls (n = 8 931) present in the SwissProt database (Boutet et al. 2007) were used as evidence from “closely related” species. Vertebrate protein sequences present in the orthologous gene database, OrthoDB (n = 4 937 339) (Kriventseva et al. 2019), were used as evidence from more distantly related species. All protein sequences were aligned to the genome using the ProtHint pipeline within GeneMark-EP (Brůna et al. 2020), which provides an output file that BRAKER can use to incorporate protein information. BRAKER merges the external evidence from RNA-Seq and protein alignments for use as input to the Augustus gene prediction software (Stanke et al. 2006; Keller et al. 2011), which outputs the final general feature format files containing the locations and features of predicted genes. Only genes supported by RNA and protein sequence data were used for further analysis. The longest coding sequences for each supported gene predicted by BRAKER were translated into amino acids and queried against the InterProScan Gene3D and Pfam protein databases (Jones et al. 2014; Lewis et al. 2018; Blum et al. 2021; Mistry et al. 2021). Sequences with matches were retained within the BRAKER annotation file and predicted genes without corresponding matches were removed. Additionally, retroelements with identified reverse-transcriptase domains were removed from the protein-coding gene annotations.

Assessing Completeness and Synteny

To assess the completeness of the assembly, BUSCO (Manni et al. 2021) searches were conducted following successive steps of the assembly and analysis pipeline. All BUSCO searches were conducted using the Cetartiodactyla lineage dataset from OrthoDB (Kriventseva et al., 2019). Synteny between the white-tailed deer pseudomolecule assembly and the Rocky Mountain elk assembly (GCA_019320065) was visualized using JupiterPlot (https://github.com/JustinChu/JupiterPlot). The software performs alignments between reference chromosomes and query scaffolds using Minimap2, runs in assembly mode drawing the alignment links in a circular diagram.

Results

Sequencing and Assembly

3GS, 2GS, and Omni-C Sequencing Metrics

Genomic sequencing used 13 µg of DNA with an average fragment length of 54.8 kb. Two single-molecule real-time sequencing cells produced 32 932 198 reads (cell 1: 13 756 097; cell 2: 19 176 101) for a total of 390.6 gigabases (Gb) of DNA sequence (cell 1: 174.6 Gb; cell 2: 215.9 Gb). Long reads used for assembly were between 5 and 50 kb in length and totaled 23 002 345 bp (cell 1: 9 554 592; cell 2: 13 447 753), covering 345.9 Gb of sequence (cell 1: 153.6 Gb; cell 2: 192.3 Gb). A single lane of paired-end Illumina sequencing produced 1 049 534 322 paired-end reads for a total of 157.4 Gb of DNA sequence. Short reads used for error correction and deduplication totaled 1 004 706 776; thus representing 149.2 Gb of the total sequence. A single lane of paired-end Illumina sequencing of the Omni-C library produced a total of 967 979 604 paired reads for a total of 145.1 Gb of DNA sequence. The 600 million paired-end reads were subsampled from the total Omni-C sequencing output for a total of 90 Gb of sequence.

Contig- and Scaffold-Level Assembly with Deduplication

The 90x coverage Wtdbg2 assembly represented a plateau in assembly quality while limiting the input of “noisy” long reads and was used for further analysis (Supplementary Table S1). Wtdbg2 produced 5 506 contigs from filtered PacBio reads. Deduplication of the contig assembly produced 984 haplotigs and 2103 artifact sequences for a total of 27.9 and 34.2 Mb, respectively. A single contig was found to have a contamination vector based on a BLAST search of the UniVec database and no contamination was found by GenBank submission staff. The final contig assembly consisted of 2420 contigs with a total length of 2 461 348 864 bp. The N50 of the contig assembly was 21.7 Mb with an L50 of 32 contigs (Table 2 and Figure 1A). Scaffolding by Salsa with Omni-C reads was able to join 312 contigs into 156 scaffolds. Additionally, Salsa detected 8 contigs with mis-assemblies based upon Omni-C mapping information. Misassembled contigs were separated into 16 sequences before being joined into scaffolds. The N50 of the scaffold assembly was 51.4 Mb, with an L50 of 18 sequences (Figure 1A). A strong diagonal “self-associated” signal was observed in the Hi-CExplorer plot of the Omni-C contact matrix (Figure 1B), with minimal non-self associations. Deduplication of the scaffold assembly with Purge-Haplotigs revealed 637 duplicated haplotype scaffolds and 972 artifact scaffolds for a total of 17.9 and 18.6 Mb, respectively. The final scaffold assembly consisted of 191 contigs joined into 508 scaffolds with a total un-gapped length of 2.42 Gb (2 424 791 208 bp). A total of 36 scaffolds were joined into 12 chromosome groups based on HiC associations and the remaining 24 chromosomes consisted of single scaffolds (Table 3 and Figure 1C). The 36 chromosome pseudomolecules had an ungapped length of 2 258 487 866 bp (Table 3), representing 93% of the complete genomic sequence assembled in this study. The number of annotated genes per chromosome and the corresponding chromosomes of other species are shown in Table 3. The number of annotated genes per chromosome and corresponding chromosomes of other species are shown in Table 3. Chromosomal fissions were inferred if multiple chromosomes in the Odocoileus virginianus assembly aligned to the same chromosome in another organism Gray cells in (Table 3). Similarly, fusions were inferred if a single chromosome in the Odocoileus virginianus assembly aligned to multiple chromosomes in another organism Bolded cells in(Table 3).
Table 2.

Assembly statistics and BUSCO scores for white-tailed deer

O. v. borealis(contig level) O. v. borealis(scaffold-level)
Total length (bp)2 461 348 8642 424 946 708
Number of sequences2420508
Number of “N” gapsn/a311
% “N”n/a0.006%
Largest sequence (bp)108 025 303108 602 581
Smallest sequence (bp)19392657
Average length (bp)1 017 086.34 773 517.1
N50 (bp)21 776 30052 482 646
L50 (# of sequences)3218
N90 (bp)3 308 69510 477 849
L90 (# of sequences)13449
BUSCO (n = 13 335)
C: complete93.2% (12 433)93.2% (12 424)
S: single copy90.9% (12 128)91.0%(12 129)
D: duplicated2.3% (305)2.2% (295)
F: fragmented0.4% (53)0.4% (51)
M: missing6.4% (849)6.4% (860)

Single-copy orthologous genes from the 22 species in the Cetartiodactyla lineage dataset.

Figure 1.

Contig, scaffold, and chromosome-level assemblies of the white-tailed deer genome. (A) Scaffolds are arranged by size (bottom) and their component contigs are arranged by scaffold (top). The largest scaffolds representing 50% (orange) and 90% (orange + red) of the assembly are indicated with color, leaving the remaining 10% of the assembly (black + gray). Scaffolds below 3 Mb (gray) are not visually separated. The number of contigs per scaffold is presented in Table 3. (B) Scaffold contact map generated from chromatin conformation capture Omni-C sequencing and visualized with HiCExplorer. Scaffold-scaffold contacts are shown increasing from blue to white, to red, and the strong diagonal signal represents scaffold self-association based on nuclear proximity. (C) Contact map for chromosome-sized pseudomolecules sequences manually curated into chromosomes.

Table 3.

Genome annotations and homology for the 36 chromosome pseudomolecules of white-tailed deer

Chrom. IDUngapped length (bp)No. of gapsNo. of genesNo. of repeats Cervus canadensis a Cervus elaphus Cervus nippon a Bos taurus a Ovis aries Homo sapiens
1108 600 5814721174 6613184447
2102 048 4202929173 10151161119
3100 279 16211 253169 281495755
493 958 8007628158 7357198113
593 570 28316956164 8142203311
689 349 4943813150 017612710715
785 956 6763583141 9688159 26/28 2210
880 668 9302385133 36293010121013
978 136 7897685134 81410231131320
1073 421 4978704130 14911111151511
1172 668 6304589123 58813141316121
1268 288 3792574118 9321162171722
1368 111 8895304109 301153314222
1467 564 2444319117 42116251520165
1566 412 0213332115 2471221121498
1661 986 2496472107 329171316211815
1760 095 37181 077101 312152191117
1857 744 214533693 516142918829
1957 482 545425293 221202826986
2057 216 54051 05998 53918411814 16/19
2156 275 464425491 77419 6/17 17664
2255 840 698224691 928232721242318
2353 708 925254696 4592122205312
2452 991 459146588 100253235312
2551 970 072121088 4312731241121
2647 961 987137676 79022241922193
2745 470 101460374 3522872523206
2844 772 963450677 29429228292111
2943 582 846224981 11726627664
3043 498 002342677 730243322221
3143 483 628432977 828301629829
3241 958 503522166 67931323027268
3340 612 519265977 187321031252416
3435 913 106023857 077332632986
X54 563 0621834097 953XXXXXX
Y2 343 2173114 570YYbXXX
Placed2 258 507 26618 8693 834 577
Unplaced166 333 442-1 782281 882
Total2 424 840 70820 6514 116 459

Gray cells—multiple chromosomes in the Odocoileus virginianus assembly aligned to the same chromosome in another organism. Bold cells—a single chromosome in the Odocoileus virginianus assembly aligned to multiple chromosomes in another organism.

Chromosomes (chrom.) for this species are not numbered in order of size.

No Y chromosome sequence available for Cervus nippon.

Assembly statistics and BUSCO scores for white-tailed deer Single-copy orthologous genes from the 22 species in the Cetartiodactyla lineage dataset. Genome annotations and homology for the 36 chromosome pseudomolecules of white-tailed deer Gray cells—multiple chromosomes in the Odocoileus virginianus assembly aligned to the same chromosome in another organism. Bold cells—a single chromosome in the Odocoileus virginianus assembly aligned to multiple chromosomes in another organism. Chromosomes (chrom.) for this species are not numbered in order of size. No Y chromosome sequence available for Cervus nippon. Contig, scaffold, and chromosome-level assemblies of the white-tailed deer genome. (A) Scaffolds are arranged by size (bottom) and their component contigs are arranged by scaffold (top). The largest scaffolds representing 50% (orange) and 90% (orange + red) of the assembly are indicated with color, leaving the remaining 10% of the assembly (black + gray). Scaffolds below 3 Mb (gray) are not visually separated. The number of contigs per scaffold is presented in Table 3. (B) Scaffold contact map generated from chromatin conformation capture Omni-C sequencing and visualized with HiCExplorer. Scaffold-scaffold contacts are shown increasing from blue to white, to red, and the strong diagonal signal represents scaffold self-association based on nuclear proximity. (C) Contact map for chromosome-sized pseudomolecules sequences manually curated into chromosomes.

Genome Analysis

Annotation of Genes and Repetitive Elements

RepeatMasker identified 3 499 765 total interspersed repetitive elements in the Ovbo_1.0 assembly occupying a total of 1 034 014 200 bp. The genome had an average repeat density of 42.69% per scaffold, with the largest 36 scaffolds having a repeat density of 42.09%. Initial analysis using BRAKER predicted 46 152 complete genes, of which 37 684 were supported by external RNA or protein evidence. Validation of gene predictions with InterProScan supported 26 648 predicted genes. Of these supported genes, 5997 contained reverse transcriptase domains and were removed from the annotation set, for a final count of 20 651 protein-coding genes (Table 3).

Assessing Completeness Using BUSCO and Synteny

Initial BUSCO scores following assembly by Wtdbg2 were 89.6% complete genes (88.0% single-copy; 1.6% duplicated) and BUSCO was re-run following each step of analysis (Supplementary Table S2). The final BUSCO scores following scaffold deduplication were 93.2% complete genes (91.0% single-copy, 2.1% duplicated). Synteny comparisons between the white-tailed deer and Rocky Mountain elk assemblies showed single chromosomal fission of the Rocky Mountain elk chromosome 1 into the white-tailed deer chromosomes 12 and 17 (Table 3 and Supplementary Figure S2). Despite the greatly enhanced contiguity of the reference genome assembly achieved herein via 3GS sequencing, it should also be noted that the BUSCO scores from Seabury et al. 2011 are higher (93.7%) than those achieved in this study (93.2%); thereby reflecting the quality and precision of the previous 2GS assembly.

Discussion

Our chromosome-level assembly of the white-tailed deer genome will serve as a valuable resource for future ruminant and cervid research including molecular phylogeny and comparative evolutionary studies. By employing multiple sequencing technologies, including Illumina short-reads, Omni-C reads, and Pacific Biosciences long reads, the contiguity, and accuracy of the assembly were able to surpass those of previously generated Capreolinae (New World deer) genomes; Rangifer tarandus (GCA_014898785), and Odocoileus hemionus (GCA_004115125). This work, resulting in the Ovbor_1.0 assembly, used currently available long-read 3GS and Omni-C technologies to produce a scaffold N50 of 52 Mb, which is 60 times longer than the scaffold N50 of the existing Odocoileus virginianus texanus assembly (GCA_002102435; Seabury et al. 2011) generated before 3GS became available. The current assembly has an average of fewer than 5 gaps per chromosome and will serve as a valuable reference genome for genomic studies in white-tailed deer and other cervids. The assembly produced during step 1 by Wtdbg2 produced large contiguous sequences, with an NG50 of 21 Mb. This is comparable to the human Wtdbg2 assembly presented by Ruan and Li (2020), which had an NG50 of 18 Mb. Arrow long-read polishing was able to extend and error-correct contigs produced by Wtdbg2. Pilon polishing was performed iteratively; however, the most accurate error correction was complete after a single iteration (Supplementary Table S2). This may be compared with the study by Nguyen et al. (2020) where 4 rounds of pilon polishing were required, following the use of Oxford Nanopore technology. Oxford Nanopore reads use complex electrical signals, and long-range errors can occur (Rang et al. 2018). By contrast, in PacBio reads, errors are characterized by insertions and deletions, and a single round of pilon polishing produced an assembly with a higher BUSCO score (Supplementary Table S2). Furthermore, each round of pilon polishing can take 1– 2 days to complete depending on the size of the genome; thereby reducing the time spent on error correction and improving overall pipeline efficiency. BUSCO results indicated that only a small percentage of the sequence was duplicated. During purging by Purge-Haplotigs, it was only necessary to purge 98.6 Mb of sequence, which was less than 5% of the total genome length and was comprised of putative artifact and haplotig sequences. By contrast, other genome assemblies require almost 50% of the genome sequence to be purged (Roach et al. 2018). Thus, the assembly produced by Wtdbg2 contained primarily collapsed haplotype sequences without high levels of duplication. Furthermore, only 8 contigs had missassemblies that were able to be detected by Salsa, (i.e., <0.1% of Wtdbg2 contigs), indicating that almost all contigs were in the chromosome order implicated by Omni-C. Our genome annotation produced by BRAKER and validated with InterProScan expanded the set of annotations on chromosome-sized sequences. This annotation will provide further genomic context allowing for the assessment of chromosomal rearrangements and evolutionary relationships in white-tailed deer. Although annotations were validated using human protein sequences, research has shown that lineage-specific traits such as antler growth have their genetic basis in genes (referred to as headgear genes) that are shared across mammalian lineages; therefore, it is unlikely that InterProScan validation led to a loss of lineage-specific genes. Some genes have been shown to be under positive selection in ruminants with headgear traits (OLIG1 and OTOP3), while others have been shown to be highly expressed in headgear (i.e., SOX10, NGFR, ALX1, VCAN, COL1A1) (Chen et al. 2019 and Wang et al. 2019). This annotation information may also facilitate future syntenic comparisons utilizing further gene-based synteny. Chromosome fissions and fusions were detected between the white-tailed deer genome and the other species that were compared (Table 3). Identification of chromosomal arrangements will inform the assumptions made about gene linkage and synteny. Future genome-wide association studies will be able to make alignments to the chromosome-level scaffolds of the 3GS Ovbor_1.0 white-tailed deer reference assembly. Having a chromosome-level assembly with few gaps will empower future population genomic sequencing to characterize genetic diversity within the deer population that could identify underlying genetic disease resistance loci and assist with current conservation efforts. Click here for additional data file.
  63 in total

1.  Allele-sharing models: LOD scores and accurate linkage tests.

Authors:  A Kong; N J Cox
Journal:  Am J Hum Genet       Date:  1997-11       Impact factor: 11.025

2.  Genetic basis of ruminant headgear and rapid antler regeneration.

Authors:  Yu Wang; Chenzhou Zhang; Nini Wang; Zhipeng Li; Rasmus Heller; Rong Liu; Yue Zhao; Jiangang Han; Xiangyu Pan; Zhuqing Zheng; Xueqin Dai; Ceshi Chen; Mingle Dou; Shujun Peng; Xianqing Chen; Jing Liu; Ming Li; Kun Wang; Chang Liu; Zeshan Lin; Lei Chen; Fei Hao; Wenbo Zhu; Chengchuang Song; Chen Zhao; Chengli Zheng; Jianming Wang; Shengwei Hu; Cunyuan Li; Hui Yang; Lin Jiang; Guangyu Li; Mingjun Liu; Tad S Sonstegard; Guojie Zhang; Yu Jiang; Wen Wang; Qiang Qiu
Journal:  Science       Date:  2019-06-21       Impact factor: 47.728

3.  The first high-quality reference genome of sika deer provides insights for high-tannin adaptation.

Authors:  Xiumei Xing; Cheng Ai; Tianjiao Wang; Yang Li; Huitao Liu; Pengfei Hu; Guiwu Wang; Huamiao Liu; Hongliang Wang; Ranran Zhang; Junjun Zheng; Xiaobo Wang; Lei Wang; Yuxiao Chang; Qian Qian; Jinghua Yu; Lixin Tang; Shigang Wu; Xiujuan Shao; Alun Li; Peng Cui; Wei Zhan; Sheng Zhao; Zhichao Wu; Xiqun Shao; Yimeng Dong; Min Rong; Yihong Tan; Xuezhe Cui; Shuzhuo Chang; Xingchao Song; Tongao Yang; Limin Sun; Yan Ju; Pei Zhao; Huanhuan Fan; Ying Liu; Xinhui Wang; Wanyun Yang; Min Yang; Tao Wei; Shanshan Song; Jiaping Xu; Zhigang Yue; Qiqi Liang; Chunyi Li; Jue Ruan; Fuhe Yang
Journal:  Genomics Proteomics Bioinformatics       Date:  2022-06-16       Impact factor: 7.691

4.  Evaluation of GRCh38 and de novo haploid genome assemblies demonstrates the enduring quality of the reference assembly.

Authors:  Valerie A Schneider; Tina Graves-Lindsay; Kerstin Howe; Nathan Bouk; Hsiu-Chuan Chen; Paul A Kitts; Terence D Murphy; Kim D Pruitt; Françoise Thibaud-Nissen; Derek Albracht; Robert S Fulton; Milinn Kremitzki; Vincent Magrini; Chris Markovic; Sean McGrath; Karyn Meltz Steinberg; Kate Auger; William Chow; Joanna Collins; Glenn Harden; Timothy Hubbard; Sarah Pelan; Jared T Simpson; Glen Threadgold; James Torrance; Jonathan M Wood; Laura Clarke; Sergey Koren; Matthew Boitano; Paul Peluso; Heng Li; Chen-Shan Chin; Adam M Phillippy; Richard Durbin; Richard K Wilson; Paul Flicek; Evan E Eichler; Deanna M Church
Journal:  Genome Res       Date:  2017-04-10       Impact factor: 9.043

5.  OrthoDB v10: sampling the diversity of animal, plant, fungal, protist, bacterial and viral genomes for evolutionary and functional annotations of orthologs.

Authors:  Evgenia V Kriventseva; Dmitry Kuznetsov; Fredrik Tegenfeldt; Mosè Manni; Renata Dias; Felipe A Simão; Evgeny M Zdobnov
Journal:  Nucleic Acids Res       Date:  2019-01-08       Impact factor: 16.971

6.  Accurate Genomic Predictions for Chronic Wasting Disease in U.S. White-Tailed Deer.

Authors:  Christopher M Seabury; David L Oldeschulte; Eric K Bhattarai; Dhruti Legare; Pamela J Ferro; Richard P Metz; Charles D Johnson; Mitchell A Lockwood; Tracy A Nichols
Journal:  G3 (Bethesda)       Date:  2020-04-09       Impact factor: 3.154

7.  Chronic wasting disease associated with prion protein gene (PRNP) variation in Norwegian wild reindeer (Rangifer tarandus).

Authors:  Mariella E Güere; Jørn Våge; Helene Tharaldsen; Sylvie L Benestad; Turid Vikøren; Knut Madslien; Petter Hopp; Christer M Rolandsen; Knut H Røed; Michael A Tranulis
Journal:  Prion       Date:  2020-12       Impact factor: 3.931

8.  A Hu sheep genome with the first ovine Y chromosome reveal introgression history after sheep domestication.

Authors:  Ran Li; Peng Yang; Ming Li; Wenwen Fang; Xiangpeng Yue; Hojjat Asadollahpour Nanaei; Shangquan Gan; Duo Du; Yudong Cai; Xuelei Dai; Qimeng Yang; Chunna Cao; Weidong Deng; Sangang He; Wenrong Li; Runlin Ma; Mingjun Liu; Yu Jiang
Journal:  Sci China Life Sci       Date:  2020-09-24       Impact factor: 6.038

9.  Gene prediction in eukaryotes with a generalized hidden Markov model that uses hints from external sources.

Authors:  Mario Stanke; Oliver Schöffmann; Burkhard Morgenstern; Stephan Waack
Journal:  BMC Bioinformatics       Date:  2006-02-09       Impact factor: 3.169

10.  The InterPro protein families and domains database: 20 years on.

Authors:  Matthias Blum; Hsin-Yu Chang; Sara Chuguransky; Tiago Grego; Swaathi Kandasaamy; Alex Mitchell; Gift Nuka; Typhaine Paysan-Lafosse; Matloob Qureshi; Shriya Raj; Lorna Richardson; Gustavo A Salazar; Lowri Williams; Peer Bork; Alan Bridge; Julian Gough; Daniel H Haft; Ivica Letunic; Aron Marchler-Bauer; Huaiyu Mi; Darren A Natale; Marco Necci; Christine A Orengo; Arun P Pandurangan; Catherine Rivoire; Christian J A Sigrist; Ian Sillitoe; Narmada Thanki; Paul D Thomas; Silvio C E Tosatto; Cathy H Wu; Alex Bateman; Robert D Finn
Journal:  Nucleic Acids Res       Date:  2021-01-08       Impact factor: 16.971

View more

北京卡尤迪生物科技股份有限公司 © 2022-2023.