Literature DB >> 31702783

The Rhododendron Genome and Chromosomal Organization Provide Insight into Shared Whole-Genome Duplications across the Heath Family (Ericaceae).

Valerie L Soza¹, Dale Lindsley^1,2, Adam Waalkes^1,3, Elizabeth Ramage¹, Rupali P Patwardhan⁴, Joshua N Burton^4,5, Andrew Adey^4,6, Akash Kumar^4,7, Ruolan Qiu^4,2, Jay Shendure^4,8,9, Benjamin Hall¹.

Abstract

The genus Rhododendron (Ericaceae), which includes horticulturally important plants such as azaleas, is a highly diverse and widely distributed genus of >1,000 species. Here, we report the chromosome-scale de novo assembly and genome annotation of Rhododendron williamsianum as a basis for continued study of this large genus. We created multiple short fragment genomic libraries, which were assembled using ALLPATHS-LG. This was followed by contiguity preserving transposase sequencing (CPT-seq) and fragScaff scaffolding of a large fragment library, which improved the assembly by decreasing the number of scaffolds and increasing scaffold length. Chromosome-scale scaffolding was performed by proximity-guided assembly (LACHESIS) using chromatin conformation capture (Hi-C) data. Chromosome-scale scaffolding was further refined and linkage groups defined by restriction-site associated DNA (RAD) sequencing of the parents and progeny of a genetic cross. The resulting linkage map confirmed the LACHESIS clustering and ordering of scaffolds onto chromosomes and rectified large-scale inversions. Assessments of the R. williamsianum genome assembly and gene annotation estimate them to be 89% and 79% complete, respectively. Predicted coding sequences from genome annotation were used in syntenic analyses and for generating age distributions of synonymous substitutions/site between paralgous gene pairs, which identified whole-genome duplications (WGDs) in R. williamsianum. We then analyzed other publicly available Ericaceae genomes for shared WGDs. Based on our spatial and temporal analyses of paralogous gene pairs, we find evidence for two shared, ancient WGDs in Rhododendron and Vaccinium (cranberry/blueberry) members that predate the Ericaceae family and, in one case, the Ericales order.

Entities: Chemical Disease Species

Keywords: chromatin conformation capture (Hi-C); chromosome-scale scaffolding; de novo genome assembly; linkage map; restriction-site associated DNA (RAD) sequencing; synteny

Mesh：

Substances：
Chromatin
Transposases

Year: 2019 PMID： 31702783 PMCID： PMC6907397 DOI： 10.1093/gbe/evz245

Source DB: PubMed Journal: Genome Biol Evol ISSN： 1759-6653 Impact factor: 3.416

Introduction

Scientific interest in the plant genus Rhododendron L., which includes azaleas, derives from the great morphological diversification that accompanied speciation and geographic dispersal of Rhododendron species across Eurasia, North America, and the Malesian archipelago (Irving and Hebda 1993). Within the northern temperate zone, Rhododendron species have adapted to grow in virtually every montane area (Irving and Hebda 1993). In Southeast Asia, similar dispersal and adaptation have accompanied major tectonic mountain-building events (Hall 1998). In addition, species within this genus are valued for their floral and vegetative diversity and are widely cultivated across Asia, North America, and Europe (Xing et al. 2017), with >35,000 cultivars produced around the world (Leslie 2004, 2017; McDonald S, personal communication). The widely accepted classification scheme for rhododendrons (Sleumer 1980; Chamberlain 1996) names eight subgenera, of which four account for the vast majority of species: subg. Hymenanthes (Blume) K. Koch, Pentanthera (G. Don) Poyarkova, Rhododendron, and Tsutsusi (Sweet) Pojark. The plant species at the center of this project, Rhododendron williamsianum Rehder and E.H. Wilson, belongs to the subgenus Hymenanthes, a large taxon with representatives on all Northern Hemisphere continents (Chamberlain 1982). The rhododendrons of subgenus Hymenanthes are evergreen and lack leaf scales (elepidote). Since rhododendrons in India and China first came to the attention of European plant explorers (ca. AD 1800), elepidote species have been widely planted and hybridized with one another to produce new varieties (Irving and Hebda 1993). Recent developments in genomics make it possible for many organisms to receive intensive genetic study comparable in depth to that achieved in familiar model organisms (e.g., Arabidopsis, Caenorhabditis, Drosophila). The experiments, data, and conclusions we present here are intended to bring Rhododendron to this forefront of genomics by sequencing and chromosome-scale scaffolding of the nuclear genome of R. williamsianum. Knowledge of the content and chromosome location of these sequences enables genomic study of the many species that comprise genus Rhododendron. Rhododendron is a member of the heath or heather family (Ericaceae), which contains >4,400 species and is an economically important family that includes cranberries, blueberries, huckleberries, and wintergreen. An important question raised by the availability of Rhododendron chromosome-level sequences is: how strictly conserved is the gross organization of the genome sequence across the Ericaceae, and how comparable is this genome to those of the other 4,400+ species in the family? One other diploid genome, Vaccinium macrocarpon Aiton (cranberry) (Polashock et al. 2014; Schlautman et al. 2017), is sequenced within the Ericaceae with scaffolds anchored to chromosomes, and two other draft, diploid genomes are available for Rhododendron and Vaccinium (Bian et al. 2014; Gupta et al. 2015; Zhang, Xu, et al. 2017). Both Rhododendron and Vaccinium represent species-rich groups within Ericaceae that diverged from one another ∼77 Ma (Rose et al. 2018), and have undergone multiple speciation shifts (Schwery et al. 2015; Rose et al. 2018). Chromosome evolution between these two genera represents the majority of the age of the family, which has a crown age of ∼90.5 Ma (Rose et al. 2018). The central goal of our project has been to sequence the nuclear genome of R. williamsianum and order its sequences along chromosomes. To that end, we created DNA libraries for de novo assembly of DNA sequences using next generation methodologies. Building upon this resource, we employed LACHESIS assembly (Burton et al. 2013) with R. williamsianum Hi-C data (Lieberman-Aiden et al. 2009) to produce highly detailed physical maps of the R. willliamsianum chromosomes. We then carried out a genetic cross and analyzed the parents and progeny of this cross for the linkage relationships between genes on the 13 chromosomes. Alignment and reconciliation of the physical and genetic maps placed contiguous blocks of the assembled sequence precisely along the R. williamsianum chromosomes. With this chromosome-level genomic resource, we then examined genome evolution within the Ericaceae and uncovered multiple shared, ancient whole-genome duplications (WGDs).

Materials and Methods

Rhododendron Sampling

Rhododendrons of subgenus Hymenanthes were chosen as the focus of this research based upon their importance in horticulture, ready availability, and strict diploidy (2n = 2x = 26) (Janaki Ammal et al. 1950; Jones et al. 2007). Within subgenus Hymenanthes, the species R. williamsianum was chosen for genomic study for three reasons. First, we believe that its small area of wild occurrence might limit within-species genetic variation. Second, R. williamsianum has had widespread successful use as a parent in hybrid crosses, making it a species of interest among plant breeders. Third, we have local availability of a cultivated accession of this species bearing thousands of flower buds annually. Rhododendron williamsianum material used throughout this study was collected from the Rhododendron Species Botanical Garden, Federal Way, WA, USA (accession# 1966-606).

Genomic Libraries

DNA for genomic libraries was extracted from R. williamsianum nuclei. Prior to DNA extraction, the nuclei were isolated from R. williamsianum floral buds using the method of Zhang et al. (1995) with minor variations (supplementary method SM1, Supplementary Material online). Five R. williamsianum genomic libraries were constructed as recommended for use with ALLPATHS-LG (Gnerre et al. 2011; Ribeiro et al. 2012). These included one 180-bp short fragment library with overlapping reads, three 2-kb short jumping libraries, and one 29-kb insert fosmid library (supplementary table S1, Supplementary Material online). In addition, a high molecular weight genomic library was created for subsequent contiguity preserving transposase sequencing (CPT-seq) analysis (Amini et al. 2014). All libraries were sequenced on a HiSeq 2000 (Illumina, Inc.) using the standard paired-end sequencing protocol (Meyer and Kircher 2010).

De Novo Assembly of Genomic Sequence

Next-generation sequencing data resulting from the five R. williamsianum libraries above (supplementary table S1, Supplementary Material online) were assembled using ALLPATHS-LG v40083 (Gnerre et al. 2011; Ribeiro et al. 2012) (supplementary method SM1, Supplementary Material online). Data from CPT-seq were assembled by fragScaff v140324.1 (Adey et al. 2014) in an effort to improve the ALLPATHS-LG assembly prior to chromosome-scale scaffolding using contact probability maps and genetic linkage. We first created a masked assembly by mapping the initial ALLPATHS-LG assembly to itself using BLAST (Altschul et al. 1990), and removing regions of high similarity (≥85%). In this masked assembly, we also removed regions that aligned to known repeat elements in Repbase (Jurka et al. 2005), as well as regions with a level of shotgun coverage >3 SDs above the mean. fragScaff options were tailored to favor the scaffolding of smaller contigs that would typically be missed by the subsequent contact probability map scaffolding process (supplementary method SM1, Supplementary Material online). fragScaff was then run using the masked assembly and CPT-seq data.

Proximity-Guided Chromosome-Scale Assembly

A proximity-based, short read library of R. williamsianum was constructed using the Hi-C protocol of Lieberman-Aiden et al. (2009) with minor modifications (supplementary method SM2, Supplementary Material online). The library was sequenced on a HiSeq 2000. Hi-C reads were aligned to the fragScaff scaffolds and combined using LACHESIS (Burton et al. 2013) (supplementary method SM2, Supplementary Material online).

Linkage Map Analysis

In order to establish a linkage map of the R. williamsianum genome, we generated progeny from a genetic cross between the hybrid R. “Moonstone” and R. campylocarpum. Pollen of R. campylocarpum was applied to the stigmas of the hybrid R. “Moonstone” (R. williamsianum×R. campylocarpum). Seed capsules from this cross were harvested 5 months after pollination and stored at 4 °C in glassine envelopes. After storage in dry and cool conditions, seeds were released from their capsules and germinated on a layer of moss at 21 °C and high humidity for 12-h days under 600 W sodium vapor light. Seedlings were hardened off by gradually increasing the growth time at ambient temperature and humidity. One year after the cross was carried out, seedlings were planted in small pots. At ∼2 years of seedling age, DNA was extracted from progeny leaves using the DNeasy Plant Mini Kit (Qiagen) in a modified protocol (supplementary method SM3, Supplementary Material online). For the linkage analysis, we created reduced complexity restriction-site associated DNA (RAD) sequencing libraries for the parents, R. campylocarpum and R. “Moonstone,” and for each of the 110 progeny. Each library was individually barcoded. The method of Etter et al. (2011) was employed for creating the RAD libraries with some modification (supplementary method SM4, Supplementary Material online) using the PstI restriction enzyme. Pooled libraries were paired-end sequenced on a HiSeq 2000. From the RAD sequencing data, we removed low quality reads, reads with ambiguous barcodes or restriction sites, and their orphaned paired-end reads using process_radtags.pl (supplementary method SM4, Supplementary Material online) from Stacks v0.997 (Catchen et al. 2011, 2013). Data from the paired-end sequences allowed removal of PCR duplicates using samtools v0.1.18 (Li 2011) rmdup. Stacks denovo_map.pl (supplementary method SM4, Supplementary Material online) called the genotypes for parents and progeny and loaded the results into a MySQL database. We chose RAD markers that showed SNP heterozygosity in parental genotypes and that were successfully genotyped by Illumina sequencing in at least 90% of the progeny (Li and He 2014; Mousavi et al. 2016; Wang et al. 2016; Zhao et al. 2017). We used JoinMap v4.1 (van Ooijen 2006) to calculate linkage relationships and to define linkage groups of RAD markers (supplementary method SM4, Supplementary Material online). Linkage groups were determined using maximum likelihood, and map distances were calculated using Haldane’s map function. Scaffolds were anchored to their appropriate linkage groups using the NCBI BLAST utility to blast linkage group RAD markers against the fragScaff assembly.

Reconciliation of LACHESIS Assembly to Linkage Map

We used custom Perl scripts to compare the LACHESIS assembly to the linkage map and to reconcile the LACHESIS clusters with the linkage groups. LACHESIS has been shown to be correct at local scales but has produced some large-scale misassemblies (Burton et al. 2013). The RAD linkage map orders scaffolds correctly at large scales, but its resolution is limited by the number of progeny analyzed and hence does not provide as much scaffold orientation information as does LACHESIS. Therefore, we used the LACHESIS orderings for chromosome-scale scaffolding of the R. williamsianum genome and the linkage map orderings to correct putative large-scale misassemblies in the LACHESIS results.

Assessments of Genome Assemblies

We calculated a variety of statistics for the genome assembly of R. williamsianum using a modified version of assembly-stats v1.0.1 (Pathogen Informatics 2018). Assembly-stats was used to calculate scaffold number, scaffold sizes, and total bp assembled by ALLPATHS-LG, fragScaff, and in our final assembly. We assessed the final genomic assembly of R. williamsianum with BUSCO v2.0 (Simão et al. 2015) using the embryophyta_odb9 lineage data set representing benchmarking universal single-copy orthologs (BUSCOs, n = 1440) from 30 species. BUSCO interacted with AUGUSTUS v3.2.3 (Keller et al. 2011), BLAST+ v2.2.29 (Camacho et al. 2009), and HMMER v3.1b2 (Eddy 2011). Default BUSCO parameters were used under the genome mode, with Arabidopsis as the starting species for AUGUSTUS parameters and the AUGUSTUS optimization mode for self-training.

Repetitive Element Annotation

Repeat elements were annotated and masked in the R. williamsianum final genomic assembly using de novo and homology-based methods. We first created a de novo species-specific repeat database for the R. williamsianum genome using RepeatModeler v1.0.11 (Smit and Hubley 2008). RepeatModeler uses two de novo repeat finding programs, RECON v1.08 (Bao and Eddy 2002) and RepeatScout v1.05 (Price et al. 2005), along with Tandem Repeats Finder v4.07b (Benson 1999) and NSEG v20000620 (Wootton and Federhen 1993). Three-way degenerative International Union of Biochemistry (IUB) codes within the R. williamsianum genomic sequence were replaced with Ns for input into RepeatModeler. Default settings within RepeatModeler were used with the NCBI search engine RMBlast v2.6.0+ (Smit and Hubley 2017). We then performed repeat annotation and masking with MAKER v2.31.9 (Cantarel et al. 2007; Holt and Yandell 2011) using the de novo species-specific repeat database above and homology-based methods within MAKER. MAKER uses RepeatMasker v4.0.7 (Smit et al. 2017), Tandem Repeats Finder, the RepBase RepeatMasker Edition 20170127 (Jurka et al. 2005; Genetic Information Research Institute 2017), and RepeatRunner (Smith et al. 2007; Yandell 2007). First, we used default settings within RepeatMasker and the NCBI search engine RMBlast v2.6.0+ (Smit and Hubley 2017), in combination with all repeats from RepBase and the de novo species-specific repeat database above, to identify repeats in the R. williamsianum genome. Next, we identified transposable elements and viral proteins using the RepeatRunner protein database included in MAKER. Complex repeats were hard-masked, while simple repeats were soft-masked, in the R. williamsianum genome for subsequent gene predictions. We estimated percent repetitive content by first extracting the RepeatMasker and RepeatRunner features from the MAKER annotations and using SOBAcl (Moore 2016) to summarize total base pairs of repetitive elements from MAKER annotations. We then calculated percent repetitive elements from total bp (including runs of 20 Ns or more) assembled in the genome and for each linkage group (LG). Assuming strings of Ns represent repetitive regions as well, we also calculated percent of N-runs in the final genome assembly and for each LG to obtain a grand total estimate of percent repetitive content.

Structural Annotation

Gene annotation was done within MAKER v2.31.9 (Cantarel et al. 2007; Holt and Yandell 2011) under default settings, except as noted below, using ab initio gene predictions from AUGUSTUS v3.2.3 (Keller et al. 2011). First, we aligned transcriptome data from Rhododendron delavayi Franch. (Zhang, Xu, et al. 2017) and proteins from the UniProtKB/Swiss-Prot database 2017_12 release (The UniProt Consortium 2017) and from manually annotated, complete Actinidia chinensis Planch. genes (Pilkington et al. 2018) to the R. williamsianum masked genome using TBlastX and BlastX from BLAST+ v2.6.0 (Camacho et al. 2009), respectively, within MAKER. Exonerate v2.2.0 (Slater and Birney 2005) was used within MAKER to polish these BLAST alignments. Final gene annotations were produced by MAKER, using the BUSCO training parameters from AUGUSTUS produced during the BUSCO analysis above. Additionally, we rescued genes that were omitted from the default MAKER annotation if they encoded a Pfam (Finn et al. 2016) domain to create a standard build, as outlined by Campbell et al. (2014). We assessed the completeness of these structural gene annotations for the R. williamsianum genome with BUSCO v2.0 (Simão et al. 2015). We used software dependencies as above, the embryophyta_odb9 lineage data set, and default BUSCO parameters under the protein mode, with Arabidopsis as the starting species for AUGUSTUS parameters. Additionally, we generated statistics for structural gene annotations using SOBAcl and accessory scripts.

Functional Annotation

Genes predicted by MAKER in the R. williamsianum genome were functionally annotated by Blast2GO (B2G) v4.1.9 and 5.1.12 (Conesa et al. 2005; Götz et al. 2008). Protein sequences from R. williamsianum were queried against the UniProtKB/Swiss-Prot database 2017_12 release (The UniProt Consortium 2017) using local BLAST with default settings in B2G but with an expectation value of 1.0E-5, the number of BLAST hits limited to one, and BLAST description annotator enabled. Rhododendron williamsianum protein sequences were then compared with protein domain annotations with InterProScan (Jones et al. 2014) through the European Molecular Biology Laboratory-European Bioinformatics Institute (EMBL-EBI) InterPro (Finn et al. 2017) option in B2G. All available families, domains, sites, and repeats, as well as all member databases and other sequence features available in InterPro were used. Gene Ontology (GO) terms from BLAST and InterPro hits were retrieved from the GO database (Ashburner et al. 2000; The Gene Ontology Consortium 2017), selected, and assigned to corresponding R. williamsianum query sequences using the B2G default annotation rule (Conesa et al. 2005) and evidence code weights. We also retrieved enzyme code and KEGG (Ogata et al. 1999) pathway map annotations from GO annotations in B2G. Functional GO classes within each of the three GO domains (Biological Process, Cellular Component, Molecular Function) were filtered for classes containing a minimum of 10% of the annotated R. williamsianum sequences for each domain. We used REVIGO TreeMaps (Supek et al. 2011) under default settings to summarize GO terms within GO domains into a subset of nonredundant terms based on semantic similarity.

Comparative Genomics

For comparison of R. williamsianum chromosomes to other diploid Ericaceae, Vaccinium macrocarpon (cranberry) is available as an assembly with scaffolds anchored to chromosomes (Polashock et al. 2014; Schlautman et al. 2017). We used anchored/ordered scaffolds on LGs to generate pseudochromosomal sequences to identify syntenic relationships both within the R. williamsianum genome as well as between R. williamsianium and V. macrocarpon. For both genomes, these sequences were extracted from the masked genome assembly by stitching anchored/ordered scaffolds from each LG together with 100-N spacers between scaffolds. The coding sequence for pseudochromosomal sequences was then extracted from the MAKER gene predictions. We also examined two other genomes available from Ericaceae, that lack chromosome-level scaffolding, for syntenic relationships: R. delavayi (Zhang, Xu, et al. 2017) and V. corymbosum L. (blueberry) (Bian et al. 2014; Gupta et al. 2015). For all syntenic analyses, we used SynMap2 within CoGe (Lyons et al. 2008; Haug-Baltzell et al. 2017) with default settings, except for choosing discontinuous MegaBLAST, e-value of 0.0001, as the BLAST algorithm, and 40 genes as the maximum distance between gene pairs in DAGChainer (Haas et al. 2004). Syntenic regions from the DAGChainer output in SynMap2 within the R. williamsianum genome and between R. williamsianum and V. macrocarpon were also visualized with Circos v0.69-6 and 0.69-9 (Krzywinski et al. 2009). Syntenic blocks identified by DAGChainer were converted to bundles using custom scripts and displayed as ribbons in Circos with the chromosome order optimized by the orderchr tool. To calculate syntenic coverage, the total length of syntenic genes along pseudochromosomal sequences from each genome was divided by the total length of predicted genes on pseudochromosomal sequences. We also used SynFind (Tang et al. 2015) within the CoGe platform to estimate syntenic depth among R. williamsianum chromosomes as a proxy for the number of whole-genome duplication (WGD) events. We used default settings for SynFind, except for selecting LastZ as the comparison algorithm and a minimum number of five anchoring genes to call a region syntenic. For identifying WGD in Ericaceae genomes, we used the distribution of synonymous substitutions/site (Ks) between paralogous gene pairs (Lynch and Conery 2000) across the entire genome for R. delavayi, R. williamsianum, V. corymbosum, and V. macrocarpon. We used the entire genome to extract Ks estimates to better compare Ks distributions across all species examined because chromosome-scale assemblies were not available for R. delavayi and V. corymbosum. Additionally, Ks analyses of R. williamsianum using syntenic genes from the SynMap2 analyses above produced results (supplementary fig. S9, Supplementary Material online) that were not as clear as analyzing genomic paralogs (paranome), potentially due to fewer data points in the syntenic gene data set. Therefore, we used Ks distributions from all genomic paralogs and compared these to Ks distributions generated for Actinidia chinensis by Shi et al. (2010) and in our current study using A. chinensis genome data from Pilkington et al. (2018) to validate our methods. We used WGDdetector (Yang et al. 2019), which estimates Ks distributions across gene families to correct for redundant Ks values among paralogs, with MMseqs2 (Steinegger and Söding 2017) as the cluster engine. We then used SiZer (Chaudhuri and Marron 1999) to identify significant peaks in Ks distributions based on significant zero crossings of derivatives from kernel density estimation as in Barker et al. (2008). We used SiZerSM (Marron 2000) in Matlab R2018a (v9.4, The MathWorks, Inc.) with default settings of α = 0.05 and 401 bins. We then fit normal mixture models to the Ks data using maximum likelihood with EMMIX (McLachlan and Peel 1999) as in Barker et al. (2008). We tested one to five components with an unrestricted covariance matrix, 1,000 random starts, and 100 k-means clustering-based starts. Model selection was done by the Bayesian Information Criterion (BIC) within EMMIX. To estimate timing of WGDs, we generated Ks distributions of orthologous gene pairs between R. williamsianum and the following genomes using the wgd pipeline v1.0.1 (Zwaenepoel and Van de Peer 2019), which used BLAST v2.7.1 (Altschul et al. 1997) with an e-value cut-off of 1e-5, MCL v14.137 (van Dongen 2000), MUSCLE v3.8.31 (Edgar 2004), and PAML v4.9i (Yang 2007): A. chinensis (Pilkington et al. 2018), Camellia sinensis (Wei et al. 2018), V. macrocarpon (Polashock et al. 2014), and Vitis vinifera (The French–Italian Public Consortium for Grapevine Genome Characterization et al. 2007). We then compared these Ks distributions, which represented speciation events, to the Ks distribution of R. williamsianum genomic paralogs, which represented WGDs. Histograms and normal mixture models were plotted in R v3.6.1 (R Core Team 2018).

Results

De Novo Assembly of R. williamsianum Genomic Sequence

In order to produce a de novo assembly of the R. williamsianum genome, we first created five R. williamsianum genomic libraries with different insert sizes (supplementary table S1, Supplementary Material online) for use with the ALLPATHS-LG assembler (Gnerre et al. 2011; Ribeiro et al. 2012). Total genome sequencing coverage was 388×. We used assembly-stats v1.0.1 (Pathogen Informatics 2018) to calculate scaffold number, scaffold sizes, and total bp assembled by ALLPATHS-LG. The initial assembly yielded 18,269 scaffolds for a total of 491.6 Mb (table 1) out of a genome size of 650.8 Mb as estimated by ALLPATHS-LG. Scaffold sizes ranged with an N10, N50, and N90 of 508,357, 132,014, and 10,942 bp, respectively (table 1). The length of the scaffolds generated by ALLPATHS-LG is limited, probably by the occurrence of dispersed repetitive sequences, such as those belonging to the copia and gypsy families. Therefore, to improve the ALLPATHS-LG assembly, we created a high molecular weight genomic library for contiguity preserving transposase sequencing (CPT-seq) (Amini et al. 2014) and fragScaff scaffolding (Adey et al. 2014) prior to chromosome-scale scaffolding using contact probability maps and genetic linkage. fragScaff was used with the CPT-seq data to conjoin many of the smaller ALLPATHS-LG scaffolds. After assembly with the described parameters that favored conjoining of smaller scaffolds, the final scaffold count was reduced to 11,962, for a total of 532.5 Mb, with an N10, N50, and N90 of 931,684, 225,489, and 29,431 bp, respectively (table 1), corresponding to fold improvements of 1.83, 1.71, and 2.69, respectively. The modest improvements to N10 and N50 in the fragScaff assembly were primarily due to tailoring the algorithm to prioritize smaller scaffold incorporation, thus favoring increase in the N90 as well as significantly (35%) reducing the total number of scaffolds. Using fragScaff after assembly of genomic libraries by ALLPATHS-LG resulted in an improved assembly of the R. williamsianum genome through the reduction in scaffold number and increase in scaffold size.

Table 1

Combining ALLPATHS-LG with fragScaff Improved the De Novo Assembly of the Rhododendron williamsianum Genome

Assembly/Stats	ALLPATHS-LG	fragScaff	Final^a
No. scaffolds	18,269	11,962	11,985
Scaffold N10 size (bp)	508,357	931,684	815,789
Scaffold N50 size (bp)	132,014	225,489	218,828
Scaffold N90 size (bp)	10,942	29,431	29,444
Total bp	491,643,723	532,499,445	532,123,622

Combined results from ALLPATHS-LG, fragScaff, and linkage map. Linkage analysis split misassembled scaffolds, resulting in slightly more scaffolds. Scaffolds were also filtered for duplicates and mitochondrial contamination.

Combining ALLPATHS-LG with fragScaff Improved the De Novo Assembly of the Rhododendron williamsianum Genome Combined results from ALLPATHS-LG, fragScaff, and linkage map. Linkage analysis split misassembled scaffolds, resulting in slightly more scaffolds. Scaffolds were also filtered for duplicates and mitochondrial contamination.

Chromosome-Scale Scaffolding of R. williamsianum Genome Using Two Independent Methods

We used a combination of results from a linkage map based on our RAD sequencing data and LACHESIS (Burton et al. 2013) analysis of our Hi-C data to cluster and order the R. williamsianum scaffolds along chromosomes. The ordering of R. williamsianum scaffolds from the fragScaff assembly within linkage groups (LGs) was facilitated by the ordering of RAD markers (Etter et al. 2011) from a genetic cross of R. “Moonstone” (R. williamsianum×R. campylocarpum)×R. campylocarpum. For a scaffold to be assigned to a LG by this method, its sequence must include one of the RAD markers used to analyze this cross. These markers could only be ordered with respect to one another if they were separated by a scoreable crossover during 1 of the 220 meioses contributing to our 110 progeny. A total of 54,432 PstI sites occur in the assembled R. williamsianum genome, offering the potential of identifying over 100,000 RAD markers. RAD sequencing parents and progeny of the cross allowed selection of 11,616 polymorphic RAD markers with sufficient coverage in the progeny (supplementary fig. S1, Supplementary Material online). Progeny genotypes at these markers were analyzed by JoinMap which designated 13 linkage groups and ordered blocks of completely linked RAD markers along the chromosomes. This method produced chromosomal segments, each defined by a group of RAD markers inseparable from one another by crossover frequencies, and ordered those blocks of markers along each chromosome. Scaffolds sharing sequences with these marker blocks could then be assigned to the chromosome segments, and scaffolds with marker sites in two or more contiguous blocks could be oriented on their chromosomes. RAD markers used in this study allowed us to place 1,711 scaffolds along chromosomes, for a total of 363.8 Mb, representing 68% of the assembled genome. The JoinMap linkage analysis uncovered 23 scaffold-assembly errors. We used the Linux program grep to identify scaffolds containing RAD markers that did not all map to the same linkage group. Targeting one of the single or multiple N’s occurring in each of the presumed error-containing regions between incorrectly conjoined groups of RAD markers, we split misassembled scaffolds and reassigned each of the resulting subscaffolds to its appropriate chromosome. Early in the sequencing/assembly project, we discovered we were unable to PCR amplify regions of ALLPATHS-LG scaffolds which contained a single N, suggesting that a lone N signals misassembly. Therefore, we split misassembled scaffolds at single N’s whenever possible. After correcting these assembly errors, the final assembly consisted of 11,985 scaffolds, with a total length of 532.1 Mb, and an updated N10, N50, and N90 of 815,789, 218,828, and 29,444 bp, respectively (table 1). We then used LACHESIS, which has been shown to be effective at local-scale ordering and orienting of scaffolds (Burton et al. 2013), to map R. williamsianum scaffolds to chromosomes. Short-read sequencing of a R. williamsianum Hi-C library (Lieberman-Aiden et al. 2009) cataloged the frequency with which DNA sequences became crosslinked to one another inside formalin-treated nuclei, elucidating their relative proximities. The LACHESIS software combined the fragScaff assembled sequences with the Hi-C data to order and orient R. williamsianum scaffolds within chromosomes (called “clusters” in LACHESIS; supplementary fig. S2, Supplementary Material online). Custom Perl scripts were used to compare the LACHESIS assembly to the linkage map and to identify the JoinMap LG to which each LACHESIS cluster corresponded. Comparison of results obtained by these two methods showed a high degree of concordance, with similar placement of scaffolds in LGs (supplementary fig. S3, Supplementary Material online). For the final chromosomal-scale assembly, the placement and orientation of scaffolds was specified by LACHESIS since LACHESIS provides more orientation data than does the linkage map. However, LACHESIS is known to produce occasional large-scale inversion errors (Burton et al. 2013). Such inversions between the two mapping methods were corrected based on the RAD linkage map (fig. 1 and supplementary fig. S3, Supplementary Material online).

. 1.

—Comparison of chromosome-scale scaffolding for the Rhododendron williamsianum genome by two methods. Comparison of ordering and orienting scaffolds from the R. williamsianum de novo assembly within linkage groups based on two methods, LACHESIS assembly of Hi-C data and linkage map of RAD-seq data. The resulting chromosomal-scale scaffolding has 13 LGs and assigns most of the assembled sequence to LGs (table 2). A total of 3,984 scaffolds with length 394.5 Mb, representing 74% of the assembled genome, were assigned to chromosomes: 1,708 scaffolds (368.4 Mb), representing 69% of the assembled genome, were ordered, and 2,276 scaffolds (26.1 Mb), representing 5% of the assembled genome, were unordered on chromosomes (table 2). The 1,708 clustered and ordered scaffolds included a conservative set of 1,333 scaffolds (327.4 Mb), or 62% of the assembled genome, representing scaffolds ordered by both LACHESIS and the linkage map with no inconsistencies. After our chromosome-scale scaffolding, 8,001 scaffolds (137.7 Mb), representing 26% of the assembled genome, remained unmapped to chromosomes (unclustered in table 2). Combining the local-scale ordering and orientation data provided by the LACHESIS assembly and the large-scale ordering from the linkage map improved the clustering and ordering of scaffolds along chromosomes.

Table 2

Chromosome-Scale Scaffolding Statistics for the Final Rhododendron williamsianum Genome

Statistics	Final Assembly^a
Clustered scaffolds	3,984
Total bp clustered	394,456,490
Clustered and ordered scaffolds	1,708
Total bp clustered and ordered	368,385,547
Clustered and unordered scaffolds	2,276
Total bp clustered and unordered	26,070,943
Unclustered scaffolds	8,001
Total bp unclustered	137,667,132
Total scaffolds in final assembly	11,985

Final assembly used LACHESIS results for clustering and ordering scaffolds and linkage map results for fixing large-scale inversions.

Chromosome-Scale Scaffolding Statistics for the Final Rhododendron williamsianum Genome Final assembly used LACHESIS results for clustering and ordering scaffolds and linkage map results for fixing large-scale inversions.

Assessing Completeness of Final R. williamsianum Assembled Genome

We compared the final R. williamsianum assembled genome to a conserved set of 1,440 benchmarking universal single-copy ortholog (BUSCO) (Simão et al. 2015) groups across embryophytes to estimate the completeness of the R. williamsianum genome. BUSCO analyses estimated the genome to be 89% complete, represented by 89% complete BUSCOs recovered from the genome: 85.2% were single-copy and 3.8% were duplicated. The other 11% BUSCOs not fully recovered from the R. williamsianum genome were fragmented (1.9%) or missing (9.1%). Thus, the final assembled genome recovered ∼89% of the R. williamsianum genome as determined by single-copy ortholog analyses.

Annotation of R. williamsianum Genome

We identified and masked repetitive elements within the R. williamsianum final genome assembly with de novo (RepeatModeler) (Smit and Hubley 2008) and homology-based (RepBase, RepeatRunner) (Jurka et al. 2005; Smith et al. 2007; Yandell 2007) methods using RepeatMasker (Smit et al. 2017) within the MAKER pipeline (Cantarel et al. 2007; Holt and Yandell 2011). From these combined methods, we estimate that 26.4% of the assembled genome consists of identifiable repetitive elements. Additionally, assuming that strings of Ns in our final assembly are repetitive regions adds another 32.4% of repetitive content. Therefore, we estimate that a grand total of 58.8% of the R. williamsianum final assembled genome consists of repetitive sequences. We also used the MAKER pipeline to predict genes within the R. williamsianum assembled genome. We obtained estimates of 23,559 genes in the final assembled genome (table 3) using the standard build in MAKER, which has been shown to balance sensitivity and specificity (Campbell et al. 2014). Of these predicted genes, 91% were on ordered scaffolds assigned to LGs, 2% were on unordered scaffolds assigned to LGs, and 7% were on unclustered scaffolds. The mean Annotation Edit Distance (AED) (Holt and Yandell 2011), which measures how well the gene predictions match the evidence for predicted genes, from the standard build was 0.3. Initially, 94% of predicted genes from the default build had an AED ≤0.5. After rescuing genes with Pfam domains, 88% of predicted genes from the standard build had an AED ≤0.5. The total number of predicted genes encompasses on an average 0.044 genes per kb across the assembled genome (table 3). These genes were on an average 4,628 bp long, with 5.68 exons per gene, an exon length of 212 bp, and an intron length of 645 bp (table 3). Taken together, our gene predictions estimate that at least 20% of the assembled R. williamsianum genome encodes mRNA and only 5% is actual coding sequence.

Table 3

Statistics for the Structural Gene Annotation of the Rhododendron williamsianum Genome

Gene Statistic	Final Assembly
Total predicted genes	23,559
Gene density/kb	0.044
Average gene length (bp)	4,628
Average no. exons/gene	5.68
Average exon length (bp)	212
Average intron length (bp)	645

Statistics for the Structural Gene Annotation of the Rhododendron williamsianum Genome We estimated the R. williamsianum chromosome sizes by summing up ordered and unordered scaffolds assigned to each LG, either excluding or including runs of 20 Ns or more (fig. 2). Estimated chromosome lengths ranged from 17.1 Mb for LG13 to 25.8 Mb for LG01 when runs of Ns were excluded, or from 22.6 Mb for LG13 to 35.2 Mb for LG09 when runs of Ns were included (fig. 2). We then used the lengths (including runs of 20 Ns or more) to get an idea of percent repeats and percent mRNAs in each chromosomal sequence. We estimate that LGs are on an average 50.7% (±3.80) repetitive and 26.1% (±2.87) mRNAs. We do not see large differences in repeat content or gene content among chromosomes.

. 2.

—Estimates of chromosome size for the Rhododendron williamsianum genome. Chromosome sizes were estimated by summing the lengths of ordered and unordered scaffolds within each linkage group (LG). Two chromosome size estimates are provided for each LG, including and excluding runs of 20 Ns or more. We again compared the R. williamsianum set of predicted genes to a conserved set of 1,440 BUSCO groups across embryophytes to estimate completeness of the structural gene annotation done within the MAKER pipeline above. BUSCO analyses estimated gene annotation as 79.1% complete, represented by 79.1% complete BUSCOs recovered from the gene predictions: 75.4% were single-copy and 3.7% were duplicated. The other 20.9% BUSCOs not fully recovered from the gene predictions were fragmented (9.2%) or missing (11.7%). Thus, our structural gene annotation using the MAKER pipeline recovered ∼79% of the potential genes in the R. williamsianum genome. We then used the Blast2GO (B2G) (Conesa et al. 2005; Götz et al. 2008) pipeline to functionally annotate genes predicted by MAKER in the R. williamsianum genome. The B2G pipeline functionally classified 18,538 out of 23,559 (79%) predicted genes to Gene Ontology (GO) terms. Of the functionally annotated genes, 62% (11,445 genes) belonged to biological process, 35% (6,462 genes) to cellular component, and 83% (15,472 genes) to molecular function, with overlap among these GO domains. Distribution of genes within each of these three GO domains are shown in figure 3 among GO classes at various levels. For details on functional classes represented in the R. williamsianum genome, refer to figure 3 and supplementary table S2, Supplementary Material online.

. 3.

—Gene Ontology (GO) classification of functionally annotated, predicted genes in Rhododendron williamsianum. The three panels represent the three main GO domains. Each panel represents level 2 classes from the GO directed acyclic graph (left side) and the top 25 classes across all levels (right side). Left side: top pie chart represents all annotated genes in genome, bottom pie chart represents syntenic genes within genome. Right side: blue bars represent syntenic genes, green bars represent all annotated genes for biological process; orange bars represent syntenic genes, yellow bars represent all annotated genes for cellular component; pink bars represent syntenic genes, blue bars represent all annotated genes for molecular function. Only classes with at least 10% of annotated genes within a domain are listed.

Syntenic Relationships among R. williamsianum Chromosomes Identify Multiple WGDs

Because of increasing evidence of ancient WGDs in plants (Ren et al. 2018), we used SynMap2 within the Comparative Genomics (CoGe) platform (Lyons et al. 2008; Haug-Baltzell et al. 2017) to search for blocks of syntenic gene pairs between R. williamsianum chromosomes, which would indicate evidence of WGD. We found 7,151 syntenic gene pairs across 520 syntenic blocks on ordered scaffolds assigned to chromosomes from the R. williamsianum genome (fig. 4 and supplementary fig. S4 and table S3, Supplementary Material online). Of predicted genes on ordered scaffolds (100.8 Mb), 34% (34.0 Mb) were syntenic to genes on other chromosomes within the R. williamsianum genome, indicating WGD. We also used SynFind (Tang et al. 2015) within the CoGe platform to estimate syntenic depth among chromosomes as a proxy for the number of paleopolyploid events (supplementary table S4, Supplementary Material online). We found evidence of multiple rounds of WGD within the R. williamsianum genome where a given syntenic region has more than one homolog as indicated by syntenic depths ranging from 1:2 to 1:5 (supplementary table S4, Supplementary Material online). For example, the same region on chromosome 5 is homologous to regions on chromosomes 1, 6, 11, 12, and 13, indicating a syntenic depth of 1:5 (fig. 4 and supplementary fig. S4, Supplementary Material online). These syntenic relationships within the R. williamsianum genome suggest multiple rounds of WGD in the history of this genome.

. 4.

—Syntenic blocks within the Rhododendron williamsianum genome indicate multiple whole-genome duplications. The 13 chromosomes of R. williamsianum (RW) are arranged along the circumference of the Circos (Krzywinski et al. 2009) plot to reduce crossing of bundles. Each bundle represents a block of at least five syntenic gene pairs shared between two chromosomes (interior bundles) or within a chromosome (exterior bundles). Colored bundles highlight two syntenic regions with 1:5 syntenic depths. In order to determine what genes may be preferentially retained after WGDs in R. williamsianum, we extracted GO terms for syntenic genes from our whole-genome B2G functional annotations. We were able to retrieve functional annotations for 6,025 out of 7,151 syntenic genes. Of the functionally annotated syntenic genes, 49% (2,979 genes) belonged to biological process, 28% (1,704 genes) to cellular component, and 62% (3,763 genes) to molecular function, with overlap among these GO domains. Figure 3 shows, for each GO domain, how syntenic genes are distributed among GO classes compared with all genes in the genome. Distributions of syntenic gene functions are similar to the distributions of all gene functions for the most part, but syntenic genes are enriched in certain GO classes compared with the whole-genome (fig. 3). In particular, syntenic genes are enriched for an additional class from level 2 in the GO directed acyclic graph for biological process and cellular component. Under biological process, syntenic genes are enriched in single-organism process, which includes genes involved in multicellular organism development, response to oxidative stress, cell redox homeostasis, and protein phosphorylation (supplementary table S5, Supplementary Material online). Under cellular component, syntenic genes are enriched in macromolecular complex, which includes genes primarily involved in ribosomes (supplementary table S6, Supplementary Material online). Thus, we see certain genes with particular functions as being preferentially retained in duplicate after WGDs in R. williamsianum.

Syntenic Relationships between Rhododendron and Cranberry Chromosomes

In order to determine how conserved chromosome organization is within the Ericaceae, we used SynMap2 within the CoGe platform to identify blocks of syntenic gene pairs between the two available diploid Ericaceae genomes with chromosome-scale scaffolding, R. williamsianum and V. macrocarpon (cranberry) (Polashock et al. 2014; Schlautman et al. 2017). We were only able to extract 2,128 predicted genes from anchored scaffolds from the current publicly available V. macrocarpon assembly versus 21,419 predicted genes from R. williamsianum ordered scaffolds. However, we did find 526 syntenic gene pairs across 81 syntenic blocks on anchored/ordered scaffolds assigned to chromosomes between the R. williamsianum and V. macrocarpon genomes (fig. 5 and supplementary fig. and table S7, Supplementary Material online). From these, we can draw certain inferences about the orthology of chromosomes between R. williamsianum and V. macrocarpon. For example, we see a reciprocal one-to-one relationship between V. macrocarpon chromosome 9 and R. williamsianum chromosome 4 (fig. 5 and supplementary fig. S5 and table S7, Supplementary Material online). In some cases, V. macrocarpon chromosomes appear orthologous to more than one R. williamsianum chromosome, suggesting WGD in the R. williamsianum genome. For example, the same region on V. macrocarpon chromosome 1 is orthologous to R. williamsianum chromosomes 8, 12, and 13; on V. macrocarpon chromosome 10, the same region is orthologous to R. williamsianum chromosomes 3 and 6. In other cases, it is difficult to determine whether the orthology of V. macrocarpon chromosomes to more than one R. williamsianum chromosome is due to fission or fusion of chromosomes versus WGD followed by gene/chromosome loss. For example, different parts of V. macrocarpon chromosome 4 are orthologous to R. williamsianum chromosome 8 or 9, similarly with V. macrocarpon chromosome 8 to R. williamsianum chromosome 9 or 10 and V. macrocarpon chromosome 12 to R. williamsianum chromosome 5 or 11. Therefore, we sought evidence of WGD in cranberry and other Ericaceae genomes to aid inferences about chromosome orthology.

. 5.

—Syntenic blocks between Ericaceae genomes, Rhododendron williamsianum and Vaccinium macrocarpon (cranberry). The 12 chromosomes of V. macrocarpon (VM) and 13 chromosomes of R. williamsianum (RW) are arranged along the circumference of the Circos (Krzywinski et al. 2009) plot to reduce crossing of bundles. Each colored bundle represents a block of at least five syntenic gene pairs shared between two chromosomes.

Syntenic Relationships within Other Ericaceae Genomes Identify WGDs

We wanted to determine whether the syntenic evidence of WGD in R. williamsianum was found in other Ericaceae. Again, we used SynMap2 within the CoGe platform to search for blocks of syntenic gene pairs within the cranberry (V. macrocarpon) genome. We did not find any syntenic blocks within the V. macrocarpon genome. This could be due to either the lack of WGD or the low number of predicted genes on anchored scaffolds. Two other diploid genomes are available within Ericaceae but do not have chromosome-level scaffolding: Rhododendron delavayi and Vaccinium corymbosum (blueberry). We used the entire assembly for each of these genomes in SynMap2 to see if there was any signal of WGD in either genome. Syntenic blocks are more dispersed since the scaffolds are not anchored to chromosomes, however, we do find syntenic regions between different scaffolds in both genomes (supplementary figs. S6 and S7, Supplementary Material online), indicating WGD in both R. delavayi and V. corymbosum genomes.

Synonymous Substitutions in Ericaceae Genomes Indicate Multiple WGDs

In order to estimate the number and relative timing of WGDs evident in Ericaceae genomes, we used the distribution of synonymous substitutions/site (Ks) between paralogous gene pairs across entire genomes available for R. delavayi, R. williamsianum, V. corymbosum, and V. macrocarpon, and compared these distributions to Ks distributions generated for Actinidia chinensis by Shi et al. (2010) and in our current study (supplementary fig. S8, Supplementary Material online). SiZer (Chaudhuri and Marron 1999) identified two (R. williamsianum, V. macrocarpon) or three (R. delavayi, V. corymbosum) significant peaks in Ks distributions, two of which appear shared across all four genomes (red and blue curves, fig. 6). An additional, more recent peak was identified in R. delavayi and V. corymbosum (green curves, fig. 6), but these taxa are known to be diploid (Janaki Ammal et al. 1950; Bian et al. 2014). Therefore, this third peak likely represents background gene duplication and loss (Blanc and Wolfe 2004) rather than recent WGD. We tested normal mixture models with one to five components in EMMIX (McLachlan and Peel 1999) to estimate parameters for Ks distributions for each genome. Models selected by Bayesian Information Criterion in EMMIX included four or five components for each of the four genomes, but we only show those components that match SiZer results (fig. 6). The ages of two Ks peaks identified in each Rhododendron genome correspond to one another (fig. 6) and indicate two potential shared WGDs that were likely inherited from a common ancestor of Rhododendron. The mean Ks values for the two peaks in R. delavayi are 0.67 and 1.65; the mean Ks values for the two peaks in R. williamsianum are 0.61 and 1.71 (table 4). Similarly, the ages of two Ks peaks identified in each Vaccinium genome correspond to one another (fig. 6) and indicate two potential shared WGDs that were likely inherited from a common ancestor of Vaccinium. The mean Ks values for the two peaks in V. corymbosum are 0.59 and 1.54; the mean Ks values for the two peaks in V. macrocarpon are 0.60 and 1.60 (table 4). Because lineages are likely to have different rates of molecular evolution, we would expect estimates of Ks from a shared WGD to be slightly different among Ericales genomes. However, these two peaks across all four Ericaceae genomes correspond well to two WGDs identified in Actinidia chinensis (Shi et al. 2010) and in our current study (mean Ks 0.49 and 1.72; table 4); we refer to them as Ad-β and At-γ, respectively, as denoted by Shi et al. (2010). To verify whether the two shared WGDs, we found across Ericaceae genomes represent the Ad-β and At-γ WGD events, we generated Ks distributions of orthologous gene pairs between R. williamsianum and the following genomes: A. chinensis (Pilkington et al. 2018), Camellia sinensis (Wei et al. 2018), V. macrocarpon (Polashock et al. 2014), and Vitis vinifera (The French–Italian Public Consortium for Grapevine Genome Characterization et al. 2007). We then compared these Ks distributions, which represent speciation events, to the two WGDs in Ericaceae as represented in R. williamsianum (supplementary fig. S10, Supplementary Material online). Based on these comparisons, we found that the most recent WGD in R. williamsianum (red star, supplementary fig. S10, Supplementary Material online) likely occurred before the speciation events between R. williamsianum and Ericales relatives Actinidia, Camellia, and Vaccinium, and after the speciation event between R. williamsianum and Vitis, corresponding to the timeline for Ad-β. The oldest WGD in R. williamsianum (blue star, supplementary fig. S10, Supplementary Material online) occurred before all four speciation events above, corresponding to the timeline for At-γ. Therefore, we propose that the evidence of two shared WGDs from Ks data in Rhododendron and Vaccinium genomes represents two ancient, shared WGDs that originated in a common ancestor of Ericaceae and even further back in a common ancestor in the Ericales.

. 6.

—Distributions of synonymous substitutions/site (Ks) for paralogous gene pairs in Ericaceae genomes. For each genome, top panel shows histogram of Ks data overlaid by normal mixture model from EMMIX (McLachlan and Peel 1999). Two or three components of the normal mixture model are shown in green, red, and blue that correspond with SiZer (Chaudhuri and Marron 1999) results below. Bottom panel shows two or three significant peaks identified by SiZer map, where blue indicates significant increases and red indicates significant decreases in curves; purple is not significant, gray indicates sparse data. (A) Rhododendron delavayi. (B) R. williamsianum. (C) Vaccinium corymbosum. (D). V. macrocarpon.

Table 4

Normal Mixture Model Parameters Estimated by EMMIX for Synonymous Substitutions/Site (Ks) Distributions of Ericales Genomes

Genome	Ad-β Mean Ks	At-γ Mean Ks
Actinidia chinensis	0.49	1.72
Rhododendron delavayi	0.67	1.65
Rhododendron williamsianum	0.61	1.71
Vaccinium corymbosum	0.59	1.54
Vaccinium macrocarpon	0.60	1.60

Discussion

The experiments, data, and conclusions we present here provide a detailed view of the genomes of R. williamsianum and other ericaceous species. We used Illumina short sequencing reads to assemble scaffolds which allowed us to ascertain ab initio the sets of contiguous DNA sequences that comprise the R. williamsianum nuclear genome. We then determined the chromosomal distribution and order of these sequences by LACHESIS assembly of Hi-C data and linkage analysis of RAD-seq data. We annotated the R. williamsianum genome and used predicted genes to identify syntenic blocks of genes within this genome, which illuminated multiple syntenic relationships across chromosomes. Using Ks distributions from R. williamsianum and three other publicly available diploid genomes in the heath family, we find shared evidence of at least two ancient WGDs across these members of the Ericaceae. Our ALLPATHS-LG estimate for the genome size of R. williamsianum (651 Mb) is slightly less than the estimate from the other genome sequenced in subg. Hymenanthes, R. delavayi (696 Mb) (Zhang, Xu, et al. 2017), and less than the minimum genome size reported for the subgenus based on flow cytometry (694 Mb) (Jones et al. 2007). This could reflect a slightly smaller genome size for R. williamsianum than those of other subg. Hymenanthes rhododendrons. In any event, knowledge of the clustering and ordering of DNA sequences for this species will be a useful genomic resource for other rhododendrons. Using the same (BUSCO) method of assessment for genome completeness, R. williamsianum is 89% and R. delavayi is 93% complete (Zhang, Xu, et al. 2017). Thus, two largely complete Rhododendron genome assemblies are now publicly available. We identified transposable elements (TEs) representing 26% of the R. williamsianum assembly, which is lower than the 40% identified as TEs in the V. macrocarpon genome (Polashock et al. 2014). Both estimates above were produced by the MAKER pipeline and are lower than estimates in the R. delavayi genome using a custom pipeline (52%) (Zhang, Xu, et al. 2017). However, when we include strings of Ns as potentially repetitive regions, the R. williamsianum total repetitive content (59%) is comparable to R. delavayi. Similarly, gene annotation for the R. delavayi genome is 87% complete using a custom pipeline, but 78% using the MAKER pipeline (Zhang, Xu, et al. 2017), which may indicate that custom pipelines can be more exhaustive in their predictions than the MAKER pipeline. Our MAKER pipeline gene annotation of the R. williamsianum genome is comparable (79% complete) to the R. delavayi MAKER annotation above. More genes (32,938) were predicted for the R. delavayi 695-Mb assembly using the custom pipeline versus the 23,559 genes predicted for the 532-Mb R. williamsianum assembly (table 3). However, gene density is comparable for the two genomes with 0.047/kb predicted for R. delavayi (Zhang, Xu, et al. 2017) and 0.044/kb predicted for R. williamsianum. For R. delavayi, mean statistics for gene length (=4434 bp), exons/gene (=4.62), exon length (=250 bp), and intron length (=785 bp) (Zhang, Xu, et al. 2017) were close to the mean gene length (=4628 bp), exons/gene (=5.68), exon length (=212 bp), and intron length (=645 bp) of R. williamsianum (table 3). Gene-density predictions for these Rhododendron genomes are notably different from those for Vaccinium genomes, which have two to four times more predicted genes per kb of sequence. Vaccinium macrocarpon has a predicted gene density of 0.083/kb (Polashock et al. 2014), and V. corymbosum has a predicted gene density of 0.162/kb (Gupta et al. 2015), suggesting that Vaccinium genomes are more gene-rich than Rhododendron genomes. For functional annotation comparisons between available Rhododendron subg. Hymenanthes genomes, 86% of predicted genes in R. delavayi (Zhang, Xu, et al. 2017) and 79% of predicted genes in R. williamsianum were functionally annotated. These functional annotations are comparable to the functional annotation of a recently developed transcriptome from mixed tissues for R. latoucheae Franch. (Xing et al. 2017) from subg. Choniastrum (Franchet) Drude, with 81% of unigenes functionally annotated. For GO classification, the percentage of genes assigned to molecular function in R. williamsianum is higher than that reported for R. latoucheae (Xing et al. 2017), otherwise, biological process terms are more dominant than cellular component terms in both genomes. Within each of these three GO domains, the top two dominant classes are the same for R. latoucheae and R. williamsianum (fig. 3). The similarity in abundance of GO classes of functional genes across these two different subgenera, Hymenanthes and Choniastrum, implies conservation in functional sequence across the genus. The LACHESIS technique maps sequences based on their intranuclear proximities, while linkage mapping is based upon meiotic recombination frequencies. Because the two methods rely on conceptually divergent data generation techniques and analyses, the congruence of their assignment and ordering of sequences on chromosomes (fig. 1 and supplementary fig. S3, Supplementary Material online) reinforces these conclusions. The few discrepancies between the LACHESIS assembly and linkage map were primarily inversions of multiscaffold blocks (fig. 1 and supplementary fig. S3, Supplementary Material online), which may be due to problematic regions in the R. williamsianum assembly, such as the regions around centromeres or those that may have resulted from segmental duplications or other types of repeats. Similar ordering and orientation errors have been seen in the human LACHESIS assembly (Burton et al. 2013). After corrections to rectify large-scale inversions (fig. 1 and supplementary fig. S3, Supplementary Material online), discrepancies between our LACHESIS and linkage map results mostly occur near the center of chromosomes which may indicate that most of the R. williamsianum chromosomes are metacentric (fig. 1). Metacentric chromosomes have also been observed in other Ericaceae members such as Vaccinium macrocarpon (Schlautman et al. 2017). We have generated a linkage map for R. williamsianum, which used 11,616 markers from RAD-seq data, in combination with the LACHESIS assembly, to cluster and order 1,708 scaffolds (table 2), representing 368.4 Mb (69%) of the 532 Mb assembly and 21,419 predicted genes. Two other linkage map studies have been published for Rhododendron, from subg. Hymenanthes and Tsutsusi, using 332 to 523 combined AFLP, EST, MYB, RAPD, RFLP, and/or SSR markers, with the goal of identifying genes involved in flower color, iron-chlorosis tolerance, leaf color, and leaf morphology for cultivation purposes (Dunemann et al. 1999; De Keyser et al. 2010, 2013). These studies identified 13, 16, or 18 linkage groups, despite the known haploid chromosome number of 13 for the genus (Sax 1930). The R. williamsianum linkage map generated in our study provides a higher density of markers and confirms a haploid chromosome number of 13. Within subgenus Hymenanthes, interspecies crosses are fertile, yielding stable hybrid progeny (Ma et al. 2010). These, in turn, prove to be fertile in further crosses with subg. Hymenanthes species or hybrids. On this basis, we would expect the genome sequence of R. williamsianum, presented here, to share synteny with other Rhododendron species in subg. Hymenanthes, but this remains to be tested with the acquisition of other Rhododendron genomes with chromosome-scale assemblies. In contrast, crosses that involve species in different subgenera give hybrids inefficiently or not at all. One exception is the case of azaleas in subg. Pentanthera crossed with elepidote rhododendrons in subg. Hymenanthes, which can yield sterile hybrids (Kehr 1977). Among the possible reasons for this lack of interfertility is a fundamental difference in gene arrangement between the subgenera. As genome structures are ascertained for other Rhododendron subgenera, molecular karyotyping will enable reconciliation of genome structure with taxonomy. Rhododendron and Vaccinium genomes used in our study are known to be diploid (Janaki Ammal et al. 1950; Bian et al. 2014; Polashock et al. 2014), and had no prior evidence of ancient WGD in these genomes. Despite the availability of scaffolds anchored to chromosomes, it was previously unknown whether V. macrocarpon had experienced any past WGDs (Schlautman et al. 2015). With our chromosome-level assembly of the R. williamsianum genome, we find spatial evidence of WGDs in syntenic analyses within this genome. In our analyses of genome assemblies for other Ericaceae, available without chromosome-level mapping, we also find spatial (syntenic) and temporal (Ks) evidence of WGDs across all Ericaceae genomes, highlighting at least two shared, ancient WGDs. Previous studies in other genomes from the order Ericales have found two shared, ancient WGDs in both Actinidia (Actinidiaceae) and Camellia (Theaceae) (Shi et al. 2010; Huang et al. 2013; Xia et al. 2017; Wei et al. 2018), with Ks estimates comparable to our Ericaceae estimates (table 4). The oldest duplication (At-γ) dates back to an ancient hexaploidy shared by most eudicots ∼120 Ma (Jiao et al. 2012; Vekemans et al. 2012). The second duplication (Ad-β) is shared in a common ancestor of Actinidia+Camellia, estimated somewhere between 64.3 and 101.4 Ma (Shi et al. 2010; Huang et al. 2013; Xia et al. 2017; Wei et al. 2018). Two WGDs in Rhododendron and Vaccinium appear to result from the same events because we find evidence for these two WGDs arising at similar times across both genera and before their divergence (fig. 6 and supplementary fig. S10, Supplementary Material online). These WGDs must have been present in a common ancestor of the two groups, which is dated to ∼77 Ma (Rose et al. 2018). We also show that the most recent WGD arose before the speciation event between Rhododendron and its Ericales relatives (Actinidia and Camellia) and the oldest WGD arose before the speciation event between Rhododendron and Vitis. Therefore, we propose that the two WGDs we find in ericaceous genomes each represent one of the products of WGD as found in Actinidia and Camellia (fig. 7); the most recent of these resulting from a WGD in a common ancestor of Actinidiaceae+Ericaceae+Theaceae (Ad-β), and the oldest representing a hexaploid event in a eudicot ancestor (At-γ), as evidenced by syntenic depths of up to 1:5 in R. williamsianum. Supporting this is a recent study by Landis et al. (2018) who used Ks data from the 1KP project (Matasci et al. 2014) to identify and date WGDs across angiosperms and found evidence for two shared, ancient WGDs in Ericaceae transcriptomes: the oldest dating to 105.7 Ma and the second to 85.6 Ma in a common ancestor in the Ericales.

. 7.

—Whole-genome duplication events detected in Ericaceae genomes. Phylogeny of Ericales genomes sequenced to date and outgroup Vitis vinifera. Whole-genome duplication (WGD) events detected in Ericaceae genomes in this study indicated by blue and red stars. At-γ is named after WGD detected in Arabidopsis thaliana (Bowers et al. 2003); Ad-β is named after WGD detected in Actinidia (Shi et al. 2010). The R. williamsianum genome presented here adds to the growing body of genomic resources available for the genus. One other nuclear genome within the genus has been sequenced (Zhang, Xu, et al. 2017), and at least nine transcriptomes have been recently published (Fang et al. 2017; Xing et al. 2017; Zhang, Zhang, et al. 2017; Cheng et al. 2018; Choudhary et al. 2018; Li et al. 2018; Wang et al. 2018; Xiao et al. 2018; Zhao et al. 2018). We hope that the availability of chromosome-scale scaffolding for the R. williamsianum genome will motivate other efforts in the genus and across the heath family to provide similar resources for examining chromosome evolution across these groups. Our study also highlights that future genomic sequencing endeavors with chromosome-scale scaffolding should include syntenic analyses within the genome of interest before cross-species comparisons, as there may be signals of ancient WGDs in these genomes.

Supplementary Material

Supplementary data are available at Genome Biology and Evolution online. Click here for additional data file.

79 in total

1. Automated de novo identification of repeat sequence families in sequenced genomes.

Authors: Zhirong Bao; Sean R Eddy
Journal: Genome Res Date: 2002-08 Impact factor: 9.043

2. DAGchainer: a tool for mining segmental genome duplications and synteny.

Authors: Brian J Haas; Arthur L Delcher; Jennifer R Wortman; Steven L Salzberg
Journal: Bioinformatics Date: 2004-07-09 Impact factor: 6.937

3. The grapevine genome sequence suggests ancestral hexaploidization in major angiosperm phyla.

Authors: Olivier Jaillon; Jean-Marc Aury; Benjamin Noel; Alberto Policriti; Christian Clepet; Alberto Casagrande; Nathalie Choisne; Sébastien Aubourg; Nicola Vitulo; Claire Jubin; Alessandro Vezzi; Fabrice Legeai; Philippe Hugueney; Corinne Dasilva; David Horner; Erica Mica; Delphine Jublot; Julie Poulain; Clémence Bruyère; Alain Billault; Béatrice Segurens; Michel Gouyvenoux; Edgardo Ugarte; Federica Cattonaro; Véronique Anthouard; Virginie Vico; Cristian Del Fabbro; Michaël Alaux; Gabriele Di Gaspero; Vincent Dumas; Nicoletta Felice; Sophie Paillard; Irena Juman; Marco Moroldo; Simone Scalabrin; Aurélie Canaguier; Isabelle Le Clainche; Giorgio Malacrida; Eléonore Durand; Graziano Pesole; Valérie Laucou; Philippe Chatelet; Didier Merdinoglu; Massimo Delledonne; Mario Pezzotti; Alain Lecharny; Claude Scarpelli; François Artiguenave; M Enrico Pè; Giorgio Valle; Michele Morgante; Michel Caboche; Anne-Françoise Adam-Blondon; Jean Weissenbach; Francis Quétier; Patrick Wincker
Journal: Nature Date: 2007-08-26 Impact factor: 49.962

4. Gamma paleohexaploidy in the stem lineage of core eudicots: significance for MADS-box gene and species diversification.

Authors: Dries Vekemans; Sebastian Proost; Kevin Vanneste; Heleen Coenen; Tom Viaene; Philip Ruelens; Steven Maere; Yves Van de Peer; Koen Geuten
Journal: Mol Biol Evol Date: 2012-07-20 Impact factor: 16.240

5. Comprehensive mapping of long-range interactions reveals folding principles of the human genome.

Authors: Erez Lieberman-Aiden; Nynke L van Berkum; Louise Williams; Maxim Imakaev; Tobias Ragoczy; Agnes Telling; Ido Amit; Bryan R Lajoie; Peter J Sabo; Michael O Dorschner; Richard Sandstrom; Bradley Bernstein; M A Bender; Mark Groudine; Andreas Gnirke; John Stamatoyannopoulos; Leonid A Mirny; Eric S Lander; Job Dekker
Journal: Science Date: 2009-10-09 Impact factor: 47.728

6. MAKER-P: a tool kit for the rapid creation, management, and quality control of plant genome annotations.

Authors: Michael S Campbell; MeiYee Law; Carson Holt; Joshua C Stein; Gaurav D Moghe; David E Hufnagel; Jikai Lei; Rujira Achawanantakun; Dian Jiao; Carolyn J Lawrence; Doreen Ware; Shin-Han Shiu; Kevin L Childs; Yanni Sun; Ning Jiang; Mark Yandell
Journal: Plant Physiol Date: 2013-12-04 Impact factor: 8.340

7. Impact of whole-genome duplication events on diversification rates in angiosperms.

Authors: Jacob B Landis; Douglas E Soltis; Zheng Li; Hannah E Marx; Michael S Barker; David C Tank; Pamela S Soltis
Journal: Am J Bot Date: 2018-05-02 Impact factor: 3.844

8. RNA-Seq analysis and annotation of a draft blueberry genome assembly identifies candidate genes involved in fruit ripening, biosynthesis of bioactive compounds, and stage-specific alternative splicing.

Authors: Vikas Gupta; April D Estrada; Ivory Blakley; Rob Reid; Ketan Patel; Mason D Meyer; Stig Uggerhøj Andersen; Allan F Brown; Mary Ann Lila; Ann E Loraine
Journal: Gigascience Date: 2015-02-13 Impact factor: 6.524

9. De novo SNP discovery and genetic linkage mapping in poplar using restriction site associated DNA and whole-genome sequencing technologies.

Authors: Mohaddeseh Mousavi; Chunfa Tong; Fenxiang Liu; Shentong Tao; Jiyan Wu; Huogen Li; Jisen Shi
Journal: BMC Genomics Date: 2016-08-18 Impact factor: 3.969

10. The draft genome assembly of Rhododendron delavayi Franch. var. delavayi.

Authors: Lu Zhang; Pengwei Xu; Yanfei Cai; Lulin Ma; Shifeng Li; Shufa Li; Weijia Xie; Jie Song; Lvchun Peng; Huijun Yan; Ling Zou; Yongpeng Ma; Chengjun Zhang; Qiang Gao; Jihua Wang
Journal: Gigascience Date: 2017-10-01 Impact factor: 6.524

16 in total

1. Chromosome-scale genome assembly of Rhododendron molle provides insights into its evolution and terpenoid biosynthesis.

Authors: Guo-Lin Zhou; Yong Li; Fei Pei; Ting Gong; Tian-Jiao Chen; Jing-Jing Chen; Jin-Ling Yang; Qi-Han Li; Shi-Shan Yu; Ping Zhu
Journal: BMC Plant Biol Date: 2022-07-15 Impact factor: 5.260

2. The Rhododendron Plant Genome Database (RPGD): a comprehensive online omics database for Rhododendron.

Authors: Ningyawen Liu; Lu Zhang; Yanli Zhou; Mengling Tu; Zhenzhen Wu; Daping Gui; Yongpeng Ma; Jihua Wang; Chengjun Zhang
Journal: BMC Genomics Date: 2021-05-22 Impact factor: 3.969

3. Chromosome-Level Genome Assembly of the American Cranberry (Vaccinium macrocarpon Ait.) and Its Wild Relative Vaccinium microcarpum.

Authors: Luis Diaz-Garcia; Luis Fernando Garcia-Ortega; Maria González-Rodríguez; Luis Delaye; Massimo Iorizzo; Juan Zalapa
Journal: Front Plant Sci Date: 2021-02-10 Impact factor: 5.753

4. Genome survey sequencing and identification of genomic SSR markers for Rhododendron micranthum.

Authors: Xiao-Jun Zhou; Meng-Xue Liu; Xiao-Yu Lu; Shan-Shan Sun; Yan-Wei Cheng; Hui-Yuan Ya
Journal: Biosci Rep Date: 2020-06-26 Impact factor: 3.840

5. Single-molecule long-read sequencing of the full-length transcriptome of Rhododendron lapponicum L.

Authors: Xinping Jia; Ling Tang; Xueying Mei; Huazhou Liu; Hairong Luo; Yanming Deng; Jiale Su
Journal: Sci Rep Date: 2020-04-21 Impact factor: 4.379

6. Genome-wide identification and characterization of COMT gene family during the development of blueberry fruit.

Authors: Yushan Liu; Yizhou Wang; Jiabo Pei; Yadong Li; Haiyue Sun
Journal: BMC Plant Biol Date: 2021-01-06 Impact factor: 4.215

Review 7. Research advances in and prospects of ornamental plant genomics.

Authors: Tangchun Zheng; Ping Li; Lulu Li; Qixiang Zhang
Journal: Hortic Res Date: 2021-04-01 Impact factor: 6.793

8. Functional Annotation of a Full-Length Transcriptome and Identification of Genes Associated with Flower Development in Rhododendronsimsii (Ericaceae).

Authors: Qunlu Liu; Fiza Liaquat; Yefeng He; Muhammad Farooq Hussain Munis; Chunying Zhang
Journal: Plants (Basel) Date: 2021-03-29

9. The Chromosome-Level Genome of Miracle Fruit (Synsepalum dulcificum) Provides New Insights Into the Evolution and Function of Miraculin.

Authors: Zhuang Yang; Zhenhuan Liu; Hang Xu; Yayu Chen; Pengmeng Du; Ping Li; Wenjie Lai; Haiyan Hu; Jie Luo; Yuanhao Ding
Journal: Front Plant Sci Date: 2022-01-03 Impact factor: 5.753

10. An ancient whole-genome duplication event and its contribution to flavor compounds in the tea plant (Camellia sinensis).

Authors: Ya Wang; Fei Chen; Yuanchun Ma; Taikui Zhang; Pengchuan Sun; Meifang Lan; Fang Li; Wanping Fang
Journal: Hortic Res Date: 2021-08-01 Impact factor: 6.793