Literature DB >> 29020749

The draft genome assembly of Rhododendron delavayi Franch. var. delavayi.

Lu Zhang1,2, Pengwei Xu3, Yanfei Cai1,2, Lulin Ma1,2, Shifeng Li1,2, Shufa Li1,2, Weijia Xie1,2, Jie Song1,2, Lvchun Peng1,2, Huijun Yan1,2, Ling Zou1,2, Yongpeng Ma4, Chengjun Zhang5, Qiang Gao3, Jihua Wang1,2.   

Abstract

Rhododendron delavayi Franch. is globally famous as an ornamental plant. Its distribution in southwest China covers several different habitats and environments. However, not much research had been conducted on Rhododendron spp. at the molecular level, which hinders understanding of its evolution, speciation, and synthesis of secondary metabolites, as well as its wide adaptability to different environments. Here, we report the genome assembly and gene annotation of R. delavayi var. delavayi (the second genome sequenced in the Ericaceae), which will facilitate the study of the family. The genome assembly will have further applications in genome-assisted cultivar breeding. The final size of the assembled R. delavayi var. delavayi genome (695.09 Mb) was close to the 697.94 Mb, estimated by k-mer analysis. A total of 336.83 gigabases (Gb) of raw Illumina HiSeq 2000 reads were generated from 9 libraries (with insert sizes ranging from 170 bp to 40 kb), achieving a raw sequencing depth of ×482.6. After quality filtering, 246.06 Gb of clean reads were obtained, giving ×352.55 coverage depth. Assembly using Platanus gave a total scaffold length of 695.09 Mb, with a contig N50 of 61.8 kb and a scaffold N50 of 637.83 kb. Gene prediction resulted in the annotation of 32 938 protein-coding genes. The genome completeness was evaluated by CEGMA and BUSCO and reached 95.97% and 92.8%, respectively. The gene annotation completeness was also evaluated by CEGMA and BUSCO and reached 97.01% and 87.4%, respectively. Genome annotation revealed that 51.77% of the R. delavayi genome is composed of transposable elements, and 37.48% of long terminal repeat elements (LTRs). The de novo assembled genome of R. delavayi var. delavayi (hereinafter referred to as R. delavayi) is the second genomic resource of the family Ericaceae and will provide a valuable resource for research on future comparative genomic studies in Rhododendron species. The availability of the R. delavayi genome sequence will hopefully provide a tool for scientists to tackle open questions regarding molecular mechanisms underlying environmental interactions in the genus Rhododendron, more accurately understand the evolutionary processes and systematics of the genus, facilitate the identification of genes encoding pharmaceutically important compounds, and accelerate molecular breeding to release elite varieties.
© The Authors 2017. Published by Oxford University Press.

Entities:  

Keywords:  Rhododendron delavayi; annotation; genome assembly; genomics

Mesh:

Year:  2017        PMID: 29020749      PMCID: PMC5632301          DOI: 10.1093/gigascience/gix076

Source DB:  PubMed          Journal:  Gigascience        ISSN: 2047-217X            Impact factor:   6.524


Background

Rhododendron L. is a genus in the family Ericaceae. It is 1 of the largest and most diverse genera in the family and is distributed predominantly throughout the Northern hemisphere, but also reaches into the Asian tropics. More than 1000 species of Rhododendron are currently recognized, of which 567 species representing 6 subgenera are known from China. Of these Chinese species, approximately 80% are endemic [1, 2]. Because of the adaptability of this genus to different environments, species such as R. arboreum and R. ferrugineum have been used to investigate the effects of different environmental factors on plant growth, development, and domestication [3-6]. Certain secondary metabolites in Rhododendron have been investigated in connection with antioxidant, anti-inflammatory, anti-carcinogen, and anti-bacterial properties; these compounds have potential in the alleviation of symptoms in conditions including diabetes, arthritis, headache, and hypertension [7-9]. Genome-level sequencing could help investigation into genes responsible for these metabolites and could facilitate the characterization of bio-active compounds and downstream production. Most species of Rhododendron are diploid (2n = 26). The relatively low levels of ploidy and reported introgression of genetic material between species in nature might be important in the evolution and speciation of Rhododendron [10]. Hybrid varieties can be produced with relative ease by using Rhododendron as the parent because of its natural interspecific hybridization [11, 12]. Previous research on morphology, anatomy, and cytology of Rhododendron suggests that the subgenus Hymenanthes represents a basal state of this genus [13], but classification attempts that employed only a small set of gene regions were not able to resolve relationships within the subgenus [14, 15]. R. delavayi Franch. is widely distributed throughout southwest China and grows at a wide altitudinal range, between 1200 and 3200 m. The species belongs to the subgenus Hymenanthes, subsection Arborea [1, 16]. Four varieties have been described for this species. Rhododendron delavayi var. peramoenum has narrow leaves and has been reported in western Yunnan, northeast India, and Myanmar, whereas R. delavayi var. delavayi has broader leaves than the former and mainly dominates in the Chinese range of the species. Another 2 varieties, R. delavayi var. adenostylum and R. delavayi var. pilostylum, were recently shown to fall within the spectrum of morphologies observed in hybrids between R. delavayi and R. irroratum [17]. In this project, material obtained from R. delavayi var. delavayi (see as Fig. 1) was used to generate genome sequences.
Figure 1:

Rhododendron delavayi Franch. var. delavayi on Cang Shan Mountain, Dali.

Due to its very attractive flowers and good resistance to arid and cold climates, R. delavayi has become a highly profitable ornamental flower in the market, especially in China and some Southeastern Asian countries, such as Vietnam, Thailand, Burma, and India. Nevertheless, it was believed that the anthropogenic activities have significantly reduced the diversity of plants of this genus in nature [18]. The aim of this project was to obtain a genome sequence of R. delavayi. With an available genome sequence, several next-generation sequencing approaches requiring a reference will become feasible, which will enable more in-depth research into genome-environment interactions, help with marker development for phylogenetic studies, and open possibilities for genome-assisted cultivar breeding and other downstream applications.

Data Description

Sample collection

Tissue samples were obtained from a 50-year-old tree growing in Jindian National Forest Park (Kunming, Yunnan, Taxonomy ID: 321363). This tree was transplanted from Cang Shan Mountain (Dali, Yunnan) in 1995. For genome library preparation, only leaf tissue was used; for transcriptome sequencing, samples were obtained from 5 different tissues: flowers, flower buds, young leaves, mature leaves, and young stems. After collection, tissues were immediately transferred into liquid nitrogen and stored until DNA and RNA extraction.

Illumina sequencing strategy

Genomic DNA was extracted from the leaf tissue using a standard CTAB extraction [19]. Different methods were used to construct different insert size libraries. For the small insert libraries (170, 250, 500, and 800 bp), Illumina's protocols were used as follows (Illumina, San Diego, CA, USA): (i) genomic DNA was fragmented by nebulization with compressed nitrogen gas; (ii) DNA ends were polished, and an adenine was added to the ends of the fragments; (iii) DNA adaptors (Illumina) with a single “T” overhang at the 3’ end were ligated to the DNA fragments above; (iv) the ligation products were run on 2% agarose gels, and the bands corresponding to each insert size were excised. For the large insert libraries (2, 5, 10, 20, and 40 kb), Illumina's mate pair library protocols were followed: (i) genomic DNA was fragmented by nebulization with compressed nitrogen gas; (ii) DNA ends were polished using dNTPs labeled with biotin and circularized for self-ligation; (iii) circularized DNA was fragmented again by DNA Exonuclease, followed by enrichment of fragments containing biotin/streptavidin with magnetic beads; 4) fragment ends were further polished, followed by addition of an “A” base and adaptors to form the large insert libraries. As shown in Table 1, the read length of the large insert libraries (2, 5, 10, 20, and 40 kb) was 49 bp, and the read length of the small insert libraries (170, 500, and 800 bp) was 100 bp, with the exception of the 250 bp insert library, which had a read length of 150 bp. A total of 336.83 Gb (×482.61) raw reads were generated from all constructed libraries. Before assembly, reads with low-quality polymerase chain reaction duplication and adapter contaminations were filtered by SOAPfilter, as included in SOAPdenovo v. 2.04 (SOAPdenovo2, RRID:SCR_014986) [20], and finally 246.06 Gb (×352.55) high-quality sequences were obtained for genome assembly.
Table 1:

Sequencing libraries and data yields for whole-genome shotgun sequencing

Library typeLaneRead length (bp)Insert size (bp)Raw basesClean bases
Total bases (Gb)Depth (×)Total bases (Gb)Depth (×)
PE101210017080.47115.3074.12106.20
PE151115025059.6985.5247.2067.63
PE101410050047.8968.6243.5862.44
PE101310080042.2260.4936.7952.71
MP50249200030.3643.5019.5628.03
MP50349500023.1133.119.0612.98
MP5034910 00020.1728.906.719.61
MP5024920 00019.0127.244.356.23
MP5014940 00013.9119.934.696.72
Total21336.83482.61246.06352.55

Sequencing depth was calculated based on a genome size of 697.94 Mb. High-quality data were obtained by filtering raw data for low-quality and duplicate reads.

Sequencing libraries and data yields for whole-genome shotgun sequencing Sequencing depth was calculated based on a genome size of 697.94 Mb. High-quality data were obtained by filtering raw data for low-quality and duplicate reads. RNA of each tissue was extracted separately according to the TRIzol protocol (Invitrogen) and then combined in homogenized RNA concentration. Total mRNAs were purified from total RNA by Dynal Oilgo (dT) beads (Invitrogen). Random oligo-nucleotides and M-MuLV Reverse Transcriptase (RNase H) were used to synthesize the first cDNA strand, and then the second cDNA strand was synthesized using DNA Polymerase I and RNase H. The cDNA libraries with insert sizes of 200–500 base pairs (bps) were selected and purified with the AMPure XP beads system (Beckman Coulter), and subsequently sequenced on an Illumina HiSeq 2000 platform. Both cDNA library construction and Illumina sequencing were carried out by BGI-ShenZhen. Paired-end reads were generated with a read length of 90 bp. The raw reads were filtered by SOAPnuke (SOAPnuke, RRID:SCR_015025) [21], with the following criteria for being discarded: (i) reads contained adaptors; (ii) reads with unknown nucleotides larger than 5%; (iii) low-quality reads (the rate of reads at which quality value ≤ 10 is more than 20%). After filtering, 7.13 G clean reads were obtained for genome evaluation and gene annotation. All clean reads were uploaded to NCBI (SRA505613).

Genome size estimates

We characterized the genome sequence (genome size, heterozygosity, and repetitive content) using the distribution of k-mers of 17, 21, 25, and 27 lengths from the clean reads (29 Gb clean reads from 500 and 800 bp insert size libraries). This analysis was performed using KmerFreq (included in SOAPdenovo, v. 2.04). The genome size (G) of R. delavayi was estimated by the following formula: G = k-mer_number/k-mer_depth, where the k-mer_number is the total number of k-mers, and k-mer_depth refers to the most frequent peak. All 4 k-mer distribution curves displayed 4 distinct peaks (Fig. 2A). The first peak at k = 1 was an artifact caused by sequencing errors, each of which created a k-mer that never occurred in the genome. The remaining 3 peak distributions indicated that the genome is a slightly repetitive, heterozygous, diploid genome. The third peak was a “diploid” peak (k-mers shared between homologous chromosomes) and was twice as deep as the second “haploid” peak (k-mers unique to a haplotype due to heterozygosity). The fourth peak was a repetitive peak (k-mers duplicated due to repetition) and was twice as deep as the “diploid” peak. For k = 17, the homozygous peak (the third peak) was found at a depth of ∼×35, with a k-mer_number of 24 427 946 424 and k-mer_depth of 35. The R. delavayi genome size was estimated to be 695.94 Mb, and the data used in 17-mer analysis was about ×41.7 coverage of the genome. All the k-mer sizes yielded similar genome size estimates of ∼697–717 Mb (Table 2).
Figure 2:

k-mer analysis of the R. delavayi genome. (A) Histograms of k-mer frequencies in the clean read data for k = 17 (green), k = 21 (purple), k = 25 (orange), and k = 27 (yellow) by KmerFreq. (B) Histograms of k-mer frequencies in clean data for k = 25 (red) and k = 31 (blue) by jellyfish. The x-axis shows the number of times a k-mer occurred; e.g., the peaks near x = 31 indicate the number of k-mers that occurred 31 times in the data.

Table 2:

Statistics of genome size estimation by KmerFreq with k = 17, 21, 25, and 27

GenomeK-mer length (bp)K-mer numbersK-mer depthsEstimated genome sizeRead numbersGenome coverage
R. delavayi 1724 427 946 42435697 941 326290 808 886×41.8
2123 264 710 88033704 991 238290 808 886×41.25
2522 101 475 33631712 950 817290 808 886×40.79
2721 519 857 56430717 328 585290 808 886×40.54

The genome size was estimated according to the formula Genome size = k-mer_numbers/k-mer_depths.

Rhododendron delavayi Franch. var. delavayi on Cang Shan Mountain, Dali. k-mer analysis of the R. delavayi genome. (A) Histograms of k-mer frequencies in the clean read data for k = 17 (green), k = 21 (purple), k = 25 (orange), and k = 27 (yellow) by KmerFreq. (B) Histograms of k-mer frequencies in clean data for k = 25 (red) and k = 31 (blue) by jellyfish. The x-axis shows the number of times a k-mer occurred; e.g., the peaks near x = 31 indicate the number of k-mers that occurred 31 times in the data. Statistics of genome size estimation by KmerFreq with k = 17, 21, 25, and 27 The genome size was estimated according to the formula Genome size = k-mer_numbers/k-mer_depths. We also used jellyfish v. 2.0 (jellyfish, RRID:SCR_005491) [22] to make k-mer histograms for k-mers 25 and 31 (Fig. 2B), and genome size estimates were 693 and 703 Mb, respectively (Table 3). The k-mer distribution obtained by jellyfish showed a similar trend to KmerFreq. Using the result from jellyfish as input for GenomeScope [23], heterozygosity estimates for the R. delavayi genome were in the range of ∼0.9–1.1%.
Table 3:

Properties of the R. delavayi k-mer distributions for k = 25 and k =31 using jellyfish

k-mer length k = 25 k = 31
Total k-mers22 120 556 92220 373 342 031
Error k-mers615 612 427688 273 368
Haploid coverage depth1614
Diploid coverage depth3128
Diploid genome size693 707 887703 038 167

The genome size was estimated according to the formula Genome size = (Total k-mers—Error k-mers)/Diploid coverage depth.

Properties of the R. delavayi k-mer distributions for k = 25 and k =31 using jellyfish The genome size was estimated according to the formula Genome size = (Total k-mers—Error k-mers)/Diploid coverage depth.

Genome and transcriptome assembly

The Rhododendron delavayi genome was assembled using Platanus v. 1.2.4 (Platanus, RRID:SCR_015531) [20], employing the 3 following steps: contig assembly, scaffolding, and gap closing. For the contig assembly step, the command line parameters “platanus assemble -t 20 -m 300 -u 0.2 -d 0.5 -k 41 -s 10” were specified to construct de Bruijn graphs for small insert size libraries (170, 250, 500, and 800 bp), to modify the graphs, and to display the output sequences. With these options, Platanus increased the k-mer size by the step size kstep (default 10) and iteratively reconstructed the graphs. Assembled contigs and bubbles in the graphs were obtained from this step. In the scaffolding step, the bubbles and reads from the libraries with small insert sizes (170, 250, 500, and 800 bp) and large insert sizes (2, 5, 10, 20, and 40 kb) were mapped onto the assembled contigs for scaffold construction. The command used for this was “platanus scaffold -t 20 -u 0.2 -c contigs.fasta -b bubble.fasta -IP ” -OP In the gap-filling step, the command used was “platanus gap_close –t 20 –IP ” and gaps within scaffolds were filled by reads from small insert size libraries where 1 end could be mapped to 1 contig and the other end extended into a gap. Two more gap-filling steps were performed based on the assembly results, first utilizing KGF (v. 1.06) [24], followed by GapCloser v. 1.12-r6 (GapCloser, RRID:SCR_015026) [24]. To remove a probable redundant sequence in the genome, we used jellyfish v. 2.0 to calculate the 17-mer frequency table from all short insert libraries, then passed the result to trimDup, which comes as part of Rabbit (the software is also archived in the Gigascience repository, GigaDB) [25, 26, 27]. The following command was used “trimDup 17-mer_table 17 1.5*main_peak genome.fa 0.3.” Hence, k-mers were excluded if their frequency was higher than 1.5 times the main peak. Each k-mer was defined as either a “repeat” or a “unique” k-mer, depending on whether its occurrence frequency was greater or less than twice the average frequency. Rabbit uses a Poisson-based k-mer model to establish a 17-mer frequency table from each scaffold of the genome sequences and then determines unique k-mers belonging to each scaffold and common k-mers shared by the scaffolds. The 17-mer frequency table generated in jellyfish is then used to filter the scaffolds so that the ratio of common to unique k-mers reaches 0.3. After the removal of 57.52 Mb of redundant scaffolds, a total scaffold length of 695 Mb was generated (Table 4). The contig N50 was 61.81 Kb, and the scaffold N50 was 637.82 Kb, while scaffolds with lengths of less than 100 bp were excluded. Meanwhile, we also ran another de novo assembler, SOAPdenovo2 (SOAPdenovo2, RRID:SCR_014986), with various modifications of parameters, but the results (Table 5) from SOAPdenovo2 were not better than those generated above.
Table 4:

The genome assembly and completeness of R. delavayi

ContigScaffold
Size (bp)NumberSize (bp)Number
N5061 8012871637 826313
Minimum length1379
Maximum length581 4293 407 404
Total size657 780 215695 092 854
Number (≥100 bp)209 926193 086
Number (≥2 kb)20 1754972
Number (≥100 kb)13151230
Number (≥1 Mb)140
CEGMA completeness95.87% [238]
CEGMA partial98.39% [244]
BUSCO completeness92.8% [1337]
BUSCO fragment1.8%% [26]

Numbers of genes that match CEGMA or BUSO are shown in square brackets.

Table 5:

Statistics of the assembly with different parameters

AssemblerAssembly size (bp)Contig N50 (bp)Scaffold N50 (bp)K-mer (bp)GapcloserRabbit
SOAPdenovo2854 390 781900338063NoNo
SOAPdenovo2543 175 1561118594637NoNo
SOAPdenovo21 231 272 24119 79267 53987YesNo
SOAPdenovo2796 221 79825 301104 91787YesYes
Platanus750 231 56313 232583 08441NoNo
Platanus809 870 2717886383 82647NoNo
Platanus752 607 34654 782584 19041YesNo
Platanus695 092 85461 801637 82641YesYes
The genome assembly and completeness of R. delavayi Numbers of genes that match CEGMA or BUSO are shown in square brackets. Statistics of the assembly with different parameters Transcript assembly was carried out in Trinity release-20130225 (Trinity, RRID:SCR_013048) [28] with the following parameters: minimum contig length 200 bp, min glue 3, group pair distance 280, path reinforcement distance 85, and min kmer covage 3. The TGI Clustering Tool (TGICL) v. 2.1 [29] was used to remove redundancies and merge the Unigenes with overlaps of at least 40 bp. Finally, a total of 83 515 Unigenes were obtained, with a mean length of 1014 bp and an N50 of 1727 bp.

Genome evaluation

We evaluated the completeness of the genome assembly using Core Eukaryotic Genes Mapping Approach (CEGMA) v. 2.5 (CEGMA, RRID:SCR_015055) [30] and Benchmarking Universal Single-Copy Orthologs v. 2.0 (BUSCO, RRID:SCR_015008) [31], which assess genome completeness using the conserved genes from the NCBI Eukaryotic Clusters of Orthologous Groups (KOGs) and BUSCO databases, respectively. CEGMA results indicated that 95.97% of core eukaryotic genes were contained in our assembly (238 out of 248 core eukaryotic genes). BUSCO analysis resulted in 92.8% of plants set (embryophyta_odb9, downloaded from BUSCO) identified as complete (1337 out of 1440 BUSCOs). More detailed information is given in Table 4. The Unigenes were aligned to the R. delavayi genome using BLAT v. 0.36 (BLAT, RRID:SCR_011919) [32] with default parameters. The alignment indicated that the assembled genome of R. delavayi covered 96.98% of the Unigenes, 89.57% of the Unigenes with at least 90% coverage in 1 scaffold, and 98.90% of the Unigenes with at least 50% coverage in 1 scaffold, suggesting a high level of coverage (Table 6).
Table 6:

The unigene coverage of transcriptome data by R. delavayi assembly

DataNumberTotalBase coverage>90% sequence in>50% sequence in
setunigeneslength (bp)by assembly (%)1 scaffold (%)1 scaffold (%)
>200 bp83 51584 701 67496.9889.5798.90
>500 bp46 58273 471 40196.9085.6499.03
>1000 bp29 81661 377 04396.8082.8599.08
The unigene coverage of transcriptome data by R. delavayi assembly

Repeat annotation

To identify tandem repeats, TRF v. 4.07 [33] was used with the following parameters: Match = 2, Mismatch = 7, Delta = 7, PM = 80, PI = 10, Minscore = 50, MaxPerid = 2000. In total 29 073 954 bp of tandem repeat sequences were detected, representing 4.18% of the R. delavayi genome. Transposable elements were identified by using homology and de novo methods. Homology: RepeatMasker v. 4.0.5 (RepeatMasker, RRID:SCR_012954) [34] was employed to identify transposable elements with RepBase library (version 20.04) [35], while RepeatProteinMask (v. 4.05) [36] was used to identify transposable elements against the TE protein database in RepBase. De novo: (i) RepeatModeler v. 1.07 (RepeatModeler, RRID:SCR_015027) [37] and LTR_FINDER v. 1.05 (LTR_Finder, RRID:SCR_015247) [38] were used to identify transposable elements; (ii) the results from RepeatModeler and LTR_FINDER were merged into a de novo repeat library; (iii) RepeatMasker was employed to categorize the genome sequence against the de novo repeat library. Finally, transposable elements identified by homology or de novo library within the same category were merged by overlap. Transposable elements accounted for 51.77% of the R. delavayi genome, while long terminal repeat elements (LTRs) represented the largest fraction (37.48%) of transposable elements (Table 7). The most abundant subtypes were Copia and Gypsy, representing 6.84% and 25.49% of the assembly genome, respectively.
Table 7:

Transposable elements in the R. delavayi genome

Repbase TE lengthProtein TE length De novo TE lengthCombined TEs
LengthPercentage
DNA7 882 5017 328 64569 812 24977 776 55711.19
LINE4 811 97612 454 81331 065 63836 834 0885.30
SINE125 7920.00869 547991 7850.14
LTR34 884 68152 469 776257 040 066260 532 49637.48
Other5520.000.005520.00
Unknown0.000.004 565 7544 565 7540.67
Total47 001 84472 016 848350 372 642359 874 50351.77

Repbase TEs means RepeatMask against Repbase; Protein TEs means RepeatProteinMask result against Repbase protein; De novo TEs means RepeatMask against the de novo library; Combined TEs means the combined result of the 3 steps.

Transposable elements in the R. delavayi genome Repbase TEs means RepeatMask against Repbase; Protein TEs means RepeatProteinMask result against Repbase protein; De novo TEs means RepeatMask against the de novo library; Combined TEs means the combined result of the 3 steps.

Gene prediction

We combined homology-based, de novo, and transcript alignment methods to predict protein-coding genes in the R. delavayi genome. Four major steps were employed, and a detailed pipeline is given in Fig. 3.
Figure 3:

The gene prediction pipeline.

The gene prediction pipeline. For gene prediction based on homology, we obtained gene sets from Arabidopsis thaliana [37], Actinidia chinensis [39], Capsicum annuum [40], Mimulus guttatus [41], Solanum tuberosum [42], and Solanum lycopersicum [43]. For genes with alternative splicing variants, the longest transcript was selected to represent the gene. We aligned these homologous protein sequences to the R. delavayi genome using TBLASTN (v. 2.2.26) [44], employing an E-value threshold of 1e-5. The resulting BLAST hits were linked to candidate gene loci using solar (v. 0.9.6) [45] with options “-a prot2genome2 –z.” Then, we extracted the candidate gene locus sequences including 1 kb of flanking DNA upstream and downstream and used Genewise v. 2.2.0 (GeneWise, RRID:SCR_015054) [46] to define the intron-exon boundary. Genes with lengths of less than 150 bp or with erroneous structure (premature stop codon or frame shifts) were excluded from further analysis. For the de novo prediction step, the repeat masked genome was used as input for 2 programs, AUGUSTUS v. 3.03 (Augustus: Gene Prediction, RRID:SCR_008417) [47] and GENSCAN v. 1.0 (GENSCAN, RRID:SCR_012902) [48]. To obtain a training set for AUGUSTUS, we randomly selected 5919 full-length genes that had been predicted based on homology, while for GENSCAN, Arabidopsis parameters were used. For the final non-redundant gene set, genes predicted based on both homology and de novo methods were combined with GLEAN (v. 1.0) [49], setting options “-gff -minlen 150 -minintron 11 -maxintron 15 000.” Genes with erroneous structure or short length were again excluded based on the same thresholds used for homology prediction. For the transcript alignment prediction step, the short reads from the transcriptome data set generated in the previous step were mapped to the R. delavayi genome using Tophat v. 2.1.1 (TopHat, RRID:SCR_013035) [50] to identify the splice junctions. Cufflinks v. 2.2.1 (Cufflinks, RRID:SCR_014597) [51] was then used to assemble transcripts from the Tophat outputs. The coding potential of these transcripts was identified by using the same gene sets with a fifth-order Hidden Markov Model, which was achieved by the same gene sets used in the training of AUGUSTUS. In the gene set combination step, outputs from GLEAN were combined with transcript assemblies as follows: first, translated sequences of both sets were cross-matched with an all-to-all BLASTP using an E-value cutoff of 1e-10. The matching transcript assemblies were then added to the GLEAN results as either untranslated regions (UTRs) or alternative splice forms, based on whether coverage and identity of the alignment results was larger than 0.9. The transcript assemblies that had no BLAST hit with the GLEAN results were added to the final set as novel genes. As a result of these steps, a total of 32 938 non-redundant genes were predicted in the R. delavayi genome (Table 8). These genes were scattered over 2149 scaffolds, averaging 15.33 genes per scaffold.
Table 8:

Summary of R. delavayi genome annotation

GeneGene numbersAverage geneAverage CDSAverage exonsAverage exonAverage intron
setof predictionlength (bp)length (bp)per genelength (bp)length (bp)
De novo AUGUSTUS42 6722623.41974.424.76204.56438.16
GENSCAN35 85911 242.681186.916.35186.871879.03
Homolog A. chinensis 45 4493501.48846.203.21263.431200.29
A. thaliana 31 9503724.50994.904.07244.30888.41
C. annuum 47 6722558.30805.263.01267.50872.00
M. guttatus 34 6163454.51963.763.95244.21845.35
S. lycopersicum 38 8003324.95917.113.74245.47880.01
S. tuberosum 39 0852958.21850.183.22263.79948.30
GLEAN29 5854126.651150.324.84237.78775.53
RNA-seq38 2732989.97828.783.45240.07881.29
Final set32 9384434.221153.214.62249.70785.08
Summary of R. delavayi genome annotation We also used Maker-P [52] to predict gene model with current homolog, de novo, and transcriptome results by using the parameter “est_gff, protein_gff, pred_gff” according to the Maker-P manual. The CEGMA assessment showed that our current pipeline identified 97.09% (234 of 241) of core eukaryotic genes, while the Maker-P pipeline identified only 86.72% (209 of 241) core eukaryotic genes. The BUSCO evaluation demonstrated that 87.4% and 6.4% of 1440 expected plant genes were identified as completeness and fragment, respectively (Table 9). Both assessment methods suggested that for the R. delavayi genome our current pipeline performed better than the Maker-P pipeline.
Table 9:

BUSCO assessment of gene prediction with different pipelines

Current pipelineMaker-P
BUSCO benchmarkNumberPercentageNumberPercentage
Total BUSCO groups searched14401440
Complete single copy BUSCOs118882.5105673.3
Complete duplicated BUSCOs704.9674.7
Fragmented BUSCOs926.415210.6
Missing BUSCOs906.216511.4
BUSCO assessment of gene prediction with different pipelines

Functional annotation

Gene function annotation was assigned based on sequence and domain conservation. (i) Assignment based on sequence conservation: protein sequences of R. delavayi were aligned to KEGG (v. 76) [53] and SwissProt and TrEMBL (Uniprot release 201406) [54] by BLASTP (v. 2.2.26) using an E-value threshold of 1e-5. Best-hit BLAST results were then used to define the gene functions. (ii) Assignment based on domain conservation: InterProScan-5.11–51.0 (InterProScan, RRID:SCR_005829) [55] was employed to identify motifs and domains by matching against public databases Pfam [56], PRINTS [57], ProDom [58], SMART [59], and PANTHER [60]. Gene ontology identities [61] for each gene were then obtained from the corresponding InterPro entry [62]. Overall, 85.91% of genes were functionally annotated by at least 1 of the 5 databases above, with 22 946 InterPro entries, 16 471 GO entries, 21 210 KEGG entries, 22 693 SwissProt entries, and 27 975 TrEMBL entries (Table 10).
Table 10:

Statistics for functional annotations in corresponding InterPRro entry

Numbers ofPercent of
matching genesannotated genes
InterPro22 94669.66
GO16 47150.00
KEGG21 21064.39
Swissprot22 69368.90
TrEMBL27 97584.93
Annotated28 29685.91
Unannotated464214.09
Statistics for functional annotations in corresponding InterPRro entry

Gene family construction

As references, protein sequences of 10 angiosperms (Actinidia chinensis, Primula veris, Catharanthus roseus, Dendrobium officinale, Phalaenopsis equestris, Tarenaya hassleriana, Solanum tuberosum, Solanum lycopersicum, Arabidopsis thaliana, and Oryza sativa) were downloaded (see the Supplementary Data). For genes with alternative splicing variants, the longest transcript was selected to represent the gene. Similarities between sequence pairs were calculated using BLASTP with an E-value threshold of 1e-5. Additionally, OrthoMCL (OrthoMCL DB: Ortholog Groups of Protein Sequences, RRID:SCR_007839) [63] was used with default parameters to identify gene family membership based on overall gene similarity combined with Markov Chain Clustering (MCL). Of all annotated genes, 77.60% were assigned to a family. A total of 14 836 families were represented, of which 1097 were specific of Rhododendron delavayi (Table 11). Figure 4 showed the number of orthologous gene families shared between 6 flower plant genomes, which have 5312 orthologous gene families in common with ancestral functions.
Table 11:

The statistic results of gene family clusters

Number ofGenes inUnclusteredNumber ofUniqueAverage number of
Speciesgenesfamiliesgenesfamiliesfamiliesgenes per family
R. delavayi 32 93825 560737814 83610971.72
A. chinensis 39 04026 06112 97914 04711001.86
P. veris 18 26915 080318911 4341801.32
C. roseus 28 17215 12213 05010 72512311.41
D. officinale 35 47425 525994914 41610911.77
P. equestris 29 41321 086832713 8347051.52
T. hassleriana 39 88138 100178114 3996232.65
S. tuberosum 34 87928 093678616 1186671.74
S. lycopersicum 33 58525 623796217 1395321.50
A. thaliana 26 63723 007363014 4825391.59
O. sativa 38 94226 64412 29813 63220201.95
Figure 4:

Groups of orthologues shared among the angiosperms Rhododendron delavayi (RHOQ), Actinidia chinensis (KIWI), Primula veris (BAOC), Catharanthus roseus (CHAN), Phalaenopsis equestris (HDLH), and Tarenaya hassleriana (ZDIH). Venn diagram generated by http://www.interactivenn.net/.

Groups of orthologues shared among the angiosperms Rhododendron delavayi (RHOQ), Actinidia chinensis (KIWI), Primula veris (BAOC), Catharanthus roseus (CHAN), Phalaenopsis equestris (HDLH), and Tarenaya hassleriana (ZDIH). Venn diagram generated by http://www.interactivenn.net/. The statistic results of gene family clusters

Phylogenetic analysis

For a phylogenetic analysis, 326 single copy orthologs were selected from the gene family step, and translated protein sequences were aligned in MUSCLE v. 3.8.31 (MUSCLE, RRID:SCR_011812) [64]. Next, the protein alignments were converted to corresponding coding sequences (CDS) using an in-house Perl script. Afterwards, the coding sequences of each single copy family were concatenated to form 1 supergene for each species. The nucleotides at positions 2 (phase 1 site) and 3 (4-fold degenerate site) of each codon were extracted separately and were used to construct 2 separate phylogenetic trees in PhyML3.0 (PhyML, RRID:SCR_014629) [65] specifying a HKY85 substitution model with a gamma distribution across sites. The tree using the phase 1 site was consistent with the tree using the 4 degenerate sites.

Divergence time

A Bayesian relaxed molecular clock approach was used to estimate species divergence time using MCMCTREE in PAML (PAML, RRID:SCR_014932) [66] based on the 4 degenerate sites data set used in phylogenetic analysis. When using previously published calibration times (split of Oryza sativa and Arabidopsis thaliana fixed as 130∼200 Mya) [67], the divergence time between R. delavayi and Actinidia chinensis was estimated to be in the range of 56.1–120.8 million years ago (Fig. 5).
Figure 5:

Estimation of divergence time. The blue numbers on the nodes are the divergence times from present; the red node indicates the calibrated split.

Estimation of divergence time. The blue numbers on the nodes are the divergence times from present; the red node indicates the calibrated split.

Conclusion

Now the order Ericales has 3 draft genome sequences of 3 economically important species (kiwi fruit [Actinidia chinensis], American cranberry [Vaccinium macrocarpon], and R. delavayi), 2 of which (V. macrocarpon and R. delavayi) also belong to the family Ericaceae. The availability of the R. delavayi genome sequence should facilitate de novo genome assembly of other species in this genus and, moreover, allow scientists to investigate interactions between environmental factors and related species at a molecular level. Furthermore, phylogenetic research can now draw on a genome as a resource to identify regions providing suitable resolution in this taxonomically difficult group, and it may become easier to identify the genes involved in metabolite pathways that have potential pharmaceutical importance.

Availability of supporting data

Supporting data and the Rabbit software are available in the GigaDB database [27]. The raw data were deposited in the SRA527514 with project accession PRJNA361437 for the Rhododendron delavayi genome. Actinidia chinensis (ftp://bioinfo.bti.cornell.edu/pub/kiwifruit/) Catharanthus roseus (http://bioinformatics.psb.ugent.be/orcae/overview/Catro) Primula veris (http://datadryad.org/resource/doi:10.5061/dryad.2s200) Dendrobium officinale (ftp://202.203.187.112/genome/dendrobe/) Phalaenopsis equestris (ftp://ftp.genomics.org.cn/from_BGISZ/20130120/) Solanum tuberosum: phytozome12.0 (https://phytozome.jgi.doe.gov/pz/portal.html) Solanum lycopersicum: phytozome12.0 (https://phytozome.jgi.doe.gov/pz/portal.html) Arabidopsis thaliana: phytozome12.0 (https://phytozome.jgi.doe.gov/pz/portal.html) Oryza sativa: phytozome12.0 (https://phytozome.jgi.doe.gov/pz/portal.html)

Abbreviations

Gb: gigabase; GO: gene ontology; PE: paired-end; TE: transposable element.

Competing interests

The authors declare that they have no competing interests.

Funding

This project was supported by the Program of Science and Technology Talents Training in Yunnan province (2016HA005), the Program of Innovative Talents Promotion by the Chinese Ministry of Science and Technology (2014HE002), the Applied Basic Research Project of Yunnan Province (2016FB058), and the National Natural Science Foundation of China (31460217, 31560225).

Author contributions

L.Z., J.W., Y.C., L.M., and Q.G. conceived the project. S.L., F.L., W.X., J.S., L.P., and H.Y. designed sample collection and extracted the genomic DNA. P.X. led the genome analysis, conducted the genome assembling, and predicted gene structure and repeat sequences. All of the authors listed above participated in discussions of the project and data. P.X., L.Z. (Lu Zhang), Q.G., and J.W. co-drafted the manuscript, and L.Z. (Ling Zou), Y.M., and C.Z. helped with manuscript revision. All authors read and approved the final manuscript. Click here for additional data file. Click here for additional data file. Click here for additional data file. Click here for additional data file. Click here for additional data file. Click here for additional data file. Click here for additional data file. Click here for additional data file. Click here for additional data file.
  54 in total

1.  The Pfam protein families database.

Authors:  A Bateman; E Birney; R Durbin; S R Eddy; K L Howe; E L Sonnhammer
Journal:  Nucleic Acids Res       Date:  2000-01-01       Impact factor: 16.971

2.  TEclass--a tool for automated classification of unknown eukaryotic transposable elements.

Authors:  György Abrusán; Norbert Grundmann; Luc DeMester; Wojciech Makalowski
Journal:  Bioinformatics       Date:  2009-04-05       Impact factor: 6.937

3.  KEGG: Kyoto Encyclopedia of Genes and Genomes.

Authors:  H Ogata; S Goto; K Sato; W Fujibuchi; H Bono; M Kanehisa
Journal:  Nucleic Acids Res       Date:  1999-01-01       Impact factor: 16.971

4.  Tandem repeats finder: a program to analyze DNA sequences.

Authors:  G Benson
Journal:  Nucleic Acids Res       Date:  1999-01-15       Impact factor: 16.971

5.  A heterozygous moth genome provides insights into herbivory and detoxification.

Authors:  Minsheng You; Zhen Yue; Weiyi He; Xinhua Yang; Guang Yang; Miao Xie; Dongliang Zhan; Simon W Baxter; Liette Vasseur; Geoff M Gurr; Carl J Douglas; Jianlin Bai; Ping Wang; Kai Cui; Shiguo Huang; Xianchun Li; Qing Zhou; Zhangyan Wu; Qilin Chen; Chunhui Liu; Bo Wang; Xiaojing Li; Xiufeng Xu; Changxin Lu; Min Hu; John W Davey; Sandy M Smith; Mingshun Chen; Xiaofeng Xia; Weiqi Tang; Fushi Ke; Dandan Zheng; Yulan Hu; Fengqin Song; Yanchun You; Xiaoli Ma; Lu Peng; Yunkai Zheng; Yong Liang; Yaqiong Chen; Liying Yu; Younan Zhang; Yuanyuan Liu; Guoqing Li; Lin Fang; Jingxiang Li; Xin Zhou; Yadan Luo; Caiyun Gou; Junyi Wang; Jian Wang; Huanming Yang; Jun Wang
Journal:  Nat Genet       Date:  2013-01-13       Impact factor: 38.330

6.  Analysis of the genome sequence of the flowering plant Arabidopsis thaliana.

Authors: 
Journal:  Nature       Date:  2000-12-14       Impact factor: 49.962

Review 7.  Chromatographic and electrophoretic methods for pharmaceutically active compounds in Rhododendron dauricum.

Authors:  Yuhua Cao; Qingcui Chu; Jiannong Ye
Journal:  J Chromatogr B Analyt Technol Biomed Life Sci       Date:  2004-12-05       Impact factor: 3.205

8.  The sequence and de novo assembly of the giant panda genome.

Authors:  Ruiqiang Li; Wei Fan; Geng Tian; Hongmei Zhu; Lin He; Jing Cai; Quanfei Huang; Qingle Cai; Bo Li; Yinqi Bai; Zhihe Zhang; Yaping Zhang; Wen Wang; Jun Li; Fuwen Wei; Heng Li; Min Jian; Jianwen Li; Zhaolei Zhang; Rasmus Nielsen; Dawei Li; Wanjun Gu; Zhentao Yang; Zhaoling Xuan; Oliver A Ryder; Frederick Chi-Ching Leung; Yan Zhou; Jianjun Cao; Xiao Sun; Yonggui Fu; Xiaodong Fang; Xiaosen Guo; Bo Wang; Rong Hou; Fujun Shen; Bo Mu; Peixiang Ni; Runmao Lin; Wubin Qian; Guodong Wang; Chang Yu; Wenhui Nie; Jinhuan Wang; Zhigang Wu; Huiqing Liang; Jiumeng Min; Qi Wu; Shifeng Cheng; Jue Ruan; Mingwei Wang; Zhongbin Shi; Ming Wen; Binghang Liu; Xiaoli Ren; Huisong Zheng; Dong Dong; Kathleen Cook; Gao Shan; Hao Zhang; Carolin Kosiol; Xueying Xie; Zuhong Lu; Hancheng Zheng; Yingrui Li; Cynthia C Steiner; Tommy Tsan-Yuk Lam; Siyuan Lin; Qinghui Zhang; Guoqing Li; Jing Tian; Timing Gong; Hongde Liu; Dejin Zhang; Lin Fang; Chen Ye; Juanbin Zhang; Wenbo Hu; Anlong Xu; Yuanyuan Ren; Guojie Zhang; Michael W Bruford; Qibin Li; Lijia Ma; Yiran Guo; Na An; Yujie Hu; Yang Zheng; Yongyong Shi; Zhiqiang Li; Qing Liu; Yanling Chen; Jing Zhao; Ning Qu; Shancen Zhao; Feng Tian; Xiaoling Wang; Haiyin Wang; Lizhi Xu; Xiao Liu; Tomas Vinar; Yajun Wang; Tak-Wah Lam; Siu-Ming Yiu; Shiping Liu; Hemin Zhang; Desheng Li; Yan Huang; Xia Wang; Guohua Yang; Zhi Jiang; Junyi Wang; Nan Qin; Li Li; Jingxiang Li; Lars Bolund; Karsten Kristiansen; Gane Ka-Shu Wong; Maynard Olson; Xiuqing Zhang; Songgang Li; Huanming Yang; Jian Wang; Jun Wang
Journal:  Nature       Date:  2009-12-13       Impact factor: 49.962

9.  AUGUSTUS: ab initio prediction of alternative transcripts.

Authors:  Mario Stanke; Oliver Keller; Irfan Gunduz; Alec Hayes; Stephan Waack; Burkhard Morgenstern
Journal:  Nucleic Acids Res       Date:  2006-07-01       Impact factor: 16.971

10.  Efficient de novo assembly of highly heterozygous genomes from whole-genome shotgun short reads.

Authors:  Rei Kajitani; Kouta Toshimoto; Hideki Noguchi; Atsushi Toyoda; Yoshitoshi Ogura; Miki Okuno; Mitsuru Yabana; Masayuki Harada; Eiji Nagayasu; Haruhiko Maruyama; Yuji Kohara; Asao Fujiyama; Tetsuya Hayashi; Takehiko Itoh
Journal:  Genome Res       Date:  2014-04-22       Impact factor: 9.043

View more
  20 in total

1.  An approach to determining anthocyanin synthesis enzyme gene expression in an evolutionary context: an example from Erica plukenetii.

Authors:  N C Le Maitre; M D Pirie; D U Bellstedt
Journal:  Ann Bot       Date:  2019-08-02       Impact factor: 4.357

2.  Characterization of Two Key Flavonoid 3-O-Glycosyltransferases Involved in the Formation of Flower Color in Rhododendron Delavayi.

Authors:  Wei Sun; Shiyu Sun; Hui Xu; Yuhan Wang; Yiran Chen; Xiaorong Xu; Yin Yi; Zhigang Ju
Journal:  Front Plant Sci       Date:  2022-05-16       Impact factor: 6.627

3.  Chromosome-scale genome assembly of Rhododendron molle provides insights into its evolution and terpenoid biosynthesis.

Authors:  Guo-Lin Zhou; Yong Li; Fei Pei; Ting Gong; Tian-Jiao Chen; Jing-Jing Chen; Jin-Ling Yang; Qi-Han Li; Shi-Shan Yu; Ping Zhu
Journal:  BMC Plant Biol       Date:  2022-07-15       Impact factor: 5.260

4.  Biotin provisioning by horizontally transferred genes from bacteria confers animal fitness benefits.

Authors:  Fei-Rong Ren; Xiang Sun; Tian-Yu Wang; Ya-Lin Yao; Yan-Zhen Huang; Xue Zhang; Jun-Bo Luan
Journal:  ISME J       Date:  2020-06-22       Impact factor: 10.302

5.  Chromosome-Level Genome Assembly of the American Cranberry (Vaccinium macrocarpon Ait.) and Its Wild Relative Vaccinium microcarpum.

Authors:  Luis Diaz-Garcia; Luis Fernando Garcia-Ortega; Maria González-Rodríguez; Luis Delaye; Massimo Iorizzo; Juan Zalapa
Journal:  Front Plant Sci       Date:  2021-02-10       Impact factor: 5.753

6.  Selection and Evaluation of Candidate Reference Genes for Quantitative Real-Time PCR in Aboveground Tissues and Drought Conditions in Rhododendron Delavayi.

Authors:  Lu Zhang; Yanfei Cai; Mingchao Zhang; Guanghui Du; Jihua Wang
Journal:  Front Genet       Date:  2022-04-14       Impact factor: 4.772

7.  Genome survey sequencing and identification of genomic SSR markers for Rhododendron micranthum.

Authors:  Xiao-Jun Zhou; Meng-Xue Liu; Xiao-Yu Lu; Shan-Shan Sun; Yan-Wei Cheng; Hui-Yuan Ya
Journal:  Biosci Rep       Date:  2020-06-26       Impact factor: 3.840

8.  The draft genome assembly of Rhododendron delavayi Franch. var. delavayi.

Authors:  Lu Zhang; Pengwei Xu; Yanfei Cai; Lulin Ma; Shifeng Li; Shufa Li; Weijia Xie; Jie Song; Lvchun Peng; Huijun Yan; Ling Zou; Yongpeng Ma; Chengjun Zhang; Qiang Gao; Jihua Wang
Journal:  Gigascience       Date:  2017-10-01       Impact factor: 6.524

9.  Characterization of Volatile Compounds in Four Different Rhododendron Flowers by GC×GC-QTOFMS.

Authors:  Chen-Yu Qian; Wen-Xuan Quan; Zhang-Min Xiang; Chao-Chan Li
Journal:  Molecules       Date:  2019-09-12       Impact factor: 4.411

10.  Illuminating the Plant Rhabdovirus Landscape through Metatranscriptomics Data.

Authors:  Nicolás Bejerman; Ralf G Dietzgen; Humberto Debat
Journal:  Viruses       Date:  2021-07-05       Impact factor: 5.048

View more

北京卡尤迪生物科技股份有限公司 © 2022-2023.