| Literature DB >> 29020749 |
Lu Zhang1,2, Pengwei Xu3, Yanfei Cai1,2, Lulin Ma1,2, Shifeng Li1,2, Shufa Li1,2, Weijia Xie1,2, Jie Song1,2, Lvchun Peng1,2, Huijun Yan1,2, Ling Zou1,2, Yongpeng Ma4, Chengjun Zhang5, Qiang Gao3, Jihua Wang1,2.
Abstract
Rhododendron delavayi Franch. is globally famous as an ornamental plant. Its distribution in southwest China covers several different habitats and environments. However, not much research had been conducted on Rhododendron spp. at the molecular level, which hinders understanding of its evolution, speciation, and synthesis of secondary metabolites, as well as its wide adaptability to different environments. Here, we report the genome assembly and gene annotation of R. delavayi var. delavayi (the second genome sequenced in the Ericaceae), which will facilitate the study of the family. The genome assembly will have further applications in genome-assisted cultivar breeding. The final size of the assembled R. delavayi var. delavayi genome (695.09 Mb) was close to the 697.94 Mb, estimated by k-mer analysis. A total of 336.83 gigabases (Gb) of raw Illumina HiSeq 2000 reads were generated from 9 libraries (with insert sizes ranging from 170 bp to 40 kb), achieving a raw sequencing depth of ×482.6. After quality filtering, 246.06 Gb of clean reads were obtained, giving ×352.55 coverage depth. Assembly using Platanus gave a total scaffold length of 695.09 Mb, with a contig N50 of 61.8 kb and a scaffold N50 of 637.83 kb. Gene prediction resulted in the annotation of 32 938 protein-coding genes. The genome completeness was evaluated by CEGMA and BUSCO and reached 95.97% and 92.8%, respectively. The gene annotation completeness was also evaluated by CEGMA and BUSCO and reached 97.01% and 87.4%, respectively. Genome annotation revealed that 51.77% of the R. delavayi genome is composed of transposable elements, and 37.48% of long terminal repeat elements (LTRs). The de novo assembled genome of R. delavayi var. delavayi (hereinafter referred to as R. delavayi) is the second genomic resource of the family Ericaceae and will provide a valuable resource for research on future comparative genomic studies in Rhododendron species. The availability of the R. delavayi genome sequence will hopefully provide a tool for scientists to tackle open questions regarding molecular mechanisms underlying environmental interactions in the genus Rhododendron, more accurately understand the evolutionary processes and systematics of the genus, facilitate the identification of genes encoding pharmaceutically important compounds, and accelerate molecular breeding to release elite varieties.Entities:
Keywords: Rhododendron delavayi; annotation; genome assembly; genomics
Mesh:
Year: 2017 PMID: 29020749 PMCID: PMC5632301 DOI: 10.1093/gigascience/gix076
Source DB: PubMed Journal: Gigascience ISSN: 2047-217X Impact factor: 6.524
Figure 1:Rhododendron delavayi Franch. var. delavayi on Cang Shan Mountain, Dali.
Sequencing libraries and data yields for whole-genome shotgun sequencing
| Library type | Lane | Read length (bp) | Insert size (bp) | Raw bases | Clean bases | ||
|---|---|---|---|---|---|---|---|
| Total bases (Gb) | Depth (×) | Total bases (Gb) | Depth (×) | ||||
| PE101 | 2 | 100 | 170 | 80.47 | 115.30 | 74.12 | 106.20 |
| PE151 | 1 | 150 | 250 | 59.69 | 85.52 | 47.20 | 67.63 |
| PE101 | 4 | 100 | 500 | 47.89 | 68.62 | 43.58 | 62.44 |
| PE101 | 3 | 100 | 800 | 42.22 | 60.49 | 36.79 | 52.71 |
| MP50 | 2 | 49 | 2000 | 30.36 | 43.50 | 19.56 | 28.03 |
| MP50 | 3 | 49 | 5000 | 23.11 | 33.11 | 9.06 | 12.98 |
| MP50 | 3 | 49 | 10 000 | 20.17 | 28.90 | 6.71 | 9.61 |
| MP50 | 2 | 49 | 20 000 | 19.01 | 27.24 | 4.35 | 6.23 |
| MP50 | 1 | 49 | 40 000 | 13.91 | 19.93 | 4.69 | 6.72 |
| Total | 21 | 336.83 | 482.61 | 246.06 | 352.55 | ||
Sequencing depth was calculated based on a genome size of 697.94 Mb. High-quality data were obtained by filtering raw data for low-quality and duplicate reads.
Figure 2:k-mer analysis of the R. delavayi genome. (A) Histograms of k-mer frequencies in the clean read data for k = 17 (green), k = 21 (purple), k = 25 (orange), and k = 27 (yellow) by KmerFreq. (B) Histograms of k-mer frequencies in clean data for k = 25 (red) and k = 31 (blue) by jellyfish. The x-axis shows the number of times a k-mer occurred; e.g., the peaks near x = 31 indicate the number of k-mers that occurred 31 times in the data.
Statistics of genome size estimation by KmerFreq with k = 17, 21, 25, and 27
| Genome | K-mer length (bp) | K-mer numbers | K-mer depths | Estimated genome size | Read numbers | Genome coverage |
|---|---|---|---|---|---|---|
|
| 17 | 24 427 946 424 | 35 | 697 941 326 | 290 808 886 | ×41.8 |
| 21 | 23 264 710 880 | 33 | 704 991 238 | 290 808 886 | ×41.25 | |
| 25 | 22 101 475 336 | 31 | 712 950 817 | 290 808 886 | ×40.79 | |
| 27 | 21 519 857 564 | 30 | 717 328 585 | 290 808 886 | ×40.54 |
The genome size was estimated according to the formula Genome size = k-mer_numbers/k-mer_depths.
Properties of the R. delavayi k-mer distributions for k = 25 and k =31 using jellyfish
| k-mer length |
|
|
|---|---|---|
| Total | 22 120 556 922 | 20 373 342 031 |
| Error | 615 612 427 | 688 273 368 |
| Haploid coverage depth | 16 | 14 |
| Diploid coverage depth | 31 | 28 |
| Diploid genome size | 693 707 887 | 703 038 167 |
The genome size was estimated according to the formula Genome size = (Total k-mers—Error k-mers)/Diploid coverage depth.
The genome assembly and completeness of R. delavayi
| Contig | Scaffold | |||
|---|---|---|---|---|
| Size (bp) | Number | Size (bp) | Number | |
| N50 | 61 801 | 2871 | 637 826 | 313 |
| Minimum length | 13 | 79 | ||
| Maximum length | 581 429 | 3 407 404 | ||
| Total size | 657 780 215 | 695 092 854 | ||
| Number (≥100 bp) | 209 926 | 193 086 | ||
| Number (≥2 kb) | 20 175 | 4972 | ||
| Number (≥100 kb) | 1315 | 1230 | ||
| Number (≥1 Mb) | 140 | |||
| CEGMA completeness | 95.87% [238] | |||
| CEGMA partial | 98.39% [244] | |||
| BUSCO completeness | 92.8% [1337] | |||
| BUSCO fragment | 1.8%% [26] | |||
Numbers of genes that match CEGMA or BUSO are shown in square brackets.
Statistics of the assembly with different parameters
| Assembler | Assembly size (bp) | Contig N50 (bp) | Scaffold N50 (bp) | K-mer (bp) | Gapcloser | Rabbit |
|---|---|---|---|---|---|---|
| SOAP | 854 390 781 | 900 | 3380 | 63 | No | No |
| SOAP | 543 175 156 | 1118 | 5946 | 37 | No | No |
| SOAP | 1 231 272 241 | 19 792 | 67 539 | 87 | Yes | No |
| SOAP | 796 221 798 | 25 301 | 104 917 | 87 | Yes | Yes |
| Platanus | 750 231 563 | 13 232 | 583 084 | 41 | No | No |
| Platanus | 809 870 271 | 7886 | 383 826 | 47 | No | No |
| Platanus | 752 607 346 | 54 782 | 584 190 | 41 | Yes | No |
| Platanus | 695 092 854 | 61 801 | 637 826 | 41 | Yes | Yes |
The unigene coverage of transcriptome data by R. delavayi assembly
| Data | Number | Total | Base coverage | >90% sequence in | >50% sequence in |
|---|---|---|---|---|---|
| set | unigenes | length (bp) | by assembly (%) | 1 scaffold (%) | 1 scaffold (%) |
| >200 bp | 83 515 | 84 701 674 | 96.98 | 89.57 | 98.90 |
| >500 bp | 46 582 | 73 471 401 | 96.90 | 85.64 | 99.03 |
| >1000 bp | 29 816 | 61 377 043 | 96.80 | 82.85 | 99.08 |
Transposable elements in the R. delavayi genome
| Repbase TE length | Protein TE length |
| Combined TEs | ||
|---|---|---|---|---|---|
| Length | Percentage | ||||
| DNA | 7 882 501 | 7 328 645 | 69 812 249 | 77 776 557 | 11.19 |
| LINE | 4 811 976 | 12 454 813 | 31 065 638 | 36 834 088 | 5.30 |
| SINE | 125 792 | 0.00 | 869 547 | 991 785 | 0.14 |
| LTR | 34 884 681 | 52 469 776 | 257 040 066 | 260 532 496 | 37.48 |
| Other | 552 | 0.00 | 0.00 | 552 | 0.00 |
| Unknown | 0.00 | 0.00 | 4 565 754 | 4 565 754 | 0.67 |
| Total | 47 001 844 | 72 016 848 | 350 372 642 | 359 874 503 | 51.77 |
Repbase TEs means RepeatMask against Repbase; Protein TEs means RepeatProteinMask result against Repbase protein; De novo TEs means RepeatMask against the de novo library; Combined TEs means the combined result of the 3 steps.
Figure 3:The gene prediction pipeline.
Summary of R. delavayi genome annotation
| Gene | Gene numbers | Average gene | Average CDS | Average exons | Average exon | Average intron | |
|---|---|---|---|---|---|---|---|
| set | of prediction | length (bp) | length (bp) | per gene | length (bp) | length (bp) | |
|
| AUGUSTUS | 42 672 | 2623.41 | 974.42 | 4.76 | 204.56 | 438.16 |
| GENSCAN | 35 859 | 11 242.68 | 1186.91 | 6.35 | 186.87 | 1879.03 | |
| Homolog |
| 45 449 | 3501.48 | 846.20 | 3.21 | 263.43 | 1200.29 |
|
| 31 950 | 3724.50 | 994.90 | 4.07 | 244.30 | 888.41 | |
|
| 47 672 | 2558.30 | 805.26 | 3.01 | 267.50 | 872.00 | |
|
| 34 616 | 3454.51 | 963.76 | 3.95 | 244.21 | 845.35 | |
|
| 38 800 | 3324.95 | 917.11 | 3.74 | 245.47 | 880.01 | |
|
| 39 085 | 2958.21 | 850.18 | 3.22 | 263.79 | 948.30 | |
| GLEAN | 29 585 | 4126.65 | 1150.32 | 4.84 | 237.78 | 775.53 | |
| RNA-seq | 38 273 | 2989.97 | 828.78 | 3.45 | 240.07 | 881.29 | |
| Final set | 32 938 | 4434.22 | 1153.21 | 4.62 | 249.70 | 785.08 |
BUSCO assessment of gene prediction with different pipelines
| Current pipeline | Maker-P | |||
|---|---|---|---|---|
| BUSCO benchmark | Number | Percentage | Number | Percentage |
| Total BUSCO groups searched | 1440 | 1440 | ||
| Complete single copy BUSCOs | 1188 | 82.5 | 1056 | 73.3 |
| Complete duplicated BUSCOs | 70 | 4.9 | 67 | 4.7 |
| Fragmented BUSCOs | 92 | 6.4 | 152 | 10.6 |
| Missing BUSCOs | 90 | 6.2 | 165 | 11.4 |
Statistics for functional annotations in corresponding InterPRro entry
| Numbers of | Percent of | |
|---|---|---|
| matching genes | annotated genes | |
| InterPro | 22 946 | 69.66 |
| GO | 16 471 | 50.00 |
| KEGG | 21 210 | 64.39 |
| Swissprot | 22 693 | 68.90 |
| TrEMBL | 27 975 | 84.93 |
| Annotated | 28 296 | 85.91 |
| Unannotated | 4642 | 14.09 |
The statistic results of gene family clusters
| Number of | Genes in | Unclustered | Number of | Unique | Average number of | |
|---|---|---|---|---|---|---|
| Species | genes | families | genes | families | families | genes per family |
|
| 32 938 | 25 560 | 7378 | 14 836 | 1097 | 1.72 |
|
| 39 040 | 26 061 | 12 979 | 14 047 | 1100 | 1.86 |
|
| 18 269 | 15 080 | 3189 | 11 434 | 180 | 1.32 |
|
| 28 172 | 15 122 | 13 050 | 10 725 | 1231 | 1.41 |
|
| 35 474 | 25 525 | 9949 | 14 416 | 1091 | 1.77 |
|
| 29 413 | 21 086 | 8327 | 13 834 | 705 | 1.52 |
|
| 39 881 | 38 100 | 1781 | 14 399 | 623 | 2.65 |
|
| 34 879 | 28 093 | 6786 | 16 118 | 667 | 1.74 |
|
| 33 585 | 25 623 | 7962 | 17 139 | 532 | 1.50 |
|
| 26 637 | 23 007 | 3630 | 14 482 | 539 | 1.59 |
|
| 38 942 | 26 644 | 12 298 | 13 632 | 2020 | 1.95 |
Figure 4:Groups of orthologues shared among the angiosperms Rhododendron delavayi (RHOQ), Actinidia chinensis (KIWI), Primula veris (BAOC), Catharanthus roseus (CHAN), Phalaenopsis equestris (HDLH), and Tarenaya hassleriana (ZDIH). Venn diagram generated by http://www.interactivenn.net/.
Figure 5:Estimation of divergence time. The blue numbers on the nodes are the divergence times from present; the red node indicates the calibrated split.