| Literature DB >> 31308375 |
Enhua Xia1, Fangdong Li1, Wei Tong1, Hua Yang1, Songbo Wang2, Jian Zhao1, Chun Liu2, Liping Gao1, Yuling Tai1, Guangbiao She1, Jun Sun1, Haisheng Cao1, Qiang Gao2, Yeyun Li1, Weiwei Deng1, Xiaolan Jiang1, Wenzhao Wang1, Qi Chen1, Shihua Zhang1, Haijing Li1, Junlan Wu1, Ping Wang1, Penghui Li1, Chengying Shi1, Fengya Zheng2, Jianbo Jian2, Bei Huang1, Dai Shan2, Mingming Shi2, Congbing Fang1, Yi Yue1, Qiong Wu1, Ruoheng Ge1, Huijuan Zhao1, Daxiang Li1, Shu Wei1, Bin Han3, Changjun Jiang1, Ye Yin2, Tao Xia1, Zhengzhu Zhang1, Shancen Zhao2, Jeffrey L Bennetzen1,4, Chaoling Wei5, Xiaochun Wan6.
Abstract
Tea is a globally consumed non-alcohol beverage with great economic importance. However, lack of the reference genome has largely hampered the utilization of precious tea plant genetic resources towards breeding. To address this issue, we previously generated a high-quality reference genome of tea plant using Illumina and PacBio sequencing technology, which produced a total of 2,124 Gb short and 125 Gb long read data, respectively. A hybrid strategy was employed to assemble the tea genome that has been publicly released. We here described the data framework used to generate, annotate and validate the genome assembly. Besides, we re-predicted the protein-coding genes and annotated their putative functions using more comprehensive omics datasets with improved training models. We reassessed the assembly and annotation quality using the latest version of BUSCO. These data can be utilized to develop new methodologies/tools for better assembly of complex genomes, aid in finding of novel genes, variations and evolutionary clues associated with tea quality, thus help to breed new varieties with high yield and better quality in the future.Entities:
Mesh:
Substances:
Year: 2019 PMID: 31308375 PMCID: PMC6629666 DOI: 10.1038/s41597-019-0127-1
Source DB: PubMed Journal: Sci Data ISSN: 2052-4463 Impact factor: 6.444
Fig. 1Evaluation of the heterozygosity of 18 representative tea plants using RAD-seq for selection of individuals to genome sequencing. The left panel indicates the accession names of tea plant species/varieties. The F1 individual was a hybrid from “Yunkang #10 × Fudingdabaicha”. The middle panel shows the heterozygosity dynamics among different tea plants. Orange bar represents wild tea plants, while blue and green bars characterize semi-wild and cultivated tea plants, respectively. The right panel indicates species name. The heterozygous data of each tea plant was collected from our previous work[6,10].
Summary of genome sequencing data of tea plant using Illumina and PacBio SMRT sequencing platforms.
| Library Type | Insert Size (bp) | Sequencing Platform | Read Length (bp) | Number Libraries/Cells | Raw Data | Clean Data | ||
|---|---|---|---|---|---|---|---|---|
| Total Data (Gb) | Sequence Coverage (×) | Total Data (Gb) | Sequence Coverage (×) | |||||
|
| ||||||||
| Paired-End | 170 | Hiseq 2500 | 150 | 2 | 209.12 | 68.79 | 192.18 | 63.22 |
| 250 | Hiseq 2500 | 150 | 2 | 456.74 | 150.24 | 361.31 | 118.85 | |
| 500 | Hiseq 2500 | 90 | 3 | 356.08 | 117.13 | 305.03 | 100.34 | |
| 800 | Hiseq 2500 | 90 | 3 | 239.81 | 78.88 | 189.52 | 62.34 | |
| Mate-Pair | 2000 | Hiseq 2500 | 90 | 2 | 119.71 | 39.38 | 62.22 | 20.47 |
| 5000 | Hiseq 2500 | 50 | 1 | 68.29 | 22.46 | 18.73 | 6.16 | |
| 10000 | Hiseq 2500 | 90 | 3 | 224.10 | 73.72 | 87.26 | 28.70 | |
| 20000 | Hiseq 2500 | 90 | 2 | 177.70 | 58.45 | 66.01 | 21.71 | |
| 40000 | Hiseq 2500 | 90 | 2 | 272.21 | 89.54 | 42.57 | 14.00 | |
| Total | 20 | 2123.76 | 698.59 | 1324.83 | 435.79 | |||
|
| ||||||||
| RSII-10 kb | 10000 | RS II sequencer | 6440 | 44 | 33.20 | 10.92 | 22.87 | 7.52 |
| RSII-20 kb | 20000 | RS II sequencer | 12632 | 97 | 92.20 | 30.33 | 63.53 | 20.90 |
| Total | 141 | 125.40 | 41.25 | 86.40 | 28.42 | |||
The architecture of sequencing data was summarized from our previous reported tea plant genome[6]. The estimated genome size of 3.08 Gb was used to calculate the sequence coverage of each library[6].
Fig. 2The 17-mer distribution used for the estimation of genome size of tea plant. The distribution of 17-mer was calculated using jellyfish based on the sequencing data from short insert size libraries (insert size = 500 bp). The heterozygous and homozygous peaks of read depth were marked, suggesting a high complexity of tea plant genome.
Statistics of the tea plant genome assembly and improved annotation.
|
| |
| Estimated genome size (Gb) | 3.08 |
| Number of scaffolds | 14,051 |
| Total length of scaffolds (bp) | 3,141,536,798 |
| N50 of scaffolds (bp) | 1,397,810 |
| N90 of scaffolds (bp) | 358,724 |
| Longest scaffold (bp) | 7,310,916 |
| Number of contigs | 94,321 |
| Total length of contigs (bp) | 2,893,782,109 |
| N50 of contigs (bp) | 67,068 |
| N90 of contigs (bp) | 14,057 |
| Longest contig (bp) | 538,748 |
| Gap sequence (bp) | 247,754,689 |
| Predicted coverage of the assembled sequences (%) | 95.07 |
| GC content of the genome (%) | 37.84 |
|
| |
| Number of predicted protein-coding genes | 53,512 |
| Average gene length (bp) | 3,747 |
| Mean exon length (bp) | 284 |
| Average exon per gene | 4.5 |
| Mean intron length (bp) | 712 |
| Annotated to Swissport | 34,694 (64.83%) |
| Annotated to PFAM | 39,889 (74.54%) |
| Annotated to TAIR (version 10) | 38,952 (72.79%) |
| Annotated to GO | 21,961 (41.04%) |
| Annotated to KOG | 14,587 (27.26%) |
| tRNAs | 597 |
| rRNAs | 2,838 |
| snRNAs | 416 |
| miRNAs | 355 |
| Masked repeat sequence length (bp) | 1,861,774,995 |
| Percentage of repeat sequences (%) | 64.42 |
The statistics of genome assembly are based on sequence lengths that are larger than 1 kb. The protein-coding genes were re-predicted based on the improved ab intio training models and manual filtering. Putative functions of the re-annotated tea plant genes were predicted by aligning them against Swiss-Prot, InterPro, KEGG and GO databases. The statistics of genome assembly, noncoding RNAs and repeat contents were summarized from our previous work[6].
Fig. 3Functional annotation of the tea plant protein-coding genes. (a) Venn diagram shows the shared and unique annotations among Swiss-prot, PFAM, GO and The Arabidopsis Information Resource (TAIR; version10). (b) Functional classification of tea plant genes using KOG database. The functional categories of KOG are abbreviated. A: RNA processing and modification; B: chromatin structure and dynamics; C: energy production and conversion; D: cell cycle control, cell division, chromosome partitioning; E: amino acid transport and metabolism; F: nucleotide transport and metabolism; G: carbohydrate transport and metabolism; H: coenzyme transport and metabolism; I: lipid transport and metabolism; J: translation, ribosomal structure and biogenesis; K: transcription; L: replication, recombination and repair; M: cell wall/membrane/envelope biogenesis; N: cell motility; O: posttranslational modification, protein turnover, chaperones; P: inorganic ion transport and metabolism; Q: secondary metabolites biosynthesis, transport and catabolism; R: general function prediction only; S: function unknown; T: signal transduction mechanisms; U: intracellular trafficking, secretion, and vesicular transport; V: defense mechanisms; Y: nuclear structure; and Z: cytoskeleton.
Validation of the assembly quality and improved gene annotation of tea plant genome using three methodologies.
| Validation of assembly quality | Number | Percentage (%) |
|---|---|---|
|
| ||
| Total BUSCO groups | 1,440 | 100 |
| Complete single-copy BUSCOs | 1,180 | 81.9 |
| Complete duplicated BUSCOs | 151 | 10.5 |
| Fragmented BUSCOs | 44 | 3.1 |
| Missing BUSCOs | 65 | 4.5 |
|
| ||
| Total BACs (#) | 18 | 100 |
| Total length (bp) | 2,080,846 | 100 |
| Aligned BACs (bp) | 1,182,063 | 98.30 |
|
| ||
| Total PCR experiments | 24 | 100 |
| Success PCR experiments | 22 | 91.67 |
|
| ||
| Total BUSCO groups | 1,440 | 100 |
| Complete single-copy BUSCOs | 1,068 | 74.2 |
| Complete duplicated BUSCOs | 173 | 12.0 |
| Fragmented BUSCOs | 118 | 8.2 |
| Missing BUSCOs | 81 | 5.6 |
The completeness of genome assembly and gene re-annotation were evaluated using the latest version of BUSCO (v3.0.2). The result of BAC alignment and PCR validation were summarized from our previous reported tea plant genome[6].
| Design Type(s) | sequence assembly objective • sequence annotation objective |
| Measurement Type(s) | whole genome sequencing assay • BAC • transcription profiling assay |
| Technology Type(s) | DNA sequencing • RNA sequencing |
| Factor Type(s) | developmental stage |
| Sample Characteristic(s) | Camellia sinensis var. sinensis • leaf • apical bud • stem • flower • fruit • root |