| Literature DB >> 29845094 |
Sin-Gi Park1, Seung Il Yoo1, Dong Sung Ryu1, Hyunsung Lee1, Yong Ju Ahn1, Hojin Ryu2, Junsu Ko1, Chang Pyo Hong1.
Abstract
Lentinula edodes is one of the most popular edible mushrooms in the world and contains useful medicinal components such as lentinan. The whole-genome sequence of L. edodes has been determined with the objective of discovering candidate genes associated with agronomic traits, but experimental verification of gene models with correction of gene prediction errors is lacking. To improve the accuracy of gene prediction, we produced 12.6 Gb of long-read transcriptome data of variable lengths using PacBio single-molecule real-time (SMRT) sequencing and generated 36,946 transcript clusters with an average length of 2.2 kb. Evidence-driven gene prediction on the basis of long- and short-read RNA sequencing data was performed; a total of 16,610 protein-coding genes were predicted with error correction. Of the predicted genes, 42.2% were verified to be covered by full-length transcript clusters. The raw reads have been deposited in the NCBI SRA database under accession number PRJNA396788.Entities:
Keywords: GFF, general feature format; Gene model; Gene prediction; Lentinula edodes; PacBio Single-molecule real-time (SMRT) transcriptome sequencing; RNA-Seq, whole transcriptome sequencing
Year: 2017 PMID: 29845094 PMCID: PMC5961913 DOI: 10.1016/j.dib.2017.09.052
Source DB: PubMed Journal: Data Brief ISSN: 2352-3409
Summary of PacBio long-read transcriptome data in L. edodes B17.
| Library size | |||
|---|---|---|---|
| <2 kb | 2–3 kb | 3–6 kb | |
| No. of subreads | 2,027,562 | 1,404,810 | 1,852,875 |
| Total length of subreads (Gb) | 3.36 | 3.49 | 5.76 |
| No. of reads of inserts | 196,775 | 207,733 | 351,503 |
| No. of full-length reads | 91,513 | 96,258 | 150,541 |
| No. of non-full-length reads | 82,559 | 99,023 | 188,716 |
| No. of filtered short reads | 22,703 | 12,452 | 12,246 |
| Polished consensus isoforms | 12,874 | 11,223 | 12,849 |
| Average length of isoforms (bp) | 1373 | 2236 | 3064 |
Adapters and artefacts were removed.
Fig. 1The length distribution of clustered transcripts.
Summary of gene prediction and annotation updated in L. edodes B17.
| This study | ||
|---|---|---|
| Protein-coding gene (No.) | 18,663 | 13,426 |
| Unique gene models (No.) | 16,610 | 13,028 |
| Genes with isoforms (No.) | 2053 | 398 |
| Supported by RNA-Seq (No.) | 15,263 | 11,781 |
| Annotated (No.) | 12,662 | 10,700 |
| Average gene length (bp) | 1288 | 1612 |
| Total length of gene models (Mb) | 24.05 | 21.64 |
| Exons | ||
| No. of exons | 91,386 | 77,650 |
| No. of average exons per gene | 4.89 | 5.78 |
| Average exon length (bp) | 196 | 204 |
| Introns | ||
| No. of exons | 72,723 | 64,224 |
| No. of average exons per gene | 3.89 | 4.78 |
| Average exon length (bp) | 83 | 90 |
Gene models were annotated with homology-based searches.
Summary of correction of gene models.
| No. of gene models | |
|---|---|
| Exactly overlapped | 7889 |
| Split into≥two gene models | 4742 |
| Fused with≥two gene models | 343 |
| Structurally re-predicted | 261 |
| Newly found | 1344 |
| Predicted in the only previous study | 2031 |
1 Gene models in the present study were structurally compared with those reported by Shim et al. [1].
Fig. 2The distribution of gene models supported by PacBio long-read and Illumina short-read RNA-Seq data.
| Subject area | Biology |
| More specific subject area | Genomics and Bioinformatics |
| Type of data | Table, Figure, GFF |
| How data was acquired | PacBio single-molecule real-time (SMRT) transcriptome sequencing and evidence-driven gene prediction |
| Data format | Raw, analyzed |
| Experimental factors | RNA isolation, cDNA library construction and PacBio sequencing |
| Experimental features | Long-read transcriptome data with variable lengths were generated, and evidence-driven gene prediction was performed based on the data. |
| Data source location | The monokaryotic B17 strain of |
| Data accessibility | Raw data from this study are available in NCBI's Sequence Read Archive (SRA) database under accession number PRJNA396788 ( |