| Literature DB >> 27673730 |
Shan-Ce Niu1,2, Qing Xu3, Guo-Qiang Zhang3, Yong-Qiang Zhang3, Wen-Chieh Tsai4,5,6, Jui-Ling Hsu3,5, Chieh-Kai Liang4, Yi-Bo Luo1, Zhong-Jian Liu3,7,8,9.
Abstract
Orchids are renowned for their spectacular flowers and ecological adaptations. After the sequencing of the genome of the tropical epiphytic orchid Phalaenopsis equestris, we combined Illumina HiSeq2000 for RNA-Seq and Trinity for de novo assembly to characterize the transcriptomes for 11 diverse P. equestris tissues representing the root, stem, leaf, flower buds, column, lip, petal, sepal and three developmental stages of seeds. Our aims were to contribute to a better understanding of the molecular mechanisms driving the analysed tissue characteristics and to enrich the available data for P. equestris. Here, we present three databases. The first dataset is the RNA-Seq raw reads, which can be used to execute new experiments with different analysis approaches. The other two datasets allow different types of searches for candidate homologues. The second dataset includes the sets of assembled unigenes and predicted coding sequences and proteins, enabling a sequence-based search. The third dataset consists of the annotation results of the aligned unigenes versus the Nonredundant (Nr) protein database, Kyoto Encyclopaedia of Genes and Genomes (KEGG) and Clusters of Orthologous Groups (COG) databases with low e-values, enabling a name-based search.Entities:
Year: 2016 PMID: 27673730 PMCID: PMC5037975 DOI: 10.1038/sdata.2016.83
Source DB: PubMed Journal: Sci Data ISSN: 2052-4463 Impact factor: 6.444
Figure 1Schematic overview of the study.
We collected one sample for each tissue type, including root, stem, leaf, flower bud, column, lip, petal, sepal and seeds from three developmental stages of P. equestris. Next, we sequenced cDNAs generated from the tissues on an Illumina HiSeq2000 in 90-bp paired-end (PE) reads, with 75-bp paired-end (PE) reads from the leaf tissue. The analysis started with assembling the short reads using the de novo assembly program Trinity and continued with functional analysis using BLASTX. Moreover, we performed quality control assessments at each step from the raw reads to the annotation datasets. Finally, we used YABBY and NBS-encoding gene families as examples of the usage of these datasets.
Figure 2RNA from eleven tissues analysed by agarose gel electrophoresis.
CL, column; Fb, flower bud; L5, root; L6, stem; LP, lip; M, Marker DL2000; PHA, leaf; PT, petal; SP, sepal; 12, 12-day seed; 7, 7-day seed; 4, 4-day seed.
Genome sequences of the P. equestris deposit.
|
|
|
|
|---|---|---|
|
| ||
| Pha_1213.scafSeq.FG2_superscaffold | fasta | Genome assembly results file |
| Pha_1213.scafSeq.FG2_superscaffold.link | txt | File containing the locational relationship between superscaffold and scaffolds or contigs |
|
| ||
| Pha_1213.scafSeq.FG2.Proteinmask.annot.known.trans.fa | fasta | Repeat annotation file by proteinmasker |
| Pha_1213.scafSeq.FG2.Proteinmask.annot.known.trans.gff | gff | gff file of repeat annotation by proteinmasker |
| Pha_1213.scafSeq.FG2.RepeatMasker.out.known.trans.fa | fasta | Repeat annotation file by repeatmasker |
| Pha_1213.scafSeq.FG2.RepeatMasker.out.known.trans.gff | gff | gff file of repeat annotation by repeatmasker |
| Pha_1213.scafSeq.FG2.denovo.trans.gff | gff |
|
| Pha_1213.scafSeq.FG2.trf.out.known.tran.fa | fasta | Repeat annotation file by TRF |
| Pha_1213.scafSeq.FG2.trf.out.known.tran.gff | gff | gff file of repeat annotation by TRF |
| repeat_statistics.xlsx | xlsx | statistics of repeat annotation |
|
| ||
| P.equestis.gene.cds | fasta | Predicted coding sequence |
| P.equestis.gene.gff | gff | Annotated coding sequence, gff format file |
| P.equestis.gene.pep | fasta | Predicted protein sequence |
|
| ||
| Interpro.tar | tar | InterPro database annotation |
| KEGG.tar | tar | KEGG database annotation |
| Swissprot.tar | tar | Swissprot database annotation |
| Trembl.tar | tar | TrEMBL database annotation |
Summary of the construction of the 37 libraries deposited in the NCBI database.
|
|
|
|
|
|
|---|---|---|---|---|
| SRR827602 | 3,332 | 2,288 | SRX265492 | 344 |
| SRR827603 | 3,255 | 2,233 | SRX265493 | 335 |
| SRR827604 | 2,635 | 1,814 | SRX265494 | 800 |
| SRR827605 | 2,612 | 1,831 | SRX265495 | 800 |
| SRR827606 | 2,875 | 1,998 | SRX265496 | 800 |
| SRR827607 | 2,903 | 2,007 | SRX265496 | 800 |
| SRR827608 | 2,895 | 1,990 | SRX265496 | 800 |
| SRR827609 | 2,867 | 1,966 | SRX265496 | 800 |
| SRR827610 | 2,586 | 1,812 | SRX265497 | 335 |
| SRR827611 | 2,531 | 1,765 | SRX265497 | 335 |
| SRR827612 | 2,464 | 1,711 | SRX265497 | 335 |
| SRR827613 | 2,941 | 2,025 | SRX265498 | 344 |
| SRR827614 | 2,902 | 2,018 | SRX265498 | 344 |
| SRR827615 | 2,935 | 2,040 | SRX265498 | 344 |
| SRR827616 | 2,648 | 1,861 | SRX265499 | 800 |
| SRR827617 | 2,606 | 1,828 | SRX265499 | 800 |
| SRR827618 | 2,631 | 1,845 | SRX265499 | 800 |
| SRR827619 | 2,600 | 1,825 | SRX265499 | 800 |
| SRR827620 | 5,872 | 2,648 | SRX265500 | 163 |
| SRR827621 | 2,681 | 1,013 | SRX265501 | 5000 |
| SRR827622 | 2,372 | 854 | SRX265502 | 5000 |
| SRR827623 | 2,430 | 881 | SRX265503 | 2000 |
| SRR827624 | 2,535 | 947 | SRX265504 | 2000 |
| SRR827625 | 2,432 | 956 | SRX265505 | 2000 |
| SRR827626 | 2,632 | 1,002 | SRX265506 | 2000 |
| SRR827627 | 2,375 | 847 | SRX265507 | 5000 |
| SRR827628 | 12,673 | 7,935 | SRX265508 | 163 |
| SRR827629 | 15,710 | 8,791 | SRX265509 | 163 |
| SRR827630 | 14,766 | 9,139 | SRX265510 | 163 |
| SRR827631 | 3,089 | 1,669 | SRX265511 | 20000 |
| SRR827632 | 5,125 | 2,829 | SRX265512 | 10000 |
| SRR827633 | 6,567 | 3,440 | SRX265513 | 20000 |
| SRR827634 | 6,260 | 3,239 | SRX265514 | 10000 |
| SRR827635 | 7,762 | 3,960 | SRX265515 | 2000 |
| SRR827636 | 8,168 | 4,209 | SRX265516 | 5000 |
| SRR827637 | 5,656 | 3,580 | SRX265517 | 40000 |
| SRR827638 | 5,708 | 3,615 | SRX265518 | 40000 |
Global genome assembly statistics deposited in the NCBI database.
| Total sequence length | 1,064,051,384 |
| Total assembly gap length | 80,500,320 |
| Number of scaffolds | 89,583 |
| Scaffold N50 | 378,442 |
| Scaffold L50 | 493 |
| Number of contigs | 188,397 |
| Contig N50 | 21,144 |
| Contig L50 | 12,818 |
Raw data deposit.
|
|
|
|
|
|---|---|---|---|
| This dataset contains 11 total samples. Sample 1 is from the root of | |||
| 1 | SRR2080194 | SAMN03799292 | Phalaenopsis_equestris_root_RNA_Seq_fastq_files |
| 2 | SRR2080204 | SAMN03799301 | Phalaenopsis_equestris_flower_RNA_Seq_fastq_files |
| 3 | SRR2080202 | SAMN03799299 | Phalaenopsis_equestris_leaf_RNA_Seq_fastq_files |
| 4 | SRR2080200 | SAMN03799297 | Phalaenopsis_equestris_stem_RNA_Seq_fastq_files |
| 5 | SRR3606718 | SAMN05185248 | Phalaenopsis equestris seed 12 days RNA_seq fastq files |
| 6 | SRR3606742 | SAMN05185247 | Phalaenopsis equestris seed 7 days RNA_seq fastq files |
| 7 | SRR3606734 | SAMN05185246 | Phalaenopsis equestris seed 4 days RNA_seq fastq files |
| 8 | SRR3602300 | SAMN05185245 | Phalaenopsis equestris sepal RNA_seq fastq files |
| 9 | SRR3602299 | SAMN05185244 | Phalaenopsis equestris petal RNA_seq fastq files |
| 10 | SRR3602277 | SAMN05185243 | Phalaenopsis equestris lip |
| 11 | SRR3600816 | SAMN05185242 | Phalaenopsis equestris column |
Unigene deposit.
|
|
|
|
|---|---|---|
| The dataset contains the unigenes from the longest contigs per transcripts generated using Trinity. The fb.Unigene.fa file contains unigenes from the flower bud of | ||
| fb.Unigene.fa | fasta | unigene |
| L5.Unigene.fa | fasta | unigene |
| L6.Unigene.fa | fasta | unigene |
| PHA.Unigene.fa | fasta | unigene |
| 12_day.unigene.fasta | fasta | unigene |
| 7_day.unigene.fasta | fasta | unigene |
| 4_day.unigene.fasta | fasta | unigene |
| sepal.unigene.fasta | fasta | unigene |
| petal.unigene.fasta | fasta | unigene |
| lip.unigene.fasta | fasta | unigene |
| colum.unigene.fasta | fasta | unigene |
Annotation deposit.
|
|
|
|
|
|---|---|---|---|
| The dataset contains functional annotations and gene coding sequence annotations for 11 tissues. There are five annotation files per tissue: three functional annotation files and two structural annotation files. The three functional annotation files are the COG, KEGG and Nr database annotation files. The.cds and.pep files are in fasta format; the titles in the files contain the unigene name predicted coding sequence, the locus and the coding direction. The annotation file was deposited in the Dryad Digital Repository (Data Citation 1). | |||
| fb. annotation | fb.blastx.cog.xls | xls | COG database annotation |
| fb.blastx.kegg.xls | xls | KEGG database annotation | |
| fb.blastx.nr.xlsx | xlsx | Nr database annotation | |
| fb.cds | fasta | predicted coding sequence | |
| fb.pep | fasta | predicted protein sequence | |
| L5. annotation | L5.blastx.cog.xls | xls | COG database annotation |
| L5.blastx.kegg.xls | xls | KEGG database annotation | |
| L5.blastx.nr.xlsx | xlsx | Nr database annotation | |
| L5.cds | fasta | predicted coding sequence | |
| L5.pep | fasta | predicted protein sequence | |
| L6. annotation | L6.blastx.cog.xls | xls | COG database annotation |
| L6.blastx.kegg.xls | xls | KEGG database annotation | |
| L6.blastx.nr.xlsx | xlsx | Nr database annotation | |
| L6.cds | fasta | predicted coding sequence | |
| L6.pep | fasta | predicted protein sequence | |
| PHA. annotation | PHA.blastx.cog.xls | xls | COG database annotation |
| PHA.blastx.kegg.xls | xls | KEGG database annotation | |
| PHA.blastx.nr.xlsx | xlsx | Nr database annotation | |
| PHA.cds | fasta | predicted coding sequence | |
| PHA.pep | fasta | predicted protein sequence | |
| 4_day_seed_annotation | 4_day_seed.blastx.cog.xls | xls | COG database annotation |
| 4_day_seed.blastx.kegg.xls | xls | KEGG database annotation | |
| 4_day_seed.blastx.nr.xls | xls | Nr database annotation | |
| 4_day_seed.cds | fasta | predicted coding sequence | |
| 4_day_seed.pep | fasta | predicted protein sequence | |
| 7_day_seed_annotation | 7_day_seed.blastx.cog.xls | xls | COG database annotation |
| 7_day_seed.blastx.kegg.xls | xls | KEGG database annotation | |
| 7_day_seed.blastx.nr.xls | xls | Nr database annotation | |
| 7_day_seed.cds | fasta | predicted coding sequence | |
| 7_day_seed.pep | fasta | predicted protein sequence | |
| 12_day_seed_annotation | 12_day_seed.blastx.cog.xls | xls | COG database annotation |
| 12_day_seed.blastx.kegg.xls | xls | KEGG database annotation | |
| 12_day_seed.blastx.nr.xls | xls | Nr database annotation | |
| 12_day_seed.cds | fasta | predicted coding sequence | |
| 12_day_seed.pep | fasta | predicted protein sequence | |
| column_annotation | column_day_seed.blastx.cog.xls | xls | COG database annotation |
| column_day_seed.blastx.kegg.xls | xls | KEGG database annotation | |
| column_day_seed.blastx.nr.xls | xls | Nr database annotation | |
| column_day_seed.cds | fasta | predicted coding sequence | |
| column_day_seed.pep | fasta | predicted protein sequence | |
| lip_annotation | lip_day_seed.blastx.cog.xls | xls | COG database annotation |
| lip_day_seed.blastx.kegg.xls | xls | KEGG database annotation | |
| lip_day_seed.blastx.nr.xls | xls | Nr database annotation | |
| lip_day_seed.cds | fasta | predicted coding sequence | |
| lip_day_seed.pep | fasta | predicted protein sequence | |
| sepal_annotation | sepal_day_seed.blastx.cog.xls | xls | COG database annotation |
| sepal_day_seed.blastx.kegg.xls | xls | KEGG database annotation | |
| sepal_day_seed.blastx.nr.xls | xls | Nr database annotation | |
| sepal_day_seed.cds | fasta | predicted coding sequence | |
| sepal_day_seed.pep | fasta | predicted protein sequence | |
| petal_annotation | petal_day_seed.blastx.cog.xls | xls | COG database annotation |
| petal_day_seed.blastx.kegg.xls | xls | KEGG database annotation | |
| petal_day_seed.blastx.nr.xls | xls | Nr database annotation | |
| petal_day_seed.cds | fasta | predicted coding sequence | |
| petal_day_seed.pep | fasta | predicted protein sequence |
HSP gene family deposit.
|
|
|
|---|---|
| The HSP gene files were deposited in the Dryad Digital Repository (Data Citation 1). PEQU means | |
| hsp70_fb_PEQU.fas | alignment of the hsp70 genes from fb transcriptome and PEQU genome |
| hsp70_L5_PEQU.fas | alignment of the hsp70 genes from L5 transcriptome and PEQU genome |
| hsp70_L6_PEQU.fas | alignment of the hsp70 genes from L6 transcriptome and PEQU genome |
| hsp70_PHA_PEQU.fas | alignment of the hsp70 genes from PHA transcriptome and PEQU genome |
| hsp70_12_day_seed_pequ.fas | alignment of the hsp70 genes from 12 day seeds transcriptome and PEQU genome |
| hsp70_4_day_seed_pequ.fas | alignment of the hsp70 genes from 4 day seeds transcriptome and PEQU genome |
| hsp70_7_day_seed_pequ.fas | alignment of the hsp70 genes from 7 day seeds transcriptome and PEQU genome |
| hsp70_column_pequ.fas | alignment of the hsp70 genes from column transcriptome and PEQU genome |
| hsp70_lip_pequ.fas | alignment of the hsp70 genes from lip transcriptome and PEQU genome |
| hsp70_petal_pequ.fas | alignment of the hsp70 genes from petal transcriptome and PEQU genome |
| hsp70_sepal_pequ.fas | alignment of the hsp70 genes from sepal transcriptome and PEQU genome |
| hsp90_fb_PEQU.fas | alignment of the hsp90 genes from fb transcriptome and PEQU genome |
| hsp90_L5_PEQU.fas | alignment of the hsp90 genes from L5 transcriptome and PEQU genome |
| hsp90_L6_PEQU.fas | alignment of the hsp90 genes from L6 transcriptome and PEQU genome |
| hsp90_PHA_PEQU.fas | alignment of the hsp90 genes from PHA transcriptome and PEQU genome |
| hsp90_12_day_pequ.fas | alignment of the hsp70 genes from 12 day seeds transcriptome and PEQU genome |
| hsp90_4_day_pequ.fas | alignment of the hsp70 genes from 4 day seeds transcriptome and PEQU genome |
| hsp90_7_day_pequ.fas | alignment of the hsp70 genes from 7 day seeds transcriptome and PEQU genome |
| hsp90_sepal_pequ.fas | alignment of the hsp70 genes from sepal transcriptome and PEQU genome |
| hsp90_column_pequ.fas | alignment of the hsp70 genes from column transcriptome and PEQU genome |
| hsp90_lip_pequ.fas | alignment of the hsp70 genes from lip transcriptome and PEQU genome |
| hsp90_petal_pequ.fas | alignment of the hsp70 genes from petal transcriptome and PEQU genome |
Quality control and data statistics of the raw reads.
|
|
|
|
|
|
|
|
|
|
|
|
|
|---|---|---|---|---|---|---|---|---|---|---|---|
| Read number | 49,848,468 | 66,141,114 | 15,999,780 | 70,571,268 | 53,861,172 | 53,200,618 | 52,791,758 | 53,212,746 | 51,175,078 | 54,004,470 | 51,191,360 |
| Read length | 90 | 90 | 75 | 90 | 90 | 90 | 90 | 90 | 90 | 90 | 90 |
| Q20 (%) | 95.8 | 94.1 | 88.9 | 94.5 | 99.9 | 99.9 | 99.9 | 99.8 | 99.9 | 99.8 | 99.7 |
| GC percentage (%) | 45 | 46 | 49 | 48 | 48 | 48 | 48 | 48 | 46 | 47 | 49 |
Assembly statistics.
|
|
|
|
|
|
|
|
|
|
|
|
|
|---|---|---|---|---|---|---|---|---|---|---|---|
| Total unigenes | 107,406 | 106,002 | 26,051 | 49,443 | 35,466 | 30,995 | 29,428 | 47,303 | 53,045 | 36,674 | 32,669 |
| Total transcripts | 152,545 | 159,409 | 28,582 | 69,824 | 49,520 | 41,506 | 40,060 | 68,976 | 73,732 | 51,634 | 43,805 |
| N50 | 787 | 1,298 | 742 | 1,575 | 1,321 | 1,222 | 1,370 | 1,165 | 1,063 | 1,311 | 1,245 |
| Average length | 576 | 764 | 584 | 911 | 849 | 824 | 911 | 762 | 703 | 874 | 844 |
Mapping rates of the reads and transcript assembly completeness.
|
|
|
|
|
|
|
|
|
|
|
| ||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
| The mapping rate was tested by Bowtie mapping reads back to the unigenes. This table shows only the numbers and percentages of proper pairs. Count indicates the number of reads mapping back to the unigenes, and percentage indicates the read percentage. The transcript assembly completeness was assessed using CEGMA: count indicates the number of the 248 ultra-conserved CEGs present in the transcript assemblies, and percentage indicates the percentage of the 248 ultra-conserved CEGs present. | ||||||||||||||||||||||
| proper_pairs | 10946586 | 86.99 | 50305282 | 88.42 | 30323300 | 85.32 | 41855338 | 85.83 | 42208656 | 92.89 | 44150392 | 93.78 | 45145406 | 94.75 | 46461164 | 93.72 | 45049842 | 94 | 42289318 | 86.36 | 30765662 | 94.1 |
| CEGs | 140 | 56.45 | 241 | 97.18 | 202 | 81.45 | 222 | 89.52 | 225 | 90.73 | 229 | 92.34 | 228 | 91.94 | 234 | 94.45 | 233 | 93.95 | 219 | 88.31 | 231 | 93.15 |
Annotation statistics.
|
|
|
|
|
|
|
|
|
|
|
|
|
|---|---|---|---|---|---|---|---|---|---|---|---|
| Unigene number | 107,406 | 106,002 | 26,051 | 49,443 | 35,466 | 30,995 | 29,428 | 47,303 | 53,045 | 36,674 | 32,669 |
| Nr | 32,996 | 30,203 | 20,923 | 22,558 | 18,787 | 22,694 | 19,851 | 25,005 | 24,614 | 23,488 | 23,097 |
| COG | 8,823 | 8,243 | 6,633 | 8,283 | 8,802 | 9,194 | 8,886 | 9,874 | 9,549 | 9,746 | 9,518 |
| KEGG | 14,596 | 13,001 | 11,330 | 12,144 | 11,857 | 12,642 | 11,910 | 13,473 | 13,092 | 13,091 | 12,946 |
Figure 3E-value distribution of the blast results for the eleven transcriptome unigenes in the Nr database.
The x-axis shows the eleven tissues, different colours outline the range of E-values, and the y-axis provides the percentages.
Statistical results for the predicted CDSs.
|
|
|
|
|
|
|
|
|
|
|
|
|
|---|---|---|---|---|---|---|---|---|---|---|---|
| Total | 34,497 | 57,793 | 53,316 | 24,299 | 18,291 | 19,099 | 17,909 | 21,364 | 21,013 | 20,756 | 20,068 |
YABBY gene families in the assembled transcriptomes.
|
|
|
|
|
|
|
|
|
|
|
|
|---|---|---|---|---|---|---|---|---|---|---|
| 6 | 7 | 7 | 6 | 6 | 6 | 2 | 0 | 0 | 0 | 0 |
NBS-encoding gene families in the assembled transcriptomes.
|
|
|
|
|
|
|
|
|
|
|
|
|---|---|---|---|---|---|---|---|---|---|---|
| 17 | 17 | 14 | 18 | 24 | 22 | 13 | 21 | 12 | 7 | 17 |