Literature DB >> 25938435

RNA-Seq Based De Novo Transcriptome Assembly and Gene Discovery of Cistanche deserticola Fleshy Stem.

Yuli Li1, Xiliang Wang1, Tingting Chen2, Fuwen Yao3, Cuiping Li2, Qingli Tang4, Min Sun2, Gaoyuan Sun2, Songnian Hu5, Jun Yu5, Shuhui Song6.   

Abstract

BACKGROUNDS: Cistanche deserticola is a completely non-photosynthetic parasitic plant with great medicinal value and mainly distributed in desert of Northwest China. Its dried fleshy stem is a crucial tonic in traditional Chinese medicine with roles of mainly improving male sexual function and strengthening immunity, but few mechanistic studies have been conducted partly due to the lack of genomic and transcriptomic resources.
RESULTS: In this study, we performed deep transcriptome sequencing in fleshy stem of C. deserticola, and about 80 million reads were generated using Illumina pair-end sequencing on HiSeq2000 platform. Using trinity assembler, we obtained 95,787 transcript sequences with transcript lengths ranging from 200 bp to 15,698 bp, having an average length of 950 bases and the N50 length of 1,519 bases. 63,957 transcripts were identified actively expressed with FPKM ≥ 0.5, in which 30,098 transcripts were annotated with gene descriptions or gene ontology terms by sequence similarity analyses against several public databases (Uniprot, NR and Nt at NCBI, and KEGG). Furthermore, we identified key enzyme genes involved in biosynthesis of lignin and phenylethanoid glycosides (PhGs) which are known to be the primary active ingredients. Four phenylalanine ammonia-lyase (PAL) genes, the first key enzyme in lignin and PhG biosynthesis, were identified based on sequences comparison and phylogenetic analysis. Two biosynthesis pathways of PhGs were also proposed for the first time.
CONCLUSIONS: In all, we completed a global analysis of the C. deserticola fleshy stem transcriptome using RNA-seq technology. A collection of enzyme genes related to biosynthesis of lignin and phenylethanoid glysides were identified from the assembled and annotated transcripts, and the gene family of PAL was also predicted. The sequence data from this study will provide a valuable resource for conducting future phenylethanoid glysides biosynthesis researches and functional genomic studies in this important medicinal plant.

Entities:  

Mesh:

Substances:

Year:  2015        PMID: 25938435      PMCID: PMC4418726          DOI: 10.1371/journal.pone.0125722

Source DB:  PubMed          Journal:  PLoS One        ISSN: 1932-6203            Impact factor:   3.240


Introduction

C. deserticola is a worldwide genus of perennial desert plants from the Orobanchaceae family, and is a completely non-photosynthetic species and usually grows underground holoparasitic plant [1]. It is parasitized on the roots of psammophyte Haloxylon ammodendron (Chenopodiaceae) [1, 2], which mainly inhabits deserts and semi-deserts due to its high tolerance to drought and salinity [1, 3]. C. deserticola shows strong resistance to harsh environmental conditions and is mainly distributed in Northwest China [4-6], especially in Inner Mongolia, Gansu and Xinjiang. It is considered to be an endangered wild species in recent years due to increased consumption by humans [5, 6]. C. deserticola which is often called desert ginseng is commonly known as desert-broomrape and the dried fleshy stem has been extensively used as a traditionally important tonic in China and Japan for many years [4, 7–10]. It was initially recorded in Shen Nong Ben Cao Jing (Dictionary of Chinese Materia Medica, 1977) [11] approximately 1800 years ago and was regarded as one of the main sources of the Chinese medicinal herba Cistanche. The extracts of C. deserticola possess a wide range of medicinal functions, especially for use in improving sexual function, tonifying kidney, protecting liver, aperient activity, enhancing memory, immunomodulatory, antioxidative activity, anti-inflammatory, antiviral activity etc [7–10, 12–15]. The major bioactive components of C. deserticola are Phenylethanoid glycosides (PheGs, PhGs) [2, 9, 10, 14, 15]. To date, more than 20 phenylethanoid glycosides have been isolated from the succulent stem of C.deserticola [9, 14, 16]. Among them, acteoside and echinacoside are two main components with significant pharmacological activities [2, 16], and documented as the quality standards of C. deserticola in the Chinese pharmacopeia (2005 and 2010 editions). Three chemical components of PhGs are organic acid, saccharide and phenylethanoid, however, the details concerning phenylethanoid biosynthetic pathways remain poorly understood in C.deserticola. Despite the commercial and medicinal importance of C.deserticola, the genomic and transcriptomic data of this species are very limited. There is no ESTs available in the NCBI database and the complete genome information for this species remains unavailable except for the chloroplast genome sequence [1]. The limited transcriptomic data hinder the study of PhG biosynthetic mechanisms. RNA-seq technology can generate sequences of the expressed parts of targeted genome [17] and identify genes [18] using the NGS technology platforms (such as Applied Biosystems SOLiD, Illumina HiSeq and Roche 454). It is becoming increasingly popular in transcriptome de novo assembly [19-22], since it is a cost-effective and powerful approach with high resolution and broad dynamic range [23-25], especially that it has an advantage to explore low abundance transcripts [26]. Because of the various advantages, RNA-seq is specifically attractive for non-model organisms with limited genetic resources [27-29]. But there is no any detailed research of C. deserticola transcriptome by RNA-seq. In this study, we globally sequenced the stem transcriptome for C. deserticola using Illumina Hiseq2000 platform, and got 7.9G raw data. By assembly and annotation, we mined the genes involved in biosynthesis of PhG and the genes responsible for entire lignin biosynthesis. Our RNA-seq analysis generated the first C. deserticola consensus trancriptome and provided new insights into comprehensive understanding of the medicinal value of C. deserticola. Additionally, the method described here can be widely applied to profile transcriptomes to facilitate the discovery of genes involved in specific medicinal components biosynthesis pathway in other medicinal plant with very limited genomic resources.

Materials and Methods

Plant material collection

The fresh succulent stem for C. deserticola in excavation stage was collected from a plant base in BayanHot City of Alxa League in Inner Mongolia in northwestern China. The collecting permit was obtained from the owner (HongKui CongRong Group) of the plant base. The voucher specimen was deposited in the Core Genomic Facility at Beijing Institute of Genomics, Chinese Academy of Sciences. After cleaning, the succulent stem tissues were cut into small pieces and immediately frozen in liquid nitrogen, and then stored at -80°C until further processing.

RNA extraction, cDNA library construction and Illumina sequencing

Total RNA was extracted from the succulent stem using TRIzol Reagent (Invitrogen Inc., California, USA) according to the manufacturer’s instruction. The resulting samples were treated with DNase I to remove any genomic DNA. Extracted RNAs were quantified using an Agilent 2100 bioanalyzer (Agilent Technologies) and checked for integrity using denaturing agarose gel electrophoresis with ethidium bromide staining. RNA samples with A260/A280 ratios between 1.9 and 2.1, RNA 28S:18S ratios higher than 1.0, and RNA integrity numbers (RINs) ≥ 8.5 were used in subsequent analyses. The RNA-seq libraries were generated using Illumina Truseq RNA Sample Preparation Kits. Poly(A)+ RNA was isolated from total RNA using Dynal ligo(dT)25 beads according to the manufacturer’s instructions. Following purification, fragmentation buffer was added to break the mRNA into short fragments. First-strand cDNA was synthesized using these short fragments as templates, along with SuperScript III reverse transcriptase and N6 random hexamer primer. Second-strand cDNA was then synthesized using buffer, dNTPs, RNaseH and DNA polymerase I. The resulting double-stranded cDNA was subjected to end-repair using T4 DNA polymerase, DNA polymerase I Klenow fragment, and T4 polynucleotide kinase, and ligated to adapters using T4 DNA ligase. Adaptor-ligated fragments were purified using a QiaQuick PCR extraction kit and eluted with EB buffer. After analysis using agarose gel electrophoresis, suitable fragments were selected as templates for PCR amplification. Sequencing of the resulting cDNA library was carried out with an Illumina HiSeq 2000 system.

Transcripts de novo assembly and gene expression quantification

Raw reads generated from sequencing was cleaned by removing the adaptor sequences (ATCTCGTATGCCGTC) using in-house method. We then carried out a stringent low-quality filtering process. Firstly, bases with phred quality score lower than 20 would be trimmed from the 3’end of the sequence, until running into one base with a higher quality (≥20). If reads length was short than 50bp, it would be discarded. Secondly, reads will be further filtered by the criterion that 70% bases in one read have high quality scores (≥20). Thirdly, only paired-end reads were used for further assembly. De novo transcript assembly was conducted using Trinity release_20130216 [30] which consisted of three successive software modules: Inchworm, Chrysalis, and Butterfly. The assembly parameters were set as below:—seqType fq—JM 300G —min_contig_length 200—CPU 20—inchworm_cpu 20—bflyCPU 20. To quantify transcript abundance, the sequenced pair-end reads were re-aligned to the assembled transcripts using a script in Trinity [31]. Mapped reads were used for quantification by RSEM (RNA-Seq by Expectation Maximization) software [32]. Gene or isoform abundance was represented by the fragment per kilobase of transcript per million fragment mapped (FPKM) value, those transcripts with FPKM value equal or larger than 0.05 were defined as expressed.

Functional annotation of expressed transcripts

For there is no any gene annotation sets of C. deserticola except for chloroplast genome [1]. We annotated the expressed transcripts by comparing to Genbank Nt, Genbank Nr, and TAIR10_pep_20101214_updated datasets separately using BLAST program (E< = 1e-20). Meanwhile, all expressed transcripts were translated into potential proteins according to ORF prediction by TransDecoder [30] and predicated for the conserved domains based on Pfam database [33].

Gene Ontology and KEGG pathway annotation

By sequence similarity alignment to Uniprot database (http://www.uniprot.org/), the Gene Ontology (GO) annotation of all assembled transcripts was obtained by using association file downloaded from (ftp://ftp.ebi.ac.uk/pub/databases/GO/goa/UNIPROT/gene_association.goa_uniprot.gz). GO terms clustering of expressed genes were conducted by using custom scripts, and we annotated genes at the fourth level for the CC, BP and MF categories separately. KEGG pathway information was assigned for all predicted proteins sequences using online tool KAAS (KEGG Automatic Annotation Sever) [34]. Sequences in fasta format were submitted to KAAS request, and the resulting files of all pathways information related to C. deseticola stem transcriptome were downloaded. 13 plant organisms’ gene data sets in KEGG were used for annotation using the BBH (bi-directional best hit) method.

RT-qPCR analysis

After digestion with DNase I, approximately 5μg of total RNA was converted into first-strand cDNA via the reverse-transcription reaction with oligo(dT)15 primers and GoScript Reverse Transcription System (Promega). The cDNA products were then diluted 10-fold with nuclease free deionized water before use as a template in real-time PCR. Specific cDNAs were amplified by GoTaq 2-Step RT-qPCR system (Promega) in a volume of 20 ul. PCR amplification was performed at the annealing temperature of 60°C with the 7500 Real-Time PCR Detection System (Applied Biosystems) according to the manufacturer’s instructions. Relative transcript abundances were calculated by the comparative cycle threshold method with gene “comp10579_c0” as an internal standard, using the 7500 Manager software. Primer pairs for RT-PCR were designed based on online software (http://primer3.ut.ee/) and are listed in S1 Dataset.

Results

RNA sequencing and de novo transcriptome assembly of C. deserticola fleshy stem

Stem of C. deserticola has been extensively used as a traditionally important tonic in China and Japan for many years. To obtain a global overview of gene expression in the C. deserticola fleshy stem, we collected C. deserticola stem samples of the same plant base in 2013 and 2014, respectively. Total RNAs were extracted and polyA+ RNAs were purified for constructing paired-end RNA-seq libraries. 79,433,734 and 86,019,176 pair-end reads corresponding to nearly 8 billion and 8.6 billion bases of sequence were obtained using Illumina HiSeq 2000 sequencing platform in 2013-year and 2014-year samples (Table 1). After removing adaptor sequences and filtering low quality reads (see details in Methods), 64,831,040 high quality pair-end reads in 2013-year sample were used for de novo transcriptome assembly. Using Trinity sequence assembler [30], 51,719 genes and 95,787 transcript sequences were generated with transcript lengths ranging from 200 bp to 15,698 bp. The average length of assembled transcripts is 950 bases and the N50 length is 1,519 bases. The number of transcript in different lengths revealed that 57.32% of the assemble transcript were about 500-bp or longer (Fig 1A). High quality pair-end reads in 2014-year sample were mapped to the assembled transcriptome. Besides, we found that the transcript number for each assembled genes varied and 69% genes with one expressed isoform while 31% genes expressed two or more transcripts (Fig 1B).
Table 1

Description of sequencing data and de novo assembly.

2013-year sample2014-year sample
Read processing
Raw reads# (2×101 bp)79,433,73486,019,176
De_adaptored reads#73,294,30483,580,765
High quality reads#69,482,48680,071,469
Paired reads#64,831,04075,350,594
Bases for assembly (Gb)6.54-
Assembly
Assembled reads#53,212,524
Assembled transcripts#95,787
Assembled components#51,719
Assembled bases#90,989,781
Transcript N50 (bp)1,519
Mapping
Mapped reads#50,588,72253,988,016
Uniquely mapped reads#34,206,71632,488,394
Expression a
Expressed transcripts#63,95752,857
Expressed transcripts with protein-encoding potentials#30,09823,433

aTranscripts with FPKM value equal or larger than 0.5 are defined as expressed.

Fig 1

Distribution of assembled transcript length and statistic of isoform pattern.

(a) Distribution of assembled transcript length binned at 250bp intervals. (b) The percentage of transcript isoform pattern ranging from a single isoform to 6 or more isoforms in a single assembled gene.

aTranscripts with FPKM value equal or larger than 0.5 are defined as expressed.

Distribution of assembled transcript length and statistic of isoform pattern.

(a) Distribution of assembled transcript length binned at 250bp intervals. (b) The percentage of transcript isoform pattern ranging from a single isoform to 6 or more isoforms in a single assembled gene.

Expression quantification and functional annotation of assembled transcripts

Gene or transcript abundance was quantified using RSEM package, in which the sequenced reads were re-aligned to the assembled genes or transcripts sequences using Bowtie and those mapped reads were used for quantification. FPKM value for each gene or transcript was calculated, and finally we identified 63,957 and 52,857 actively expressed transcripts (FPKM value ≥ 0.5) in C. deserticola fleshy stem samples in 2013 and 2014, respectively. 44,776 transcripts (70.01% in 2013-year sample, 84.71% in 2014-year sample) were commonly expressed in the two replicates, and the correlation (Pearson correlation coefficient: 0.91979) of their expression data was showed in S1 Fig. The sequencing raw data had been uploaded to NCBI SRA database (accession numbers: SRX857402 and SRX858938). We used expressed genes identified in 2013-year sample for further analysis. Functional annotation information for all expressed transcripts was obtained using two methods. Firstly, all expressed transcripts were aligned to known nucleotide (GenBank nt) and peptide sequence databases (GenBank nr and Arabidopsis peptide) separately by BLAST algorithm. Out of 63,957 expressed transcripts, 29,220 (45.7%) were annotated and showed homology to sequences in any of the three subject databases with E-value cutoff 1e-20. Meanwhile, the candidate coding regions for all expressed transcript sequences were predicted using TransDecoder software and the longest ORFs for each transcript were used for Pfam domain search. As a result, 21,358 (33.4%) transcripts were annotated based on Pfam database. Overall, 30,098 (47.1%) transcripts were significantly matched to known genes in the public databases by combining the two methods above. The complete expressed transcripts list with function annotation were showed in supplemental data (S2 Dataset). We surveyed the top 20 most highly expressed transcripts (Table 2) corresponding to 18.99% of all sequencing reads, and found that most of them are genes responding to abiotic stress stimulus. Dehydrin (DHNs), a class of hydrophilic and thermostable stress proteins with a high number of charged amino acids that belong to the Group II Late Embryogenesis Abundant (LEA) family, is the most highly expressed genes. Three different Dehyrin transcripts (comp28713_c0_seq1/2/4) were detected highly expressed in fleshy stem which may be involved in protecting cells from damage caused by drought stress. Other stress-related genes such as heat shock protein, pathogen-related protein and metallothionein were also found expressed highly, which may be related to its severe survival environment. Additionally, some constitutive genes including 26S ribosomal RNA gene (comp22329_c2_seq1), auxin-repressed/dormancy-associated protein (comp20999_c0_seq1), ADP-ribosylation factor (comp20499_c0_seq1) were also highly transcribed.
Table 2

The top 20 highly expressed genes in fleshy stem tissue of C. deserticola.

Transcript idFPKMPredicted Function
comp28713_c0_seq217733.25Dehydrin
comp22282_c0_seq117066.35Avicennia marina class I type 2 metallothionein
comp22329_c2_seq116735.83Valeriana officinalis 26S ribosomal RNA gene
comp29412_c2_seq214550.58Oryza sativa Japonica Group clone KCG115D09 17.4 kDa heat shock protein mRNA
comp25013_c0_seq113532.52pathogen-related protein STH-2 [Salvia miltiorrhiza]
comp29412_c1_seq412074.82* a
comp24506_c0_seq911779.81*
comp29412_c1_seq37156.45*
comp10462_c0_seq16667.79*
comp20999_c0_seq16596.5Citrus cv. Shiranuhi putative auxin-repressed/dormancy-associated protein mRNA
comp10453_c0_seq16045.73MLP-like protein 28-like; Pathogenesis-related protein Bet v I family
comp28713_c0_seq45815.95Dehydrin
comp29412_c1_seq14007.95*
comp20237_c0_seq63723.08Coffea canephora dehydrin (DH3) mRNA, complete cds
comp26495_c2_seq53460.12Carthamus tinctorius ubiquitin (UBQ) mRNA
comp28713_c0_seq13381.59Dehydrin
comp22515_c0_seq13313.78Sesamum indicum polyphenol oxidase (PPO) mRNA
comp14988_c0_seq12811.77Extensin-like region
comp20499_c0_seq12748.74Salvia miltiorrhiza ADP-ribosylation factor mRNA
comp30181_c0_seq12710.42*

aThe signal of “*” shows that the transcript was not annotated by public databases.

aThe signal of “*” shows that the transcript was not annotated by public databases.

Functional classification of all expressed transcripts based on Gene Ontology and KEGG databases

Gene Ontology (GO) annotation was obtained from Uniprot annotation and identities association file. In total, 20,907 transcripts, accounting for 32.69% of the total expressed sequences, were assigned to 1,745 functional terms. Of the total functional GO terms, assignments to the biological process made up the majority (1,116, 63.95%) followed by cellular component (329, 18.85%) and molecular function (300, 17.20%). The assigned functions of expressed transcripts covered a broad range of GO categories, and the top 10 GO terms with most annotated transcripts were listed in Table 3. And we provide all expressed transcripts distribution in three Gene Ontology categories (Molecular Function, Cellular component and biological process) in supplemental file (S3 Dataset). GO terms related to binding functions and transferase activity were predominantly represented in the molecular function category. About the binding functions, cation binding (4,394 transcripts) represented the most abundant, followed by nucleotide/nucleoside binding (3,404 transcripts in average) and protein binding (2,422 transcripts). While in transferase activity group, the most is those with transferring phosphorus-containing groups (2,256 transcripts, 65.77%). Among the cellular component category, transcripts were more located to intracellular (10,581 transcripts in average), while among the biological process category, transcripts were more involved in biopolymer metabolic process (6,683 transcripts in average), followed by regulation of cellular process (4,841 transcripts), gene expression (4,678 transcripts) and transport (3,512 transcripts).
Table 3

The top 10 GO terms with the highest transcript number in each of the three ontologies.

Functional termsGO Accession NumberTranscripts Count
Molecular Function
cation bindingGO:00431694,394
purine nucleotide bindingGO:00170764,105
nucleotide bindingGO:00001664,065
ribonucleotide bindingGO:00325533,915
purine nucleoside bindingGO:00018833,680
transferase activityGO:00167403,430
hydrolase activityGO:00167873,052
protein bindingGO:00055152,422
transferase activity, transferring phosphorus-containing groupsGO:00167722,256
DNA bindingGO:00036772,233
Cellular component
intracellularGO:000562213,541
intracellular partGO:004442413,471
intracellular organelleGO:004322911,766
intracellular membrane-bounded organelleGO:004323111,083
cytoplasmGO:000573710,020
cytoplasmic partGO:00444448,271
membraneGO:00160206,738
intracellular organelle partGO:00444465,915
membrane partGO:00444255,079
intrinsic to membraneGO:00312244,073
Biological process
biopolymer metabolic processGO:00432839,927
cellular macromolecule metabolic processGO:00442609,371
cellular biosynthetic processGO:00442497,043
nucleobase, nucleoside, nucleotide and nucleic acid metabolic processGO:00061396,380
protein metabolic processGO:00195385,703
macromolecule biosynthetic processGO:00090595,031
regulation of cellular processGO:00507944,841
gene expressionGO:00104674,678
transportGO:00068103,512
regulation of metabolic processGO:00192223,325
In order to mine genes involved in biosynthesis of lignin and PhG, 21,358 non-redundant potential protein sequences were searched against gene sequences of 13 plant organisms in KEGG database, and they were assigned to 275 KEGG pathways with at least 5 hits. The top 10 pathways with most aligned sequences were listed in Table 4. Most pathways were involved in primary metabolic processes, such as amino acid or protein metabolism (ko01230, ko04141 and ko04120), carbohydrate metabolism (ko01200 and ko00500) and nucleotide or nucleoside metabolism (ko03018, ko00230, and ko00240). Besides, there are 27 secondary metabolism related pathways (Fig 2), such as terpenoid backbone biosynthesis, phenylpropanoid biosynthesis, carotenoid biosynthesis, isoquinoline alkaloid biosynthesis and tropane, piperidine and pyridine alkaloid biosynthesis. These results provide further indication that active metabolic processes were underway in the C. deserticola stem tissue. All expressed transcripts associated with KEGG pathways were listed in supplemental file (S4 Dataset). Although there are some significantly changed pathways between C. deserticola and other plants, such as rice [35] (S5 Dataset), our main goal in this study aims to reveal the whole transcriptome profile of C. deserticola stem, and to picture the related pathways of PhGs biosynthesis which could be useful for guiding the cultivation.
Table 4

The top 10 KEGG pathways with most protein sequences.

KO NO.Protein countKEGG pathway name
ko01230164Biosynthesis of amino acids
ko01200161Carbon metabolism
ko00230116Purine metabolism
ko04141108Protein processing in endoplasmic reticulum
ko03013100RNA transport
ko0412097Ubiquitin mediated proteolysis
ko0050090Starch and sucrose metabolism
ko0024084Pyrimidine metabolism
ko0301880RNA degradation
ko0516979Epstein-Barr virus infection
Fig 2

The KEGG pathways involved in secondary metabolism in C. deserticola.

KEGG classification based on secondary metabolite categories sorted by counts of transcripts in a single pathway.

The KEGG pathways involved in secondary metabolism in C. deserticola.

KEGG classification based on secondary metabolite categories sorted by counts of transcripts in a single pathway.

Candidate genes encoding enzymes involved in the biosynthesis of lignin

Lignin is the second most abundant natural terrestrial polymer in plant kingdom, composing up to one-third of the material found in plant cell walls [36, 37]. As an important component of cell wall, lignins help water transport, provide mechanical support and structural integrity, and defend against pathogens and herbivores [38-40]. Those roles of lignin are much valuable to support underground erective growth of C. deserticola in desert. In this study, we presented the complete picture of lignin biosynthesis pathways in C. deserticola (Fig 3), in which the lignin monomers are biosynthesized from phenylalanine through a series of enzymatic reactions, including hydroxylation, methylation, reduction, and oxidative polymerization process. Lignin biosynthesis-related enzymes were detected for three mainly synthesized forms in vascular tissue (p-hydroxyl-phenyl (H), guaiacyl (G) and syringyl (S) lignin) and 5-hydroxyl-guaiacyl lignin which was only identified in COMT (caffeic acid 3-O-methyltransferase, EC 2.1.1.68) deficient (such as knock-down) plants [41, 42].
Fig 3

Detailed pathways and genes involved in the synthesis of lignin and PhGs in C. deserticola stem.

The top and bottom panels showed the detailed pathways and genes involved in the biosynthesis of four types lignin and PhGs, respectively. Red diamonds represent the common pathway shared by lignin and PhGs. Green dotted lines were used to represent the synthesis pathway of 5-Hydroxyl-Guaiacyl lignin, which was identified in COMT deficient plants. Abbreviation: PAL(Phenylalanine ammonialyas), 4CL(4-coumarate-CoA ligase), CYP73A(trans-cinnamate 4-monooxygenase), CCR(cinnamoyl-CoA reductase), HCT(shikimate o-hydroxycinnamoyltransferase), F5H(ferulate-5-hydroxylase), COMT(3-O-methyltransferase), CCoAOMT(caffeoyl-CoA O-methyltransferase), CAD(cinnamyl-alcohol dehydrogenase), PER(peroxidase), CYP98A3(coumaroylquinate/coumaroylshikimate 3'-monooxygenase), CSE(caffeoylshikimate esterase), GOT1(aspartate aminotransferase), TAT(tyrosine aminotransferase), hisC(histidinol-phosphate aminotransferase) and AOC3(primary-amine oxidase).

Detailed pathways and genes involved in the synthesis of lignin and PhGs in C. deserticola stem.

The top and bottom panels showed the detailed pathways and genes involved in the biosynthesis of four types lignin and PhGs, respectively. Red diamonds represent the common pathway shared by lignin and PhGs. Green dotted lines were used to represent the synthesis pathway of 5-Hydroxyl-Guaiacyl lignin, which was identified in COMT deficient plants. Abbreviation: PAL(Phenylalanine ammonialyas), 4CL(4-coumarate-CoA ligase), CYP73A(trans-cinnamate 4-monooxygenase), CCR(cinnamoyl-CoA reductase), HCT(shikimate o-hydroxycinnamoyltransferase), F5H(ferulate-5-hydroxylase), COMT(3-O-methyltransferase), CCoAOMT(caffeoyl-CoA O-methyltransferase), CAD(cinnamyl-alcohol dehydrogenase), PER(peroxidase), CYP98A3(coumaroylquinate/coumaroylshikimate 3'-monooxygenase), CSE(caffeoylshikimate esterase), GOT1(aspartate aminotransferase), TAT(tyrosine aminotransferase), hisC(histidinol-phosphate aminotransferase) and AOC3(primary-amine oxidase). Phenylalanine ammonialyas (PAL, EC 4.3.1.24) is the first key enzyme in lignin biosynthesis pathway (Fig 3) which transforms phenylalanine into cinnamic acid by non-oxidative deamination [43, 44]. A total of 6,297 PALs reads were sequenced and 7 PAL transcripts were assembled in C. deserticola (Table 5). By sequence similarities comparison, we found that 4 of them (comp28550_c1_seq1/2/3/5) had more than 95% similarity with the known mRNA sequence of C. deserticola (gi|289595227|gb|ADD12041.1|), while comp28550_c1_seq4 and comp25940_c0_seq1 had 77% and 82% similarities, respectively. ORF prediction revealed 5 transcripts had potentials of encoding proteins and carried with aromatic amino acid lyase domain (PF00221.14). Among them, only the comp28550_c1_seq4 transcript could encode a complete protein sequence of 718 amino acid residues. It has been reported that PAL was encoded by a small multigene family in most plant species, such as 4 in Arabidopsis thaliana [39, 45], 5 in Populus trichocarpa [46, 47], 3 in Scutellaria baicalensis [48], and 7 Cucumis sativus [43, 49] etc. Our phylogenetic analysis suggested that there were 4 PAL encoding genes in C. deserticola and we named them CdPAL1, CdPAL2, CdPAL3 and CdPAL4, respectively (S2 Fig). 4-coumarate-CoA ligase (4CL, EC 6.2.1.12) and trans-cinnamate 4-monooxygenase (CYP73A, EC 1.14.13.11) are two enzymes responsible for transforming cinnamic acid to p-coumaroyl-CoA in two reverse orders. They are also in backbones, and their expression FPKM values are 39.57 and 51.93, respectively.
Table 5

Phenylalanine ammonia lyase (PAL) mRNA in de novo assembly transcriptome of C. deserticola.

Transcript IDLengthMatched length to known sequence a Identity to known sequence a (%)Supported Read#Length of predicted proteinProtein typeRangeLyase aromatic Domain
comp28550_c1_seq144723995.4354* b **N
comp28550_c1_seq22076186898.021970606 aa5prime_partial1–1818Y
comp28550_c1_seq32077186898.021971606 aa5prime_partial1–1818Y
comp28550_c1_seq42506172982.732726718 aacomplete286–2439Y
comp28550_c1_seq544823995.4355***N
comp25940_c0_seq11499102177.52460 aa3prime_partial1–1380Y
comp25940_c1_seq1820--240 aa5prime_partial2–721Y

aAll transcript sequences were compared with the C. deserticola known complete CDS sequence (gi|289595227|gb|ADD12041.1|) of phenylalanine ammonia lyase (PAL). The sequence of ‘comp25940_c1_seq1’ had little similarity with the C. deserticola, but with 78.24% similarity with Camellia chekiangoleosa PAL sequence.

bThe signal of “*” represents no results produced.

aAll transcript sequences were compared with the C. deserticola known complete CDS sequence (gi|289595227|gb|ADD12041.1|) of phenylalanine ammonia lyase (PAL). The sequence of ‘comp25940_c1_seq1’ had little similarity with the C. deserticola, but with 78.24% similarity with Camellia chekiangoleosa PAL sequence. bThe signal of “*” represents no results produced. The four type lignins were biosynthesized by different pathways which were controlled by three key enzymes, cinnamoyl-CoA reductase (CCR, EC 1.2.1.44), shikimate o-hydroxycinnamoyltransferase (HCT, EC 2.3.1.133), and ferulate-5-hydroxylase (F5H, EC 1.14.-.-). CCR was reported as a control point of lignins pathway [50, 51] which catalyzed X-CoA (X including p-coumaroyl, caffeoyl, feruloyl, 5-hydroxyl-feruloyl and sinapoyl) into Y-aldehyde (Y including p-coumar, caffeyl, coniferyl, 5-hydroxyl-coniferyl and sinap), while HCT catalyzed p-coumaroyl-CoA to p-coumaroyl shikimic acid/p-coumaroyl quinic acid. The two enzymes, just like a switch, regulated biosynthesis of P-hydroxyl-phenyl lignins or the other three type lignins. F5H was another branch switch that regulated syringyl lignin and 5-hydroxyl-guaiacyl lignin. Other important enzymes including caffeic acid 3-O-methyltransferase (COMT, EC 2.1.1.68), caffeoyl-CoA O-methyltransferase (CCoAOMT, EC 2.1.1.104) and cinnamyl-alcohol dehydrogenase (CAD, EC 1.1.1.195) were also detected expressed. Detailed expression information was listed in Table 6. These enzyme genes identified in this study will provide a valuable resource for functional genomic studies in this important medicinal plant. 10 genes related to lignins biosynthesis pathway in Table 6 were selected for RT-qPCR verification to confirm our RNA-seq results (Fig 4), and their high correlations (Pearson correlation coefficient: 0.90343) indicated high accuracy and reproducibility of our transcriptome analysis. S1 Dataset lists the primer sequences used in this analysis.
Table 6

Expression profile of related genes in biosynthesis of lignin and PhGs.

DescriptionEC NumberORF IDFPKM
phenylalanine ammonia-lyase4.3.1.24comp25940_c03.93
comp28550_c122.54
beta-glucosidase3.2.1.21comp15401_c019.27
comp27218_c01.65
comp10579_c034.08
comp28293_c07.76
comp56_c00.92
4CL;4-coumarate—CoA ligase6.2.1.12comp26895_c07.2
comp29240_c032.37
CCR;cinnamoyl-CoA reductase1.2.1.44comp26393_c23.89
CYP73A;trans-cinnamate 4-monooxygenase1.14.13.11comp10963_c051.93
cinnamyl-alcohol dehydrogenase1.1.1.195comp20951_c031.67
comp27272_c070.95
comp27835_c018.94
comp29267_c01.77
comp30193_c01159.57
comp30398_c0259.8
peroxidase1.11.1.7comp10543_c042.74
comp10650_c041.53
comp119949_c01.7
comp13993_c00.8
comp21153_c01.94
comp21305_c017.16
comp25087_c045.26
comp28355_c243.19
comp28694_c09.8
comp29269_c33.47
comp29883_c213.01
comp30017_c00.78
comp30161_c06.59
comp30272_c0665.8
comp7217_c00.71
HCT; shikimate O-hydroxycinnamoyltransferase2.3.1.133comp73580_c03.4
CYP98A3, C3'H; coumaroylquinate (coumaroylshikimate) 3'-monooxygenase1.14.13.36comp22365_c110.05
comp23009_c01.46
comp26902_c8150.22
CSE; caffeoylshikimate esterase3.1.1.-comp19323_c096.8
COMT; caffeic acid 3-O-methyltransferase2.1.1.68comp29821_c139.63
caffeoyl-CoA O-methyltransferase2.1.1.104comp28020_c1145.37
CYP84A, F5H;ferulate-5-hydroxylase1.14.-.-comp52365_c03.83
GOT1;aspartate aminotransferase2.6.1.1comp14836_c054.81
comp18427_c03.21
comp28117_c013.99
comp29202_c010.29
TAT; tyrosine aminotransferase2.6.1.5comp29792_c06.33
hisC; histidinol-phosphate aminotransferase2.6.1.9comp18748_c021.59
AOC3, AOC2, tynA; primary-amine oxidase1.4.3.21comp18613_c068.63
comp29232_c0100
Fig 4

RT-qPCR analysis. Real-time PCR results for 10 selected genes related to lignin biosynthesis pathway.

Plot in the top panel of the figure presents RNA-seq (line graph) and qPCR (bar graph) quantification results, and the bottom panel shows their high correlations (r>0.9). Corresponding gene names and unique gene identifier numbers in the assembly are bglX (comp10579_c0), 4CL (comp26895_c0), CAD_1 (comp30193_c0), CAD_2 (comp30398_c0), PER_1 (comp21305_c0), PER_2 (comp25087_c0), PER_3 (comp30272_c0), CYP98A3 (comp26902_c8), CSE (comp19323_c0), and cCOMT (comp28020_c1). (Notes: as comp30193_c0 and comp30398_c0 were both annotated as CAD gene, we added subscripts to differentiate them. The same thing was done to comp21305_c0, comp25087_c0 and comp30272_c0, which were all annotated as PER gene).

RT-qPCR analysis. Real-time PCR results for 10 selected genes related to lignin biosynthesis pathway.

Plot in the top panel of the figure presents RNA-seq (line graph) and qPCR (bar graph) quantification results, and the bottom panel shows their high correlations (r>0.9). Corresponding gene names and unique gene identifier numbers in the assembly are bglX (comp10579_c0), 4CL (comp26895_c0), CAD_1 (comp30193_c0), CAD_2 (comp30398_c0), PER_1 (comp21305_c0), PER_2 (comp25087_c0), PER_3 (comp30272_c0), CYP98A3 (comp26902_c8), CSE (comp19323_c0), and cCOMT (comp28020_c1). (Notes: as comp30193_c0 and comp30398_c0 were both annotated as CAD gene, we added subscripts to differentiate them. The same thing was done to comp21305_c0, comp25087_c0 and comp30272_c0, which were all annotated as PER gene).

Candidate genes encoding enzymes involved in the biosynthesis of PhGs

Phenylethanoid glycosides (PhGs) are known to be the primary active ingredients in C. deserticola with activities of improving sexual potency, scavenging free radical and anti-aging [7-10]. Three chemical components of PhGs are organic acid, saccharide and phenylethanol aglycon (Fig 3). The organic acid including caffeic acid, ferulic acid and coumalic acid are products of phenylpropanoid biosynthesis pathway. The components of saccharide including glucose and rhamnose are products of carbohydrate metabolism pathways, such as starch and sucrose metabolism, amino sugar and nucleotide sugar metabolism, fructose and mannose metabolism etc. However, the biosynthesis pathway of phenylethanol part is not clear yet. Here, we proposed two possible phenylethanol biosynthesis pathways based on our sequence data. One is the reported caffeic acid or ferulic acid pathway, also known as cinnamiv acid pathway which is similar to the lignin biosynthesis backbone pathway. Another is based on phenylalanine metabolism pathway (Fig 3), in which the phenylalanine to phenylethanol was achieved by a known ‘Enrlich pathway’ that was first found in yeast one century ago and validated in petunia flowers [52, 53], tomato [54], and rose [55, 56]. Four enzyme genes encoding aspartate/tyrosine aminotransferase, histidinol-phosphate aminotransferase, and primary-amine oxidase which are responsible for the conversion of phenylalanine to phenylethanol were detected expressed in stem of C. deserticola. The product of phenylethanol may be further oxidized by monooxygenase or methylated by methytransferase into its derivates (phenylethanol aglycon) which took part in PhG biosynthesis. In summary, two putative biosynthesis pathways of phenylethanol aglycon were proposed for C. deserticola but still needs more study in further.

Discussions

Recent years, plant genomics have developed rapidly with the application of next-generation sequencing technology, while few researches have been focused on the genomics of desert medicinal plants [57]. It is urgently necessary to perform genomic or transcriptomic research to understand its adaption to drought and salinity environment and the biosynthesis pathway of the major bioactive components. The de novo transcriptome discovery for some medical plants, such as Panax ginseng [58], Ginkgo biloba [59] and Glycyrrhiza uralensis [60] have been first exploited using Roche 454 platform for its long read length. Because of the effectively assembly ability with short reads, especially the advantaged paired-end reads [23, 61], Illumina-based transcriptome sequencing and assembly has also been extensively used for model [20, 22, 62–64] and non-model organisms [17, 19, 21, 65, 66]. In the present study, we generated about 8G of 101 bp paired-end reads and produced longer unigene sequences with 725 bp average length. Large-scale stem-specific transcriptome data could provide useful reference data and be used to mine secondary metabolism of bioactive components of C. deserticola. There is 81.62% of the total raw reads passed stringent quality filters (including adaptor trimming and low-quality reads discarding) before assembly, suggesting the high-quality of our sequencing data, and in which 82.08% of the high-quality reads were useful for assemble. Other reads failed to use for assemble may come from sequencing error [67], assembly parameters [23] et al. Those unused high-quality reads remained helpful to improve de novo assembly combined with longer reads from other platform (such as Roche 454) in the future. A great number of assembled transcripts (30,098) showed high sequences similarities to know genes in public databases, suggesting that our Illumina-based paired-end data covered a substantial fraction of transcripts of C. deserticola. Transcripts with no BLAST hits may due to 3’ or 5’ untranslated regions, non-coding RNA [23] or new gene sequences of C. deserticola. The expressed transcripts were annotated to a wide range of GO categories and KEGG pathways (Tables 3 and 4), in which many transcripts were assigned to secondary metabolism related pathways. As we known, phenylpropanoid can function as inducible antimicrobial compounds with great salutary for an underground lifestyle [1], and also act as signal molecule in plant-microbe interactions besides for its medicinal utility [68, 69]. Terpenoid is used for biosynthesis of bioactive components (such as 6- deoxycatalpol) [70]. We found genes involved in the phenylpropanoid and terpenoid backbone biosynthesis pathway were highly abundant in C. deserticola. More importantly, the discovery of well represented pathways of lignin biosynthesis (Fig 3) indicated the active metabolic process of lignin in C. deserticola stem. All known enzyme genes involved in biosynthesis of lignin (Fig 3) were detected expressed, and four key enzymes including PAL, CCR, HCT and F5H had lower expression abundance (FPKM 26.47, 3.89, 3.4 and 3.83, respectively) compared with other enzyme genes (Table 6). Whether or not the expression change of those three genes could influence lignin production in C. deserticola is worthy of further study. PAL is a key enzyme in lignin biosynthesis and also involving in the biosynthesis of phenylpropanoid, resveratrol, flavonoid and coumarin [71-74]. We detected four distinct PAL genes in C. deserticola genome (S2 Fig) which was coincident with that PAL was encoded by a small multigene family [39, 43, 45–49] and further proved it may play important roles in metabolic carbon flux. PhG is the primary active ingredients in C. deserticola. Genes involved in the biosynthesis of phenylethanol are important for the quality of C. deserticola. We deduced two different biosynthesis pathways of phenylethanol and 17 enzyme genes involved in PhG biosynthesis in C. deserticola stem. The possible post-caffeic/ferulic acid processes (Fig 3) were also deduced for the first time based on structural formula of intermediates and catalytic properties of corresponding enzymes, in which the caffeic/ferulic acid would be first oxidized into phenyl-pyruvate derivate; then, the carboxyl group was deprived by decarboxylases; finally, aldehyde group were converted back into alcohol group by dehydrogenase. This is the first application of Illumina paired-end sequencing technology to investigate the whole transcriptome of C. deserticola and to assemble RNA-seq reads without a reference genome. This study will provide useful resources and gene sequences for functional genomics and proteomics research on C. deserticola in future.

Conclusions

In this study, we profiled the transcriptome of C. deserticola stem based on high-throughput sequencing data, and identified genes involved in biosynthesis pathways of lignin and also inferred potential biosynthesis pathway of PhGs for the first time, which will certainly accelerate the understanding of the ambiguous physiological processes and the great medicinal value in molecular level. Up to now, this is the first attempt to de novo assemble the whole transcriptome of C. deserticola stem and to detect biosynthesis pathway of medicinal components using Illumina-based sequencing datasets. Our study may promote the development of natural medicines and the selection of cultivars with medicinal traits.

Correlation between RNA-seq expression data of two replicated C. deserticola samples.

Normalized read count of two replicates (2013-year sample and 2014-year sample) were plotted, and high correlation showed reproducibility of our RNA-seq data. (TIF) Click here for additional data file.

Phylogenetic analyses of PAL genes in C. deserticola.

Phylogenetic distances between known PAL gene and assembled potential PAL sequences in C. deserticola. (TIF) Click here for additional data file.

All primer sequences for 10 selected genes related to lignins biosynthesis pathway.

(XLSX) Click here for additional data file.

All expressed transcripts with annotation in assembled C. deserticola stem transcriptome.

(XLSX) Click here for additional data file.

GO annotation of all expressed transcripts in C. deserticola stem transcriptome.

(XLSX) Click here for additional data file.

KEGG pathway annotation of all expressed transcripts with predicted ORF in C. deserticola stem transcriptome.

(XLSX) Click here for additional data file.

Significantly changed KEGG pathways between C. deserticola and rice (tissues of callus and leaf).

(XLSX) Click here for additional data file.
  65 in total

1.  The developmental dynamics of the maize leaf transcriptome.

Authors:  Pinghua Li; Lalit Ponnala; Neeru Gandotra; Lin Wang; Yaqing Si; S Lori Tausta; Tesfamichael H Kebrom; Nicholas Provart; Rohan Patel; Christopher R Myers; Edwin J Reidel; Robert Turgeon; Peng Liu; Qi Sun; Timothy Nelson; Thomas P Brutnell
Journal:  Nat Genet       Date:  2010-10-31       Impact factor: 38.330

2.  De novo characterization of the root transcriptome of a traditional Chinese medicinal plant Polygonum cuspidatum.

Authors:  DaCheng Hao; Pei Ma; Jun Mu; ShiLin Chen; PeiGen Xiao; Yong Peng; Li Huo; LiJia Xu; Chao Sun
Journal:  Sci China Life Sci       Date:  2012-05-27       Impact factor: 6.038

3.  Function annotation of the rice transcriptome at single-nucleotide resolution by RNA-seq.

Authors:  Tingting Lu; Guojun Lu; Danlin Fan; Chuanrang Zhu; Wei Li; Qiang Zhao; Qi Feng; Yan Zhao; Yunli Guo; Wenjun Li; Xuehui Huang; Bin Han
Journal:  Genome Res       Date:  2010-07-13       Impact factor: 9.043

4.  Studies on the sedative effect of Cistanche deserticola.

Authors:  M C Lu
Journal:  J Ethnopharmacol       Date:  1998-01       Impact factor: 4.360

5.  [Morphotype diversity of Cistanche deserticola].

Authors:  Rui Cao; Hong Ma; Ying-chun Wang
Journal:  Zhongguo Zhong Yao Za Zhi       Date:  2004-01

6.  Biogenesis of 2-phenylethanol in rose flowers: incorporation of [2H8]L-phenylalanine into 2-phenylethanol and its beta-D-glucopyranoside during the flower opening of Rosa 'Hoh-Jun' and Rosa damascena Mill.

Authors:  Shuzo Watanabe; Kentaro Hayashi; Kensuke Yagi; Tatsuo Asai; Hazel MacTavish; Joanne Picone; Colin Turnbull; Naoharu Watanabe
Journal:  Biosci Biotechnol Biochem       Date:  2002-05       Impact factor: 2.043

7.  Ultrafast and memory-efficient alignment of short DNA sequences to the human genome.

Authors:  Ben Langmead; Cole Trapnell; Mihai Pop; Steven L Salzberg
Journal:  Genome Biol       Date:  2009-03-04       Impact factor: 13.583

8.  Single nucleotide polymorphism (SNP) discovery in the polyploid Brassica napus using Solexa transcriptome sequencing.

Authors:  Martin Trick; Yan Long; Jinling Meng; Ian Bancroft
Journal:  Plant Biotechnol J       Date:  2009-01-21       Impact factor: 9.803

9.  The phenylalanine ammonia lyase (PAL) gene family shows a gymnosperm-specific lineage.

Authors:  Ujwal R Bagal; James H Leebens-Mack; W Walter Lorenz; Jeffrey F D Dean
Journal:  BMC Genomics       Date:  2012-06-11       Impact factor: 3.969

10.  Full-length transcriptome assembly from RNA-Seq data without a reference genome.

Authors:  Manfred G Grabherr; Brian J Haas; Moran Yassour; Joshua Z Levin; Dawn A Thompson; Ido Amit; Xian Adiconis; Lin Fan; Raktima Raychowdhury; Qiandong Zeng; Zehua Chen; Evan Mauceli; Nir Hacohen; Andreas Gnirke; Nicholas Rhind; Federica di Palma; Bruce W Birren; Chad Nusbaum; Kerstin Lindblad-Toh; Nir Friedman; Aviv Regev
Journal:  Nat Biotechnol       Date:  2011-05-15       Impact factor: 54.908

View more
  7 in total

Review 1.  Renaissance in phytomedicines: promising implications of NGS technologies.

Authors:  Sonal Sharma; Neeta Shrivastava
Journal:  Planta       Date:  2016-03-22       Impact factor: 4.116

2.  Integrated transcriptomic and proteomic analyses reveal ɑ-lipoic acid-regulated cell proliferation via Grb2-mediated signalling in hepatic cancer cells.

Authors:  Lan Yang; Xiliang Wang; Juan Xu; Ya Wen; Manqiao Zhang; Jingxiao Lu; Rongfu Wang; Xiaojuan Sun
Journal:  J Cell Mol Med       Date:  2018-03-25       Impact factor: 5.310

3.  De novo characterization of the Baphicacanthus cusia(Nees) Bremek transcriptome and analysis of candidate genes involved in indican biosynthesis and metabolism.

Authors:  Wenjin Lin; Wei Huang; Shuju Ning; Xiaohua Wang; Qi Ye; Daozhi Wei
Journal:  PLoS One       Date:  2018-07-05       Impact factor: 3.240

4.  De novo full length transcriptome analysis and gene expression profiling to identify genes involved in phenylethanol glycosides biosynthesis in Cistanche tubulosa.

Authors:  Lei Hou; Guanghui Li; Qingliang Chen; JinJin Zhao; Jiaowen Pan; Ruxia Lin; Xiujin Zhu; Pengfei Wang; Xingjun Wang
Journal:  BMC Genomics       Date:  2022-10-08       Impact factor: 4.547

Review 5.  Herba Cistanche (Rou Cong-Rong): One of the Best Pharmaceutical Gifts of Traditional Chinese Medicine.

Authors:  Zhiming Li; Huinuan Lin; Long Gu; Jingwen Gao; Chi-Meng Tzeng
Journal:  Front Pharmacol       Date:  2016-03-01       Impact factor: 5.810

6.  The De Novo Transcriptome and Its Functional Annotation in the Seed Beetle Callosobruchus maculatus.

Authors:  Ahmed Sayadi; Elina Immonen; Helen Bayram; Göran Arnqvist
Journal:  PLoS One       Date:  2016-07-21       Impact factor: 3.240

7.  Wound Stress, an Unheeded Factor for Echinacoside Accumulation in Cistanche deserticola Y. C. Ma.

Authors:  Gaosheng Hu; Tianran Wu; Yue Chang; Xinyi Zhan; Jingming Jia
Journal:  Molecules       Date:  2018-04-12       Impact factor: 4.411

  7 in total

北京卡尤迪生物科技股份有限公司 © 2022-2023.