Literature DB >> 24376689

De Novo transcriptome sequencing reveals important molecular networks and metabolic pathways of the plant, Chlorophytum borivilianum.

Shikha Kalra¹, Bhanwar Lal Puniya², Deepika Kulshreshtha², Sunil Kumar¹, Jagdeep Kaur¹, Srinivasan Ramachandran², Kashmir Singh¹.

Abstract

Chlorophytum borivilianum, an endangered medicinal plant species is highly recognized for its aphrodisiac properties provided by saponins present in the plant. The transcriptome information of this species is limited and only few hundred expressed sequence tags (ESTs) are available in the public databases. To gain molecular insight of this plant, high throughput transcriptome sequencing of leaf RNA was carried out using Illumina's HiSeq 2000 sequencing platform. A total of 22,161,444 single end reads were retrieved after quality filtering. Available (e.g., De-Bruijn/Eulerian graph) and in-house developed bioinformatics tools were used for assembly and annotation of transcriptome. A total of 101,141 assembled transcripts were obtained, with coverage size of 22.42 Mb and average length of 221 bp. Guanine-cytosine (GC) content was found to be 44%. Bioinformatics analysis, using non-redundant proteins, gene ontology (GO), enzyme commission (EC) and kyoto encyclopedia of genes and genomes (KEGG) databases, extracted all the known enzymes involved in saponin and flavonoid biosynthesis. Few genes of the alkaloid biosynthesis, along with anticancer and plant defense genes, were also discovered. Additionally, several cytochrome P450 (CYP450) and glycosyltransferase unique sequences were also found. We identified simple sequence repeat motifs in transcripts with an abundance of di-nucleotide simple sequence repeat (SSR; 43.1%) markers. Large scale expression profiling through Reads per Kilobase per Million mapped reads (RPKM) showed major genes involved in different metabolic pathways of the plant. Genes, expressed sequence tags (ESTs) and unique sequences from this study provide an important resource for the scientific community, interested in the molecular genetics and functional genomics of C. borivilianum.

Entities: CellLine Chemical Disease Gene Species

Mesh：

Substances：

Year: 2013 PMID： 24376689 PMCID： PMC3871651 DOI： 10.1371/journal.pone.0083336

Source DB: PubMed Journal: PLoS One ISSN： 1932-6203 Impact factor: 3.240

Introduction

Chlorophytum borivilianum is an important species of liliaceae family due to its exceptional medicinal properties. Overexploitation and extensive harvesting of the wild strands has threatened its status as ‘endangered’ species by International Union for Conservation of Nature and Natural Resources (IUCN) [1]. Due to its high medicinal properties, the species has been recognized as 26th among the top priority medicinal plants to be protected and promoted by the Medicinal Plant Board, Government of India. Chlorophytum borivilianum brag the exuberant references in many Ayurvedic classics like Charka Samhita (2nd century B.C.), Sushrut Samhita (2nd century A.D.), Raja Nighantu (17th century A.D.) etc. (http://www.safedmusli.net). Its tubers are used for aphrodisiac, adaptogen, antiageing, health restorative and health promoting purposes. The amalgamation of C. borivilianum leaves with other herbs such as Withania sominifera, Emblica officinalis etc. makes the body resistant against sex related diseases and also delays menopause [2]. The above attributes have made C. borivilianum an essential ingredient in Ayurvedic, Unani and Allopathic formulations. Major phytochemical components reported from the roots of C. borivilianum include steroidal saponins, fructans and fructoligosaccharides (FOS), acetylated mannans, phenolic compounds and proteins [3], [4]. Steroidal saponins, are considered to be the principal bioactive components responsible for the pharmacological properties [5] and borivilianosides, furostane type steroidal saponins, have been isolated and characterized from this plant [6], [7]. Steroidal saponins are synthesized via the mevalonic acid (MVA) pathway, pervasively operating in cytoplasm [8], or through the newly discovered non-mevalonate pathway (MEP) located in plastids [9], [10]. Cyclization of precursor compound, 2, 3-oxidosqualene, involving oxidosqualene cyclase (OSC) combined with modifications on steroid skeletons like hydroxylations and glycosylations lead to the formation of various saponins. Several OSC genes like cycloartenol synthase (CAS), lupeol synthase (LS), β- amyrin synthase (β- AS) have been cloned from various plant systems [11], [12]. According to the proposed pathway [13], some specific CYP450s and UDP-glycosyltransferases (UGTs) may catalyze the conversion of cycloartenol to various steroidal saponins. Little is known about the molecular mechanism of the biosynthetic pathway downstream of cyclization process involved in saponin biosynthesis. The functional genomics studies in C. borivilianum is still in its infancy and few expressed sequence tags (ESTs) have been generated by our group [13], with an aim to identify the differentially expressed genes in the leaf and root tissues of C. borivilianum. Also, efforts are being made to clone the genes of interest via the candidate gene approach [14] for this plant. ESTs play significant role in accelerating gene discovery, especially in non model organisms where reference genome sequence is unavailable, as it is a rapid and relatively economical method for analyzing the transcribed region of the genome [15]. Besides this, ESTs are also helpful in large scale expression analysis, improving genome annotation, identifying splice variants, molecular markers identification and physical mapping [16]. For large scale transcriptome analysis, next generation sequencing has evolved to be a very useful technique for providing large expression data in much shorter time period, depth and coverage to expedite understanding of metabolic pathway as well as contribute to comparative transcriptomics, evolutionary genomics and gene discovery [17], [18]. Illumina high throughput sequencing technology yields huge amount of parallel sequence short reads with larger coverage. Assembling these short read sequences is a challenging task in the absence of any reference sequence. However, many de novo assembly tools have been developed that can be used to analyze the short read sequences [19], [20]. We have undertaken the first global analysis of C. borivilianum transcriptome. Strategy has been developed for de novo assembly of transcriptome using short-read sequence data generated by Illumina RNA-Seq method in lieu to identify candidate genes involved in saponin biosynthetic pathway. Furthermore, GC content analysis, identification of EST-SSRs and gene expression analysis has also been done. Transcriptome coverage, at 22.42 megabase pairs, was comprehensive enough to discover all known genes of several major metabolic pathways. The data has been submitted to the Sequence Read Archive (SRA) of NCBI database under the accession ID PRJNA 196968 and will serve as a public information platform for further studies in C. borivilianum.

Materials and Methods

2.1 Plant Material, growth conditions and saponins extraction

Plants of C. borivilianum were grown under controlled conditions in the growth chamber with a day/night temperature of 27±1°C, for 16 h photoperiod (flux density of 200 µmol m−2 s−1) with 30% relative humidity at Panjab University, Chandigarh (India). Plants were regenerated from the previous year's roots in the month of April. Young leaves and roots of a two month old plant were harvested during the period of active growth (June, average outside temperature 38°C), snap frozen in liquid nitrogen and later transferred to a −80°C freezer until further processing. Total saponins were isolated from leaf and root tissues by soxhelation method and analyzed by thin layer chromatography (TLC) [21]. Briefly, powdered tissues were defatted with petroleum ether (60–80°C). Ethanolic extracts were prepared by extracting defatted tissues with 95% ethanol in a soxhlet extractor. Ethanolic extract was suspended in water (100 ml) and then extracted with n-butanol (300 ml). The volume of n-butanol soluble portion was reduced to half under reduced pressure and finally saponins were precipitated by addition of diethyl ether. Chromatographic analysis of ethanolic extract was performed on silica gel-GF254 precoated plates using chloroform: glacial acetic acid: methanol: water (16∶8∶3∶2, v/v) as mobile phase. Anisaldehyde (0.5 ml) mixed with glacial acetic acid (10 ml), methanol (85 ml) and concentrated sulphuric acid (5 ml) was used as spraying reagent. TLC plates, after spraying the reagent, were heated at 100°C and saponin spots were visualized.

2.2 RNA isolation, cDNA library construction and Illumina sequencing

Total RNA was extracted by combining the methods as described by Ghawana et al (2011), [22] followed by RNA purification and on column DNaseI digestion using miRNA Easy kit (Qiagen, Germany). The integrity of RNA samples was assessed by Agilent 2100 Bioanalyzer (Agilent Technologies, USA). Sample with RNA integrity number (RIN) value more than 8.0 was selected for further use. Library construction and sequencing was performed at Microarray core facility, Huntsman Cancer Institute, University of Utah, Salt Lake City, Utah, USA. cDNA library was generated using TruSeq™ RNA Sample preparation kit (Illumina, USA) according to the manufacturer's instructions. Briefly, oligo(dT) beads were used to purify poly(A) mRNA from total RNA. mRNA was fragmented using RNA fragmentation kit (Ambion, USA). First strand cDNA was synthesized from the fragmented mRNA using random hexamer primer and reverse transcriptase (Invitrogen, USA). Single-end cDNA library was prepared in accordance with Illumina's protocol with an insert size of 150 bp. 50-cycled single end library sequencing was performed using Illumina Hiseq2000 platform. Raw reads were filtered to obtain high-quality clean reads by removing adaptor sequences, duplication sequences using FastQC software and bases with Phred score <20 were trimmed. Based on the quality check for each base pair in the reads, last two base pairs from each read were removed in order to minimize the sequencing error, which is usually higher in the 3′ end of reads.

2.3 De novo assembly of the sequences and clustering

All the assemblies were performed on Red Hat based SGI workstation with 48 cores, 2.27 GHz Intel Xeon processor and 50 Gb random access memory. We used SOAPdenovo (version 1.04; http://soap.genomics.org.cn/soapdenovo.html) which applies de Bruijn graph algorithm for de novo assembly of high quality (HQ) sequence reads to generate a non-redundant set of transcripts. Clean reads were first split into different ‘k-mers’ for assembly, in order to produce contigs, using the de Bruijn graph. K-mer size of 23 achieved best balance between the number of contigs produced, coverage and average sequence length obtained. In order to reduce sequence redundancy, clustering process was supplemented with TIGR gene indices clustering tools (TGICL), which is based on pairwise sequence similarity and then assembly by individual clusters in order to retrieve better results [23]. Clusters were then passed to contig assembly program (CAP3) assembler for multiple alignments and consensus building. TGICL and CAP3 assembly were run under default parameters. Resulting singletons and consensus contigs were merged for final assembled transcripts.

2.4 Functional annotation and Classification

All the assembled unigenes (consensus and singletons) longer than 100 bp were annotated by assigning putative gene descriptions and Gene Ontology (GO) terms on the basis of sequence similarity with previously identified genes annotated with similar details. Unigenes were subjected to BLASTX search against non-redundant protein database and significant hits with e-value ≤1.0e−05 were extracted. Transcripts that did not show any significant hit were searched using BLASTN tool against the non-redundant database. Functional categorization by GO terms (http://www.geneontology.org) [24] and Enzyme Commission (EC) database was carried out based on Anno8r tool [http://www.nematodes.org/bioinformatics/annot8r/] with E-value threshold of 1.0e−05. GOSlim terms for molecular function, biological process, and cellular component categories associated with the best BLASTX hit with non- redundant database were assigned to the corresponding C. borivilianum transcripts. KOBAS 2.0 (http://kobas.cbi.pku.edu.cn/home.do) software was used for annotation of transcripts with KEGG pathways. For this, we first used KEGG orthology data provided at website and locally BLAST all transcripts with it. The transcripts showing similarity with KO database at e-value cut-off of 1e−05 were selected. The resultant data was used as an input for annotations wherein, we mapped transcript ID's to KO terms which finally annotate the transcripts with KEGG pathways. The ‘Identify’ program in KOBAS 2.0 uses annotation to identify enriched pathways. Oryza sativa, being a monocot, was used as background species. Arabidopsis thaliana, being a model organism was also used as background species. Hyper geometric distribution was used for p-value calculation. The false discovery rate (FDR) was calculated by using q-value and other parameters were used as default.

2.5 Read mapping onto C. borivilianum transcripts

The expression level of each assembled transcript was measured through reads per kilobase per million mapped reads (RPKM) values. All reads were mapped onto the non-redundant set of transcripts to quantify the abundance of assembled transcripts. SeqMap [25] was used for read mapping and rSeq [26] was then applied for RPKM based expression measurement. Assembled sequences were used as reference sequence to map back short reads and to measure RPKM for all assembled transcripts as suggested by Mortazavi et al. [26] and Jiang and Wong [25]. For RPKM measurement, filtered reads were first mapped back to various assembled transcripts and total mapped reads were estimated. Unique mapped reads were assigned to each assembled transcript allowing maximum two mismatches. For each such cluster having similar sequences in database, longest sequence was considered as the representative sequence for the unique gene it represented. The associated GO terms and Id's were parsed for each of such sequence and their corresponding RPKM values for different assembled genes were calculated.

2.6 GC content analysis and Simple Sequence Repeats (SSRs) identification

GC content analysis was done using in-house developed R script. A Perl script known as MIcroSAtellite (MISA, http://pgrc.ipk-gatersleben.de/misa/) was used to identify SSRs in the unigenes. Repeats of mono-nucleotide more than 10 times, di-nucleotides repeats more than 6 times, tri-, tetra-, penta- and hexa-nucleotide repeats more than 5 times were considered as search criteria in MISA script.

2.7 Identification of Transcription Factor families

For identification of transcription factor families represented in C. borivilianum transcriptome, transcripts were probed against all the transcription factor protein sequences at Plant transcription factor database (PlnTFDB; http://plntfdb.bio.uni-potsdam.de/v3.0/downloads.php) using BLASTX with an e-value cut-off of 1.0e−05.

Results and Discussion

3.1 Transcriptome sequencing and de novo assembly

Illumina Hiseq 2000 (Illumina, USA) run representing the cDNA library from leaf tissues produced 22,595,634 single end (SE) reads. Each read was 50 bp in length encompassing nearly 4.0 Gb of sequencing data in fastq format. Since 3′ ends of reads are more prone to sequencing error, so after quality check for every 50 bp read, only 48 bases (excluding 2 bases at 3′ end) were considered for further analysis. Sequence data was filtered for low-quality reads and reads containing primer/adaptor sequences. Reads having the Phred quality score ≥20 (an error probability of 0.01) were selected for further use. After quality filtering, a total of 22,161,444 high-quality SE reads (3.8 Gb, 95% of the raw data) remained. Quality of the clean reads data was assessed using FASTX Toolkit (www.hannonlab.cshl.edu/fastxtoolkit). De novo assembly of C. borivilianum transcriptome was optimized after assessing effect of various k-mer lengths. The high quality trimmed reads (48 bp) were assembled using SOAPdenovo program at k-mer length of 17, 21, 23, 25, 27, 31, and 35 (Table 1). Total number of transcripts decreased linearly with the increment of k-mer size suggesting over-representation at lower k-mer and under-representation at higher k-mers. It was observed that sequences assembled at higher k-mer were enriched for transcripts with higher coverage/higher expression. For assembly process, only those reads were considered that produced high frequency k-mer. Several output parameters were analyzed that included total number of contigs, contigs with length 100 bp and above, N50 length, longest contig length, and average contig length as a function of k-mer. For the above mentioned data set, k-mer size of 23 emerged as the best for assembly with N50 length of 245 bp, largest contig length 3,168 bp and average contig length of 221 bp (Table 1). A total of 101,589 contigs having length of at least 100 bp were generated. These contigs made the final representatives for assembled sequences for this study. Total bases covered by contigs with length greater than 100 bp and above came out to be 22.47 Mb (Table 2). Although the majority of the contigs lie between 1 bp to 100 bp, a total of 6568 contigs with a size of 500 bp and above were are generated (Fig 1a).

Table 1

Effect of k-mer size on assembly of transcriptome data.

k-mer	Total contigs	Contigs >100 bp	Longest contig length (bp)	Average length (bp)	Total bases covered by 100 bp length	N50
17	1079131	99031	1862	178	17930159	180
21	516724	101445	3216	220	22401122	243
23	408567	101589	3168	221	2247281	245
25	337186	101765	3179	216	2207354	240
27	276611	101491	5256	211	2240112	232
31	183200	95717	5256	199	1911429	211

Table 2

Summary of filtered and assembled transcriptome data generated on Illumina Hiseq 2000 platform using RNA isolated from leaf tissue of C. borivilianum.

Total number of single-end reads	22,595,634
Number of reads obtained after quality filtering	22,161,444
Number of assembled transcripts	1,01,589
Average length of transcripts (bp)	221
Average coverage (bp)	22,472,81

Figure 1

Overview of the C. borivilianum transcriptome assembly.

(A) Size distribution of the contigs obtained from de novo assembly of high quality clean reads. (B) Size distribution of the unigenes produced from further assembly of contigs after clustering.

Overview of the C. borivilianum transcriptome assembly.

(A) Size distribution of the contigs obtained from de novo assembly of high quality clean reads. (B) Size distribution of the unigenes produced from further assembly of contigs after clustering.

3.2 Sequence clustering and similarity search

To reduce any redundancy, assembled sequences from the above analysis were clustered by hierarchical clustering with TGICL [27], which generated 1273 clusters. These clusters were further assembled using CAP3 program [28] and using 896 sequences, 448 consensus sequences were generated. This resulted in reduction of uniquely assembled contigs from 101,589 to 101,141. Contigs shorter than 100 bp were removed and total coverage with contig length ≥100 bp came out to be 22.42 Mb. Although majority of the contigs lie between 150 bp to 500 bp, we obtained a total of 6,290 contigs with a size of >500 bp (Fig 1b). For contig annotation, similarity search was performed using BLASTX against the non-redundant (NR) protein database with an E-value cut off 1.0e−05. For read assembled transcript sequences, significant BLASTX hits were found for a total of 50,826 sequences (50.4% of the total unigenes obtained; Table S1). The E-value distribution of the top hits in the NR database showed that 15.9% of the mapped sequences have high similarity (<1.0e−50), while the other 84.1% of sequences, similarity ranged between 1.0e−05 to 1.0e−50 (Fig 2a). The similarity distribution revealed 67.5% of the query sequences have a similarity higher than 80%, while 32.4% of the hits have a similarity ranging from 20% to 80% (Fig 2b). Homologous genes come from several species, with 23.5% of the unigenes having the highest homology to genes from Vitis vinifera followed by Oryza sativa (16.3%), Populus trichocarpa (10.8%), Ricinus communis (9.0%), Glycine max (8.1%) and Sorghum bicolor (7.2%) (Fig 2c).

Figure 2

Similarity search analysis of unigenes against the NR database.

Similarity search analysis of unigenes against the NR database.

(A) E-value distribution of the top BLAST hits for each unigene (E-value of 1.0e−5). (B) Similarity distribution of the best BLAST hits for each unigene. (C) Species distribution is shown as the percentage of the total homologous sequences (with an E-value ≤1.0e−5). We used the NCBI NR proteins database for similarity search and extracted the best hit of each sequence for analysis. Due to lack of C. borivilianum genome information, nearly half of the unigenes did not show a match in the plant protein dataset of NR database. Moreover, 63.4% of the above mentioned 50,826 annotation descriptions were uninformative (e.g., ‘unknown’, ‘unnamed’, ‘putative’, or ‘hypothetical’ protein).

3.3 Validation of assembled sequences against the available ESTs of C. borivilianum

The assembled sequences were validated by sequence alignments against ESTs of C. borivilianum submitted at NCBI dbEST by our group [13]. BLASTN analysis of assembled transcripts was performed with 459 ESTs with an E-value threshold of 1.0e−05 (Table S2). Alignment parameters were manually checked and also analyzed for mis-assemblies. Significant hits were observed for 394 sequences (85.8%), while no hit could be obtained for 65 ESTs from the assembled transcript set. It was observed that most of the assembled transcript sequences aligned correctly and in continuous form, with average identity of 95.67%, showing good assembly quality. The possible reason for unmapped unigenes of C. borivilianum could be due the presence of relatively short and low quality singletons, UTR sequences far from the translation start or stop sites (>1,000 bp), fusion transcripts and those having incomplete coverage by the genome. Similar results were shown in A. thaliana where around 13% of the ESTs could not be aligned to the predicted genes [29] and in human only 64% of the reads could be mapped to the RefSeq database of well annotated human genes [30]. Picorrhiza kurrooa had 14% of the ESTs that could not be aligned to its predicted genes [31].

3.4 Functional annotation and classification of C. borivilianum transcriptome

Chlorophytum borivilianum transcripts were assigned GO terms to describe functions of genes and associated gene products into three major categories namely, biological process, molecular function, and cellular component, and their sub-categories [24]. A total of 30,625 out of 50,826 assembled sequences yielded significant annotation against GO database. These genes were further classified into three major categories namely, biological process, molecular function, and cellular component using plant specific GO slims that broadly provides an overview of the ontology content. Functional classification of C. borivilianum transcripts in biological process category (Fig 3) showed that metabolic process (GO: 0008152), response to stimulus (GO: 0050896), nucleobase containing compound metabolic process (GO: 0006139) and extracellular structural organization (GO: 0009987) were among the highly represented groups indicating that the plant is undergoing rapid growth and extensive metabolic activity. In cellular component group, sequences related to the intracellular regions (GO: 0005622) and membrane regions were well-represented categories (GO: 0016020) (Fig 3). Transcripts belonging to major subgroups of molecular function category included binding functions (GO: 0005488), transferase activity (GO: 0016740), catalytic activity (GO: 0003824) and oxidoreductase activity (GO: 001649) (Fig 3). These GO annotations provide a comprehensive information on C. borivilianum expressed genes that are encoding diverse proteins.

Figure 3

Gene Ontology (GO) classification of the C. borivilianum transcriptome.

Gene Ontology (GO) classification of the C. borivilianum transcriptome.

GO term assignments to C. borivilianum unigenes based on significant plant species hits against the NR database. Results are summarized into three main GO categories (biological process, cellular component, molecular function) and 27 sub-categories. Best EC classification was obtained for 14,998 assembled sequences, which were annotated with 15,866 enzyme codes. Fig 4 lists major 20 abundant enzyme classes observed for C. borivilianum transcriptome. Interestingly, a large amount of assembled transcripts belonged to non-specific protein tyrosine kinases enzyme class alone (14.7%). In arabidopsis, this enzyme controls shoot and floral meristem size [32] and also contributes to signal transduction [33]. Presence of this enzyme in such abundance may suggest that the plant is in active metabolic phase.

Figure 4

Functional characterization and abundance of C. borivilianum transcriptome for enzyme classes. C. borivilianum transcripts were classified in top 20 abundant enzyme classes; area under each pie represents the value in actual number of transcripts.

We also compared KEGG pathways mapping of C. borivilianum with KEGG mapping of O. sativa and A. thaliana. Sample score corresponds to the transcripts of C. borivilianum mapped on KEGG pathways and background score is the total genes in the genome of the background species mapped on KEGG pathways. We found 50,034 transcripts showing similarity with GO database at e-value cut-off of 1e−05. These transcripts were used for annotation of KEGG pathways. These transcripts were annotated with 257 KEGG pathways which were present in O. sativa and A. thaliana. A total of 236 pathways were common in both of background species. 10 unique pathways of O. sativa and 11 of A. thaliana were also mapped. Total 3,805 transcripts were mapped on these KEGG pathways common with O. sativa and 3,806 transcripts were mapped on KEGG pathways common with A. thaliana. Out of these 43 KEGG pathways were significantly enriched (p-Value <0.05) with the genes present in O. sativa. Non-homologous end joining, Lipopolysaccharise biosynthesis, Spliceosome, RIG-I like receptor signaling pathway, mineral absorption, GPI-anchor biosynthesis, RNA transport, Basal transcription factors, N-Glycan biosynthesis and RNA degradation are top enriched pathways. In comparison to A. thaliana, 42 pathways were identified as enriched (p-value <0.05). Spliceosome, RNA degradation, RNA transport, Pyrimidine metabolism and GPI anchor biosynthesis were found to be most enriched pathways in comparison with A. thaliana. A total of 100 transcripts in C. borivilianum were mapped on spliceosome KEGG pathway compared to 106 genes of O. sativa and 117 genes of A. thaliana. Highest number transcripts (106) were mapped to Ribosome pathways in C. borivilianum compared to 257 genes of O. sativa and 311 genes of A. thaliana. Other pathways with highest transcripts mapping include RNA transport (89), Purine metabolism (79) and protein processing in endoplasm (73), thus suggesting that the plant was in actively growing stage at the time of harvesting. Besides all these pathways, flavonoid biosynthesis and steroid biosynthesis were the most enriched pathways in C. borivilianum. The enrichment of these pathways clearly suggested that the plant possess antioxidant properties, anti-arthritic properties, and anticancer properties. Pathways comparison (Fisher's Exact P-value <0.05) with respect to the background species O. sativa and A. thaliana is shown in Fig 5a and 5b. All the data of pathways mapping is provided in Table S3A and Table S3B for O. sativa and A. thaliana respectively.

Figure 5

Overrepresented pathways in KEGG database with respect to A) O. sativa and B) A. thaliana.

3.5 Discovery of the transcripts encoding putative transcription factors

Transcription factors interact with the promoter regions of a gene and modulate its expression. Associative modulation of several plant processes suggests involvement of transcription factors (TFs) for coordinated regulation of gene expression. By sequence comparison with known transcription factor gene families, 8,369 putative C. borivilianum transcription factor genes, distributed in at least 62 families, were identified representing 8.3% of C. borivilianum transcripts (Fig 6). These genes covered transcription factor gene families like MYB, MADS, Orphans, basic Helix-Loop-Helix (bHLH), bZIP, WRKY and many more. Among all these TF gene families, C3H, PHD and MYB were most abundant families (Table S4). MYB family proteins are characterized by DNA-binding domains and it is the largest transcription factor family in Arabidopsis thaliana as well, where it comprises of 163 genes [34]. These TFs have been associated with varied processes. It has been observed that many protein members of the MYB, bZIP and WRKY transcription factor families insinuates the regulation of stress responses [35]. Members of C3H family are involved in embryogenesis [36] and members of PHD family TFs are involved in vernalization processes [37]. bHLH members are involved in controlling cell proliferation and development of specific cell lineages [38] and bZIP regulates processes including pathogen defense light and stress signaling seed maturation and flower development. A complete picture about evolution of various transcription factor families will emerge once the complete genome sequence of C. borivilianum will be available.

Figure 6

Distribution of C. borivilianum transcripts in different transcription factor families.

3.6 GC content and SSRs analysis from the transcriptome of C. borivilianum

Deep sequencing of C. borivilianum transcriptome rendered us with an opportunity to calculate the ratio of GC contents. Knowledge of GC content helps in understanding ecology and evolution of particular taxa. It also plays a vital role in gene and genome regulation and also helps in determining the physical properties of the genome. It is an important indicator of stability of DNA [39]. Average GC content of C. borivilianum transcripts was 44% (Figure S1). This is in range with the GC levels of coding sequences of O. sativa (43.6%) which also belong to monocot family. Assembly of C. borivilianum was further assessed for the molecular markers. Morphological as well as biochemical markers are used in the authentication of herbal drugs. SSRs or microsatellite markers are polymorphic stretches of 1 to 6 nucleotide units repeated in tandem and randomly spread in eukaryotic genomes. SSRs are generally associated with functional and phenotypic variations. Moreover, SSRs allow identification of many alleles at a single locus, are easy to develop, distributed evenly all over the genome and are co-dominant [40]. SSRs are particularly useful when genome information of the crop is lacking. Since C. borivilianum is a cross pollinated species [41], and hence seed raised population will have variability. For identification of SSRs, transcripts of C. borivilianum were analyzed with perl script MISA. 3,487 SSRs were identified in 3,321 (3.28%) transcripts comprehensively out of which, 160 sequences contained more than 1 SSR. With a frequency of over 43.1% (1,503/3,321), di-nucleotides were most abundant of all the SSRs obtained. Tri-nucleotides were more than 32.89% (1,147/3,487), followed by tetra-nucleotides (0.4%, 15/3,487) and penta-nucleotides (0.05%, 2/3487). Most prevalent of mononucleotide was poly-A. The edge of poly-A mono-nucleotide SSRs over others may be due to the presence of poly-A tails in the RNA sequences (Table S5). Di-nucleotide SSRs were the most abundant in the transcripts, which is similar to results obtained from other plants [42]. Among the di-nucleotide repeat classes, AG/GA/CT/TC (90.7%) was the most frequent dimer motif. Other dimer motifs included TG/CA and AT/TA (Table S5). CG repeats were infrequent in the plant (0.6%), which is consistent with previous observations [42], [43,]. Among tri-nucleotide repeats, CTC/CCT/GAG/GAA was the largest repeat class followed by CTT/AAG/GGA/AGA (Table 3). These findings indicated that unique sequences containing SSR markers were indeed abundant in C. borivilianum. Di- and tri-nucleotide SSR motifs and the number of repeats are presented in Table 3. In particular, several SSR motifs were linked with the unique sequences encoding enzymes (e.g. MVK, DXS, SqS) involved in saponin biosynthesis (Table 4). These unique sequence-derived markers generated in this study represent a valuable genetic resource for future studies in this and related species.

Table 3

Simple sequence repeats (SSRs) identified in transcripts of C. borivilianum.

SSR mining	Number
Total number of sequences examined:	10,11,41
Total size of examined sequences (Mb):	22.42
Total number of identified SSRs:	3,487
Number of SSR containing sequences:	3,321 (3.28%)
Number of sequences containing more than one SSR:	160
Number of SSRs present in compound formation:	109

Table 4

The discovery of SSR motifs in putative genes involved in saponin biosynthesis.

Gene name	Unique sequence	SSR motif	Number of repeats	SSR start (bp)	SSR end (bp)	Sequence length (bp)
Mevalonate kinase (MVK)	contig 780100	ag	7	6	19	1326
1-deoxy-D-xylulose 5-phosphate synthase (DXS)	contig 801760	ac	20	1	20	787
DXS	contig 809736	ta	16	64	79	1647
Squalene synthase (Sqs)	contig 676689	cat	16	1	16	355

3.7 Analysis of metabolic pathway genes

Saponins present in C. borivilianum are the primary source of its significant medicinal properties and are synthesized by mevalonate and non-mevalonate pathways in plants. TLC analysis indicated that significant levels of saponins were present in C. borivilianum tissues (Figure S2). Nearly 10 spots were visualized on TLC plate upon derivatization with anisaldehyde - sulfuric acid reagent. Previous studies reported that initial reactions of isoprenoid biosynthetic pathway occur in leaves, while later step modifications and storage of saponins occurs in roots [13], [14], thus the amount of saponins is high in roots. There are also few reports on the presence of flavonoid compounds in C. borivilianum [44]. These reports very well validate the presence of saponins and flavonoids in C. borivilianum, thus, our interest was to identify the genes responsible for biosynthesis of saponins and other metabolites in our dataset. The genes related to following pathways/compounds were identified:

3.7.1. Saponin biosynthesis pathway

Saponins are derived from geranyl pyrophosphate (GPP). GPP is synthesized by sequential head to tail addition of isopentenyl pyrophosphate (IPP) and its allelic isomer dimethylallyl pyrophosphate (DMAPP) [45]. IPP and DMAPP are synthesized via the cross talks between the cytosolic mevalonate (MVA) pathway and plastidial 2-C-methyl-D-erythritol-4-phosphate (MEP) pathway [46]. MVA pathway starts from the condensation of acetyl-CoA [47], whereas MEP pathway needs pyruvate and _lyceraldehydes 3-phosphate [48]. In C. borivilianum transcriptome dataset, multiple transcripts encoding almost all known enzymes involved in the MVA pathway, MEP pathway and saponin biosynthesis pathway were identified (Fig 7a). In almost all the cases, more than one unique sequence was annotated as same enzyme. These unique sequences may either represent different fragments of a single transcript or different members of a gene or both. MVA pathway starts with the condensation of three molecules of acetyl-CoA while in methylerythritol phosphate (non-mevalonate) pathway, isopentenyl diphosphate (IPP) is formed from pyruvate and _lyceraldehydes-3-phosphate. Both biosynthetic routes yield the activated isoprene units, dimethylallyl diphosphate (DMAPP) and IPP [49]. Genes involved in MVA pathway that were found in our transcriptome included acetyl CoA- acetyltransferase (EC 2.3.1.9, 5 unigenes), HMG-CoA reductase (EC 1.1.1.34, 13 unigenes), HMG-CoA synthase (EC 2.3.3.10, 15 unigenes), mevalonate kinase (EC 2.7.1.36, 3 unigenes), phosphomevalonate kinase (EC 2.7.4.2, 2 unigenes), mevalonate diphosphate decarboxylase (EC 4.1.1.33, 2 unigenes). MEP pathway genes are also converted to IPP by reaction catalyzed by series of enzymes. These included 1-deoxy-D-xylulose-5-phosphate synthase (EC 2.2.1.7, 18 unigenes), 1-deoxy-D-xylulose-5-phosphate reductoisomerase (EC 1.1.1.267, 3 unigenes), 2-C-methyl-D-erythritol 4-phosphate cytidylyl transferase (EC 2.7.7.60, 2 unigenes), 4-diphosphocytidyl-2-C-methyl-D-erythritol kinase (EC 2.7.1.148, 1 unigene), 2-C-methyl-D-erythritol, 2,4-cyclodiphosphate synthase (EC 4.6.1.12, 1 unigene), 4-hydroxy-3-methylbut-2-enyl diphosphate synthase (EC 1.17.7.1, 3 unigenes), 4-hydroxy-3-methylbut-2-enyl diphosphate reductase (EC 1.17.1.2, 13 unigenes). IPP was then converted to squalene by a series of enzymes which included isopentenyl diphosphate isomerase (EC 5.3.3.2, 6 unigenes), geranylgeranyl diphosphate synthase (EC 2.5.1.29, 5 unigenes) farnesyl diphosphate synthase (EC 2.5.1.10, 14 unigenes), squalene synthase (EC 2.5.1.21, 9 unigenes). Subsequently, squalene monooxygenase (EC 1.14.13.132, 14 unigenes) catalyzes the conversion of squalene to 2, 3-oxidosqualene. The cyclization event of 2, 3-oxidosqualene is catalyzed by a class of enzymes, OSCs. Cyclization of 2, 3-oxidosqualene is rate limiting step and this event is also the branch point for sterol and triterpenoid biosynthesis in many plants [50]. Three OSC genes of C. borivilianum, cycloartenol synthase (E.C. 5.4.99.8, 16 unigenes) [51], β-amyrin synthase (E.C. 5.4.99.39, 1 unigene) [52] and Dammarenediol II synthase (E.C. 4.2.1.125, 1 unigene) [53] exist in our dataset. Steroidal saponins, borivilianosides, are the major type of saponins present in C. borivilianum [6]. As yet, no reports reveal the presence of triterpenoid type of saponins in C. borivilianum. However, two singleton sequences (Transcript ID: 664097 and 798016) matched with β-amyrin synthase and Dammarenediol II synthase of V. vinifera and R. communis respectively.

Figure 7

Chlorophytum borivilianum unigenes involved in two secondary metabolic pathways.

C. borivilianum unigenes involved in; (A) saponin biosynthesis, (B) flavonoid biosynthesis and (C) alkaloid biosynthesis. Red number in the bracket following each gene name indicates the number of corresponding unigenes.

Chlorophytum borivilianum unigenes involved in two secondary metabolic pathways.

3.7.2 Flavonoid biosynthesis pathway

Genes associated with flavonoid biosynthesis pathway were also detected in the dataset (Fig 7b). Flavonoids are synthesized via the phenylpropanoid pathway and are converted from phenylalanine to chalcone by the enzymes phenylalanine ammonia lyase (EC 4.3.1.24, 25 unigenes), cinnamate 4-hydroxylase (EC 1.14.13.11, 8 unigenes), 4-coumarate CoA ligase (EC 6.2.1.12, 23 unigenes), and chalcone synthase (EC 2.3.1.74, 16 unigenes). Chalcone isomerase (EC 5.5.1.6, 12 unigenes) then catalyzes the isomerisation of chalcones into naringenin. Naringenin can be converted by flavonoid 3′- hydroxylase (EC 1.14.13.21, 18 unigenes, 421.18) and flavonoid 3′, 5′-hydroxylase (EC 1.14.13.88, 3 unigenes) to produce eriodictyol and dihydrotricetin, respectively. Flavone synthase (EC 1.14.11.22, 1 unigene) catalyzes the conversion of flavanones to flavones, and flavanone 3-hydroxylase (EC 1.14.11.9, 24 unigenes) can convert these flavanones to dihydroflavonols. Dihydroflavonols can then lead to production of flavonols and flavan - 3, 4-diols (leucoanthocyanidin), reactions being catalysed by flavonol synthase (EC 1.14.11.23, 9 unigenes) and by dihydroflavonol 4-reductase (EC 1.1.1.219, 6 unigenes) repectively. Leucoanthocyanidins can be converted either to anthocyanidins and subsequently anthocyanins through the subsequent action of anthocyanidin synthase (EC 1.14.11.19, 6 unigene) or reduced to catechins through the action of the enzyme and anthocyanidin reductase (EC 1.3.1.77, 8 unigenes). Genes of the phenylpropanoid pathway have already been reported in several plant species such as Camellia sinensis [54], [55], Petroselinum hortense (Parsely) [56], Zea mays [57], Arabidopsis thaliana [58], Vitis vinifera [59], Citrus unshiu (Marc.) [60] and Fragaria spp. [61]. Above descriptions showed vertical pathway responsible for the formation and conversion of the sub-categories of flavonoids. Another revelation came out with the genes involved in isoflavonoid metabolism. These genes included 2-hydroxyisoflavanone synthase (EC 1.14.13.86, 3 unigenes), isoflavone 4′-O-methyltransferase (EC 2.1.1.46, 1 unigene), isoflavone 7-O-methyltransferase (EC 2.1.1.150, 2 unigenes), isoflavone 2′-hydroxylase (EC 1.14.13.89, 9 unigenes) and isoflavone reductase (EC 1.3.1.45, 12 unigenes). C borivilianum is not known to produce isoflavones. Similar kinds of findings have been reported in other non-isoflavone accumulating plants [62], suggesting homologs of isoflavone reductase and isoflavone O-methyltransferase may have general metabolic functions in many plant species.

3.7.3. Discovery of transcripts encoding CYPs and UGTs involved putatively in saponin and flavonoid biosynthesis

Cytochrome P450 (CYP 450) are versatile biocatalysts. These enzymes constitute largest family of plant proteins (http://drnelson.utmem.edu/CytochromeP450.html). CYP450s are involved in the NADPH-dependent regio- and stereo-specific oxygenations during the biosynthesis of terpenoids, sterols, lignins, hormones, fatty acids, pigments, and phytoalexins in plants [63]. In biosynthesis of steroidal-type of saponins, CYP450 catalyzes the conversion of obtusifoliol to Δ8, 14 – sterol. A total of 490 unique transcripts, were annotated as CYP450s (Table S6). Previous studies reported characterization of obtusifoliol 14-alpha-demethylase (EC 1.14.13.70, 7 unigenes) of CYP51 family from Sorghum bicolor [64] and Triticum aestivum [65] and CYP 710 [66] from Arabidopsis, both of which are involved in steroid saponin biosynthesis. Therefore, the CYP450s belonging to CYP51 and CYP710 family might be involved in borivilianoside biosynthesis in Chlorophytum genus. Amongst reactions for the modification of secondary metabolites, glycosylation plays an important role in plants, contributing to biosynthesis and storage of secondary metabolites [67]. UDP-glycosyltransferases (UGTs) catalyze the transfer of sugar residues from uridine diphosphate sugars to an acceptor. Several unigenes encoding for different type of glycosyltransferases were found in our dataset (Table S7). Out of these, UGTs supposed to be involved in saponin biosynthesis included soyasaponin III rhamnosyltransferase (EC 2.4.1.273, 4 unigenes) and sterol 3- beta- glucosyltransferase (EC 2.4.1.173, 9 unigenes). Soyasaponin III rhamnosyltransferase is involved in the biosynthesis of soyasaponin I in Glycine max [68]. The enzyme has strong sugar donor specificity for UDP-rhamnose. It does not show any activity with UDP-glucose, UDP-galactose or UDP- glucuronic acid. Sterol 3- beta- glucosyltransferase on the other hand is involved in the glycosylation activity of different type of sterols (e.g. sitosterol, stigmasterol etc.) present in the plant. This enzyme is thought to be involved in steroid metabolism. UGTs involved in the glycosylation of flavonoids were also present in C. borivilianum transcriptome dataset, some of which were anthocyanidin 3-O-glucosyltransferase (EC 2.4.1.115, 16 unigenes), anthocyanin 3′-O-beta-glucosyltransferase (EC 2.4.1.238, 14 unigenes), and flavonol-3-O-glucosyltransferase (EC 2.4.1.91, 46 unigenes). Presence of such glycosylation enzymes and cytochrome P450 genes might contribute to extensive modifications of various secondary metabolites in C. borivilianum.

3.7.4. Alkaloid biosynthesis pathway

Alkaloids are heterocyclic nitrogen compounds synthesized from amino acids and are the largest groups of natural plant products [69]. Potent biological activity of some alkaloids has led to their exploitation as pharmaceuticals, stimulants, narcotics and poisons. Plant-derived alkaloids currently in clinical use include, but not limited to, analgesics morphine and codeine, anticancer agents vinblastine and taxol, gout suppressant colchicine, muscle relaxant (C)-tubocurarine, antiarrythmic ajmaline, antibiotic sanguinarine, and sedative scopolamine. Other important alkaloids of plant origin include caffeine, nicotine, cocaine, and synthetic O, O-acetylated morphine derivative heroin. In C. borivilianum transcriptome dataset, in addition to saponins and flavonoids, we also discovered genes involved in the biosynthesis of benzylisoquinoline alkaloids (BIA). Amino acid tyrosine is the precursor of benzylisoquinoline alkaloids. BIA biosynthesis begins with a metabolic lattice of decarboxylations, ortho hydroxylations and deaminations that convert tyrosine to both dopamine and 4- hydroxyphenylacetaldehyde [70]. This reaction is catalyzed by tyrosine decarboxylase (TYDC; EC 4.1.1.25, 8 unigenes), the only enzyme involved in these early steps that has been purified [71], and for which the corresponding cDNA has been cloned [72], [73] (Fig 7c). TYDC cDNAs have also been reported from parsley [74] and Arabidopsis thaliana [75], which do not accumulate tyrosine-derived alkaloids. TYDC mRNAs were shown to be rapidly induced in response to elicitor treatment [74], [75] and pathogen challenge [76] in various plants. Induction of TYDC mRNAs in parsley and Arabidopsis suggests that, in addition to BIAs, tyramine serves as precursor to a ubiquitous class of defense responsive metabolites. Dopamine and 4-hydroxyphenylacetaldehyde are condensed by norcoclaurine synthase (NCS) (EC 4.2.1.78, 13 unigenes) to yield the trihydroxybenzylisoquinoline alkaloid (S)- norcoclaurine, which is the central precursor to all BIAs in plants (Fig 7c) [77]. (S)-Norcoclaurine is converted to (S)-reticuline by a 6-O-methyltransferase (EC 2.1.1.128, 6 unigenes), an N-methyltransferase, a P450 hydroxylase (EC 1.14.13.71, 2 unigenes) and a 4′-O-methyltransferase. SAM-dependent 6-O- and 4′-O-methyltransferases (6-O-MT and 4′-O-MT, respectively) have been purified from cultured Coptis japonica cells [78], and the corresponding cDNAs were isolated and characterized [79]. (S)-reticuline is the central intermediate that can undergo various rearrangements and modifications to yield different structural classes of benzylisoquinolines. Berberine Bridge Enzyme (BBE) (Reticuline oxidase; EC 1.21.3.3, 1 unigene) that catalyses the conversion of reticuline to (S) - scoulerine was also found but we could not discover the enzymes that are involved beyond BBE in this pathway.

3.7.5. Discovery of anticancer and plant defense genes

C. borivilianum possess antitumor and anticancer properties [80]. It contains cytotoxic steroidal glycoside saponinchloromaloside-A and spirostanolpentaglycosides embracing beta-D-apiofuranose which are used for their anticancer properties [81]. In the present study, for the first time, we reported the presence of genes involved in taxol biosynthesis. 3′-N-debenzoyal- 2′deoxytaxol N- benzoyaltransferase, taxane 10 - beta - hydroxylase (5-α-taxadienol-10-β-hydroxylase) and deacetyl baccatin genes were discovered that are involved in taxol biosynthetic pathway. Few genes namely ± neomenthol dehydrogenase and pulegone reductase, involved in menthol biosynthesis pathway, were also discovered. Menthol biosynthetic genes are primarily involved in plant defense, although there are evidences that menthol has a potent anticancer property, effecting cell death through TRPM8 receptor [82]. Further studies on this pathway could be beneficial as the combination of C. borivilianum with other herbs could be beneficial for cancer treatment.

3.8 Quantification of C. borivilianum transcripts

All the reads were mapped onto the non-redundant set of C. borivilianum transcripts which revealed that the number of reads corresponding to each transcript ranged from 2 to 13289 with an average of 94 reads per transcript indicating a very wide range of expression in transcripts of C. borivilianum (Table S8). Minimum coverage (RPKM) of a C. borivilianum transcript was 3.38 and maximum of 4328.27 with an average of 55.78 (Figure S3). Low expressing transcripts were also present in our assembly. We also studied expression of various genes that were involved in the saponin biosynthesis and flavonoid biosynthesis pathways. We found that expression of unigene 807650 annotated as Acetyl Co-acetyl transferase (AACT), unigene662091 annotated as HMG-CoA synthase, unigene 757982 annotated as HMG-CoA reductase and unigene724460 annotated as Squalene monooxygenase was highest among saponin biosynthesis pathway genes (Table 4). AACT participates in 10 different metabolic pathways and HMG-CoA, synthesized by HMG-CoA synthase is an intermediate both in the saponin biosynthesis and in ketogenesis. This may be the reason of its high abundance in C. borivilianum transcriptome dataset. The expression of cycloartenol synthase (EC 5.4.99.8) (100 reads) was 5 fold higher than that of β-amyrin synthase (21 reads) (EC 5.4.99.39) (Table 5) thus, further confirming the diversion of the pathway towards the synthesis of steroidal type of saponins in this plant. Most of the genes of mevalonate pathway had expression greater than the genes of non-mevalonate pathway (Table 5). It might be due to the fact that mevalonate pathway is more active in the production of saponins in C. borivilainum as compared to non-mevalonate pathway as is the case commonly seen in monocot plants. In case of flavonoid biosynthesis, expression of 4-coumarate CoA ligase, dihydroflavonol 4-reductase, flavonol synthase, flavanone 3-hydroxylase was the highest (Table 6).

Table 5

Identification of genes involved in saponin biosynthesis along with their RPKM values.

Gene Name	EC number	Transcript ID	Total ESTs	RPKM value	Number of reads
acetyl Co-acetyl transferase	2.3.1.9	contig 662605, 789282, 807132, 807650, 810900	5	420.22	2162
HMG-CoA synthase	2.3.3.10	contig 617027, 642120, 642122, 642652, 643560, 648291, 662091, 685089, 714060, 719432, 730642, 750334, 752180, 787594, 791328	15	588.8	399
HMG-CoA reductase	1.1.1.34	contig 616693, 642352, 728682, 731828, 739034, 750024, 751864, 757982, 761916, 765000, 777172, 789074, 805890	13	684	1144
mevalonate kinase	2.7.1.36	contig 780100, 790236, 808328	3	116.9	337
phosphomevalonate kinase	2.7.4.2	contig 796674, 802836	2	83.9	351
mevalonate diphosphate decarboxylase	4.1.1.33	contig 731872, 807394	2	183.9	932
1-deoxy-D-xylulose-5-phosphate synthase	2.2.1.7	contig 630630, 639174, 659061, 672215, 692780, 701402, 716282, 758550, 770832, 776522, 785368, 788176, 796714, 799960, 801760, 807536, 809736, 812228	18	33.8	31
1-deoxy-D-xylulose-5-phosphate reductoisomerase	1.1.1.267	contig 767146, 807128, 811584	3	49.4	325
2-C-methyl-D-erythritol 4-phosphate cytidylyl transferase	2.7.7.60	contig 697584, 812958	2	35.05	262
4-diphosphocytidyl-2-C-methyl-D-erythritol kinase	2.7.1.148	contig 815196	1	30.19	302
2-C-methyl-D-erythritol, 2,4-cyclodiphosphate synthase	4.6.1.12	contig 809428	1	35.16	200
4-hydroxy-3-methylbut-2-enyl diphosphate synthase	1.17.7.1	contig 768924, 813016, 815362	3	91.9	179
4-hydroxy-3-methylbut-2-enyl diphosphate reductase	1.17.1.2	contig 663803, 674257, 674259, 697268, 713070, 757110, 770076, 775264, 776012, 776014, 780586, 780710, 813560	13	91.6	217
isopentenyl diphosphate isomerase	5.3.3.2	contig 627116, 640576, 677655, 684005, 771148, 793800	6	239.49	128
geranylgeranyl diphosphate synthase	2.5.1.29	contig 669873, 709788, 771708, 791512, 804002	5	37.97	166
farnesyl diphosphate synthase	2.5.1.10	contig 622530, 663119, 669873, 694410, 709520, 709788, 759242, 764802, 771708, 774914, 782044, 785872, 791512, 804002	14	337.61	174
squalene synthase	2.5.1.21	contig 660543, 669241, 676689, 684225, 720378, 762052, 771110, 771878, 776032	9	178.12	119
squalene monooxygenase	1.14.13.132	contig 625956, 650031, 684205, 699388, 718888, 724460, 735738, 749834, 752160, 753330, 759096, 772472, 773000, 788034	14	789.79	897
cycloartenol synthase	5.4.99.8	contig 634716, 641806, 644535, 660567, 679937, 737170, 761218, 762950, 777784, 785022, 796606, 802302, 805366, 814320, 815892	16	177.58	100
β-amyrin synthase	5.4.99.39	contig 664097	1	30.56	21
dammarenediol II synthase	4.2.1.125	contig 663453	1	125.43	85

Table 6

Genes involved in flavonoid biosynthesis along with their RPKM values.

Gene Name	EC number	Transcript ID	Total ESTs	RPKM value	Number of read
phenylalanine ammonia lyase	4.3.1.24	contig 634698, 642066, 657575, 668909, 672281, 691416, 702626, 703366, 703960, 716568, 719604, 735030, 743228, 748710, 753296, 754648, 755468, 760124, 760936, 769576, 777484, 783272, 788948, 790736, 799910	25	241.09	474
cinnamate 4-hydroxylase	1.14.13.11	contig 626042, 673051, 688226, 704750, 755308, 758396, 791376, 794622	8	207.68	111
4-coumarate CoA ligase	6.2.1.12	contig 658459, 673103, 685642, 712208, 722868, 734074, 750294, 752608, 757946, 758914, 759276, 760860, 761416, 775342, 776318, 790346, 790364, 792818, 793014, 793670, 803146, 811102, 815080	23	1206.79	979
chalcone synthase	2.3.1.74	contig 619401, 623782, 642216, 664433, 694872, 703744, 711022, 729452, 742538, 750790, 752302, 775504, 776662 790158, 803096, 804748	16	158.11	83
chalcone isomerase	5.5.1.6	contig 618735, 654743, 666277, 716164, 716334, 761068, 765552, 772422, 776068, 783048, 785354, 790738	12	343.98	174
flavonoid 3′- hydroxylase	1.14.13.21	contig 645173, 688012, 697330, 716994, 717416, 719790, 732470, 733298, 746820, 756294, 765426, 780410, 781518, 796884, 799424, 805552, 813496	18	421.18	1013
flavonoid 3′, 5′-hydroxylase	1.14.13.88	contig 638078, 679673, 705666	3	31.43	18
flavone synthase	1.14.11.22	contig 682129	1	30.66	24
flavanone 3-hydroxylase	1.14.11.9	contig 617797, 622328, 628840, 631844, 649435, 649459, 656551, 665101, 712318, 715714, 721514, 724718, 728206, 741180, 760132, 763024, 767854, 789424, 790626, 792572, 800076, 801784, 803294, 812892	24	475.11	263
flavonol synthase	1.14.11.23	contig 631844, 656551, 721514, 741180, 763024, 767854, 800076, 801784, 812892	9	475.11	263
dihydroflavonol 4-reductase	1.1.1.219	contig 701984, 702106, 781542, 789140, 789820, 796570	6	729.12	679
anthocyanidin synthase	1.14.11.19	contig 698156, 706394, 733490, 752474, 778182, 794658	6	18.9	61
anthocyanidin reductase	1.3.1.77	contig 660955, 667861, 675719, 718414, 719766, 730968, 756916, 794760	8	22.83	17

Conclusion

In conclusion, identification of genes involved in saponin and flavonoid biosynthesis will contribute to future functional studies in the plant and provide a basis for improving production levels in plants or in microbial hosts by metabolic engineering. In our data set, we also identified transcripts that encode for enzymes involved in anticancer and plant defense properties. Further, we annotated a large number of genes involved in the various pathways. Dataset of assembled C. borivilianum unigenes presented here will provide the foundation for other functional and comparative genomic studies. Guanine-cytosine (GC) content analysis of transcripts. (TIF) (TIF) Click here for additional data file. Analysis of total saponins contents in root and leaf tissue of by soxhelation method and TLC. (TIF) Click here for additional data file. Distribution of RPKM values for different transcripts in transcriptome. (TIF) Click here for additional data file. Assembled unigenes after CAP3 assembly. (XLS) Click here for additional data file. Assembly validation with known leaf nucleotide sequences. (XLS) Click here for additional data file. metabolic pathway analysis using KEGG with; (A) and (B) background databases. (XLS) Click here for additional data file. Details on transcription factor (TF) families identified in (XLS) Click here for additional data file. Simple sequence repeats (SSRs) identified in transcripts of . (XLS) Click here for additional data file. Annotated cytochrome P450s (CYPs) in transcriptome. (XLS) Click here for additional data file. Annotated glycosyltransferases (GTs) in transcriptome. (XLS) Click here for additional data file. Reads per exon kilobase per million (RPKM) based expression of annotated genes of transcriptome. (XLS) (XLS) Click here for additional data file.

64 in total

1. In silico analysis on frequency and distribution of microsatellites in ESTs of some cereal species.

Authors: Rajeev K Varshney; Thomas Thiel; Nils Stein; Peter Langridge; Andreas Graner
Journal: Cell Mol Biol Lett Date: 2002 Impact factor: 5.787

2. Expressed sequence tags: alternative or complement to whole genome sequences?

Authors: Stephen Rudd
Journal: Trends Plant Sci Date: 2003-07 Impact factor: 18.313

3. Action of (2-->1)Fructo-oligopolysaccharide fraction of Chlorophytum borivilianum against Streptozotocin-Induced oxidative stress.

Authors: Sreevidya Narasimhan; Raghavan Govindarajan; Vijayakumar Madhavan; M Thakur; V K Dixit; Shanta Mehrotra; K P Madhusudanan
Journal: Planta Med Date: 2006-10-18 Impact factor: 3.352

4. Isolation and characterization of cytotoxic saponin chloromaloside A from Chlorophytum malayense.

Authors: S X Qiu; X C Li; Y Xiong; Y Dong; H Chai; N R Fransworth; J M Pezzuto; H H Fong
Journal: Planta Med Date: 2000-08 Impact factor: 3.352

5. Sampling the Arabidopsis transcriptome with massively parallel pyrosequencing.

Authors: Andreas P M Weber; Katrin L Weber; Kevin Carr; Curtis Wilkerson; John B Ohlrogge
Journal: Plant Physiol Date: 2007-03-09 Impact factor: 8.340

6. Isolation of a set of ripening-related genes from strawberry: their identification and possible relationship to fruit quality traits.

Authors: K Manning
Journal: Planta Date: 1998-08 Impact factor: 4.116

7. Differential and tissue-specific expression of a gene family for tyrosine/dopa decarboxylase in opium poppy.

Authors: P J Facchini; V De Luca
Journal: J Biol Chem Date: 1994-10-28 Impact factor: 5.157

8. Novel protein kinase of Arabidopsis thaliana (APK1) that phosphorylates tyrosine, serine and threonine.

Authors: T Hirayama; A Oka
Journal: Plant Mol Biol Date: 1992-11 Impact factor: 4.076

9. A pathogen-responsive gene of parsley encodes tyrosine decarboxylase.

Authors: P Kawalleck; H Keller; K Hahlbrock; D Scheel; I E Somssich
Journal: J Biol Chem Date: 1993-01-25 Impact factor: 5.157

10. Characterization of the sesame (Sesamum indicum L.) global transcriptome using Illumina paired-end sequencing and development of EST-SSR markers.

Authors: Wenliang Wei; Xiaoqiong Qi; Linhai Wang; Yanxin Zhang; Wei Hua; Donghua Li; Haixia Lv; Xiurong Zhang
Journal: BMC Genomics Date: 2011-09-19 Impact factor: 3.969

34 in total

1. Pathways associated with lignin biosynthesis in lignomaniac jute fibres.

Authors: Avrajit Chakraborty; Debabrata Sarkar; Pratik Satya; Pran Gobinda Karmakar; Nagendra Kumar Singh
Journal: Mol Genet Genomics Date: 2015-02-28 Impact factor: 3.291

Review 2. Renaissance in phytomedicines: promising implications of NGS technologies.

Authors: Sonal Sharma; Neeta Shrivastava
Journal: Planta Date: 2016-03-22 Impact factor: 4.116

3. Subtractive transcriptome analysis of leaf and rhizome reveals differentially expressed transcripts in Panax sokpayensis.

Authors: Bhusan Gurung; Pardeep K Bhardwaj; Narayan C Talukdar
Journal: Funct Integr Genomics Date: 2016-09-01 Impact factor: 3.410

4. Comparative genomic hybridization and transcriptome sequencing reveal that two genes, OsI_14279 (LOC_Os03g62620) and OsI_10794 (LOC_Os03g14950) regulate the mutation in the γ-rl rice mutant.

Authors: Xulong Wang; Fanhua Wang; Huiqiong Chen; Xiaoyu Liang; Yingmei Huang; Jicai Yi
Journal: Physiol Mol Biol Plants Date: 2017-08-22

5. Transcriptome and metabolome analysis of Ferula gummosa Boiss. to reveal major biosynthetic pathways of galbanum compounds.

Authors: Ahmad Sobhani Najafabadi; Mohammad Reza Naghavi; Hamid Farahmand; Alireza Abbasi
Journal: Funct Integr Genomics Date: 2017-07-07 Impact factor: 3.410

6. RNA-Seq mediated root transcriptome analysis of Chlorophytum borivilianum for identification of genes involved in saponin biosynthesis.

Authors: Sunil Kumar; Shikha Kalra; Baljinder Singh; Avneesh Kumar; Jagdeep Kaur; Kashmir Singh
Journal: Funct Integr Genomics Date: 2015-10-12 Impact factor: 3.410

7. Sequencing, de novo assembly and comparative analysis of Raphanus sativus transcriptome.

Authors: Gang Wu; Libin Zhang; Yongtai Yin; Jiangsheng Wu; Longjiang Yu; Yanhong Zhou; Maoteng Li
Journal: Front Plant Sci Date: 2015-04-01 Impact factor: 5.753

8. Global Gene Expression of Kosteletzkya virginica Seedlings Responding to Salt Stress.

Authors: Xiaoli Tang; Hongyan Wang; Chuyang Shao; Hongbo Shao
Journal: PLoS One Date: 2015-04-22 Impact factor: 3.240

9. Next Generation Sequencing and Transcriptome Analysis Predicts Biosynthetic Pathway of Sennosides from Senna (Cassia angustifolia Vahl.), a Non-Model Plant with Potent Laxative Properties.

Authors: Nagaraja Reddy Rama Reddy; Rucha Harishbhai Mehta; Palak Harendrabhai Soni; Jayanti Makasana; Narendra Athamaram Gajbhiye; Manivel Ponnuchamy; Jitendra Kumar
Journal: PLoS One Date: 2015-06-22 Impact factor: 3.240

10. De novo transcriptome analyses reveals putative pathway genes involved in biosynthesis and regulation of camptothecin in Nothapodytes nimmoniana (Graham) Mabb.

Authors: Gulzar A Rather; Arti Sharma; Shahzad A Pandith; Veenu Kaul; Utpal Nandi; Prashant Misra; Surrinder K Lattoo
Journal: Plant Mol Biol Date: 2017-12-21 Impact factor: 4.076