Literature DB >> 25713814

De novo assembly and characterization of Oryza officinalis leaf transcriptome by using RNA-seq.

Ying Bao¹, Si Xu¹, Xiang Jing¹, Lu Meng¹, Zongyan Qin¹.

Abstract

Although endeavors have been made to identify useful wild rice genes that can be used to improve cultivated rice, the virtual reservoir of genetic variation hidden within the wild relatives of cultivated rice is largely untapped. Here, using next-generation sequencing technology, we investigated the leaf transcriptome of a wild rice O. officinalis with CC genome. Approximately 23 million reads were produced in the species leaf transcriptome analysis and de novo assembly methods constructed 68,132 unigenes. Functional annotations for the unigenes were conducted using sequence similarity comparisons against the following databases: the nonredundant nucleotide database, the nonredundant protein database, the SWISS-PROT database, the Clusters of Orthologous Groups of proteins database, the Kyoto Encyclopedia of Genes and Genomes database, the Gene Ontology Consortium database, and the InterPro domains database. In addition, a total of 476 unigenes related to disease resistance were identified in O. officinalis, and these unigenes can serve as important genetic resources for cultivated rice breeding and quality improvement. The present study broadens our understanding of the genetic background of non-AA genomic wild rice species and it also provides a bridge to extend studies to other Oryza species with CC genomes.

Entities: Chemical Disease Species

Mesh：

Substances：
Plant Proteins
RNA, Plant

Year: 2015 PMID： 25713814 PMCID： PMC4332758 DOI： 10.1155/2015/982065

Source DB: PubMed Journal: Biomed Res Int Impact factor: 3.411

1. Introduction

The transcriptome is defined as “the complete complement of mRNA molecules generated by a cell or population of cells” [1, 2]. Accessibility to transcriptomic information allows us to answer long-standing questions regarding the genetic basis of environmental adaptation and intraspecies divergence as well as the evolutionary differences associated with gene expression in plant-specific growth and developmental stages. Although genome-wide scanning and high-throughput sequencing studies have focused on genetic model plants, this situation has dramatically changed with the rapid development of next-generation sequencing technology and the simultaneous maturation of bioinformatic methods [3]. De novo assembly and characterization of nonmodel plants by RNA-sequencing have become more feasible. Many nonmodel plants such as Flaveria [4], Leymus chinensis [5], and Panicum maximum [6] have been successfully sequenced at the transcriptome level, and some common underlying evolutionary mechanisms have been elucidated through comparative analyses of related species [4, 7, 8]. Oryza is a very important agricultural genus that includes rice (O. sativa L.), which is a major source of crop that supports approximately half of the world's population [9]. The genus is composed of 23 species, including two cultivated and 21 wild rice species with 10 genome types (AA, BB, CC, BBCC, CCDD, EE, FF, GG, HHJJ, and HHKK) [10, 11]. As the direct gene pool of cultivated rice (AA genome), wild Oryza species possess abundant genetic diversity. Traditional hybridization has long been occurred between cultivated rice and AA genome wild rice species [12], and the potential value of non-AA genome species in rice breeding has also been recognized. For example, alleles associated with resistances to bacterial blight, brown planthoppers and white-backed planthoppers from O. latifolia Desv. (CCDD genome), O. officinalis Wall ex Watt (CC genome), and O. australiensis Domin. (EE genome) have been successfully introgressed into cultivated rice O. sativa populations [13-15]. However, the virtual reservoir of genetic variation hidden within the wild relatives of cultivated rice is largely untapped. O. officinalis is a perennial wild rice that is distributed in South and Southeast Asia, South China, Papua New Guinea, and Australia. The CC genome possessed by O. officinalis is shared by three diploids (O. eichingeri Peter, O. officinalis, and O. rhizomatis Vaughan) and six allotetraploids (three for BBCC genomes: O. malampuzhaensis Krish. and Chand., O. minuta J.S. Presl. ex. C.B. Presl., and O. punctata Kotschy ex Steud.; three for CCDD genomes: O. alta Swallen, O. grandiglumis (Doell.) Prod., and O. latifolia Desv.) [10, 11]. In addition, the CC genome is phylogenetically close to the AA and BB genomes [9-11]. Therefore, further investigation of the genetic basis of O. officinalis will not only provide more opportunities to discover valuable genes that may improve the quality of cultivated rice but will also serve as a bridge to extend further study to other allopolyploids or diploids that contain CC genomes.

2. Materials and Methods

2.1. Plant Materials

Seeds of three biological replicas of O. officinalis (Acc. number 104973) from the International Rice Research Institute (IRRI, Manila, Philippines) were dehulled and heated at 50°C for five days to break dormancy and were subsequently immersed in warm water at 30°C for three days to germinate. The germinated seeds were planted in three small pots at 24°C for two weeks, and the seedlings were transplanted into three large pots (30 × 30 cm) in the Qufu Normal University's greenhouse (length of lightening: 12 h; day/night temperature: 28°C/22°C; moisture: 40%) under normal soil. Young flag leaves from each biological replica were harvested 60 days after germination and were mixed together in equal quantities for RNA extraction.

2.2. RNA Extraction, Library Construction, Clustering, and Sequencing

Total RNA was extracted from leaf tissues using the Trizol method (Invitrogen). RNA concentration and quality were assessed by analyzing 1 μL of the RNA sample on an Agilent Technology 2100 Bioanalyzer. The RNA library was constructed using a TruSeq RNA Sample Preparation Kit (RS-122-2001, Illumina) according to the manufacturer's protocols. The library was qualified using the Agilent 2100 Bioanalyzer and quantification was conducted using Qubit and qPCR. Cluster formation and sequencing were performed on the HiSeq2000 platform following the manufacturer cBot and sequencing protocols. The library was run on a single lane for 100 cycles (CapitalBio Corporation), and transcriptome reads are available in GenBank with the accession number SRR1582383.

2.3. De Novo Assembly and Functional Annotation

After trimming adaptor sequences and removing low quality reads (reads with ambiguous bases “N”) and reads with Q < 30 bases, the remaining reads were assembled with Trinity software [16] to construct unique transcripts. Nonredundant unique transcripts were defined as unigenes. The trimmed Solexa transcriptome reads were mapped onto the unigenes using Bowtie2-2.2.3 software (Bowtie parameter: -v 3 –all –best –strata) [17] to detect the genes expression profiles. Functional annotations for the unigenes were conducted by sequence similarity comparisons against the nonredundant nucleotide database and the nonredundant protein database of NCBI (http://www.ncbi.nlm.nih.gov/) with BLASTx (E values cutoff ≤ 1e −5) as well as the SWISS-PROT database (European Bioinformatics Institute, ftp://ftp.ebi.ac.uk/pub/databases/swissprot/), the Clusters of Orthologous Groups of proteins database (COG) [18, 19], and the Kyoto Encyclopedia of Genes and Genomes database (KEGG) [20] with BLASTx (E values cutoff ≤ 1e −10). Moreover, functional assignments of the unigenes and InterPro domains [21] were further annotated using Gene Ontology (GO) [22] and InterProScan [23], and functional classifications of the unigenes were identified using WEGO software [24]. In addition, the unigenes were compared to the rice genome (Os-Nipponbare-Reference-IRGSP-1.0) by using BLASTn (E values cutoff ≤ 1e −5).

3. Results

3.1. De Novo Assembly of the O. officinalis Transcriptome

In this study, approximately 23 million high-quality reads with nucleotide sequences totaling 2,131,363,516 bp were obtained, and each read was 100 bp in length. All of the high-quality reads were de novo assembled using Trinity software [16], because of the absence of an O. officinalis reference genome. The nonredundant assembly resulted in 68,132 unigenes with a total length of 83,266,858 bp and an average length of 1222 bp. The single assembly length ranged from 201 bp to 13,067 bp. The majority of the assemblies (36%) were 200–500 bp, and 20% of the assemblies were longer than 2,000 bp. The remaining assemblies fell into 500–1,000 bp, 1,000–1,500 bp, and 1,000–2,000 bp ranges and represented 19%, 14%, and 11% of the assemblies, respectively (Figure 1). Ninety percent of the 23 million reads were successfully mapped onto the assemblies.

Figure 1

Assemblies of length distribution.

3.2. Functional Annotations Using Transcript BLAST Analyses

Functional annotations of the unigenes were performed using BLAST comparisons with different databases (Table 1). Of the 68,132 unigenes, significant hits at the nucleotide level were obtained for 65,303 (96%) unigenes using the annotated sequences deposited in the nonredundant nucleotide database (E values cutoff ≤ 1e −5) (Table 1). The sequences identities were all greater than 77%. Of the matched sequences, 92% were Oryza homologs, and 94% were orthologous to the sequences from Poaceae. When comparing protein-coding sequences only, 77% of the unigenes had significant BLAST results for the nonredundant protein database (E values cutoff ≤ 1e −5); and 48% met a slightly strict standard (E values cutoff ≤ 1e −10) in the SWISS-PROT protein database (Table 1). Moreover, 42,708 (83%) unigenes had high sequence identities (≥80%) with homologous sequences in the nonredundant protein database, while 7,490 (23%) of unigenes shared high sequences identities (≥80%) with homologous sequences in the SWISS-PROT database (Figure 2). In addition, 54,190 (80%) O. officinalis unigenes (Table 1) were perfectly matched to the rice genome (Os-Nipponbare-Reference-IRGSP-1.0) with high sequence identities (>80%) (Figure 2) and the E-value = 0 for 60% of the matched unigenes.

Table 1

Functional annotations using transcript BLAST analyses.

Database	Hit unigenes number	Percent (hit/total)
NT	65303	95.8%
NR	52210	76.6%
SWISS-PROT	32410	47.6%
COG	17643	25.9%
KEGG	47564	69.8%
GO	23568	34.6%
InterPro	27935	41.0%
Rice	54190	79.5%

NT: the nonredundant nucleotide database; NR: the nonredundant protein database; SWISS-PROT: SWISS-PROT database; COG: the Clusters of Orthologous Groups of proteins database; KEGG: the Kyoto Encyclopedia of Genes and Genomes database; GO: the Gene Ontology Consortium database; InterPro: InterPro domains database; Rice: Os-Nipponbare-Reference-IRGSP-1.0.

Figure 2

Sequences identities between query and subject sequences by comparing against the nonredundant protein database (NR), the SWISS-PROT database, and Rice genome (Os-Nipponbare-Reference-IRGSP-1.0).

3.3. Functional Annotation Based on Gene Ontology (GO) and InterPro

Gene ontology (GO) provides a systematic language to describe the attributes of genes and gene products, which includes three key biological domains that are shared by all organisms: molecular function, biological process, and cellular component [25]. In this study, of the 68,132 unigenes, 23,568 (Table 1) were assigned to at least one of the three biological domains (18,364 for molecular function, 30,646 for biological process, and 33,608 for cellular component) and 34 GO subcategories (Figure 3). In addition, a relatively higher proportion of unigenes were grouped into the following GO annotations: “binding (14,485 unigenes),” “catalytic activity (12,396 unigenes),” “metabolic process (12,287 unigenes),” “cellular process (10746 unigenes),” “cell (6,595 unigenes),” and “cell part (6,595 unigenes).”

Figure 3

Functional annotation based on Gene Ontology (GO).

Protein functions were also predicted by InterPro [21]. Forty-one percent of the unigenes (27,935) were matched to known protein domains (Table 1), and all of the InterPro entries could be mapped to the 34 GO subcategories described above. In addition, based on the above BLAST results and rice functional annotations, we identified a total of 476 unigenes (GenBank accession GBRJ00000000) that were related to disease resistance in the O. officinalis leaf transcriptome (see Table S1 in Supplementary Material available online at http://dx.doi.org/10.1155/2015/982065). Of these unigenes, 442 had complete open reading frames, and the codon region lengths of the unigenes varied from 102 bp to 3,774 bp. GC percent of codon regions of the unigenes was at the range of 28%–76%. Different expression levels among the unigenes were also found. There was approximately 1500-fold expression divergence between the highest and the lowest expressed unigenes. Functional annotations showed that the resistance unigenes were all associated with the nucleotide-bing site (NBS) and leucine-rich repeat (LRR) proteins.

3.4. COG Classification and KEGG Pathway Mapping

The COG (Cluster of Orthologous Group) database was designed to classify proteins on the basis of the orthology concept. The database includes 66 unicellular prokaryotic genomes and seven multicellular eukaryotic genomes [18, 19]. Using BLAST searches against these genomes (E values cutoff ≤ 1e −10), 17,643 (26%) unigenes of the O. officinalis leaf transcriptome that showed significant homologous scores (Table 1) were classified into 25 clusters. Among the clusters, the COG “general function predication only” cluster matched the highest proportion (approximately 23%) of unigenes. Additionally, 7.94%, 7.75%, and 7.47% of the unigenes were classified into the “carbohydrate transport and metabolism,” “translation, ribosomal structure and biogenesis,” and “posttranslational modification, protein turnover, chaperones” clusters, respectively. In contrast, only 0.32% and 0.02% of the unigenes were assigned to the “cell motility” and “nuclear structure” clusters, respectively (Figure 4).

Figure 4

Clusters of Orthologous Group classification.

Approximately 70% (47,564) of the unigenes were matched to homologs in the KEGG database (Table 1), and 22% (10,476) of those could be mapped to at least one biological pathway. These pathways were mainly associated with five categories, “metabolism” (9,420 unigenes), “genetic information processing” (6,884 unigenes), “environmental information processing” (1,227 unigenes), “cellular processes” (1,300 unigenes), and “human diseases” (1,166 unigenes). The specific pathways, including “spliceosome,” “purine metabolism,” “ribosome,” “RNA transport,” “starch and sucrose metabolism,” and the other 20 top mapped pathways were shown in Figure 5.

Figure 5

The top 25 mapped pathways annotated by the KEGG database.

4. Discussion

The wild relatives of cultivated rice contain abundant genetic diversity that can be used to improve cultivated rice quality and yield. However, comprehensive genetic backgrounds of different Oryza species are largely unknown. Fortunately, this status has begun to change, and a series of studies based on genome-wide sequencing have focused on the important agricultural genus. In addition to the greatly improved reference genome of Asian cultivated rice [26, 27], the whole genomes of African cultivated rice (O. glaberrima Steud.) [28] and a distant relative of cultivated rice (O. brachyantha A. Chev. and Roehr., FF genome) [29] have been published. The complete reference assemblies for some AA genomic wild rice species (O. barthii A. Chev., O. glumaepatula Steud., and O. nivara S. D. Sharma and Shastry) and a BB genomic wild rice species (O. punctata Kotschy ex Steud.) are continuously released and upgraded by the Oryza Map Alignment Project (OMAP) and the Oryza Genome Evolution (OGE) Project (http://www.genome.arizona.edu/modules/publisher/item.php?itemid=7) before publication. These projects also include the chromosome 3 short-arm assemblies of an additional eight Oryza species, including one polyploid, and the outgroup species Leersia perrieri (A. Camus) Launert [9]. These works greatly extend our knowledge of the genomes of different Oryza species. Here, using next generation sequencing methods, we sequenced the leaf transcriptome of O. officinalis, a CC genomic wild rice, and de novo assembled 68,132 unigenes based on approximately 23 million transcriptome reads. Although our study only involved one vegetable organ (leaves), this primary gene identification has broadened our understanding of the genetic background of the non-AA genomic wild rice species. For example, of the annotated unigenes, 49% were assigned to cellular components, 45% to biological processes, and 27% to molecular functions (Figure 3). Moreover, four GO subcategory annotations (“binding,” “catalytic activity,” “metabolic process,” and “cellular process”) were assigned to more than 10,000 unigenes, respectively. In addition, organ-specific gene expression patterns were revealed in the O. officinalis leaf transcriptome. The top five highly expressed unigenes were related to “ribulose bisphosphate carboxylase small chain,” “chlorophyll A-B binding protein,” “myosin-Vb,” “ubiquitin fusion degradation protein,” and “carbonic anhydrase” (Figure 6). Of these five functions, three (“ribulose bisphosphate carboxylase small chain,” “chlorophyll A-B binding protein,” and “carbonic anhydrase”) are involved in photosynthetic processes. The Cluster of Orthologous Group BLAST analyses and the KEGG pathway mapping further indicated that a majority of leaf transcripts were associated with “carbohydrate transport and metabolism” and “translation, ribosomal structure and biogenesis” (Figure 4) and were included in the pathways of “ribosome,” “starch and sucrose metabolism,” “glycolysis/gluconeogenesis,” and “oxidative phosphorylation.”

Figure 6

Biological functions of the top five highly expressed unigenes in O. officinalis leaf transcriptome.

Meanwhile, in this study, we also identified and characterized 476 unigenes associated with disease resistance (Table S1), and these unigenes were grouped with an ancient family of encoding proteins with nucleotide-bing sites (NBS) and leucine-rich repeat (LRR) domains. NBS-LRR genes that control resistance to a wide variety of pathogens and pests are one of the largest classes of plant disease resistance genes. In cultivated rice, 581 potential NBS-encoding sequences have been identified from the Nipponbare rice genome, and 100 were predicted to be probable pseudogenes [30]. Given that NBS-LRR genes play very important roles in disease defense, the identified expressed sequences in O. officnalis are valuable genetic resources for cultivated rice breeding and quality improvement.

5. Conclusion

The present transcriptome analysis provides useful data on expressed genes of O. officinalis, with 68,132 unigenes identified. These data are invaluable resources for broadening our understanding of the genetic background of non-AA genomic wild rice species and are potential resources for increasing cultivated rice quality and yield. Furthermore, from an evolutionary point of view, the transcriptome of O. officinalis with a CC genome provides a bridge to further study the other two diploids CC genomes (O. eichingeri and O. rhizomatis) and gene pools, and it is also provides the ability to distinguish the subgenome constitution of six allotetraploids (BBCC genome: O. malampuzhaensis, O. minuta, and O. punctata; CCDD genome: O. alta, O. grandiglumis, and O. latifolia) that contain the CC genome. Data represent the assembled unigenes of O. officinalis assigned to the functions related to disease resistances by comparison against the rice genome (Os-Nipponbare-Reference-IRGSP-1.0) at an E ≤ 1e-5.

29 in total

1. The Gene Ontology (GO) database and informatics resource.

Authors: M A Harris; J Clark; A Ireland; J Lomax; M Ashburner; R Foulger; K Eilbeck; S Lewis; B Marshall; C Mungall; J Richter; G M Rubin; J A Blake; C Bult; M Dolan; H Drabkin; J T Eppig; D P Hill; L Ni; M Ringwald; R Balakrishnan; J M Cherry; K R Christie; M C Costanzo; S S Dwight; S Engel; D G Fisk; J E Hirschman; E L Hong; R S Nash; A Sethuraman; C L Theesfeld; D Botstein; K Dolinski; B Feierbach; T Berardini; S Mundodi; S Y Rhee; R Apweiler; D Barrell; E Camon; E Dimmer; V Lee; R Chisholm; P Gaudet; W Kibbe; R Kishore; E M Schwarz; P Sternberg; M Gwinn; L Hannick; J Wortman; M Berriman; V Wood; N de la Cruz; P Tonellato; P Jaiswal; T Seigfried; R White
Journal: Nucleic Acids Res Date: 2004-01-01 Impact factor: 16.971

Review 2. Transcriptomics in the RNA-seq era.

Authors: Paul A McGettigan
Journal: Curr Opin Chem Biol Date: 2013-01-02 Impact factor: 8.822

3. Principles of transcriptome analysis and gene expression quantification: an RNA-seq tutorial.

Authors: Jochen B W Wolf
Journal: Mol Ecol Resour Date: 2013-04-27 Impact factor: 7.090

4. A comparison of transcriptome and epigenetic status between closely related species in the genus Arabidopsis.

Authors: Takahiro Kawanabe; Ryo Fujimoto; Taku Sasaki; Jennifer M Taylor; Elizabeth S Dennis
Journal: Gene Date: 2012-07-13 Impact factor: 3.688

5. Development of monosomic alien addition lines and introgression of genes from Oryza australiensis Domin. to cultivated rice O. sativa L.

Authors: D S Multani; K K Jena; D S Brar; B G de Los Reyes; E R Angeles; G S Khush
Journal: Theor Appl Genet Date: 1994-04 Impact factor: 5.699

6. From genomics to chemical genomics: new developments in KEGG.

Authors: Minoru Kanehisa; Susumu Goto; Masahiro Hattori; Kiyoko F Aoki-Kinoshita; Masumi Itoh; Shuichi Kawashima; Toshiaki Katayama; Michihiro Araki; Mika Hirakawa
Journal: Nucleic Acids Res Date: 2006-01-01 Impact factor: 16.971

7. The Gene Ontology project in 2008.

Authors:
Journal: Nucleic Acids Res Date: 2007-11-04 Impact factor: 16.971

8. The COG database: an updated version includes eukaryotes.

Authors: Roman L Tatusov; Natalie D Fedorova; John D Jackson; Aviva R Jacobs; Boris Kiryutin; Eugene V Koonin; Dmitri M Krylov; Raja Mazumder; Sergei L Mekhedov; Anastasia N Nikolskaya; B Sridhar Rao; Sergei Smirnov; Alexander V Sverdlov; Sona Vasudevan; Yuri I Wolf; Jodie J Yin; Darren A Natale
Journal: BMC Bioinformatics Date: 2003-09-11 Impact factor: 3.169

9. Transcriptome analysis in sheepgrass (Leymus chinensis): a dominant perennial grass of the Eurasian Steppe.

Authors: Shuangyan Chen; Xin Huang; Xueqing Yan; Ye Liang; Yuezhu Wang; Xiaofeng Li; Xianjun Peng; Xingyong Ma; Lexin Zhang; Yueyue Cai; Tian Ma; Liqin Cheng; Dongmei Qi; Huajun Zheng; Xiaohan Yang; Xiaoxia Li; Gongshe Liu
Journal: PLoS One Date: 2013-07-04 Impact factor: 3.240

10. A detailed gene expression study of the Miscanthus genus reveals changes in the transcriptome associated with the rejuvenation of spring rhizomes.

Authors: Adam Barling; Kankshita Swaminathan; Therese Mitros; Brandon T James; Juliette Morris; Ornella Ngamboma; Megan C Hall; Jessica Kirkpatrick; Magdy Alabady; Ashley K Spence; Matthew E Hudson; Daniel S Rokhsar; Stephen P Moose
Journal: BMC Genomics Date: 2013-12-09 Impact factor: 3.969

5 in total

1. Transcriptome analysis and molecular marker discovery in Solanum incanum and S. aethiopicum, two close relatives of the common eggplant (Solanum melongena) with interest for breeding.

Authors: P Gramazio; J Blanca; P Ziarsolo; F J Herraiz; M Plazas; J Prohens; S Vilanova
Journal: BMC Genomics Date: 2016-04-23 Impact factor: 3.969

2. De Novo Transcriptome Sequencing of Oryza officinalis Wall ex Watt to Identify Disease-Resistance Genes.

Authors: Bin He; Yinghong Gu; Xiang Tao; Xiaojie Cheng; Changhe Wei; Jian Fu; Zaiquan Cheng; Yizheng Zhang
Journal: Int J Mol Sci Date: 2015-12-10 Impact factor: 5.923

3. OrMKK3 Influences Morphology and Grain Size in Rice.

Authors: Ying Hua Pan; Li Jun Gao; Yun Tao Liang; Yan Zhao; Hai Fu Liang; Wei Wei Chen; Xing Hai Yang; Dong Jin Qing; Ju Gao; Hao Wu; Juan Huang; Wei Yong Zhou; Cheng Cui Huang; Gao Xing Dai; Guo Fu Deng
Journal: J Plant Biol Date: 2021-01-04 Impact factor: 3.150

4. An atlas connecting shared genetic architecture of human diseases and molecular phenotypes provides insight into COVID-19 susceptibility.

Authors: Liuyang Wang; Thomas J Balmat; Alejandro L Antonia; Florica J Constantine; Ricardo Henao; Thomas W Burke; Andy Ingham; Micah T McClain; Ephraim L Tsalik; Emily R Ko; Geoffrey S Ginsburg; Mark R DeLong; Xiling Shen; Christopher W Woods; Elizabeth R Hauser; Dennis C Ko
Journal: medRxiv Date: 2020-12-22

5. Evolutionary insights from de novo transcriptome assembly and SNP discovery in California white oaks.

Authors: Shawn J Cokus; Paul F Gugger; Victoria L Sork
Journal: BMC Genomics Date: 2015-07-28 Impact factor: 3.969

5 in total