Literature DB >> 24147765

Rice DB: an Oryza Information Portal linking annotation, subcellular location, function, expression, regulation, and evolutionary information for rice and Arabidopsis.

Reena Narsai¹, James Devenish, Ian Castleden, Kabir Narsai, Lin Xu, Huixia Shou, James Whelan.

Abstract

Omics research in Oryza sativa (rice) relies on the use of multiple databases to obtain different types of information to define gene function. We present Rice DB, an Oryza information portal that is a functional genomics database, linking gene loci to comprehensive annotations, expression data and the subcellular location of encoded proteins. Rice DB has been designed to integrate the direct comparison of rice with Arabidopsis (Arabidopsis thaliana), based on orthology or 'expressology', thus using and combining available information from two pre-eminent plant models. To establish Rice DB, gene identifiers (more than 40 types) and annotations from a variety of sources were compiled, functional information based on large-scale and individual studies was manually collated, hundreds of microarrays were analysed to generate expression annotations, and the occurrences of potential functional regulatory motifs in promoter regions were calculated. A range of computational subcellular localization predictions were also run for all putative proteins encoded in the rice genome, and experimentally confirmed protein localizations have been collated, curated and linked to functional studies in rice. A single search box allows anything from gene identifiers (for rice and/or Arabidopsis), motif sequences, subcellular location, to keyword searches to be entered, with the capability of Boolean searches (such as AND/OR). To demonstrate the utility of Rice DB, several examples are presented including a rice mitochondrial proteome, which draws on a variety of sources for subcellular location data within Rice DB. Comparisons of subcellular location, functional annotations, as well as transcript expression in parallel with Arabidopsis reveals examples of conservation between rice and Arabidopsis, using Rice DB (http://ricedb.plantenergy.uwa.edu.au).

Entities: Chemical Disease Gene Species

Keywords: Arabidopsis; Arabidopsis thaliana; Oryza sativa; protein; rice; subcellular location; transcript expression

Mesh：

Substances：

Year: 2013 PMID： 24147765 PMCID： PMC4253041 DOI： 10.1111/tpj.12357

Source DB: PubMed Journal: Plant J ISSN： 0960-7412 Impact factor: 6.417

Introduction

The sequencing of the Arabidopsis thaliana (Arabidopsis) genome in 2000, followed by that of Oryza sativa (rice) and an increasing number of other species (Swarbreck ; Youens-Clark ) has stimulated efforts to define the functions of all genes in a plant genome, such as the Arabidopsis 2010 and RICE2020 projects, respectively (Chory ; Zhang ). Rice is a major food-producing crop and important monocot plant model (Han ). Thus, the completed rice genome sequence in 2005 enabled genome-wide approaches to be applied to rice research (Rice Genome, 2005). Functional annotation of genes depends to a large degree on various omic(s) approaches, such as transcriptomic, proteomic and/or metabolomic data sets used to provide insight under altered genetic/environmental conditions: for example in Arabidopsis (Borevitz and Ecker, 2004; Nordborg and Weigel, 2008). A good example of integrating traditional and post-genomic research in rice is the transcriptomic analysis of super-hybrid rice and its parents (Gibbs ). The advent of these omic data sets has led to the generation of various web-based public databases that give access to this data in ways that would be useful to scientists (Long ). This flow of research from genomics to web-based public databases in plants is best seen for Arabidopsis, with more recent updates also including rice. A number of expression-based databases [the Bio-Array Resource (BAR; Schroder ) and Genevestigator (Wells )], protein-based databases (ARAMEMNON; Schwacke ) and metabolite databases [Golm Metabolome Database (Kopka ) and Madison-Qingdao Metabolomics Consortium Database (Cui )] are now available. A number of specialized databases are also available for Arabidopsis only, including the Arabidopsis Predicted Interactome (Geisler-Lee ) and the Subcellular Location Database for Arabidopsis (SUBA; Heazlewood ; Tanz ). A key element amongst these Arabidopsis databases is their integration via an international combined effort, the Multinational Arabidopsis Steering Committee, which provides an avenue to promote co-operative development and integration of resources. An example of this is the MASCP Gator database, which is a portal that draws on proteome data from a variety of databases for Arabidopsis (Joshi ). Also, the single major Arabidopsis database, The Arabidopsis Information Resource (TAIR), provides a place where data from various sources are combined (2005). The importance of rice as a basic food source for approximately three billion people has led to a variety of independent post-genomic resources that have been applied in various studies (http://www.irri.org). However, unlike TAIR, there is not a single unified database for rice, with both the MSU Rice Genome Annotation Project (RGAP; Ouyang ) and the Rice Annotation Project Database (RAP-DB; Tanaka ) presenting mainly sequence and annotation information for all rice genes. Both of these databases are extremely useful for rice research, and have even incorporated new functions such as simple keyword searches and even gene expression information in RGAP. Also, both of these databases have now facilitated integration, by allowing the conversion of rice identifiers between these databases. Other useful databases for rice include specialized expression and co-expression databases such as the Rice Oligonucleotide Array Database (Jung ), RiceXPro (Sato ), Oryzaexpress (Hamada ), RiceFREND (Sato ) and the Rice BAR-eFP browser (Toufighi ; Patel ), which all provide useful ways to examine transcript expression patterns in rice. Additionally, the Oryzabase database shows extensive rice information, curated by rice researchers (Kurata and Yamazaki, 2006). Another important rice resource is the Gramene database, which now also encompasses several genomes for rice and other species (Youens-Clark ). Gramene is very powerful for the comparison of genomes and genes using a variety of tools such as whole-genome alignments, evolutionary relationships of genes, synteny and genetic diversity, as well as linking to various resources for visualization of biochemical pathways (Jaiswal ). Similarly, the SALAD database (Mihara ) uses protein sequence information across several different plant species, including rice, to gain insight into protein function by revealing conserved protein motifs. Despite the various specialized rice databases available, accessing and integrating these information sources is still challenging, especially because of the use of non-systematic identifiers. Furthermore, the outputs of these individual databases are often obtained in a piecemeal manner, requiring manual integration. Many desired data sets lack any direct, if any, references to subcellular location, which is arguably one of the most crucial pieces of information to determine the function of any protein (Koroleva ). The multicompartmental nature of eukaryotic cells allows specialization of function, which is reflected in the subcellular compartments of organelles. In Arabidopsis, the subcellular proteome is well advanced, with experimentally confirmed proteomes determined for the cell wall, plasma membrane, nucleus, plastid, mitochondria, Golgi, endoplasmic reticulum, peroxisome and cytosol (Agrawal ). In contrast, no comparably complete subcellular proteome map exists for rice to show how the subcellular proteomes of these two important plant models compare with each other. Additionally, there is currently no rice database that provides subcellular location information, in terms of collating the experimentally determined and predicted locations of rice proteins, as shown in SUBA, the subcellular location database for Arabidopsis (Heazlewood ; Tanz ). Finally, research to elucidate the function of rice genes, i.e. for predicted proteins, would benefit greatly from direct integration (and parallel comparison) of information on orthologous genes in Arabidopsis. To address these issues, as well as provide completely new resources (including subcellular location information), a database for rice has been developed that integrates all of the above types of information. The rice database presented is a functional genomics database that performs integrated searches for functional annotations, subcellular location, expression levels, and putative or known regulatory elements, as well as showing orthologues between rice and Arabidopsis. Furthermore, the direct comparison with Arabidopsis in relation to a number of features including function and subcellular location facilitates greater understanding of the possible functions and locations of rice proteins, for which this information is unknown. Thus, Rice DB has been designed as a multilevel network that would be useful for researchers in various areas, from genomics and transcriptomics to proteomics. To present the ease of use and functions available in Rice DB, the rice mitochondrial proteome as well are other examples are shown.

Results

Integrating information at a single website

For Arabidopsis, the TAIR database encompasses sequence information, function and expression annotations, bulk downloads and useful tools for gene/protein analysis (TAIR, 2005): however, whereas the existing rice databases do provide very useful and accurate information, it is not a simple task to move between these, and subcellular location information is almost completely lacking for rice. In Rice DB, a range of data types were collated and networked, combining substantial volumes of curated and computed data, generated in-house specifically for Rice DB (Figure1a; Table1).

Figure 1

The user-friendly interface of Rice DB. (a) The major data types presented in Rice DB, showing the linked connections between them. The colour coding of each data type is maintained in the headings and side bars. (b) Screenshot showing the front page, where links to information about Rice DB, including the ‘About’, ‘Data’ and ‘Tutorial’ sections are shown on the left, next to a large search box allowing various entries. Note Boolean queries (AND/OR) are allowed. The right column shows examples of recent queries. The output after the search for arginase is shown here as a summary. Although ‘arginase’ returns one gene, note that when multiple genes are entered multiple rows are shown. (c) The summary output when ‘arginase’ was searched. Details of each data type are shown, with the coloured side bars representing the data type.

Table 1

Outline of data presented in Rice DB

Data in Rice DB	Data subtype (brief description)	References
Alternative identifiers	MSU RGAP identifiers	(Ouyang et al. 2007)
All identifiers were collated and matched to MSU identifiers; gene symbols were specifically curated and added into Rice DB	RAP-DB identifiers and descriptions Gramene GenesDB symbol namesa Probe set IDs – Rice Oligo. Array DB (ROAD) Oryzabase (inc. RefSeq, Unigene etc.) Arabidopsis - For Arabidopsis, AGIs were used as identifiers (Swarbreck et al., 2008)	(Tanaka et al. 2008) Youens-(Clark et al. 2011), and curated for Rice DB (Jung et al. 2008b) Kurata and Yamazaki (2006)
Annotations	MSU putative function annotation	(Ouyang et al. 2007)
Annotations were compiled to enable keyword searches across any of the sources.	RAP gene description Genebin (Funcats at ANU)a	(Tanaka et al. 2008) Goffard and Weiller (2007), and curated for Rice DB (Ouyang et al. 2007)
	GO slim: ontology, domain (C/F/P)Domain annotations Coil predictions FingerPRINTScan Gene3D InterPro PROSITE Panther Pfam SMART SuperFamily TMHMM Arabidopsis: annotations were derived from TAIR (Swarbreck et al., 2008)	http://www.ebi.ac.uk/Tools/pfa/iprscan/ (Attwood et al. 2012) http://gene3d.biochem.ucl.ac.uk/ http://www.ebi.ac.uk/interpro/ (Sigrist et al. 2010) (Thomas et al. 2003) (Punta et al. 2012) (Letunic et al. 2012) (Madera et al. 2004) http://www.cbs.dtu.dk/services/TMHMM/
Transcript data	Transcript expression: data from multiple sources (see Experimental procedures)
Microarray data were normalized and expression annotations were generated for Rice DB. The occurrence of all possible hexamers and cis-acting elements known to function in regulating expression has been calculated and shown. Known miRNA targets have also been matched and annotated.	RNA tissue data (no. tissues expressed in) Expressed in (‘expression annotation’) Stress expression (‘expression annotation’) Experimentally confirmed DNA binding motifs: matched in 1 kb upstream regions, occurrences calculated Motifs/CAREs known to be functional miRNAs: known miRNA targets and sequences presented in miRNA et al. are annotated miRNA target and sequence information Arabidopsis: transcript expression data also shown (see Experimental procedures)	Analysed and compiled for Rice DB Analysed, compiled and annotated for Rice DB Analysed, compiled and annotated for Rice DB AGRIS, Athamap, matched in rice promoters (Jeong et al. 2011)
Protein data	Predicted locations: output from each predictor is presented for Rice DB
Peptide sequences for all the encoded genes in the rice genome were run through each computational predictor, and outputs are presented in Rice DB. In addition, published literature was searched and compiled showing experimentally determined localization. For those with experimentally determined localization, the phenotype is indicated if one was determined upon genetic alteration.	Ambiguous targeting predictor ChloroP location predictor MitoProt 2 location predictor PredictNLS location predictor Predotar location predictor ProteinProwler location predictor PTS1 location predictor SignalP location predictor TargetP location predictor WoLFPSort location predictor YLoc location predictor	(Mitschke et al. 2009) (Emanuelsson et al. 2007) Claros (1995) https://rostlab.org/owiki/index.php/PredictNLS (Small et al. 2004) Hawkins and Boden (2006) (Neuberger et al. 2003) (Emanuelsson et al. 2007) (Emanuelsson et al. 2007) (Horton et al. 2007) (Briesemeister et al. 2010)
	Experimentally confirmed locations: compiled for Rice DB
	Experimentally confirmed phenotypes: compiled for Rice DB
	Arabidopsis: predicted and experimental locations shown from SUBA (Heazlewood et al., 2005)
Orthology with Arabidopsis thaliana
Orthology data were matched and compiled for Rice DB from Inparanoid, Gramene and Expressologs.	Inparanoid orthology: clusters of orthologous genes with scores are shown (Ostlund et al., 2010) Gramene orthology: sequence identity between genes is shown (Youens-Clark et al., 2011) Expressologs: closest Expressolog gene is indicated (Patel et al., 2012)

Manual amendment or addition to this category for Rice DB. Details of all these sources, including web links, are also shown on the Data page in Rice DB.

Outline of data presented in Rice DB Coil predictions FingerPRINTScan Gene3D InterPro PROSITE Panther Pfam SMART SuperFamily TMHMM Manual amendment or addition to this category for Rice DB. Details of all these sources, including web links, are also shown on the Data page in Rice DB. The user-friendly interface of Rice DB. (a) The major data types presented in Rice DB, showing the linked connections between them. The colour coding of each data type is maintained in the headings and side bars. (b) Screenshot showing the front page, where links to information about Rice DB, including the ‘About’, ‘Data’ and ‘Tutorial’ sections are shown on the left, next to a large search box allowing various entries. Note Boolean queries (AND/OR) are allowed. The right column shows examples of recent queries. The output after the search for arginase is shown here as a summary. Although ‘arginase’ returns one gene, note that when multiple genes are entered multiple rows are shown. (c) The summary output when ‘arginase’ was searched. Details of each data type are shown, with the coloured side bars representing the data type. Firstly, it is immediately apparent that a range of identifier types are cited across different publications pertaining to rice, making it difficult to track a specific gene of interest. In Rice DB, different identifier types were collated from the numerous sources listed in Table1, and any of these, including commonly used names (e.g. arginase) and gene symbols (e.g. OsARG) can be searched, and are converted to the commonly used Michigan State University (MSU) identifier. Given the multiple rice resources, there are also multiple putative function annotations for genes; thus putative function and domain annotations were collated and integrated in-house for Rice DB (Figure1a; Table1). Additionally, hundreds of microarrays were analysed in-house and expression information was collated to indicate whether or not a gene(s) is expressed (Figure1a; Table1). Furthermore, an extensive collection of protein information is presented in Rice DB, with protein properties, i.e. amino acid length, molecular weight and isoelectric point (calculated in-house; http://web.expasy.org/compute_pi), as well as predicted subcellular location (computed in-house; Table1), manually compiled experimentally determined subcellular location and/or phenotype information (collated in-house from publications) listed. Furthermore, unlike other rice databases, it is possible to simply view this information for Arabidopsis orthologues in parallel, where the Arabidopsis gene identifiers (AGIs), locus annotations (from TAIR), transcript expression (also analysed in–house), and predicted and experimentally determined location of Arabidopsis proteins (from SUBA; Heazlewood ; Tanz ) are also shown in Rice DB (indicated by the bold green borders of boxes in Figure1a). Thus, users can begin with any data type in Rice DB, from genetic data (e.g. a specific promoter motif of interest) and transcript data (e.g. a differentially expressed gene list) to protein data (e.g. a protein list following mass spectrometry).

Oryza information portal: ‘Google for rice’

As well as containing a wealth of information (Table1), Rice DB presents a powerful ‘Google’-style search engine, in that a wide range of searches are supported that can be entered into one search box. These include keywords (e.g. arginase; Figure1b), a range of identifiers (e.g. LOC_Os04g01590.1, Os04g106300, Q7X7N2, etc.), promoter motifs (e.g. AGATAG), domain annotations (e.g. Ureohydrolase), subcellular localization (e.g. mitochondria) and even AGIs (e.g. At4g08900.1). Note that the latter shows rice genes that are either orthologous or expressolog(s) to Arabidopsis genes, where an expressolog is defined on the basis of both orthology and conserved expression (Patel ). Boolean searches in Rice DB (such as AND/OR) also allow domain specialists to search and refine for specific fields, involving different combinations of information: such as annotation and protein location, e.g. ‘kinase AND mitochondria in experimental location’; annotation and transcript expression, ‘kinase AND expressed in seed’; subcellular location and expression. ‘mitochondria in experimental location AND expressed in seed’. Thereby presenting a valuable feature that will be useful to researchers at any level. Note that although keyword searches within the MSU RGAP and RAP–DB are possible, and are very useful, these are limited to only finding keywords within the putative function annotations within each respective database. Furthermore, one of the intuitive features in Rice DB is that it detects typing errors and makes suggestions, e.g. if ‘kina’ or ‘kinose’ are searched, Rice DB will return with, ‘Did you mean kinase?’ A recent study revealed the crucial role of a mitochondrial arginase (LOC_Os04g01590.1) in panicle development and grain production in rice (Ma ). This rice arginase, named OsARG, is used as an example gene to illustrate the functions and data available in Rice DB (Figure1b,c; tutorial examples below the Rice DB search box). Upon searching for ‘arginase’ a summary for LOC_Os04g01590.1 is returned, using a systematic on-screen overview of the current gene research, with hyperlinks and side-by-side drill-down to scholarly data (Figure1c). A colour-coded system was also created for Rice DB, to clarify the type of information being presented throughout the website (Figure1c). There is a multitude of information at each of the individual levels (identifiers, annotations, etc.), thus only a short summary is shown after the search, with the option of viewing extended pop-up information by clicking on the magnifying glasses or opening the flat file by clicking the locus identifier (Figure1c). Although the sequencing and annotation of the rice genome has resulted in labelling all gene loci with systematic identifiers, many (or even the majority of) studies to date tend not to use any common or consistent identifier(s) in publications. This was particularly observed upon searching and compiling hundreds of publications for the purposes of including subcellular location and phenotype data to Rice DB. For example, one study of a luminal binding protein named it BiP, and AK065743 was given as the refseq accession (Yasuda ), whereas the same protein was name BiP3 and the MSU identifier LOC_Os02g02410.1 was given in another study (Park ). To discover whether these referred to the same gene/protein, AK065743 was first searched in the National Center for Biotechnology Information (NCBI) database, which showed that this refers to Os02g0115900 (a Rice Annotation Project (RAP) identifier), and this was then converted to an MSU identifier using either the MSU or RAP database to reveal that it also converts to LOC_Os02g02410.1. This was only one of several examples of where this happened, and this lack of consistency and difficulty in identifying genes/proteins can even, and has, led to different groups characterising the same protein. Thus, in Rice DB, alternative gene/protein/microarray identifiers are all automatically converted to the standard MSU identifier (LOC_OsXgXXXXX.X), and all identifiers can be viewed in parallel (orange column; Figure1c). This conversion is a prerequisite before any integration of data sets can be routinely achieved. Thus, a search using the gene symbol ‘OsARG’ or protein accessions ‘Os04g106300’ or ‘Q7X7N2’ also displays the same summary after conversion to the same MSU identifier ‘LOC_Os04g01590.1’ (as seen in Figure1c). Details of each data type shown in the summary can also quickly be viewed by clicking on the magnifying glasses (Figure1c), or in the case of multiple genes, by clicking on the spreadsheet icons to the left of the column headings. For example, the ‘tissues’ column shows that the arginase (LOC_Os04g01590.1) is expressed in 41 out of 41 analysed tissues, and clicking on the magnifying glass shows a pop-up box with the expression intensities indicated as a heat map (yellow sidebar; Figure1c). Similarly, for the experimentally confirmed motifs, the name, sequence and hyperlinked PubMed identifier linking to the resource presenting this motif is shown (pop-up with yellow sidebar; Figure1c). Three columns are shown for protein information, and the pop-up windows with the blue side bars show the details of these (Figure1c). Firstly, the output from the different protein sublocation predictors are shown, e.g. when the arginase protein sequence was analysed by a range of subcellular location predictors, five out of the six predicted a mitochondrial location (e.g. Mitoprot, PProwler, etc; Figure1c). Secondly, the experimentally demonstrated protein localization is shown, as well as the method used and a link(s) to the publication confirming this location. For this arginase, a mitochondrial location was shown in two publications: one of these was by green fluorescent protein tagging (GFP); the other study carried out organelle isolation followed by mass spectrometry to assign a mitochondrial location (Figure1c). Lastly, a short description of the phenotype is shown for proteins with experimentally confirmed subcellular location(s). For example, it is shown that mutation in this arginase gene results in reduced plant height, and small panicle and grain size (Figure1c). Although these pop-ups are useful for quick checks, clicking on the hyperlinked model locus (orange column; Figure1c) opens a flat file, which first repeats the ‘Summary’ at the top of the page and then follows with details of identifiers, annotations, transcript data, protein data, orthology and expressology associated with that gene (details in Table1, and also on the ‘Data’ page at the Rice DB website). Outputs from Rice DB have been designed for research purposes; therefore, a separate line for each gene is shown when multiple hits occur, and all information can be readily downloaded as a tab-delimited file (‘Generate TSV’ icon above grey bar; Figure1c). Details on how to use Rice DB are also shown in the interactive ‘Tutorials’ on the Rice DB webpage (left panel; Figure1b).

Using Arabidopsis orthology to gain insight into rice

Despite the variety of compiled and newly generated data for rice in Rice DB, there are still hundreds, if not thousands, of genes/proteins for which specific information cannot be found. To address this problem, Rice DB allows simple, parallel access to Arabidopsis data. To date, numerous databases allow specific analyses of both Arabidopsis and rice, such as Gramene (Youens-Clark ) and the MIPS PlantDB (Nussbaumer ), which allow comparisons of genetic synteny, whereas databases such as ATTEDII (Obayashi ) and OryzaExpress (Hamada ) facilitate co-expression analysis. Similarly, the SALAD database (Mihara ) and PRIN database (Gu ) facilitate protein motif and interaction analysis in both rice and Arabidopsis. However, whereas some of these databases do allow a comparison of results between both species, others only facilitate separate analysis of both species, but under the same conditions, i.e. within the same database. In Rice DB, multiple levels of data can easily be compared between Arabidopsis and rice, from transcript expression to protein properties and subcellular localization, all within the same database for both species (Figure2). This is achieved through orthology, based on two different methods including sequence identity (as computed in Gramene; Youens-Clark ) as well as InParanoid (Ostlund ). Furthermore, a recent study has used co-expression and orthology to generate ‘Expressologs’ between rice and Arabidopsis genes (Patel ), and these are also shown in Rice DB (Figure2). Thus, the user can chose to employ one or more of these preferred methodologies.

Figure 2

An example of the output after three genes/proteins in rice were searched in Rice DB. After clicking on the down arrow present in the orthology column, it is possible to see parallel information for the orthologous gene(s) in Arabidopsis within the Rice DB output table. Examples demonstrating the usefulness of showing Arabidopsis gene descriptions, expression annotations and subcellular locations in parallel are shown (in the pink, yellow and blue boxes, respectively). Examples of the pop-up windows are also shown for the expression and protein subcellular location(s) data. In the Rice DB summary output, the last two columns show the Arabidopsis expressologs and orthologues (Figures1c and 2). Clicking the blue arrow in this column drops one (or more) rows below to reveal parallel information for the Arabidopsis orthologues(s) (rows shaded in green; Figure2). More information about the orthology can be gained by clicking the green information icon (Figure2). To demonstrate the ease and biological usefulness of showing Arabidopsis information in parallel, three example rice genes were searched (Figure2). The first protein (encoded by LOC_Os07g44840.1) is annotated as a bacterial transferase hexapeptide domain-containing protein, which is predicted to be mitochondrial, but there is no experimental confirmation of the subcellular location (Figure2). By using the Arabidopsis orthology link, it is easily shown that this protein has an Arabidopsis orthologue (At1g47260.1) of similar length, with 71% sequence identity, based on Gramene (Youens-Clark ), and with 100% confidence of orthology, based on InParanoid (Ostlund ) (green row below LOC_Os07g44840.1; Figure2). However, in contrast to its rice orthologue, the subcellular location of At1g47260.1 has been experimentally confirmed to be mitochondrial, based on evidence from 12 different publications (blue box; Figure2). The experimental method(s) used and links to publications are also shown (PubMed identifiers; green pop-up window; Figure2). Specifically, this Arabidopsis protein (At1g47260.1) has been shown to be mitochondrial and part of complex I in Arabidopsis (Meyer ; Klodmann ), and clicking on At1g47260.1 (circled red; Figure2) opens up the TAIR page for this gene directly, where other publications relating to this protein are also shown below. Given the high sequence identity and conserved, essential role of complex I in the electron transport chain, knowledge of the mitochondrial location of the protein encoded by At1g47260.1 sheds light onto the likely subcellular location of LOC_Os07g44840.1. In this way, Rice DB provides novel insight into the subcellular location of rice proteins, by presenting and linking to the known subcellular location information of Arabidopsis orthologues (for which this is collated in SUBA; Heazlewood ; Tanz ). Thus, Rice DB allows researchers to easily gain insight into the subcellular location for hundreds of rice proteins with Arabidopsis orthologues, for which this information is already known. Similarly, although many rice genes lack detailed functional annotation, often the Arabidopsis orthologues are more informatively annotated. For example, there are nearly 14 300 rice genes annotated as ‘expressed protein’ in the MSU database, with many having no further (functional) description. An example is shown for LOC_Os03g44810.1, where the MSU annotation calls this an ‘expressed protein’, whereas the Arabidopsis orthologue (At2g38920.1) is annotated as an SPX domain-containing protein in TAIR (pink box; Figure2). This annotation was also reinforced when the compiled domain annotations in Rice DB also showed that LOC_Os03g44810.1 does in fact contain an SPX domain, according to Interpro, Prosite and Pfam. Thus, having parallel annotation information for Arabidopsis orthologues can be useful for rice proteins where little functional annotation information is shown. Lastly, comparing transcript expression between Arabidopsis and rice can also provide functional insight. For example, LOC_Os08g14400.1 is most highly expressed in seed/germination rice, with expression shown in four out of 41 tissues (yellow box; Figure2). Using the orthology information in Rice DB, we can easily show that this expression is conserved in Arabidopsis, where At3g03660.2 is also expressed specifically during germination, in five out of the 73 tissues (yellow box; green pop-up window; Figure2). This example was taken from a rice germination study (Howell ), in which this gene was shown to be highly expressed during germination, and this was also conserved for the Arabidopsis orthologue using the BAR efp browser (Toufighi ). Although, this was only done for a few genes during germination in that study (Howell ), it demonstrates how any gene or set of genes can be easily searched in Rice DB to examine and reveal conservation in expression patterns between Arabidopsis and rice. The usefulness of combining these data is also demonstrated in the recently updated rice BAR efp browser, showing expressologs combining expression and orthology as a highly informative way of revealing the conservation or divergence between species (Patel ). As indicated in that study, it is also important to point out here that there are also significant differences between Arabidopsis and rice, or dicots and monocots in general, that must be considered when interpreting these comparisons (Narsai ; Patel ). Details of these are described on the ‘Combined Orthology Summary’ page of the ‘Data’ pages.

Expression analysis for rice genes

In TAIR, when an Arabidopsis gene/protein is searched, the flat file shows annotation and function information as well as expression annotations: e.g. expressed in ‘seed, germination’. These expression annotations show the tissues/developmental stages in which the specific gene of interest is expressed. This is not only useful for transcriptomic studies, but even for proteomic research, where it can be useful to distinguish protein isoforms based on transcript expression patterns. To our knowledge, this is currently not found in other rice databases, and thus hundreds of microarrays for rice (and Arabidopsis) were analysed in parallel to generate expression annotations in Rice DB (Figure3a).

Figure 3

Microarray analysis workflow using Rice DB. (a) The summary output after a list of differentially expressed Affymetrix probe sets are entered into Rice DB. (b) The output after clicking on the expand spreadsheet icon in the ‘Annotations’ column or after typing ‘Show annotations for…’. All columns show the annotations from the various sources (listed in Table1). (c) The output after clicking the expand spreadsheet icon in the ‘Expressed_in’ column, or after typing ‘Show expression profiles for…’. The normalized expression intensities across the 41 different developmental tissues are shown after log transforming and viewing using the custom heat map in MS Excel. (di) Output showing the genes containing the experimentally confirmed motifs after the expand icon was clicked in the ‘Exp_shown_motifs’ column. (dii) The output after the ‘View hexamers for…’ was entered for a shortlist of genes. This shows the numerical and percentage occurrence of the 4096 possible hexamers in the input gene list, as well as these occurrences in the genome. During the course of omics research, often a list of genes/proteins are identified. A transcriptomic analysis workflow using Rice DB is presented in Figure3. For example, following a microarray study a set of differentially expressed genes can be entered into Rice DB (Figure3a). Given that Rice DB accepts Affymetrix microarray probe sets, identifier conversion was not necessary (identifiers retrieved from ROAD; Jung ; Table1). As an example, the 2244 differentially expressed probe sets identified during germination (Howell ) were entered into Rice DB and the summary for the first four probe sets are shown (Figure3a). The first two probe sets do not match to specific MSU gene identifiers and therefore these rows are blank; however, the next two match LOC_Os08g14400.1 and LOC_Os03g44810.1 (Figure3a). Given that LOC_Os03g44810.1 represents just one of thousands of genes annotated as ‘expressed/hypothetical/unknown protein’, functional annotations were specifically compiled for Rice DB from different sources (Annotations; Table1). Thus, if a gene has an MSU putative function annotation of ‘expressed protein’, the RAP description, domain annotations (Interpro, Prosite, Pfam) and Genebins functional annotations (Goffard and Weiller, 2007) are also shown (Figure3b). Additionally, the transcription factor databases (Gao ; Riano-Pachon ) and kinase databases (Dardick ) have also been incorporated into the Genebins annotations in Rice DB, and thus a wider net of annotation information can now be found (Figure3b). In this way, it was revealed that despite the annotation of ‘expressed protein’, LOC_Os03g44810.1 encodes a transcription factor with a zinc-finger and SPX domain (Figure3b), which also supports the TAIR annotation for its Arabidopsis orthologue (Figure2). Furthermore, clicking on the small spreadsheet icon in the ‘expressed_in’ column shows the normalized expression levels across development (Figure3c). Note that the MSU RGAP database also presents transcript profiles and co-expression data for rice genes present on microarrays, also allowing users to extract expression profiles in a similar way. For LOC_Os08g14400.1 and LOC_Os03g44810.1, the expression intensities were exported from Rice DB, logarithm-transformed in Microsoft excel and false-coloured as a heat map within Microsoft excel, visualising the high expression in seed/germination (Figure3c). Interestingly, the expression annotations for the Arabidopsis orthologues to these also show seed-specific expression (Figure2). It is, however, important to point out that although this does occur for several genes, parallel expression between these species must be interpreted with caution, given the divergence between these two model species. In this way, numerous studies have used transcript data to gain an expression ‘context’ of a given set of genes (Howell ; Huang ; Narsai , 2011; Taylor ), and Rice DB can also now facilitate this with ease. Following heat map/cluster generation in this way (Figure3c), or even after identifying lists of co-expressed genes using other rice databases (e.g. ATTEDII (Obayashi ), Rice DB can be used to search for potential elements of co-regulation. It is possible to search the 1 kb upstream regions of a set of rice genes, both for the occurrence of experimentally identified (Bulow ; Yilmaz ) and putative motifs (Figure3d). Figure3(di) shows the output when experimentally demonstrated motifs are searched in the genes encoding transcription factors that showed highest expression in seed/germination. After sorting by EXP_SHOWN_MOTIF, sets of genes containing that motif, the motif sequence and source are shown (Figure3di). Alternatively, a search such as ‘show hexamers for [insert identifiers]’ produces a table showing the occurrence of all 4096 possible hexamers in the upstream regions of these genes (Figure3dii). Using percentage and occurrence values, the number of sequences containing a particular motif in a given subset (i.e. gene list) can be compared with the genome to reveal over-represented putative hexamers (Figure3 dii). This function in Rice DB is comparable with the motif analysis tool available in TAIR (2005), where it is possible to simply search the 1 kb upstream regions of the Arabidopsis orthologous genes for hexamer occurrences in a subset compared to the genome. Thus, a common analysis workflow can be supported in Rice DB, from a differentially expressed set of genes to revealing putative regulatory elements of co-expression (Figure3). Specifically, it was shown that LOC_Os08g14400.1 and LOC_Os03g44810.1 encode transcription factors (Figure3b), which show seed/germination-specific expression (Figure3c) and contain common RAV1–A, SORLIP3, RY-repeat and NAC motifs in their promoters (Figure3di).

Subcellular localization of rice proteins: the rice mitochondrial proteome

One of the most important pieces of information required to define the function of a protein is its subcellular location. Isoforms of proteins can be located in different subcellular locations, and despite identical/conserved enzymatic activities, the functions can differ because of subcellular location. Examples include a variety of proteins located in mitochondria and plastids, which catalyse various metabolic steps in energy metabolism. Furthermore, some proteins can have multiple locations, termed dual targeting, and again, it is essential to know this in order to gain insight into function (Lu ). In Arabidopsis, extensive knowledge of subcellular location has provided useful insight into specific protein functions. The SUBA database shows collated subcellular location information, compiling the outputs from multiple computational location predictors (analysing all protein sequences in the Arabidopsis genome) as well as linking to the collection of publications showing experimental evidence of localization, such as MS/MS and green fluorescence protein (GFP) analysis, which is manually curated (Heazlewood ). To date, no equivalent resource has been generated for rice, with a number of studies in rice only relying on the output from one or more predictors to gain insight into location (Lemberg and Freeman, 2007; Soanes ). Thus, in Rice DB, we not only incorporated the same two lines of evidence (computational prediction and experimentally determined localization), but also incorporated the subcellular location of orthologous proteins in Arabidopsis. Using Rice DB, a putative rice mitochondrial proteome is presented, comparing the three available methods for defining location, using specific examples (Figure4a–d).

Figure 4

Subcellular location of rice proteins. (a) Seven genes are shown, representing combinations of the three ways that Rice DB can give insight into subcellular location: i.e. computational prediction (‘Predicted in rice’); experimentally determined subcellular location of orthologous proteins in Arabidopsis (‘Exp. shown in Arabidopsis’); and experimentally determined subcellular location of rice proteins (‘Exp. shown in rice’). (b) Overlapping numbers of proteins identified as mitochondrial on the basis of these three approaches: i.e. (i) orthologues to AT mito. set; (ii) the Rice mito. set; and iiii) computational prediction based on four or more predictors. All three total sets, as well as the exclusive set of 839 proteins determined by orthology alone, were significantly enriched in the ‘Energy’ Genebins category (P < 0.01, indicated by ∧), and the number of these in each set is indicated in brackets. Gene expression patterns for the genome (defined as all genes on the Affymetrix Rice genome microarray) and each of the three gene sets were examined. For each set, the percentage of genes expressed in none of the microarrays, between one and 36 of the tissues/stages and >90% of all tissues/developmental stages (i.e. more than 37 out of the 41 different developmental tissues/stages) is indicated. *Gene sets enriched in these proportions, compared with the genome. (c) The orthologue summary for the rice proteins identified as mitochondrial on the basis of orthology: orthologues to AT mito. set. (d) The pop-up search box where specific predictors can be selected. The outputs for each of these were presented as percentages (for most, where possible), and the expanded output of these is shown. (e) The pop-up window showing the possible data types that can be searched in Rice DB, when ‘choose data type’ is selected on the homepage. The example output is shown for experimentally determined locations. See references for predictors in the ‘Data’ pages in Rice DB. Firstly, all protein sequences encoded in the rice genome were analysed for subcellular location using 11 different computational predictors (listed in Table1). Note that once a list of gene identifiers or the subcellular location is entered into the search box, and the predictors are selected (as in Figure4d), the output shows the identifiers that matched the search criteria, including subcellular location and percentage/score for each predictor used (Figure4d). For example, when ‘mitochondria’ is searched in Rice DB, 32 487 rows are returned and 23 501 of these are predicted by one or more predictors to be in the ‘mitochondrion’. It is estimated that 1000–2000 proteins are present in mitochondria (Loreti ; Meisinger ). Thus, given the advantage of having a variety of predictor choices in Rice DB (Figure4d), only proteins predicted to be mitochondrial by four or more predictors were included for comparison (2998; Figure4b). For example, LOC_Os02g10820.1 was predicted to be mitochondrial by six predictors (predicted locations column; Figure4a). Secondly, a multitude of publications were searched and lists of rice proteins with experimental evidence of location were carefully curated and compiled, including those with GFP analysis, MS/MS, immunogold labelling and immunodetection (usually encompassing organelle isolation and the use of specific antibodies). Collectively, over 500 publications relating to rice and the respective organelles (e.g. ‘mitochondria’) were searched for information about experimentally determined protein localization. In this way, a final list of 497 mitochondrial proteins was generated after compiling experimentally determined location information from different publications. For most publications, this also involved converting different protein identifiers into MSU identifiers. Also, all details of the experimental method used to determine localization had to be extracted and entered into Rice DB (e.g. ‘Exp. determined set’ for mitochondria, as shown in Figure4b; see also Example 8.1 at the bottom of the Rice DB website). The proteins encoded by LOC_Os01g54940.1, LOC_Os02g45820.1 and LOC_Os07g31390.1 are examples of experimentally confirmed mitochondrial proteins (Figure4a), and the expanded details of these are shown in Figure4e. Alternatively, the advanced search window allows users to select the data to display: in this case, experimental location (Figure4e). Additionally, for proteins with experimentally confirmed locations, Rice DB also shows a manually extracted brief description of the phenotype (if this is reported) when this gene is mutated (e.g. T–DNA insertion, EMS, TOS17 lines etc.), knocked-down (e.g. antisense, RNAi) or over-expressed (hyperlinked PubMed identifiers link directly to the publication). In this way, it is shown that the suppressed expression of several mitochondrial proteins results in developmentally impaired phenotypes: e.g. for OsARG (Figure1; Ma ), DCW11 (Fujii and Toriyama, 2008), MIR (Ishimaru ) and a number of others. Thirdly, although the 497 experimentally confirmed mitochondrial proteins represent a substantially extended list, it is still likely to represent less than half of all mitochondrial proteins in rice. Thus, orthologous Arabidopsis proteins are also shown, as a third method to gain insight into the possible location of rice proteins (Figure4a–c). The orthologue summaries of three rice genes (from Figure4a) show the strength of orthology and subcellular location of the Arabidopsis orthologues (Figure4c). In this way, the 1265 Arabidopsis proteins considered to be mitochondrial were used (Law ), where ∼72% of these have experimental evidence for mitochondrial localization (Heazlewood ), and the remaining 28% were predicted and considered to be mitochondrial on the basis of function (Law ). In order to do this without bias, no cut-offs or preferences were given to the method for determining orthology, which was defined based on sequence identity computed in Gramene (Youens-Clark ) and/or Inparanoid (Ostlund ) in Rice DB. Of the 1265 Arabidopsis mitochondrial proteins, 1164 have rice orthologues (making up the 1466 rice proteins, orthologous to the AT mito. set; Figure4b). Additionally, there are numerous Arabidopsis membrane proteins with confirmed subcellular locations that cannot be accurately predicted by computational predictors, thus further supporting the usefulness of incorporating orthology. However, it is important to caution users against assuming direct conservation between Arabidopsis and rice for subcellular location, whereby despite orthology there can be divergence in the subcellular location of orthologous proteins, either because of technical differences in the experimental methods or because of real biological differences between these species (Xu ). Thus, overlapping these lists for mitochondria (Figure4b) allowed comparison of these methods. Note, although publications state that up to 60% of Arabidopsis and rice proteins with experimentally determined location are also predicted to be in that location (Heazlewood ; Huang ), this must not be mistaken to represent the accuracy of predictors. For example, although 302 of the 497 experimentally confirmed mitochondrial proteins (61%) were also predicted by MitoProt (Claros, 1995), this is only 2.6% (302 of 11 745) of all proteins predicted to be mitochondrial by MitoProt in rice, revealing a high false-positive rate for individual predictors (Figure S1). Similar overlaps were seen for PProwler and TargetP (<3%), and <5% for WoLFPSort and YLoc, whereas 7.6% was seen for Predotar (Figure S1; Hawkins and Boden, 2006; Emanuelsson ; Horton ; Briesemeister ). Thus using the four predictors combined, the range and power of these resulted in a 6.8% overlap with the 497 experimentally confirmed mitochondrial proteins (205 out of 2298; Figure4b). Whereas ∼20% (291) of the 1466 rice proteins orthologous to Arabidopsis mitochondrial proteins (Law ) overlapped with the experimentally confirmed rice mitochondrial proteins, pointing towards orthology as one of the more accurate ways of determining subcellular location in rice (Figure4b). To examine these mitochondrial lists in light of function and expression, we first took advantage of the compiled functional annotations that can be used in Rice DB. A simple search for the ‘Energy in Genebins’, revealed 638 MSU identifiers that were matched by Genebins annotations (Goffard and Weiller, 2007) to encode energy-related functions, which makes up 1.8% of the 35 787 genes annotated by Genebins. Simple matching these to each of the three sets from Figure4(b) revealed that each of these were significantly enriched in ‘Energy’ functions (P < 0.001) (∧denotes significant over-representation, compared with 1.8% in the genome; Figure4a). Given that mitochondria are essential organelles for energy production, the significant over-representation of energy functions was not surprising, and in fact provided independent evidence supporting the accuracy of these sets as representing mitochondrial proteins. Notably, the subset of mitochondrial proteins defined exclusively by orthology was also enriched in energy functions (839 proteins; Figure4b), whereas this was not seen for the mitochondrial subset based exclusively on prediction (2457 proteins; Figure4b). Additionally, gene expression was examined within each of the subsets (from Figure4b) using the in–house generated Rice DB expression annotations. For all genes with probe sets in the rice genome, only 24% are expressed in >90% of all tissues (i.e. 37+ out of the 41 possible tissues in rice; Figure4b). In contrast, 68% of the 497 experimentally confirmed rice mitochondrial proteins are expressed in >90% of all tissues, compared with the 24% in the genome (*significant, P < 0.001; over-representation denoted with an asterisk; Figure4b). Similarly, a significant enrichment was also revealed for the 1466 genes orthologous to the AT mito. set, where >65% of these genes were expressed in >90% of all tissues (Figure4b). In contrast, no such significant enrichment was observed for the genes encoding proteins predicted to be mitochondrial (35% of 2998; Figure4b). Notably, when the Arabidopsis RNA expression annotations in Rice DB were also examined, 65% of the AT mito. set were also expressed in >90% (66+ out of the 73) of the tissues analysed, which is also a significant (P < 0.001) enrichment of these genes compared with the 38% of Arabidopsis genes that show this expression in the Arabidopsis genome (data not shown), which is comparable with the observed enrichment in rice (Figure4b). Given that several mitochondrial functions are essential for viability and central metabolism, it is not unexpected that these genes are expressed in most (>90%) tissues throughout plant development, and the ability to examine annotations, expression and subcellular locations like this in parallel, using Rice DB, can then strengthen the knowledge of given rice genes/proteins, especially where little other information is known.

Maximising insight by linking annotation, expression, regulation, subcellular location and orthology

As demonstrated in Figures4, it is extremely useful to have multilevel knowledge incorporating annotations, transcript and protein data for both Arabidopsis and rice in parallel in Rice DB, as this maximizes insight in a way that is not currently possible using the existing rice databases. Furthermore, in contrast to most other rice databases, Rice DB acts as a portal linking through to several data sources, and users can start from any data type and link to an array of information for their genes/proteins of interest (Figure5a). For example, Figure3 demonstrates how it is possible for researchers analysing microarray or RNA sequencing data to have an assisted workflow by using Rice DB, following a common order of analysis. Also, although it was not presented in Figure3, there is also additional transcript-related data in Rice DB, including annotated miRNA targets (Jeong ; Figure5a; Table1). Furthermore, by having expression data in parallel with protein data in Rice DB, it is possible to gain insight into multiple orthologous proteins, possibly revealing which of the multiple homologues may be functional in specific tissues or developmental stages (Figure5). However, the differences between Arabidopsis and rice must also be taken into account for these comparisons, from the significant difference in genome size to considerable biological differences, which have led to 45% of rice genes not having significant Arabidopsis orthologues (See Combined Orthology ‘Data’ page).

Figure 5

Inter-connections within Rice DB: Oryza information portal. (a) Rice DB creates a network for rice that connects identifiers, annotations, transcript data and protein data, and links these with information for orthologous genes in Arabidopsis. Data subtypes are shown below each heading. By connecting these data types for rice, it is possible to follow these connections and gain insight into function, including for rice genes with very little or no functional information. (b) Tutorial examples, as shown below the search box in Rice DB. These can be used as templates to use the functions in Rice DB. Note that only single examples are shown per data type (the full list is shown below the search box in Rice DB). The flexibility of Rice DB also facilitates finding information in other workflows. For example, researchers examining specific transcription factor families may be interested in all genes containing a specific binding site: e.g. WRKY transcription factors and the W–box, TTGACC. Using Rice DB, it is possible to just enter that sequence into the search box and retrieve the list of 12 175 genes containing it in their promoters. Upon receiving this, it is possible to refine this (using the Refine tool) and identify any bias or over-representation(s). That is, it is possible to see if this set represents a co-expressed data set (using the transcript data), a set of co-localized proteins (using the subcellular location data) or a set of genes encoding proteins of a specific function (using the annotations) in Rice DB (Figure5a; see examples in Figure5b and Rice DB tutorial). Note, if the focus is co-expression, gene lists can also be easily exported and examined further in other databases, such as RiceFREND (Sato ), Oryzaexpress (Hamada ) and RiceXPro (Sato ), that specialize in detailed co-expression analysis. The networked structure of Rice DB is also very useful to protein researchers. For example, protein properties can easily be retrieved using Rice DB, by simply entering ‘Show protein properties for…’ (Figure5b; see Tutorial examples on Rice DB homepage). Following this, the peptide length, projected molecular weight and isoelectric point is shown for all proteins (Figure5a). Thus, after receiving the protein properties for a list of proteins (e.g. those identified following mass spectrometry), it also possible to view and identify putative functional domains based on Gene3D, Interpro, Prosite and potential transmembrane helices based on TMHMM within Rice DB (Figure5a; Table1). Also, the computation and collation of predicted subcellular location(s) in Rice DB represents a resource not available anywhere else for rice (Figure5a), and it is well known that subcellular location is extremely informative for defining protein function. The usefulness of combining data types, such as expression and subcellular location, is also demonstrated by its incorporation into the BAR efp browser for Arabidopsis (Toufighi ). Lastly, the data/lists from Rice DB can also easily be exported for further searching in other resources such as the SALAD (Mihara ) or PRIN (Gu ) databases, which can reveal deeper insight into protein function by identifying conserved protein motifs or interactions. Furthermore, the collation of phenotype information for proteins with known subcellular location represents another important resource in Rice DB, where rice phenotypes can simply be searched, revealing new trends and facilitating new hypotheses that would otherwise not have been apparent without Rice DB (Figure5a). For example, a search for ‘growth in experimental phenotypes’ in Rice DB reveals 21 genes from 14 different publications, where genetic perturbation results in altered plant growth phenotypes. Viewing these closely revealed the two independent publications showing that mutating two different golgi-localized proteins results in growth alterations in plants (Li ; Zhang ). These were LOC_Os01g51430.1, which was annotated as ‘green ripe-like, putative expressed protein’, and LOC_Os12g36890.1, which was annotated as a ‘cellulose synthase-like protein’. Thus, viewing these in parallel in Rice DB enables common threads to be identified. Likewise, clicking on the AGI of the closest Arabidopsis orthologue in Rice DB, quickly allows researchers to see if this has also been shown in Arabidopsis by opening the TAIR page, where links to publications relating to this gene are shown at the bottom of the page. Also, once co-localized proteins are identified, Rice DB can also easily be used to gain insight into co-expression and co-regulation, without needing to manually translate identifiers or reformat searches for disparate specific database resources. In fact just having the different data types linked, such as the functional and expression annotations for both species in parallel, can yield insight, even for rice genes without detailed functional annotations. For example, At3g62790 is annotated as an NADH-ubiquinone oxidoreductase-related protein, and is expressed in all tissues, whereas its rice orthologue LOC_Os08g44250.1 is annotated as fiber protein Fb14, and is also expressed in all tissues. However, the annotation for At3g62790 and the knowledge that it is mitochondrial in Arabidopsis helps to provide greater insight into what may be the function of the protein encoded by the rice orthologue LOC_Os0844250.1. Thus, by using Rice DB, it possible to gain detailed insight into potential organellar proteins, which would not have been otherwise identified.

Conclusions

We have demonstrated a network of functional data relating to rice. Furthermore, in the style of search engines such as Google rice is now ‘searchable’, with little effort, and the retrieval of fundamental connections can now become routine and commonplace. Researchers can quickly gain access to rice knowledge by entering arbitrary identifiers, annotation keywords or even promoter motifs to immediately reveal relevant rice knowledge from previously unlinked data. In the presented examples, we have shown ways of exploring putative organellar proteomes through the use of localization information, expression and orthology for maximum insight into function. This is particularly important for genes and proteins with very little functional information, for which Rice DB can now reveal putative function annotations from a variety of sources (some improved upon and curated in-house), expression annotations (normalized and generated in-house), predicted subcellular localization (pre-computed in-house), experimentally determined location (collated and curated in-house) as well as the phenotypes (if any, upon mutation, knock-down, or overexpression) for proteins with experimentally confirmed subcellular locations (curated and collated in-house), and simply link to the Arabidopsis orthologues(s), where subcellular location and functional knowledge may already be known. Thus, Rice DB presents a simple, centralized data resource that can be used to gain maximum insight into rice gene/protein functions.

Experimental procedures

Database design

The Rice DB software and website have been developed for the java runtime environment using the scala programming language and Lift web framework. The software's only operating system requirements are for a java servlet runtime, disk storage and a network connection. See Appendix S1 for details.

Alternative identifiers

For the range of identifiers that can be searched in Rice DB, a variety of sources (outlined in Table1: Alternative identifiers) were used, and a number were manually added. See Appendix S1 for details.

Annotations

We combined functional annotations with domain and structural annotations (outlined in Table1) to maximize insight into putative function. A number of additional resources were also added and/or updated by manual collation and annotation. See Appendix S1 for details.

Transcript data: microarray analysis for expression annotations

To examine gene expression across development in rice (and Arabidopsis), a range of publically available microarrays were downloaded from the Gene Expression Omnibus or MIAME Array Express Databases (for each species), and these were analysed in-house using similar methods to previous studies (Howell ; Narsai ). See Appendix S1 for details.

Transcript data: microRNAs in rice

Known microRNA target genes, as identified in the Meyers lab Next-Gen Sequence Database (Jeong ), are also annotated to indicate known microRNA target genes in Rice DB.

Subcellular localization

To determine subcellular location, three main methods are presented, including: computational prediction, publications presenting experimental evidence and on the basis of orthology with Arabidopsis. Details of these (including thresholds used/cut-offs etc.) are detailed in Appendix S1. Note, although the mitochondrial proteome is shown as the example here, this information is also available for chloroplasts, peroxisomes, nucleus and various other organelles.

Compiling lists of genes with known phenotypes

For proteins with confirmed localizations based on experimental methods, phenotype details were also extracted for all genes that showed a phenotype when expression is altered: for example, by mutation (e.g. T–DNA insertion, EMS, TOS17 lines etc.); knock-down (e.g. antisense, RNAi); or overexpression, from the relevant publication. For these, a simple phenotype is shown in Rice DB: e.g. ‘developmental’ phenotype. Thus, for confirmed organellar proteins, a documented phenotype can also be searched in Rice DB. A description of the phenotype, as described in the publication, is shown in the column called ‘Exp. shown. Pheno’, and the relevant hyperlinked PubMed identifier is also shown.

Data sources

To view all the data sources used in Rice DB, see Table1 and the ‘Data’ page (on the left panel of the Rice DB homepage).

81 in total

1. Detecting and sorting targeting peptides with neural networks and support vector machines.

Authors: John Hawkins; Mikael Bodén
Journal: J Bioinform Comput Biol Date: 2006-02 Impact factor: 1.122

2. Functional and evolutionary implications of enhanced genomic analysis of rhomboid intramembrane proteases.

Authors: Marius K Lemberg; Matthew Freeman
Journal: Genome Res Date: 2007-10-15 Impact factor: 9.043

3. Resolving and identifying protein components of plant mitochondrial respiratory complexes using three dimensions of gel electrophoresis.

Authors: Etienne H Meyer; Nicolas L Taylor; A Harvey Millar
Journal: J Proteome Res Date: 2008-01-12 Impact factor: 4.466

4. EXORDIUM-LIKE1 promotes growth during low carbon availability in Arabidopsis.

Authors: Florian Schröder; Janina Lisso; Carsten Müssig
Journal: Plant Physiol Date: 2011-05-04 Impact factor: 8.340

5. YLoc--an interpretable web server for predicting subcellular localization.

Authors: Sebastian Briesemeister; Jörg Rahnenführer; Oliver Kohlbacher
Journal: Nucleic Acids Res Date: 2010-05-27 Impact factor: 16.971

6. The PRINTS database: a fine-grained protein sequence annotation and analysis resource--its status in 2012.

Authors: Teresa K Attwood; Alain Coletta; Gareth Muirhead; Athanasia Pavlopoulou; Peter B Philippou; Ivan Popov; Carlos Romá-Mateo; Athina Theodosiou; Alex L Mitchell
Journal: Database (Oxford) Date: 2012-04-15 Impact factor: 3.451

7. The Pfam protein families database.

Authors: Marco Punta; Penny C Coggill; Ruth Y Eberhardt; Jaina Mistry; John Tate; Chris Boursnell; Ningze Pang; Kristoffer Forslund; Goran Ceric; Jody Clements; Andreas Heger; Liisa Holm; Erik L L Sonnhammer; Sean R Eddy; Alex Bateman; Robert D Finn
Journal: Nucleic Acids Res Date: 2011-11-29 Impact factor: 16.971

8. Modulation of ethylene responses by OsRTH1 overexpression reveals the biological significance of ethylene in rice seedling growth and development.

Authors: Wei Zhang; Xin Zhou; Chi-Kuang Wen
Journal: J Exp Bot Date: 2012-03-26 Impact factor: 6.992

9. InParanoid 7: new algorithms and tools for eukaryotic orthology analysis.

Authors: Gabriel Ostlund; Thomas Schmitt; Kristoffer Forslund; Tina Köstler; David N Messina; Sanjit Roopra; Oliver Frings; Erik L L Sonnhammer
Journal: Nucleic Acids Res Date: 2009-11-05 Impact factor: 16.971

10. PlnTFDB: an integrative plant transcription factor database.

Authors: Diego Mauricio Riaño-Pachón; Slobodan Ruzicic; Ingo Dreyer; Bernd Mueller-Roeber
Journal: BMC Bioinformatics Date: 2007-02-07 Impact factor: 3.169

13 in total

1. Reannotation of Yersinia pestis Strain 91001 Based on Omics Data.

Authors: Yiqing Mao; Xianwei Yang; Yang Liu; Yanfeng Yan; Zongmin Du; Yanping Han; Yajun Song; Lei Zhou; Yujun Cui; Ruifu Yang
Journal: Am J Trop Med Hyg Date: 2016-07-05 Impact factor: 2.345

2. Spatio-temporal transcript profiling of rice roots and shoots in response to phosphate starvation and recovery.

Authors: David Secco; Mehdi Jabnoune; Hayden Walker; Huixia Shou; Ping Wu; Yves Poirier; James Whelan
Journal: Plant Cell Date: 2013-11-18 Impact factor: 11.277

3. Subcellular Proteomics as a Unified Approach of Experimental Localizations and Computed Prediction Data for Arabidopsis and Crop Plants.

Authors: Cornelia M Hooper; Ian R Castleden; Sandra K Tanz; Sally V Grasso; A Harvey Millar
Journal: Adv Exp Med Biol Date: 2021 Impact factor: 2.622

4. LOTUS-DB: an integrative and interactive database for Nelumbo nucifera study.

Authors: Kun Wang; Jiao Deng; Rebecca Njeri Damaris; Mei Yang; Liming Xu; Pingfang Yang
Journal: Database (Oxford) Date: 2015-03-27 Impact factor: 3.451

5. Transcriptional Basis of Drought-Induced Susceptibility to the Rice Blast Fungus Magnaporthe oryzae.

Authors: Przemyslaw Bidzinski; Elsa Ballini; Aurélie Ducasse; Corinne Michel; Paola Zuluaga; Annamaria Genga; Remo Chiozzotto; Jean-Benoit Morel
Journal: Front Plant Sci Date: 2016-10-27 Impact factor: 5.753

6. Microarray Analysis of Rice d1 (RGA1) Mutant Reveals the Potential Role of G-Protein Alpha Subunit in Regulating Multiple Abiotic Stresses Such as Drought, Salinity, Heat, and Cold.

Authors: Annie P Jangam; Ravi R Pathak; Nandula Raghuram
Journal: Front Plant Sci Date: 2016-01-28 Impact factor: 5.753

7. SUBA4: the interactive data analysis centre for Arabidopsis subcellular protein locations.

Authors: Cornelia M Hooper; Ian R Castleden; Sandra K Tanz; Nader Aryamanesh; A Harvey Millar
Journal: Nucleic Acids Res Date: 2016-11-28 Impact factor: 16.971

8. Accurate Digitization of the Chlorophyll Distribution of Individual Rice Leaves Using Hyperspectral Imaging and an Integrated Image Analysis Pipeline.

Authors: Hui Feng; Guoxing Chen; Lizhong Xiong; Qian Liu; Wanneng Yang
Journal: Front Plant Sci Date: 2017-07-25 Impact factor: 5.753

9. Water-stress induced downsizing of light-harvesting antenna complex protects developing rice seedlings from photo-oxidative damage.

Authors: Vijay K Dalal; Baishnab C Tripathy
Journal: Sci Rep Date: 2018-04-13 Impact factor: 4.379

10. An integrated hyperspectral imaging and genome-wide association analysis platform provides spectral and genetic insights into the natural variation in rice.

Authors: Hui Feng; Zilong Guo; Wanneng Yang; Chenglong Huang; Guoxing Chen; Wei Fang; Xiong Xiong; Hongyu Zhang; Gongwei Wang; Lizhong Xiong; Qian Liu
Journal: Sci Rep Date: 2017-06-30 Impact factor: 4.379