Literature DB >> 24082820

Integrated analysis of transcriptomic and proteomic data.

Abstract

Until recently, understanding the regulatory behavior of cells has been pursued through independent analysis of the transcriptome or the proteome. Based on the central dogma, it was generally assumed that there exist a direct correspondence between mRNA transcripts and generated protein expressions. However, recent studies have shown that the correlation between mRNA and Protein expressions can be low due to various factors such as different half lives and post transcription machinery. Thus, a joint analysis of the transcriptomic and proteomic data can provide useful insights that may not be deciphered from individual analysis of mRNA or protein expressions. This article reviews the existing major approaches for joint analysis of transcriptomic and proteomic data. We categorize the different approaches into eight main categories based on the initial algorithm and final analysis goal. We further present analogies with other domains and discuss the existing research problems in this area.

Entities: CellLine Chemical Disease Gene Species

Keywords: Combined analysis review.; Data fusion approaches; Integrated omics; Joint modeling; Proteome; Transcriptome

Year: 2013 PMID： 24082820 PMCID： PMC3637682 DOI： 10.2174/1389202911314020003

Source DB: PubMed Journal: Curr Genomics ISSN： 1389-2029 Impact factor: 2.236

INTRODUCTION

One of the significant objectives of Systems Biology is to understand the regulation of cell behavior through interactions of various components in the regulome (regulation components in the cell such as mRNA, proteins, metabolites etc.). Two important observational categories involves (a) measurement of transcriptomic profiles through techniques such as microarray, RNA-seq etc. and (b) measurement of proteomic profiles through techniques such as gel electrophoresis and mass spectrometry. Some of the data measurement techniques may involve destruction of the living cell and thus joint measurement of both transcripts and proteins in a single cell will not be feasible by such methods. Furthermore, some approaches may provide expression data on the average behavior of a collection of cells and not the expression distribution of the cells. Thus, understanding the limitations and assumptions in the data measurement techniques used for measuring the transcriptomic and proteomic profiles is essential before conducting a joint analysis of the two data sources. Section 2 of this article provides a review of the various approaches used for measuring transcriptomic and proteomic profiles. The next step in developing a joint model of the two domains involves comprehending the differences in the expression of the mRNAs and proteins. Studies [1-5] have shown that there can be poor correlation between mRNA and protein expression data from same cells under similar conditions. Section 3 of this article discusses and provides possible reasons for the lack of correlation between mRNA and Protein expressions. Finally, the type of extracted transcriptomic and proteomic data and the ultimate goal of analysis will dictate the manner of the joint analysis of the two domains. Section 4 discusses the various approaches for the integrated analysis of transcriptomic and proteomic profiles. We divide the various available approaches into eight main categories and provide an initial overview of the techniques followed by specific examples in section 5. Section 6 provides analogies between biological scenarios and other physical scenarios so that approaches used for the analysis of one can throw insights and be possibly used for the analysis of the other. We compare the gene-transcriptome-proteome network with an organizational command structure and large scale social network. Section 7 provides conclusions and future research directions. The current review focuses on uncovering the primary categories of approaches that have been proposed for fusion of transcriptomic and proteomic data. In comparison, existing reviews on joint transcriptomic and proteomic profiling focuses on specific aspects of combined analysis. For instance, Catherine Hack [6] focuses on different statistical methods for correlation between transcriptomic and proteomic datasets. Cox et al. [7] reviews different methods for comparison of microarray and proteomic datasets along with clustering and merging options for these datasets. Nie et al. [8] focuses on attempts to develop various statistical tools for improving the chances of capturing a relationship between transcriptomic and proteomic data along with different transformation and normalization techniques for data, effects on measurement errors and challenges of missing values in datasets. A significant part of the paper by Hecker et al. [9] reviews approaches to build dynamic models of transcriptomic and/or proteomic network. Simon Rogers [10] described the available statistical tools for bridging multi-omics data.

TECHNIQUES FOR TRANSCRIPTOMIC AND PROTEOMIC DATASET GENERATION

Methods for Transcriptomic Profiling

Current transcriptomic profiling techniques include DNA microarray, cDNA amplified fragment length polymorphism (cDNA-AFLP), expressed sequence tag (EST) sequencing, serial analysis of gene expression (SAGE), massive parallel signature sequencing (MPSS), RNA-seq etc. Among the above mentioned technologies, DNA microarray [11] is the most widely used one. But, its application is dependent on the availability of complete genome sequence or knowledge of significant amount of transcript sequence. This technique has evolved from Southern blotting [12] and has been widely accepted as an inexpensive analog technique for high-throughput transcriptomic profiling. cDNA-AFLP [13] is a highly sensitive method which allows the detection of low-abundance mRNAs. Recent examples of cDNA-AFLP based transcriptomic studies are documented in [14] and [15]. EST [1] sequencing is another approach for transcriptomic profiling which has been used in a large number of transcriptomic studies (e.g. [16, 17]). SAGE [18] is a RNA-sequencing based transcriptomic profiling method that can be used to analyze large number of transcripts quantitatively and simultaneously (e.g. [19] and [3]). MPSS [20] is another sequenced based approach for profiling transcriptomic data which is somewhat similar to SAGE but with a substantial difference in sequencing approach and with different approach to biochemical manipulation (e.g. [21] and [22]). The most recent technology for transcriptomic profiling is RNA-Seq [23] which is considered as a revolutionary tool for this purpose. Eukaryotic transcriptomic profiles are primarily analyzed with this technique and it has been already applied for transcriptomic analysis of several organisms including Saccharomyces cerevisiae, Schizosaccharomyces pombe, Arabidopsis thaliana, mouse and human cells [24-29]. RNA-Seq technology shows clear advantages over existing profiling technologies in terms of amount of sequence coverage, revealing new transcriptomic insights, accuracy of defining transcription level, etc. However, existing microarray technology still remains reliable to many researchers for various reasons (explained in an article by Nathan Blow [30]). Overall comparison of existing technologies and most recent RNA-Seq technology can be found in recent reviews by Nicole Roy et al. [31] and Schirmer et al. [32]. Use of different transcriptomic technologies and their success on Amyotrophic Lateral Sclerosis study was discussed in recent review [33]. Also, the omics-era technologies for systems-level understanding of Streptomyceshas have been discussed in a recent review [34]. Genome-wide copy number analysis [35] is another area where extensive use of different transcriptomic technologies is exercised.

Methods for Proteomic Profiling

Current state of the art proteomic technologies include: 2-dimensional difference gel electrophoresis (2D DIGE), matrix-assisted laser desorption/ionization (MALDI) imaging mass spectrometry, electron transfer dissociation mass spectrometry and reverse-phase protein array. 2D-DIGE is a form of gel-electrophoresis which can label 3 different samples of proteins with fluorescent dyes. This method overcomes the limitations due to inter-gel variation in traditional 2D gel electrophoresis technique (2D-GE) [36] of proteomic profiling. Despite the limitation in 2D-GE method, it is still a mature proteomic profiling technique backed by 3 decades of research. Examples of proteomic study using 2D-GE can be found in [37] and [38]; whereas [39] and [40] provide examples of using 2D-DIGE technique in proteomic study. A detailed comparison between these 2 techniques can be found in the article by Marouga et al. [41]. MALDI imaging mass spectrometry [42] is a unique technique for identification of biomarkers in different diseases. Studies of proteomics profiling using this technique include [43] and [44]. Mass spectrometry based quantitative proteomic analysis is another form of proteomic profiling which is followed by 2D-GE. Here, intensity of protein stain is measured to find the existence and amount of protein present in a sample. Liquid chromatography mass spectrometry (LC-MS) (example studies [45] and [46]), liquid chromatography-tandem mass spectrometry (LC-MS/MS) (example studies [47] and [48]), in-gel tryptic digestion followed by liquid chromatography-tandem mass spectrometry (geLC-MS/MS) (example studies [49] and [50]) are different versions of mass spectrometry techniques used in proteomic profiling. Electron transfer dissociation (ETD) mass spectrometry [51] is another form of proteomic study which is a method of fragmenting ions in a mass spectrometer. [52] and [53] are examples of proteomic studies that use ETD. Reverse-phase protein array [54] is a protein microarray technology that has use in quantitative analysis of protein expressions in various kinds of cells including cancer cells, body fluids and tissues (example studies [55] and [56]). Use of several technologies stated above on prognosis and outcome of the treatment of breast tumor was discussed in a recent review paper [57].

CORRELATION BETWEEN TRANSCRIPTOMIC AND PROTEOMIC DATA

Until recently, there was an implicit assumption in systems biology literature of the existence of proportional relationship between mRNA and protein expressions measured from a tissue. However, analysis of mRNA and protein expression data from same cells under similar conditions have failed to show a high correlation between the two domains in multiple studies [1-5]. To analyze the differences between mRNA and protein expressions, we should note that factors having an impact on translational efficiency will have an impact on mRNA-protein correlation. Physical properties of the transcript have a great impact on translational efficiency. One example of such physical property can be Shine-Dalgarno (SD) sequence [58, 59] in prokaryotic transcripts. Transcripts that have weak SD sequence are translated less efficiently. The SD sequence may also be changed by mutation resulting in reduced translational efficiency [60]. Reduction in translation due to mutation in galE initiation codon has also been reported [60]. Another physical property influencing translation is the whole structure of the mRNA. Temperature may change the conformation of mRNA and thus influence translation which was reported in a study for E. coli [61]. In numerous organisms, multiple number of codons can be used to translate same amino-acid which is referred to as ‘codon-bias’ [62]. Codon adaptation index [63] is the measure for codon bias. It is reported that the mRNA-protein correlation is influenced more by codon bias than by SD sequence [64]. Number of ribosome in a transcriptional unit is called ribosome-density which has a major influence on efficiency of translation. A ribosomal density-mapping procedure to explore ribosome positions along translating mRNAs is described by Eldad et al. [65]. mRNAs that entered ribosome for translation (ribosome-associated mRNA) shows better correlation with proteins than typical mRNA expression [66]. Occupancy time of those mRNAs in ribosome also has an impact on translational efficiency which was observed in case of Yeast [67]. Variability (normalized standard deviation) of mRNA expression level during the cell cycle can also affect the mRNA-protein correlation. This effect is found in cell cycle data by Cho et al. [68] which was analyzed by Greenbaum et al. [67] and summarized as: "high variability results in high correlation with protein expressions". The average half-life of eukaryotic mRNA is reported to be 10-20h whereas the average half-life is 48-72h in eukaryotic proteins in a study in 1989 [69]. From a recent study on mammalian cells, it is reported that, mRNAs are 5 times less stable and 900 times less abundant than proteins and spanned a higher dynamic range [70]. In vivo half-life of a protein depends on its amino-terminal residue [71]. Phosphorylation, ubiquitination, and localization of proteins are some post-transcriptional factors which creates variety of half-lives in proteins [72]. Also variation in synthesis and degradation of different proteins creates varied half-lives for proteins which may affect the correlation of protein expression with mRNA. It is also reported that the correlation between half-lives of proteins and mRNAs can be low even when the actual expression level correlation can be higher and also mRNA and protein shares functional properties if they have specific combination of stability (‘stable mRNA stable protein’, ‘stable mRNA unstable protein’, ‘unstable mRNA stable protein’ and ‘unstable mRNA stable protein’) in mammalian cells [70]. The above mentioned stability issues have to be analyzed for modeling dynamic mRNA and protein expression inter-network. Finally, the experimental errors in the type of data extraction approach for protein and mRNA expression brings in extrinsic noise that significantly influences the correlation between mRNA and protein expression.

OVERVIEW OF DIFFERENT APPROACHES

In this section, we provide a brief overview of the proposed approaches in literature to jointly analyze transcriptomic and proteomic data. The methods for integrating and modeling transcriptomic and proteomic networks can be categorized primarily into eight different types as discussed next: Type 1: Union of Transcriptomic and Proteomic Data: This can be considered as one of the most obvious integration types. Approaches related to this type generally consider a union of two different data sets (proteomic data and transcriptomic data; not from the same sample) and then create a reference data set. The reference data sets have sometimes shown new insights and revealed previously undetected phenomenon or supported a new phenomenon as compared to the individual data-sets. There are a number of approaches related to this [73, 74]. A work on Bradyrhizobium japonicum bacteroid metabolism in soybean root nodules by Nathanael et al. [49] can be an example of this method. In this study, authors have compiled a reference dataset by combining (union) transcriptomic and proteomic data. Based on the reference dataset, they have discovered significant number of enzymes related to several types of bacterial metabolism that were not present in the dataset from proteomic study alone. Section 5.1 briefly reviews the approach considered by Nathanael et al. Type 2: Extraction of Common Functional Context of Transcriptomic and Proteomic Features: For various reasons [72], transcriptomic and proteomic data may not have direct overlap in features (here feature refers to different genes for transcripts and proteins). But features on transcriptomic and proteomic level might share the same functional context. These functional contexts may refer to different biological processes or pathways in which features from both transcripts and proteins are enriched. In this approach, the common functional contexts are extracted through the analysis of both transcriptomic and proteomic datasets on the level of protein interaction networks. This approach was published in 2010 by Paul et al. [75] which is discussed in section 5.2. Authors of this publication also generated omicsNET for finding dependency between features of proteomics and transcriptomics. A similar kind of approach (functional analysis) was applied for integrating transcriptomic and proteomic evaluation of gentamicin nephrotoxicity in rats by Com et al. in 2011 [39]. But the functional analysis was done by GO-Browser [2] (an in-house Gene Ontology based annotation tool) with help of Ingenuity Pathway Analysis software[3]. Based on the functional analysis, some gene ontology biological processes were selected which were enriched by the features of the transcriptomic and proteomic dataset with Fisher p≤0.05. This integration by functional analysis reveals a putative model of toxicity [39] in the kidney of rats. Type 3: Topological Networks Approach: Topological network methods (over-connection analysis, hidden node analysis, rank aggregation and network analysis) have been used to elucidate the common regulators (transcriptional factors and receptors) from two different types of data sets (transcriptomic and proteomic) by Eleonora Piruzian et al. [37]. This category of approach refers to locating upstream regulators of mRNA and proteins individually and collecting the common regulators in both the networks for a combined signaling pathway. Topological and network analysis was used in finding individual transcription factors (TF) of mRNAs and Proteins. The TFs that were not common in transcriptomic and proteomic profiles were ignored and the common TFs were used to find the most influential receptors that could trigger maximal possible transcriptional response. Among the receptors discovered from joint analysis, some of them were never reported as psoriasis markers in earlier studies while some of them have been reported before. In another recently published study [76], an integrated quantitative proteomic, transcriptomic, and network analysis approach was discussed which also reveals molecular features of tumorigenesis and clinical relapse. Section 5.3 discusses the approach of Eleonora et al.. Type 4: Merging datasets in individual domains: Type 4 integration merges multiple proteomic data sets into a merged-proteomic data set along with joining multiple transcriptomic data sets into a reference transcriptomic data set. The transcriptomic and proteomic datasets that are merged can be created by different transcriptomic and proteomic profiling respectively. After merging the datasets, correlation analysis is conducted between these 2 merged data-sets and it is shown that the coefficient of correlation is better than the one without merging. Furthermore, specific subsets of the merged data sets can have higher coefficient of correlation. Dov Greenbaum et al. [67] used such an approach in their publication in 2003 which is discussed in section 5.4. Type 5: Missing Value Estimation by non-linear optimization: This category of integration uses non-linear or linear optimization to predict missing values of proteomic data. It maximizes an objective function to find out the connections between transcriptomic and proteomic networks. However, they do not result in a dynamic model able to predict the abundance of next time point but rather, they are able to predict the protein expression at the same time point. A good example of non-linear optimization is a method described in Wandaliz Torres-Garcia et al. [77] for a study of Desulfovibrio vulgaris published in 2009. The method is based on stochastic gradient boosting tree (GBT) proposed by Friedman et al. [78]. Stochastic GBT optimization technique was also used in a study of Shewanella oneidensis in 2011 [79]. Artificial neural network approach was applied to find the missing values of the proteins using the relations between transcriptomic and proteomic data in a separate study published in 2011 [80]. In section 5.5, we briefly review the approach made by Garcia et al. [77] in their Desulfovibrio vulgaris study. Type 6: Multiple regression analysis to predict contribution of sequence features in mRNA-protein correlation: Protein abundance is not only related to corresponding mRNA abundance but also depends on other biological and chemical factors (termed as covariates). For this reason, the idea of multiple regression analysis is used to relate characteristics of different covariates of each individual gene with the mRNA-protein correlation. The multiple regression approach can possibly provide a better explanation of protein variability than traditional single regression technique. Effect of multiple sequence feature (one kind of covariate) on mRNA-protein correlation was discussed by Nie et al. in 2006 [81] where they have used multiple regression analysis. Example of another linear regression model can be Poisson’s linear regression model which has been used by Lie Nie et al. [47] to elucidate the relationship model of transcriptomic and proteomic networks. In section 5.6, we briefly explain the multiple regression analysis used in [81]. Type 7: Clustering Approaches: Clustering mRNA and protein abundance datasets individually and locating similarities (and hence correlation) between the individual clusters does not produce promising results (as explained in section 5.7). This failure leads to the assumption that concatenating the proteomic and transcriptomic datasets and then clustering the concatenated dataset may not be a good idea either (details in section 5.7). Based on these observations, a new clustering method called coupled clustering was implemented by Rogers et al. [82]. Couple clustering creates certain number of proteomic and transcriptomic clusters and provides the conditional probability of a gene to be in a protein cluster given that it is in an mRNA cluster. These conditional probabilities can reveal the relational complexity of mRNA and protein data. Rogers et al. used time series transcriptomic and proteomic data extracted under same experimental conditions. Section 5.7 discusses coupled clustering approach. We would want to emphasize that this type of approach is also not a dynamic modeling approach that can provide temporal predictions. Type 8: Dynamic Modeling: A number of studies reported in the literature have inferred dynamic models ( such as Boolean network, linear models, differential equation models, Bayesian networks etc.) of GRNs from time series transcriptomic data alone. For example, Liang et al. [83] used REVEAL algorithm for inference of Boolean network model from time series mRNA expression data. A basic linear modeling has been proposed by D’haeseleer [84]. GRN Models consisting of differential equations was employed by Guthke et al. [85]. Validation of inference procedures of GRN was discussed by Edward R Dougherty [86]. Friedman used Bayesian networks to analyze and model gene expression data [87]. Among the existing network models, Bayesian networks can be applied to combine heterogeneous data and prior biological knowledge. For example, Nairai et al. [88] used protein-protein interaction network data for refining the Bayesian Network model of the GRN produced by mRNA data alone. Yu Zhang et al. [89] used transcriptional factor binding site data and gene expression data (transcriptomic) to model GRN using Bayesian network approach. Werhli et al. [90] integrated multiple sources of prior biological knowledge (TF binding location) with microarray expression data to generate a Bayesian network model. Section 5.8 discusses approach used by Nariai et al. [88]. Similar kind of approach was also used by Segal et al. [91] where they identified pathways from microarray data and p-p interaction data.

SPECIFIC EXAMPLES FOR INTEGRATED ANALYSIS CATEGORIES

In this section, we provide details and specific examples for the eight categories of approaches for joint analysis of transcriptomic and proteomic data discussed in the previous section.

Type 1 Example: Integration of Omics Data Generated Under Symbiotic Conditions

Bradyrhizobium japonicum is a gram negative, rod-shaped, nitrogen-fixing bacterium that communicates with its host plant and develops a symbiotic partnership with its host. The host considered in [49] is the soybean plant Glycine max [4]. The complete genome sequence of Bradyrhizobium japonicum was identified by Kaneko et al. [92] where 8317 potential protein-coding genes were found. 66153 protein-coding loci have been identified in the genome sequence of Glycine max [5]. Nathanael et al. [49] built a database by combining the above mentioned 8317 proteins of Bradyrhizobium japonicum, 62199 of the above mentioned 66153 proteins of Glycine max and 258 contaminating proteins. They searched the combined database to locate the experimental protein extracts of Bradyrhizobium japonicum soybean bacteroids. GeLC-MS/MS experimental data was used for this study. A probability-based protein identification algorithm [93] was employed to identify the proteins from mass spectrometry data by searching the sequence database. The use of combined database was beneficial because of the fact that soybean proteins present in the nodule extracts of Bradyrhizobium japonicum and Glycine max might have symbiotic relations. 2315 proteins in the experimental dataset were also reported to be present in the combined database. The expression of 2780 B. japonicum genes in soybean nodules was reported by Pessi et al. [94] in a transcriptomic study in 2007. The experimental condition of this genomic study was same as the proteomic study made by Nathanael et al. [49]. In both the transcriptomic and proteomic analysis, stringent filtering criteria for normalization procedure were applied. Several statistical analysis tests (Wilcoxon rank-sum and the student t-test with a P-value threshold of 0.01) based on numerous biological replicates were applied in the microarray based transcriptomic study to prevent erroneous conclusions. The use of transcriptomic expression profiling alone has some limitations (such as limitations of the array i.e. not all the genes are present in the array, concealment of true expression levels due to bias of the probe set, etc.) which are also somehow true for using only proteomic expression profiling. Thus, the authors chose to integrate (through a simple union method) both the data sets and use this list of genes as reference data set for bacteroid expression. In total, 3587 transcriptomic (A) and protein (B) expressions in soybean bacteroids were recorded in the union set (AUB). The number of elements in the set A∩B was 1508. 807 proteins were identified to be expressed by only the proteomics approach (B−A) and 1272 genes that have been identified as expressed only in the transcriptomic study (A−B). Among the set A−B, 47 were RNAs (45 tRNAs, rnpB, ssrA2) and the remaining 1225 were protein encoding genes Fig. (). Summarizes the method of integration. A list of 15 gene function categories[6] were observed for the 3 different datasets: (i) the dataset X by Kaneko et al. [92] consisting of 8317 protein encoding genes, (ii) the reference datasets AUB consisting of 3540 (3587−47) protein encoding genes and (iii) the dataset X−(AUB) consisting of 4777 protein encoding genes i.e. the protein encoding genes that are not detected by Nathanael et al. [49]. The number of genes present in the 3 datasets for each category was detected. The number of genes/proteins present in each category was divided by the total number of genes/proteins in that data set and thus relative frequency of each category for the 3 datasets was established. In 4 among the 15 categories, it has been observed that the relative frequency in the reference dataset (AUB) is less than the relative frequency in the Kaneko dataset (X). This means that in only 4 categories, the reference dataset AUB was less capable to represent that category than the dataset X. The reference dataset revealed novel insights regarding some aspects of bacterial metabolism (e.g. Nitrogen metabolism, Carbon metabolism, Nucleic Acid metabolism) and also regarding translation and post-transcriptional regulation. For example (i) some key regulator proteins (e.g. GlnA, GlnB, GlnK and GlnII) of N metabolism was identified in the reference dataset. (ii) The authors reported that all the enzymes related to C4 metabolism were detected for the 1st time as well as almost the entire set of gluconeogenesis related enzymes was identified in the combined reference data set. (iii) In a study of global protein expression pattern of Bradyrhizobium japonicum bacteroids by Sarma et al. [95], nucleic acid metabolism related proteins were reported to be lacking in the total protein expression pattern of the nodule bacteria. But, in this study, authors have found almost all enzymes related to de novo nucleoside and nucleotide biosynthesis either in the gene or in the protein level or in both. (iv) The reference dataset also comprises a large number of proteins related to transcriptional and post-transcriptional regulation. Also, the enzymes related to protective response (to reactive oxygen species) under stress were also discovered in the reference dataset. In this context, it can be mentioned that, glutathione is crucial for biotic and abiotic stress management for plants, thus, detecting all the enzymes required for glutathione synthesis and reduction strengthens the richness of the reference dataset. Very few enzymes in metabolic pathways were discovered in the dataset derived from the proteomic study alone (dataset B). Thus the combination of data can provide novel insights for carbon and nitrogen metabolism.

Type 2 Example: Functional Analysis of Transcriptomic and Proteomic Data

An approach for linking transcriptomic and proteomic data on the level of protein interaction network has been discussed in a recently published paper [75]. Transcriptomic and proteomic datasets characterizing the chronic kidney disease (CKD) have been used to illustrate the procedures for integrating omics profile at the level of protein interaction networks. Three publicly available studies on CKD by Schmid et al. [96], Baelde et al. [97] and Rudnicki et al. [98] are used for identifying deregulated features on the mRNA level. 697 differentially regulated genes were selected from the 3 studies creating the transcriptomic dataset. The proteomic dataset was extracted from the online database HUPDB v2.0 [7]. A total of 192 samples were used and after comparing with the CKD and healthy references, 37 proteins were identified as differentially abundant. HUPDB was selected as the only source to avoid heterogeneity of datasets. Swiss-prot annotation tool was used in this study (as HUPDB uses Swiss-prot names as identifiers) to map the proteins to the gene symbols. Swiss-Prot [99] or UniprotKB is a protein sequence database which provides all known relevant information about a particular protein. The details about Swiss-prot entry annotation can be found in . The following five different analysis procedures have been discussed and compared extensively in this study to elucidate the correspondence between transcriptomic and proteomic data: (i) Direct feature overlap, (ii) Functional overlap; here features (genes) related to different biological processes that are found in transcriptomic or proteomic datasets or both has been analyzed, (iii) Joint pathway analysis, (iv) Protein dependency graph analysis and (v) Direct edges between transcripts and proteins.

Direct feature overlap:

Here ‘feature’ refers to different genes for transcripts and proteins. Features present in both the transcriptomic and proteomic lists were identified. Genes of 4 proteins[8] (out of 37) were also reported to be differentially expressed in the transcriptomic dataset. This overlap was confusing as only 1 of them was upregulated in both the datasets but the other 3 shows upregulation in one dataset and downregulation in another.

Functional Overlap:

PANTHER (Protein ANalysis THrough Evolutionary Relationships) classification system [100, 101] classifies proteins and their genes to facilitate high-throughput analysis. The classification of proteins were done according to ‘family and subfamily’, ‘molecular function’, ‘biological process’ and ‘pathway’. Multiple gene lists can be uploaded in the PANTHER system and jointly compared against a reference dataset to look for under and over represented functional categories based on either ‘chi-square test’ or ‘binomial statistics’ tool. Here, the authors used PANTHER to identify enriched biological process. They have used fully annotated set of human genes as a reference dataset and used chi-square test with a p-value less than 0.05 to identify significantly enriched or depleted biological processes. They had identified 27 biological processes [9] that are relevant to the transcriptomic and proteomic datasets. Four of the processes were found to be enriched by both the transcriptomic and proteomic feature set. Other processes were enriched by either transcriptomic or proteomic features.

Joint Pathway Analysis:

The laboratory of Immunopathogenesis and Bioinformatics (LIB) developed the DAVID (the Database for Annotation, Visualization and Integrated Discovery) tool [102] to provide functional interpretation of large lists of genes derived from different genomic studies. KEGG pathway database is used as a repository for applying DAVID tool in this study. Seven pathways [10] are found to be significantly enriched in deregulated transcripts and/or proteins using Fisher exact test with p-value less than 0.05. Among these 7, 3 pathways are enriched with both the transcriptomic and proteomic features and 4 pathways are enriched with either transcriptomic or proteomic features.

Protein Dependency Graph Analysis:

PANTHER and KEGG do not cover all the features found in transcriptomic and proteomic dataset used in this study. So, the authors had developed undirected protein interaction network omicsNET [103] which includes all protein encoding genes as nodes in the network. It has edges between the nodes with edge weights referring to dependency measures between the pair of nodes. The dependency measures were determined using Gene Expression Omnibus Human Body Map, the MicroCosm database, Gene Ontology data on molecular processes and functions, PANTHER, KEGG and IntAct databases. The research team found 65 strong dependencies in omicsNET between the features of transcriptomic and proteomic dataset of this study. The features that are involved in the dependency graph include 21 features from transcriptomic dataset, 21 features from proteomic dataset and 2 features from both. Also, dependencies between the features related to blood coagulation cascade was analyzed using omicsNET on different edge weight values.

Direct edges between transcripts and proteins:

MAPPER[11] (Multi-genome Analysis of Positions and Patterns of Elements of Regulation) is a platform for identifying transcription factors binding sites (TFBSs) in multiple genomes [104]. Binding sites of 4 transcription factors were identified in ORF (open reading frame) regions of the 37 proteins using MAPPER. 2 of these 4 transcription factors shows upregulation and 2 shows downregulation in mRNA level. At least 1 of these 4 transcription factor binding sites are present in 13 proteins of the protein dataset. This fact reveals some direct edge between transcripts and proteins. Use of direct feature overlap gave only limited and ambivalent results. But this limited overlap increases when enriched biological processes were identified based on transcriptomic and proteomic datasets. Several biological processes were identified significantly enriched with both transcriptomic and proteomic features. Mapping transcriptomic and proteomic features on different KEGG pathways also reveals significant involvement of both transcriptomic and proteomic features; three KEGG pathways are found to be enriched with them. And finally using omicsNET for overcoming the shortcomings of PANTHER and KEGG gives significant outcome for understanding the interactions of the transcriptomic and proteomic network. The main advantage of functional analysis of transcriptomic and proteomic data is that different pathways and processes for the genes under analysis become evident. Although dependency measures can be found between transcripts and proteins from omicsNET, the main shortcoming is that it cannot create a dynamic model involving transcripts and proteins.

Type 3 Example: Comprehensive Meta-Analysis

Topological network analysis was used to reveal the similarities and differences between transcriptomic and proteomic-level perturbations in psoriatic lesions in a recent study [37]. The transcriptomic data related to psoriatic lesions contains 462 over-expressed transcripts and the proteomic data contains 10 abundantly expressed proteins. Unlike most of the studies, this study shows good consistency between the proteomic and transcriptomic dataset as 7 out of the 10 protein encoding genes were also over-expressed in the transcriptomic dataset. But the significant difference in the magnitudes of the 2 dataset hinders direct correlation analysis. Rather than analyzing correlation between them, the authors used topological network approach to discover regulatory transcription factors, receptors and their ligands to reconstruct the network between them. Their approach produced biologically meaningful results and revealed unknown regulatory receptors that may be related to psoriatic lesions. The methods used in the study include (i) Over-connection analysis (ii) Topological analysis: hidden node analysis (iii) Rank aggregation and (iv) Network analysis.

Interactome Overconnectivity Analysis:

It is assumed in this analysis method that the expression values of transcripts and proteins follow hyper geometric distribution. The method for finding overconnected regulators (transcription factors) of a target dataset is described in the supplementary material of publication by Nikolsky et al. [105]. This overconnection analysis mainly ranks transcription factors (assign a score to it). The score or significance of a transcription factor (taken from global gene database or manually curated gene database like MetaCore [12]) is a function of ‘hypergeometeric distribution probability mass function’ [37, 105].

Hidden Node Analysis:

The complete algorithm for hidden node analysis [13] has been discussed by Dezso et al. [106]. Here we’ll try to demonstrate what it actually does by a very simple example. Fig. () demonstrates 4 genes (nodes) x1, x2, x3 and x4 which are over-expressed or abundant in a transcriptomic or proteomic dataset. Hidden node analysis reveals node x5 which was not present in the experimental data but is the key to regulate downstream effects of targets x2, x3 and x4. The members of the hidden nodes may come from a global database or manually curated database like MetaCore.

Rank Aggregation:

When we have multiple ordered lists, rank aggregation approaches can be used to combine them into a single list. Rank aggregation can be formulated as a optimization problem [107] with the objective function being the weighted sum of distances of the original list from the combined list where L is the combined list and L denote the individual lists and d is a distance function. An example of the distance function is the Spearman footrule distance which is the absolute sum of the differences in the ranks of the unique elements of the individual list and the combined list. The optimization can be carried out using approaches such as Cross-Entropy Monte Carlo stochastic search [108] or genetic algorithms.

Network Analysis:

A typical network analysis helps to select biologically connected meaningful sub-networks with relevant objects. For example, 10 over-expressed proteins were taken as relevant objects in this study and published literature was used to come up with regulatory transcription factors, receptors and kinases. For the integrated study, 10 common transcription factors (top ranked) of transcriptomic and proteomic data types were identified using overconnection analysis and hidden node analysis. But before this step, 20 TFs were identified for each data type using topology analysis and 5 TFs were identified using network analysis approach. These 10 common nodes in each data type shows resemblance with the TFs found from literature search. Next, hidden node algorithm was used to find the most influential 44 membrane receptors that are present in the same signaling pathway with one of the 10 common TFs and whose genes or corresponding ligands were 2.5 fold (or greater) over expressed in the experimental data. For the hidden node analysis to find the 44 receptors, the target set consisted of 462 differentially expressed genes. Among the 44 receptors, 22 were previously reported to be related to psoriatic lesions. 14 of them shows possible but not confirmed relation to psoriatic lesions and rest of the 8 receptors were not reported before. Thus, topological analysis can be applied to identify common regulators from two different datasets. The common regulatory machinery can be applied to arrive at a biologically meaningful signaling pathway that can be verified through new experiments or existing reported results in literature.

Type 4 Example: Merging Datasets into Meta-Datasets

In a review paper published in 2003, Dov Greenbaum et al. [67] discussed the results of the comparisons of different approaches on correlating mRNA expression with protein abundance. The authors focused on yeast. At first, they created a ‘reference mRNA dataset’ using iterative combination of different datasets (as shown in Algorithm 1) that had been earlier discussed by Greenbaum et al. [38] in 2002. They used 4 different datasets from 4 different studies: 3 of the datasets [109-111] used Affymetrix chips and the remaining dataset [112] used the SAGE method. The 4 datasets were merged using algorithm 1 to create the ’mRNA reference dataset’. As a following step, four proteomic datasets were from Gygi et al. (2DE-1 dataset [3]), Futcher et al. (2DE-2 dataset [113]), Washburn et al. (MudPit-1 dataset [114]) and Peng et al. (MudPit-2 dataset [115]) were merged to create a reference protein dataset. The merging technique is illustrated in algorithm 2. The correlation coefficient (r) of the reference mRNA dataset and the reference protein dataset was derived to be r=0.66. Correlation for some smaller functional categories of proteins was also calculated and a mix of higher correlation values [14] as well as lower correlation values [15] was reported. The ideas of merging mRNA datasets (Algorithm 1) and merging proteins (Algorithm 2) have some sort of similarity. The mRNA merging technique uses one of the mRNA datasets it is merging as a reference that is used to find the regression parameters while the protein merging technique uses the reference mRNA dataset in this regard. The values of α and β is an important factor in mRNA merging technique while the quality ranking of the protein datasets have important influence on merging the protein datasets. The order 2DE−1>2DE−2>MudPit−2>MudPit−1 is used as quality ranking that depends on the confidence level of the accuracy of the datasets.

Algorithm 1:

Algorithm for generating the mRNA reference dataset Let X1, X2....Xn define different mRNA expression datasets from n different experiments using Gene Chips. Let the last dataset Xn denote the dataset with highest accuracy. Let Xn be denoted by X. The dimension of these datasets may not be equal i.e. each dataset does not contain mRNA expression of all N genes that are present jointly in the n datasets. Here X will denote the mRNA expression of the ith gene in the jth database i∈1,2,...N and j∈1,2,...n. Initialize MergedData=[] Set α=0.15. This can be changed. =1:n−1 Find the common genes present in X and X Find the parameter p1 and p2 while minimizing the expression where k is the member of common gene set in X and X for i=1:N gene i is present in both X and X and gene i is present only in X MergedData(i)=X gene i is present only in X MergedData(i)=X gene i is present neither in X nor in X Do nothing. mRNA expression for ith gene will not be present in the MergedData Make the MergedData as the X for next iteration Let S denote the SAGE data. Set β=16. for i=1:N gene i is present in both MergedData and S and MergedData(i)>β and MergedData(i)

Algorithm 2:

Algorithm for creating Protein reference dataset Let P1, P2.....P Denote protein expression datasets from n different experiments. Let the last dataset P denote the dataset with the highest confidence level in its accuracy. The dimension of these datasets may not be equal i.e. each dataset does not contain protein expression of all N genes that are present jointly in the n datasets. Here P will denote the protein expression of the ith gene in the jth database. i∈1,2,...N and j∈1,2,...n Let M denote the mRNA reference dataset created using algorithm 1 Find the parameters a and b while minimizing the expression where k is the member of the common gene set in P and M and j∈1,2,....n. Find Y for all jϵ1,2,...n. This is the transformation of the protein databases into mRNA reference dataset. Find for all j∈1, 2,...n. This is the inverse transform into the protein space using parameters of the most accurate set P. Combine the set into MergedData. To do this, order of confidence for the original datasets will be used i.e. if expression for gene i is present in multiple datasets, then the value of that gene in MergedData will come from the dataset with the highest confidence. The authors’ argument in support to their proposed protein merging technique is that the resulting protein reference dataset is a better quantitative and a representative one that is easier to compare with the mRNA expression dataset and is supported by the increased correlation coefficient in some functional categories. Thoughtful scaling techniques were applied to avoid biases in the datasets and in case of multiple possibilities of entering values into the reference dataset from individual datasets, a plausible method of quality ranking was used. So, the ultimate success can be viewed as finding higher correlations among different functional categories. The lower correlation in a functional category reflects heterogeneity in that category. We should note that often due to different half-lives of mRNAs and Proteins, we will observe lack of correlation between transcriptomic and proteomic datasets measured at the same time under symbiotic conditions. Merging of multiple datasets normalizes and integrates the expression values which are more likely to have a higher correlation. But the emphasis on correlation can be sometimes misleading as correlation measures the linear dependence and a perfect non-linear dependence between two variables can be ignored by correlation analysis.

Type 5 Example: Non-Linear Optimization Model to Integrate Transcriptomic and Proteomic Data

Garica et al. [77] implemented stochastic Gradient Boosting Tree (GBT) approach to infer non-linear relationships between mRNA and protein expression data and estimate the missing protein expressions using the generated relationship. They had mRNA expression data of around 3500 genes and protein expression data of around 800 genes of Desulfovibrio vulgaris. After locating the non-linear relationship and the missing protein expression values, they validated the result using knowledge from literature. The total procedure is shown in Fig. ().

Gradient Boosting Tree Method:

Gradient boosting approach was proposed by Jerome H. Friedman in 2001 [116]. It is a nonlinear regression technique that produces a prediction model in the form of decision trees. Some modification of this method has been proposed by Friedman in 2002 [78]. The method uses a training set (x1,y1), ..., (x,y) that tries to find an approximation Ĝ(x) to a function G(x) that minimizes the expected value of some specified loss function φ(y,G(x)). Several loss criteria can be used including least squares: least absolute deviation: etc. Garcia et al. used least squares criteria. The total procedure was divided into α iterations. Each iteration is called a regression tree. In each tree, pseudo residuals of the dependent variable (here, proteomic expression) of the training dataset were located. Then the total input space has been divided into β disjoint regions using least square splitting criterion [87]. These regions are called the leaves of the tree. In each region, a multiplier value ηαβ is calculated here α denotes the α tree and β denotes the β leaf of that tree. Thus, the approximation function Ĝ(x) mainly depends on the splitting variables and splitting points for each tree and also on the value of η in each leaf. Using the training dataset, these variables are estimated and used in future prediction. Algorithm 3 demonstrates how Gradient Boosting Tree method works. Here the modified version by Friedman which he called TreeBoost is shown. Friedman introduced another parameter ξ (0<ξ<1) which controls the learning rate of the algorithm. This is called ‘shrinkage parameter’. In each tree of each iteration, the value of η is multiplied by the value of ξ. According to Friedman [116], choice of the value of ξ is important for the performance of the algorithm; small values cause less prediction error.

Stochastic Gradient Boosting Tree:

A small change in the gradient boosting tree method can make it stochastic. For Stochastic GBT, a random subset of training dataset is used in each iteration rather than using the total training dataset. According to Friedman, the incorporation of randomness and the use of training data subset improve the performance of prediction as well as reduce computational complexity.

Algorithm 3

Algorithm for implementation of Gradient Boosting Tree method (modified version ‘TreeBoost’) a=1:α Compute pseudo residuals of the dependent variable of the training dataset: for i=1,...,n Divide the training data space into β different regions R1a, R2a...Rβa using pseudo residuals. Least square splitting criterion is used to split the region. Compute multiplier ηba for each region b (b∈0,1,2...β) by solving the following optimization: Update the model: , where I(.) is the indicator function. endfor Output Fα(x).

Validation process:

The validation of the predicted missing values is important for performance analysis. Garcia et al. used existing biological knowledge to validate their results. The biological knowledge of Desulfovibrio vulgaris includes (i) functional categories of all genes (20 categories found from Comprehensive Microbial Resource [117]), (ii) sequence length, protein length, molecular weight, guanine-cytosine and triple codon counts of all gene, (iii) a total of 609 operons of Desulfovibrio vulgaris consisting of 2 to 13 genes in each operon (iv) Regulons of Desulfovibrio vulgaris and (v) 92 metabolic pathways (KEGG pathways) for microbial genomes. The Coefficient of Variation (CV) for different operons and regulons are calculated using the predicted values of proteins by dividing the standard deviation (SD) by mean expression value of each operon/regulon/pathway group. Let an operon or a regulon or a pathway group has n genes in it; a random set consisting n genes was created and CV (CV) was calculated for that set of genes. This process was repeated for 1000 times for each operon/regulon/pathway and mean for CV was calculated. The mean CV was then compared against the CV of the original operon. The idea is that the variability of genes in operons or regulons will be less than the variation of random set of genes because the genes in an operon or regulon are supposed to be expressed together and are relationship in their expressions. If the CV of an operon is found to be less than the mean CV, then it is concluded that, the predicted values of that operon is somehow close to accurate. The results found by Garcia et al. shows that a large portion of the operons/regulons/pathway groups indeed has less variability than the variability of randomly created groups of gene. Another measure used in this study for understanding the less dispersion of gene expressions in operons/regulons/pathways is ‘percentile score’. It’s the percentage of 1000 set of random genes for each operon/regulon/pathway group which have CV values (CV) less than the CV of original operon/regulon/pathway. Thus, the percentile score is expected to be less to prove that the variability in expression of genes in an operon/regulon/pathway is less dispersed than the random set of genes. Correlation between protein and mRNA expression was measured for each operon/regulon/pathway groups and also for all the genes. It was found that the correlation was stronger in most of the individual operons and pathway groups than the correlation for all genes. Small fraction of the regulon groups showed better correlation. The method applied for the validation process of this study clearly gives an idea about the overall prediction accuracy but does not guarantee a good prediction. A poor prediction for all the genes in an operon might produce CV less than the mean CV if the overall predictions for other operons are also similarly poor. A different option for validation may be combination of this approach and incorporation of a testing dataset for cross-validation. The testing dataset will be a set of mRNA and protein datasets other than the training data whose actual values are also known. However, it will obviously reduce the size of training dataset. The predicted values can be compared with the original values along with the method of validation using biological knowledge.

Type 6 Example: Linear Regression Model to Integrate Transcriptomic and Proteomic Data

In a study on Desulfovibrio vulgaris by Nie et al. [81], the effect of sequence features in different translational stages on the correlation of mRNA expression and protein abundance has been discussed. Multiple regression analysis that has been previously discussed in another paper by Nie et al. [118] was applied to predict the contribution of different sequence features on the correlation of mRNA and protein abundance.

Sequence features in translational stages:

A sequence feature [16] can be defined as an entity or data located in DNA or RNA sequences that are responsible for different biological phenomena. For example, Shine Dalnargo sequence is a sequence feature which is mainly a ribosomal binding site in mRNA and it helps the ribosome to start synthesis of protein. Other examples of sequence features can be start codon, stop codon, codon usage etc. In prokaryotes, translation can be divided into 3 stages: initiation of translation, elongation of translation and termination of translation. Lithwick et al. [64] demonstrates hierarchy of sequence features related to prokaryotic translation. Shine Dalgarno sequences, start codon identity and start codon context are examples of initiation feature; codon usage and amino acid usage are examples of elongation feature; stop codon identity and stop codon context are examples of termination feature.

Multiple Regression Analysis:

A simple regression analysis can be expressed through the following equation: Protein The target is to find A and B that relates the two variables. Here, mRNA and Protein are logarithm of the mRNA and protein value of gene i respectively. Nie et al. [118] reported that only 20−28% (Pearson correlation coefficient R2) of protein variability can be captured by simple regression analysis. This is because, protein abundance is not only related to corresponding mRNA abundance but also depends on other different biological and chemical factors (termed as ‘covariate’). So multiple regression analysis is required which can be expressed as: where Protein and mRNA are the protein abundance data and the mRNA expression level for the ith gene respectively. Covariate refers to the j covariate of the i gene. B represents the slope for the j covariate. Nie et al. [118] found that 52−61% (Pearson correlation coefficient, R2) variability of protein can be captured by this multiple regression analysis. In this study of effect of sequence features, Nie et al. [81] used the sequence features in different translational stages as covariates and performed multiple regression analysis to locate the sequence features that has the highest effect on the mRNA-protein correlation. They have done multiple regression analysis for each type of sequence feature (i.e. sequence features related to initiation stage, elongation stage and termination stage) separately and also for a combination of sequence features. The results showed that the sequence features are significantly responsible for the variation in mRNA-protein correlation. And also mRNA-protein correlation was affected the most by elongation stage features. The method of finding the effect of covariates in mRNA-protein correlation using one of the 3 datasets can be visualized by the flowchart shown in Fig. (). Three different datasets containing transcriptomic and proteomic measurements were used in this sequence feature analysis of Desulfovibrio vulgaris. The three datasets were expression levels under three different growth conditions (lactate, formate and lactate-stationary). Partial correlation coefficient was used to find the contribution of specific sequence features in the variability. The partial correlation coefficient can be interpreted as: where is the Pearson correlation coefficient using only simple regression; is the Pearson correlation coefficient when multiple regression model is used with sequence features included. Standard F-test [119] was used to examine the significance (P-value for the F-test) of each covariate. The for different sequence features varied; for ‘SD sequence’: 1.9%−3.8%, for sum of ‘start codon’: 0.1%−0.7%, for ‘start codon context’: 0.3%−2.6%, for sum of ‘codon usage’: 5.3%−15.7%, for sum of ‘Amino acid usage’: 5.8%−11.9%, for sum of ‘stop codon’: 1.3%−2.3%, for sum of ‘stop codon context’: 3.7%-5.1%. The sum of the individual values for all the sequence features are ranged in 21.8%−39.8% where the for sequence features together in a single multiple regression ranged in 15.2%-26.2%. So, the analysis proved that, among multiple sequence features, ‘amino acid usage’ and ‘codon usage’ are the top factors that affect the mRNA-protein correlation. The results were validated by conducting similar analysis where sequence features were kept same for all the genes but protein values were randomly assigned to the mRNA values. The resulting P-value in this validation stage analysis was found to be less than the original P-values which indicates better statistical significance of the model found. Pointing out the factors affecting the mRNA-protein correlation was the major contribution of this study. These results can be utilized in creating robust model for mRNA and protein expression values. This is another proof of the fact that only mRNA expression does not necessarily have the power of predicting the protein expression. Complex biological factors such as sequence features related to translational stages should have a significant role in their prediction procedure.

Type 7 Example: Correspondence Between Transcriptomic and Proteomic Expression Profiles Using Coupled Cluster Models

The mRNA or protein expression of a random set of genes is likely to show multiple different levels of expression, but genes involved in similar functions or having similar effects on cellular regulation might show close expression levels. A mixture model [120] generally clusters such datasets into a predefined number of sub-sets in an unsupervised manner. For example, Gaussian mixture model is an unsupervised clustering algorithm where it is able to create soft boundaries among the clusters, i.e. points in the space can be present in any cluster defined by a given probability. This is primarily a mixture of a certain number of Gaussian distributions with unknown parameters where each Gaussian distribution fits its corresponding cluster. Estimation maximization (EM) algorithm [121] is used to find the parameters of the Gaussian distribution and the cluster probabilities. Simon Rogers et al. [82] proposed a coupled mixture model to investigate the correspondence between transcriptomic and proteomic expressions. The dataset consisted of transcriptomic and proteomic profiles of 542 human genes from the Human Mammary Epithelial Cell line (HMEC). Measurements were taken between 0 and 24h after the cells were stimulated with Epidermal Growth Factor (EGF). There were a total of 6 transcriptomic (mRNA) measurements (1 hr, 4 hr, 8 hr, 13 hr, 18 hr, 24 hr) and 7 proteomic (proteins) measurements (15', 1 hr, 4 hr, 8 hr, 13 hr, 18 hr, 24 hr). The mRNA and protein datasets were clustered individually using Gaussian mixture model and the similarity between the two sets of clusters were determined by standard Rand index [122]. The standard Rand index is ranged from 0 to 1 where 1 denotes that the two cluster sets are exactly same. It was observed in the study that the two cluster sets showed very little similarity. The large dissimilarity suggested that if the two datasets were clustered after concatenating them into a single dataset, the number of clusters could have been as large as 20*15=300 which was impractical as the total number of genes was 542. Also, by comparing gene ontology (GO) enrichment analysis, it was discovered that the individual clustering produced different biologically meaningful clusters which were lost when clusters were created after concatenation. The failure of individual and concatenated clustering lead to the implementation of coupled mixture models described in this study. The ideas of clustering individually and clustering after concatenation are illustrated in Figs. ( and () respectively. In coupled mixture modeling, the mRNA dataset was clustered into ‘U’ different clusters with p(u)(u∈1,2...U) denoting the probability that mRNA expression of a gene belongs to the u cluster. Similarly protein dataset was clustered into ‘V’ different clusters with p(v)(v∈1,2...V) denoting the probability that protein expression of a gene belongs to the v cluster. The joint probability can be described as p(u,v)=p(u)p(v|u) where p(v|u) is the parameter that provides the relationship between mRNA expression and protein level. The EM algorithm was used to maximize a log-likelihood function (equation 1 in supplementary material of [82]) to infer the desired parameters. The number of clusters (U and V) in each dataset was derived to be U=15 and V=20 using Bayesian Information Criterion (BIC, proposed by Gideon E. Schwarz [123]). Fig. () demonstrates the coupled mixture model. The values of p(v|u) can unravel important information about the complexity of the relationship between mRNA and protein expressions. For example, the protein cluster v=4 had a total of 19 proteins in it, 18 of those were ribosomal proteins. There were 7 mRNA clusters which had positive p(u|v=4) within the protein cluster v=4. The most connected mRNA cluster with this protein cluster was the cluster u=3 because p(u=3|v=4)=0.3653 (which was the highest among all p(u|v=4)). If we look at the other protein clusters which are related to this mRNA cluster u=3, we’ll see that, there are 14 protein clusters which have positive p(v|u=3). This complex set of information suggests that indeed the relationship between mRNA and protein expression is a complex one; this decision could not be made with the results of individual clustering and concatenated clustering technique. The inference of complex relationships from those conditional probabilities remains open; may be use of more biological knowledge about the involvement of different sets of mRNA and proteins in different biological processes will reveal the relationship more clearly. Rogers et al. concentrated on three biological phenomena and others are still open problems. The three biological phenomena they dealt with are: (i) conserved behavior of ribosomes occurring at the protein level, (ii) discovering interesting set of genes involved in cell-adhesion and (iii) The role of TCP-1 as a protein folding machine. The choice of the model in each component of the mixture is not limited to Gaussian. Other models can be used to construct mixture models such as ordinary differential equation model used by Chudova et al. [124] and B-splines used by Luan and Li [125]. We should note that the mRNA and protein expression of 542 genes used in Rogers et al. was a subset of the original dataset that had a lot of missing values for proteomic expression. The genes that had both transcriptomic and proteomic values present were used in the study. An interesting idea may be to use missing value prediction method described in section 5.5 (method by Garcia et al. [77]) to complete the original dataset and use that in this study. Combining the approaches by Garcia et al. [77] and Rogers et al. [82] can possibly improve mRNA and protein correspondence when missing values possess a significant issue.

Type 8 Example: Dynamic Models

Nariai et al. [88] used cell cycle microarray data [126] of Saccharomyces cerevisiae and 9030 protein-protein interaction data derived from MIPS database [127] to construct a Bayesian network model. The authors proved that the use of p-p interaction data had refined the estimated gene network produced by using only microarray data. The algorithm used to construct the network can be simply illustrated by the flowchart showed in Fig. (). The algorithm is designated as the greedy hill-climbing algorithm. In the greedy hill-climbing algorithm, each network was evaluated by Bayesian network and Nonparametric Regression Criterion (BNRC) score [88]. Parents of each gene (genes that regulate a gene are called parents of that gene) were determined using this algorithm; the parents can be protein complex or other genes. Principal component analysis was used to find the protein complexes that were involved in regulating certain genes. Three gene networks were estimated: (i) by using only microarray data (ii) by using only p-p interaction data and (iii) by using the greedy hill-climbing algorithm. The three networks were then compared with the KEGG compiled network for evaluation. The network edges agreed with the KEGG pathway used for comparison. By using 350 chosen genes from the MIPS functional category mitotic cell cycle and cell cycle control, 34 protein complexes were discovered in this study (22 of these 34 complexes are listed in MIPS complex catalog). Also, incorporating phase information of cell cycle (e.g. G1/S, S, M phase) revealed biologically important relationships of several genes that are not included in the KEGG pathway. It is well established that inferring GRNs from mRNA data alone has a huge computational complexity and often lack in accuracy. Thus, a number of studies [88-90] used prior biological knowledge to reduce the complexity and/or incorporated other type(s) of data to increase the accuracy of the inference process. But to our knowledge, no such study has been done that uses protein abundance data along with transcriptomic data to infer a GRN which can reveal a dynamic relationship model between transcriptomic and proteomic network.

DISCUSSION

Studies have shown that there exists a poor correlation between mRNA expression and protein abundance [1-5]. Some possible reasons based on protein half-lives, post transcription machinery etc. has been proposed. It is interesting to locate analogies between biological scenarios and other physical scenarios so that approaches used for the analysis of one can throw insights and be possibly used for the analysis of the other. For instance, Wang et al. [128] compared the gene-mRNA-protein structure with computer’s internal structure and proposed a theory that can be used to verify the reason behind the poor correlations. On a similar note, we propose that a gene-transcriptome-proteome network has a number of similarities with an organizational command structure. We next show the relations between the two using a military command system. The military headquarter can be assumed as the main data processor and command center during war time. Headquarter send commands to the base camp that actually controls the on-field troops. The on-field troops directly take part in the operation. Command and direction of the headquarter can be viewed as the gene sequence in DNA which encodes the proteins. The base camp command can be viewed as the transcriptome and the on-field troops can be viewed as the proteome. The factors that affect any on-field troop in the operation can be viewed as metabolites and other external conditions. A troop may not be always able to exactly perform as commanded and sometimes the base camp cannot pass on the exact command from the head quarter due to on-field scenario. Furthermore, the base camp may have a different strategy to implement the commands relayed from the headquarter resulting in a delay in the implementation of the headquarter command. Thus, reflection of any command may not be instantaneous. Similarly, transcriptomic existence does not guarantee instantaneous or even delayed version of proteomic abundance. Furthermore, a command from the headquarters may not require completion depending on the on-field situation and instantaneous thinking. Thus, delay of the command propagation plays an important role. Similarly, the delay in creating a protein from gene-mRNA-protein (central dogma) system may cause the desired protein ‘not needed’ at all. The above mentioned analogy can be supported by the fact that the command may be adaptively changed based on the needs of the troop or success i.e. there is a feedback from the troop which may alter the war-plan. Similarly, feedback from outside factors and proteins can control the ’on’ and ’off’ mechanism of genes. Thus, to understand the biological mechanism and associated network, we need to have a detailed idea of how the proteins and mRNAs react to outside stimulations and how the commands from the genes are neglected resulting in a poor correlation between transcriptomic and proteomic data. In brief, we need to broaden our view just from transcriptomic abundance and proteomic abundance and consider an integrated transcriptomic-proteomic approach incorporating other factors such as external conditions, metabolites etc. The relation between transcriptomic and proteomic domain can be better understood if a time series of gene and protein expression for single cells are available for a good length of time with high sampling frequency. As mentioned earlier, similarities between different domains allow us to apply techniques developed for one domain in another and also provide unique viewpoints for understanding the system behavior. Since biological networks are extremely complex and large-scale, a natural question arises whether we can relate them to other complex networks such as social networks, communication networks, web graphs etc. [129-131]. The recent development of online social networks offers an analogy between the molecular biology networks and social networks. The social networks can be considered to have two primary domains: one of them consists of the network of physical relations as manifested through verbal communications in workplaces, schools, neighborhood etc. and the other is the network of online relations through social media such as facebook, LinkedIn, twitter, Wikipedia, play station networks etc. We can associate the transcriptomic domain as the physical relation network and the proteomic domain as the social media network. The portion of individuals participating in social media can be considered as the protein coding genes. The social media network and the physical social networks are both connected and can have very similar communities just like related set of mRNAs and proteins are involved in specific functions. The commands of protein generation through mRNAs are similar to individuals expressing their views on social media. There will be links in physical social networks and links in virtual social networks and multitude of cross-links between these two networks. A number of views of an individual may not be instantly reflected in the social media due to surrounding physical situations, delay in reaching the device to post the message or disturbances in the online network. Expressions of emotions in the physical world are quick and can change fast similar to mRNAs that are mostly transient. Due to the vast memory of online interactions, emotion expressed in the social media remains for longer time similar to the scenario of generated proteins being in the system for longer time. Assuming this scenario, we can ask questions such as does the problem of learning the actual emotions of individuals of the social network by asking online questions equivalent to understanding the biological regulatory mechanism by studying protein abundance alone? In a social network, an individual’s bad disposition can affect the moods of closely linked people around him/her (such as members of his/her community in the social network) just like RNAs can affect other surrounding RNAs. However, the manifestation of an individual’s gloomy state of mind in the social media network may not have the same effect as observed in the physical social network whereas some other expressions in the online world can influence more links in the online and physical networks as compared to the expression propagated through the physical social network. In a similar fashion, some proteins can have much more effect on surrounding proteins under specific conditions as compared to the effect of the corresponding mRNA on its surrounding mRNAs. As a single time snapshot is not sufficient to detect how information is spreading through a social network, single time snapshots of mRNAs and Proteins are not suitable for understanding the biological machinery. Time series data of mRNAs and Proteins for individual cells are required to get better understanding of the interactions of the transcriptomic and proteomic domains.

CONCLUSIONS

As compared to existing reviews [8, 7, 6, 10] on joint transcriptomic and proteomic profiling, the current article focuses on uncovering the primary categories of approaches that have been proposed for fusion of transcriptomic and proteomic data. We have divided the existing methods into eight main categories and illustrated each by specific example of studies. For a researcher searching for ways to combine a set of transcriptomic and proteomic profiles, this review provides a concise overview of the existing analysis techniques categorized into eight types and the advantages and limitations of the various approaches. For further insights, we provide analogies of the transcriptomic and proteomic expression scenario with cases in large scale organizational and social networks. This can possibly allow design of methodologies for joint analysis of mRNA and Protein expression data based on fusion techniques applied in other large scale network analysis.

116 in total

1. Integrated proteomic, transcriptomic, and biological network analysis of breast carcinoma reveals molecular features of tumorigenesis and clinical relapse.

Authors: Marcin Imielinski; Sangwon Cha; Tomas Rejtar; Elizabeth A Richardson; Barry L Karger; Dennis C Sgroi
Journal: Mol Cell Proteomics Date: 2012-01-12 Impact factor: 5.911

2. The codon Adaptation Index--a measure of directional synonymous codon usage bias, and its potential applications.

Authors: P M Sharp; W H Li
Journal: Nucleic Acids Res Date: 1987-02-11 Impact factor: 16.971

Review 3. Transcriptomics in ecotoxicology.

Authors: Kristin Schirmer; Beat B Fischer; Danielle J Madureira; Smitha Pillai
Journal: Anal Bioanal Chem Date: 2010-04-06 Impact factor: 4.142

4. Comprehensive identification of cell cycle-regulated genes of the yeast Saccharomyces cerevisiae by microarray hybridization.

Authors: P T Spellman; G Sherlock; M Q Zhang; V R Iyer; K Anders; M B Eisen; P O Brown; D Botstein; B Futcher
Journal: Mol Biol Cell Date: 1998-12 Impact factor: 4.138

Review 5. RNA-Seq: a revolutionary tool for transcriptomics.

Authors: Zhong Wang; Mark Gerstein; Michael Snyder
Journal: Nat Rev Genet Date: 2009-01 Impact factor: 53.242

6. Investigating the correspondence between transcriptomic and proteomic expression profiles using coupled cluster models.

Authors: Simon Rogers; Mark Girolami; Walter Kolch; Katrina M Waters; Tao Liu; Brian Thrall; H Steven Wiley
Journal: Bioinformatics Date: 2008-10-30 Impact factor: 6.937

7. Genome-wide analysis in vivo of translation with nucleotide resolution using ribosome profiling.

Authors: Nicholas T Ingolia; Sina Ghaemmaghami; John R S Newman; Jonathan S Weissman
Journal: Science Date: 2009-02-12 Impact factor: 47.728

8. Characterizing the community structure of complex networks.

Authors: Andrea Lancichinetti; Mikko Kivelä; Jari Saramäki; Santo Fortunato
Journal: PLoS One Date: 2010-08-12 Impact factor: 3.240

9. A whole-genome massively parallel sequencing analysis of BRCA1 mutant oestrogen receptor-negative and -positive breast cancers.

Authors: Rachael Natrajan; Alan Mackay; Maryou B Lambros; Britta Weigelt; Paul M Wilkerson; Elodie Manie; Anita Grigoriadis; Roger A'Hern; Petra van der Groep; Iwanka Kozarewa; Tatiana Popova; Odette Mariani; Samra Turaljic; Simon J Furney; Richard Marais; Daniel-Nava Rodruigues; Adriana C Flora; Patty Wai; Vidya Pawar; Simon McDade; Jason Carroll; Dominique Stoppa-Lyonnet; Andrew R Green; Ian O Ellis; Charles Swanton; Paul van Diest; Olivier Delattre; Christopher J Lord; William D Foulkes; Anne Vincent-Salomon; Alan Ashworth; Marc Henri Stern; Jorge S Reis-Filho
Journal: J Pathol Date: 2012-02-23 Impact factor: 7.996

10. Correlation of mRNA and protein levels: cell type-specific gene expression of cluster designation antigens in the prostate.

Authors: Laura E Pascal; Lawrence D True; David S Campbell; Eric W Deutsch; Michael Risk; Ilsa M Coleman; Lillian J Eichner; Peter S Nelson; Alvin Y Liu
Journal: BMC Genomics Date: 2008-05-23 Impact factor: 3.969

122 in total

Review 1. Molecular and Cellular Mechanisms of Axonal Regeneration After Spinal Cord Injury.

Authors: Erna A van Niekerk; Mark H Tuszynski; Paul Lu; Jennifer N Dulin
Journal: Mol Cell Proteomics Date: 2015-12-22 Impact factor: 5.911

2. A Cell Wall Proteome and Targeted Cell Wall Analyses Provide Novel Information on Hemicellulose Metabolism in Flax.

Authors: Malika Chabi; Estelle Goulas; Celine C Leclercq; Isabelle de Waele; Christophe Rihouey; Ugo Cenci; Arnaud Day; Anne-Sophie Blervacq; Godfrey Neutelings; Ludovic Duponchel; Patrice Lerouge; Jean-François Hausman; Jenny Renaut; Simon Hawkins
Journal: Mol Cell Proteomics Date: 2017-07-13 Impact factor: 5.911

3. Reduced changes in protein compared to mRNA levels across non-proliferating tissues.

Authors: Kobi Perl; Kathy Ushakov; Yair Pozniak; Ofer Yizhar-Barnea; Yoni Bhonker; Shaked Shivatzki; Tamar Geiger; Karen B Avraham; Ron Shamir
Journal: BMC Genomics Date: 2017-04-18 Impact factor: 3.969

4. Nitrogen deprivation in Fusarium oxysporum promotes mycotoxin production via intermediates in the Krebs cycle and unreported methylmalonyl-CoA mutase activity.

Authors: A V Karpe; M S Dunn; M C Taylor; T Nguyen; C Ong; T Karla; S Rockman; D J Beale
Journal: Metabolomics Date: 2018-12-11 Impact factor: 4.290

5. Essential role of JunD in cell proliferation is mediated via MYC signaling in prostate cancer cells.

Authors: Bethtrice Elliott; Ana Cecilia Millena; Lilya Matyunina; Mengnan Zhang; Jin Zou; Guangdi Wang; Qiang Zhang; Nathan Bowen; Vanessa Eaton; Gabrielle Webb; Shadyra Thompson; John McDonald; Shafiq Khan
Journal: Cancer Lett Date: 2019-02-11 Impact factor: 8.679

10. Changes in metallothionein transcription levels in the mussel Mytilus galloprovincialis exposed to CdTe quantum dots.

Authors: Thiago Lopes Rocha; Eider Bilbao; Cátia Cardoso; Manu Soto; Maria João Bebianno
Journal: Ecotoxicology Date: 2018-02-02 Impact factor: 2.823