Literature DB >> 25553089

Pathway and network approaches for identification of cancer signature markers from omics data.

Jinlian Wang¹, Yiming Zuo², Yan-Gao Man³, Itzhak Avital³, Alexander Stojadinovic⁴, Meng Liu⁵, Xiaowei Yang⁵, Rency S Varghese⁶, Mahlet G Tadesse⁷, Habtom W Ressom⁶.

Abstract

The advancement of high throughput omic technologies during the past few years has made it possible to perform many complex assays in a much shorter time than the traditional approaches. The rapid accumulation and wide availability of omic data generated by these technologies offer great opportunities to unravel disease mechanisms, but also presents significant challenges to extract knowledge from such massive data and to evaluate the findings. To address these challenges, a number of pathway and network based approaches have been introduced. This review article evaluates these methods and discusses their application in cancer biomarker discovery using hepatocellular carcinoma (HCC) as an example.

Entities: Chemical Disease Gene Species

Keywords: Biological pathways; cancer biomarker.; high-throughput omics data; system biology

Year: 2015 PMID： 25553089 PMCID： PMC4278915 DOI： 10.7150/jca.10631

Source DB: PubMed Journal: J Cancer ISSN： 1837-9664 Impact factor: 4.207

Introduction

A better understanding of disease associated with biomarkers could potentially start a new area for uncovering the mechanism of cancer progression, development and offer better targets for drug development 1. Studies on single gene/protein/metabolite molecular signatures offer limited insight into the complex interplay among the molecules responsible for progression of complex diseases such as cancer. Thus, there is a shift toward the identification of a panel of genes that interact directly or indirectly in the form of pathway or complex network to evaluate their association to cancer 2,3. This is accomplished through massive data derived by high throughput omic technologies such as next generation sequencing, microarray, and mass spectrometry. Although thousands of candidate biomarkers have been discovered by these technologies, few of them have been transferred into practical application in clinical setting and new drug production. The challenges lie in (1) high false positive rate of the candidate biomarkers identified from omics data; (2) Lack of attention on the study of the context of biomarkers who are interacting each other in the form of pathway or network associated with cancer; (3) Fragmental and incomplete information based on biomarkers identified from solely omics platform; (4) Lack of effective algorithms that allow integration of diverse omics data sources to simulate the biological pathway and networks. To meet these challenges, a number of pathway and network based approaches have been introduced. This review article evaluates the advantages and limitations of these methods. The traditional approaches that individual and a panel of cancer biomarkers are selected by analytic methods such as analysis of variance (ANOVA), Lasso, pairwise, information theory and support vector machine (SVM) do not explicitly consider interaction between genes, proteins and metabolites. Compared to traditional methods, pathway and network centric methods naturally provide a way to understand the underlying pathways and the interactions between individual signature markers and non-markers. With the large-scale generation and integration of genomic, transcriptomic, proteomic, and metabolomic data, pathway/network-based methods provide a more effective and accurate means for cancer biomarker discovery. Increasingly, pathway and network-based analyses are applied to omics data to gain more insight into the underlying biological function and processes, such as cell signaling and metabolic pathways as well as gene regulatory networks 4-6. A number of pathway /network approaches have also been used for improving the prediction of cancer outcome, providing novel hypotheses for pathways involved tumor progression 7, and exploring cancer associated biomarkers 8. For example, Taylor et al. 9 combined gene expression data with physical protein-protein interaction data to identify subnetwork markers for the prognosis of breast cancer and lymphoma patients. Torkamani and Schork 10 used gene co-expression network to infer cancer-initiating genes in breast, colorectal cancer, and glioblastoma. Kim et al. applied the MAPIT (Multi Analyte Pathway Inference Tool) algorithm to identify prognostic network markers to predict GBM patient survival time using multi-analyte network markers discovered by integrating gene expression profile, epigenomic profile, and protein-protein interactome 11. Goh et al. 12 built a human disease network (HDN) by linking hereditary disease that share a disease-causing gene recorded in Mendelian Inheritance in Man (OMIM) database. Although the functional connections in the HDN remain to be further demonstrated, it inspires us to systematically study the relationships among diseases by constructing a network. More detailed descriptions of relationships between human disease and network essential for understanding of human have been recently summarized in reviews 2, 12-14. In this review, we provide some pathway and network centric computational approaches and their applications for biomarker discovery.

Summary of pathways and networks centric approaches for cancer biomarker discovery

Availability of biomedical pathways and networks based on large-scale data gathering through diverse omics data sources offers new opportunities to explain the causality of relationships between biological entities and cancers 15. As shown in Figure 1, the general steps of the biomarker discovery include the following: 1) Define precisely a well-framed, relevant clinical problem and focus the experimental design around appropriate study populations and samples; 2) Collect tissue samples or fluids from patients and suitable assays; 3) Acquire high-throughput data from the omics technologies; 4) Analyze the data using signal processing, statistical and machine learning methods to select relevant features from the data; 5) Integrate the pathway/network knowledge from databases such as KEGG, HMDB and Reatcome mapping candidate biomarkers to the corresponding pathways or networks; 6) Evaluate biomarkers to estimate their diagnostic or prognostic capability and clinical validity using alternative technologies such as Westen blot, ELSA, and RT-PCR. In computational aspect, cross-validation and independent validation are the commonly used methods to evaluate the performance of a biomarkers. P-values, sensitivity, specificity and the area under receiver operating curves (AUC) are used as quantitative indicators of the performance of the methods 16; 7) Use the biomarkers for clinical applications after reliable pre-clinical tests and validation of the markers in a large population.

Figure 1

The pipeline of pathway/networks centric approach for cancer biomarker discovery. A variety of computational tools and algorithms have been proposed for biomarker discovery based on pathway and network methods. The most commonly used methods are categorized roughly into statistical 17, graph theory 18, Bayesian methods 19, text mining 20, machine learning 21-23 and integrative methods summarized in Table 1.

Statistics methods

Statistical methods test scientific theories when observations, processes or boundary conditions are subject to stochasticity. For examples, the classical t-test has been extensively used for testing differential gene expression in microarray data 35. However, this kind of procedure relies on reasonable estimates of reproducibility or within-gene error, requiring a large number of replicated arrays. Thus, several methods for improving estimates of variability and statistical tests of differential expression have been proposed. For example, Significance Analysis of Microarrays (SAM) aimed to improve the unstable error estimation in the two-sample t-test by adding a variance stabilization factor which minimizes the variance variability across different intensity ranges 36, 37. ANOVA model approach is widely used in multiple kinds of omic data. For example, it was used to model microarray data with the effects of array, condition, and condition-array interaction and then to fit the residuals with the effects of gene, gene-condition interaction, and gene-array interaction 38,39. Also, it was applied to capture the effects of controlled groups, batches, condition, alias of experimental equipment, and condition-metabolite interaction separately on LC-MS data 40. To improve the accuracy and sensitivity of analytic results, false discovery rate (FDR) 41 and its refinement, q-value, (q-value package, www.bioconductor.org) have been rapidly adopted for genomic, proteomics and metabolic data analysis including the widely-used SAM, DAVID 42 and other approaches 36. Another statistics method for biomarker discovery is linear discriminant analysis (LDA), one of the classical statistical classification techniques based on the multivariate normal distribution assumption, is quite robust and powerful to discover biomarker or pathways between omics data for many different applications despite the distributional assumption. Compared to LDA, quadratic discriminant analysis (QDA) requires more observations to estimate each variance-covariance matrix for each class 43. In addition, logistic regression analysis has been successfully used to evaluate biomarker performance of prostate cancer with mRNA profiling 44. Logistic regression (LR) model based on the regression fit on probabilistic odds between comparing conditions requires no specific distribution assumption (e.g. Gaussian distribution) but is often found to be less sensitive than other approaches42,43.

Graph theory based network and visualization

The modeling fundamentals of graph theory are often used to describe the global topology, structure or the community of a complex system. It emphasizes on entities (e.g, genes, proteins, diseases, biological process) and the relations between them. The complexity of graphical modeling can be either simple only with nodes and edges or more complex where edges have weights, and nodes and edges can be of different types. Recent publications have applied graphical modeling in computational biology to study biological networks, enhance the ability to draw causal inferences from functional MRI experiments, support the early detection of disconnection and the modeling of pathology spread in neurodegenerative disease such as Alzheimer's disease 45-49. For example, in mammalian cells, Bleris et al. have had early success in characterizing the dynamics of key feed forward modules and motifs, helping to enable the circuit design of adaptive gene expression 50. Using graph based approaches, Ma'ayan et al. model cellular machinery including genes, proteins and other subcellular compartments 51, in which the interactions between components are drawn as edge connections between the relevant nodes 51. Gene expression data combined with network analysis can yield important information on how expression variation relates to differences between observed states 52. As closely connected genes tend to be involved in similar functions, network annotation can complement clusters obtained via fold change analysis 7. A standard systems-based approach to biomarker and drug target discovery consists of placing putative or known biomarkers in the context of a network of biological interactions, followed by different 'guilt-by-association' analyses 53. The goal of visualization is to find patterns and structures that remain hidden in the raw unstructured datasets. Graph visualization is key to display directly the various relationships between entities (e.g., genes, proteins). Challenges of graph visualization lie in 1) the high false positive rate of incorporating heterogeneous multi-omic datasets; 2) Visual representation of the logical structure transformed from the raw data; 3) Graph manipulation and layout algorithm for representing the complicated relationships between biological entities. 4) Heterogeneous omic data from different level visualization needs more flexibility for layered representation. A number of commercial and free sourced graph visualization tools and platforms have been extensively developed. For example, Cytoscape 31, one of the free open source platforms providing biological network analysis and visualization with more than 172 registered plugins contributed by the community, is very versatile in network applications, such as network importing, network integrating, inference customization, literature mining, topological clustering, functional enrichment, network comparison, and programmatic access 54. 3DScapeCS, a Cytoscape plugin providing three-dimensional, dynamic, parallel network visualization for Mass Spectrometry (MS) molecular network 55. IPA 56, a commercial software tool for pathway analysis with omics data provides powerful graphical visualized pathways and networks overlaid by diseases, drugs and biological process etc. PathwayStudio provide abstractive graphical interface for users to analyze gene expression, protein interaction and metabolic data to analyze and explore the pathways and networks identified from data. STRING not only gives the graphical visualized protein interaction of both known and predicted but also quantifies each pair of proteins by their interaction types such as physical interaction and gene fusion etc. 57.

Bayesian methods and its derivatives

Bayesian methods allow informative priors so that prior knowledge or results of a previous model can be used to inform the current model. In cancer bioinformatics and systems biology, the primary application of Bayesian methods include Bayesian inference, Bayesian network, Naive Bayes classifier and Bayesian variable selection. Among these methods, Bayesian network is one of the most common modeling tools for pathway and network analysis 19. Bayesian network is a form of directed statistical modeling designed to capture conditional dependencies between probabilistic events 58. It consists of a dependency structure and local probability model also named probabilistic graph models which include Hierarchical Bayesian Networks (HBN), Probabilistic Boolean Networks (PBN), Hidden Markov Models (HMM), and Markov Logic Networks (MLN) 59-61. The dependency structure specifies how the variables are related to each other by drawing directed edges between the variables without creating directed cycles. Each variable depends on a possibly set of other variables, termed "parents." Compared with other pathway/network centric method, Bayesian network model is capable of integrating heterogeneous data, missing value and dependent relationships between variables 62. In a Bayesian network model, probabilities define the relationship between the current node and its predecessor or parent in a graph 63. The power of these methods lies in their ability to facilitate the reverse engineering of multiplex networks based on molecular expression, molecular activity and/or cell behavior data, serving as a precursor to synthetic modifications of existing molecular pathways 64. Bayesian inference is one of the very important Bayesian methods widely used in cancer biomarker discovery, signaling pathway and network inference 65,66. It has previously been applied to gene expression data for inference of gene regulatory networks 67,68, infer both protein signaling networks 69,70 and gene regulatory networks 71. To incorporate an explicit time element, dynamic Bayesian Inference was proposed to interrogate dynamic signaling responses within a Bayesian framework, with existing signaling biology incorporated through an informative prior distribution on networks 66. In addition, Bayesian variable selection aims at solving the problems of “large p, small n” existing in omic data set and using prior knowledge such as pathway and protein interaction to estimate the posterior probability by Markov Chain Monte Carlo (MCMC) also widely used to infer functional interactions in biochemical pathway, model the interactions between different functional modules of a biological network 72 and pathway based cancer biomarker discovery 73,74. For example, Yang et al. 21 used a Bayesian network to construct HCC cell networks and identify functional modules and interactions between these modules. Stochastic simulation models offer an alternative, but they are hitherto associated with a major disadvantage: their likelihood functions cannot be calculated explicitly, and thus it is difficult to couple them to well-established statistical theory such as maximum likelihood and Bayesian statistics. A number of new methods, among them Approximate Bayesian Computing and Pattern-Oriented Modeling, bypass this limitation. The difference between Bayesian and frequentist inference lies in the following: 1) Bayesian inference provides answers conditional on the observed data and not based on the distribution of estimators or test statistics over imaginary samples not observed (Rossi et al., 2005, p. 4); 2) It includes uncertainty in the probability model, yielding more realistic predictions. 3) It safeguards against overfitting by integrating over model parameters. But the quality of the prior information directly impacts the performance of the Bayesian methods. Also, they are unable to account for feed- back regulation, a hallmark of signaling networks.

Text mining

With the growth of information in literature and biomedical databases, biological and clinical scientists need efficient means of handling and extracting diagnostic methods and prognostic terms and information from scientific literature. For this purpose, text mining that comprises the discovery and extraction of knowledge from free text to generate new hypotheses particularly relevant and helpful in biomedical research 14. Text mining complements the reading of scientific literature by individual researchers, allows rapid access to information contained in large volume of documents and increases the reproducibility of literature searches by enabling users to process all documents for a specific result. The primary application of text-mining in biomedical research roughly lies in three aspects: 1) Simple text-mining such as transforming textual information into database content and integrating with existing knowledge resources to suggest novel hypotheses; 2) Literature analysis including clustering and classification of entities or diseases; 3) Integrative biology for producing or testing hypotheses against knowledge bases. Currently, text mining is being successfully applied to the identification of molecular causes of diseases using facts from databases and literature 75-77. For example, text-mining has been used to suggest disease biomarkers from the scientific literature, and made on the basis of the assumption that two proteins are likely to interact with each other if they share a substantial amount of contextual information 78,79. By defining a gene of interest, a network is constructed from all scientific publications related to the query-defined gene. The results can be browsed by navigating through the visualized network. CoPub makes uses of lexical resources for genes, proteins, Gene Ontology labels, diseases, pathways, drugs and tissues to identify and statistically to qualify the significance of a specific term for a gene or a set of genes 80. The results return a set of annotations for their genes of interest. Besides, text mining has been widely used in industrial large scale knowledge base for query genes, proteins, metabolic compounds and drugs functional analysis. To visualize knowledge contained in the scientific literature, software tools have been developed that provide improved integration of text-mining results with other data resources. For example, IPA (Ingenuity) 56, KEGG 81, Pathway Studio 82 and HPRD 83 use text-mining to integrate gene/protein-phenotype associations linking genes and protein variants to the diseases, toxic effects and drug response to their knowledge databases. Depending on the tasks researchers address, text-mining can achieve different objectives. This include primarily the following: 1) retrieval information from relevant documents; 2) Identification of entities such as genes, diseases, complex relationship between entities and diseases and interactions between proteins and genes 80; 3) Deposit extracted information into database or used to support manual database curation efforts 15; 4) Generation hypothesis 79 and test novel research questions 78. The trend of text-mining technique is shifting from the analysis of only abstracts to the full text of papers, from the analysis of gene and protein-related information to the information about cells, tissues and whole organisms. The most prominent shift is to integrate information from the literature with data sets from other domains such as gene expression profiles 84, genome-wide association studies (GWASs), biochemistry and phenotype 84,85. Text-mining is prone to integration with machine learning, statistical techniques. In the future, text-mining might face several major challenges such as improve literature analysis, integrate to existing knowledge base, visualization of extracted information.

Machine learning

Machine learning methods have been used for the biomarker discovery from high-throughput omics data, inferring causal relations between mutations and diseases 21 , interactions between genes and proteins 86-88 and relations between environmental features and cancer 89 as well as pathway and network modeling. There are two kinds of basic machine learning techniques, one is unsupervised machine learning such as hierarchical clustering, self-organizing mapping (SOM) etc. 90. The other is supervised machine learning which needs known knowledge from data train a model and then apply this model to predict the output variables 3. A number of machine learning such as SVM 14, Artificial Neural Network 91, decision tree and random forests (RFs) etc. have been widely for various applications including identification of breast cancer biomarkers 92, diagnosis biomarker of Parkinson disorders 93, subcellular locations of proteins 94,95, the prediction of protein functions on the basis of protein structures 96,97, the annotation of mutations 98,99. For example, Han proposed a machine learning based derivative component analysis method to select implicit feature by capturing subtle data behaviors and removing system noises from a proteomic profile to overcome the reproducibility problem for biomarker discovery in proteomics 100. Another interesting study by Hoshida et al 101 combined eight independent cohorts of gene expression profiles to reveal the subclass of HCC and their related pathways using unsupervised machine learning methods. They found that three common subclasses (S1-S3) of hepatocellular carcinoma (HCC) were significantly correlated to Wnt pathway, MYC, AKT and hepatocyte differentiation respectively. Westen blotting; knockout and immunohistochemical staining were used for experimental validation of their discovery. Another framework called knowledge-driven matrix factorization (KMF) proposed by Yang et al. was used to reconstruct phenotype-specific modular gene networks 21.

Integrative methods

Integration of data from multiple omic studies not only can help unravel the underlying molecular mechanism of carcinogenesis but also identify the signature of signaling pathway/networks characteristic for specific cancer types that can be used for diagnosis, prognosis and guidance for targeted therapy. The methods described in Sections A-E have proven useful for discovering biomarkers from high-throughput omic data, analyzing protein-protein, protein-DNA, and kinase-substrate interactions, as well as for genetic interactions among genes 102. These efforts have yielded good results in cancer biomarker discovery, protein interaction and interaction between genotype and diseases 103. However, current omic technologies provide only limited fragmented reality of the biological functions within cell or cancer mechanism. Separate analysis of the data generated from each of these technologies is limited to revealing only partial aberrant molecular changes, because the interaction of multiple molecules cannot be modeled by isolated analysis of genes, proteins or metabolites. Furthermore, limitations such as intrinsic high noise, incomplete data, small sample-size, bias have motivated the use of integrative omic analysis and use of prior biological knowledge and information bases, rather than as mere collections of single large-scale omic studies 14, 34, 104. However, integration of multiple disparate data types remains a significant challenge in systems biology research. Most recently, attempts at integration of multiple high-throughput omics data have concentrated on capturing regulatory associations between genes and proteins by comparing expression patterns across multiple conditions 105-107, combining functional characterization and quantitative evidence extracted from different data sources of all levels of gene products, mRNA, proteins and metabolites, as well as their interaction 108-110. Some previous works 81, 111-113 in integrative analysis utilize pathways in the form of connected routes through a graph-based representation of the metabolic network 114. Other approaches focus on the functional module of protein interaction network and analyze experimental data in the context of pathways using multiple source omics data 14,115,116. We and others have developed advanced bioinformatics tools and algorithms to facilitate the integration of diverse data types 34, 110, 117-120. Different biological types of data, such as sequences, protein structures and families, proteomics data, ontologies, gene expression and other experimental data sets show a growing complexity produced by numerous heterogeneous application areas. The integration of heterogeneous data is therefore becoming more and more important. In order to gain insights into the complexity and dynamics of biological systems, the information stored in these data repositories needs to be linked and combined in efficient ways.

Application of biomarker discovery in HCC

Hepatocellular carcinoma (HCC) is the fifth most common malignancy and the third leading cause of cancer death in the world, with the five-year survival rate approaching 7% 33. Treatments of HCC include surgical resection and transplantation, ablation and transarterial chemoembolization, and systemic chemotherapy. Even so, no existing systemic chemotherapy is effective for advanced HCC 121,122. For example, Lovet et al. 123 reported that targeted therapy with sorafenib which inhibits multiple tyrosine kinase receptors (RAS/VEGFR) may prolong survival by about three months. However, due to the redundancy and compensation of the signaling network in HCC, a significant reorganization of the signaling network observed such as down regulation of tumor suppressors (p53 and CHK1 when XIAP silenced or p-RB when CDK6 silenced) and upregulation of tumor promoting proteins (ETS1 when XIAP silenced or p-CREB when CDK6 silenced) may confer the growth benefit for cancer cells 124. This example suggests providing pathways and network information may improve the efficacy of systemic chemotherapy of HCC. Chang et al 125 partitioned the complex oncogenic signaling networks into basic units, or functional modules, of signaling activity (e.g., a protein phosphorylating another protein to activate its kinase activity) and demonstrated that gene expression signatures based on these modules can predict the effectiveness of pathway-specific therapeutics 125. Except for surgical resection/transplantation of early stage HCC, the survival time is not significantly prolonged by any of these treatments. Added to pathway and network centric method making use of omics data with systematic chemotherapy will benefit the development of newer therapeutic targets for HCC treatment. In recent years, computational methods for models take more and more important roles in the HCC investigations 114,126,127. Some computer systems have also been developed. For example, Shannon et al. 128 developed a java based tool Gaggle by integrating diverse databases (e.g., KEGG, BioCyc, String) and software (e.g., Cytoscape, R ) to simultaneously explore the experimental data (e.g., mRNA and protein abundance, protein-protein and protein-DNA interactions), functional associations, metabolic pathways (KEGG) and Pubmed abstracts. Recently, Zheng et al 129 identified the molecular events underlying the development of HCV induced HCC by integrating gene expression profile and protein interaction data. To get the subnetworks, they refined the network by removing a network component if the number of nodes is smaller than five. They found four subnetworks called normal-cirrhosis, cirrhosis-dysplasia, dysplasia-early and early-advanced HCC networks. From each of the sub networks they identified functional modules and hub genes. By comparing the pathways in each sub networks, they observed changes of pathways and network activities. Their findings were validated by literature. Even though the types of omics data they used only include gene expression and protein interactions, they provide a way to study the changes of network activities by analysis of omic data. Zhang et al. used systematical method including partial least squares, literature mining technique and with GeneGO Meta-Core to discover the biomarkers of HCC with gene expression as well as protein data. Based on these marker genes, they constructed down regulated and up regulated networks. In the former, they identified 10 up regulated hub genes (MAPK1, SP1, HDAC1, YY1, ABL1, PTK2, SMAD2, NCOA3; CDC25A and NCOA2). They identified 7 hub genes (FOS, ESR1, JUNB, EGFR, SOCS3; FOLH1 and IGF1) in the latter. Partial least squares were employed to construct a classifier with these biomarkers. They used five-fold cross-validation and two independent datasets to evaluate the performance of the classifier. Furthermore, they used experimental immunohistochemistry and western blot measurements to verify the marker genes predicted by the classifier. Their results show that the network-based approach facilitates biomarker identification and improves classification accuracy 130. Hollywod et al 131 identified driver genes which are potent diagnosis markers and mechanism study of HCC using t-statistic map (TM) and transcriptome correlation map (TCM) approaches with integration of DNA copy number measured by genomics CGH array and gene expression. They found 50 driver genes with significant prognostic relevance to HCC key signaling pathways such as mTOR, AMPK, and EGFR. siRNA-mediated knockdown experiments was used to evaluate the functional significance of the 50 driver genes 131. Even though collection of diverse omics data to analyze the relationships between HCC phenotype and biological entities within the cell has been proved powerful enough, such integration is still fragmentary, incomplete and inadequate to reflect the whole picture of the cancer information and development. The amount of omics data from genomics, proteomics, metabolomics and interactomics is increasing. In pace with the explosion of omics data, a number of open-access databases, containing comprehensive gene, protein interaction, biological pathway and network information, are being developed to provide biologists with valuable tools for analyzing the data from complex biological systems. These include IntAct, BioGRID, MINT, KEGG, PID, STRING and REACTOME etc. all of which provide very useful qualitative mappings of functional associations between key components in canonical pathways 14. Table 2 summarizes primary data source and URLs specific to HCC.

Table 2

Data sources and URLs for HCC databases.

Data sources	URLs
EHCO132	http://ehco.iis.sinica.edu.tw/
Onco.HCC133	http://oncodb.hcc.ibms.sinica.edu.tw/index.htm
HCVpro134	http://cbrc.kaust.edu.sa/hcvpro/
HCVdb135	http://euhcvdb.ibcp.fr/euHCVdb/
Hepatitis Virus Database (HVDB) 136	http://s2as02.genes.nig.ac.jp
Los Alamos National Laboratory in the United States137	http://hcv.lanl.gov
LiverAtlas138	http://liveratlas.hupo.org.cn
dbHCCvar139	http://GenetMed.fudan.edu.cn/dbHCCvar

Limitations of omics based biomarker discovery

With wide applications of omics technique, more accurate and ubiquitous biomarkers have been identified, but only few have been brought to clinical setting and many have proved to be irreproducible 140. One of the concerns is that biomarkers identified suffer from low diagnostic specificity and sensitivity which leads to current cancer biomarkers have not yet made a major impact in reducing cancer burden. For instance, serum alpha-fetoprotein (AFP) is the most widely used biomarkers for detecting and monitoring of HCC, but the false negative rate with AFP levels may be high as 40% for patients with early stage of HCC, for advances patients, the AFP levels remain small in 15%-30% of patients 141. One of the important limitations is possible artifacts in conducing biological experiments such as instrument variability. Others include bias in sample collection and sample handling which lead to cohort differences. For example, Sreekumar et al. 142 reported sarcosine as a prostate cancer biomarker through metabolomics analysis. However, subsequent validation study done by Jentzmik et al. 143 concluded that the levels of sarcosine measured by GC-MS could not differentiate malignant from nonmalignant tissue. Collestelli et al. reported no statistically significant difference between prostate cancer and healthy controls in the sarcosine to creatinine ratios and that the levels of sarcosine were about 11.7% higher in the healthy controls 144. Another important limitation relates to lack of computational methods that can extract knowledge from omic data involving substantial amount of noise, high dimensionality, missing values, etc. Although the use of pathway and network-based approaches and the integration of prior biological knowledge with omic data are promising in addressing some of the computational challenges, they too have some limitations as outlined below: mRNA levels and DNA alterations may not accurately reflect the corresponding protein levels and fail to reveal changes in posttranscriptional protein modulation (e.g., phosphorylation, acetylation, methylation, ubiquitination, etc.) or protein degradation rates. Correlation of mRNA with its associated protein expression can be relatively low. The signaling network constructed using these approaches does not reflect the dynamic signal flow in a spatial relationship. Also, the genomic changes (mRNA level, SNP, CNV, methylation) ultimately affect protein expression, activation and inactivation, which, in turn, controls cellular behavior. Current proteomic technologies provide only limited coverage of the proteome and more sensitive technologies are needed to identify and quantitate low abundant proteins 145,146. Interpretation of pathway mapping results from the fact that pathway annotations currently take little consideration of tissue specificities of genes or proteins in the pathway. This limits the tissue and/or isoform specificity in pathway annotations. Thus, specific steps of a pathway may not be actually active in tissues/cells from which the omics data may be generated. In some cases, this may occur because protein isoforms or splice variants have been annotated as a protein class or a canonical protein sequence, respectively, in the pathway while they may be expressed differentially in different tissues/cells. Because biological pathways are inherently complex and dynamic, pathway annotations in different pathway databases vary significantly in pathway models and in a number of other aspects, e.g., specific protein forms, dynamic complex formation, subcellular locations, and pathway cross talks. Current computational methods thus need to provide a solution to these issues including revealing patterns within the data, modeling heterogeneity, profiling of disease classes and subclasses, producing a predictive of patients' classification, etc.. Biomarker discovery is now changing research away from identification of individual biomarkers to searching for perturbed pathways and network activities.

Conclusion

Early detection of cancer improves survival and enhances quality of life. An ideal marker would be one that can be measured easily and reliably using an assay with high sensitivity and specificity and undergo rigorous validation before they are introduced into routine clinical care. Currently, the treatment of most cancers is based on the tissue types and clinical stages. This approach is often ineffective due to the heterogeneity of the tumors. Pathway and network based method have taken more important role in analysis of high-throughput data. Pathway and network based methods provide a global and systematical way to explore the relationships between biomarkers and their interacting partners. Thus, future work is likely to focus on using pathway and network based methods for biomarker discovery. It is our expectation that methods discussed above will become a component in a shared infrastructure of biomedical resources that can be used by researchers to identify and to retrieve the most relevant work, to formulate hypothesis, to find supporting and contradicting evidence for hypotheses, to integrate research results into a framework of whole biological systems and to support the translation of research results across domains and into clinical applications.

Table 1

Computational methods for biomarker discovery categorized by their application, examplary tools and URLs.

Approaches	Technique & Application Examples	Exemplary Tools &URL
Statistical analysis	Hypothesis testing, random sampling. ANOVA. Detection of differentially expressed genes/proteins, genotypes, biomarker filtering/selection24	BRB:http://linus.nci.nih.gov/BRB-ArrayTools.htmlPAM: http://www-stat.stanford.edu/~tibs/PAM/SAM: http://www-stat.stanford.edu/~tibs/SAM/
Pattern recognition	Machine learning, Probabilistic, instance-based, kernel classification models. Clustering, multi-source data classification, biomarker selection and associations 25Bayesian regression models 26, partial least squares 27, and Genetic Algorithm/KNN 28.	Weka: http://weka.wikispaces.com/LIBSVM: http://www.csie.ntu.edu.tw/~cjlin/libsvm/PRTools: http://prtools.org/R package: http://cran.r-project.org/web/views/Bayesian.html
Graph/network theory	Network topology analysis, network visualization and data integration, clustering. Genetic, regulatory, protein-protein, signaling network analysis, biomarker/target identification29	BioNet4 :http://www.fda.gov/ScienceResearch/BioinformaticsTools/ucm285284.htmJung: http://jung.sourceforge.net/http://bioinfo.mc.vanderbilt.edu/dmGWAS.html 30
Data visualization and imaging	Sequence and cluster visualization, interactive visualization, statistical analysis graphs. Data exploration, biomarker visualization, model explanation, in vivo/in vitro imaging of molecules and cells 29	Cytoscape 31: http://www.biotapestry.org/Medusa: http://coot.embl.de/medusa/Graphviz: http://www.graphviz.org/Osprey:http://biodata.mshri.on.ca/osprey/servlet/IndexPajek: http://vlado.fmf.uni-lj.si/pub/networks/pajek/3Omics: http://3omics.cmdm.tw
Natural language processing and information retrieval	Ontologies, text mining, information representation standards, information retrieval and extraction. Inference of functional associations from publications, automated annotation and characterization 32,33	iHOP: http://www.ihop-net.org/UniPub/iHOPCoPub: http://services.nbic.nl/copub/portalPolySearch: http://wishart.biology.ualberta.ca/polysearch/index.htmOpen Biomedical Annotator: http://bioportal.bioontology. org/annotatorGeneSeeker: http://www.cmbi.ru.nl/GeneSeeker/
Software development, Internet technologies	Data warehouses and distributed information systems, semantic Web tools, information retrieval, extraction and curation. Biomarker discovery and validation platforms, data mining tools, search and reasoning engines 34	IPA:http://www.ingenuity.com/products/pathways_analysis.htmGO: http://www.geneontology.org/GO.tools.shtmlMiMI: http://mimi.ncibi.org/MimiWeb/main-page.jsp

143 in total

1. Association of genes to genetically inherited diseases using data mining.

Authors: Carolina Perez-Iratxeta; Peer Bork; Miguel A Andrade
Journal: Nat Genet Date: 2002-05-13 Impact factor: 38.330

Review 2. Data merging for integrated microarray and proteomic analysis.

Authors: Katrina M Waters; Joel G Pounds; Brian D Thrall
Journal: Brief Funct Genomic Proteomic Date: 2006-05-10

Review 3. "Omics" data and levels of evidence for biomarker discovery.

Authors: Debashis Ghosh; Laila M Poisson
Journal: Genomics Date: 2008-09-14 Impact factor: 5.736

4. Circulating transcriptome reveals markers of atherosclerosis.

Authors: Willmar D Patino; Omar Y Mian; Ju-Gyeong Kang; Satoaki Matoba; Linda D Bartlett; Brenda Holbrook; Hugh H Trout; Louis Kozloff; Paul M Hwang
Journal: Proc Natl Acad Sci U S A Date: 2005-02-22 Impact factor: 11.205

5. Significance of DNA polymerase delta catalytic subunit p125 induced by mutant p53 in the invasive potential of human hepatocellular carcinoma.

Authors: Kensaku Sanefuji; Akinobu Taketomi; Tomohiro Iguchi; Keishi Sugimachi; Toru Ikegami; Yo-ichi Yamashita; Tomonobu Gion; Yuji Soejima; Ken Shirabe; Yoshihiko Maehara
Journal: Oncology Date: 2011-03-03 Impact factor: 2.935

6. Proteomic identification of CIB1 as a potential diagnostic factor in hepatocellular carcinoma.

Authors: Tong Junrong; Zhou Huancheng; He Feng; Gao Yi; Yang Xiaoqin; Luo Zhengmao; Zhang Hong; Zeng Jianying; Wang Yin; Huang Yuanhang; Zhang Jianlin; Sun Longhua; He Guolin
Journal: J Biosci Date: 2011-09 Impact factor: 1.826

7. Metabolomic profiles delineate potential role for sarcosine in prostate cancer progression.

Authors: Arun Sreekumar; Laila M Poisson; Thekkelnaycke M Rajendiran; Amjad P Khan; Qi Cao; Jindan Yu; Bharathi Laxman; Rohit Mehra; Robert J Lonigro; Yong Li; Mukesh K Nyati; Aarif Ahsan; Shanker Kalyana-Sundaram; Bo Han; Xuhong Cao; Jaeman Byun; Gilbert S Omenn; Debashis Ghosh; Subramaniam Pennathur; Danny C Alexander; Alvin Berger; Jeffrey R Shuster; John T Wei; Sooryanarayana Varambally; Christopher Beecher; Arul M Chinnaiyan
Journal: Nature Date: 2009-02-12 Impact factor: 49.962

8. Cloud-based solution to identify statistically significant MS peaks differentiating sample categories.

Authors: Jun Ji; Jeffrey Ling; Helen Jiang; Qiaojun Wen; John C Whitin; Lu Tian; Harvey J Cohen; Xuefeng B Ling
Journal: BMC Res Notes Date: 2013-03-23

9. OncoDB.HCC: an integrated oncogenomic database of hepatocellular carcinoma revealed aberrant cancer target genes and loci.

Authors: Wen-Hui Su; Chuan-Chuan Chao; Shiou-Hwei Yeh; Ding-Shinn Chen; Pei-Jer Chen; Yuh-Shan Jou
Journal: Nucleic Acids Res Date: 2006-11-10 Impact factor: 16.971

10. Evaluation and integration of existing methods for computational prediction of allergens.

Authors: Jing Wang; Yabin Yu; Yunan Zhao; Dabing Zhang; Jing Li
Journal: BMC Bioinformatics Date: 2013-03-08 Impact factor: 3.169

15 in total

1. Identifying a biomarker network for corticosteroid resistance in asthma from bronchoalveolar lavage samples.

Authors: José Eduardo Vargas; Bárbara Nery Porto; Renato Puga; Renato Tetelbom Stein; Paulo Márcio Pitrez
Journal: Mol Biol Rep Date: 2016-05-17 Impact factor: 2.316

Review 2. Genomic, Proteomic, and Metabolomic Data Integration Strategies.

Authors: Kwanjeera Wanichthanarak; Johannes F Fahrmann; Dmitry Grapov
Journal: Biomark Insights Date: 2015-09-07

Review 3. Network-Based Protein Biomarker Discovery Platforms.

Authors: Minhyung Kim; Daehee Hwang
Journal: Genomics Inform Date: 2016-03-31

4. Genomic and Transcriptomic Alterations Associated with STAT3 Activation in Head and Neck Cancer.

Authors: Noah D Peyser; Kelsey Pendleton; William E Gooding; Vivian W Y Lui; Daniel E Johnson; Jennifer R Grandis
Journal: PLoS One Date: 2016-11-17 Impact factor: 3.240

5. Metabolomic profiling of breast tumors using ductal fluid.

Authors: Luisa Matos Do Canto; Catalin Marian; Rency S Varghese; Jaeil Ahn; Patricia A Da Cunha; Shawna Willey; Mary Sidawy; Janice D Rone; Amrita K Cheema; George Luta; Mohammad R Nezami Ranjbar; Habtom W Ressom; Bassem R Haddad
Journal: Int J Oncol Date: 2016-10-13 Impact factor: 5.650

6. Signalling maps in cancer research: construction and data analysis.

Authors: Maria Kondratova; Nicolas Sompairac; Emmanuel Barillot; Andrei Zinovyev; Inna Kuperstein
Journal: Database (Oxford) Date: 2018-01-01 Impact factor: 3.451

7. Functional divergence and convergence between the transcript network and gene network in lung adenocarcinoma.

Authors: Min-Kung Hsu; Chia-Lin Pan; Feng-Chi Chen
Journal: Onco Targets Ther Date: 2016-01-14 Impact factor: 4.147

Review 8. Biological Networks for Cancer Candidate Biomarkers Discovery.

Authors: Wenying Yan; Wenjin Xue; Jiajia Chen; Guang Hu
Journal: Cancer Inform Date: 2016-09-04

9. A New Strategy for Analyzing Time-Series Data Using Dynamic Networks: Identifying Prospective Biomarkers of Hepatocellular Carcinoma.

Authors: Xin Huang; Jun Zeng; Lina Zhou; Chunxiu Hu; Peiyuan Yin; Xiaohui Lin
Journal: Sci Rep Date: 2016-08-31 Impact factor: 4.379

10. Tumor-adjacent tissue co-expression profile analysis reveals pro-oncogenic ribosomal gene signature for prognosis of resectable hepatocellular carcinoma.

Authors: Oleg V Grinchuk; Surya P Yenamandra; Ramakrishnan Iyer; Malay Singh; Hwee Kuan Lee; Kiat Hon Lim; Pierce Kah-Hoe Chow; Vladamir A Kuznetsov
Journal: Mol Oncol Date: 2017-12-12 Impact factor: 6.603