| Literature DB >> 29872707 |
Wei Zhang1, Jeremy Chien2, Jeongsik Yong3, Rui Kuang1.
Abstract
Network-based analytics plays an increasingly important role in precision oncology. Growing evidence in recent studies suggests that cancer can be better understood through mutated or dysregulated pathways or networks rather than individual mutations and that the efficacy of repositioned drugs can be inferred from disease modules in molecular networks. This article reviews network-based machine learning and graph theory algorithms for integrative analysis of personal genomic data and biomedical knowledge bases to identify tumor-specific molecular mechanisms, candidate targets and repositioned drugs for personalized treatment. The review focuses on the algorithmic design and mathematical formulation of these methods to facilitate applications and implementations of network-based analysis in the practice of precision oncology. We review the methods applied in three scenarios to integrate genomic data and network models in different analysis pipelines, and we examine three categories of network-based approaches for repositioning drugs in drug-disease-gene networks. In addition, we perform a comprehensive subnetwork/pathway analysis of mutations in 31 cancer genome projects in the Cancer Genome Atlas and present a detailed case study on ovarian cancer. Finally, we discuss interesting observations, potential pitfalls and future directions in network-based precision oncology.Entities:
Year: 2017 PMID: 29872707 PMCID: PMC5871915 DOI: 10.1038/s41698-017-0029-7
Source DB: PubMed Journal: NPJ Precis Oncol ISSN: 2397-768X
Fig. 1Overview of the methods for network-based precision oncology. a The methods for integration of patient genomic data and molecular networks grouped under the three scenarios of data analysis pipelines. b The methods for integration of drug–drug similarities, drug–target relations and target–target relations for drug repositioning, grouped under three algorithmic categories. c Patient genomic profiles describe the genomic landscape of each patient sample. d The patient genomic profiles are integrated with a molecular network, the human protein–protein interaction (PPI) network in the example. e Drug and disease phenotypes are modeled in a network with connections to the target genes in the PPI network. f An example of cancer subnetworks associated with recurrent ovarian cancer.[36] g Resources of biomedical and molecular networks. h List of the TCGA cancer studies
List of molecular and biomedical networks
|
|
|
| HPRD(7) | “Human Protein Reference Database provides curated human-specific protein interactions; currently >40,000 interactions for >30,000 protein entries. HPRD is used as a browser for interactions, protein annotations, motifs and domains.” |
| BioGRID (8) | BioGRID is a curated database of interactions, derived from the literature. It contains 1,412,140 protein and genetic interactions, 27,745 chemical associations and 38,559 post translational modification from major organism species. |
| MINT (9) | “A searchable molecular interaction database with total of 125,000 interactions reported in peer-reviewed journals.” Most of the interactions are from yeast, human and mouse. |
| DIP (10) | “The database of interacting proteins (DIP) is a database with catalogs experimentally determined protein–protein interactions.” It contains 81,731 interactions for 28,868 proteins from 834 organisms. |
| STRING (11) | A database of known and predicted protein–protein interactions. “The interactions include direct (physical) and indirect (functional) associations; they stem from computational prediction, from knowledge transfer between organisms, and from interactions aggregated from other databases.” |
| IntAct (12) | “A molecular interaction database populated by data either curated from the literature or from direct data depositions. It contains approximately 658,000 curated binary interaction evidences from overall 14,451 publications.” |
|
|
|
| HEFalMp (15) | “A human gene functional network was constructed by a regularized Bayesian integration system. The network contains maps of functional activity and interaction networks in over 200 areas of human cellular biology with information from 30,000 genome scale experiments.” |
| Co-expression Network (13, 14) | A gene co-expression network is constructed by looking for pairs of genes which show a similar expression pattern across samples by some co-expression measure. |
|
|
|
| TRRUST (16) | “A manually curated database of human transcriptional regulatory network. It contains 8015 transcriptional regulatory relationships between 748 human transcription factors (TFs) and 1975 non-TF genes, derived from 6175 PubMed articles.” |
| RegNetwork (17) | A database of transcriptional and post-transcriptional human and mouse regulatory networks. It collects knowledge-based regulatory relationships and certain potentially regulatory relationships between the two regulators and targets. |
|
|
|
| HMDB (18) | “A database contains information about small molecule metabolites found in the human body. It contains experimental MS/MS data for over 5700 compounds, experimental NMR data for over 1300 compounds and GC/MS spectral and retention index data for more than 780 compounds.” |
| MetaCyc (19) | “A curated database of experimentally elucidated metabolic pathways from all domains of life. It contains 2491 pathways involved in both primary and secondary metabolism, as well as associated metabolites, reactions, enzymes, and genes from 2816 different organisms.” |
|
|
|
| OMIM (21) | “OMIM is a database of human genes and genetic disorders and traits, with a particular focus on the gene-phenotype relationship.” It contains approximately 8000 phenotypes and 15,000 genes. |
| HPO (24) | “HPO serves as a standardized vocabulary of phenotypic abnormalities that have been seen in human disease.” It currently focuses on monogenic diseases listed in OMIM, Orphanet, DECIPHER and other medical literature. |
|
|
|
| DragBank (26) | “A bioinformatics and cheminformatics resource that combines detailed drug data with comprehensive drug target information. It contains 8250 drug entries including 2016 FDA-approved small molecule drugs, 229 FDA-approved biotech drugs, 94 nutraceuticals and over 6000 experimental drugs.” |
| ChEMBL (27) | “ChEMBL is a bioactivity database containing information manually extracted from the medicinal chemistry literature.” It contains the information extracted from >51,000 publications, with >9000 targets of which 2827 are human protein targets. |
| TTD(28) | “A database contains the known and explored therapeutic protein and nucleic acid targets, the targeted disease, pathway information and the corresponding drugs directed at each of these targets.” The database currently contains 2025 targets, 17,816 drugs, and 3681 multitarget agents. |
| KEGG DRUG (29) | “A comprehensive drug information resource for approved drugs in Japan, USA, and Europe unified based on the chemical structures and/or the chemical components, and associated with target, metabolizing enzyme, and other molecular interaction network information.” |
Fig. 2Three scenarios for the integration of genomic data with molecular networks. a Model-based integration formulates one unified learning framework regularized by a graph Laplacian. The output of the model is network modules enriched by the selected genomic features and a prediction of treatment outcome/cancer phenotype. b Preprocessing integration consists of the following two steps: The first step detects subnetworks that differentiate the contrasted patient groups by the genomic features; in the second step, the subnetwork features are then fed into a standard learning model to generate predictions. c Post-analysis integration of oncogenic alterations in the network also consists of two steps. The oncogenic alterations are first detected across the patient profiles, and then the altered genes/loci are mapped to the network as seed genes for the module analysis. For each scenario, the objectives of the approach, the inputs and outputs of the network-based analysis models/methods, and the advantages/limitations of each approach are also provided
Fig. 3Model-based integration of whole-genomic profiles and a molecular network. a The patient genomic profiles along with the clinical information: the survival time, two patient subgroups for classification and treatment response of each individual patient are shown. The network is typically integrated into the genomic profile analysis with a graph Laplacian regularization. The formulas of the graph Laplacian and its regularization are shown below. The graph Laplacian regularization can be rewritten as summation of pairwise smoothness terms that promote smoothness among the connected genomic features in the network. b The network-based linear regression and Cox regression models are illustrated in the figure with the graph Laplacian regularization term added to the original cost functions. c Network-based classification is illustrated by a network-based SVM to classify the samples. d Network-based semi-supervised learning models classify samples and detect disease markers on a bipartite graph. The edges between samples and genomic features are weighted by the genomic profiles, and semi-supervised learning is based on the bipartite graph Laplacian. e Network-based factorization models factorize the genomic profile into the product of two matrices, and , which cluster patient samples and learn the latent features in the genomic profiles
Notations
| Notation | Definition |
|---|---|
|
| # of samples and features (e.g., genes), respectively. |
|
| genomic profile matrix. |
|
| coefficients of features to be learned by the model. |
|
| responses for regression or labels for classification, |
|
| symmetric adjacency matrix of an undirected molecular network. |
|
| diagonal matrix with vector |
|
| normalized symmetric adjacency matrix: |
|
| normalized graph Laplacian: |
|
| graph Laplacian regularization: |
|
| initialization for semi-supervised learning: |
|
| Predictions by semi-supervised learning: |
|
| positive hyper-parameters to weight the cost terms. |
Fig. 4Methods for network-based drug repositioning. a Graph connectivity measures consider the local structures of the networks to predict drug–target interactions. This example shows the shortest path from each target node to the query drug (red node) in the graph. b Link prediction models predict the relations between drugs and targets based on the global structures of the known interactions in the networks with matrix completion or random-walk approaches. The known and predicted drug–target interactions are green and red, respectively, in the drug–target relation matrix. c Network-based classification methods first extract the network topological features for all the targets in the networks. For each drug, a classifier can be trained with the known targets of the drug as positive samples and the others as negative samples. The learned classifiers can then be used to predict the new targets in the test set for each drug. d The advantages and disadvantages of the methods in each category are compared
Fig. 5Network-based analysis of highly mutated pathways of 31 cancer types in TCGA data. The highly mutated pathways detected by a network-based analysis and b standard enrichment analysis. The pathways of interest in the discussion are highlighted in blue, and the pathways only enriched by network-based analysis are highlighted in red
Fig. 6Network-based analysis of patient mutation data in TCGA ovarian cancer. The significantly mutated pathways in each patient detected by a network analysis and b the analysis of the original mutation data without the network. c The survival plot of the three groups detected by the network-based pathway analysis of the TCGA ovarian cancer patients. Derived by standard log-rank test, the p-values for comparing group 2 vs. group 3 and group 1 + group 2 vs. group 3 are both significant. d The survival plot of the groups detected by the analysis of the original mutation data of the TCGA ovarian cancer patients