| Literature DB >> 35487924 |
Limeng Pu1, Manali Singha2, Hsiao-Chun Wu3, Costas Busch4,5, J Ramanujam1,3, Michal Brylinski6,7.
Abstract
Genomic profiles of cancer cells provide valuable information on genetic alterations in cancer. Several recent studies employed these data to predict the response of cancer cell lines to drug treatment. Nonetheless, due to the multifactorial phenotypes and intricate mechanisms of cancer, the accurate prediction of the effect of pharmacotherapy on a specific cell line based on the genetic information alone is problematic. Emphasizing on the system-level complexity of cancer, we devised a procedure to integrate multiple heterogeneous data, including biological networks, genomics, inhibitor profiling, and gene-disease associations, into a unified graph structure. In order to construct compact, yet information-rich cancer-specific networks, we developed a novel graph reduction algorithm. Driven by not only the topological information, but also the biological knowledge, the graph reduction increases the feature-only entropy while preserving the valuable graph-feature information. Subsequent comparative benchmarking simulations employing a tissue level cross-validation protocol demonstrate that the accuracy of a graph-based predictor of the drug efficacy is 0.68, which is notably higher than those measured for more traditional, matrix-based techniques on the same data. Overall, the non-Euclidean representation of the cancer-specific data improves the performance of machine learning to predict the response of cancer to pharmacotherapy. The generated data are freely available to the academic community at https://osf.io/dzx7b/ .Entities:
Mesh:
Year: 2022 PMID: 35487924 PMCID: PMC9054771 DOI: 10.1038/s41540-022-00226-9
Source DB: PubMed Journal: NPJ Syst Biol Appl ISSN: 2056-7189
Fig. 1Schematic of the graph representation of multiple heterogeneous data.
Circles are kinases, whereas rounded squares represent non-kinase proteins. Nodes are connected through confident interactions forming a network. Each node is colored according to the differential gene expression: green—up-regulated, red —down-regulated, and gray—normally regulated. Both types of nodes can have gene-disease association scores (numbers in bold), whereas kinases can also have pIC50 values according to the kinase profiling data (numbers in italics).
Fig. 2Histogram of the pairwise GOGO similarity scores across the protein-protein interaction network.
GOGO similarities are calculated using the biological process ontology for 1st, 2nd, 3rd, and 4th order neighbors in the network.
Fig. 3Graph reduction of cancer-specific networks.
A A schematic of the initial graph with yellow boxes outlining groups of nodes that can be merged by contracting their edges. B A schematic graph of the reduced graph in which merged nodes are represented by diamonds. C The initial (sub)network for glioblastoma (cell line A172) with red nodes representing kinases and green nodes representing other proteins. D The reduced network for glioblastoma colored the same as in (C). The network in (C) is a randomly sampled subgraph from the original network with the same number of nodes as (D).
Properties of full-size and reduced graphs.
| Property | Full-size graph | Reduced graphs |
|---|---|---|
| Number of nodes | 19,144 | 1349 ± 80 |
| Number of edges | 685,198 | 12,613 ± 608 |
| Average degree | 71 | 19 ± 0.3 |
| Density | 0.004 | 0.014 ± 0.0009 |
| Diameter | 8 | 4.073 ± 0.26 |
| Clustering coefficient | 0.287 | 0.659 ± 0.006 |
| Maximum betweenness centrality | 0.021 | 0.596 ± 0.011 |
| Average betweenness centrality | 1.11 × 10−4 | 7.88 × 10−4 ± 4.49 × 10−6 |
Statistics are calculated from the graph topology without considering node features. Values for reduced graphs are reported as the average ±standard deviation across the dataset.
Fig. 4Entropy gain/loss for different reduction schemes.
Purple bars represent the Shannon entropy calculated using the feature matrix only, while yellow bars correspond to the graph-feature entropy computed using both feature and topological information of a graph. GO-BP requires that two incident nodes have a common biological process term to be assigned to the same cluster. HCA bars correspond to the clustering using GOGO similarities into 30, 100, and 300 clusters.
Performance of algorithms to predict the response of cancer cell lines to drugs.
| Data type | Model | Features | ACC | PPV | TPR | MCC | F-score |
|---|---|---|---|---|---|---|---|
| Matrix | MLP | DGE, LE | 0.55 | 0.63 | 0.64 | 0.27 | 0.55 |
| Matrix | MLP | DGE, KIP, DGA | 0.60 | 0.60 | 0.60 | 0.20 | 0.60 |
| Matrix | SVM-PCA | DGE, LE | 0.61 | 0.70 | 0.51 | 0.10 | 0.40 |
| Matrix | SVM-PCA | DGE, KIP, DGA | 0.62 | 0.72 | 0.53 | 0.16 | 0.45 |
| Matrix | RF-PCA | DGE, LE | 0.42 | 0.56 | 0.52 | 0.06 | 0.33 |
| Matrix | RF-PCA | DGE, KIP, DGA | 0.44 | 0.56 | 0.53 | 0.09 | 0.39 |
| Graph | WL Tree | DGE, KIP, DGA | 0.68 | 0.67 | 0.65 | 0.32 | 0.65 |
A graph-based approach is compared to two matrix-based methods. The performance of each algorithm is cross-validated at the tissue level.
MLP multilayer perceptron, SVM-PCA Support Vector Machines with Principal Component Analysis, RF-PCA Random Forest with Principal Component Analysis, WL Tree Weisfeiler–Lehman graph kernel, DGE differential gene expression, LE ligand embeddings, KIP kinase inhibitor profiling, DGA disease-gene associations, ACC accuracy, PPV precision, TPR recall, MCC Matthews correlation coefficient.