| Literature DB >> 34888618 |
Paul Scherer1, Maja Trębacz1, Nikola Simidjievski1, Ramon Viñas1, Zohreh Shams1, Helena Andres Terre1, Mateja Jamnik1, Pietro Liò1.
Abstract
MOTIVATION: Gene expression data is commonly used at the intersection of cancer research and machine learning for better understanding of the molecular status of tumour tissue. Deep learning predictive models have been employed for gene expression data due to their ability to scale and remove the need for manual feature engineering. However, gene expression data is often very high dimensional, noisy, and presented with a low number of samples. This poses significant problems for learning algorithms: models often overfit, learn noise, and struggle to capture biologically relevant information. In this article we utilise external biological knowledge embedded within structures of gene interaction graphs such as protein-protein interaction networks (PPI) to guide the construction of predictive models.Entities:
Year: 2021 PMID: 34888618 PMCID: PMC8826027 DOI: 10.1093/bioinformatics/btab830
Source DB: PubMed Journal: Bioinformatics ISSN: 1367-4803 Impact factor: 6.937
Fig. 1.An overview of our procedure for incorporating PPI network based protein complex discovery and constructing computational graphs for gene expression analysis. GINCCo’s procedure for model construction is best described in three stages: (i) induction of the case study specific sub-graph common to the input gene expression dataset (for set of k genes K) and the external PPI network which will be used for the (ii) unsupervised discovery of the protein complexes that act as biologically relevant higher level modules of the inputs and (iii) the use of the clusterings to construct a bipartite factor graph between the gene expressions and the protein complexes and extending the use of the graph in the predictive model that transitively maps the gene expressions to phenotypes via the protein complex activities. In the final computational graph model, we can see blue genes which are excluded as a result of extracting the case specific study graph, and red genes which are excluded as a result of clustering process on
Fig. 2.A visual comparison between the factor graphs produced using a FC computational graph as in a standard neural network and that produced by GINCCo using the toy example introduced in Figure 1
Number of parameters used in equally dimensioned FC MLP network and the proposed method using different clustering methods to automatically discover protein complexes and their members on the STRING 9606 PPI network and the 24 368 genes measured in METABRIC
| Method | MCODE (40 clusters) | COACH (4108 clusters) | IPCA (5744 Clusters) | DPCLUS (1562 clusters) |
|---|---|---|---|---|
| FC MLP | 974 720 | 100 103 744 | 139 969 792 | 38 062 816 |
| GINCCo | 14 537 | 1 431 338 | 2 800 267 | 19 545 |
Average percentage balanced accuracy (B-ACC) and W-AUC with SDs over five repeated train and holdout test evaluations using all of the gene expression features of METABRIC and TCGA-HNCS
| Method | METABRIC | TCGA-HNCS | ||||||||
|---|---|---|---|---|---|---|---|---|---|---|
| Distance relapse | PAM50 | IC10 | Tumour grade | 2-year relapse-free survival | ||||||
| B-ACC | W-AUC | B-ACC | W-AUC | B-ACC | W-AUC | B-ACC | W-AUC | B-ACC | W-AUC | |
| MajorityClass | 50.00 ± 0.00 | 0.50 ± 0.00 | 20.00 ± 0.00 | 0.50 ± 0.00 | 9.09 ± 0.00 | 0.50 ± 0.00 | 25.00 ± 0.00 | 0.50 ± 0.00 | 50.00 ± 0.00 | 0.50 ± 0.00 |
| SVM | 54.43 ± 1.85 | 0.54 ± 0.02 | 72.21 ± 3.07 | 0.94 ± 0.01 | 55.72 ± 3.79 | 0.95 ± 0.01 | 39.35 ± 4.28 | 0.67 ± 0.04 | 56.59 ± 4.83 | 0.57 ± 0.05 |
| FC MLP | 56.92 ± 2.65 | 0.57 ± 0.03 | 74.65 ± 3.60 | 0.94 ± 0.01 | 66.32 ± 1.99 | 0.95 ± 0.01 | 34.29 ± 3.53 | 0.66 ± 0.04 | 58.14 ± 4.23 | 0.58 ± 0.05 |
| GraphReg | 49.86 ± 1.05 | 0.50 ± 0.01 | 22.57 ± 2.71 | 0.82 ± 0.01 | 9.09 ± 0.00 | 0.83 ± 0.01 | 27.63 ± 3.25 | 0.64 ± 0.02 | 55.42 ± 2.35 | 0.55 ± 0.02 |
| GINCCo + MCODE | 56.65 ± 1.86 | 0.57 ± 0.02 | 73.52 ± 2.71 | 0.93 ± 0.01 | 57.77 ± 1.73 | 0.93 ± 0.01 | 36.93 ± 10.14 | 0.64 ± 0.03 | 55.43 ± 2.87 | 0.55 ± 0.03 |
| GINCCo + COACH | 56.73 ± 0.98 | 0.57 ± 0.01 | 74.97 ± 3.27 | 0.95 ± 0.01 | 63.04 ± 2.98 | 0.95 ± 0.01 | 39.38 ± 11.48 | 0.65 ± 0.03 | 56.79 ± 3.49 | 0.57 ± 0.03 |
| GINCCo + IPCA | 57.13 ± 1.47 | 0.57 ± 0.01 | 74.62 ± 4.55 | 0.94 ± 0.01 | 62.26 ± 4.51 | 0.94 ± 0.01 | 37.36 ± 9.54 | 0.63 ± 0.03 | 55.56 ± 3.39 | 0.55 ± 0.03 |
| GINCCo + DPCLUS | 57.27 ± 1.80 | 0.57 ± 0.02 | 75.97 ± 4.59 | 0.97 ± 0.01 | 70.43 ± 3.68 | 0.97 ± 0.00 | 39.09 ± 9.96 | 0.67 ± 0.03 | 57.17 ± 4.42 | 0.57 ± 0.04 |
Descriptive statistics of the protein complexes discovered via the topological clustering of the study PPI network induced from the STRING PPI network and METABRIC
| Statistic | MCODE | COACH | IPCA | DPCLUS |
|---|---|---|---|---|
| Number of protein complex | 40 | 4108 | 5744 | 1562 |
| Maximum cluster size | 1555 | 2684 | 639 | 359 |
| Minimum cluster size | 3 | 4 | 5 | 2 |
| Average cluster size | 363.43 | 348.43 | 487.51 | 12.51 |
B-ACC and W-AUC with SDs over five repeated train/test evaluations using all of the gene expression features of METABRIC and TCGA-HNCS
| Method | METABRIC | TCGA-HNCS | ||||||||
|---|---|---|---|---|---|---|---|---|---|---|
| Distance relapse | PAM50 | IC10 | Tumour grade | 2-year relapse-free survival | ||||||
| B-ACC | W-AUC | B-ACC | W-AUC | B-ACC | W-AUC | B-ACC | W-AUC | B-ACC | W-AUC | |
| FC MLP | 56.92 ± 2.65 | 0.57 ± 0.03 | 74.65 ± 3.60 | 0.94 ± 0.01 | 66.32 ± 1.99 | 0.95 ± 0.01 | 34.29 ± 3.53 | 0.66 ± 0.04 | 58.14 ± 4.23 | 0.58 ± 0.05 |
| RC MLP-R | 56.91 ± 0.78 | 0.57 ± 0.01 | 72.06 ± 6.55 | 0.93 ± 0.04 | 57.25 ± 10.03 | 0.92 ± 0.06 | 38.02 ± 3.26 | 0.64 ± 0.05 | 54.86 ± 1.58 | 0.54 ± 0.02 |
| RC MLP-M | 55.25 ± 1.56 | 0.55 ± 0.02 | 64.87 ± 8.79 | 0.92 ± 0.05 | 54.10 ± 6.68 | 0.91 ± 0.04 | 35.45 ± 2.45 | 0.66 ± 0.01 | 54.15 ± 1.87 | 0.54 ± 0.02 |
| GINCCo + DPCLUS | 57.27 ± 1.80 | 0.57 ± 0.02 | 75.97 ± 4.59 | 0.97 ± 0.01 | 70.43 ± 3.68 | 0.97 ± 0.00 | 39.09 ± 9.96 | 0.67 ± 0.03 | 57.17 ± 4.42 | 0.57 ± 0.04 |