| Literature DB >> 32435638 |
Tao Wang1, Qidi Peng1, Bo Liu1, Yongzhuang Liu1, Yadong Wang1.
Abstract
The study of disease-relevant gene modules is one of the main methods to discover disease pathway and potential drug targets. Recent studies have found that most disease proteins tend to form many separate connected components and scatter across the protein-protein interaction network. However, most of the research on discovering disease modules are biased toward well-studied seed genes, which tend to extend seed genes into a single connected subnetwork. In this paper, we propose N2V-HC, an algorithm framework aiming to unbiasedly discover the scattered disease modules based on deep representation learning of integrated multi-layer biological networks. Our method first predicts disease associated genes based on summary data of Genome-wide Association Studies (GWAS) and expression Quantitative Trait Loci (eQTL) studies, and generates an integrated network on the basis of human interactome. The features of nodes in the network are then extracted by deep representation learning. Hierarchical clustering with dynamic tree cut methods are applied to discover the modules that are enriched with disease associated genes. The evaluation on real networks and simulated networks show that N2V-HC performs better than existing methods in network module discovery. Case studies on Parkinson's disease and Alzheimer's disease, show that N2V-HC can be used to discover biological meaningful modules related to the pathways underlying complex diseases.Entities:
Keywords: GWAS; disease module identification; eQTL; hierarchical clustering; node2vec
Year: 2020 PMID: 32435638 PMCID: PMC7218106 DOI: 10.3389/fbioe.2020.00418
Source DB: PubMed Journal: Front Bioeng Biotechnol ISSN: 2296-4185
Figure 1Framework of N2V-HC algorithm. The left-most panel shows input data sources of the integrated network: summary statistics of GWAS and eQTL studies, and PPI network or other types of networks. The edge width represents weight on edge. Representation learning step extracts global connectivity features for N nodes of the integrated network by using a biased random walk technology and the Skip-gram model. Each feature is a numeric vector of d dimension. Unsupervised hierarchical clustering method and dynamic tree-cut method are applied in an iterative module convergence process. The circle with red dash line represents the disease module which is significantly enriched with egenes.
Figure 2Steps of disease module identification.
Summary of real-world network datasets.
| Karate | 34 | 78 | 1.4E-1 | 2 | w, ud | Zachary, |
| Dolphins | 62 | 159 | 8.4E-2 | 2 | uw, ud | Lusseau et al., |
| UKfaculty | 81 | 817 | 2.5E-1 | 4 | w, ud | Nepusz et al., |
| Polbooks | 105 | 441 | 8.1E-2 | 3 | uw, ud | Krebs, |
| Football | 115 | 613 | 9.4E-2 | 12 | uw, ud | Girvan and Newman, |
| Cora | 2,708 | 5,429 | 1.4E-3 | 7 | uw, ud | Fakhraei et al., |
w, weighted graph; uw, unweighted graph; ud, undirected graph.
Clustering performance on real-world networks.
| Karate | 0.844 | 0.847 | 0.529 | 0.588 | 0.588 | 0.623 | |
| Dolphins | 0.935 | 0.804 | 0.677 | 0.613 | 0.565 | 0.533 | |
| UKfaculty | 0.494 | 0.889 | 0.951 | 0.370 | 0.333 | 0.397 | |
| Polbooks | 0.609 | 0.816 | 0.838 | 0.400 | 0.438 | 0.451 | |
| Football | 0.113 | 0.583 | 0.235 | 0.235 | 0.435 | 0.922 (5, 2, 11) | |
| Cora | 0.356 | 0.512 | 0.294 | 0.298 | 0.287 | 0.295 |
AP, affinity propagation; MCL, Markov cluster; SC, spectral clustering; HC, hierarchical clustering; MMS, minModuleSize; DS, DeepSplit; NPC, number of predicted clusters. Parameter setting: MCL inflation factor setting: Karate 2.0, Dolphins 2.0, UKfaculty 2.5, Polbooks 2.1, Football 2.0, Cora 1.8. Parameters in AP, GLay, SC, HC, and mCODE were in default except that cluster number was set to the ground truth if available. Bold Values indicate the best micro F1 scores.
Figure 3The clustering effect of N2V-HC on Dolphins social network (Lusseau et al., 2003). (A) The topology of original network, with colors represents the ground truth communities. (B) The hierarchical clustering dendrogram constructed by N2V-HC, where each leaf node represents a member in original network. Two predicted modules are colored in red and blue. Node “40,” which is misclassified, is labeled in yellow.
Summary of LFR simulated networks.
| LFR (100, 10, 30) | 100 | 1,047 | 0.212 | 7 |
| LFR (500, 10, 50) | 500 | 5,269 | 0.042 | 36 |
| LFR (1000, 20, 100) | 1,000 | 19,115 | 0.038 | 39 |
| LFR (2000, 30, 200) | 2,000 | 60,946 | 0.030 | 34 |
Clustering performance on LFR-benchmark datasets.
| LFR (100, 10, 30) | 0.304 | 0.131 | 0.350 | 0.28 | 0.26 | 0.35 | |
| LFR (500, 10, 50) | 0.090 | 0.127 | 0.120 | 0.128 | 0.14 | 0.138 | |
| LFR (1,000, 20, 100) | 0.097 | 0.075 | 0.103 | 0.109 | 0.145 | 0.620 (6, 3, 40) | |
| LFR (2,000, 30, 200) | 0.092 | 0.033 | 0.651 | 0.080 | 0.082 | 0.135 |
AP, affinity propagation; MCL, Markov cluster; SC, spectral clustering; HC, hierarchical clustering; MMS, minModuleSize; DS, DeepSplit; NPC, number of predicted clusters. Parameter setting: MCL inflation factor was set in default (2.5) for all networks. Parameters in AP, GLay, SC, HC, and mCODE were in default except that cluster number was set to the ground truth if available. Bold Values indicate the best micro F1 scores.
Gene set enrichment analysis of PD modules.
| PD36 | 39 | 20 | 2.94E-23 | 1.50E-21 | GPCR ligand binding | Martin et al., |
| PD41 | 33 | 17 | 5.62E-20 | 9.55E-19 | Retinoic acid biosynthesis | Jacobs et al., |
| PD42 | 32 | 13 | 7.47E-14 | 9.52E-13 | GPI-anchor biosynthesis, ER/Golgi trafficking, Membrane lipid biosynthesis | Wang et al., |
| PD12 | 126 | 19 | 5.45E-11 | 5.56E-10 | Endocytosis, Immune response | Mosley et al., |
| PD20 | 80 | 13 | 2.57E-08 | 2.18E-07 | Immune response, Integrin cell surface | Wu and Reddy, |
| PD37 | 38 | 9 | 1.28E-07 | 9.35E-07 | Potassium channels, Glycogen metabolism | Chen et al., |
| PD44 | 30 | 7 | 3.75E-06 | 2.12E-05 | Hemoglobin complex | Freed and Chakrabarti, |
| PD10 | 135 | 13 | 1.18E-05 | 6.00E-05 | Oxidoreductase activity | Parker et al., |
| PD34 | 42 | 7 | 3.94E-05 | 1.82E-04 | Glycosaminoglycans biosynthesis | Lehri-Boufala et al., |
| PD45 | 29 | 5 | 4.43E-04 | 1.74E-03 | Immune response, Natural killer cell mediated immunity | Mihara et al., |
| PD35 | 42 | 5 | 2.49E-03 | 9.08E-03 | Lysosome, Sphingolipic metabolism | Dehay et al., |
| PD46 | 29 | 4 | 3.96E-03 | 1.34E-02 | WNT signaling pathway, Dopaminergic neuron differentiation | Arenas, |
# Gene, number of genes in a module; # PD egene, number of egene regulated by PD susceptibility variants in a module.
Gene set enrichment analysis of AD modules.
| AD1 | 88 | 6 | 6.36E-09 | 5.09E-08 | Immune response | Wang et al., |
| AD2 | 42 | 3 | 5.16E-05 | 2.07E-04 | WNT signaling pathway, Dopaminergic neuron differentiation | dos Santos and Smidt, |
| AD3 | 177 | 4 | 2.28E-04 | 6.08E-04 | Immune response, JAK/STAT signaling pathway | Nicolas et al., |
| AD4 | 52 | 2 | 3.73E-03 | 7.47E-03 | ER/Golgi trafficking, Glycosaminoglycans metabolism | Placido et al., |
Figure 4Module dendrogram of Parkinson's disease. Dendrogram of modules is built based on module eigen feature, i.e., the eigen vector corresponding with the first principle component of node features in a module. Distance is measured as one minus Pearson's correlation coefficient. Modules covered by the shaded rectangle share similar functions as illustrated.