| Literature DB >> 34535759 |
Arno van Hilten1, Steven A Kushner2, Manfred Kayser3, M Arfan Ikram4, Hieab H H Adams5,6, Caroline C W Klaver4,7, Wiro J Niessen5,8, Gennady V Roshchupkin9,10.
Abstract
Applying deep learning in population genomics is challenging because of computational issues and lack of interpretable models. Here, we propose GenNet, a novel open-source deep learning framework for predicting phenotypes from genetic variants. In this framework, interpretable and memory-efficient neural network architectures are constructed by embedding biologically knowledge from public databases, resulting in neural networks that contain only biologically plausible connections. We applied the framework to seventeen phenotypes and found well-replicated genes such as HERC2 and OCA2 for hair and eye color, and novel genes such as ZNF773 and PCNT for schizophrenia. Additionally, the framework identified ubiquitin mediated proteolysis, endocrine system and viral infectious diseases as most predictive biological pathways for schizophrenia. GenNet is a freely available, end-to-end deep learning framework that allows researchers to develop and use interpretable neural networks to obtain novel insights into the genetic architecture of complex traits and diseases.Entities:
Mesh:
Year: 2021 PMID: 34535759 PMCID: PMC8448759 DOI: 10.1038/s42003-021-02622-z
Source DB: PubMed Journal: Commun Biol ISSN: 2399-3642
Fig. 1Overview of the GenNet framework.
Neural networks are created by using prior biological knowledge to define connections between layers (i.e., SNPs are connected to their corresponding genes by using gene annotations, and genes are connected to their corresponding pathway by using pathway annotations). Prior knowledge is thus used to define each connection, creating interpretable networks.
Fig. 2Simulation results and the importance of each gene for predicting schizophrenia.
a A simple, non-linear proof of concept. In this simulation, each gene in the causal region has two causal SNPs that cause the simulated disease. The magnitude of the learned weight is represented by the line thickness (contributing causal connections in red, non-contributing control connections in gray). b A secondary set of simulations show the performance of GenNet, expressed as area under the curve, for increasing levels of heritability and training set size (c). The black curve presents the theoretical maximum of the AUC versus heritability. d Manhattan plot showing genes and their relative importance according to the network, here we have shown the results for distinguishing between schizophrenia cases and controls in the Sweden exome study. This Manhattan plot is a cross-section between the gene layer (21,390 nodes) and the outcome of a trained network with 1,288,701 input variants.
Overview of the main results for the experiments using the Rotterdam Study, UK Biobank, and Swedish Schizophrenia Exome Sequencing study data.
| Dataset (type) | Trait | Subjects & phenotype | AUC gene test (val) | Top three predictive genes | AUC pathway test (val) | Top predictive pathway | |||
| Class I | Class II | Global level | Middle level | Local level | |||||
| Rotterdam Study (genotype array) | Eye color | 4041 Blue | 2250 Other | 0.50 (0.52) | Organismal Systems (78.4%) | Digestive system (72.6%), | Pancreatic secretion (59.1%) | ||
| UK Biobank (exome) | Hair color | 1734 Red | 1727 Other | 0.77 (0.77) | Genetic Information Processing (87.4%) | Replication and repair (83.4%) | Fanconi anemia pathway (79.7%) | ||
| 3762 Black | 3753 Other | 0.80 (0.82) | 0.76 (0.78) | Organismal Systems (46.9%) | Endocrine system (18.6%) | Axon guidance (5.0%) | |||
| 4501 Blond | 4518 Other | 0.58 (0.57) | Organismal Systems (70.2%) | Endocrine system (30.1%), | Adrenergic signaling in cardiomyocytes (4.1%) | ||||
| Bipolar disorder | 343 Cases | 347 Controls | 0.47 (0.55) | Organismal Systems (76.9%) | Endocrine system (46.5%) | Melanogenesis (32.9%) | |||
| Atrial fibrillation | 192 Cases | 194 Controls | 0.57 (0.63) | Organismal Systems (39.6%) | Signal transduction (11.2%) | Cytokine−cytokine receptor interaction (4.4%) | |||
| Coronary Artery Disease | 1563 Cases | 1600 Controls | 0.54 (0.56) | Environmental Information Processing (29.5%), | Signal transduction (27.7%) | PI3K-Akt signaling pathway (4.6%) | |||
| Dementia | 139 Cases | 142 Controls | 0.55 (0.58) | Human Diseases (39.8%) | Signal transduction (22.2%) | Pathways in cancer (5.6%) | |||
| Male balding pattern | 3454 Balding pattern 1 | 3454 Balding pattern 4 | 0.54 (0.55) | Organismal Systems (34.6%) | Nervous system (9.7%) | Metabolic pathways (8.7%) | |||
| Asthma | 4229 Cases | 4214 Controls | 0.51 (0.54) | Genetic Information Processing (52.3%) | Folding, sorting and degradation (41.5%) | Ubiquitin mediated proteolysis (22.0%) | |||
| Diabetes | 2557 Cases | 2555 Controls | 0.54 (0.54) | Environmental Information Processing (43.5%), | Signal transduction (40.5%), | Ras signaling pathway (7.9%) | |||
| Breast cancer | 1070 Cases | 1082 Controls | 0.51 (0.56) | Human Diseases (57.1%) | Infectious diseases: Viral (16.6%) | Pathways in cancer (6.5%) | |||
| Sweden (exome) | Schizophrenia | 4969 Cases | 6245 Controls | 0.68 (0.67) | Human Diseases (30.8%) | Infectious diseases: Viral (27.3%) | Human papillomavirus infection (11.7%) | ||
The performance in AUC for the network with gene-annotations is bold if the network outperformed or matched LASSO regression. Manhattan plots for the genes can be found in Supplementary Notes 3−5. *MC1R was not annotated but was identified by linkage disequilibrium. **Many genes contributed to the prediction without clear separation between genes (see Supplementary Note 4).
Fig. 3Sunburst plot of the important KEGG pathways for predicting schizophrenia.
A neural network with layers based on KEGG pathway information was trained to predict schizophrenia. The relative importance was calculated for each pathway using the learned weights of this neural network. The sunburst plot is read from the center, which shows the output node with the prediction of the neural network. The inner ring represents the last layer with the highest-level pathways, followed by the mid-level pathways, and finally the lowest-level pathways. The gene and SNP layer are omitted for clarity.