| Literature DB >> 31845988 |
Fabio Fabris1, Daniel Palmer2, Khalid M Salama1, João Pedro de Magalhães2, Alex A Freitas1.
Abstract
MOTIVATION: One way to identify genes possibly associated with ageing is to build a classification model (from the machine learning field) capable of classifying genes as associated with multiple age-related diseases. To build this model, we use a pre-compiled list of human genes associated with age-related diseases and apply a novel Deep Neural Network (DNN) method to find associations between gene descriptors (e.g. Gene Ontology terms, protein-protein interaction data and biological pathway information) and age-related diseases.Entities:
Year: 2020 PMID: 31845988 PMCID: PMC7141856 DOI: 10.1093/bioinformatics/btz887
Source DB: PubMed Journal: Bioinformatics ISSN: 1367-4803 Impact factor: 6.937
Fig. 1.Architecture—Gray nodes represent the inputs coming from several biological databases. Followed by nodes representing the supervised feature extraction modules. The Combiner joins the higher-level features coming from the feature extraction modules to make a final prediction (rightmost node). Each of the encoder nodes, as well as the combiner node, are deep multi-layer neural networks (DNNs)
Number of neurons in the DNN for each type of input feature
| Module(s) | No. of trainable weights | No. of input neurons (number of features) | No. of hidden neurons (in all hidden layers) | No. of output neurons |
|---|---|---|---|---|
| GO | 874 491 | 13 615 | 112 | 27 |
| PPI | 891 899 | 13 887 | 112 | 27 |
| PathDIP | 309 691 | 4790 | 112 | 27 |
| GTex | 8507 | 84 | 112 | 27 |
| All (concat.) | 2 075 195 | 32 376 | 112 | 27 |
| Modular DNN | 2 091 963 (9211 for the Combiner module, 2 082 752 for all four Encoder modules) | 32 376 | 160 | 27 |
Comparing AUROC results of our Modular DNN approach (in bold) with the individual feature types and the full dataset concatenating all features (Concat. all feats.) using the DNN, BT and LR algorithms
| Feature type or classification method/approach | % of unknown genes in the feature type | AUROC values | ||||||
|---|---|---|---|---|---|---|---|---|
| DNN | BT | LR | Naive | |||||
| pred. all genes | pred. known genes only | pred. all genes | pred. known genes only | pred. all genes | pred. known genes only | pred. all genes | ||
| GO | 4.0% | 0.8498 | 0.8583 | 0.8520 | 0.8468 | 0.7900 | 0.7981 | 0.7995 |
| PPI | 19.4% | 0.6381 | 0.6897 | 0.6585 | 0.6782 | 0.6844 | 0.6889 | 0.6080 |
| PathDIP | 16.8% | 0.7535 | 0.8051 | 0.8392 | 0.8314 | 0.7600 | 0.7607 | 0.7817 |
| GTex | 4.1% | 0.7507 | 0.7476 | 0.7156 | 0.7114 | 0.7173 | 0.7233 | 0.5062 |
| Concat. all feats. | 0.0% | 0.7548 | 0.7548 | 0.8794 | 0.8794 | 0.8059 | 0.8059 | 0.8105 |
|
| 0.0% |
|
| NA | NA | NA | NA | NA |
| FSS Mod. approach | 0.0% | 0.7525 | 0.7525 | NA | NA | NA | NA | NA |
| Stacking | 0.0% | NA | NA | 0.7301 | 0.7301 | 0.8711 | 0.8711 | NA |
Note: This table also shows the results when varying the strategy to deal with ‘unknown genes’—i.e. genes for which the values of all features (of a given feature type) are unknown—e.g. a gene where all GO term features have missing values. Columns 3, 5 and 7 show the results when classifying all genes (including unknown genes) while columns 4, 6 and 8 shows the results when ignoring unknown genes. The last column shows the results for the ‘Naive’ approach. For the AUROC results of our approach for the individual diseases, please consult the Supplementary File ‘auroc_per_disease.xlsx’.
Comparing the Modular DNN approach (AUROC = 0.8795) with the DNN, BT and LR approaches using Bayesian hypothesis testing and the traditional NHST
| Base. | Feature type or approach | AUROC | Bayesian | NHST |
|---|---|---|---|---|
| DNN | GO | 0.8498 | 0.9995, 0.0005, 0.0000 | 0.0051 |
| PPI | 0.6381 | 1.0000, 0.0000, 0.0000 | 0.0051 | |
| PathDIP | 0.7535 | 1.0000, 0.0000, 0.0000 | 0.0051 | |
| GTex | 0.7507 | 1.0000, 0.0000, 0.0000 | 0.0051 | |
| Con. all feats. | 0.7548 | 1.0000, 0.0000, 0.0000 | 0.0051 | |
| FSS Mod. app. | 0.7525 | 0.9989, 0.0011, 0.0000 | 0.0069 | |
| BT | GO | 0.8520 | 0.9980, 0.0020, 0.0000 | 0.0093 |
| PPI | 0.6585 | 1.0000, 0.0000, 0.0000 | 0.0051 | |
| PathDIP | 0.8392 | 0.9999, 0.0001, 0.0000 | 0.0051 | |
| GTex | 0.7156 | 1.0000, 0.0000, 0.0000 | 0.0051 | |
| Con. all feats. | 0.8794 | 0.1086, 0.7685, 0.1229 | 0.8785 | |
| Stacking | 0.7301 | 1.0000, 0.0000, 0.0000 | 0.0051 | |
| LR | GO | 0.7900 | 1.0000, 0.0000, 0.0000 | 0.0051 |
| PPI | 0.6844 | 1.0000, 0.0000, 0.0000 | 0.0051 | |
| PathDIP | 0.7600 | 1.0000, 0.0000, 0.0000 | 0.0051 | |
| GTex | 0.7173 | 1.0000, 0.0000, 0.0000 | 0.0051 | |
| Con. all feats. | 0.8059 | 1.0000, 0.0000, 0.0000 | 0.0051 | |
| Stacking | 0.8711 | 0.3920, 0.5588, 0.0492 | 0.3863 |
List of negatively labelled candidate genes with at least nine positive neighbours annotated with the label ‘Associated with Type 2 Diabetes’ appearing in all 30 runs of the Modular DNN
| Candidate gene: | |
|---|---|
| Found in 30 out of 30 randomized runs. | |
| Avg. prob.: 0.4211/min. prob.: 0.1257/max. prob.: 0.7044 | |
| Times in NN list | Positive neighbouring genes |
| 30 |
|
| 30 |
|
| 30 |
|
| 30 |
|
| 30 |
|
| 30 |
|
| 30 |
|
| 30 |
|
| 28 |
|
| 2 |
|
|
| |
| Times in NN list | Negative neighbouring genes |
|
| |
| 30 |
|
| Candidate gene: | |
| Found in 30 out of 30 randomized runs. | |
| Avg. prob.: 0.4954/min. prob.: 0.1451/max. prob.: 0.8449 | |
|
| |
| Times in NN list | Positive neighbouring genes |
|
| |
| 30 |
|
| 30 |
|
| 30 |
|
| 30 |
|
| 30 |
|
| 30 |
|
| 29 |
|
| 27 |
|
| 26 |
|
| 8 |
|
|
| |
| Times in NN list | Negative neighbouring genes |
|
| |
| 30 |
|
Note: Each sub-table shows in its heading the name of the candidate gene and the average, minimum and maximum positive class label probabilities across the 30 randomized runs. Next, we show the list of positive neighbours and the list of its negative neighbours (if any) of the candidate gene. The sub-tables also show the number of times the gene was in the NN list of the candidate gene.
List of the top five negative genes in terms of average positive class label probability across 30 randomized runs of the modular DNN and three disease types
| Candidate gene | Avg. prob. | Min. prob. | Max. prob. |
|---|---|---|---|
| Brain disease (99% percentile of avg. prob.: 0.0942) | |||
|
| 0.5764 | 0.2659 | 0.8631 |
|
| 0.4762 | 0.2315 | 0.7186 |
|
| 0.4737 | 0.2214 | 0.7014 |
|
| 0.4393 | 0.2052 | 0.6542 |
|
| 0.4306 | 0.1977 | 0.6367 |
| Neoplasm (99% percentile of avg. prob.: 0.0784) | |||
|
| 0.5128 | 0.1728 | 0.8357 |
|
| 0.4228 | 0.1439 | 0.6891 |
|
| 0.3610 | 0.1338 | 0.5780 |
|
| 0.3507 | 0.1243 | 0.5686 |
|
| 0.2856 | 0.1036 | 0.4588 |
| Myocardial infarction (99% percentile of avg. prob.: 0.0808) | |||
|
| 0.5283 | 0.1279 | 0.8312 |
|
| 0.4138 | 0.1050 | 0.6374 |
|
| 0.3532 | 0.0888 | 0.5545 |
|
| 0.3050 | 0.0734 | 0.4603 |
|
| 0.2844 | 0.0736 | 0.4284 |
Note: The table shows the class label associated with the gene, the average, minimum and maximum probabilities across the 30 runs.