| Literature DB >> 34920719 |
Junhua Ye1, Shunfang Wang2, Xin Yang1, Xianjun Tang1.
Abstract
BACKGROUND: At present, the bioinformatics research on the relationship between aging-related diseases and genes is mainly through the establishment of a machine learning multi-label model to classify each gene. Most of the existing methods for predicting pathogenic genes mainly rely on specific types of gene features, or directly encode multiple features with different dimensions, use the same encoder to concatenate and predict the final results, which will be subject to many limitations in the applicability of the algorithm. Possible shortcomings of the above include: incomplete coverage of gene features by a single type of biomics data, overfitting of small dimensional datasets by a single encoder, or underfitting of larger dimensional datasets.Entities:
Keywords: Deep neural networks; Gene ontology; Gene prediction; Mashup; Protein–protein interaction
Mesh:
Year: 2021 PMID: 34920719 PMCID: PMC8680025 DOI: 10.1186/s12859-021-04518-5
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Fig. 1Algorithm frameworks. The vector of binary features consisted of 4 main parts (PPI, GO, KEGG, PathDIP), each part of which was equivalent to one of the data sources. The information for each data source was a boolean value, and if any gene contained this value, it scored 1, and otherwise, it scored 0. The remaining part (Mashup, GTEx) is non-BOOL value
Learning hyperparameters for each network model
| Models | Batch-size | Optimizer | Learning-rate | Epoch |
|---|---|---|---|---|
| Encoder1 | 2048 | Adam | 0.0001 | 150 |
| Encoder2 | 3072 | Adam | 0.0001 | 150 |
Data information of each feature set
| Datasets | GO | PPI | PathDIP | GTEx | Mashup | KEGG | Concat. M | Concat. P&G |
|---|---|---|---|---|---|---|---|---|
| Unknown proportion | 4% | 19.4% | 16.8% | 4.1% | 5.3% | 3.7% | 0 | 0 |
| Data dimension | 13,615 | 13,887 | 4790 | 84 | 800 | 318 | 5992 | 32,694 |
| Sample size | 18,418 | 15,469 | 15,956 | 18,597 | 15,709 | 7040 | 19,188 | 19,188 |
AUROC values obtained by 10-fold cross-validation in datasets with unknown genes
| Algorithms/datasets | Encoder1 | GBT | LR | DNN | Naive |
|---|---|---|---|---|---|
| PPI | 0.7451 | 0.6585 | 0.6844 | 0.6381 | |
| GO | 0.8326 | 0.8520 | 0.7900 | 0.6080 | |
| GTEx | 0.7156 | 0.7173 | 0.7535 | 0.5062 | |
| pathDIP | 0.8392 | 0.7600 | 0.7507 | 0.7817 | |
| Mashup | NA | NA | NA | NA | |
| KEGG | NA | NA | NA | NA |
AUROC values obtained by 10-fold cross-validation in datasets without unknown genes
| Algorithms/datasets | Encoder1 | GBT | LR | DNN | Encoder2 |
|---|---|---|---|---|---|
| PPI | 0.6782 | 0.6889 | 0.6897 | NA | |
| GO | 0.8321 | 0.8268 | 0.7981 | NA | |
| GTEx | 0.7114 | 0.7233 | 0.7476 | NA | |
| pathDIP | 0.8314 | 0.7607 | 0.8051 | NA | |
| Mashup | NA | NA | NA | NA | |
| KEGG | NA | NA | NA | NA | |
| Concat. M | 0.8881 | 0.8688 | 0.8760 | NA | |
| Concat. P&G | 0.7995 | 0.8503 | 0.8777 |
Fig. 2Comparison of ROCs by Concat. M datasets and Concat. P&G datasets on each methods
Comparing the MDL approach (AUROC = 0.9153) with each data set of others methods using T-test and Mann–Whitney-U hypothesis testing
| Algorithms | Feature combination | AUROC | T-test | Mann–Whitney-U test | ||
|---|---|---|---|---|---|---|
| Greater | Two-side | Less | ||||
| Encoder1 | Concat. M | 0.8881 | 0.0045 | 0.0021 | 0.0043 | 0.9980 |
| Concat. P&G | 0.7955 | 1.721e−09 | 4.428e−08 | 8.857e−08 | 1.0000 | |
| Encoder2 | Concat. P&G | 0.9095 | 0.1256 | 0.0261 | 0.0522 | 0.9739 |
| GBT | Concat. M | 0.7301 | 6.778e−09 | 4.523e−08 | 9.046e−08 | 1.0000 |
| Concat. P&G | 0.8503 | 0.0001 | 4.146e−05 | 8.292e−05 | 1.0000 | |
| LR | Concat. M | 0.8760 | 1.704e−06 | 1.461e−07 | 2.922e−07 | 1.0000 |
| Concat. P&G | 0.8777 | 4.071e−07 | 1.094e−07 | 2.189e−07 | 1.0000 | |
Fig. 3AUROC value of 27 diseases predicted by each algorithm
The top five genes related to each of the three diseases according to the frequency of occurrence in 30 repeated experiments, and the various probabilities of association with disease of these candidate genes
| Disease | Candidate genes | Ave. P | Hig. P | Low. P | Ave. P of rest genes |
|---|---|---|---|---|---|
| Immune | 3553( | 0.119864 | 0.166426 | 0.062936 | 0.00128 |
| 1544( | 0.115728 | 0.159259 | 0.061422 | ||
| family 1 subfamily A member 2) | |||||
| 4023( | 0.113289 | 0.155988 | 0.06022 | ||
| 1543( | 0.111613 | 0.154343 | 0.059018 | ||
| family 1 subfamily A member 1) | |||||
| 4846( | 0.110097 | 0.154304 | 0.057448 | ||
| Brain | 1544( | 0.204762 | 0.303633 | 0.111395 | 0.0363 |
| family 1 subfamily A member 2) | |||||
| 1559( | 0.190703 | 0.28241 | 0.104468 | ||
| family 2 subfamily C member 9) | |||||
| 1586( | 0.175779 | 0.243321 | 0.104167 | ||
| family 17 subfamily A member 1) | |||||
| 338( | 0.174123 | 0.256506 | 0.097296 | ||
| 1557( | 0.173544 | 0.258927 | 0.0953 | ||
| family 2 subfamily C member 19) | |||||
| Nutrition | 3553( | 0.343483 | 0.496965 | 0.195028 | 0.00677 |
| 1544( | 0.332656 | 0.478875 | 0.190983 | ||
| family 1 subfamily A member 2) | |||||
| 1557( | 0.284008 | 0.414314 | 0.163521 | ||
| family 2 subfamily C member 19) | |||||
| 4035( | 0.280427 | 0.412398 | 0.159798 | ||
| related protein 1 | |||||
| 2688( | 0.279018 | 0.381114 | 0.171254 |
Testing results for the statistical hypothesis of whether the recommended 30 genes are significantly different from the subsequent 30 genes for 27 aging-related diseases
| Disease classification | Disease classification | ||
|---|---|---|---|
| Disease.Brain | 8.484e−09 | Disease.Neoplasm | 2.033e−09 |
| Brain.Alzheimer | 3.474e−10 | Neoplasm.Adenocarcinoma | 1.464e−10 |
| Brain.Multiple.Sclerosis | 5.967e−09 | Neoplasm.Breast | 1.205e−10 |
| Brain.Parkinson | 1.070e−09 | Neoplasm.Colorectal | 2.154e−10 |
| Disease.Heart | 5.072e−09 | Neoplasm.Lung | 7.380e−10 |
| Heart.Arteriosclerosis | 1.776e−10 | Neoplasm.Prostatic | 3.196e−09 |
| Heart.Coronary.Disease | 6.695e−11 | Neoplasm.Stomach | 2.609e−10 |
| Heart.Hypertension | 4.504e−11 | Disease.Nutrition | 7.118e−09 |
| Heart.Coronary.Disease | 6.695e−11 | Nutritional.Diabetes.Type1 | 3.689e−11 |
| Disease.Immune | 6.065e−10 | Nutritional.Diabetes.Type2 | 2.227e−09 |
| Immune.Hypersensitivity | 5.494e−11 | Nutritional.Obesity | 4.975e−11 |
| Disease.Muscle | 2.438e−11 | Disease.Respiratory.Asthma | 7.380e−10 |
| Muscle.Arthritis | 1.464e−10 | Class_Disease | 2.831e−08 |
| Muscle.Osteoporosis | 8.484e−09 |
Evidence of 15 genes recommended for association with disease
| Gene ID | Genes associated with “Disease.Immune” | Relevant evidence |
|---|---|---|
| 3553 | Roghieh Safari et al. [ | |
| 1544 | Klein et al. [ | |
| subfamily A member | ||
| 4023 | ||
| 1543 | Uno et al. [ | |
| subfamily A member 1 | ||
| 4846 | Bogdan et al. [ | |
| Gene ID | Genes associated with “Disease.Brain ” | Relevant evidence |
| 1544 | Siokas et al. [ | |
| subfamily A member 2 | ||
| 1559 | Sun et al. [ | |
| subfamily C member 9 | ||
| 1586 | Emanuelsson et al. [ | |
| subfamily A member 1 | ||
| 338 | Bjelik et al. [ | |
| 1557 | Ingelman-Sundberg et al. [ | |
| subfamily C member 19 | ||
| Gene ID | Genes associated with “Disease.Nutrition” | Relevant evidence |
| 3553 | Norde et al. [ | |
| 1544 | Agúndez et al. [ | |
| subfamily A member 2 | ||
| 1557 | None | |
| subfamily C member 19 | ||
| 4035 | Masson et al. [ | |
| 2688 | Thissen et al. [ |
Fig. 4Distribution of recommended genes in 27 diseases. The gray dots in the figure represent the association between diseases and genes in the tag set, and the red dots represent the association between diseases and genes we predict