| Literature DB >> 31706268 |
Masoud Arabfard1,2, Mina Ohadi3, Vahid Rezaei Tabar4, Ahmad Delbari5, Kaveh Kavousi6.
Abstract
BACKGROUND: Machine learning can effectively nominate novel genes for various research purposes in the laboratory. On a genome-wide scale, we implemented multiple databases and algorithms to predict and prioritize the human aging genes (PPHAGE).Entities:
Keywords: Genome-wide; Human aging genes; Machine learning; Positive unlabeled learning; Prioritization
Mesh:
Year: 2019 PMID: 31706268 PMCID: PMC6842548 DOI: 10.1186/s12864-019-6140-0
Source DB: PubMed Journal: BMC Genomics ISSN: 1471-2164 Impact factor: 3.969
Datasets used to evaluate reliable negative sample extraction algorithms
| Number of instances | Number of attributes | Data set names |
|---|---|---|
| 756 | 754 | Parkinson’s Disease Classification Data Set [ |
| 345 | 7 | Liver Disorders Data Set [ |
| 1024 | 10 | Cloud Data Set [ |
| 351 | 34 | Ionosphere Data Set [ |
| 19,020 | 11 | MAGIC Gamma Telescope Data Set [ |
| 961 | 6 | Mammographic Mass Data Set [ |
| 569 | 32 | Breast Cancer Wisconsin (Diagnostic) Data Set [ |
| 208 | 60 | Connectionist Bench (Sonar, Mines vs. Rocks) Data Set [ |
Performance evaluation of the reliable negative sample extraction algorithms
| Data set | Algorithm | FPR% | FNR% | Precision % | Recall % | F_measure % |
|---|---|---|---|---|---|---|
| Parkinson’s Disease | NB | 37.25 | 4.57 | 95.43 | 89.78 | 92.52 |
| SPY | 8.70 | 16.11 | 97.42 | 83.89 | 90.15 | |
| Roc-SVM | 6.52 | 15.00 | 98.08 | 85.00 | 91.07 | |
| Liver Disorders | NB | 17.65 | 5.71 | 73.33 | 94.29 | 82.50 |
| SPY | 36.14 | 0 | 40.00 | 100 | 57.14 | |
| Roc-SVM | 31.33 | 5.00 | 42.22 | 95.00 | 58.46 | |
| Cloud | NB | 18.88 | 7.93 | 84.83 | 92.07 | 88.30 |
| SPY | 9.52 | 14.92 | 92.77 | 85.08 | 88.76 | |
| Roc-SVM | 6.32 | 16.51 | 96.72 | 83.49 | 89.62 | |
| Ionosphere | NB | 47.62 | 8.33 | 88.51 | 91.67 | 90.06 |
| SPY | 26.32 | 6.98 | 94.12 | 93.02 | 93.57 | |
| Roc-SVM | 33.33 | 8.89 | 94.25 | 91.11 | 92.66 | |
| MAGIC Gamma Telescope | NB | 10.49 | 44.44 | 68.18 | 55.56 | 61.22 |
| SPY | 17.88 | 36.22 | 53.88 | 63.78 | 58.42 | |
| Roc-SVM | 6.68 | 47.18 | 77.65 | 52.82 | 62.87 | |
| Mammographic Mass | NB | 7.25 | 33.72 | 85.07 | 66.28 | 74.51 |
| SPY | 11.96 | 10.00 | 62.07 | 90.00 | 73.47 | |
| Roc-SVM | 1.95 | 28.57 | 94.34 | 71.43 | 81.30 | |
| Breast Cancer Wisconsin | NB | 13.85 | 12.26 | 91.18 | 87.74 | 89.42 |
| SPY | 9.09 | 10.48 | 94.00 | 89.52 | 91.71 | |
| Roc-SVM | 22.50 | 22.14 | 91.89 | 77.86 | 84.30 | |
| Connectionist Bench (Sonar, Mines vs. Rocks) | NB | 13.85 | 12.26 | 91.18 | 87.74 | 89.42 |
| SPY | 16.67 | 7.69 | 80.00 | 92.31 | 85.71 | |
| Roc-SVM | 22.50 | 22.14 | 91.89 | 77.86 | 84.30 |
Model performance evaluation by Naïve Bayes on the aging data
| Precision % | Recall % | F measure % | Accuracy % | AUC % | |
|---|---|---|---|---|---|
| Train | 80.78 | 76.95 | 78.81 | 78.52 | 83.81 |
| Test | 87.09 | 81.82 | 84.37 | 84.13 | 88.99 |
Fig. 1ROC curves. ROC was performed to evaluate the performance of the Naïve Bayes model at the training and test steps, which resulted in similar values for both curves
Performance evaluation comparison by multiple binary classifier in the aging data
| TP rate % | FP rate% | Precision % | Recall % | F measure % | AUC % | |
|---|---|---|---|---|---|---|
| SVM | 80 | 21.1 | 82 | 80 | 79.6 | 79.5 |
| libD3C | 85.1 | 15.3 | 85.3 | 85.1 | 85 | 91.9 |
| NB | 81.1 | 19.7 | 82.4 | 81.1 | 80.9 | 86 |
Performance evaluation comparison by multiple binary classifier in the aging data after feature selection
| TP rate % | FP rate% | Precision % | Recall % | F measure % | AUC % | |
|---|---|---|---|---|---|---|
| SVM | 83.5 | 17.1 | 84.2 | 83.5 | 83.4 | 83.2 |
| libD3C | 84.6 | 15.7 | 84.8 | 84.6 | 84.6 | 92.3 |
| NB | 81.9 | 18.5 | 82.1 | 81.9 | 81.9 | 86.8 |
Number of detected seed genes in comparison to the output of tools
| Tools | Rank | Fold1 | Fold2 | Fold3 |
|---|---|---|---|---|
| Endeavour | < 10 | 1 | 0 | 1 |
| < 50 | 2 | 0 | 2 | |
| < 100 | 4 | 1 | 2 | |
| < 500 | 11 | 12 | 17 | |
| < 1000 | 24 | 25 | 25 | |
| ToppGene | < 10 | 2 | 0 | 1 |
| < 50 | 11 | 0 | 2 | |
| < 100 | 16 | 1 | 2 | |
| < 500 | 44 | 12 | 17 | |
| < 1000 | 62 | 25 | 25 | |
| PPHAGE | < 10 | 2 | 2 | 0 |
| < 50 | 7 | 4 | 5 | |
| < 100 | 12 | 12 | 9 | |
| < 500 | 50 | 35 | 38 | |
| < 1000 | 66 | 61 | 67 |
Average rank of the seed genes in comparison to the output of tools
| Fold1 | Fold2 | Fold3 | |
|---|---|---|---|
| Endeavour | 1851 | 1918 | 1877 |
| ToppGene | 926 | 849 | 1024 |
| PPHAGE | 833 | 919 | 930 |
The top 25 human candidate aging genes
| Rank | Gene symbol | Relevance | Reference | Database reference |
|---|---|---|---|---|
| 1 | Nucleosome Assembly | [ | ||
| 2 | Parkinson Disease | [ | BEFREE | |
| 3 | Ribosomal Protein | [ | ||
| 4 | Alzheimer’s Disease | [ | BEFREE | |
| 5 | Diabetes Mellitus, Non-Insulin-Dependent Osteoporosis, Postmenopausal Colorectal Cancer | [ | BEFREE | |
| 6 | ATPase Phospholipid Transporting | [ | ||
| 7 | Serine And Arginine Rich Splicing Factor | [ | ||
| 8 | ||||
| 9 | Cardiovascular Diseases Diabetes Mellitus, Non-Insulin-Dependent Colorectal Cancer Atherosclerosis Parkinson Disease Alzheimer’s Disease Arthritis Heart failure | [ [ [ [ [ [ [ [ | CTD_human RGD LHGDN BEFREE HPO | |
| 10 | Cataract, autosomal recessive congenital 2 Cataract | [ | UNIPROT GENOMICS_ENGLAND HPO CTD_human | |
| 11 | ||||
| 12 | Parkinson Disease | [ | GWASDB GWASCAT BEFREE | |
| 13 | ||||
| 14 | ||||
| 15 | Diabetes Mellitus, Non-Insulin-Dependent | [ | BEFREE | |
| 16 | Alzheimer’s Disease Colorectal Cancer Osteopetrosis | [ [ [ | BEFREE GWASDB GWASCAT | |
| 17 | Coronary heart disease Colorectal Cancer | [ | BEFREE UNIPROT | |
| 18 | ||||
| 19 | ||||
| 20 | ||||
| 21 | Colorectal Cancer | [ | BEFREE | |
| 22 | ||||
| 23 | Heart failure Colorectal Cancer | [ [ | BEFREE | |
| 24 | Colorectal Cancer Hereditary Diffuse Gastric Cancer Coronary heart disease Increased gastric cancer | [ [ [ | BEFREE CTD_human HPO | |
| 25 | Arthritis Cataract | HPO HPO |
Indicative diseases associated with the candidate aging genes
| Index | Name | Adjusted | Z-score | Combined score | |
|---|---|---|---|---|---|
| 1 | Colorectal cancer | 1.43e-08 | 0.000001256 | −1.94 | 35.07 |
| 2 | Leukemia | 6.71e-07 | 0.00002953 | −1.64 | 23.32 |
| 3 | Breast_cancer | 0.000009246 | 0.0002357 | −1.45 | 16.76 |
| 4 | Diabetes | 0.00002362 | 0.0002986 | −0.92 | 9.85 |
| 5 | Anemia | 0.00002185 | 0.0002986 | −0.9 | 9.68 |
| 6 | Cardiomyopathy | 0.00002757 | 0.0002986 | − 0.59 | 6.23 |
Fig. 2Significant biological processes associated with the candidate aging genes
Indicative biological pathways associated with the candidate aging genes
| Index | Name | Adjusted | Z-score | Combined score | |
|---|---|---|---|---|---|
| 1 | Pathways in cancer_Homo sapiens_hsa05200 | 4.07e-41 | 1.19e-38 | −2.11 | 196.21 |
| 2 | Proteoglycans in cancer_Homo sapiens_hsa05205 | 1.91e-31 | 2.78e-29 | −1.99 | 140.58 |
| 3 | Epstein-Barr virus infection_Homo sapiens_hsa05169 | 3.24e-30 | 3.15e-28 | −1.9 | 128.92 |
| 4 | Endocytosis_Homo sapiens_hsa04144 | 1.19e-28 | 8.70e-27 | −1.89 | 121.38 |
| 5 | Regulation of actin cytoskeleton_Homo sapiens_hsa04810 | 4.30e-26 | 2.51e-24 | −1.82 | 106.42 |
| 6 | HTLV-I infection_Homo sapiens_hsa05166 | 1.01e-25 | 4.21e-24 | −1.79 | 103.2 |
| 7 | Protein processing in endoplasmic reticulum_Homo sapiens_hsa04141 | 7.55e-26 | 3.68e-24 | −1.69 | 98.04 |
| 8 | Herpes simplex infection_Homo sapiens_hsa05168 | 1.24e-25 | 4.54e-24 | −1.61 | 92.36 |
| 9 | PI3K-Akt signaling pathway_Homo sapiens_hsa04151 | 1.79e-22 | 4.96e-21 | −1.83 | 91.82 |
| 10 | Focal adhesion_Homo sapiens_hsa04510 | 1.12e-22 | 3.63e-21 | −1.72 | 86.98 |
Indicative diseases associated with the reliable negative genes
| Index | Name | Adjusted | Z-score | Combined score | |
|---|---|---|---|---|---|
| 1 | Cardiomyopathy,_dilated | 0.01658 | 0.2321 | −1.69 | 6.93 |
| 2 | Cardiomyopathy | 0.03134 | 0.2416 | −1.61 | 5.57 |
| 3 | Zellweger_syndrome | 0.01588 | 0.2321 | −1.06 | 4.41 |
| 4 | Dystonia | 0.03451 | 0.2416 | −0.37 | 1.25 |
Fig. 3Significant biological processes associated with the reliable negative genes
Fig. 4The overall learning scheme based on positive and unlabeled samples, and extraction of reliable negative samples (step 1), construction of the binary Classifier (step 2), and prediction and prioritization of candidate genes (step 3)
Comparison of the evaluation metric across data sources
| Data source | Recall | Specificity | Precision | Accuracy | F_Measure |
|---|---|---|---|---|---|
| Literature | 0.58098 | 0.61453 | 0.5888 | 0.5981 | 0.58478 |
| Annotation | 0.77685 | 0.78668 | 0.76645 | 0.78165 | 0.77133 |
| Pathways | 0.73268 | 0.74538 | 0.7204 | 0.73893 | 0.72605 |
| Gene Ontology | 0.79303 | 0.78843 | 0.76315 | 0.78958 | 0.77703 |
| Phenotype | 0.7946 | 0.81968 | 0.8158 | 0.80695 | 0.80488 |
| Intrinsic properties | 0.67963 | 0.77035 | 0.78945 | 0.71835 | 0.72965 |
| Sequence | 0.6901 | 0.72828 | 0.71713 | 0.70885 | 0.70305 |
| Interaction | 0.7378 | 0.7724 | 0.76645 | 0.7543 | 0.75135 |
| Gene expression | 0.75635 | 0.82148 | 0.82235 | 0.7864 | 0.78735 |
| Regulatory | 0.77355 | 0.79203 | 0.77633 | 0.78163 | 0.77393 |
Data sources used in Naïve Bayes classifier for candidate aging genes
| Data source name | Dataset name | Features detail | Web address |
|---|---|---|---|
| Literature | OBO AgeFactDB | The ageing-related information included both by manual and automatic information extraction from the scientific literature. | |
| Functional annotation | David | The list of all functional annotation. | |
| Biological pathways | Reactome Kegg | The list of biological pathway. | |
| Gene Ontology | GO | The Biological Process, Molecular Function, and Cellular Component vocabularies. | |
| Phenotype | HPO OMIM | The list of all ageing-related phenotype and associated gene. | |
| Intrinsic properties | Pfam PDB | The chromosome number, location, gene segment, gene type, etc. | |
| Sequence | RefSeq | The list of all known active site, binding site, chain, etc. | |
| Protein-Protein Interaction | HPRD String | The list of each gene had a physical interaction with each of the positive genes. | |
| Gene expression | GEO HAGR | The ageing-related expression included tissue type, overexpressed and under expressed, etc. | |
| Regulatory | RegNetwork | The list of all regulatory relationship, such as miRNA, Transcription factor, etc. | |
| Orthologues | CDD HomoloGene OrthoDB | The catalog of orthologous protein-coding genes across vertebrates and known conserved domain. |
Fig. 5The Percentage of Variance in Principal Component Analysis