| Literature DB >> 29511309 |
Csaba Kerepesi1, Bálint Daróczy2, Ádám Sturm3,4, Tibor Vellai3,4, András Benczúr2.
Abstract
Ageing has a huge impact on human health and economy, but its molecular basis - regulation and mechanism - is still poorly understood. By today, more than three hundred genes (almost all of them function as protein-coding genes) have been related to human ageing. Although individual ageing-related genes or some small subsets of these genes have been intensively studied, their analysis as a whole has been highly limited. To fill this gap, for each human protein we extracted 21000 protein features from various databases, and using these data as an input to state-of-the-art machine learning methods, we classified human proteins as ageing-related or non-ageing-related. We found a simple classification model based on only 36 protein features, such as the "number of ageing-related interaction partners", "response to oxidative stress", "damaged DNA binding", "rhythmic process" and "extracellular region". Predicted values of the model quantify the relevance of a given protein in the regulation or mechanisms of the human ageing process. Furthermore, we identified new candidate proteins having strong computational evidence of their important role in ageing. Some of them, like Cytochrome b-245 light chain (CY24A) and Endoribonuclease ZC3H12A (ZC12A) have no previous ageing-associated annotations.Entities:
Mesh:
Substances:
Year: 2018 PMID: 29511309 PMCID: PMC5840292 DOI: 10.1038/s41598-018-22240-w
Source DB: PubMed Journal: Sci Rep ISSN: 2045-2322 Impact factor: 4.379
A simple model, produced by tree boosting (XGBoost), to classify human proteins as ageing-related or non-ageing-related.
| feature ID | description of the feature | category | score | relative frequency in ageing/non-ageing |
|---|---|---|---|---|
| ageing_n_0 | number of ageing-related neighbours = 0 | Net | −2.896 | 38.8/92.1 |
| ageing_n_1 | number of ageing-related neighbours = 1 | Net | −2.275 | 15.8/5.6 |
| ageing_n_2 | number of ageing-related neighbours = 2 | Net | −1.168 | 15.1/1.4 |
| ageing_n_3_4 | number of ageing-related neighbours = 3,4 | Net | −0.744 | 12.8/0.6 |
| GO:0043567 | regulation of insulin-like growth factor receptor signaling pathway | BP | 1.327 | 2.6/0.1 |
| GO:0006979 | response to oxidative stress | BP | 0.9 | 21.7/1.4 |
| GO:0003684 | damaged DNA binding | MF | 0.837 | 8.6/0.2 |
| GO:0009987 | cellular process | BP | 0.805 | 99.3/70.0 |
| GO:0005576 | extracellular region | CC | 0.636 | 21.7/8.8 |
| GO:0065008 | regulation of biological quality | BP | 0.563 | 60.2/14.9 |
| GO:0051276 | chromosome organization | BP | 0.515 | 14.5/1.6 |
| GO:0032502 | developmental process | BP | 0.497 | 69.4/22.5 |
| GO:0043066 | negative regulation of apoptotic process | BP | 0.474 | 32.9/3.5 |
| GO:0009628 | response to abiotic stimulus | BP | 0.441 | 38.2/4.4 |
| GO:0007169 | transmembrane receptor protein tyrosine kinase signaling pathway | BP | 0.413 | 19.1/2.1 |
| GO:0010332 | response to gamma radiation | BP | 0.411 | 8.6/0.1 |
| GO:0019838 | growth factor binding | MF | 0.405 | 5.3/0.4 |
| GO:0040008 | regulation of growth | BP | 0.398 | 22.0/2.8 |
| GO:0044710 | single-organism metabolic process | BP | 0.388 | 42.1/15.4 |
| GO:0031325 | positive regulation of cellular metabolic proc | BP | 0.331 | 64.8/12.8 |
| GO:0050896 | response to stimulus | BP | 0.288 | 77.3/22.8 |
| GO:0031667 | response to nutrient levels | BP | 0.285 | 16.8/1.5 |
| GO:0005515 | protein binding | MF | 0.271 | 75.7/24.4 |
| GO:2000377 | regulation of reactive oxygen species metabolic process | BP | 0.259 | 13.8/0.6 |
| GO:0051716 | cellular response to stimulus | BP | 0.257 | 62.2/11.1 |
| GO:0005654 | nucleoplasm | CC | 0.235 | 49.7/14.1 |
| GO:0080135 | regulation of cellular response to stress | BP | 0.225 | 27.3/2.6 |
| GO:0048511 | rhythmic process | BP | 0.224 | 15.1/1.2 |
| GO:0044427 | chromosomal part | CC | 0.197 | 24.0/3.4 |
| ageing_n_5+ | number of ageing-related neighbours ≥ 5 | Net | 0.192 | 17.4/0.2 |
| GO:0003682 | chromatin binding | MF | 0.171 | 17.1/2.1 |
| GO:0006974 | cellular response to DNA damage stimulus | BP | 0.167 | 27.6/3.1 |
| GO:0097159 | organic cyclic compound binding | MF | 0.166 | 62.8/28.8 |
| GO:0005739 | mitochondrion | CC | 0.16 | 20.4/6.1 |
| GO:0019899 | enzyme binding | MF | 0.128 | 39.8/6.8 |
| GO:0009894 | regulation of catabolic process | BP | 0.125 | 25.7/3.4 |
Features are listed by ID and description. Feature category can take values “Net” (Network), “MF” (Molecular Function), “CC” (Cellular Component), or “BP” (Biological Process). The table consists of only binary (true or false) features. For each protein we can compute the predicted relevance of ageing as follows: for each row of the table, we check whether the given feature is true for the protein and then we add up the corresponding scores. The larger the final sum, the more important role of a protein is predicted in the human ageing process. For example, suppose that a protein has 3 ageing-related neighbours and their UniProt record contains only two GO terms, “response to oxidative stress”, and “regulation of growth”. Then the predicted ageing relevance of that protein is − 0.744 + 0.9 + 0.398 = 0.554. Predicted scores produced by the above summation method are presented in the “Table1_pred” column of Supplementary Table S1. Scores obtained by summation are not necessarily bounded by 1. The actual output of XGBoost, which we used in the rest of the paper, was normalized to take values in [0…1]. In fact, we use the average of normalized predicted values made by several models (see the Methods). The relative frequency of features in the ageing-related and the non-ageing-related sets of proteins, a value independent of our particular model, is displayed in the last column.
Human proteins with the highest predicted relevances in ageing.
| Uniprot ID | recommended name in UniProt | ageing neighbours | “aging” GO | GenAge | average predicted value |
|---|---|---|---|---|---|
| BCL2_HUMAN | Apoptosis regulator Bcl-2 | 4 | yes | yes | 0.981 |
| FOXO1_HUMAN | Forkhead box protein O1 | 4 | no | yes | 0.96 |
| ERCC1_HUMAN | DNA excision repair protein ERCC-1 | 3 | yes | yes | 0.944 |
| PCNA_HUMAN | Proliferating cell nuclear antigen | 4 | no | yes | 0.936 |
| FOXO3_HUMAN | Forkhead box protein O3 {ECO:0000305} | 5 | yes | yes | 0.929 |
| SIR2_HUMAN | NAD-dependent protein deacetylase sirtuin-2 | 2 | no | no | 0.909 |
| PTEN_HUMAN | Phosphatidylinositol 3,4,5-trisphosphate 3-phosphatase and dual-specificity protein phosphatase | 5 | yes | yes | 0.882 |
| APEX1_HUMAN | DNA-(apurinic or apyrimidinic site) lyase | 2 | yes | yes | 0.857 |
| HDAC2_HUMAN | Histone deacetylase 2 | 3 | no | yes | 0.849 |
| MTOR_HUMAN | Serine/threonine-protein kinase mTOR | 3 | yes | yes | 0.832 |
| BECN1_HUMAN | Beclin-1 | 3 | yes | no | 0.827 |
| AKT1_HUMAN | RAC-alpha serine/threonine-protein kinase | 10 | yes | yes | 0.827 |
| KPCD_HUMAN | Protein kinase C delta type | 3 | yes | yes | 0.808 |
| CDK1_HUMAN | Cyclin-dependent kinase 1 | 2 | yes | yes | 0.804 |
| SYUA_HUMAN | Alpha-synuclein | 2 | yes | no | 0.801 |
| P73_HUMAN | Tumor protein p73 | 2 | no | yes | 0.8 |
| PARP1_HUMAN | Poly [ADP-ribose] polymerase 1 | 6 | no | yes | 0.798 |
| PRKDC_HUMAN | DNA-dependent protein kinase catalytic subunit | 4 | no | yes | 0.791 |
| ABL1_HUMAN | Tyrosine-protein kinase ABL1 | 6 | no | yes | 0.782 |
| WRN_HUMAN | Werner syndrome ATP-dependent helicase | 9 | yes | yes | 0.782 |
The 20 highest scored proteins considered the entire set of human proteins (regardless of whether or not the protein is included in the GenAge database), sorted by decreasing predicted relevance in ageing (average predicted value). Each row consists of an ID of the given protein (“Uniprot ID”), a description (“recommended name in UniProt”), the number of ageing-related protein neighbours of the given protein in the protein-protein interaction network (“ageing neighbours”), a statement about its assignment to the GO term “aging” (“aging GO”), a statement about its inclusion in GenAge (“GenAge”), and the average predicted value of 20 predictions of three machine learning methods each (XGBoost, SVM and LR) by using the final feature set selected by XGBoost (“average predicted value”). Average predicted values close to one indicate very strong predicted relevance for the human ageing process. Supplementary Table S1 is a more detailed list with all of the human proteins.
New candidates of ageing-related human proteins predicted by machine learning.
| Uniprot ID | recommended name | ageing neighbours | ageing GO | GenAge | average predicted value |
|---|---|---|---|---|---|
| SIR2_HUMAN | NAD-dependent protein deacetylase sirtuin-2 | 2 | no | no | 0.909 |
| BECN1_HUMAN | Beclin-1 | 3 | yes | no | 0.827 |
| SYUA_HUMAN | Alpha-synuclein | 2 | yes | no | 0.801 |
| CAV1_HUMAN | Caveolin-1 | 4 | no | no | 0.745 |
| LRRK2_HUMAN | Leucine-rich repeat serine/threonine-protein kinase 2 | 6 | no | no | 0.734 |
| BAD_HUMAN | Bcl2-associated agonist of cell death | 3 | no | no | 0.721 |
| PARK7_HUMAN | Protein DJ-1 | 2 | no | no | 0.711 |
| HS90B_HUMAN | Heat shock protein HSP 90-beta | 8 | no | no | 0.709 |
| SMAD3_HUMAN | Mothers against decapentaplegic homolog 3 | 2 | no | no | 0.662 |
| KDM1A_HUMAN | Lysine-specific histone demethylase 1A | 2 | no | no | 0.66 |
| ERBB4_HUMAN | Receptor tyrosine-protein kinase erbB-4 | 3 | no | no | 0.633 |
| HDAC6_HUMAN | Histone deacetylase 6 | 2 | no | no | 0.606 |
| FACD2_HUMAN | Fanconi anemia group D2 protein | 2 | no | no | 0.585 |
| RARA_HUMAN | Retinoic acid receptor alpha | 5 | no | no | 0.567 |
| XRCC1_HUMAN | DNA repair protein XRCC1 | 4 | no | no | 0.567 |
| CY24A_HUMAN | Cytochrome b-245 light chain | 0 | no | no | 0.562 |
| SRC_HUMAN | Proto-oncogene tyrosine-protein kinase Src | 10 | no | no | 0.562 |
| CBL_HUMAN | E3 ubiquitin-protein ligase CBL | 5 | no | no | 0.561 |
| XBP1_HUMAN | X-box-binding protein 1 | 0 | no | no | 0.551 |
| FYN_HUMAN | Tyrosine-protein kinase Fyn | 3 | no | no | 0.543 |
The 20 highest scored proteins with no ageing-related GenAge annotation, sorted by decreasing predicted relevance in ageing (average predicted value). The columns have the same meanings as in Table 2.
Figure 1The top 20 new candidates of ageing-related proteins and their known and new ageing-related interaction partners. Blue rectangles represent the new candidates of ageing-related proteins (also listed in Table 2). Yellow rectangles represent the known ageing-related proteins of GenAge. Only the edges between yellow rectangles and blue rectangles and the edges between two blue rectangles are displayed. Nodes without edges are not displayed.
Figure 2(a) Receiver operating characteristic curve (ROC) of our final averaged prediction (see “avg pred” in Supplementary Table S1). (b) Several evaluation functions calculated for different threshold values. (c) The number of overlapping proteins among GenAge, Aging GO (proteins annotated with the GO term “aging”) and ML prediction (proteins that have predicted values above the threshold 0.24).
Figure 3Overview of the study methods as the main ingredients of our classification method. We utilized four databases (UniProt, Gene Ontology, GenAge, GeneFriends) and after ID mapping and GO ancestor determination, we extracted several feature sets. Then we selected the most important features in several steps, which considerably reduced the dimensionality of the final feature space. Finally, we used three different classification methods (XGBoost, support vector machine, logistic regression) trained on the selected features and then we averaged the predicted values of the three methods.
Feature selection process driven by performance of XGBoost on different feature sets.
| short description of the feature set | number of features | depth of trees | number of trees | number of predictions | AUC | |
|---|---|---|---|---|---|---|
| average | std dev | |||||
| GO w/o ancestors, with ageing GOs | 16820 | 6 | 20 | 20 | 0.8787 | 0.0061 |
| GO w/o ancestors | 16800 | 6 | 20 | 20 | 0.8729 | 0.0050 |
| GO | 21000 | 6 | 20 | 20 | 0.9086 | 0.0049 |
| GO XGBoost one pass filter | 373 | 6 | 20 | 20 | 0.9187 | 0.0042 |
| GO XGBoost two pass filter | 65 | 6 | 20 | 20 | 0.9219 | 0.0033 |
| GO XGBoost two pass filter UniNet, CoExp | 79 | 6 | 20 | 20 | 0.9294 | 0.0034 |
| GO XGBoost two pass filter, UniNet | 78 | 6 | 20 | 20 | 0.9293 | 0.0036 |
| GO XGBoost two pass filter, degree | 66 | 6 | 20 | 20 | 0.9283 | 0.0027 |
| GO XGBoost two pass filter, ageing_n | 66 | 6 | 20 | 20 | 0.9314 | 0.0029 |
| GO XGBoost three pass filter, ageing_n | 32 | 1 | 50 | 20 | 0.9322 | 0.0011 |
Performance of different feature sets, from weakest down to strongest, by comparing classification performance of 20 prediction each. Default settings for Gene Ontology (GO) features are “without ageing GOs but with GO ancestors”; we marked when used otherwise. For each feature set description (row), we list the number of features, the depth and number of trees in the model and the average and standard deviation of AUC values generated by 20 predictions of 5-fold cross-validation. “UniNet” means the set of network features (including degree, ageing_n, and the remaining network features), “CoExp” means the co-expression feature.
Performance of various machine learning algorithms on two different feature sets.
| short description of the feature set | name of algorithm | number of features | number of predictions | AUC | |
|---|---|---|---|---|---|
| average | std dev | ||||
| GO, UniNet, CoExp | k-nearest neighbour | 21014 | 20 | 0.5614 | 0.0053 |
| GO, UniNet, CoExp | decision tree | 21014 | 20 | 0.6373 | 0.0113 |
| GO, UniNet, CoExp | naïve Bayes | 21014 | 20 | 0.7258 | 0.0056 |
| GO, UniNet, CoExp | logistic regression | 21014 | 20 | 0.7374 | 0.0538 |
| GO, UniNet, CoExp | support-vector machine | 21014 | 20 | 0.9091 | 0.0022 |
| GO, UniNet, CoExp | XGBoost | 21014 | 20 | 0.9201 | 0.0024 |
| Frequent GOs, UniNet, CoExp | k-nearest neighbour | 310 | 20 | 0.5857 | 0.0082 |
| Frequent GOs, UniNet, CoExp | decision tree | 310 | 20 | 0.6191 | 0.0095 |
| Frequent GOs, UniNet, CoExp | naïve Bayes | 310 | 20 | 0.7991 | 0.0025 |
| Frequent GOs, UniNet, CoExp | logistic regression | 310 | 20 | 0.8036 | 0.0343 |
| Frequent GOs, UniNet, CoExp | support-vector machine | 310 | 20 | 0.8739 | 0.0109 |
| Frequent GOs, UniNet, CoExp | XGBoost | 310 | 20 | 0.9088 | 0.0041 |
Performance of various machine learning algorithms on two different feature sets. “GO, UniNet, CoExp” means the feature set containing all GO features without ageing GOs but with GO ancestors, the network features and the co-expression feature. “Frequent GOs, UniNet, CoExp” means the feature set containing only GO features that occur in at least 100 proteins (selected from the above mentioned feature set). For each raw, we list the feature set description, the name of the algorithm, the number of features, the number of predictions, and the average and standard deviation of 20 AUC values generated by a number of predictions of 5-fold cross-validation.