| Literature DB >> 29462247 |
Fabio Fabris1, Aoife Doherty2, Daniel Palmer2, João Pedro de Magalhães2, Alex A Freitas1.
Abstract
Motivation: This work uses the Random Forest (RF) classification algorithm to predict if a gene is over-expressed, under-expressed or has no change in expression with age in the brain. RFs have high predictive power, and RF models can be interpreted using a feature (variable) importance measure. However, current feature importance measures evaluate a feature as a whole (all feature values). We show that, for a popular type of biological data (Gene Ontology-based), usually only one value of a feature is particularly important for classification and the interpretation of the RF model. Hence, we propose a new algorithm for identifying the most important and most informative feature values in an RF model.Entities:
Mesh:
Year: 2018 PMID: 29462247 PMCID: PMC6041990 DOI: 10.1093/bioinformatics/bty087
Source DB: PubMed Journal: Bioinformatics ISSN: 1367-4803 Impact factor: 6.937
Fig. 1.Example of a Random Tree used to calculate the statistics . In this tree, leaf nodes (where a prediction is made) are represented by squares with the predicted class in it, edges in bold form the relevant rules (a rule is a path from the root to a leaf node). We also show the OOB Hits and Coverages that are relevant to calculate the statistics
Random Forest predictive accuracy results (AUROC) with and without under-sampling for the classes ‘Over-expressed (O)’, ‘Under-expressed (U)’ and with ‘No change in expression (N)’ with age in the brain, and the mean AUROC across classes (All) weighted by their number of instances
| Training type | Classes | |||
|---|---|---|---|---|
| O | U | N | All | |
| With under-sampling | 0.758 | 0.676 | 0.707 | 0.708 |
| Without under-sampling | 0.733 | 0.653 | 0.698 | 0.699 |
Top-ranked GO terms (ranked by rule-based Precision) used to classify genes as ‘over-expressed’ and with ‘no change in expression’ with age in the brain
| Rank | Feature i.d. | Feature name | Rule prec. | Rule hits |
|---|---|---|---|---|
| Top-ranked GO terms predicting class over-expressed with age | ||||
| 1 | GO:2001198 | Regulation of dendritic cell differentiation | 0.70 | 2.90 |
| 2 | GO:0042605 | Peptide antigen binding | 0.49 | 5.80 |
| 3 | GO:0042611 | MHC protein complex | 0.40 | 6.73 |
| 4 | GO:0050431 | Transforming growth factor beta binding | 0.39 | 2.83 |
| 5 | GO:0071294 | Cellular response to zinc ion | 0.36 | 7.97 |
| 6 | GO:0071556 | Integral component of lumenal side of endoplasmic reticulum membrane | 0.36 | 6.45 |
| 7 | GO:0071276 | Cellular response to cadmium ion | 0.33 | 5.07 |
| 8 | GO:0002479 | Antigen proc. and pres. of exogenous peptide antigen via MHC class I, TAP-dependent | 0.32 | 14.57 |
| 9 | GO:0042590 | Antigen processing and presentation of exogenous peptide antigen via MHC class I | 0.30 | 23.97 |
| 10 | GO:0055038 | Recycling endosome membrane | 0.29 | 3.93 |
| 11 | GO:0046686 | Response to cadmium ion | 0.28 | 4.73 |
| 12 | GO:0060333 | Interferon-gamma-mediated signaling pathway | 0.27 | 35.73 |
| 13 | GO:0044548 | S100 protein binding | 0.27 | 0.95 |
| 14 | GO:0071402 | Cellular response to lipoprotein particle stimulus | 0.27 | 0.93 |
| 15 | GO:0030670 | Phagocytic vesicle membrane | 0.26 | 5.07 |
| 16 | GO:0019865 | Immunoglobulin binding | 0.26 | 1.13 |
| 17 | GO:0012507 | ER to Golgi transport vesicle membrane | 0.23 | 10.27 |
| 18 | GO:0030176 | Integral component of endoplasmic reticulum membrane | 0.23 | 5.20 |
| Top-ranked GO terms predicting class no change in expression with age | ||||
| 1 | GO:0004930 | G-protein coupled receptor activity | 1.00 | 4480.70 |
| 2 | GO:0006396 | RNA processing | 1.00 | 2688.77 |
| 3 | GO:0050906 | Detection of stimulus involved in sensory perception | 1.00 | 2388.67 |
| 4 | GO:0051606 | Detection of stimulus | 1.00 | 2287.60 |
| 5 | GO:0050907 | Detection of chemical stimulus involved in sensory perception | 1.00 | 2237.87 |
| 6 | GO:0009593 | Detection of chemical stimulus | 1.00 | 2079.60 |
| 7 | GO:0004984 | Olfactory receptor activity | 1.00 | 1768.10 |
| 8 | GO:0050911 | Detection of chemical stimulus involved in sensory perception of smell | 1.00 | 1624.60 |
| 9 | GO:0005882 | Intermediate filament | 1.00 | 334.77 |
| 10 | GO:0034470 | ncRNA processing | 1.00 | 302.03 |
| 11 | GO:0006397 | mRNA processing | 1.00 | 301.43 |
| 12 | GO:0031424 | Keratinization | 1.00 | 286.03 |
| 13 | GO:0000151 | Ubiquitin ligase complex | 1.00 | 130.80 |
| 14 | GO:0007608 | Sensory perception of smell | 1.00 | 112.77 |
| 15 | GO:0032259 | Methylation | 1.00 | 110.87 |
| 16 | GO:0016072 | rRNA metabolic process | 1.00 | 108.83 |
| 17 | GO:0045095 | Keratin filament | 1.00 | 107.07 |
| 18 | GO:0000375 | RNA splicing, via transesterification reactions | 1.00 | 99.90 |
Note: The columns contain: (1) the feature rank, (2) the feature identifier, (3) the feature name, (4) the mean rule-based Precision and (5) the mean rule-based Hits. Rule-based scores are based on the RF’s predictions on the Out-of-Bag datasets––not used for building the models. See the main text for definitions of Precision and Hits.
Top-ranked GO terms [ranked by the Intervention in Prediction score (Epifanio, 2017)] used to classify genes as ‘over-expressed’, ‘under-expressed’ and with ‘no change in expression’ with age in the brain
| Rank | Feature i.d. | Feature name | Interv. score |
|---|---|---|---|
| Top-Ranked GO terms predicting class over-expressed with age | |||
| 1 | total | Number of GO annotations | 1.22e–02 |
| 2 | GO:0043005 | Neuron projection | 5.61e–03 |
| 3 | GO:0097458 | Neuron part | 5.55e–03 |
| 4 | GO:1903561 | Extracellular vesicle | 5.36e–03 |
| 5 | GO:0070062 | Extracellular exosome | 5.33e–03 |
| 6 | GO:0043230 | Extracellular organelle | 5.01e–03 |
| 7 | GO:0044456 | Synapse part | 4.70e–03 |
| 8 | GO:0002376 | Immune system process | 4.43e–03 |
| 9 | GO:0042995 | Cell projection | 4.25e–03 |
| 10 | GO:0044421 | Extracellular region part | 4.21e–03 |
| 11 | GO:0031982 | Vesicle | 3.77e–03 |
| 12 | GO:0044444 | Cytoplasmic part | 3.58e–03 |
| 13 | GO:0002252 | Immune effector process | 3.45e–03 |
| 14 | GO:0050896 | Response to stimulus | 3.07e–03 |
| 15 | GO:0002682 | Regulation of immune system process | 2.72e–03 |
| 16 | GO:0048731 | System development | 2.56e–03 |
| Top-ranked GO terms predicting class under-expressed with age | |||
| 1 | total | Number of GO annotations | 1.30e–02 |
| 2 | GO:0043005 | Neuron projection | 6.81e–03 |
| 3 | GO:0097458 | Neuron part | 6.51e–03 |
| 4 | GO:0044456 | Synapse part | 5.73e–03 |
| 5 | GO:1903561 | Extracellular vesicle | 5.15e–03 |
| 6 | GO:0070062 | Extracellular exosome | 5.11e–03 |
| 7 | GO:0042995 | Cell projection | 4.82e–03 |
| 8 | GO:0043230 | Extracellular organelle | 4.80e–03 |
| 9 | GO:0044421 | Extracellular region part | 4.27e–03 |
| 10 | GO:0002376 | Immune system process | 4.10e–03 |
| 11 | GO:0031982 | Vesicle | 3.79e–03 |
| 12 | GO:0044444 | Cytoplasmic part | 3.77e–03 |
| 13 | GO:0050896 | Response to stimulus | 3.04e–03 |
| 14 | GO:0048731 | System development | 2.92e–03 |
| 15 | GO:0002252 | Immune effector process | 2.88e–03 |
| 16 | GO:0007399 | Nervous system development | 2.84e–03 |
| Top-ranked GO terms predicting class no change in expression with age | |||
| 1 | total | Number of GO annotations | 1.43e–02 |
| 2 | GO:0097458 | Neuron part | 5.36e–03 |
| 3 | GO:0043005 | Neuron projection | 5.31e–03 |
| 4 | GO:1903561 | Extracellular vesicle | 4.85e–03 |
| 5 | GO:0070062 | Extracellular exosome | 4.77e–03 |
| 6 | GO:0043230 | Extracellular organelle | 4.55e–03 |
| 7 | GO:0044456 | Synapse part | 4.43e–03 |
| 8 | GO:0044421 | Extracellular region part | 4.05e–03 |
| 9 | GO:0042995 | Cell projection | 4.03e–03 |
| 10 | GO:0044444 | Cytoplasmic part | 3.97e–03 |
| 11 | GO:0002376 | Immune system process | 3.91e–03 |
| 12 | GO:0031982 | Vesicle | 3.66e–03 |
| 13 | GO:0050896 | Response to stimulus | 3.21e–03 |
| 14 | GO:0005515 | Protein binding | 2.98e–03 |
| 15 | GO:0048731 | System development | 2.79e–03 |
| 16 | GO:0008150 | Biological_process | 2.75e–03 |
Note: The columns are: (1) the feature’s rank, (2) the feature’s identifier, (3) the feature’s name and (4) the Intervention score. The ‘Total’ feature is the number of GO terms annotated for each gene.