| Literature DB >> 31800601 |
Joverlyn Gaudillo1,2, Jae Joseph Russell Rodriguez3, Allen Nazareno1, Lei Rigi Baltazar1,2, Julianne Vilela4, Rommel Bulalacao5, Mario Domingo5, Jason Albia1,2.
Abstract
Machine learning (ML) is poised as a transformational approach uniquely positioned to discover the hidden biological interactions for better prediction and diagnosis of complex diseases. In this work, we integrated ML-based models for feature selection and classification to quantify the risk of individual susceptibility to asthma using single nucleotide polymorphism (SNP). Random forest (RF) and recursive feature elimination (RFE) algorithm were implemented to identify the SNPs with high implication to asthma. K-nearest neighbor (kNN) and support vector machine (SVM) algorithms were trained to classify the identified SNPs whether associated with non-asthmatic or asthmatic samples. Feature selection step showed that RF outperformed RFE and the feature importance score derived from RF was consistently high for a subset of SNPs, indicating the robustness of RF in selecting relevant features associated with asthma. Model comparison showed that the integration of RF-SVM obtained the highest model performance with an accuracy, precision, and sensitivity of 62.5%, 65.3%, and 69%, respectively, when compared to the baseline, RF-kNN, and an external MeanDiff-kNN models. Furthermore, results show that the occurrence of asthma can be predicted with an Area under the Curve (AUC) of 0.62 and 0.64 for RF-SVM and RF-kNN models, respectively. This study demonstrates the integration of ML models to augment traditional methods in predicting genetic predisposition to multifactorial diseases such as asthma.Entities:
Year: 2019 PMID: 31800601 PMCID: PMC6892549 DOI: 10.1371/journal.pone.0225574
Source DB: PubMed Journal: PLoS One ISSN: 1932-6203 Impact factor: 3.240
List of hyperparameter values.
| Method | Hyperparameter | Range of values |
|---|---|---|
| Random Forest | optimum number of trees in the forest, n_estimators | {20, 30, 40, 50, 60, 70, 80, 90, 100, 110, 120, 130, 140, 150, 160, 170, 180, 190, 200} |
| maximum number of features considered for splitting a node to achieve least uncertainty when creating a tree, max_features | {400, 500, 600} | |
| SVM | kernel | {linear, sigmoid, rbf, poly} |
| regularization parameter, C | {1, 10, 100, 1000} | |
| tolerance, | {0.002, 0.1, 0.001, 0.0001} | |
| kNN | number of neighbors, k | 1, 3, 5, 7, 9, 11, 13, 15, 17, 19, 21, 23, 25, 27, 29, 31, 33, 35 |
Fig 1Accuracy vs. number of SNP for SVM (left) and kNN (right).
Comparison of feature selection methods.
| Model | Accuracy (%) |
|---|---|
| RF-SVM | 62.17 |
| RF-kNN | 61.70 |
| RFE-SVM | 50.90 |
| RFE-kNN | 49.68 |
Fig 2Feature importance score of top 20 SNPs.
Characteristics of top five SNPs selected by RF based on importance score [21].
| SNP ID | Chromosome Location | Gene | Functional Consequence | GeneCard Summary |
|---|---|---|---|---|
| rs7541950 | 1:147903855 | intron variant | GJA8 (Gap Junction Protein Alpha 8) is a Protein Coding gene. Diseases associated with GJA8 include Cataract 1, Multiple Types and Cataract Microcornea Syndrome. Among its related pathways are Development Slit-Robo signaling and Vesicle-mediated transport. Gene Ontology (GO) annotations related to this gene include channel activity and gap junction channel activity | |
| rs7541956 | 1:111366426 | intron variant | RNA Gene, and is affiliated with the ncRNA class | |
| rs7542025 | 1:40643890 | intron variant | RIMS3 (Regulating Synaptic Membrane Exocytosis 3) is a Protein Coding gene. Gene Ontology (GO) annotations related to this gene include ion channel binding | |
| rs7542028 | 1:168718584 | intron variant | DPT (Dermatopontin) is a Protein Coding gene. Diseases associated with DPT include Commensal Bacterial Infectious Disease and Anisometropia | |
| rs7542082 | 1:118617643 | NA | NA | NA |
Comparison of performance between RF-SVM and RF-kNN to the baseline SVM and kNN models.
The selected SNPs for Baseline SVM and kNN refers to the 19 asthma-associated SNPs.
| Models | Accuracy (%) | |
|---|---|---|
| Entire Dataset | Selected SNPs | |
| Baseline SVM | 52.83 | 51.81 |
| Baseline kNN | 49.69 | 49.09 |
| RF-SVM | 56.30 | 62.17 |
| Baseline kNN | 54.70 | 61.70 |
Fig 3Performance of the three ML models (RF-SVM, RF-kNN, MeanDiff-kNN).
Optimal hyperparameters determined for the models.
| Method | Hyperparameter values |
|---|---|
| SVM | kernel = rbf |
| regularization parameter, C = 100 | |
| tolerance, | |
| number of selected SNP = 310 | |
| kNN | number of nearest neighbors, k = 7 |
| number of selected SNP = 400 |
Fig 4ROC Curves of RF-SVM (a), RF-kNN (b), and MeanDiff-kNN (c).