| Literature DB >> 30532199 |
Muhammad Asif1,2,3, Hugo F M C M Martiniano1,2, Astrid M Vicente1,2,4, Francisco M Couto3.
Abstract
Identifying disease genes from a vast amount of genetic data is one of the most challenging tasks in the post-genomic era. Also, complex diseases present highly heterogeneous genotype, which difficult biological marker identification. Machine learning methods are widely used to identify these markers, but their performance is highly dependent upon the size and quality of available data. In this study, we demonstrated that machine learning classifiers trained on gene functional similarities, using Gene Ontology (GO), can improve the identification of genes involved in complex diseases. For this purpose, we developed a supervised machine learning methodology to predict complex disease genes. The proposed pipeline was assessed using Autism Spectrum Disorder (ASD) candidate genes. A quantitative measure of gene functional similarities was obtained by employing different semantic similarity measures. To infer the hidden functional similarities between ASD genes, various types of machine learning classifiers were built on quantitative semantic similarity matrices of ASD and non-ASD genes. The classifiers trained and tested on ASD and non-ASD gene functional similarities outperformed previously reported ASD classifiers. For example, a Random Forest (RF) classifier achieved an AUC of 0. 80 for predicting new ASD genes, which was higher than the reported classifier (0.73). Additionally, this classifier was able to predict 73 novel ASD candidate genes that were enriched for core ASD phenotypes, such as autism and obsessive-compulsive behavior. In addition, predicted genes were also enriched for ASD co-occurring conditions, including Attention Deficit Hyperactivity Disorder (ADHD). We also developed a KNIME workflow with the proposed methodology which allows users to configure and execute it without requiring machine learning and programming skills. Machine learning is an effective and reliable technique to decipher ASD mechanism by identifying novel disease genes, but this study further demonstrated that their performance can be improved by incorporating a quantitative measure of gene functional similarities. Source code and the workflow of the proposed methodology are available at https://github.com/Muh-Asif/ASD-genes-prediction.Entities:
Mesh:
Year: 2018 PMID: 30532199 PMCID: PMC6287949 DOI: 10.1371/journal.pone.0208626
Source DB: PubMed Journal: PLoS One ISSN: 1932-6203 Impact factor: 3.240
Fig 1Main components of the proposed methodology to predict disease genes.
Functional similarities are computed for a given set of genes. Different machine learning methods are applied to functional similarity matrices to define rules that discriminate disease genes from non-disease genes. Two evaluation approaches, namely stratified and held-out restricted stratified five-fold cross-validation are used.
Fig 2Classifying ASD database genes into HD genes and LD genes using the available evidence strength.
Fig 3Graphical representation of the methodology to predict ASD genes.
A. ASD genes with different level of evidence and non-mental genes were used to implement the proposed methodology. B. Three different semantic similarity measures were used to calculate functional similarities for HD + non-mental genes and for HD + LD + non-mental genes. C. Four different machine learning methods were used to analyze the computed gene functional similarities. Machine learning classifiers were tested using stratified and held-out restricted stratified five-fold cross-validation. Like the Krishnan et al. method in the held-out restricted validation, only HD + non-mental genes were chosen for testing the classifier. In stratified five-fold cross-validation, classifiers were evaluated using all genes in the test set.
The performance of classifiers trained and tested over different semantic similarities matrices and validated using stratified and held-out restricted stratified validation.
The performance of the Krishnan et al.’s method in terms of AUC value: unweighted HD genes based classifier = 0.73, weighted HD + LD classifier = 0.74, weighted HD + LD classifier and evaluated with HD genes = 0.80–0.89.
| Semantic similarity measure | Classifiers | HD genes | HD + LD genes | |
|---|---|---|---|---|
| Fivefold validation | Fivefold validation | Held-out fivefold validation | ||
| RF | 0.80 | 0.75 | 0.82 | |
| NB | 0.73 | 0.66 | 0.70 | |
| Linear-SVM | 0.28 | 0.59 | 0.62 | |
| Radial-SVM | 0.49 | 0.59 | 0.59 | |
| RF | 0.75 | 0.73 | 0.81 | |
| NB | 0.71 | 0.67 | 0.76 | |
| Linear-SVM | 0.26 | 0.59 | 0.59 | |
| Radial-SVM | 0.54 | 0.59 | 0.59 | |
| RF | 0.77 | 0.74 | 0.82 | |
| NB | 0.70 | 0.64 | 0.75 | |
| Linear-SVM | 0.28 | 0.61 | 0.62 | |
| Radial-SVM | 0.51 | 0.62 | 0.57 | |
HD: High confidence Disease genes; LD: Low confidence Disease genes.
Fig 4The architecture of the automated workflow for predicting disease genes.
Functional similarity layer is the instantiation of methodology step 1, while the classifier layer implements the steps 2 and 3 of proposed methodology (Fig 1).
Top 16 enriched HPO terms for predicted ASD genes.
| HPO term ID | HPO term | Adjusted |
|---|---|---|
| HP:0000076 | Vesicoureteral reflux | 5.21303E-05 |
| HP:0000717 | Autism | 0.000232356 |
| HP:0000582 | Upslanted palpebral fissure | 0.000232356 |
| HP:0002999 | Patellar dislocation | 0.000232356 |
| HP:0001636 | Tetralogy of Fallot | 0.000650645 |
| HP:0001537 | Umbilical hernia | 0.000798236 |
| HP:0007018 | Attention deficit hyperactivity disorder | 0.000798236 |
| HP:0002120 | Cerebral cortical atrophy | 0.001075744 |
| HP:0000113 | Polycystic kidney dysplasia | 0.00210752 |
| HP:0000670 | Carious teeth | 0.002213532 |
| HP:0010669 | Cheekbone underdevelopment | 0.002678489 |
| HP:0005293 | Venous insufficiency | 0.003171264 |
| HP:0003072 | Hypercalcemia | 0.003377945 |
| HP:0000232 | Everted lower lip vermilion | 0.003377945 |
| HP:0002624 | Venous abnormality | 0.005095958 |
| HP:0000722 | Obsessive-compulsive behavior | 0.005670617 |