| Literature DB >> 30564269 |
Jinmeng Jia1, Ruiyuan Wang1, Zhongxin An1, Yongli Guo2, Xi Ni2, Tieliu Shi1,3.
Abstract
DNA sequencing has allowed for the discovery of the genetic cause for a considerable number of diseases, paving the way for new disease diagnostics. However, due to the lack of clinical samples and records, the molecular cause for rare diseases is always hard to identify, significantly limiting the number of rare Mendelian diseases diagnosed through sequencing technologies. Clinical phenotype information therefore becomes a major resource to diagnose rare diseases. In this article, we adopted both a phenotypic similarity method and a machine learning method to build four diagnostic models to support rare disease diagnosis. All the diagnostic models were validated using the real medical records from RAMEDIS. Each model provides a list of the top 10 candidate diseases as the prediction outcome and the results showed that all models had a high diagnostic precision (≥98%) with the highest recall reaching up to 95% while the models with machine learning methods showed the best performance. To promote effective diagnosis for rare disease in clinical application, we developed the phenotype-based Rare Disease Auxiliary Diagnosis system (RDAD) to assist clinicians in diagnosing rare diseases with the above four diagnostic models. The system is freely accessible through http://www.unimd.org/RDAD/.Entities:
Keywords: diagnostic model; machine learning; phenotype; rare disease; web-based tools
Year: 2018 PMID: 30564269 PMCID: PMC6288202 DOI: 10.3389/fgene.2018.00587
Source DB: PubMed Journal: Front Genet ISSN: 1664-8021 Impact factor: 4.599
Figure 1The workflow of RDAD. HPO, Human Phenotype Ontology. OMIM, Online Mendelian Inheritance in Man. PGAS, Phenotype-Gene Association based rare disease similarity model; PICS, Phenotypic TF-IDF-Hierarchy information content based rare disease similarity model; CPML, Curated feature Phenotype spatial vector based rare disease Machine Learning prediction model; APML, Curated and text mined feature phenotype spatial vector based rare disease Machine Learning prediction model.
The Four Diagnostic Models Contained in the RDAD System.
| HPO Phenotypes | √ | √ | √ | √ |
| eRAM Curated Genes | √ | |||
| eRAM Text Mined Phenotypes | √ | |||
| Disease Similarity Classifiers | √ | √ | ||
| Machine Learning Classifiers | √ | √ | ||
The Test Data Set for the Four Diagnostic Models.
| PHENYLKETONURIA (MIM 261600) | 157 |
| CONGENITAL DISORDER OF GLYCOSYLATION, TYPE Ia (MIM 212065) | 27 |
| MAPLE SYRUP URINE DISEASE (MIM 248600) | 21 |
| PROPIONIC ACIDEMIA (MIM 606054) | 16 |
| CANAVAN DISEASE (MIM 271900) | 15 |
| SUCCINIC SEMIALDEHYDE DEHYDROGENASE DEFICIENCY (MIM 271980) | 10 |
| ALKAPTONURIA (MIM 203500) | 10 |
| ARGININOSUCCINIC ACIDURIA (MIM 207900) | 9 |
| ISOVALERIC ACIDEMIA (MIM 243500) | 7 |
| CYSTINURIA (MIM 220100) | 5 |
| CITRULLINEMIA, TYPE II, NEONATAL-ONSET (MIM 605814) | 5 |
| WILSON DISEASE (MIM 277900) | 4 |
| HOLOCARBOXYLASE SYNTHETASE DEFICIENCY (MIM 253270) | 4 |
| FANCONI-BICKEL SYNDROME (MIM 227810) | 2 |
| ALPHA-METHYLACETOACETIC ACIDURIA (MIM 203750) | 2 |
| TYROSINE TRANSAMINASE DEFICIENCY (MIM 276600) | 2 |
| HYPERINSULINEMIC HYPOGLYCEMIA, FAMILIAL, 2 (MIM 601820) | 2 |
| HAWKINSINURIA (MIM 140350) | 2 |
| OSTEOGENESIS IMPERFECTA, TYPE I (MIM 166200) | 1 |
| GLYCOGEN STORAGE DISEASE VI (MIM 232700) | 1 |
| N-ACETYLGLUTAMATE SYNTHASE DEFICIENCY (MIM 237310) | 1 |
| REFSUM DISEASE (MIM 266500) | 1 |
| KRABBE DISEASE (MIM 245200) | 1 |
| LEIGH SYNDROME (MIM 256000) | 1 |
| GLYCOGEN STORAGE DISEASE Ib (MIM 232220) | 1 |
| PYRUVATE CARBOXYLASE DEFICIENCY (MIM 266150) | 1 |
| PEARSON MARROW-PANCREAS SYNDROME (MIM 557000) | 1 |
The Training Data Sets for the Four Diagnostic Models.
| Data Set I | PICS | Rare Diseases | 4,498 |
| Curated Phenotypes | 5,990 | ||
| D-P Associations | 57,346 | ||
| Data Set II | PGAS | Rare Diseases | 4,498 |
| Curated Phenotypes | 5,990 | ||
| D-P Associations | 57,346 | ||
| Curated Genes | 3,682 | ||
| P-G Associations | 419,597 | ||
| Data Set III | CPML | Rare Diseases | 4,498 |
| Curated Phenotypes | 5,990 | ||
| D-P Associations | 57,346 | ||
| Synthetic Patients | 44,980 | ||
| Data Set IV | APML | Rare Diseases | 4,498 |
| All Phenotypes | 6,453 | ||
| D-P Associations | 72,404 | ||
| Synthetic Patients | 44,980 |
D-P Associations, Disease-Phenotype association pairs; P-G Associations, Phenotype-Gene association pairs.
The Four Diagnostic Models with Their Corresponding Classifiers.
| PICS | Data Set I | Cosine Similarity | Bayesian Averaging Algorithm |
| Tanimoto Coefficient | |||
| Ψi Score | |||
| MICA | |||
| PGAS | Data Set II | Cosine Similarity | Bayesian Averaging Algorithm |
| Tanimoto Coefficient | |||
| CPML | Data Set III | Logistic Regression | Bayesian Averaging Algorithm |
| K-Nearest Neighbor | |||
| Random Forest | |||
| Extra Trees | |||
| Naive Bayes | |||
| Deep Neural Network | |||
| APML | Data Set IV | Logistic Regression | Bayesian Averaging Algorithm |
| K-Nearest Neighbor | |||
| Random Forest | |||
| Extra Trees | |||
| Naive Bayes | |||
| Deep Neural Network |
MICA, Most Informative Common Ancestor.
Figure 2The Precision, Recall, F1-Score of Different Models. (A) The top 1 diagnostic performance. (B) The top 10 diagnostic performance. APML, the curated and text mined feature phenotype spatial vector based rare disease machine learning prediction model. CPML, the curated feature phenotype spatial vector based rare disease machine learning prediction model. PGAS, the phenotype-gene association based rare disease similarity model. PICS, the phenotypic TF-IDF-Hierarchy information content based rare disease similarity model.
Figure 3The Precision Recall and F1-Score of the model with different number of Phenotypes Submitted. (A) The top 1 diagnostic performance of PICS model. (B) The top 10 diagnostic performance of CPML model.
Figure 4The top 10 candidate rare diseases confusion matrix of the CPML model. The ylab refers to the disease names of the records, while xlab refers to the candidate disease names provided by the diagnostic model.
Figure 5The ranking distribution of the models. The ylab refers to the percentage of disease rankings, while xlab refers to the diagnostic models.