| Literature DB >> 34009343 |
Josephine Yates1, Alba Gutiérrez-Sacristán1, Vianney Jouhet1, Kimberly LeBlanc1, Cecilia Esteves1, Thomas N DeSain1, Nick Benik1, Jason Stedman1, Nathan Palmer1, Guillaume Mellon1, Isaac Kohane1, Paul Avillach1.
Abstract
OBJECTIVE: When studying any specific rare disease, heterogeneity and scarcity of affected individuals has historically hindered investigators from discerning on what to focus to understand and diagnose a disease. New nongenomic methodologies must be developed that identify similarities in seemingly dissimilar conditions.Entities:
Keywords: cluster analysis; rare diseases; supervised machine learning; undiagnosed diseases; unsupervised machine learning
Year: 2021 PMID: 34009343 PMCID: PMC8324228 DOI: 10.1093/jamia/ocab050
Source DB: PubMed Journal: J Am Med Inform Assoc ISSN: 1067-5027 Impact factor: 4.497
Figure 1.(A) Workflow for the clustering and phenotype enrichment analysis per cluster of the Undiagnosed Diseases Network. (B) Workflow for the training of support vector classifier (SVC) and assignment of new patients. (A) (1) Patients were represented as binary vectors (1 if the phenotype was present, 0 if absent). (2) All patient vectors were aggregated in one matrix representing the whole network. (3) The pairwise Jaccard similarity was computed for every pair of patients and represented a similarity matrix. Similarities ranged from 0 to 1. (4) A network of patients was created, the nodes representing patients of the Undiagnosed Diseases Network, the edges being proportional to the similarity between the 2 nodes that they linked. Patients with a 0 score for similarity were not linked. (5) The Louvain community detection algorithm was performed on the network, and clusters were detected. (6) The list of the 5 phenotypes that were presented by the highest proportion of patients within the cluster was extracted and referred to as the list of best phenotypes. (7) The proportion of patients presenting the list of best phenotypes was represented as a heatmap. (B) (1) An SVC is trained on the patients labeled with their cluster number. (2) New patients represented with their Human Phenotype Ontology annotations are assigned to clusters with the trained SVC.
Analysis of the UDN database as of May 6, 2019, depicting age, race, ethnicity, primary symptoms, and clinical sites of evaluation
| Attribute | Adult Diagnosed | Adult Undiagnosed | Pediatric Diagnosed | Pediatric Undiagnosed | All | Mann-Whitney | |
|---|---|---|---|---|---|---|---|
| Female-to-male ratio | — | 22:23 | 102:85 | 113:81 | 295:321 | 532:510 | (Fisher) Adult: .51 |
| Pediatric: .013 | |||||||
| Age | At symptom onset, y | 36 (26-44) | 38 (26-50) | 2 (0-1) | 3 (0-3) | 10 (0-14) | Adult: .19 |
| Pediatric: | |||||||
| At UDN evaluation, y | 45 (33-57) | 47 (36-58) | 11 (4-14) | 12 (4-17) | 20 (5-31) | Adult: .24 | |
| Pediatric: <.001 | |||||||
| Race | White | 36 (16) | 156 (67) | 149 (18) | 498 (61) | 839 (81) | / |
| Asian | 5 (2) | 7 (3) | 17 (2) | 34 (4) | 63 (6) | ||
| American Indian or Alaska Native | 0 (0) | 0 (0) | 0 (0) | 3 (<1) | 3 (<1) | ||
| Black or African American | 2 (1) | 14 | 11 (1) | 21 (3) | 48 (5) | ||
| Native Hawaiian Pacific Islander | 0 (0) | 0 (0) | 0 (0) | 2 (<1) | 2 (<1) | ||
| Other | 2 (1) | 10 (4) | 17 (2) | 58 (7) | 87 (8) | ||
| Ethnicity | Not Hispanic or Latino | 33 (14) | 152 (66) | 132 (16) | 452 (56) | 769 (74) | / |
| Hispanic or Latino | 1 (<1) | 12 (5) | 40 (5) | 103 (13) | 156 (15) | ||
| Unknown/not reported ethnicity | 11 (5) | 23 (10) | 22 (3) | 61 (8) | 117 (11) | ||
| Primary symptom | Neurology | 22 (9) | 92 (40) | 100 (12) | 283 (35) | 497 (49) | Adult: .08 |
| Pediatric: .33 | |||||||
| Musculoskeletal | 7 (3) | 13 (6) | 27 (3) | 81 (10) | 128 (12) | ||
| Allergies and disorders of the immune system | 2 (1) | 17 (7) | 6 (1) | 31 (4) | 56 (5) | ||
| Cardiology and vascular conditions | 5 (2) | 13 (6) | 7 (1) | 20 (2) | 45 (4) | ||
| Gastroenterology | 0 (0) | 2 (1) | 6 (1) | 29 (4) | 37 (4) | ||
| Rheumatology | 2 (1) | 10 (4) | 0 (0) | 22 (3) | 34 (3) | ||
| Endocrinology | 0 (0) | 4 (2) | 6 (1) | 12 (1) | 22 (2) | ||
| Pulmonology | 0 (0) | 4 (2) | 4 (<1) | 11 (1) | 19 (2) | ||
| Hematology | 1 (<1) | 4 (2) | 1 (<1) | 9 (1) | 15 (1) | ||
| Nephrology | 2 (1) | 4 (2) | 3 (<1) | 6 (1) | 15 (1) | ||
| Ophthalmology | 0 (0) | 1 (<1) | 1 (<1) | 10 (1) | 12 (1) | ||
| Dermatology | 3 (1) | 3 (1) | 2 (<1) | 1 (<1) | 9 (1) | ||
| Dentistry and craniofacial | 0 (0) | 0 (0) | 1 (<1) | 6 (1) | 7 (1) | ||
| Psychiatry | 0 (0) | 1 (<1) | 2 (<1) | 3 (<1) | 6 (1) | ||
| Gynecology and reproductive medicine | 0 (0) | 0 (0) | 1 (<1) | 1 (<1) | 2 (<1) | ||
| Infectious diseases | 0 (0) | 1 (<1) | 0 (0) | 0 (0) | 1 (<1) | ||
| Urology | 0 (0) | 0 (0) | 0 (0) | 1 (<1) | 1 (<1) | ||
| Oncology | 0 (0) | 1 (<1) | 0 (0) | 0 (0) | 1 (<1) | ||
| Other | 1 (<1) | 11 (5) | 22 (3) | 71 (9) | 105 (10) | ||
| Clinical site of evaluation | Baylor | 5 (2) | 21 (9) | 36 (4) | 103 (13) | 165 (16) | Adult: .5 |
| Pediatric:.26 | |||||||
| Duke | 3 (1) | 5 (2) | 43 (5) | 68 (8) | 119 (11) | ||
| Harvard affiliate | 11 (5) | 18 (8) | 24 (3) | 65 (8) | 118 (11) | ||
| NIH | 6 (3) | 88 (38) | 8 (1) | 135 (17) | 237 (23) | ||
| Stanford | 5 (2) | 32 (14) | 17 (2) | 96 (12) | 150 (14) | ||
| UCLA | 3 (1) | 12 (5) | 38 (5) | 69 (9) | 122 (12) | ||
| Vanderbilt | 12 (5) | 11 (5) | 28 (3) | 79 (10) | 130 (13) | ||
| WUSTL | 0 (0) | 0 (0) | 0 (0) | 1 (<1) | 1 (<1) |
Values are n, mean (interquartile range), or n (%). Statistical significance was computed using the Mann-Whitney U test (Fisher exact test for female-to-male ratio).
NIH, National Institutes of Health; UCLA, University of California, Los Angeles; UDN: Undiagnosed Diseases Network; WUSTL, Washington University in St Louis.
Figure 2.Percentage of patients in Undiagnosed Diseases Network PIC-SURE database presenting at least 1 symptom from top-level phenotypic category in Human Phenotype Ontology. There are 23 types of top-level phenotypic abnormalities in which Human Phenotype Ontology terms can be classified. A single phenotype may be classified within several categories. Each patient was counted within a category if they presented at least 1 symptom classified in the category (as of May 6, 2019).
Analysis of clusters according to the number of included patients, their female-to-male ratio, the average number of HPO terms per patient in the cluster, the odds ratio of being diagnosed, the average age at onset of the disease (years), and the average age at UDN evaluation (years) for pediatric patients (as of May 6, 2019)
| Pediatric | |||||
|---|---|---|---|---|---|
| Cluster C1P | Cluster C2P | Cluster C3P | Cluster C4P | Kruskal-Wallis | |
| Patients per cluster | 218 | 279 | 198 | 103 | — |
| Female-to-male ratio | 12:10 | 8:10 | 10:10 | 12:10 | — |
| Average of HPO terms per patient | 36.1 (33.3-38.9) | 20.6 (19.2-22.0) | 21.1 (18.9-23.3) | 17.9 (16.0-19.7) | <.001 |
| Odds ratio diagnosed (95% CI) | 1.7 (1.2-2.4) | 1.9 (1.4-2.7) | 0.7 (0.4-1.0) | 1.4 (0.9-2.2) | — |
| Average age at onset (95% CI), y | 0.7 (0.4-1.0) | 0.8 (0.5-1.1) | 5.2 (4.4-6.0) | 4.8 (3.8-5.8) | <.001 |
| Average age at UDN evaluation (95% CI), y | 9.0 (7.7-10.3) | 8.4 (7.5-9.3) | 18.3 (16.4-20.1) | 16.0 (13.5-18.5) | <.001 |
CI: confidence interval; HPO: Human Phenotype Ontology; UDN: Undiagnosed Diseases Network.
Figure 3.Heatmap of most representative phenotypes for each cluster in the Undiagnosed Diseases Network for networks. The 5 most representative phenotypes for every cluster were extracted. All phenotypes were concatenated in a list referred to as “best phenotypes.” The proportion of patients presenting these phenotypes in every cluster is represented in the heatmap: the darker the shade is, the higher the proportion of patients presenting this cluster-specific phenotype is, ranging from 0% to 50% for adult onset and 0% to 75% for pediatric onset. The cluster sizes are shown next to their name.