| Literature DB >> 27578529 |
Djordje Gligorijevic1, Jelena Stojanovic1, Nemanja Djuric1, Vladan Radosavljevic1, Mihajlo Grbovic1, Rob J Kulathinal2,3, Zoran Obradovic1.
Abstract
Data-driven phenotype analyses on Electronic Health Record (EHR) data have recently drawn benefits across many areas of clinical practice, uncovering new links in the medical sciences that can potentially affect the well-being of millions of patients. In this paper, EHR data is used to discover novel relationships between diseases by studying their comorbidities (co-occurrences in patients). A novel embedding model is designed to extract knowledge from disease comorbidities by learning from a large-scale EHR database comprising more than 35 million inpatient cases spanning nearly a decade, revealing significant improvements on disease phenotyping over current computational approaches. In addition, the use of the proposed methodology is extended to discover novel disease-gene associations by including valuable domain knowledge from genome-wide association studies. To evaluate our approach, its effectiveness is compared against a held-out set where, again, it revealed very compelling results. For selected diseases, we further identify candidate gene lists for which disease-gene associations were not studied previously. Thus, our approach provides biomedical researchers with new tools to filter genes of interest, thus, reducing costly lab studies.Entities:
Mesh:
Year: 2016 PMID: 27578529 PMCID: PMC5006166 DOI: 10.1038/srep32404
Source DB: PubMed Journal: Sci Rep ISSN: 2045-2322 Impact factor: 4.379
Figure 1Graphical summary of the approach proposed in this study.
Heterogeneous data obtained from large scale discharge records and hand curated disease-gene associations are used to jointly learn meaningful vector representations of disease and gene concepts in a latent vector space, where interactions of diseases and genes are retrieved and discovered.
Figure 2Graphical representations of the D2D and DAG2D models illustrated on projecting Acute Myocardial Infarction (AMI) diagnoses and AMI-related genes to AMI-associated diagnoses.
Figure 3Precision@K for D2D model with different dimension D of the embedding space.
Precision and gene overlap for various competing models for the task of phenotype discovery, evaluated using disease-gene associations.
| Data | Average precision | Average perc. of overlapping genes | |||||||
|---|---|---|---|---|---|---|---|---|---|
| 1 | 2 | 5 | 10 | 1 | 2 | 5 | 10 | ||
| D2D | |||||||||
| Modularity (Adjacency) | 0.8575 | 0.8457 | 0.8284 | 0.8130 | 0.5198 | 0.4893 | 0.4508 | 0.4145 | |
| Spectral (Adjacency) | 0.7181 | 0.7007 | 0.6795 | 0.6640 | 0.3052 | 0.2779 | 0.2311 | 0.2006 | |
| Modularity (Comorbidity) | 0.8493 | 0.8412 | 0.8110 | 0.7865 | 0.5586 | 0.5315 | 0.4681 | 0.4204 | |
| Spectral (Comorbidity) | 0.8375 | 0.8316 | 0.8190 | 0.7974 | 0.5288 | 0.4989 | 0.4520 | 0.3964 | |
| Comorbidity graph | 0.7268 | 0.7134 | 0.6915 | 0.7068 | 0.1582 | 0.1496 | 0.1465 | 0.1554 | |
| Disease co-occurrence | 0.5616 | 0.5516 | 0.5439 | 0.5668 | 0.1448 | 0.1329 | 0.1264 | 0.1261 | |
| LDA | 0.5260 | 0.5094 | 0.4913 | 0.4217 | 0.1040 | 0.0989 | 0.0864 | 0.0853 | |
| DAG2D | |||||||||
| Modularity (Adjacency) | 0.8711 | 0.8604 | 0.8503 | 0.8389 | 0.5303 | 0.5082 | 0.4706 | 0.4340 | |
| Spectral (Adjacency) | 0.9165 | 0.9102 | 0.9020 | 0.8926 | 0.7524 | 0.7430 | 0.7277 | 0.7110 | |
| Disease and genes co-occurrence | 0.6978 | 0.6985 | 0.7093 | 0.7071 | 0.1058 | 0.1042 | 0.1018 | 0.0935 | |
| LDA | 0.5795 | 0.3874 | 0.3253 | 0.2831 | 0.1136 | 0.0781 | 0.0760 | 0.0652 | |
Four nearest disease neighbors for Chronic Kidney Disease Stage I.
| Phenotype disease | Cosine Similarity |
|---|---|
| Chronic kidney disease Stage II (mild) | 0.9361 |
| Chronic kidney disease Stage III (moderate) | 0.8652 |
| Chronic kidney disease Stage IV (severe) | 0.7647 |
| Chronic kidney disease unspecified | 0.6923 |
Ten nearest disease neighbors of the Multiple Sclerosis phenotype retrieved by the D2D model.
| Multiple Sclerosis |
|---|
| Late effect of spinal cord injury |
| Other causes of myelitis |
| Neuromyelitis optica |
| Acute infective polyneuritis |
| Late effects of intracranial abscess or pyogenic infection |
| Late effects of viral encephalitis |
| Acute (transverse) myelitis NOS |
| Amyotrophic lateral sclerosis |
| Spina bifida without mention of hydrocephalus unspec. region |
| Primary lateral sclerosis |
Ten nearest disease neighbors for the Sepsis and Congestive heart failure phenotypes retrieved by the D2D model.
| Sepsis | Congestive heart failure unspecified |
|---|---|
| Severe sepsis | Other primary cardiomyopathies |
| Septic shock | Atrial fibrillation |
| Intestinal infection due to Clostridium difficile | Other specified forms of chronic ischemic heart disease |
| Candidiasis of other urogenital sites | Atrial flutter |
| Other and unspecified mycoses | Other chronic pulmonary heart diseases |
| Systemic inflammatory response syndrome | Paroxysmal ventricular tachycardia |
| Hyperosmolality and-or hypernatremia | Cardiac pacemaker |
| Pressure ulcer stage III | Aortic valve disorders |
| Proteus infection | Other left bundle branch block |
| Other shock without mention of trauma | Old myocardial infarction |
precision@K results for gene discovery.
| 1 | 2 | 5 | 10 | |
|---|---|---|---|---|
| DAG2V | 0.6978 | |||
| Modularity | 0.4760 | 0.4874 | 0.4902 | 0.4689 |
| Spectral | 0.7551 | 0.6705 | 0.6387 | |
| LDA | 0.2300 | 0.2570 | 0.3560 | 0.4180 |
| Co-occurence | 0.4691 | 0.3867 | 0.2998 | 0.2416 |
| Most Frequent | 0.2000 | 0.3467 | 0.4324 | 0.3887 |