| Literature DB >> 30759150 |
Juan Zhao1, QiPing Feng2, Patrick Wu1,3, Jeremy L Warner1,4, Joshua C Denny1,4, Wei-Qi Wei1.
Abstract
Genome-wide and phenome-wide association studies are commonly used to identify important relationships between genetic variants and phenotypes. Most studies have treated diseases as independent variables and suffered from the burden of multiple adjustment due to the large number of genetic variants and disease phenotypes. In this study, we used topic modeling via non-negative matrix factorization (NMF) for identifying associations between disease phenotypes and genetic variants. Topic modeling is an unsupervised machine learning approach that can be used to learn patterns from electronic health record data. We chose the single nucleotide polymorphism (SNP) rs10455872 in LPA as the predictor since it has been shown to be associated with increased risk of hyperlipidemia and cardiovascular diseases (CVD). Using data of 12,759 individuals with electronic health records (EHR) and linked DNA samples at Vanderbilt University Medical Center, we trained a topic model using NMF from 1,853 distinct phenotypes and identified six topics. We tested their associations with rs10455872 in LPA. Topics enriched for CVD and hyperlipidemia had positive correlations with rs10455872 (P < 0.001), replicating a previous finding. We also identified a negative correlation between LPA and a topic enriched for lung cancer (P < 0.001) which was not previously identified via phenome-wide scanning. We were able to replicate the top finding in a separate dataset. Our results demonstrate the applicability of topic modeling in exploring the relationship between genetic variants and clinical diseases.Entities:
Mesh:
Substances:
Year: 2019 PMID: 30759150 PMCID: PMC6374022 DOI: 10.1371/journal.pone.0212112
Source DB: PubMed Journal: PLoS One ISSN: 1932-6203 Impact factor: 3.240
Fig 1Illustration of topic modeling on EHRs using NMF.
Fig 2Word clouds for six topics.
The size of the words (phecode) in each cloud indicates the weights of the phenotypes on the topic. Phenotypes with larger-sized words have greater influence on the topic compared to phenotypes with smaller-sized words. For each word cloud, we listed the top 60 words.
Fig 3Topic distribution in the cohort.
To visualize the prevalence of each topic in the cohort, we assigned an individual to the topic with the maximum score.
Fig 4t-SNE plot of visualizing the patient clusters in a projected 2D metric map (The perplexity was set to 30).
Pearson correlation coefficient testing between LPA variant for each topic.
| Topic | Top phenotypes in this topic | ||
|---|---|---|---|
| #0 | Respiratory failure, Pneumonia, Pleurisy, Pulmonary collapse; interstitial/compensatory emphysema, Hypotension NOS, Tachycardia NOS, Other dyspnea, Hypopotassemia, Sepsis, Septicemia | 0.011 | 0.199 |
| #1 | Pain in joint, Other tests, Back pain, Pain in limb, Malaise and fatigue, Cough, Nonspecific chest pain, Essential hypertension, Osteoarthrosis NOS, Abdominal pain | -0.008 | 0.358 |
| #2 | Coronary atherosclerosis, Essential hypertension, Hyperlipidemia, Congestive heart failure NOS, Nonspecific chest pain, Atrial fibrillation, Chronic ischemic heart disease, Shortness of breath, Nonrheumatic mitral valve disorders, Cardiomegaly | 0.072 | 5.8e-16 |
| #3 | Chemotherapy, Tobacco use disorder, Lung cancer, Other diseases of lung, Malaise and fatigue, Secondary malignancy of lymph nodes, Secondary malignancy of lung, Nausea and vomiting, Nonspecific chest pain, Shortness of breath | -0.039 | 8.5e-6 |
| #4 | Type 2 diabetes, Hypertensive chronic kidney disease, Chronic renal failure, Insulin pump user, Type 2 diabetic neuropathy, Chronic Kidney Disease, Stage III, Type 2 diabetic nephropathy, Type 1 diabetes, Polyneuropathy in diabetes, Acute renal failure | 0.002 | 0.783 |
| #5 | Ascites (nonmalignant), Abdominal pain, Cirrhosis of liver without mention of alcohol, Thrombocytopenia, Liver abscess and sequelae of chronic liver disease, Portal hypertension, Chronic nonalcoholic liver disease, Disorders of liver, Esophageal bleeding, Nausea and vomiting | -0.02 | 0.021 |
Logistic regression analysis between LPA variant for each topic.
| Predictor | Coefficient | |
|---|---|---|
| Age | -0.003 | 0.079 |
| Sex | 0.145 | 0.005 |
| topic_0 | 0.542 | 0.166 |
| topic_1 | -0.07 | 0.820 |
| topic_2 | 2.789 | 3.42E-13 |
| topic_3 | -1.101 | 0.009 |
| topic_4 | -0.446 | 0.275 |
| topic_5 | -0.695 | 0.131 |
Fig 5PheWAS results of rs10455872 on 12,759 individuals adjusted by sex and age.