| Literature DB >> 31206534 |
Raina M Merchant1,2,3, David A Asch1,3,4,5, Patrick Crutchley1,6, Lyle H Ungar1,5,6,7, Sharath C Guntuku1,2,3,7, Johannes C Eichstaedt6, Shawndra Hill1,8, Kevin Padrez1, Robert J Smith1, H Andrew Schwartz1,6,9.
Abstract
We studied whether medical conditions across 21 broad categories were predictable from social media content across approximately 20 million words written by 999 consenting patients. Facebook language significantly improved upon the prediction accuracy of demographic variables for 18 of the 21 disease categories; it was particularly effective at predicting diabetes and mental health conditions including anxiety, depression and psychoses. Social media data are a quantifiable link into the otherwise elusive daily lives of patients, providing an avenue for study and assessment of behavioral and environmental disease risk factors. Analogous to the genome, social media data linked to medical diagnoses can be banked with patients' consent, and an encoding of social media language can be used as markers of disease risk, serve as a screening tool, and elucidate disease epidemiology. In what we believe to be the first report linking electronic medical record data with social media data from consenting patients, we identified that patients' Facebook status updates can predict many health conditions, suggesting opportunities to use social media data to determine disease onset or exacerbation and to conduct social media-based health interventions.Entities:
Mesh:
Year: 2019 PMID: 31206534 PMCID: PMC6576767 DOI: 10.1371/journal.pone.0215476
Source DB: PubMed Journal: PLoS One ISSN: 1932-6203 Impact factor: 3.240
Fig 1General study design.
We extract a patient language encoding from the words and phrases within an individual’s Facebook status updates. The three word clouds shown represent the words most prevalent in three example dimensions of the encoding. We then learn predictive models and identify predictive markers for the medical condition categories in the medical records.
Medical condition prevalence and participant characteristics.
| Medical condition categories | N |
|---|---|
| Digestive Abdominal Symptoms | 641 |
| Genitourinary Disorders | 562 |
| Injury and Poisoning | 543 |
| Respiratory Symptoms | 433 |
| Pregnancy | 323 |
| Skin Disorders | 364 |
| Chronic Pulmonary Disease | 204 |
| Deficiency Anemia | 194 |
| Depression | 149 |
| Fluid and Electrolyte Disorders | 135 |
| Hypertension | 132 |
| Obesity | 132 |
| Anxiety | 122 |
| Psychoses | 73 |
| Drug Abuse | 64 |
| Sexually Transmitted Disease | 57 |
| Diabetes | 49 |
| Blood Loss Anemia | 45 |
| Coagulopathy | 38 |
| Alcohol Abuse | 34 |
| Collagen Vascular Diseases | 32 |
| Female sex | 76% |
| Black | 71% |
| White | 23% |
| Asian | 2% |
| Other | 4% |
| 18–23 | 37% |
| 24–30 | 33% |
| 31–65 | 30% |
Fig 2A. Diagnoses Prediction Strength of Demographics and Facebook. This figure represents overall accuracies of Facebook and demographic models at predicting diagnoses. Accuracies were measured using the area under the receiver operating characteristic curve (AUC), a measure of discrimination. The category “Facebook alone” represents predictions based only on Facebook language. “Demographics alone” represents predictions from age, sex, and race. “Demographics & Facebook” represents predictions based on a combination of demographics and Facebook posts. Diagnoses are ordered by the difference in AUC between Facebook alone and demographics alone. For the top 10 categories, Facebook predictions are significantly more accurate than those from demographics (p < .05), and for the top 17 plus iron deficiency anemia, Facebook & demographics are significantly more accurate than Facebook alone (p < .05). * Pregnancy analyses only included females. B. Markers (most predictive topics) per diagnosis. This figure illustrates top markers (clusters of similar words from social media language) most predictive of selected diagnoses categories. Word size within topic represents rank order prevalence in the topic. Expletives were edited and represented by stars (i.e. *). All topics shown, except for those with digestive abdominal symptoms, were individually predictive beyond the demographics (multi-test correct p < .05). (Full results in supplement [S2 Table]).
Fig 3Differential expression of topics across medical conditions within the social mediome.
Analogous to studying the differential expression of a genome, topics of the social mediome can be explored differentially across diagnoses. The 21 rows represent all medical condition categories of the study ordered using hierarchical clustering while the 200 columns indicate the predictive strength[24] (measure by area under the ROC curve) of each potential language marker (topics). Blue topics were more likely to be used by patients with the given medical condition and orange topics were less likely to be mentioned. Medical condition categories each have unique patterns of markers. These encodings allow for the prediction of diagnoses and identification of diagnoses with similar patterns of markers.