| Literature DB >> 26925459 |
Christopher Weeg1, H Andrew Schwartz, Shawndra Hill, Raina M Merchant, Catalina Arango, Lyle Ungar.
Abstract
BACKGROUND: Twitter is increasingly used to estimate disease prevalence, but such measurements can be biased, due to both biased sampling and inherent ambiguity of natural language.Entities:
Keywords: bias; data mining; demographics; disease; epidemiology; prevalence; public health; social media
Year: 2015 PMID: 26925459 PMCID: PMC4763717 DOI: 10.2196/publichealth.3953
Source DB: PubMed Journal: JMIR Public Health Surveill ISSN: 2369-2960
Example of rating whether each tweet does or does not refer to a medical meaning of the selected term. Here the term is “heart attacks.”
| Rater 1 | Rater 2 | Tweet |
| Yes | Yes | Visited a man who has had 2 heart attacks who feels privileged to be in circumstances that allow him to share his trust in God. #realdeal |
| Yes | No | Got room for 1 more? RT @pjones59: Sausage balls, heart attacks on a stick, dip, chips, wings and cheese, cream cheese/pickle/ham wraps |
| No | No | I still can't believe I saw Kris at work the other day. Talk about mini heart attacks. U_U |
Figure 1Equation for deriving a disease lexicon's correction factor.
Figure 2Disease terms from the diabetes lexicon that were subjected to manual appraisal. Each term receives appraisal on up to 30 instances. The term-level appraisals are then summed to reach the final lexicon-level diabetes-validated tweet count (8896).
Raw and validated tweet counts, correction factor, and US and Twitter disease prevalence for each disease.
| Disease | Raw tweet count | Validated tweet count | Correction factora | Prev US (millions)b,d | Prev US Twitter (millions)c,d |
| Acid reflux disease/gastroesophageal reflux disease | 743 | 631 | 84.98 | 32.4 | 2.40 |
| Acne | 6936 | 6027 | 86.89 | 11.2 | 2.00 |
| Attention deficit disorder/attention deficit hyperactivity disorder | 2794 | 2660 | 95.19 | 4.9 | 0.90 |
| Arthritis | 2524 | 2522 | 99.92 | 34.4 | 1.30 |
| Asthma | 3952 | 3754 | 95.00 | 12.4 | 1.00 |
| Backache | 3035 | 3028 | 99.77 | 42.0 | 2.60 |
| Cancer | 110,760 | 63,647 | 57.46 | 5.0 | 0.46 |
| Congestive heart failure | 928 | 313 | 33.76 | — | — |
| Heart disease | 2741 | 2410 | 87.91 | — | — |
| Congestive heart failure/heart diseasee | 3669 | 2723 | 74.21 | 5.9 | 0.46 |
| Chronic obstructive pulmonary disease | 226 | 188 | 83.37 | 5.5 | 0.86 |
| Depression | 14,294 | 10,459 | 73.17 | 18.7 | 2.20 |
| Diabetes | 9202 | 8896 | 96.67 | 20.8 | 1.20 |
| Flu | 10,139 | 8810 | 86.90 | 17.2 | 1.80 |
| Genital herpes | 76 | 66 | 86.84 | 1.8 | 0.33 |
| Heart attack | 15,027 | 2311 | 15.38 | — | — |
| Stroke | 12,852 | 1914 | 14.89 | — | — |
| Heart attack/strokef | 27,879 | 4225 | 15.15 | 3.0 | 0.11 |
| High cholesterol | 225 | 218 | 96.67 | 37.9 | 1.70 |
| Human papilloma virus | 636 | 545 | 85.73 | 1.5 | 0.12 |
| Hypertension/high blood pressure | 1630 | 1491 | 91.49 | 43.5 | 1.50 |
| Migraine headache | 5958 | 5615 | 94.24 | 16.4 | 1.80 |
| Nasal allergies/hay fever | 481 | 473 | 98.27 | 18.2 | 1.30 |
| Osteoporosis | 316 | 306 | 96.68 | 6.0 | 0.13 |
| Stomach ulcers | 80 | 73 | 91.25 | 3.3 | 0.03 |
| Urinary tract infection | 880 | 479 | 54.40 | 10.0 | 1.00 |
aCorrection factor is the percentage of tweets that were appraised as valid.
bPrev US (millions) represents a disease’s prevalence in the US.
cPrev US Twitter (millions) represents a disease’s prevalence among US Twitter users.
dThe source for both Prev US (millions) and Prev US Twitter (millions) is the Experian Simmons National Consumer Study.
eIn the Experian dataset, congestive heart failure and heart disease are collapsed into a single data point. We mined Twitter for these diseases separately, and we applied our evaluation method to tweets containing disease terms for each one separately. However, because Experian was our source for prevalence statistics, we can only report on the prevalence of these two diseases in a combined state.
fNote “e” is true for the diseases heart attack and stroke.
Spearman correlation coefficients between both raw and validated tweet counts and US population and Twitter-user disease prevalence (all P <.001).
| Prevalence | ||
| US population | US Twitter users | |
| Raw tweet count | .113 | .258 |
| Validated tweet count | .208 | .366 |
Figure 3Projected prevalence (as a function of validated tweet count) versus actual US prevalence for 22 diseases, in millions (sorted by projected prevalence). Some diseases are “over-tweeted” (in particular, cancer), whereas others are “under-tweeted” (eg, backache and arthritis).