| Literature DB >> 25626480 |
Elad Yom-Tov1, Diana Borsa, Andrew C Hayward, Rachel A McKendry, Ingemar J Cox.
Abstract
BACKGROUND: The escalating cost of global health care is driving the development of new technologies to identify early indicators of an individual's risk of disease. Traditionally, epidemiologists have identified such risk factors using medical databases and lengthy clinical studies but these are often limited in size and cost and can fail to take full account of diseases where there are social stigmas or to identify transient acute risk factors.Entities:
Keywords: Information retrieval query processing; Machine Learning; Web search engines; epidemiology; self-controlled case series
Mesh:
Year: 2015 PMID: 25626480 PMCID: PMC4327439 DOI: 10.2196/jmir.4082
Source DB: PubMed Journal: J Med Internet Res ISSN: 1438-8871 Impact factor: 5.428
Disease incidenceain the United States and in the self-identified user (SIU) population for 29 diseases.
| Disease | Percentage of SIUs | Incidence in the United States | Rank of SIU | Rank of incidence in the United States |
| HIV | 6.17E+00 | 1.53E-04 | 1 | 18 |
| Cancer | 5.19E+00 | 5.37E-03 | 2 | 3 |
| Diabetes mellitus | 4.42E+00 | 6.13E-03 | 3 | 2 |
| Herpes simplex | 2.34E+00 | 2.50E-03 | 4 | 4 |
| Arthritis | 1.30E+00 | 4.10E-04 | 5 | 11 |
| Atrial fibrillation | 1.08E+00 | 6.45E-02 | 6 | 1 |
| Gastroparesis | 1.05E+00 | 2.50E-05 | 7 | 28 |
| Heart failure | 9.70E-01 | 1.77E-04 | 8 | 15 |
| Schizophrenia | 9.44E-01 | 7.00E-04 | 9 | 8 |
| Crohn’s disease | 8.38E-01 | 7.90E-05 | 10 | 20 |
| Dementia | 7.69E-01 | 1.51E-03 | 11 | 5 |
| Alzheimer's disease | 7.53E-01 | 1.26E-03 | 12 | 6 |
| Amyotrophic lateral sclerosis | 6.04E-01 | 1.81E-05 | 13 | 29 |
| Parkinson's disease | 5.62E-01 | 1.77E-04 | 14 | 15 |
| Breast cancer | 5.04E-01 | 7.51E-04 | 15 | 7 |
| Colitis | 4.40E-01 | 8.80E-05 | 16 | 19 |
| Asthma | 4.19E-01 | 7.00E-04 | 17 | 8 |
| Epilepsy | 3.92E-01 | 4.70E-04 | 18 | 10 |
| Lyme disease | 3.45E-01 | 7.00E-05 | 19 | 23 |
| Hepatitis C | 3.08E-01 | 5.48E-05 | 20 | 25 |
| Spina bifida | 2.65E-01 | 3.49E-04 | 21 | 12 |
| Leukemia | 2.55E-01 | 1.69E-04 | 22 | 17 |
| Hypothyroidism | 2.33E-01 | 2.86E-04 | 23 | 13 |
| Chronic pancreatitis | 2.28E-01 | 4.35E-05 | 24 | 27 |
| Celiac disease | 2.17E-01 | 6.50E-05 | 25 | 24 |
| Cardiomyopathy | 1.96E-01 | 5.00E-05 | 26 | 26 |
| Multiple myeloma | 1.59E-01 | 7.76E-05 | 27 | 21 |
| Lymphoma | 1.48E-01 | 2.58E-04 | 28 | 14 |
| Brain tumor | 1.22E-01 | 7.54E-05 | 29 | 22 |
aIncidence is provided as a fraction of the population.
Percentage of correctly classified self-identified users (SIUs) given users’ queries for drugs, diseases, symptoms, and their combinations (n=18,859).
| Attribute | Correctly classified, |
| Drugs | 6751 (35.80) |
| Diseases | 16,652 (88.30) |
| Symptoms | 6695 (35.50) |
| Drugs and diseases | 16,671 (88.40) |
| Drugs and symptoms | 6714 (35.60) |
| Diseases and symptoms | 16,675 (88.42) |
| All three attributes | 16,659 (88.33) |
Five most common errors of the classifier, which predicts SIUs given the disease profile.
| Self-identified disease | Predicted disease | Percentage of errors |
| AIDS | HIV | 9.6 |
| AIDS | CHILD syndrome | 7.1 |
| AIDS | Pregnancy | 6.5 |
| HIV | AIDS | 3.6 |
| AIDS | Cancer | 3.2 |
Attributes used to predict whether the most frequently queried disease is the one afflicting the user.
| Index | Attribute |
| 1 | The number of times queried for the most common diseases. |
| 2 | The number of times queried for the second most common diseases. |
| 3 | The ratio between the above two. |
| 4 | The fraction of users who asked about the most common disease queried by the user. |
| 5 | The number of diseases that were asked about more than once. |
| 6 | The number of diseases that were asked about more than five times. |
| 7 | The number of diseases that were asked about more than 10 times. |
| 8 | Indication of queries for drugs related to the most common disease queried by the user: Let |
Figure 1Spearman correlation between known disease incidence (n=29) and the size of the population identified by the classifier, as a function of the classifier threshold.
Figure 2Self-controlled case series (SCCS) density equation.
Figure 3Self-controlled case series (SCCS) maximum likelihood equation.
Medical conditions analyzed for precursor behaviors.
| Condition | Number of users | Number of queries | Number of categories |
| Pregnancy | 56,062 | 3,154,273 | 1263 |
| Allergy | 3739 | 217,395 | 1455 |
| HIV | 1522 | 80,537 | 1008 |
| Herpes simplex | 709 | 45,669 | 1102 |
| Myocardial infarction | 701 | 36,552 | 1340 |
| Post-traumatic stress disorder | 657 | 36,986 | 925 |
| Eating disorder | 615 | 37,948 | 1671 |
Precursor search categories and queries associated with the analyzed medical conditions (at FDR rate of 0.05). Queries related to interest in specific people afflicted with the medical condition were manually removed from the list.
| Condition | Precursor behaviors | Category or query | Relative hazard |
|
| |||
|
| Pregnancy symptoms | Query | 3.33 |
|
| Birth control | Category | 2.74 |
|
| Fertility | Category | 2.58 |
|
| Pregnancy with abortive outcome | Category | 2.07 |
|
| Medical emergencies | Category | 1.84 |
|
| Teen pregnancy in film and television | Category | 1.59 |
|
| |||
|
| Petco | Query | 3.88 |
|
| Pet stores | Category | 3.34 |
|
| Crops originating from the Americas | Category | 2.88 |
|
| PetSmart | Query | 2.07 |
|
| |||
|
| Image search | Category | 8.14 |
|
| Bipolar spectrum | Category | 8.01 |
|
| Depression | Category | 6.66 |
|
| Barnes and Noble (Web-based book store) | Query | 4.54 |
|
| English child actors | Category | 3.85 |
|
| |||
|
| WorldStarHipHop (multimedia website) | Query | 6.12 |
|
| Web-based real estate companies | Category | 3.50 |
|
| Real estate valuation | Category | 3.50 |
|
| Military brats (children of parents serving full time in the US Armed forces) | Category | 2.52 |
|
| Plenty of Fish (dating website) | Query | 2.34 |
|
| Yahoo | Query | 2.13 |
|
| Zillow (real estate website) | Query | 2.06 |
|
| Query | 2.03 | |
|
| Walmart | Query | 1.94 |
|
| Redtube | Query | 1.49 |
|
| |||
|
| Xtube | Query | 5.50 |
|
| Same sex online dating | Category | 3.54 |
|
| Adam4Adam | Query | 3.42 |
|
| Video game franchises | Category | 3.14 |
|
| |||
|
| Fast-food hamburger restaurants | Category | 5.28 |
|
| Theme restaurants | Category | 4.22 |
|
| |||
|
| Homelessness | Query | 14.52 |
|
| Rape | Query | 14.52 |
Figure 4Self-identified user (SIU) rate for users reported having HIV, compared to the HIV incidence rate by state. The correlation between the two variables is Spearman rho=.452 (P=9*10-4).