| Literature DB >> 27375358 |
Manabu Torii1, Sameer S Tilak1, Son Doan1, Daniel S Zisook1, Jung-Wei Fan1.
Abstract
In an era when most of our life activities are digitized and recorded, opportunities abound to gain insights about population health. Online product reviews present a unique data source that is currently underexplored. Health-related information, although scarce, can be systematically mined in online product reviews. Leveraging natural language processing and machine learning tools, we were able to mine 1.3 million grocery product reviews for health-related information. The objectives of the study were as follows: (1) conduct quantitative and qualitative analysis on the types of health issues found in consumer product reviews; (2) develop a machine learning classifier to detect reviews that contain health-related issues; and (3) gain insights about the task characteristics and challenges for text analytics to guide future research.Entities:
Keywords: big data; consumer health informatics; natural language processing; online product reviews; syndromic surveillance; text mining
Year: 2016 PMID: 27375358 PMCID: PMC4915789 DOI: 10.4137/BII.S37791
Source DB: PubMed Journal: Biomed Inform Insights ISSN: 1178-2226
Figure 1An overview of the data processing workflow.
Note: A clinical NLP pipeline, nQuiry, was used to extract phrases potentially relevant to health-related issues in consumer product reviews. Detected phrases were further filtered in the subsequent steps to narrow down the target concepts.
Figure 2Classifier performance for different training data sizes.
Notes: Logistic regression models were trained on different sizes of data so as to observe their impact on the performance. Specifically, mean F-scores were calculated through resampling evaluation, where four different fractions of annotated data (0.25, 0.50, 0.75, and 1.00) were used to conduct the evaluation tests. The error bars represent one standard deviation from the mean.
Figure 3Classifier performance for different ratios of irrelevant to relevant instances.
Notes: Logistic regression models were trained on different ratios of positive and negative data so as to determine an appropriate ratio for the final model. In the original dataset, the size of irrelevant instances was double the size of relevant instances (Double). In addition to the original ratio, two different ratios, Half and Equal, were tested. Mean precisions and recalls were calculated through repeated resampling evaluation tests.
Figure 4A fraction of relevant instances per bin (score range) on an unseen data set.
Notes: A logistic regression model was applied to a large collection of phrases in an unseen dataset, and phrases were sorted in bins according to prediction scores assigned by the classifier. One hundred phrases were sampled from each bin, and manually reviewed to estimate the fraction of relevant instances per bin.
Frequent health issues in the customer reviews.
| HEALTH ISSUES | UMLS CUI | PHRASE VARIANTS | FREQUENCY |
|---|---|---|---|
| Pain | C0030193 | Pain(s), painful, hurt(s), hurting, ache | 146 |
| Diabetes | C0011849, C0375113 | Diabetic(s), diabetes | 122 |
| Nausea | C0027497 | Nausea, nauseous, nauseated, feeling sick | 103 |
| Headache | C0018681 | Headache(s), head-ache, head pains | 79 |
| Morning sickness | C0240352, C0312416 | Morning sickness | 66 |
| Upset stomach | C0235309 | Upset stomach(s), upset tummy, stomach discomfort | 42 |
| Allergy | C0685900, C0700625, C0851444 | Allergic, allergy, allergies | 35 |
| Diarrhea | C0011991 | Diarrhea, loose stools | 33 |
| Acid reflux | C0017168 | Acid reflux, gerd | 23 |
| Stomach ache | C0221512 | Stomach ache(s), tummy ache(s), pain in the stomach | 20 |
Note: The frequencies in the rightmost column were counted in the manually annotated data.
Examples of product types per health issue.
| PROBLEM | PRODUCT TYPE | EXAMPLES |
|---|---|---|
| Pain | Purified water | |
| Cherry tart juice | ||
| Nausea | Vitamin supplements | |
| Ginger candy | ||
| Headache | Caffeinated water | |
| Energy drink | ||
| Diabetes | Natural soda | |
| Baking mix | ||
| Upset stomach | Ginger candy | |
| Sweetener |
Categories of health issues in the grocery reviews.
| CATEGORY | EXAMPLES |
|---|---|
| Adverse effect | • |
| Benefit of product | • |
| First-person health problem | • |
| Third-person health problem | • |
| Risk to health | • |
| Unclassified/unrelated to product | • |
Figure 5Proportions of the health issue categories.
Notes: One hundred concept phrases were sampled among those that were manually confirmed as relevant (health-related), and they were manually reviewed and categorized.
False positives (FPs) of NLP extraction.
| FP TYPE | EXPLANATION | EXAMPLE |
|---|---|---|
| Ambiguity | Idiomatic expression for complaint of trouble | |
| “Worms” is used as a synonym of C0018889 Helminthiasis, a type of parasite infection | ||
| Patient is “dehydrated” versus the dehydration process in food industry | ||
| C0030587 Paroxysmal atrial tachycardia can be abbreviated as “pat” | ||
| Typo correction error | NLP considered “nostalgia” a typo of “notalgia”, which is a synonym of C0004604 Back pain | |
| NLP considered “stoke” as typo of “stroke”, a synonym of C0038454 Cerebrovascular accident | ||
| Not on human | On cat | |
| On dog | ||
| Semantic modifier | Analogy that refers to taste | |
| Part of organization name, not explicitly referring to problem |