| Literature DB >> 30496232 |
Jorge Carrillo-de-Albornoz1, Javier Rodríguez Vidal1, Laura Plaza1.
Abstract
INTRODUCTION: Exploiting information in health-related social media services is of great interest for patients, researchers and medical companies. The challenge is, however, to provide easy, quick and relevant access to the vast amount of information that is available. One step towards facilitating information access to online health data is opinion mining. Even though the classification of patient opinions into positive and negative has been previously tackled, most works make use of machine learning methods and bags of words. Our first contribution is an extensive evaluation of different features, including lexical, syntactic, semantic, network-based, sentiment-based and word embeddings features to represent patient-authored texts for polarity classification. The second contribution of this work is the study of polar facts (i.e. objective information with polar connotations). Traditionally, the presence of polar facts has been neglected and research in polarity classification has been bounded to opinionated texts. We demonstrate the existence and importance of polar facts for the polarity classification of health information.Entities:
Mesh:
Year: 2018 PMID: 30496232 PMCID: PMC6264154 DOI: 10.1371/journal.pone.0207996
Source DB: PubMed Journal: PLoS One ISSN: 1932-6203 Impact factor: 3.240
Examples of sentences from the eDiseases dataset according to their factuality and polarity.
Distribution of sentences into information types (“Facts”, “Experiences” and “Opinions”).
| Facts | Experiences | Opinions | |
|---|---|---|---|
| 267 | 348 | 271 | |
| 273 | 931 | 389 | |
| 225 | 278 | 310 | |
| 765 | 1,557 | 970 |
Distribution of sentences into polarity classes (“Positive”, “Negative” and “Neutral”).
| Positive | Negative | Neutral | |
|---|---|---|---|
| 162 | 294 | 499 | |
| 302 | 614 | 712 | |
| 171 | 216 | 475 | |
| 635 | 1,079 | 1,686 |
Fig 1Distribution of sentences into polarity classes (“Positive”, “Negative” and “Neutral”) and factuality classes (“Fact”, “Opinion” and “Experience”).
Percent inter-annotator agreement for the three polarity labels and the three diseases.
| Positive | Negative | Neutral | |
|---|---|---|---|
| 79% | 76% | 71% | |
| 68% | 68% | 79% | |
| 77% | 70% | 79% | |
| 75% | 71% | 76% |
Percent inter-annotator agreement for the three factuality labels and the three diseases.
| Experience | Opinion | Fact | |
|---|---|---|---|
| 86% | 69% | 65% | |
| 88% | 65% | 72% | |
| 77% | 70% | 79% | |
| 84% | 68% | 72% |
Feature comparison for the allergies domain.
Results are reported in Accuracy and F-measure. Best results are indicated in bold. For each group of experiments, significance of the best combination of features with respect to the baseline (BoW, W2V and C2V, respectively) is calculated (FDR<0.001***, FDR<0.01**, FDR<0.05*).
| SMO | RF | NB | Vote | |||||
|---|---|---|---|---|---|---|---|---|
| Feature | Acc | F-1 | Acc | F-1 | Acc | F-1 | Acc | F-1 |
| BOW | 56,9 | 56,3 | 62,1 | 57,6 | 64 | 62,9 | 63,9 | 62,6 |
| BOW + ST | 59,6 | 59,2*** | 63,5 | 59,6 | 64,1 | 63 | 63,4 | 62,1 |
| BOW + ST + Pst | 59,6 | 59,7*** | 62,9 | 58,4 | 64,3 | 63,1 | 63,6 | 62,3 |
| BOW + ST + Pst + Net | 59,5 | 59,7*** | 61,5 | 57,2 | 64,2 | 63 | 63,4 | 62 |
| BOW + ST + Pst + Net + SA | 60,1 | 59,5*** | 61,8 | 57,6 | 65,1 | 64,2 | 65 | 64 |
| BOW + ST + Pst + Net + SA + Gramm | 61,2 | 60,8*** | 64,4 | 61,1 | 65,2 | 64,4 | 65,9 | 64,9 |
| BOW + ST + Pst + Net + SA + Gramm + Fact | 64,7 | 64,8*** | 62,2 | 59,6 | 65 | 64,1 | ||
| W2V | 66,2 | 65 | 63,5 | 59,1 | 53,9 | 44,2 | 63,9 | 61,7 |
| W2V + ST | 66,8 | 66 | 63,8 | 59,4 | 59,9 | 54,3 | 64,4 | 62 |
| W2V + ST + Pst | 64,5 | 63,6 | 61,1 | 57,1 | 58,1 | 53,4 | 63,4 | 61 |
| W2V + ST + Pst + Net | 64,6 | 63,7 | 62,9 | 59 | 57,9 | 53,3 | 63,7 | 61,2 |
| W2V + ST + Pst + Net + SA | 67 | 66,2 | 63,2 | 59,5 | 59,3 | 55,6 | 64,6 | 62,6 |
| W2V + ST + Pst + Net + SA + Gramm | 66,6 | 65,8 | 61,8 | 57,3 | 58,6 | 55,4 | 64,6 | 62,6 |
| W2V + ST + Pst + Net + SA + Gramm + Fact | 63,1 | 59,4 | 61,8 | 59,6 | 65 | 63,3 | ||
| C2V | 62,3 | 60,8 | 61,7 | 57,3 | 54,3 | 45,3 | 61,5 | 58,5 |
| C2V + ST | 65,9 | 65,2** | 61,8 | 57,4 | 57,9 | 53,3 | 64,1 | 61,7 |
| C2V + ST + Pst | 66,2 | 65,6** | 61,2 | 57,1 | 58,1 | 53,4 | 63,4 | 61 |
| C2V + ST + Pst + Net | 66,1 | 65,4** | 62,9 | 59 | 57,9 | 53,3 | 63,7 | 61,2 |
| C2V + ST + Pst + Net + SA | 63,2 | 59,5 | 59,3 | 55,6 | 64,6 | 62,6 | ||
| C2V + ST + Pst + Net + SA + Gramm | 62,4 | 61,6 | 61 | 56,1 | 59,2 | 56,2 | 62,8 | 60,8 |
| C2V + ST + Pst + Net + SA + Gramm + Fact | 63,2 | 62,4 | 63,4 | 59,4 | 61,8 | 59,8 | 63,4 | 61,5 |
Feature comparison for the breast cancer domain.
Results are reported in Accuracy, F-measure, Precision and Recall. Best results are indicated in bold. For each group of experiments, significance of the best combination of features with respect to the baseline (BoW, W2V and C2V, respectively) is calculated (FDR<0.001***, FDR<0.01**, FDR<0.05*).
| SMO | RF | NB | Vote | |||||
|---|---|---|---|---|---|---|---|---|
| Feature | Acc | F-1 | Acc | F-1 | Acc | F-1 | Acc | F-1 |
| BOW | 57,1 | 47,7 | 57,6 | 46,1 | 58,1 | 49,4 | 58,4 | 48,9 |
| BOW + ST | 60,9 | 60,1*** | 64,4 | 59,4 | 63,9 | 62,9 | 63,4 | 61,5 |
| BOW + ST + Pst | 61,8 | 59,7*** | 64,8 | 59,9 | 63,6 | 62,6 | 63,9 | 62,4 |
| BOW + ST + Pst + Net | 59,8 | 59,3*** | 64,7 | 59,9 | 63,5 | 62,7 | 62,9 | 61,3 |
| BOW + ST + Pst + Net + SA | 62,1 | 61,6*** | 64,2 | 59 | 65,3 | 64,5 | ||
| BOW + ST + Pst + Net + SA + Gramm | 56,6 | 47,1 | 57,2 | 55,5 | 57,6 | 49,2 | 59,3 | 52,4 |
| BOW + ST + Pst + Net + SA + Gramm + Fact | 57,1 | 49,2 | 56,7 | 55,3 | 57,5 | 49,7 | 59,7 | 53,6 |
| W2V | 65,9 | 65,1 | 62,8 | 56,5 | 57,5 | 46,4 | 66,5 | 63,5 |
| W2V + ST | 66,2 | 65,2 | 63,1 | 56,4 | 60,3 | 62,9 | 67 | 63,4 |
| W2V + ST + Pst | 66,5 | 65,9 | 61,3 | 53,8 | 60,6 | 62,6 | 67,3 | 63 |
| W2V + ST + Pst + Net | 66,1 | 65,4 | 62,3 | 55,2 | 59,9 | 62,7 | 67,4 | 63,4 |
| W2V + ST + Pst + Net + SA | 68,4 | 66,5 | 62,6 | 55,7 | 61,9 | 64,5 | ||
| W2V + ST + Pst + Net + SA + Gramm | 66 | 65,8 | 62,9 | 56,5 | 62,6 | 49,2 | 68,2 | 65,9 |
| W2V + ST + Pst + Net + SA + Gramm + Fact | 64,7 | 64,5 | 63,3 | 56,9 | 62,6 | 49,7 | 68,5 | 66,4 |
| C2V | 62,6 | 60,8 | 61,2 | 55 | 57,4 | 45,6 | 63,8 | 59,6 |
| C2V + ST | 65 | 65,2** | 63,1 | 56,4 | 63,9 | 53,8 | 65 | 61,4 |
| C2V + ST + Pst | 66,5 | 65,9** | 61,3 | 53,8 | 63,7 | 54,4 | 65,3 | 62 |
| C2V + ST + Pst + Net | 66,1 | 65,4** | 62,3 | 55,2 | 63,5 | 54,1 | 66,4 | 63,4 |
| C2V + ST + Pst + Net + SA | 62,6 | 55,7 | 65,3 | 57,6 | 67,7 | 65,3 | ||
| C2V + ST + Pst + Net + SA + Gramm | 62,4 | 61,6* | 63,1 | 56,5 | 57,6 | 58,8 | 64,8 | 62 |
| C2V + ST + Pst + Net + SA + Gramm + Fact | 63,2 | 62,4* | 63,3 | 57 | 57,5 | 59,5 | 65,3 | 62,5 |
Feature comparison for the crohn domain.
Results are reported in Accuracy and F-measure. Best results are indicated in bold. For each group of experiments, significance of the best combination of features with respect to the baseline (BoW, W2V and C2V, respectively) is calculated (FDR<0.001***, FDR<0.01**, FDR<0.05*).
| SMO | RF | NB | Vote | |||||
|---|---|---|---|---|---|---|---|---|
| Feature | Acc | F-1 | Acc | F-1 | Acc | F-1 | Acc | F-1 |
| BOW | 57,9 | 56,8 | 61,7 | 60,7 | 63,1 | 62,4 | 64 | 63,3 |
| BOW + ST | 61,2 | 59,8 | 62 | 60,8 | 63,1 | 62,3 | 64,5 | 63,7 |
| BOW + ST + Pst | 61,2 | 59,8 | 63,1 | 61,5 | 62,8 | 62 | 64,1 | 63,3 |
| BOW + ST + Pst + Net | 61,2 | 59,8 | 62,7 | 61 | 63 | 62,2 | 64,1 | 63,3 |
| BOW + ST + Pst + Net + SA | 63,8 | 62,8 | 64,3 | 61,8 | 64 | 63,1 | 64,4 | 63,6 |
| BOW + ST + Pst + Net + SA + Gramm | 64,4 | 63,4* | 63,6 | 61,9 | 64,6 | 63,7 | 65,4 | 64,5 |
| BOW + ST + Pst + Net + SA + Gramm + Fact | 64,6 | 62,8 | 63,8 | 62,7 | 66 | 65,2 | ||
| W2V | 65,8 | 65,4 | 61,9 | 59,3 | 52,2 | 46 | 64,4 | 62,9 |
| W2V + ST | 66,2 | 65,9 | 62 | 59,4 | 54,8 | 50,2 | 64,9 | 63,4 |
| W2V + ST + Pst | 65,7 | 65,4 | 63,3 | 60,5 | 55 | 50,5 | 64,7 | 63,4 |
| W2V + ST + Pst + Net | 65,2 | 64,9 | 61,8 | 59,2 | 55 | 50,5 | 66 | 64,7 |
| W2V + ST + Pst + Net + SA | 67 | 66,3 | 61,8 | 59,2 | 56,1 | 56,1 | 66,2 | 65,2 |
| W2V + ST + Pst + Net + SA + Gramm | 66,4 | 66,3 | 63,1 | 60,3 | 59 | 59 | 65,8 | 64,5 |
| W2V + ST + Pst + Net + SA + Gramm + Fact | 62,8 | 60,3 | 56,9 | 56,9 | 65,2 | 64,1 | ||
| C2V | 61,2 | 60,6 | 59,1 | 56,9 | 50,3 | 50,3 | 60,5 | 58,8 |
| C2V + ST | 65,2 | 64,9*** | 62 | 59,4 | 54,8 | 54,8 | 64,9 | 63,4 |
| C2V + ST + Pst | 65,7 | 65,4*** | 63,3 | 60,5 | 55 | 55 | 64,7 | 63,4 |
| C2V + ST + Pst + Net | 65,2 | 64,9*** | 61,8 | 59,2 | 55 | 55 | 65,9 | 64,7 |
| C2V + ST + Pst + Net + SA | 61,8 | 59,2 | 56,1 | 56,1 | 66,2 | 65,2 | ||
| C2V + ST + Pst + Net + SA + Gramm | 62,9 | 62,5** | 60,1 | 57,5 | 55,9 | 55,9 | 62,7 | 61,4 |
| C2V + ST + Pst + Net + SA + Gramm + Fact | 62,9 | 62,6** | 60,1 | 57,7 | 58,1 | 58,1 | 64 | 62,7 |
Comparison between diseases (allergies, crohn, and breast cancer) (Average accuracy) for the SMO classifier.
| Feature | Allergies | Crohn | Breast cancer |
|---|---|---|---|
| BoW | 56,9 | 57,9 | 57,1 |
| W2V | 66,2 | 65,8 | 65,9 |
| C2V | 62,3 | 61,2 | 62,6 |
| Best combination of features | 68,1 | 67,2 | 68,4 |
| Majority baseline | 53,8 | 43,6 | 56,7 |
Average F-1 by class for the SMO-W2V classifier (Resample).
Best results are indicated in bold.
| Class | Allergies | Crohn | Breast cancer |
|---|---|---|---|
| Neutral | 72,5 | 74,6 | |
| Positive | 71,8 | ||
| Negative | 74,9 | 72,18 | 80,9 |
| Average | 76,6 | 75,1 | 79,0 |
Classification results when data for the three diseases are combined for the SMO.
Best results are indicated in bold.
| Feature | Acc | F-1 |
|---|---|---|
| W2V | 67,2 | 66,3 |
| W2V + ST + Pst + Net + SA + Gramm + Fact | 67,3 | 66,7 |
| W2V - | 70,9 | 71,0 |
| W2V + ST + Pst + Net + SA + Gramm + Fact - | 73,1 | 73,2 |
| C2V | 64,2 | 62,8 |
| C2V + ST + Pst + Net + SA + Gramm + Fact | 64,9 | 64,2 |
| C2V - | 67,9 | 67,9 |
| C2V + ST + Pst + Net + SA + Gramm + Fact - | 70,7 | 70,8 |
F-1 by factuality class and disease for the W2V classifier.
Best results are indicated in bold.
| Disease | Facts | Experiences | Opinions |
|---|---|---|---|
| 79,7 | 62,0 | ||
| 76,1 | 78,1 | ||
| 74,5 | 67,5 |
Average F-1 by class for the SMO-W2V classifier (Resample).
Best results are indicated in bold.
| Class | Allergies | crohn | Breast cancer |
|---|---|---|---|
| Neutral | 72,7 | 71,6 | 73,2 |
| Positive | 79,7 | ||
| Negative | 77,3 | 71,7 | |
| Total | 77,5 | 73,4 | 77,8 |