| Literature DB >> 29799867 |
Sattam Almatarneh1, Pablo Gamallo1.
Abstract
Studies in sentiment analysis and opinion mining have been focused on many aspects related to opinions, namely polarity classification by making use of positive, negative or neutral values. However, most studies have overlooked the identification of extreme opinions (most negative and most positive opinions) in spite of their vast significance in many applications. We use an unsupervised approach to search for extreme opinions, which is based on the automatic construction of a new lexicon containing the most negative and most positive words.Entities:
Mesh:
Year: 2018 PMID: 29799867 PMCID: PMC5969751 DOI: 10.1371/journal.pone.0197816
Source DB: PubMed Journal: PLoS One ISSN: 1932-6203 Impact factor: 3.240
Fig 1Red Hypothetical continuous distribution of negative, neutral and positive views on a scale from 1 to 5, according to the borderline between stars.
Fig 2Algorithm to assign the most negative classification to an input document.
Fig 3Algorithm to assign the most positive classification to an input document.
A sample of the collection format for the word (“bad”, a) in each category.
| Word | Tag | Category | Freq | Total | Corpus |
|---|---|---|---|---|---|
| bad | a | 1 | 122232 | 25395214 | IMDB |
| bad | a | 2 | 40491 | 11755132 | IMDB |
| bad | a | 3 | 37787 | 13995838 | IMDB |
| bad | a | 4 | 33070 | 14963866 | IMDB |
| bad | a | 5 | 39205 | 20390515 | IMDB |
| bad | a | 6 | 43101 | 27420036 | IMDB |
| bad | a | 7 | 46696 | 40192077 | IMDB |
| bad | a | 8 | 42228 | 48723444 | IMDB |
| bad | a | 9 | 29588 | 40277743 | IMDB |
| bad | a | 10 | 51778 | 73948447 | IMDB |
| bad | a | 1 | 2100 | 3419923 | Goodreads |
| bad | a | 2 | 1956 | 3912625 | Goodreads |
| bad | a | 3 | 2780 | 6011388 | Goodreads |
| bad | a | 4 | 2298 | 10187257 | Goodreads |
| bad | a | 5 | 2119 | 16202230 | Goodreads |
| bad | a | 1 | 1127 | 699695 | OpenTable |
| bad | a | 2 | 2595 | 2507147 | OpenTable |
| bad | a | 3 | 2859 | 4207700 | OpenTable |
| bad | a | 4 | 2544 | 7789649 | OpenTable |
| bad | a | 5 | 1905 | 8266564 | OpenTable |
| bad | a | 1 | 1241 | 3419923 | Amazon/Tripadvisor |
| bad | a | 2 | 791 | 3912625 | Amazon/Tripadvisor |
| bad | a | 3 | 870 | 6011388 | Amazon/Tripadvisor |
| bad | a | 4 | 1301 | 10187257 | Amazon/Tripadvisor |
| bad | a | 5 | 2025 | 16202230 | Amazon/Tripadvisor |
Negative lexicons: Total number of words (adjectives and adverbs) for each lexicon, and number of words for each class (MN and NMN) in each lexicon.
| Number of words | MN | NMN | |||||||
|---|---|---|---|---|---|---|---|---|---|
| Lexicon | ADJ | ADV | Total | ADJ | ADV | Total | ADJ | ADV | Total |
| VERY-NEG B = 1 | 11670 | 2790 | 14460 | 4178 | 1092 | 5270 | 7492 | 1698 | 9190 |
| VERY-NEG B = 2 | 11557 | 2771 | 14328 | 4966 | 1266 | 6232 | 6591 | 1505 | 8096 |
| SO-CAL NP1 | 2826 | 876 | 3702 | 189 | 62 | 251 | 2637 | 814 | 3451 |
| SO-CAL NP2 | 2826 | 876 | 3702 | 536 | 135 | 671 | 2290 | 741 | 3031 |
| SO-CAL NP3 | 2826 | 876 | 3702 | 1080 | 289 | 1369 | 1746 | 587 | 2333 |
| SO-CAL NP4 | 2826 | 876 | 3702 | 1576 | 429 | 2005 | 1250 | 447 | 1697 |
| SentiWords NP1 | 13425 | 2811 | 16236 | 156 | 4 | 160 | 13269 | 2807 | 16076 |
| SentiWords NP2 | 13425 | 2811 | 16236 | 1132 | 24 | 1156 | 12293 | 2787 | 15080 |
| SentiWords NP3 | 13425 | 2811 | 16236 | 4016 | 189 | 4205 | 9409 | 2622 | 12031 |
| SentiWords NP4 | 13425 | 2811 | 16236 | 7612 | 540 | 8152 | 5813 | 2271 | 8084 |
Positive lexicons: Total number of words (adjectives and adverbs) for each lexicon, and number of words for each class (MP and NMP) in each lexicon.
| Number of words | MP | NMP | |||||||
|---|---|---|---|---|---|---|---|---|---|
| Lexicon | ADJ | ADV | total | ADJ | ADV | Total | ADJ | ADV | Total |
| VERY-POS B = 1 | 11402 | 2769 | 14171 | 4721 | 1163 | 5884 | 6681 | 1606 | 8287 |
| VERY-POS B = 2 | 11472 | 2772 | 14244 | 5753 | 1339 | 7092 | 5719 | 1433 | 7152 |
| SO-CAL PP1 | 2826 | 876 | 3702 | 239 | 75 | 314 | 2587 | 801 | 3388 |
| SO-CAL PP2 | 2826 | 876 | 3702 | 512 | 167 | 679 | 2314 | 709 | 3023 |
| SO-CAL PP3 | 2826 | 876 | 3702 | 835 | 292 | 1127 | 2155 | 628 | 2783 |
| SO-CAL PP4 | 2826 | 876 | 3702 | 1250 | 447 | 1697 | 1576 | 429 | 2005 |
| SentiWords NP1 | 13425 | 2811 | 16236 | 130 | 13 | 143 | 13295 | 2798 | 16093 |
| SentiWords NP2 | 13425 | 2811 | 16236 | 581 | 34 | 615 | 12844 | 2777 | 15621 |
| SentiWords NP3 | 13425 | 2811 | 16236 | 2418 | 250 | 2668 | 11007 | 2561 | 13568 |
| SentiWords NP4 | 13425 | 2811 | 16236 | 5813 | 2271 | 8084 | 7612 | 540 | 8152 |
Size of the five test datasets and the total number of reviews in each class (MN vs. NMN) and (MP vs. NMP).
| Datasets | # of Reviews | MN | NMN | MP | NMP |
|---|---|---|---|---|---|
| 2000 | 522 | 1478 | 731 | 1269 | |
| 2000 | 530 | 1470 | 714 | 1286 | |
| 2000 | 666 | 1334 | 680 | 1320 | |
| 2000 | 687 | 1313 | 754 | 1246 | |
| 50000 | 14708 | 35292 | 14338 | 35662 |
Polarity classification results for all collections with the SO-CAL lexicon, in terms of precision (P), recall (R) and F1 scores for most negative (MN) and other (NMN) class of documents.
The best F1 for the most negative class in each dataset is highlighted (in bold).
| NP1 | NP2 | NP3 | NP4 | |||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Dataset | P | R | F1 | P | R | F1 | P | R | F1 | P | R | F1 |
| 0.36 | 0.06 | 0.10 | 0.47 | 0.13 | 0.20 | 0.50 | 0.26 | 0.34 | 0.46 | 0.50 | ||
| 0.60 | 0.10 | 0.17 | 0.58 | 0.18 | 0.28 | 0.56 | 0.31 | 0.40 | 0.48 | 0.51 | ||
| 0.57 | 0.13 | 0.21 | 0.62 | 0.20 | 0.31 | 0.62 | 0.29 | 0.39 | 0.55 | 0.49 | ||
| 0.59 | 0.10 | 0.17 | 0.64 | 0.19 | 0.29 | 0.66 | 0.29 | 0.40 | 0.57 | 0.48 | ||
| 0.13 | 0.03 | 0.05 | 0.30 | 0.14 | 0.19 | 0.40 | 0.30 | 0.34 | 0.42 | 0.55 | ||
Polarity classification results for all collections with the SentiWords lexicon, in terms of precision (P), recall (R) and F1 scores for most negative (MN) and other (NMN) documents.
The best F1 for the most negative class in each dataset is highlighted (in bold).
| NP1 | NP2 | NP3 | NP4 | |||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Dataset | P | R | F1 | P | R | F1 | P | R | F1 | P | R | F1 |
| 0.42 | 0.01 | 0.02 | 0.35 | 0.01 | 0.03 | 0.28 | 0.04 | 0.07 | 0.24 | 0.43 | ||
| 0.33 | 0.01 | 0.01 | 0.53 | 0.03 | 0.06 | 0.58 | 0.13 | 0.22 | 0.49 | 0.41 | ||
| 0.26 | 0.01 | 0.01 | 0.37 | 0.02 | 0.03 | 0.63 | 0.18 | 0.28 | 0.57 | 0.49 | ||
| 0.36 | 0.01 | 0.01 | 0.56 | 0.01 | 0.03 | 0.71 | 0.17 | 0.27 | 0.62 | 0.45 | ||
| 0.09 | 0.00 | 0.00 | 0.31 | 0.01 | 0.01 | 0.32 | 0.05 | 0.08 | 0.44 | 0.25 | ||
Polarity classification results for all collections with VERY-NEG lexicon, in terms of precision (P), recall (R) and F1 scores for most negative (MN) and other (NMN) documents.
The best F1 for the most negative class in each dataset is highlighted (in bold).
| VERY-NEG B = 1 | VERY-NEG B = 2 | |||||
|---|---|---|---|---|---|---|
| Dataset | P | R | F1 | P | R | F1 |
| 0.42 | 0.64 | 0.51 | 0.40 | 0.80 | ||
| 0.43 | 0.76 | 0.88 | 0.88 | 0.53 | ||
| 0.50 | 0.80 | 0.45 | 0.86 | 0.59 | ||
| 0.52 | 0.70 | 0.47 | 0.80 | 0.59 | ||
| 0.42 | 0.77 | 0.39 | 0.89 | |||
Fig 4The best performance (F1) obtained by all lexicons on all datasets for identifying most negative documents (MN vs NMN).
Polarity classification results for all collections with SO-CAL lexicon, in terms of precision (P), recall (R) and F1 scores for most positive (MP) and other (NMP) documents.
The best F1 for the most Positive class in each dataset is highlighted (in bold).
| PP1 | PP2 | PP3 | PP4 | |||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Dataset | P | R | F1 | P | R | F1 | P | R | F1 | P | R | F1 |
| 0.61 | 0.17 | 0.27 | 0.54 | 0.34 | 0.42 | 0.52 | 0.55 | 0.53 | 0.41 | 0.94 | ||
| 0.66 | 0.21 | 0.32 | 0.58 | 0.38 | 0.46 | 0.54 | 0.56 | 0.55 | 0.41 | 0.95 | ||
| 0.54 | 0.26 | 0.35 | 0.51 | 0.40 | 0.45 | 0.49 | 0.60 | 0.54 | 0.38 | 0.94 | ||
| 0.53 | 0.23 | 0.32 | 0.53 | 0.36 | 0.43 | 0.50 | 0.55 | 0.52 | 0.42 | 0.97 | ||
| 0.75 | 0.11 | 0.20 | 0.60 | 0.29 | 0.39 | 0.52 | 0.49 | 0.50 | 0.35 | 0.94 | ||
Polarity classification results for all collections with SO-CAL lexicon, in terms of precision (P), recall (R) and F1 scores for most positive (MP) and other (NMP) documents.
The best F1 for the most positive class in each dataset is highlighted (in bold).
| PP1 | PP2 | PP3 | PP4 | |||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Dataset | P | R | F1 | P | R | F1 | P | R | F1 | P | R | F1 |
| 0.76 | 0.06 | 0.12 | 0.66 | 0.13 | 0.22 | 0.60 | 0.38 | 0.46 | 0.40 | 0.93 | ||
| 0.65 | 0.07 | 0.21 | 0.64 | 0.13 | 0.22 | 0.59 | 0.38 | 0.46 | 0.39 | 0.92 | ||
| 0.70 | 0.11 | 0.19 | 0.71 | 0.19 | 0.30 | 0.63 | 0.41 | 0.50 | 0.40 | 0.93 | ||
| 0.61 | 0.07 | 0.13 | 0.63 | 0.17 | 0.27 | 0.65 | 0.37 | 0.47 | 0.43 | 0.94 | ||
| 0.64 | 0.01 | 0.03 | 0.63 | 0.05 | 0.09 | 0.55 | 0.27 | 0.36 | 0.31 | 0.95 | ||
Polarity classification results for all collections with VERY-POS lexicon, in terms of precision (P), recall (R) and F1 scores for most positive (MP) and other (NMP) documents.
The best F1 for the most positive class in each dataset is highlighted (in bold).
| VERY-POS B = 1 | VERY-POS B = 2 | |||||
|---|---|---|---|---|---|---|
| Dataset | P | R | F1 | P | R | F1 |
| 0.67 | 0.55 | 0.61 | 0.61 | 0.67 | ||
| 0.68 | 0.49 | 0.57 | 0.63 | 0.61 | ||
| 0.63 | 0.42 | 0.50 | 0.57 | 0.52 | ||
| 0.63 | 0.43 | 0.51 | 0.60 | 0.60 | ||
| 0.63 | 0.41 | 0.50 | 0.55 | 0.58 | ||
Fig 5The best performance (F1) obtained by all lexicons on all datasets for identifying the most positive documents.