| Literature DB >> 34414378 |
Qufei Chen1, Marina Sokolova2.
Abstract
Analyze performance of unsupervised embedding algorithms in sentiment analysis of knowledge-rich data sets. We apply state-of-the-art embedding algorithms Word2Vec and Doc2Vec as the learning techniques. The algorithms build word and document embeddings in an unsupervised manner. To assess the algorithms' performance, we define sentiment metrics and use a semantic lexicon SentiWordNet (SWN) to establish the benchmark measures. Our empirical results are obtained on the Obesity data set from i2b2 clinical discharge summaries and the Reuters Science dataset. We use the Welch's test to analyze the obtained sentiment evaluation. On the Obesity data, the Welch's test found significant difference between the SWN evaluation of the most positive and most negative texts. On the same data, the Word2Vec results support the SWN results, whereas the Doc2Vec results partially correspond to the Word2Vec and the SWN results. On the Reuters data, the Welch's test did not find significant difference between the SWN evaluation of the most positive and most negative texts. On the same data, Word2Vec and Doc2Vec results only in part correspond to the SWN results. In unsupervised sentiment analysis of medical and scientific texts, the Word2Vec sentiment analysis has been more consistent with the SentiWordNet sentiment assessment than the Doc2Vec sentiment analysis. The Welch's test of the SentiWordNet results has been a strong indicator of future correspondence between Word2Vec and SentiWordNet results.Entities:
Keywords: Clinical discharge summaries; Doc2Vec; Reuters science data; Unsupervised sentiment analysis; Word2Vec
Year: 2021 PMID: 34414378 PMCID: PMC8364743 DOI: 10.1007/s42979-021-00807-1
Source DB: PubMed Journal: SN Comput Sci ISSN: 2661-8907
The data sets’ stylometrics parameters (K = 1000)
| Data | # of texts | # of subsets | Tokens | Types | Occurrence of rare words | Token match with SWN vocabulary | ||
|---|---|---|---|---|---|---|---|---|
| Hapaxlegomena | Dislegomena | |||||||
| Obesity | 1237 | 12 | 804 K | 32.1 K | 19.1 K | 2.8 K | 260.5 K | 32.04% |
| Science | 3949 | 4 | 460 K | 36.8 K | 17.3 K | 5.4 K | 304.0 K | 66.10% |
The i2b2 data subsets
| Annotated disease | Number of summaries | Digital ID |
|---|---|---|
| Obesity | 443 | 1 |
| Hypertension | 816 | 2 |
| Diabetes | 737 | 3 |
| Obesity and hypertension | 332 | 4 |
| Obesity and not hypertension | 111 | 5 |
| Hypertension and not obesity | 484 | 6 |
| Obesity and diabetes | 258 | 7 |
| Obesity and not diabetes | 185 | 8 |
| Diabetes and not obesity | 479 | 9 |
| Hypertension and diabetes | 595 | 10 |
| Hypertension and not diabetes | 221 | 11 |
| Diabetes and not hypertension | 142 | 12 |
The Reuters20 data subsets
| Categories | Number of texts |
|---|---|
| Crypto | 1000 |
| Space | 1000 |
| Electronics | 1000 |
| Med | 1000 |
SWN sentiment scores of the obesity datasets
| ID | Positive | Negative | Objective | Overall (negative–positive) |
|---|---|---|---|---|
| 6 | 0.052 | 0.07 | 0.878 | 0.018 |
| 9 | 0.051 | 0.071 | 0.879 | 0.02 |
| 12 | 0.052 | 0.073 | 0.876 | 0.021 |
| 3 | 0.051 | 0.072 | 0.877 | 0.021 |
| 10 | 0.05 | 0.071 | 0.879 | 0.021 |
| 2 | 0.051 | 0.072 | 0.878 | 0.021 |
| 5 | 0.051 | 0.073 | 0.876 | 0.022 |
| 7 | 0.051 | 0.072 | 0.877 | 0.022 |
| 1 | 0.048 | 0.073 | 0.879 | 0.022 |
| 8 | 0.049 | 0.072 | 0.879 | 0.023 |
| 11 | 0.05 | 0.074 | 0.876 | 0.023 |
| 4 | 0.049 | 0.075 | 0.877 | 0.026 |
| Average | 0.05 | 0.072 | 0.877 | 0.022 |
SWN sentiment ratios of the obesity datasets
| ID | Num positive | Num negative | % positive | % negative |
|---|---|---|---|---|
| 5 | 2636 | 2708 | 0.493 | 0.507 |
| 6 | 4322 | 4447 | 0.493 | 0.507 |
| 12 | 2931 | 3016 | 0.493 | 0.507 |
| 9 | 4321 | 4467 | 0.492 | 0.508 |
| 3 | 4941 | 5114 | 0.491 | 0.509 |
| 10 | 4597 | 4768 | 0.491 | 0.509 |
| 8 | 3006 | 3122 | 0.491 | 0.509 |
| 2 | 4971 | 5163 | 0.491 | 0.509 |
| 7 | 3503 | 3657 | 0.489 | 0.511 |
| 11 | 3186 | 3338 | 0.488 | 0.512 |
| 1 | 4142 | 4341 | 0.488 | 0.512 |
| 4 | 3744 | 3948 | 0.487 | 0.513 |
| Average | 3858.33 | 4007.42 | 0.491 | 0.509 |
SWN sentiment scores for the Reuters-Science data
| Dataset | Positive | Negative | Objective | Overall (negative–positive) |
|---|---|---|---|---|
| Crypto | 0.060 | 0.049 | 0.890 | − 0.011 |
| Space | 0.049 | 0.039 | 0.912 | − 0.010 |
| Electronics | 0.056 | 0.046 | 0.898 | − 0.010 |
| Med | 0.066 | 0.061 | 0.872 | − 0.005 |
| Average | 0.058 | 0.049 | 0.893 | − 0.009 |
SWN sentiment ratios for the Reuters-Science datasets
| Datasets | Num positive | Num negative | % positive | % negative |
|---|---|---|---|---|
| Med | 71,019 | 69,175 | 50.7 | 49.3 |
| Electronics | 45,278 | 43,947 | 50.7 | 49.3 |
| Space | 74,604 | 71,826 | 50.9 | 49.1 |
| Crypto | 78,341 | 75,301 | 51.0 | 49.0 |
| Average | 67,311 | 65,062 | 50.8 | 49.2 |
Word2Vec cosine similarity scores (× 100) on the obesity subsets
| ID | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | 100.0 | 99.87 | 99.77 | 99.86 | 98.31 | 99.55 | 99.89 | 99.66 | 99.54 | 99.80 | 99.47 | 98.92 |
| 2 | 99.87 | 100 | 99.94 | 99.72 | 98.20 | 99.87 | 99.81 | 99.45 | 99.86 | 99.96 | 99.53 | 99.16 |
| 3 | 99.77 | 99.94 | 100.0 | 99.57 | 98.28 | 99.87 | 99.82 | 99.13 | 99.95 | 99.97 | 99.22 | 99.42 |
| 4 | 99.86 | 99.72 | 99.57 | 100.0 | 97.22 | 99.22 | 99.85 | 99.36 | 99.27 | 99.72 | 99.12 | 98.21 |
| 5 | 98.31 | 98.20 | 98.28 | 97.22 | 100.0 | 98.55 | 97.88 | 98.55 | 98.36 | 97.91 | 98.55 | 99.25 |
| 6 | 99.55 | 99.87 | 99.87 | 99.22 | 98.55 | 100.0 | 99.46 | 99.19 | 99.95 | 99.80 | 99.48 | 99.49 |
| 7 | 99.89 | 99.81 | 99.82 | 99.85 | 97.88 | 99.46 | 100.0 | 99.17 | 99.58 | 99.87 | 98.99 | 98.93 |
| 8 | 99.66 | 99.45 | 99.13 | 99.36 | 98.55 | 99.19 | 99.17 | 100.0 | 98.96 | 99.15 | 99.82 | 98.38 |
| 9 | 99.54 | 99.86 | 99.95 | 99.27 | 98.36 | 99.95 | 99.58 | 98.96 | 100.0 | 99.88 | 99.20 | 99.55 |
| 10 | 99.80 | 99.96 | 99.97 | 99.72 | 97.91 | 99.80 | 99.87 | 99.15 | 99.88 | 100.0 | 99.20 | 99.13 |
| 11 | 99.47 | 99.53 | 99.22 | 99.12 | 98.55 | 99.48 | 98.99 | 99.82 | 99.20 | 99.20 | 100.0 | 98.62 |
| 12 | 98.92 | 99.16 | 99.42 | 98.21 | 99.25 | 99.49 | 98.93 | 98.38 | 99.55 | 99.13 | 98.62 | 100.0 |
Doc2Vec cosine similarity scores (× 100) on the obesity subsets
| ID | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | 100.0 | 79.43 | 81.16 | 92.29 | 68.42 | 56.38 | 89.76 | 79.73 | 66.70 | 77.50 | 66.81 | 55.48 |
| 2 | 79.43 | 100.0 | 89.21 | 76.97 | 73.49 | 79.27 | 77.81 | 75.38 | 81.22 | 89.50 | 63.08 | 66.43 |
| 3 | 81.16 | 89.21 | 100.0 | 82.80 | 62.41 | 72.33 | 82.48 | 62.92 | 78.66 | 96.23 | 51.21 | 68.06 |
| 4 | 92.29 | 76.97 | 82.80 | 100.0 | 61.46 | 58.68 | 94.30 | 78.33 | 70.31 | 80.26 | 60.45 | 53.41 |
| 5 | 68.42 | 73.49 | 62.41 | 61.46 | 100.0 | 61.80 | 67.74 | 65.85 | 66.41 | 62.13 | 70.58 | 72.88 |
| 6 | 56.38 | 79.27 | 72.33 | 58.68 | 61.80 | 100.0 | 63.35 | 57.87 | 90.20 | 72.81 | 65.02 | 64.70 |
| 7 | 89.76 | 77.81 | 82.48 | 94.30 | 67.74 | 63.35 | 100.0 | 74.84 | 71.91 | 79.04 | 66.60 | 60.74 |
| 8 | 79.73 | 75.38 | 62.92 | 78.33 | 65.85 | 57.87 | 74.84 | 100.0 | 60.62 | 67.24 | 77.22 | 35.46 |
| 9 | 66.70 | 81.22 | 78.66 | 70.31 | 66.41 | 90.20 | 71.91 | 60.62 | 100.0 | 78.42 | 61.92 | 7292 |
| 10 | 77.50 | 89.50 | 96.23 | 80.26 | 62.13 | 72.81 | 79.04 | 67.24 | 78.42 | 100.0 | 55.20 | 67.08 |
| 11 | 66.81 | 63.08 | 51.21 | 60.45 | 70.58 | 65.02 | 66.60 | 77.22 | 61.92 | 55.20 | 100.0 | 48.14 |
| 12 | 55.48 | 66.43 | 68.06 | 53.41 | 72.88 | 64.70 | 60.74 | 35.46 | 72.92 | 67.08 | 48.14 | 100.0 |
Reuters-Science Word2Vec cosine similarity scores
| Crypto | Electronics | Med | Space | |
|---|---|---|---|---|
| Crypto | 1.000 | 0.690 | 0.611 | 0.542 |
| Electronics | 0.690 | 1.000 | 0.727 | 0.606 |
| Med | 0.611 | 0.727 | 1.000 | 0.638 |
| Space | 0.542 | 0.606 | 0.638 | 1.000 |
Reuters-Science Doc2Vec cosine similarity scores
| Crypto | Electronics | Med | Space | |
|---|---|---|---|---|
| Crypto | 1.000 | 0.128 | − 0.074 | − 0.320 |
| Electronics | 0.128 | 1.000 | 0.379 | 0.070 |
| Med | − 0.074 | 0.379 | 1.000 | 0.325 |
| Space | − 0.320 | 0.070 | 0.325 | 1.000 |
Correspondence between Word2Vec, Doc2Vec, and SWN results
| Data sets | Word2Vec | Doc2Vec | ||
|---|---|---|---|---|
| Subsets with SWN sentiment highest similarity | Subsets with SWN sentiment lowest similarity | Subsets with SWN highest sentiment similarity | Subsets with SWN lowest sentiment similarity | |
| Obesitya | ✔ | ✔ | ✔ | × |
| Reuters | × | × | × | × |
✔ supported by SWN results, × not supported by SWN results
aIndicates the data set with significant difference between the most positive and most negative SWN evaluations.