| Literature DB >> 25909740 |
Xinmiao Li1, Jing Li1, Yukeng Wu1.
Abstract
Following the rapid development of social media, sentiment analysis has become an important social media mining technique. The performance of automatic sentiment analysis primarily depends on feature selection and sentiment classification. While information gain (IG) and support vector machines (SVM) are two important techniques, few studies have optimized both approaches in sentiment analysis. The effectiveness of applying a global optimization approach to sentiment analysis remains unclear. We propose a global optimization-based sentiment analysis (PSOGO-Senti) approach to improve sentiment analysis with IG for feature selection and SVM as the learning engine. The PSOGO-Senti approach utilizes a particle swarm optimization algorithm to obtain a global optimal combination of feature dimensions and parameters in the SVM. We evaluate the PSOGO-Senti model on two datasets from different fields. The experimental results showed that the PSOGO-Senti model can improve binary and multi-polarity Chinese sentiment analysis. We compared the optimal feature subset selected by PSOGO-Senti with the features in the sentiment dictionary. The results of this comparison indicated that PSOGO-Senti can effectively remove redundant and noisy features and can select a domain-specific feature subset with a higher-explanatory power for a particular sentiment analysis task. The experimental results showed that the PSOGO-Senti approach is effective and robust for sentiment analysis tasks in different domains. By comparing the improvements of two-polarity, three-polarity and five-polarity sentiment analysis results, we found that the five-polarity sentiment analysis delivered the largest improvement. The improvement of the two-polarity sentiment analysis was the smallest. We conclude that the PSOGO-Senti achieves higher improvement for a more complicated sentiment analysis task. We also compared the results of PSOGO-Senti with those of the genetic algorithm (GA) and grid search method. From the results of this comparison, we found that PSOGO-Senti is more suitable for improving a difficult multi-polarity sentiment analysis problem.Entities:
Mesh:
Year: 2015 PMID: 25909740 PMCID: PMC4409395 DOI: 10.1371/journal.pone.0124672
Source DB: PubMed Journal: PLoS One ISSN: 1932-6203 Impact factor: 3.240
SVM-based Sentiment Analysis Research Summary.
| Research | Feature Subset Selection | SVM Parameters Optimization | Global Optimization |
|---|---|---|---|
| Desmet (2013) [ | No | No | No |
| Moraes (2013) [ | No | No | No |
| Basari (2013) [ | No | Yes, PSO | No |
| Lane (2012) [ | Yes, Chi-square approach and taking the top 250 features | Yes, Manual setting | No |
| Abbasi et al. (2010) [ | No | No | No |
| Li and Wu (2010) [ | No | No | No |
| Abbasi et al. (2008) [ | No | No | No |
| Tan and Zhang (2008) [ | Yes, Predefined size of feature set | No | No |
| Coussement and Poel (2008) [ | No | Yes, Grid search approach | No |
| Xia (2008) [ | Yes, Chi-square approach and remaining the top 60% sentiment words | No | No |
| Pang (2002) [ | No | No | No |
Fig 1PSO-based Global Optimization Approach for Multi-Polarity Sentiment Analysis (PSOGO-Senti).
Performance Comparison of PSOGO-Senti and Benchmarks in the Two-Polarity Sentiment Analysis (Ctrip Dataset).
| Precision | Recall | F-measure | |
|---|---|---|---|
|
| 0.903 | 0.903 | 0.903 |
|
| 0.788 | 0.784 | 0.784 |
|
| 0.905 | 0.905 | 0.905 |
|
| 0.89 | 0.89 | 0.89 |
|
| 0.891 | 0.891 | 0.891 |
|
| 0.838 | 0.832 | 0.829 |
|
| 0.87 | 0.87 | 0.87 |
|
|
|
|
|
Performance Comparison of PSOGO-Senti and Benchmarks in the Three-Polarity Sentiment Analysis (Ctrip Dataset).
| Precision | Recall | F-measure | |
|---|---|---|---|
|
| 0.692 | 0.692 | 0.691 |
|
| 0.614 | 0.611 | 0.603 |
|
| 0.693 | 0.694 | 0.693 |
|
| 0.683 | 0.687 | 0.684 |
|
| 0.685 | 0.689 | 0.686 |
|
| 0.623 | 0.616 | 0.604 |
|
| 0.654 | 0.662 | 0.652 |
|
|
|
|
|
Performance Comparison of PSOGO-Senti and Benchmarks in the Five-Polarity Sentiment Analysis (Ctrip Dataset).
| Precision | Recall | F-measure | |
|---|---|---|---|
|
| 0.516 | 0.517 | 0.513 |
|
| 0.473 | 0.482 | 0.465 |
|
| 0.521 | 0.522 | 0.518 |
|
| 0.509 | 0.509 | 0.500 |
|
| 0.517 | 0.519 | 0.516 |
|
| 0.479 | 0.400 | 0.321 |
|
| 0.474 | 0.480 | 0.455 |
|
|
|
|
|
Comparison between the Optimal Feature Subset Selected by PSOGO-Senti and the HowNet Dictionary.
| HowNet Dictionary | Optimal feature subset dimension | 8742 | |||||
|---|---|---|---|---|---|---|---|
|
| Dataset | Ctrip Dataset | Guahao Dataset | ||||
| Polarities | Two polarity | Three polarity | Five polarity | Two polarity | Three polarity | Five polarity | |
| Optimal feature subset dimension | 1120 | 1074 | 5000 | 2537 | 762 | 1686 | |
| Features in PSOGO-Senti but not in HowNet | 927 | 888 | 4363 | 2166 | 624 | 1403 | |
| The percentage of features in PSOGO-Senti but not in HowNet | 82.77% | 82.68% | 87.26% | 85.38% | 81.89% | 83.21% | |
Some of the Features in the Optimal Feature Subset of PSOGO-Senti but not in the HowNet Dictionary (Ctrip Dataset).
| Features | |
|---|---|
| Verb | 忽悠 (hoodwink), 恐吓 (threaten), 上当 (be fooled), 受骗 (be cheated), 宰 (swindle money out of customers), 坑人 (harm), 拉客 (soliciting), 享受 (enjoy) |
| Adjective | 般般 (so so), 凑合 (make do in a bad situation), 标志性 (landmark) |
| Noun | 流氓 (rogue), 假货 (fake goods), 黑店 (gangster inn), 垃圾 (shit), 商业化 (commercialization), 天堂(paradise) |
| Phrase | 豁然开朗 (be suddenly enlightened), 流连忘返 (linger on and forget to return), 不虚此行(the trip has been well worthwhile), 世外桃源 (wonderland) |
Performance Comparison of PSOGO-Senti and Benchmarks in the Two-Polarity Sentiment Analysis (Guahao Dataset).
| Precision | Recall | F-measure | |
|---|---|---|---|
|
| 0.878 | 0.878 | 0.877 |
|
| 0.901 | 0.899 | 0.899 |
|
| 0.921 | 0.921 | 0.921 |
|
| 0.920 | 0.920 | 0.920 |
|
| 0.921 | 0.921 | 0.921 |
|
| 0.856 | 0.842 | 0.841 |
|
| 0.899 | 0.898 | 0.898 |
|
| 0.922 | 0.922 | 0.922 |
Performance Comparison of PSOGO-Senti and Benchmarks in the Three-Polarity Sentiment Analysis (Guahao Dataset).
| Precision | Recall | F-measure | |
|---|---|---|---|
|
| 0.739 | 0.718 | 0.721 |
|
| 0.752 | 0.743 | 0.745 |
|
| 0.753 | 0.747 | 0.748 |
|
| 0.752 | 0.745 | 0.746 |
|
| 0.743 | 0.737 | 0.739 |
|
| 0.743 | 0.737 | 0.739 |
|
| 0.729 | 0.714 | 0.716 |
|
| 0.759 | 0.753 | 0.755 |
Performance Comparison of PSOGO-Senti and Benchmarks in the Five-Polarity Sentiment Analysis (Guahao Dataset).
| Precision | Recall | F-measure | |
|---|---|---|---|
|
| 0.637 | 0.587 | 0.590 |
|
| 0.680 | 0.672 | 0.673 |
|
| 0.690 | 0.688 | 0.688 |
|
| 0.690 | 0.690 | 0.690 |
|
| 0.690 | 0.687 | 0.688 |
|
| 0.674 | 0.661 | 0.663 |
|
| 0.647 | 0.624 | 0.627 |
|
| 0.694 | 0.690 | 0.691 |
Some of the Features in the Optimal Feature Subset of PSOGO-Senti but not in the HowNet Dictionary (Guahao Dataset).
| Features | |
|---|---|
| Verb | 浪费 (waste), 打发 (send away), 折腾 (cause physical or mental suffering), 不理 (ignore), 坑人 (harm), 无奈 (feel helpless) |
| Adjective | 不耐烦 (impatient),久 (for a long time), 耐心 (patient),丰富 (experienced), 体贴 (considerate) |
| Noun | 黄牛 (scalper), 庸医 (quack), 复查 (reexamination) |
| Phrase | 敷衍了事 (do things carelessly), 不怎么样 (not very good), 草草了事 (go through a thing carelessly), 莫名其妙 (without rhyme or reason), 答非所问 (give an irrelevant answer), 名不符实(undeserved reputation), 救死扶伤 (heal the wounded and rescue the dying) |
Performance Comparison of PSOGO-Senti and the GSM-, GA-based Approaches (1).
| Dataset | Ctrip Dataset | ||||||||
|---|---|---|---|---|---|---|---|---|---|
| Polarities | Two-polarity | Three-polarity | Five-polarity | ||||||
| Approach | PSOGO-Senti | GA | GSM | PSOGO-Senti | GA | GSM | PSOGO-Senti | GA | GSM |
| Precision | 0.906 | 0.915 | 0.905 | 0.695 | 0.737 | 0.690 | 0.521 | 0.500 | 0.467 |
| Recall | 0.906 | 0.915 | 0.905 | 0.696 | 0.747 | 0.690 | 0.522 | 0.485 | 0.467 |
| F-measure | 0.906 | 0.915 | 0.905 | 0.695 | 0.741 | 0.690 | 0.519 | 0.484 | 0.465 |
| Optimal | 1120 | 1930 | 1001 | 1074 | 3157 | 1261 | 5000 | 2882 | 501 |
Performance Comparison of PSOGO-Senti and the GSM-, GA-based Approaches (2).
| Dataset | Guahao Dataset | ||||||||
|---|---|---|---|---|---|---|---|---|---|
| Polarities | Two-polarity | Three-polarity | Five-polarity | ||||||
| Approach | PSOGO-Senti | GA | GSM | PSOGO-Senti | GA | GSM | PSOGO-Senti | GA | GSM |
| Precision | 0.922 | 0.925 | 0.922 | 0.759 | 0.753 | 0.743 | 0.694 | 0.689 | 0.670 |
| Recall | 0.922 | 0.925 | 0.922 | 0.753 | 0.750 | 0.742 | 0.690 | 0.686 | 0.666 |
| F-measure | 0.922 | 0.925 | 0.922 | 0.755 | 0.751 | 0.742 | 0.691 | 0.686 | 0.666 |
| Optimal | 2537 | 3323 | 4161 | 762 | 3239 | 1261 | 1686 | 4119 | 461 |