| Literature DB >> 22879770 |
Wenbo Wang1, Lu Chen, Ming Tan, Shaojun Wang, Amit P Sheth.
Abstract
This paper presents our solution for the i2b2 sentiment classification challenge. Our hybrid system consists of machine learning and rule-based classifiers. For the machine learning classifier, we investigate a variety of lexical, syntactic and knowledge-based features, and show how much these features contribute to the performance of the classifier through experiments. For the rule-based classifier, we propose an algorithm to automatically extract effective syntactic and lexical patterns from training examples. The experimental results show that the rule-based classifier outperforms the baseline machine learning classifier using unigram features. By combining the machine learning classifier and the rule-based classifier, the hybrid system gains a better trade-off between precision and recall, and yields the highest micro-averaged F-measure (0.5038), which is better than the mean (0.4875) and median (0.5027) micro-average F-measures among all participating teams.Entities:
Keywords: emotion identification; sentiment analysis; suicide note
Year: 2012 PMID: 22879770 PMCID: PMC3409482 DOI: 10.4137/BII.S8963
Source DB: PubMed Journal: Biomed Inform Insights ISSN: 1178-2226
Features used by the SVM classifier.
| N-gram features | Notation-N |
|---|---|
| Unigram: count | |
| Bigram: count | |
| Trigram: count | |
| The numbers of strongsubj, weaksubj, positive, negative and neutral words regarding MPQA: count | |
| feature vector generated by LIWC software | |
| Collapsed dependency relations by Stanford Parser: count | |
| The numbers of adjectives, adverbs, nouns, pronouns, present verbs, past verbs and modals: count | |
| The numbers of different verb tenses: count | |
| The numbers of two types of location phrases: count | |
| Whether POSs of the first two words are VB/VBZ respectively: binary | |
| The numbers of subjects that are the writer, other people and anything else respectively: count | |
| The numbers of direct objects that are the writer, other people and anything else respectively: count | |
| The numbers of indirect objects that are the writer, other people and anything else respectively: count |
Performance of the SVM classifier with different feature combinations on the testing data.
| Feature set | Micro-averaged F-measure | Precision | Recall | |
|---|---|---|---|---|
| N-gram feature | 0.4492 | 0.5971 | 0.3601 | |
| 0.6505 | 0.3687 | |||
| 0.4542 | 0.6128 | 0.3609 | ||
| Knowledge-based features | 0.4623 | 0.5946 | 0.3781 | |
| 0.4650 | 0.6161 | 0.3734 | ||
| 0.6525 | 0.3734 | |||
| Syntactic features | 0.4781 | 0.6667 | 0.3726 | |
| 0.4783 | 0.6553 | 0.3766 | ||
| 0.4798 | 0.6584 | 0.3774 | ||
| 0.6612 | 0.3789 | |||
| 0.4804 | 0.6657 | 0.3758 | ||
| Context features | 0.4697 | 0.6218 | 0.3774 | |
| 0.4758 | 0.6508 | 0.3750 | ||
| 0.4787 | 0.6593 | 0.3758 | ||
| Class-specific features | 0.6667 | |||
| 0.3837 | ||||
| All | All features | 0.4720 | 0.6279 | 0.3781 |
Figure 1Precision-recall curve of the rule-based classifier with varying
Figure 2F-measure of the rule-based classifier with varying threshold τ on the testing data.
Figure 3F-measure of the combined classifier on the test data.