| Literature DB >> 22879763 |
James A McCart1, Dezon K Finch, Jay Jarman, Edward Hickling, Jason D Lind, Matthew R Richardson, Donald J Berndt, Stephen L Luther.
Abstract
In 2007, suicide was the tenth leading cause of death in the U.S. Given the significance of this problem, suicide was the focus of the 2011 Informatics for Integrating Biology and the Bedside (i2b2) Natural Language Processing (NLP) shared task competition (track two). Specifically, the challenge concentrated on sentiment analysis, predicting the presence or absence of 15 emotions (labels) simultaneously in a collection of suicide notes spanning over 70 years. Our team explored multiple approaches combining regular expression-based rules, statistical text mining (STM), and an approach that applies weights to text while accounting for multiple labels. Our best submission used an ensemble of both rules and STM models to achieve a micro-averaged F(1) score of 0.5023, slightly above the mean from the 26 teams that competed (0.4875).Entities:
Keywords: i2b2 competition; machine learning; sentiment analysis; text analysis
Year: 2012 PMID: 22879763 PMCID: PMC3409473 DOI: 10.4137/BII.S8931
Source DB: PubMed Journal: Biomed Inform Insights ISSN: 1178-2226
Number of rules by label.
| Abuse | 24 |
| Anger | 227 |
| Blame | 261 |
| Fear | 33 |
| Forgiveness | 18 |
| Guilt | 287 |
| Happiness | 51 |
| Hopefulness | 49 |
| Hopelessness | 37 |
| Information | 590 |
| Instructions | 2,093 |
| Love | 507 |
| Pride | 29 |
| Sorrow | 80 |
| Thankfulness | 158 |
| All | 4,444 |
Statistical text mining modeling parameters.
| Decision trees | ||
| Term weighting | GR, LOR, | |
| Top | 10, 25, 50, 100, 250, 500, 1000, All | |
| Split criterion | GI, GR | |
| k-Nearest neighbor | ||
| Term weighting | GR, LOR, | |
| Top | 25, 50, 100, 250, 500, 1000, All | |
| | 1, 2, 5, 10 | |
| Support vector machines | ||
| Term weighting | GR, LOR, | |
| Top n terms | 0, 25, 50, 100, 250, 500, 1000, All | |
| SVD dimensions | 0, 25, 50, 100, 250 | |
Abbreviations: GI, Gini index; GR, gain ratio; LOR, log odds ratio; X2, Chi-square.
Weight-based modeling parameters.
| Decision trees | ||
| Split criterion | ACC, GI, GR, IG | |
| Feature sets | {± | |
| Logistic regression | ||
| Feature sets | {± | |
| Support vector machines | ||
| Kernel | Linear, Poly, Sigmoid, RBF | |
| Feature sets | {± | |
Abbreviations: ACC, Accuracy; GI, Gini index; GR, gain ratio; IG, information gain.
Training set F1 score by label and method.
| Abuse | 0.8235 | 0.0000 | 0.0000 | 0.0000 | 0.5882 | 0.5714 | 0.5714 |
| Anger | 0.9466 | 0.0000 | 0.1622 | 0.1980 | 0.5758 | 0.6486 | 0.4138 |
| Blame | 0.9353 | 0.0484 | 0.1635 | 0.1569 | 0.3837 | 0.4487 | 0.2835 |
| Fear | 0.8889 | 0.1333 | 0.1923 | 0.1429 | 0.4681 | 0.2500 | 0.0000 |
| Forgiveness | 0.9091 | 0.0000 | 0.0000 | 0.0000 | 0.0000 | 0.0000 | 0.0000 |
| Guilt | 0.8125 | 0.3158 | 0.4278 | 0.3601 | 0.4091 | 0.3368 | 0.2335 |
| Happiness | 0.8636 | 0.0741 | 0.0588 | 0.1395 | 0.0000 | 0.0000 | 0.0000 |
| Hopefulness | 0.7949 | 0.0000 | 0.0702 | 0.0345 | 0.0732 | 0.0000 | 0.0000 |
| Hopelessness | 0.7665 | 0.3280 | 0.3782 | 0.4149 | 0.3494 | 0.3569 | 0.3117 |
| Information | 0.8500 | 0.2481 | 0.3028 | 0.3410 | 0.2857 | 0.2310 | 0.0946 |
| Instructions | 0.8739 | 0.4554 | 0.4379 | 0.5449 | 0.4796 | 0.4435 | 0.4309 |
| Love | 0.8056 | 0.6497 | 0.6022 | 0.6497 | 0.6188 | 0.6122 | 0.5055 |
| Pride | 0.7500 | 0.0000 | 0.1481 | 0.1111 | 0.0000 | 0.0000 | 0.0000 |
| Sorrow | 0.8182 | 0.0000 | 0.0513 | 0.0339 | 0.0345 | 0.0000 | 0.0385 |
| Thankfulness | 0.8101 | 0.6869 | 0.6696 | 0.6559 | 0.0885 | 0.0000 | 0.0000 |
| All | 0.8396 | 0.4110 | 0.3972 | 0.4630 | 0.4206 | 0.4010 | 0.3326 |
Training and testing performance by submission.
| Training | ||||||||||||
| 1 | ✓ | 20% | 0.8396 | 0.9908 | 0.7284 | 1,854 | ||||||
| 2 | ✓ | ✓ | ✓ | 20% | 0.7420 | 0.6795 | 0.8172 | 3,033 | ||||
| 3 | ✓ | ✓ | ✓ | 20% | 0.7228 | 0.6457 | 0.8208 | 3,206 | ||||
| Testing | ||||||||||||
| 1 | ✓ | 20% | 0.3408 | 0.5667 | 0.2437 | 547 | ||||||
| 2 | ✓ | ✓ | ✓ | 20% | 0.4770 | 0.4865 | 0.4678 | 1,223 | ||||
| 3 | ✓ | ✓ | ✓ | 20% | 0.5023 | 0.4992 | 0.5055 | 1,288 | ||||
F1 Score by submission and label.
| Abuse | 0.8235 | 0.0000 | 0.7368 | 0.0000 | 0.8235 | 0.0000 |
| Anger | 0.9466 | 0.1290 | 0.7950 | 0.1111 | 0.9466 | 0.1290 |
| Blame | 0.9353 | 0.0000 | 0.7854 | 0.1842 | 0.9353 | 0.0000 |
| Fear | 0.8889 | 0.0000 | 0.7419 | 0.2222 | 0.8889 | 0.0000 |
| Forgiveness | 0.9091 | 0.0000 | 0.9091 | 0.0000 | 0.9091 | 0.0000 |
| Guilt | 0.8125 | 0.1791 | 0.7494 | 0.4233 | 0.6856 | 0.4677 |
| Happiness | 0.8636 | 0.0000 | 0.8444 | 0.0000 | 0.8636 | 0.0000 |
| Hopefulness | 0.7949 | 0.0000 | 0.7949 | 0.0000 | 0.7949 | 0.0000 |
| Hopelessness | 0.7665 | 0.1931 | 0.7157 | 0.4531 | 0.6680 | 0.5081 |
| Information | 0.8500 | 0.2119 | 0.7176 | 0.3519 | 0.6723 | 0.3793 |
| Instructions | 0.8739 | 0.4808 | 0.7335 | 0.5664 | 0.7313 | 0.5562 |
| Love | 0.8056 | 0.4952 | 0.7518 | 0.6437 | 0.6841 | 0.6541 |
| Pride | 0.7500 | 0.0000 | 0.7500 | 0.0000 | 0.7500 | 0.0000 |
| Sorrow | 0.8182 | 0.0000 | 0.8182 | 0.0000 | 0.8182 | 0.0000 |
| Thankfulness | 0.8101 | 0.4286 | 0.8101 | 0.4286 | 0.8018 | 0.6500 |
| All | 0.8396 | 0.3408 | 0.7420 | 0.4770 | 0.7228 | 0.5023 |
Weight formulas.
| Sum of chi-square values for each term in a sentence. | |
| ± | Sum of chi-square values for each term in a sentence, with the sign of a term’s weight determined by whether it exists predominately in the |
| Sum of modified Gini index ( | |
| Sum of modified Gini index multiplied by chi-square for each term in a sentence. |