| Literature DB >> 22419877 |
John P Pestian1, Pawel Matykiewicz, Michelle Linn-Gust, Brett South, Ozlem Uzuner, Jan Wiebe, K Bretonnel Cohen, John Hurdle, Christopher Brew.
Abstract
This paper reports on a shared task involving the assignment of emotions to suicide notes. Two features distinguished this task from previous shared tasks in the biomedical domain. One is that it resulted in the corpus of fully anonymized clinical text and annotated suicide notes. This resource is permanently available and will (we hope) facilitate future research. The other key feature of the task is that it required categorization with respect to a large set of labels. The number of participants was larger than in any previous biomedical challenge task. We describe the data production process and the evaluation measures, and give a preliminary analysis of the results. Many systems performed at levels approaching the inter-coder agreement, suggesting that human-like performance on this task is within the reach of currently available technologies.Entities:
Year: 2012 PMID: 22419877 PMCID: PMC3299408 DOI: 10.4137/bii.s9042
Source DB: PubMed Journal: Biomed Inform Insights ISSN: 1178-2226
Example of a note annotation for different span with corresponding Krippendorff’s. α and the majority rule.
| Token | hate | love | ||||||
| anger, hate | anger, hate | love | love | ≈0.570 | ||||
| anger, blame | anger, blame | anger, blame | love | love | love | |||
| hate | love | |||||||
| Sentence | anger, hate | love | ≈0.577 | |||||
| anger, blame | love | |||||||
| Majority | anger, hate | love |
Annotator characteristics.
| Response to call | |
| Annotators | |
| Direct contact | 1500 |
| Indirect contact | Unknown |
| Not eligible | 10 |
| Completed training | 169 |
| Withdrew | 17 |
| Respondents who fully completed the task | 64 |
| Gender and age | |
| Males | 10% |
| Females | 90% |
| Average age (SD) | 47.3 (11.2) |
| Age range | 23–70 |
| Education level | |
| High school degree | 26 |
| Associates degree | 13 |
| Bachelors | 23 |
| Masters | 34 |
| Professional (PhD/MD/JD) | 4 |
| Connection to suicide | |
| Survivor of a loss to suicide | 70 |
| Mental health professional | 18 |
| Other | 12 |
| Time since loss | |
| 0–0 years | 27 |
| 3–3 years | 25 |
| 6–60 years | 14 |
| 11–15 years | 13 |
| 16 years or more | 12 |
| Relationship to the lost | |
| Child | 31 |
| Sibling | 23 |
| Spouse or partner | 15 |
| Other relative | 9 |
| Parent | 8 |
| Friend | 5 |
| Performance | |
| Number of notes annotated at least once | 1278 |
| Number of notes annotated at least twice | 1225 |
| Number of notes annotated at least three times | 1004 |
| Mean (SD) annotation time per note | 4.4 min (1.3 min) |
| Token inter-annotation agreement | 0.535 |
| Sentence inter-annotation agreement | 0.546 |
Figure 1.Geographic location of participants.
Characteristics of the data.
| Word count | 146739 | 102.399 | 112.178 | 3 | 888.000 |
| Swear | 105 | 0.073 | 0.48 | 0 | 7.690 |
| Family | 2029 | 1.416 | 2.24 | 0 | 17.650 |
| Friend | 305 | 0.213 | 0.794 | 0 | 12.500 |
| Positive emotion | 7869 | 5.491 | 5.096 | 0 | 42.860 |
| Negative emotion | 3017 | 2.105 | 2.834 | 0 | 33.330 |
| Anxiety | 356 | 0.248 | 0.788 | 0 | 9.090 |
| Anger | 650 | 0.453 | 1.132 | 0 | 10.000 |
| Sad | 814.4 | 0.568 | 1.309 | 0 | 16.670 |
| Cognitive process | 19512.39 | 13.616 | 6.380 | 0 | 66.670 |
| Biology | 4267 | 2.977 | 3.324 | 0 | 25.000 |
| Sexual | 1453 | 1.01 | 2.044 | 0 | 25.000 |
| Ingestion | 172 | 0.12 | 0.496 | 0 | 5.560 |
| Religion | 917 | 0.64 | 1.845 | 0 | 27.270 |
| Death | 971 | 0.677 | 1.858 | 0 | 33.330 |
Team ranking using micro-average. F1, precision and recall.
| Open university | 0.61390 | 0.58210 | 0.64937 |
| MSRA | 0.58990 | 0.55915 | 0.62421 |
| Mayo | 0.56404 | 0.57085 | 0.55739 |
| Nrciit | 0.55216 | 0.55725 | 0.54717 |
| Oslo | 0.54356 | 0.60580 | 0.49292 |
| Limsi | 0.53831 | 0.53810 | 0.53852 |
| Swatmrc | 0.53429 | 0.57890 | 0.49607 |
| UMAN | 0.53367 | 0.56614 | 0.50472 |
| Cardiff | 0.53339 | 0.54962 | 0.51808 |
| LT3 | 0.53307 | 0.54374 | 0.52280 |
| UTD | 0.51589 | 0.55089 | 0.48506 |
| OHSU | 0.50985 | 0.53351 | 0.48821 |
| Wolverine | 0.50315 | 0.45334 | 0.56525 |
| TPAVACOE | 0.50234 | 0.49922 | 0.50550 |
| CLiPS | 0.50183 | 0.51889 | 0.48585 |
| SIP | 0.49727 | 0.67429 | 0.39387 |
| SRI & UC Davis | 0.48003 | 0.49831 | 0.46305 |
| DIEGO-ASU | 0.47506 | 0.41791 | 0.55031 |
| Ebi | 0.45636 | 0.60077 | 0.36792 |
| Duluth | 0.45269 | 0.45985 | 0.44575 |
| Columbia | 0.43017 | 0.42125 | 0.43947 |
| Pxs697 | 0.40288 | 0.37192 | 0.43947 |
| Lassa | 0.38194 | 0.35089 | 0.41903 |
| Saeed | 0.37927 | 0.37059 | 0.38836 |
| SNAPS | 0.35294 | 0.58684 | 0.25236 |
| Senti6 | 0.29669 | 0.30532 | 0.28852 |
Figure 2.Comparison of different systems’ outputs using distance. d = 1. F1 and hierarchical clustering with minimum variance condition.
Examples of sentence/label combinations that were misclassified by all systems.
| False negative | 200909031138 4664 | “Goodbye my dear wife Jane.” | love | none |
| False negative | 200809091809 2119 | “I ask God alone to judge my action.” | guilt | none |
| False negative | 200812181837 2227 | “I hope something is done to John Johnson, for I do not wish to die in vain.” | anger | none |
| False positive | 200908201415 0445 | “respectfully Mary P.S. I love you BABY.” | none | love |
| False positive | 200812181838 1506 | “Dearest Jane I am about to commit suicide. | none | instructions |
| False positive | 200809091735 1923 | “John: I can’t take your cruel unkind treatment any longer.” | none | hopelessness |
System description.
| Cardiff | TopClass | Stanford POS tagger, WordNet lexical domains, emotive lexicons, internally assembled lexicons, manually identified patterns | frequency, mutual information, principal component analysis | 245 | N/A | None | naive Bayes | Java regular expressions | cross validation | 0.533 |
| CLiPS Research Center | GoldDigger | Multi-label training sentences re-annotated into single-label instances. | None | 6,941 (# of tokens in training) | N/A | None | One- vs.-all SVMs trained on emotion- labeled and unlabeled instances, returning probability estimates per instance, per class. Two experimentally determined probability thresholds: one for emotion labels & one for the no-emotion class | None | 10-fold CV | 0.5018 |
| Columbia | Columbia | Lexical, syntactic, and machine-learned features | No | 30 | Very sparse | Using ridge estimator | Logistic regression with ridge estimator | No | MLE | 0.43 |
| DEIGOASU | Emotion Finder | Clause level polarity features, unigrams and WordNet Affect emotion categories, Syntactical features (eg, sentence offset in the note) | Semi automated: the clause level and syntactic features manually selected and a greedy algorithm developed for selecting the rest of the features for each category | 14,300 | 0.0025 | TF-IDF for unigrams | SVM with polynomial kernel | Intuitive lexical and emotional clues were manually translated to rules using regular expression and sentiment analysis of the clauses | 2-fold cross validation | 0.47 |
| Duluth | Duluth-1 | Manual inspected combined with use of Ngram Statistics Package | Manual selection, looking for features uniquely associated with a particular emotion (based on intuition and Ngram Statistics Package output) | Approximately 1–30 rules per emotion, mainly consisting of unigram and bigram expressions | N/A | Rules for each emotion checked in order of frequency of emotion in training data, at most 2 emotions assigned | Human intuition | Perl regular expressions | N/A | 0.45 |
| European Bioinformatics Institute | ebi | Word unigrams and bigrams, POS, negation, grammatical relations (subject, verb, and object) | Using frequency as threshold | Unigram (1,379), Bigram (8,391), POS (6), GR (775), verb (550) | None | SVM, CRF, SVM + CRF | Yes | 9-fold cross validation | 0.456 | |
| LIMSI | LIMSI | SVM classifiers and manually-defined transducers | None | 160,272 | N/A | Combination of Binary and frequency weighting | LIBLINEAR SVM classifiers (one per emotion class) using following features: POS tags, General Inquirer, Heuristics, Unigrams, Bi-grams, Dependency Graphs, Affective Norms of English Words (ANEW) | Cascade of UNITEX transducers (one per emotion class) | 10-fold cross validation | 0.5383 |
| LT3, University College Ghent, Belgium | LT3 | MBSP shallow parser (lemma, POS), token tri-grams (highly frequent in positive instances), Senti- WordNet and Wiebe Subjectivity clues scores | Experimental: manual compilation of 17 feature sets, experiments to determine best feature set per label | 5975 average (min 1747, max 6699) | 0.00270 average (min 0.00189, max 0.00426) | None | Binary SVM, one classifier per label | None | 50 bootstrap resampling rounds (3000 train, 1633 test) | 0.5331 |
| Microsoft Research Asia | eHuatuo | Spanning 1–4 grams and general 1–4 garms | Positive frequency is divided by negative frequency by leveraging Live-journal weblog information | 14428 selected features from spanning 1–4 grams | N/A | The confidence score from SVM | SVM classifier and pattern matching | No | 10 fold cross validation | 0.5899 |
| National Research Council Canada | NRC | Word unigrams and bigrams, thesaurus matches, character 4-grams, document length, various sentence-level patterns | None | 71061 | 608448/(71061 * 4633) = 0.00185 | Feature vectors normalized to unit length | Binary SVM; one-classifier-per-label | None | 10-fold cross validation | 0.5522 |
| Oslo | Oslo | Stems and bigrams from PorterStemmer; part-of-speech from TreeTagger; dependency patterns from MaltParser; first synsets from WordNet | No constraints | Mean = 28289.3; std. dev. = 18924.7 | Mean = 0.0017; std. dev. = 0.0008 | N/A | Six binary linear one- vs.-all cost-sensitive SVM classifiers | None | 10-fold cross-validated grid search over all permutations of feature types and cost factors | 0.54356 |
| SRI, UC Davis | Stanford Core-NLP generated POS tags, addressing features, unigrams & bigrams, LIWC (original and customized), emotion sequence and sentence position | Regularization in Log-Linear Model | On the order of thousands (comparable to text classification problems) | Very sparse (comparable to text classification problems) | Frequency counts | Log-linear model, tuned with L-BFGS, followed by single step self training | None | 5-fold cross validation | 0.49 | |
| UMAN | NLTK for significant uni-, bi- and tri-grams (likelihood measure), Stanford CoreNLP for NLP and NER, hand-crafted semantic lexicons, Flesh tool (for readability scores), Lingua-EN-Gender-1.013 (for gender feature) and manually written rules for sentence tense and some NER classes | genetic algorithm, Fast Correlation-Based Filter method and top 500 uni-, bi- and tri-grams | 1690 | 0.013 | None | Nave Bayes with kernel density estimation |
Frozen/common layman expressions lexico-syntactic patterns using GATE/JAPE grammar | 5-fold cross validation | 0.5336 |