| Literature DB >> 21643548 |
John Pestian1, Henry Nasrallah, Pawel Matykiewicz, Aurora Bennett, Antoon Leenaars.
Abstract
Suicide is the second leading cause of death among 25-34 year olds and the third leading cause of death among 15-25 year olds in the United States. In the Emergency Department, where suicidal patients often present, estimating the risk of repeated attempts is generally left to clinical judgment. This paper presents our second attempt to determine the role of computational algorithms in understanding a suicidal patient's thoughts, as represented by suicide notes. We focus on developing methods of natural language processing that distinguish between genuine and elicited suicide notes. We hypothesize that machine learning algorithms can categorize suicide notes as well as mental health professionals and psychiatric physician trainees do. The data used are comprised of suicide notes from 33 suicide completers and matched to 33 elicited notes from healthy control group members. Eleven mental health professionals and 31 psychiatric trainees were asked to decide if a note was genuine or elicited. Their decisions were compared to nine different machine-learning algorithms. The results indicate that trainees accurately classified notes 49% of the time, mental health professionals accurately classified notes 63% of the time, and the best machine learning algorithm accurately classified the notes 78% of the time. This is an important step in developing an evidence-based predictor of repeated suicide attempts because it shows that natural language processing can aid in distinguishing between classes of suicidal notes.Entities:
Year: 2010 PMID: 21643548 PMCID: PMC3107011 DOI: 10.4137/bii.s4706
Source DB: PubMed Journal: Biomed Inform Insights ISSN: 1178-2226
Feature selection process and results.
| Initial features | Removal of high correlations | Removal of rare features | Removal of low information gain features | |
|---|---|---|---|---|
| Words | 993 | 993 | 153 | 42.7 (0.13) |
| P.O.S. | 36 | 32 | 24 | 18.47 (0.11) |
| Concepts | 32 | 32 | 8 | 2.82 (0.10) |
| Read. Score | 2 | 2 | 2 | 2 (0) |
| Total | 1063 | 1053 | 186 | 66 (0) |
Note: Done on training set only.
Figure 1Partial suicide ontology.
Genuine and elicited notes descriptive statistics.
| Description | Genuine notes mean (SD) | Elicited notes mean (SD) | P-value |
|---|---|---|---|
| Number of sentences | 9.242 (8.058) | 4.848 (3.308) | 0.001 |
| Maximal frequency of a word | 7.909 (6.090) | 4.212 (3.525) | 0.002 |
| Number of non-word characters | 13.182 (11.078) | 7.212 (7.039) | 0.004 |
| Standard deviation of word | 1.102 (0.809) | 0.632 (0.519) | 0.005 |
| Flesch-Kincaid grade level | 4.719 (2.142) | 6.517 (2.994) | 0.008 |
| Cardinal numbers per note | 1.121 (1.867) | 0.182 (0.465) | 0.015 |
| Verb, past tense per note | 4.848 (9.398) | 1.152 (1.787) | 0.018 |
| Personal pronoun per note | 16.030 (16.318) | 8.879 (8.112) | 0.019 |
| Flesch reading ease per note | 81.776 (7.566) | 75.949 (11.201) | 0.024 |
| Number of paragraphs per note | 1.909 (1.508) | 1.333 (0.777) | 0.027 |
| Number of tokens per note | 122.000 (108.096) | 73.364 (63.465) | 0.031 |
| Number of words per note | 108.818 (97.944) | 66.152 (56.801) | 0.041 |
| Adjective, superlative per note | 0.485 (0.795) | 0.182 (0.584) | 0.042 |
| Verb, non-3rd person singular present per note | 4.727 (3.476) | 3.394 (3.297) | 0.072 |
Abbreviation: SD, standard deviation.
Human & Machine Raters in after 25 × bootstraps.
| Mean accuracy (SD) | |
|---|---|
| Human Raters | |
| Psychiatric Physician Trainees | 0.510 (0.002) |
| Mental Health Professionals | 0.609 |
| Machine Model | |
| LMT | 0.744 |
| LinSMO | 0.705 |
| Decision | 0.667 |
| JRip | 0.661 |
| NB | 0.645 |
| PART | 0.645 |
| J48 | 0.640 |
| Logistic | 0.633 |
| IB3 | 0.623 |
| OneR | 0.605 (0.014) |
Significant between students and professionals ≤ 0.001.
Significant between professionals and machine ≤ 0.001.
Logistic Model Tree when all features and all suicide notes are used for training.
| Genuine note equation | Elicited note equation |
|---|---|
| f = 0.04 + | f = -0.04 + |
| “Flesch-Kincaid grade level” | “Flesch-Kincaid grade level” |
| “maximal depth of a sentence” | “maximal length of a sentence” |
| “mean depth of a sentence” | “mean depth of a sentence” |
| “mean length of a sentence” | “mean length of a sentence” |
| “standard deviation of a length of a sentence” | “standard deviation of a length of a sentence” |
| “comma frequency” | “comma frequency” |
| “other punctuation frequency” | “other punctuation frequency” |
| “cardinal number frequency” | “cardinal number frequency” |
| “proper noun, singular frequency” | “proper noun, singular frequency” |
| “symbol frequency” | “symbol frequency” |
| “verb, gerund or present participle frequency” | “verb, gerund or present participle frequency” |
| “verb, past participle frequency” | “verb, past participle frequency” |
| “wh-pronoun frequency” | “wh-pronoun frequency” |
| “maximal frequency of a word” | “maximal frequency of a word” |
Note: Coefficient's absolute value represent strength of feature.
Figure 2Hyperspace example.