| Literature DB >> 35281704 |
Roberto Wellington Acuña Caicedo1,2, José Manuel Gómez Soriano3, Héctor Andrés Melgar Sasieta2.
Abstract
The suicide of a person is a tragedy that deeply affects families, communities, and countries. According to the standardized rate of suicides per number of inhabitants worldwide, in 2022 there will be approximately about 903,450 suicides and 18,069,000 unconsummated suicides, affecting people of all ages, countries, races, beliefs, social status, economic status, sex, etc. The publication of suicidal intentions by users of social networks has led to the initiation of research processes in this field, to detect them and encourage them not to commit suicide. This study focused on determining a semi-supervised method to populate the Life Corpus, using a bootstrapping technique, to automatically detect and classify texts extracted from social networks and forums related to suicide and depression based on initial supervised samples. To carry out the experiments we used two different classifiers: Support Vector Machine (SVM) (with Bag of Words (BoW) features with and without Term-Frequency/Inverse Document Frequency (Tf/Idf), as a weighted term, and with or without stopwords) and Rasa (with the default feature extraction system). In addition, we performed the experiments using five data collections: Life, Reddit, Life+Reddit, Life_en, and Life_en + Reddit. Using the semi-supervised method, we managed to increase the size of the Life Corpus from 102 to 273 samples with texts from the social network Reddit, in a combination Life+Reddit+BoW_Embeddings, with the SVM classifier, with which a macro f1 value of 0.80 was achieved. These texts were in turn evaluated by annotators manually with a Cohen's Kappa level of agreement of 0.86.Entities:
Keywords: Natural language processing; Social networks; Suicidal behavior; Suicidal ideation; Suicide prevention
Year: 2022 PMID: 35281704 PMCID: PMC8913319 DOI: 10.1016/j.invent.2022.100519
Source DB: PubMed Journal: Internet Interv ISSN: 2214-7829
Number of samples for each “Alert Level” type.
| Alert level | Quantity | EN | ES |
|---|---|---|---|
| No risk | 70 (68.6%) | 45 (63.4%) | 25 (80.6%) |
| Urgent | 19 (18.6%) | 15 (21.1%) | 4 (12.9%) |
| Possible | 8 (7.8%) | 6 (8.5%) | 2 (6.5%) |
| Immediate | 5 (4.9%) | 5 (7%) | 0 (0%) |
Number of samples for each “Alert Level” type.
| Alert Level | Quantity | EN | ES |
|---|---|---|---|
| No risk | 70 (68.63%) | 45 | 25 |
| Risk | 32 (31.37%) | 26 | 6 |
Fig. 1System workflow scheme. The system was evaluated using, the original Life Corpus and translated Life Corpus. The system is composed of three processes: i) translation process, ii) bootstrapping Corpus expansion, iii) reviewing, building, and evaluating the final supervised corpus.
Number of samples and threshold by iteration.
| Iteration | Num samples | Threshold |
|---|---|---|
| 1 | 102 | 0.8 |
| 2 | 225 | 0.96 |
| 3 | 302 | 0.992 |
Agreements between reviewers.
| Group | Reviewer | Suicide text | Mutual-agreement | Kappa Cohen |
|---|---|---|---|---|
| 1 | Reviewer GC | 38/12 | 38/3 | 0.82 |
| Reviewer RA | 47/3 | |||
| 2 | Reviewer CC | 46/4 | 46/4 | 1.00 |
| Reviewer RA | 46/4 | |||
| 3 | Reviewer KM | 45/5 | 41/3 | 0.88 |
| Reviewer RA | 47/3 | |||
| 4 | Reviewer AR | 32/18 | 26/10 | 0.72 |
| Reviewer JG | 31/19 | |||
| Reviewer RA | 35/15 | |||
| Totales | 364/87 | 151/20 | 0.86 | |
Corpus used in experiments.
| Corpus | Risk | Not risk | Total |
|---|---|---|---|
| Corpora only in English languages | |||
| Life | 30 | 72 | 102 |
| 153 | 18 | 171 | |
| Life+Reddit | 183 | 90 | 273 |
| Corpus in Spanish and English languages | |||
| Life_es_en | 30 | 72 | 102 |
| Life_es_en + Reddit | 183 | 90 | 273 |
Fig. 2The vertical pointed line is the original f-measure result with Life Corpus.
Macro f1, macro precision, and macro recall. Corpus in the English language combined with the training features. The confidence interval was calculated with p < .01.
| Features | Macro f1 | Macro precision | Macro recall |
|---|---|---|---|
| Life | |||
| Rasa | 0.49 ± 0.02 | 0.52 ± 0.03 | 0.53 ± 0.02 |
| BoW | 0.43 ± 0.02 | 0.40 ± 0.03 | 0.50 ± 0.01 |
| BoW + Embedding | 0.42 ± 0.02 | 0.39 ± 0.03 | 0.51 ± 0.02 |
| Tf/Idf | 0.41 ± 0.01 | 0.35 ± 0.01 | 0.51 ± 0.01 |
| Tf/Idf + Embeddings | 0.41 ± 0.01 | 0.35 ± 0.01 | 0.51 ± 0.01 |
| Rasa | 0.55 ± 0.03 | 0.54 ± 0.04 | 0.57 ± 0.03 |
| BoW | 0.51 ± 0.02 | 0.48 ± 0.03 | 0.55 ± 0.02 |
| BoW + Embedding | 0.50 ± 0.02 | 0.47 ± 0.02 | 0.54 ± 0.02 |
| Tf/Idf | 0.52 ± 0.03 | 0.50 ± 0.03 | 0.55 ± 0.03 |
| Tf/Idf + Embeddings | 0.52 ± 0.03 | 0.49 ± 0.03 | 0.55 ± 0.03 |
| Life + Reddit | |||
| Rasa | 0.65 ± 0.01 | 0.76 ± 0.02 | 0.66 ± 0.01 |
| BoW | 0.77 ± 0.01 | 0.77 ± 0.01 | 0.79 ± 0.01 |
| BoW + Embeddings | 0.79 ± 0.01 | 0.80 ± 0.01 | 0.79 ± 0.01 |
| Tf/Idf | 0.53 ± 0.04 | 0.72 ± 0.06 | 0.58 ± 0.05 |
| Tf/Idf + Embeddings | 0.51 ± 0.02 | 0.68 ± 0.04 | 0.56 ± 0.01 |
Fig. 3The vertical pointed line is the original f-measure result with Life Corpus.
Macro f1, macro precision, and macro recall. Corpus in English and Spanish language combined with training features. The confidence interval was calculated with p < .01.
| Features | Macro f1 | Macro precision | Macro recall |
|---|---|---|---|
| Life_es_en | |||
| Rasa | 0.48 ± 0.03 | 0.51 ± 0.03 | 0.50 ± 0.02 |
| BoW | 0.40 ± 0.01 | 0.35 ± 0.02 | 0.49 ± 0.01 |
| BoW + Embeddings | 0.43 ± 0.02 | 0.40 ± 0.04 | 0.51 ± 0.01 |
| Tf/Idf | 0.42 ± 0.02 | 0.37 ± 0.02 | 0.51 ± 0.01 |
| Tf/Idf + Embeddings | 0.43 ± 0.02 | 0.37 ± 0.02 | 0.52 ± 0.01 |
| Life_es_en + Reddit | |||
| Rasa | 0.67 ± 0.02 | 0.78 ± 0.02 | 0.67 ± 0.01 |
| BoW | 0.78 ± 0.01 | 0.78 ± 0.01 | 0.79 ± 0.01 |
| BoW + Embeddings | 0.80 ± 0.01 | 0.81 ± 0.01 | 0.81 ± 0.01 |
| Tf/Idf | 0.63 ± 0.01 | 0.78 ± 0.02 | 0.64 ± 0.01 |
| Tf/Idf + Embeddings | 0.62 ± 0.01 | 0.76 ± 0.02 | 0.63 ± 0.01 |