| Literature DB >> 35805856 |
Jingfang Liu1, Mengshi Shi1, Huihong Jiang1.
Abstract
Suicide has become a serious problem, and how to prevent suicide has become a very important research topic. Social media provides an ideal platform for monitoring suicidal ideation. This paper presents an integrated model for multidimensional information fusion. By integrating the best classification models determined by single and multiple features, different feature information is combined to better identify suicidal posts in online social media. This approach was assessed with a dataset formed from 40,222 posts annotated by Weibo. By integrating the best classification model of single features and multidimensional features, the proposed model ((BSC + RFS)-fs, WEC-fs) achieved 80.61% accuracy and a 79.20% F1-score. Other representative text information representation methods and demographic factors related to suicide may also be important predictors of suicide, which were not considered in this study. To the best of our knowledge, this is the good try that feature combination and ensemble algorithms have been fused to detect user-generated content with suicidal ideation. The findings suggest that feature combinations do not always work well, and that an appropriate combination strategy can make classification models work better. There are differences in the information contained in different functional carriers, and a targeted choice classification model may improve the detection rate of suicidal ideation.Entities:
Keywords: China; Weibo; ensemble method; machine learning; multi-feature fusion; social media; suicidal ideation detection
Mesh:
Year: 2022 PMID: 35805856 PMCID: PMC9266694 DOI: 10.3390/ijerph19138197
Source DB: PubMed Journal: Int J Environ Res Public Health ISSN: 1660-4601 Impact factor: 4.614
A critical review of different suicide ideation detection studies on social media.
| References | Features Extracted | Methodology Used | Social Media | Performance and Drawback(s) |
|---|---|---|---|---|
| [ | Vocabulary | Naive Bayes | Performance: Accuracy is 0.6315 in Leave One | |
| [ | Word bags, Polarity dictionary, LSA topic model, Named entities | LIBSVM | Dutch-language forum | Performance: F1 is 0.93 for relevant messages, 0.70 for severe messages. |
| [ | Simplified Chinese-Linguistic Inquiry and Word count | Logistic regression, SVM | Performance: The overall classification performance was not satisfactory and could only be classified among those with high probability of suicide (AUC = 0.61, | |
| [ | TFIDF, Word frequencies, Information retrieval | SVM | Performance: Accuracy is 0.76 when sets A and B were combined. | |
| [ | Demographic features, Emotion labels, | Logistic regression | Performance: F1 is 0.53. | |
| [ | N-gram, Word vectors, Document vectors | Random Forest, SVM, CEM, deep learning, ensemble models | Microblogging and movie reviews domain | Performance: The F1-scores are 0.7302, 0.6379, 0.7532, 0.7181, 0.8120 and 77.41 in the six datasets, respectively. |
| [ | TFIDF, Linguistic inquiry and Word count, and Sentiment analysis | Logistic regression, random forest, and SVM | Performance: Logistic regression (F1: 0.78–0.92, Accuracy: 0.76–0.92); random forest (F1: 0.75–0.92, Accuracy: 0.71–0.89); SVM (F1: 0.73–0.92, Accuracy: 0.76–0.92). |
Examples of suicidal ideation and non-suicidal ideation posts.
| Category | Example |
|---|---|
| Suicidal ideation | Anybody here? What if I swallowed 6 Escitalopram Oxalate tablets and 2 Zopiclone tablets? |
| I hide in the wardrobe with a knife in my hand that can cut off the carotid artery at any time. I don’t want to work with you or see you. Don’t talk to me or save me. | |
| Non-suicidal ideation | Today, my throat hurts more and more. I’m afraid my body is getting worse and worse. If I die, no one will care. |
| I still find it extremely painful to be alive. |
Figure 1Suicide ideation detection framework. Note: BSC = basic statistical characteristics, RFS = risk factors for suicide, WEC = word embedding clustering.
Optimal model performance of a single feature.
| Feature | Accuracy | F1-Score | Precision | Recall | Optimum Classifier |
|---|---|---|---|---|---|
| BSC | 74.49% | 72.41% | 77.72% | 67.78% | ET-g |
| BSC-fs | 74.65% | 72.34% | 78.28% | 67.24% | ET-e |
| RFS | 75.86% | 72.47% | 83.29% | 64.14% | SVM-l |
| RFS-fs | 76.19% | 72.77% | 84.04% | 64.17% | SVM-l |
| WEC | 69.81% | 62.98% | 80.99% | 51.52% | Log-l2 |
| WEC-fs | 71.32% | 65.48% | 81.89% | 54.55% | NB |
-fs = feature selection.
Optimal model performance of multidimensional features.
| Features | Feature Combination | Accuracy | F1-Score | Precision | Recall | Optimum Classifier |
|---|---|---|---|---|---|---|
| BSC RFS | BSC + RFS | 78.06% | 76.07% | 82.76% | 70.38% | Log-l2 |
| (BSC-fs) + (RFS-fs) | 78.56% | 76.84% | 82.44% | 71.95% | ET-g | |
| (BSC + RFS)-fs | 78.87% | 76.56% | 85.10% | 69.58% | SVM-l | |
| BSC WEC | BSC + WEC | 77.24% | 74.84% | 82.60% | 68.41% | Log-l1 |
| (BSC-fs) + (WEC-fs) | 77.60% | 75.28% | 82.97% | 68.89% | Log-l1 | |
| (BSC + WEC)-fs | 77.70% | 75.61% | 81.93% | 70.20% | ET-e | |
| RFS WEC | RFS + WEC | 78.17% | 76.05% | 83.57% | 69.77% | SVM-l |
| (RFS-fs) + (WEC-fs) | 78.89% | 76.94% | 84.05% | 70.94% | SVM-l | |
| (RFS + WEC)-fs | 78.83% | 76.90% | 83.78% | 71.06% | SVM-l | |
| BSC | BSC + RFS + WEC | 79.20% | 77.03% | 84.71% | 70.63% | ET-g |
| (BSC-fs) + (RFS-fs) + (WEC-fs) | 79.86% | 78.32% | 84.05% | 73.32% | SVM-l | |
| (BSC + RFS + WEC)-fs | 80.15% | 78.60% | 84.44% | 73.52% | SVM-l |
-fs = feature selection.
Suggested model performance on different feature sets.
| Feature Set | Accuracy | F1-Score | Precision | Recall | |
|---|---|---|---|---|---|
| BSC + RFS | 79.40% | 77.70% | 83.70% | 72.50% | |
| (BSC + RFS) + WEC | WEC | ||||
| (BSC-fs) + (RFS-fs) | 79.12% | 77.19% | 83.97% | 71.42% | |
| WEC-fs | |||||
| (BSC + RFS)-fs | 80.61% | 79.20% | 84.58% | 74.46% | |
| WEC-fs | |||||
| BSC + WEC | 79.80% | 78.27% | 83.64% | 73.55% | |
| (BSC + WEC) + RFS | RFS | ||||
| (BSC-fs) + (WEC-fs) | 80.11% | 78.65% | 84.01% | 73.93% | |
| RFS-fs | |||||
| (BSC + WEC)-fs | 79.93% | 77.85% | 85.68% | 71.33% | |
| RFS-fs | |||||
| RFS + WEC | 79.55% | 77.43% | 85.15% | 70.99% | |
| (RFS + WEC)+ BSC | BSC | ||||
| (RFS-fs) + (WEC-fs) | 79.78% | 77.81% | 85.12% | 71.66% | |
| BSC-fs | |||||
| (RFS + WEC)-fs | 79.67% | 77.64% | 84.58% | 71.75% | |
| BSC-fs | |||||
| BSC + RFS + WEC | BSC, RFS, WEC | 78.92% | 76.54% | 85.06% | 69.57% |
| BSC-fs, RFS-fs, WEC-fs | 80.15% | 78.17% | 85.73% | 71.84% | |
-fs = feature selection.
Performance results of multiple ensemble methods.
| Model | Parameter | Accuracy | F1-Score | Precision | Recall |
|---|---|---|---|---|---|
| Random forest | criterion = ‘entropy’ | 76.78% | 74.03% | 82.23% | 67.32% |
| criterion = ‘gini’ | 76.43% | 73.63% | 81.85% | 66.91% | |
| XGBoost | booster = ‘gbtree’ | 77.24% | 76.01% | 79.60% | 72.73% |
| booster = ‘gblinear’ | 73.26% | 71.65% | 75.83% | 67.91% | |
| AdaBoost | base_estimator = tree | 77.82% | 76.78% | 79.81% | 73.97% |
| Bagging | base_estimator = tree | 77.07% | 75.01% | 80.96% | 69.87% |
| Gradient Boosting | / | 78.76% | 77.14% | 82.38% | 72.53% |
| Stacking | base_estimator = SVM-l,Log1,Log2,NB, ET-g,ET-e | 79.77% | 77.92% | 84.76% | 72.10% |
| Suggested model | / | 80.61% | 79.20% | 84.58% | 74.46% |