| Literature DB >> 36011981 |
Eldar Yeskuatov1, Sook-Ling Chua1, Lee Kien Foo1.
Abstract
Suicide is a major public-health problem that exists in virtually every part of the world. Hundreds of thousands of people commit suicide every year. The early detection of suicidal ideation is critical for suicide prevention. However, there are challenges associated with conventional suicide-risk screening methods. At the same time, individuals contemplating suicide are increasingly turning to social media and online forums, such as Reddit, to express their feelings and share their struggles with suicidal thoughts. This prompted research that applies machine learning and natural language processing techniques to detect suicidality among social media and forum users. The objective of this paper is to investigate methods employed to detect suicidal ideations on the Reddit forum. To achieve this objective, we conducted a literature review of the recent articles detailing machine learning and natural language processing techniques applied to Reddit data to detect the presence of suicidal ideations. Following the Preferred Reporting Items for Systematic Reviews and Meta-Analyses guidelines, we selected 26 recent studies, published between 2018 and 2022. The findings of the review outline the prevalent methods of data collection, data annotation, data preprocessing, feature engineering, model development, and evaluation. Furthermore, we present several Reddit-based datasets utilized to construct suicidal ideation detection models. Finally, we conclude by discussing the current limitations and future directions in the research of suicidal ideation detection.Entities:
Keywords: machine learning; natural language processing; suicidal ideation detection; text mining
Mesh:
Year: 2022 PMID: 36011981 PMCID: PMC9407719 DOI: 10.3390/ijerph191610347
Source DB: PubMed Journal: Int J Environ Res Public Health ISSN: 1660-4601 Impact factor: 4.614
Figure 1Study selection flowchart based on PRISMA guidelines.
Data source: (a) post level and (b) user level.
|
| ||||
|
|
|
|
|
|
| Aladağ et al., 2018 [ | 508,398 posts | 785 posts | Experts | Suicidal, non-suicidal |
| Ji et al., 2018 [ | 3549 suicidal posts | NA | Community affiliation | Suicidal, non-suicidal |
| Yao et al., 2020 [ | NA | 500 r/Opiates posts | Crowdsourcing | Opioid addiction, no opioid addiction, suicide risk, no suicide risk |
| Reddit SuicideWatch and Mental Health Collection by Ji et al., 2021 [ | 54,412 posts | NA | Community affiliation | r/Depression, r/SuicideWatch, r/Anxiety, r/Offmychest, r/Bipolar |
| Nikhileswar et al., 2021 [ | 116,037 suicidal posts | NA | Community affiliation | Suicidal, non-suicidal |
|
| ||||
|
|
|
|
|
|
| UMD Reddit Suicidality Dataset v2 by Shing et al., 2018 [ | 11,129 r/SuicideWatch users | 866 r/SuicideWatch users | Crowdsourcing, experts | No risk, low risk, moderate risk, severe risk |
| Reddit C-SSRS Suicide Dataset by Gaur et al., 2019 [ | NA | 500 users (15,755 posts) | Experts | Indicator, ideation, behavior, attempt, supportive |
| Reddit C-SSRS Suicide Dataset v2 by Gaur et al., 2021 [ | NA | 448 users (7327 posts) | Experts | Ideation, behavior, attempt, supportive |
Summary of machine learning and natural language processing techniques.
| Study | Feature Extraction Techniques | Machine Learning Algorithms | Embedding Techniques | Deep Learning Algorithms | Best Performing Model | Metric and Result |
|---|---|---|---|---|---|---|
| Shing et al., 2018 [ | BOW, Empath, Readability Index, Syntactic features, LDA, LIWC, NRC Lexicon, mentalDisLex (Mental Disease Lexicon) | SVM, LR, XGBoost | SkipGram | CNN | SVM | Macro |
| Aladağ et al., 2018 [ | TF–IDF, LIWC, Sentiment | ZeroR, LR, RF, SVM | NA | NA | LR, | Accuracy = 0.920 |
| Ji et al., 2018 [ | Statistics, Part of Speech Tags, LIWC, TF–IDF, LDA | SVM, RF, Gradient Boost Decision Tree, XGBoost, | Word2Vec | MLFFNN, LSTM | XGBoost | Accuracy = 0.957 |
| Allen et al., 2019 [ | LIWC | NA | GloVe | CNN | CNN used with LIWC | Macro |
| Ambalavanan et al., 2019 [ | NA | NA | BERT | LSTM | BERT-Softmax | Macro |
| Bitew et al., 2019 [ | TF–IDF, DeepMoji pre-trained model, | LR, SVM, | NA | NA | LR | Macro |
| Chen et al., 2019 [ | Sentiments, LIWC, | SVM | NA | NA | SVM | Macro |
| Gaur et al., 2019 [ | Sentiments with AFINN, TF–IDF, Statistics, Syntactic | SVM, RF | ConceptNet | FFNN, CNN | CNN | Graded Recall = 0.600 |
| González Hevia et al., 2019 [ | TF–IDF, NRC VAD Lexicon | SVM, LR | Multilingual Word Embedding, Doc2Vec | RNN | SVM-LR | Macro |
| Iserman et al., 2019 [ | Sentiments with AFINN, Hu & Liu, General Inquirer, labMT, LIWC, Lusi, Moral Foundations, Netspeak, NRC Affect Intensity Lexicon, Senticnet, | LR, RF, DT | NA | NA | DT | Macro |
| Matero et al., 2019 [ | Affect & Intensity Lexicon, NRC VAD Lexicon, Age&Gender Lexicon, Big-5 Personality Lexicon, Anxiety, Anger & Depression Lexicon, LDA, Statistics | LR | BERT | LSTM | LSTM-Attention | Macro |
| Mohammadi et al., 2019 [ | NA | SVM | GloVe, Embeddings from Language Model | CNN, RNN, LSTM, GRU, | Ensemble model consisting of CNN, Bi-RNN, Bi-LSTM, Bi-GRU and SVM | Macro |
| Morales et al., 2019 [ | BOW, TF–IDF, LDA, POS, Named-Entity Recognition, IBM Watson Personality Insights API, IBM Watson Tone Analyzer | RF, NB, KNN, SVM | SkipGram, FastText | CNN, LSTM, | CNN | Macro |
| Ríssola et al., 2019 [ | TF–IDF, LIWC, Statistics | LR, SVM, RF | GloVe | N/A | LR | Macro |
| Ruiz et al., 2019 [ | Clinical Text Analysis and Knowledge Extraction System, Social Determinant of Health, NRC Word-Emotion Association Lexicon, Readability Index, Semantic Role Labeling, Sentiments, LDA, Empathy | NB, GB, RF, SVM, | Doc2Vec | CNN, LSTM, | Ensemble model consisting of NB, SVM, GB | Macro |
| Jones et al., 2019 [ | Suicide Risk Factor Lexicon, LDA | RF, LR, SVM | FLAIR, GloVe | N/A | RF | |
| Tadesse et al., 2019 [ | Statistics, TF–IDF, BOW, | RF, SVM, NB, XGBoost | Word2Vec | LSTM, CNN | LSTM-CNN | Accuracy = 0.938 |
| Shah et al., 2020 [ | TF–IDF, N-Gram, LIWC | NB, SVM, KNN, RF | NA | N/A | NB | Accuracy = 0.736 |
| Yao et al., 2020 [ | TF–IDF | LR, RF, SVM, | GloVe, FastText | RNN, CNN | CNN | |
| Haque et al., 2020 [ | NA | NA | Glove, BERT | LSTM | BERT with Softmax Layer | Accuracy = 0.952 |
| Kumar et al., 2021 [ | NA | NB, LR, SVM | GloVe | LSTM, GRU | Bi-GRU with Multiplicative Attention | Micro |
| Rabani et al., 2021 [ | TF–IDF, BOW, Statistics, LDA, POS | NB, DT, LR, SVM, | NA | N/A | DT | |
| Gaur et al., 2021 [ | NA | NA | ConceptNet | CNN, LSTM | CNN-LSTM | AUC = 0.640 |
| Ji et al., 2021 [ | Sentiments, LIWC, LDA | NA | GloVe, FastText | CNN, LSTM, Structured Self-Attentive Sentence Embedding, Relation Network | Relation Network | |
| Nikhileswar et al., 2021 [ | TF–IDF, BOW | XGBoost, SVM | Universal Sentence Encoder, Word2Vec, | LSTM, CNN, FCNN | FCNN used with Universal Sentence Encoder | Accuracy = 0.942 |
| Renjith et al., 2021 [ | TF–IDF | SVM | Word2Vec | LSTM, CNN | LSTM-Attention-CNN | Accuracy = 0.903 |