| Literature DB >> 33165729 |
Gema Castillo-Sánchez1, Gonçalo Marques2,3, Enrique Dorronzoro4, Octavio Rivera-Romero4, Manuel Franco-Martín5, Isabel De la Torre-Díez2.
Abstract
According to the World Health Organization (WHO) report in 2016, around 800,000 of individuals have committed suicide. Moreover, suicide is the second cause of unnatural death in people between 15 and 29 years. This paper reviews state of the art on the literature concerning the use of machine learning methods for suicide detection on social networks. Consequently, the objectives, data collection techniques, development process and the validation metrics used for suicide detection on social networks are analyzed. The authors conducted a scoping review using the methodology proposed by Arksey and O'Malley et al. and the PRISMA protocol was adopted to select the relevant studies. This scoping review aims to identify the machine learning techniques used to predict suicide risk based on information posted on social networks. The databases used are PubMed, Science Direct, IEEE Xplore and Web of Science. In total, 50% of the included studies (8/16) report explicitly the use of data mining techniques for feature extraction, feature detection or entity identification. The most commonly reported method was the Linguistic Inquiry and Word Count (4/8, 50%), followed by Latent Dirichlet Analysis, Latent Semantic Analysis, and Word2vec (2/8, 25%). Non-negative Matrix Factorization and Principal Component Analysis were used only in one of the included studies (12.5%). In total, 3 out of 8 research papers (37.5%) combined more than one of those techniques. Supported Vector Machine was implemented in 10 out of the 16 included studies (62.5%). Finally, 75% of the analyzed studies implement machine learning-based models using Python.Entities:
Keywords: Algorithm; Data mining; Machine learning; Natural processing language; Sentiment analysis; Social networks; Suicide
Mesh:
Year: 2020 PMID: 33165729 PMCID: PMC7649702 DOI: 10.1007/s10916-020-01669-5
Source DB: PubMed Journal: J Med Syst ISSN: 0148-5598 Impact factor: 4.460
Fig. 1Flow diagram of the scoping review process
Fig. 2The distribution of the included studies according to the year of publication
Data sources and ethics of the included studies
| References | Objectives | Data Sources | Ethics | Inclusion / Exclusion criteria | Time spam | Num. posts | Num. part. | Data descrip. |
|---|---|---|---|---|---|---|---|---|
| [ | Text Classification | OHC (subreddit) | Yes | N/A | ND | 874 | ND | ND |
| [ | Text Classification | GSN (Twitter) | No | Keywords | ND | 892 | ND | Based on the classifier outputs |
| [ | Text Classification | GSN (Twitter) | No | Keywords | 1 February 2014 to 15 March 2014 | 816 | ND | Class 1: 13% Class 2: 5% Class 3: 30% Class 4: 6% Class 5: 5% Class 6: 15% Class 7: 26% |
| [ | Text Classification | GSN (Twitter) | No | Keywords | ND | 1060 | ND | datasets Binary: Suicide - 156 Flippant - 133 Three classes: Suicide – 156 Flippant −133 Non-Suicide- 771 Seven classes: Suicide - 156 Campaign - 158 Flippant - 133 Support - 178 Memorial - 142 Reports - 165 Other - 128 |
| [ | Text Classification | GSN (Forums and Netlogs) | No | Keywords | ND | 10,040 | ND | Training: ( 189 irrelevant |
| [ | Text Classification+ Entity Recognition | GSN (Twitter) | No | Keywords | 26 June 2017 to 19 October 2017 | 3263 (Classification) 3000 (Recognition) | ND | Classification: 50% in training dataset; The same distribution of original data collected in the evaluation dataset |
| [ | Feature Extraction + Text Classification | GSN (Twitter) | No | Keywords | 1 January 2015 to 8 June 2016 | 12,066 | 3873 | dataset 1: 280 users HighRisk 1614 users at risk dataset 2: 280 HighRisk 280 AtRisk (randomly selected) dataset 3: 280 HighRisk 285 AtRisk |
| [ | Theme Identification | OHC (r/SuicideWatch) | No | N/A | 2008–2016 | 131,728 | 63,252 | N/A |
| [ | Features Selection | GSN (Twitter and websites) | No | Keywords | 1 January 2012 to 31 December2014 | 1241 | ND | Positive: 506 Negative: 735 |
| [ | Text Classification | OHCa (Zoufan’s Sina microblog) | Yes | N/A | Training: 1 to 28 April 2017 Testing: 3 July, 2017 to 3 July 2018 | Training: 27,007 Testing: 387,823 | ND | Training: Positive: 2786 (10%) |
| [ | Text Classification | GSN (Twitter) | Yes | Keywords | 18 February 2014 to 23 April 2014 | 1820 | ND | Set A: 27% safe to ignore; 55% possibly concerning; 18% Strongly concerning Set B: 31% secure in ignoring; 58% probably relating to; 11% Strongly concerning |
| [ | Text Classification | GSN (Twitter, Facebook, Instagram, and forums) | No | ND | N/A | 102 | ND | No risk: 70 Urgent: 19 Possible: 8 Immediate: 5 |
| [ | Features Selection + Text Classification | GSN (Twitter, Tumblr, Reddit) + Forums | No | Keywords | 13 January 2018 to 26 March 2018 | 4314 | ND | SCO = 2726 pos + 18,290 neg (training) UNI = 1576 pos + 18,290 neg (training) H = 666 pos + 4130 neg (test) |
| [ | Text Classification | GSN (Twitter) | No | Keywords | ND | ND | ND | ND |
| [ | Emotion Recognition | GSN (Weibo) | No | N/A | ND | 1,100,000 | N/A | ND |
| [ | Features Extraction + Score Estimation | GSN (Weibo Sina) | No | Direct selection | ND | 2000 per user | 697 | N/A |
ND Not defined, N/A Not applicable, Data Sources: GSN Generic Social Network, OHC Online Health Community
aA microblog focused on suicide
Techniques for feature extraction or selection used in the included studies
| Techniques | LIWC | LDA | LSA | NMF | Word2vec | PCA | Total |
|---|---|---|---|---|---|---|---|
| References | |||||||
| [ | • | • | 2 | ||||
| [ | • | 1 | |||||
| [ | • | • | • | 3 | |||
| [ | • | 1 | |||||
| [ | • | 1 | |||||
| [ | • | 1 | |||||
| [ | • | 1 | |||||
| [ | • | • | 2 | ||||
| Total Number of studies | 4 | 2 | 2 | 1 | 2 | 1 |
LWIC Linguistic Inquiry and Word Count, LDA Latent Dirichlet Allocation, LSA Latent Semantic Analysis, NMF Non-negative Matrix Factorization, PCA Principal Component Analysis
Machine learning techniques used in the included studies
| Techniques | SVM | LR | LiR | DT | NB | KNN | RF | GBM | RoF | Km | PAM | HCA | AR | NN | DL | Total |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| References | ||||||||||||||||
| [ | • | • | • | 3 | ||||||||||||
| [ | • | • | • | • | 4 | |||||||||||
| [ | • | • | • | • | 4 | |||||||||||
| [ | • | • | • | • | 4 | |||||||||||
| [ | • | 1 | ||||||||||||||
| [ | • | • | • | • | 4 | |||||||||||
| [ | • | • | 2 | |||||||||||||
| [ | • | 1 | ||||||||||||||
| [ | • | • | • | 3 | ||||||||||||
| [ | • | • | • | • | 4 | |||||||||||
| [ | • | • | 2 | |||||||||||||
| [ | • | • | • | 3 | ||||||||||||
| [ | • | • | • | • | • | 5 | ||||||||||
| [ | • | • | 2 | |||||||||||||
| [ | • | 1 | ||||||||||||||
| [ | • | 1 | ||||||||||||||
| Total Number of studies | 10 | 5 | 1 | 7 | 3 | 1 | 4 | 1 | 1 | 3 | 1 | 1 | 1 | 2 | 3 |
SVM Supported Vector Machine, LR Logistic Regression, LiR Lineal Regression, DT Decision Tree, NB Naïve Bayes, KNN K-Nearest Neighbor, RF Random Forrest, GBM Gradient Boost Machines, RoF Rotation Forest, Km K-means, PAM Partitioning Around Medoids, HCA Hierarchical Clustering Algorithm, AR Association Rules, NN artificial Neural Network, DL Deep Learning