Literature DB >> 35874865

Analysis of depression in social media texts through the Patient Health Questionnaire-9 and natural language processing.

Nam Hyeok Kim¹, Ji Min Kim², Da Mi Park², Su Ryeon Ji¹, Jong Woo Kim³.

Abstract

Objective: Although depression in modern people is emerging as a major social problem, it shows a low rate of use of mental health services. The purpose of this study was to classify sentences written by social media users based on the nine symptoms of depression in the Patient Health Questionnaire-9, using natural language processing to assess naturally users' depression based on their results.
Methods: First, train two sentence classifiers: the Y/N sentence classifier, which categorizes whether a user's sentence is related to depression, and the 0-9 sentence classifier, which further categorizes the user sentence based on the depression symptomology of the Patient Health Questionnaire-9. Then the depression classifier, which is a logistic regression model, was generated to classify the sentence writer's depression. These trained sentence classifiers and the depression classifier were used to analyze the social media textual data of users and establish their depression.
Results: Our experimental results showed that the proposed depression classifier showed 68.3% average accuracy, which was better than the baseline depression classifier that used only the Y/N sentence classifier and had 53.3% average accuracy. Conclusions: This study is significant in that it demonstrates the possibility of determining depression from only social media users' textual data.

Entities: Chemical

Keywords: Depression; Patient Health Questionnaire-9; deep learning; machine learning; natural language processing; social media

Year: 2022 PMID： 35874865 PMCID： PMC9297458 DOI： 10.1177/20552076221114204

Source DB: PubMed Journal: Digit Health ISSN： 2055-2076

Introduction

Depression is a disease that threatens the mental health of modern people and is recognized as a problem that needs to be solved, but there is a lack of understanding and agreement on the proper treatment for depression. Depression leads to a decline in the functions of daily life, with its main symptoms of being demotivated and feeling sad or unhappy. The World Health Organization (WHO) expects depression to be the most burdensome disease for humans in 2030, with more than 264 million people in the world suffering from it. WHO also notes that, globally, depression is a major cause of disability and contributes to the burden of disease on people. According to the (US) National Institute of Mental Health (NIMH), 17.3 million people, accounting for 7.1% of the adult population in the United States, have had at least one major depressive episode. Additionally, 3.2 million American teenagers, which account for 13.3% of the population between the ages of 12 and 17 years, suffer from the same symptoms. What is more concerning is that 35% of adults and 60% of adolescents in the United States who suffer from depression are not receiving proper treatment, even though depression occurs in various age groups. As such, the mental health status of modern people has emerged as a major social problem and has begun to be perceived as a problem that can no longer be ignored. Despite concerns about depression, there is no definitive treatment for mental illness in many countries. In 2016, Mental health service utilization rates for those diagnosed with mental illness were 43.1% in the United States, 46.5% in Canada, 34.9% in Australia, 35.5% in Spain, and 22.2% in Korea, indicating that mental health service utilization is significantly lower than 50% worldwide. According to Andrade et al., the three main reasons for the low utilization of mental health services are low perceived need, structural barriers, and attitudinal barriers. Low perceived need is the lack of awareness of mental health issues and means the patient himself/herself thinks no help is needed. Structural barriers represent concerns about money, lack of time, accessibility, insurance coverage, etc. Attitudinal barriers include the idea that the mental disorder will improve itself, prejudice against mental health services, and distrust of treatment effects, the result of which is that patients fail to use the service. Unlike with physical diseases, patients suffering from mental illness often do not understand the extent of the disease, often do not receive treatment due to low motivation, and often do not know that problems can be improved by using mental health services. In general, mental illness is fully treatable by early intervention, but the later the treatment, the more serious the disorder can be, so proper awareness and early detection of mental illness are important and necessary steps in treating the disease. Furthermore, recognizing the disease and knowing its exact name increases the probability of early detection and enhances positive treatment effects. One of the important steps in treating depression is correct self-awareness of this condition. Self-diagnosis of depression allows people to check their degree of depression themselves, and there are many self-diagnosis tests for depression. Examples of self-diagnosis instruments for depression include the Beck Depression Inventory (BDI), the Center for Epidemiologic Studies Depression Scale (CES-D), the Patient Health Questionnaire-9 (PHQ-9), and the Geriatric Depression Scale (GDS). There are various self-diagnosis tables for depression, but it is difficult for people with mental disabilities to identify their condition through this method for the same reasons as the low utilization rate of mental health services. Thus, an alternative could be a system that automatically (without specific patient involvement) identifies depression levels in such patients. In fact, there have been many attempts to predict or detect depression through various techniques.[12-18] In recent years, the expansion of social media such as Twitter and Facebook has raised interest in automatic depression detection techniques. As social media has become an integral part of modern life, much data is produced, suggesting that considerable textual data are available for mental health analysis. This can be seen as a valuable resource for depression and mental disability assessment through text, the direction our research aims to pursue. Existing studies on text analysis for depression or mental disorder[12-16] were conducted by establishing a classifier to determine whether the text is related to symptoms of depression, and to further assess the degree of concern for depression. Sentence classification has been carried out through techniques such as naïve Bayes classification (NBC), latent Dirichlet placement (LDA), support vector machine (SVM), and logistic regression, by building vocabularies with relevant experts. In particular, in a study conducted by Yazdavar et al., by whose work we were most inspired, the PHQ-9 was used as a text classification criterion. In our study, the performance of sentence classification was improved by applying a better natural language processing (NLP) method to that preceding study, and the resulting sentence classification was expanded to a model that can judge depression based on the sentence classification results. Therefore, the purpose of our study was to associate textual data with the nine symptoms of depression in the PHQ-9 through NLP techniques, and to identify users’ depression based on their results. The remaining sections are organized as follows. Related research section introduces the diagnosis of depression and various depression self-diagnosis tables. We also analyze the applications of NLP in health care and reference prior studies on online depression detection. Depression classification model section introduces our depression classification model and describes the model details and mechanisms. Experiments section describes the experiments to evaluate our model and analyze its results. The academic and practical significance of this study are described in Discussion section, and Conclusion section concludes with a summary of key findings and future research proposals.

Related research

Diagnosis of depression

The diagnosis of depression usually occurs as follows: psychiatrists identify symptoms through consultation with the patients, conduct the necessary tests, and comprehensively analyze and diagnose the results; diagnostic assessment tools are used in this process. In real-world clinical situations, the use of appropriate diagnostic assessment tools can significantly help diagnosis because depression symptoms can be overlooked if they are not clear. The most widely used diagnostic evaluation tools are the fifth edition of the Diagnostic and Statistical Manual of Mental Disorders (DSM-5) and the 10th International Classification of Diseases (ICD-10).[21,22] DSM-5 is published by the American Psychological Association (APA) and has 20 categories of mental illness classification, providing comprehensive diagnostic criteria for mental illness, including major depressive disorders. According to DSM-5, the symptomatic criteria for diagnosis of major depressive disorders include (1) depression, (2) reduction of interest, (3) eating disorders, (4) sleep disorders, (5) impatience, (6) fatigue, (7) self-blame, (8) decreased concentration, and (9) thoughts of self-harm or suicide. If five symptoms (including either (1) or (2)) persist for 2 weeks, depression is diagnosed. ICD-10 is an international disease classification system developed by WHO in 1994 and updated regularly, which deals with mental and behavioral disorders in Chapter 5. ICD-10 is similar to DSM-5 in that it assesses the severity of depression according to the number of symptoms. In general, depression screening tools are used prior to precise clinical diagnosis. For basic epidemiological studies in psychiatry, simple depression screening tools are used to classify depression symptoms into positive and negative groups before precise diagnostic testing is performed, which reduces the human and economic burden. Examples of depression screening tools are the BDI, SDS, CES-D, PHQ-9, and GDS. The PHQ-9 is a self-reported test developed in 1990 by Robert L. Spitzer et al., which focuses on major depression and evaluates its severity. The questionnaire consists of nine questions asking about various symptoms of depression, such as depressed emotions, appetite, and suicidal thoughts, and calculates the score by evaluating how often the symptoms have occurred in the past 2 weeks. The score analysis criteria are– 04 points are not depression, 5–9 points are mild depression, 10–14 points are moderate depression, 15–19 points are treatment-needed depression, and 20–27 points are severe depression that requires active treatment. In a study comparing widely used depression screening tools in primary care, the SDS and BDI were compared with the PHQ-9. The PHQ-9 has proven to be a stable and reasonable tool for measuring depression, with 88% sensitivity and 88% specificity compared to other scales. It is also considered easy to apply, taking less time to complete, and being easier to score than conventional depression screening tools. In this study, a version of the PHQ-9 translated into Korean and standardized was used; the reliability and validity of this version have been verified by previous studies.[28,29]

NLP

In pre-deep learning NLP studies, sentence classification through machine learning was mainly based on NBC, SVMs, and the random forest classification algorithm. With deep learning architectures and algorithms making significant contributions in the field of computer vision and pattern recognition, studies on deep learning-based NLP have begun to emerge. Convolutional neural networks (CNNs) and recurrent neural networks (RNNs) are representative deep learning techniques that are applied to various fields such as image recognition, object tracking, and autonomous driving, as well as NLP. Although CNNs were originally used as a key technology in image analysis, studies of applying CNNs to NLP have begun since Collobert and Weston’s 2008 research and research by Collobert et al. in 2011, and many have published results that CNN also excels in text processing.[34-41] Unlike traditional artificial neural networks wherein the signal flow is directed only in the output direction, RNNs are utilized for sequential data because the signal flow has a circular structure and can have information about the past RNNs have proven to be an effective method for statistical language modeling and has also been shown to be suitable for many NLPs, such as language modeling, machine translation, and speech recognition. Long short-term memory (LSTM) is a technique that complements the limitations of gradient vanishing in RNNs and is actively used in NLP research with RNNs.[44-47] Recently, in addition to RNNs and LSTM, pretrained language models such as bidirectional encoder representations from transformers (BERT), generative pre-trained transformer (GPT)-2, and GPT-3 have been used in NLP research.

Application of artificial intelligence in the medical field

With the development of artificial intelligence (AI), the development of AI-based medical technologies using information and communication technology (ICT) convergence technologies and medical “big data” is actively underway. AI-based medical technology reduces uncertainty in clinical decision making and can analyze and process vast amounts of medical data to provide customized medical services for patients and medical staff. Currently, there are three areas of medical technology based on AI: electronic health records (EHRs) and medical data, medical and pathological imaging, and physiological signal monitoring. Unlike conventional paper records, the application of AI to EHRs and medical data refers to the electronic storage of all medical information concerning a patient’s past and present health status and health care. With the introduction of EHRs in hospitals and clinics, the role of AI as a tool for extensive clinical care and research has become important. EHRs are characterized as being vast, heterogeneous, incomplete, noisy, and generated for purposes other than research, so studies that apply NLP to extract relevant medical information from these vast data troves are actively underway.[54-59] EHR studies based on NLP have shown that NLP can be applied by extracting meaningful medical information from unstructured data such as clinical data. With the development of social media, NLP will become even more widely used and applied to extract meaningful information from more diverse textual data. The second application is AI in medical and pathological imaging. In medical imaging, AI technology can be used for classification, image segmentation, automatic detection of lesions in computer-assisted detection methods, and simulated imaging. Examples of these studies include AI-based detection of cancer spread to lymph nodes in women with breast cancer, brain hemorrhage verification, skin cancer classification, and the use of FDA-approved AI medical devices for diagnosis of breast cancer, stroke, and brain hemorrhage, as well as cardiac ultrasonic diagnosis and MRI heart analysis. AI increasingly supports various medical image analyses and diagnoses based on its consistency, certainty, and accuracy. AI is also applied in physiological signal monitoring, which refers to obtaining signal information from a user through sensors attached to or worn on the body for medical purposes such as real-time response and 24-hour monitoring. Physiological signals are meaningful data sources that help detect, treat, and rehabilitate diseases because they reflect the electrical activity of certain body parts; AI, such as deep learning, can be designed to utilize these signals to make health care decisions. For example, monitoring of menstrual signals has already been commercialized and is used in hospital environments such as intensive care units, emergency rooms, and hospital rooms. Physiological signal monitoring is also applied to predict and prevent the worsening of disease courses that have been actively underway for years, and its utilization is expanding. With the introduction of various AI technologies into clinical practice, there is a need for early thorough and systematic clinical verification, as well as to compare imprecise medical data to standard data to create conditions to train AI.

Detection of depression from social media texts

Communication between doctors and patients is important in the diagnosis of depression, but patients are often reluctant to visit a hospital or clinic themselves because they resist accepting their depression. Due to these limitations, efforts are also being made to identify barriers to the use of mobile apps for depression treatment. Recently, as technology has advanced, attempts have been made to detect depression based on electroencephalograms (EEGs), deep learning, and machine learning in various applications in the medical field. For example, there have been attempts to predict depression using machine learning random forest, k-nearest neighbor algorithm, or SVM. In addition, the expansion of social networks such as Facebook, Twitter, and Imgur has increased interest in detecting depression using them.[69,71] As social networks have become an integral part of modern life, tremendous numbers of posts are now generated in real time, which has led to attempts to analyze people’s mental health conditions such as emotions and depression based on these textual data. Arguably, language reflects human thought, emotion, belief, behavior, and personality, so the generation of this abundant textual data has triggered research on the possibility of depression detection as well as emotional analysis through NLP. Examples of this are studies that parse out sentences written on social media to identify word-to-word similarities, or that classify sentences into depression-related issues and then categorize them according to criteria in a self-diagnosis table. Assuming that people with depression will tweet on their Twitter accounts and show symptoms of depression, studies have also been conducted to establish related vocabulary dictionaries to detect those symptoms. Furthermore, there have been attempts to use social networks to distinguish between suicidal ideation and depression. While there is ample opportunity to detect depression in social media, to date there have been no significant, clear criteria established for emotional analysis. Therefore, this study created a model to determine whether social media-indicated depression conforms with actual depression self-diagnosis tables, which constitute a clear classification standard. In addition, the model was created with logistic regression, which can determine the relative effects of each variable. In addition to depression self-diagnosis tables, a variety of other diagnostic tables can be used to create a model specific to the symptoms if the model is trained based on the process in this study.

Depression classification model

The development process of a new depression classification model

We herein propose a three-step process for detecting and analyzing social media users’ depression. In Figure 1, each of the three stages has two parts: training the model used for depression detection and then predicting depression using the model trained in the previous step. The “sentence classifier training phase” (SCT) and the “depression classifier training phase” (DCT) are in the first training phase, and the “user's depression classification phase” (UDC) is in the prediction phase. First, we train two sentence classifiers in the SCT phase. Two sentence classifiers are used to classify sentences based on symptoms of depression, the Y/N classifier, and the PHQ-9. The Y/N classifier determines whether or not a sentence is related to depression, and the 0–9 classifier determines, based on the symptoms given in the PHQ-9, which question(s) of the PHQ-9 is/are related to the sentence. The 0–9 classifier classifies a sentence into one of 10 categories that correspond to each symptom in the PHQ-9, with 0 being a class that is not related to one of the PHQ-9 symptoms.

Figure 1.

The entire process of training and using the depression classifier.

The entire process of training and using the depression classifier. Next, in the DCT phase, a logistic regression classifier is trained to determine whether or not a user is depressed. To train the logistic regression classifier, a set of user-generated social media text data and a sample user’s PHQ-9 score are necessary. The target variable of the logistic regression classifier is the likelihood that the user is depressed. The likelihood of depression is 1 if the user’s the PHQ-9 score is ≥5; otherwise it is 0. Finally, the UDC phase uses the previously trained sentence classifier and the depression classifier to predict one user’s depression.

Sentence classifier training phase

To train sentence classifiers, social media texts that describe daily life were collected from the Internet. The collected text data were separated into sentences, which were then preprocessed by removing stop words and spell checking. Then three people who received data labeling training read each sentence individually and assigned a Y/N label, that is, a "Y" when is the sentence was related to depression, and "N" when it was not. For each sentence related to depression (“Y”), a label(s) of “0” through “9” were assigned for each of the PHQ-9 symptom(s) it reflected. If the assigned labels were different among the three people, the final labels were assigned through discussion among them. To train the sentence classifiers, BERT, Word2Vec, and Unicode word embedding methods were used, and the applied classifiers were NBC, SVM, RNN, LSTM, CNN, and BERT. Finally, the model with the highest accuracy was selected as the final model.

Depression classifier training phase

To train the depression classifier, user PHQ-9 scores and social media text data for 2 weeks were necessary. We collected social media text data from 30 adults who, based on their PHQ-9 scores, were judged to have depression and 30 adults who were not. The collected textual data were preprocessed in the same way as in the SCT phase, and were then classified using the trained Y/N classifier and the 0–9 classifier. The ratio of each user’s classified sentences was calculated based on the number of sentences classified Y/N and 0–9 and the total number of sentences the user had written. Then a logistic regression classifier was trained with depression as a dependent variable and the ratio of each label (Y/N and 0–9) as an independent variable. To determine the final logistic classifier, statistically significant coefficients were selected stepwise based on the variability inflation factor (VIF) and p-value for each variable. Due to the small number of users (60 adults), we performed fivefold cross-validation to improve the reliability of the results.

User’s depression classification phase

Each user’s social media text data for 2 weeks was used for input. This data was preprocessed the same way as in the SCT phase, and were then classified using the trained Y/N classifier and the 0–9 classifier. Based on those classification results, input variables of the logistic regression classifier were calculated. Through the logistic classifier, the user was categorized as being depressed or not.

Experiments

Experimental designs

The experiment was approved by Hanyang University Institutional Review Board (IRB) and the approval number was HYUIRB-202008-001. Participants were informed of the detailed experimental purpose and procedure, and written consent was obtained. Training the sentence classifier based on depression symptoms: The sentences to be labeled were collected from the representative Korean blog sites Naver Blog, Naver Cafe, and Daum Cafe. Sentences not related to depression were collected through the daily Naver Blog, and sentences deemed to be depressed were collected through Naver’s and Daum’s depression-related cafes. For each user posting, we collected the user ID, URL, upload time, title, and content. We collected 23,115 documents and separated them into sentences using the Python library Korean Sentence Splitter (KSS). The total number of collected sentences was 249,103. The collected data were labeled in two steps. First, the sentences are labeled based on whether they are related to (Y) or not (N) with depression. When a sentence was labeled Y, the second labeling indicated which of the nine symptoms of the PHQ-9 corresponded to the sentence. We added a 0 label for those Y sentences that did not correspond to one of the PHQ-9’s 1–9 symptoms. According to the above rules, it was labeled independently by three workers who had basic knowledge of NLP and learned PHQ-9. Each worker read the given sentence, labeled Y/N according to whether it was related to depression, and if Y, labeled it in a category between 0 and 9 according to the criteria of PHQ-9. When there exists inconsistency in the labeled results, it was labeled as ground truth according to the majority vote between the three workers for the same sentence. If all workers were labeled differently, the ground truth was determined through discussion between the three workers. Once the collected data were labeled, we found there was a significant data imbalance: there were significantly fewer sentences reflecting depression than those that did not. A more severe imbalance was found in sentences pertaining to the PHQ-9 symptoms: among the sentences tagged Y for depression, very few were labeled with symptoms other than 0, 1, or 9. To resolve these data imbalances, under-sampling was performed on the sentences that were not related to depression (tagged N). The details of the final dataset after under-sampling are shown in Table 1.

Table 1.

Number and proportion of sentences according to depression and the Patient Health Questionnaire-9 (PHQ-9) symptoms.

Label Y/N	The PHQ-9 label 0–9	Count	Percentage
Y	0	2156	15.07%	49.99%
	1	2132	14.90%
	2	534	3.73%
	3	1358	9.49%
	4	685	4.79%
	5	809	5.65%
	6	1986	13.88%
	7	2027	14.17%
	8	591	4.13%
	9	2030	14.19%
	Total	14,308	100.00%
N		14,313	50.01%
Total		28,621	100.00%

Number and proportion of sentences according to depression and the Patient Health Questionnaire-9 (PHQ-9) symptoms. Training the depression classifier: For this experiment, we recruited blog users who had written daily online articles in the 2-week selection period. A total of 60 adults over the age of 19 were selected, 30 who were considered depressed and 30 who were not. Whether a person was currently experiencing depression was determined based on their PHQ-9 test results (≥5 = depressed) acquired in the application stage. Since PHQ-9 diagnoses 5 points as mild depression, we also determined depression based on 5 points. And, to ensure the reliability of the PHQ-9 results, we administered a second PHQ-9 test three days after the first one. Considering that PHQ-9 diagnoses depression based on symptoms of 2 weeks, we collected all textural data from the users’ blogs over the 2 weeks prior to the PHQ-9 test date. The data collected in this way were divided into sentence units and preprocessed and organized by experimenter. These data were used to train and evaluate the logistic regression classifier with fivefold cross-validation. First, a baseline depression classifier using only the Y/N classifier was created, without the 0–9 classifier, to compare the performance of the proposed logistic regression classifier. Then, after training each logistic regression classifier, the accuracy of the two models was compared. The reason for using fivefold cross-validation here is that the number of data was too small. We used cross-validation to overcome this and ensure reliability of the model’s performance. Also, we used fivefold cross-validation instead of commonly used 10-fold cross-validation to secure enough test dataset. If the 10-fold cross-validation was used, the size of the test dataset would have been too small at 6, but by using fivefold cross validation, we obtain relatively sufficient size of 12. The main experiment was conducted with fovefold cross-validation, and experiments on 10-fold cross-validation and 3-fold cross-validation were also conducted. The results of the experiment are presented in the Appendix.

Experimental results

Performance of the sentence classifier: To find the best-performing sentence classifiers, we conducted experiments with various embeddings and classification algorithms. NBC, SVM, RNN, LSTM, BiRNN, and BiLSTM classifiers were trained with Word2Vec embedding, and CNN 1D and CNN 2D were trained with Unicode embedding. In the case of BERT, the classifier was implemented by adding a linear layer to the last layer of KoBERT, a Korean BERT released by SKT. As shown in Figure 2, the experimental results showed that BERT classifiers were the best for both Y/N and 0–9 sentence classification.

Figure 2.

Performance comparison of the Y/N and 0–9 sentence classifiers according to algorithm.

Performance comparison of the Y/N and 0–9 sentence classifiers according to algorithm. In Table 2, the accuracy of the BERT-based Y/N sentence classifier was 93.68%. The precision of N was 96%, which was greater than the precision of Y, while the recall of Y was 96%, which was greater than that of N (91%).

Table 2.

Performance of bidirectional encoder representations from transformers (BERT)-based Y/N sentence classifier.

Y/N sentence classifier
Class	Precision	Recall	F1-score
N	0.96	0.91	0.93
Y	0.92	0.96	0.94

Accuracy		0.9368

Performance of bidirectional encoder representations from transformers (BERT)-based Y/N sentence classifier. In Table 3, the accuracy of the BERT-based 0–9 sentence classifier was 83.29%. However, because the accuracy of the Y/N sentence classifier was 93.68%, the actual accuracy was 93.68% × 83.29% = 78.02%.

Table 3.

Performance of the 0–9 sentence classifier.

0–9 sentence classifier
Class	Precision	Recall	F1-score
0	0.73	0.69	0.71
1	0.77	0.79	0.78
2	0.67	0.73	0.79
3	0.90	0.96	0.93
4	0.94	0.97	0.95
5	0.80	0.82	0.81
6	0.89	0.81	0.85
7	0.86	0.85	0.86
8	0.82	0.89	0.85
9	0.90	0.92	0.91
Accuracy		0.8329

Performance of the 0–9 sentence classifier. In addition, when viewed through Figure 3, the F1-scores were different among the symptoms because this is influenced by how distinct the symptom-specific features are. In the case of symptom 0, various symptoms were mixed and they did not correspond to numbers 1 through 9, so the characteristics were not clear, resulting in the lowest F1-score. Symptom 1 related to depression, symptom 2 related to reduction of interest, and symptom 5 related to psychomotor agitation or retardation also had low F1-scores because they were difficult to extract from textual data. On the other hand, F1-scores were high for symptom 3, which related to significant weight loss, symptom 4 related to insomnia, and symptom 9 related to thinking about or attempting suicide or death, because they were relatively easy to identify from textual data.

Figure 3.

F1-scores of the 0–9 sentence classifier.

F1-scores of the 0–9 sentence classifier. Performance of the depression classifier: The baseline depression classifier had two variables: S (the number of sentences) and Ratio_D (the ratio of the sentences related to depression to the total sentences). When viewed through Table 4, among five folds, Ratio_D was selected three times after variable selection. Even though in the two cases, Ratio_D was not selected as a significant variable, if the number of data is sufficient, Ratio_D can certainly be chosen as a significant variable.

Table 4.

Logistic regression results of baseline and proposed depression classifier by fivefold cross-validation.

	Baseline depression classifier				Proposed depression classifier
K-fold	Variable	Estimate	Std. Error	Pr (>\|t\|)	Variable	Estimate	Std. Error	Pr (>\|t\|)
1	Intercept	0.3800	0.0963	2.71e-4	Intercept	0.449	0.079	1.07e-6
	Ratio_D	3.0651	1.6601	0.0712	Ratio_1	15.219	6.328	0.0204
	Ratio_D	3.0651	1.6601	0.0712	Ratio_2	-61.892	37.617	0.1069

2	Intercept	0.3593	0.1034	0.0011	Intercept	0.413	0.096	9.6e-5
	Intercept	0.3593	0.1034	0.0011	Ratio_1	20.172	7.287	0.0082
	Ratio_D	3.0429	1.6253	0.0675	Ratio_2	-56.379	29.164	0.0598
					Ratio_3	-30.373	16.583	0.0732
					Ratio_6	20.586	11.849	0.0894

3	Intercept	0.5000	0.0729	1.35e-8	Intercept	0.444	0.090	1.26e-5
					Ratio_1	26.240	9.821	0.0106
					Ratio_2	-78.131	29.953	0.0125
					Ratio_3	-34.988	17.794	0.0557
					Ratio_6	24.628	11.798	0.0428

4	Intercept	0.5000	0.0729	1.35e-8	Intercept	0.362	0.098	6.5e-4
					Ratio_1	21.768	7.997	0.0093
					Ratio_2	-63.253	32.416	0.0575
					Ratio_3	-33.660	16.825	0.0517
					Ratio_6	27.675	12.397	0.0308

5	Intercept	0.3775	0.1021	5.78e-4	Intercept	0.412	0.094	8.19e-5
	Intercept	0.3775	0.1021	5.78e-4	Ratio_1	21.261	7.994	0.0109
	Ratio_D	2.7240	1.6182	0.0990	Ratio_2	-67.880	34.380	0.0548
					Ratio_3	-36.262	18.687	0.0589
					Ratio_6	25.842	14.083	0.0734

Logistic regression results of baseline and proposed depression classifier by fivefold cross-validation. On the other hand, the proposed depression classifier had three variables: S (the number of sentences), Y (the number of depression-acknowledged sentences), and Ratio_n (the number of sentences classified as nth symptoms/total number of sentences). Variable selection steps are performed for five folds. The results show that Ratio_1, Ratio_2, Ratio_3, and Ratio_6 are significant variables. The coefficients of Ratio_1 and Ratio_6 are positive, and that of Ratio_2 and Ratio_3 are negative. That is, higher proportion of sentences on symptoms 1 and 6 among the entire sentences increases the probability of depression. On the contrary, higher proportion of sentences of symptoms 2 and 3 reduce the chance of depression. It can also be seen that the absolute values of the Ratio_2’s coefficients are significantly larger than that of other variables, which means the ratio of sentences on symptom 2 is more sensitive than that of other symptoms. The average accuracy of the proposed depression classifier was 68.3%, which was 15% higher than that of the baseline depression classifier (53.3%). In Figure 4, for all cases in the fivefold cross-validation, the accuracies of the proposed depression classifier were always higher than those of the baseline depression classifier. Therefore, the user’s depression could be more accurately classified when adding label-specific ratios obtained through the 0–9 sentence classifier, rather than only using the Y/N sentence classifier.

Figure 4.

Comparison of the accuracy of the baseline and the proposed depression classifier by fivefold cross-validation.

Discussion

This study aimed to determine whether a user’s depression can be predicted based on text written on social media. Our study results indicated that this is possible, using NLP and machine learning techniques. This study contributes to early depression identification, which is a significant step in the treatment of depression. And the methodology described here can be applied without the conscious participation of the user. There are currently many mental health online applications (“apps”) that can automatically analyze users’ emotions and detect mental disorders, and the model proposed here can be included in many of them. In the case of mental illness, it is important to constantly scan for mental conditions and get professional help to prevent mental deterioration. Therefore, our model can be used for mental health care services and apps for people suffering from mental illness. Although it is necessary that some users provide their PHQ-9 scores and their social media text for the training purpose, after the training phase, the classifiers can be used to determine whether other users are depressed or not solely based on their social media text. In addition, if there were systematic disease indicators for various diagnoses, more diverse mental disorders could be analyzed online in similar ways. For example, in this study, we used the PHQ-9 but we might also be able to create other models using BDI, SDS, CES-D, and GDS. In addition to depression, self-diagnoses of panic disorder, anxiety disorder, stress, bipolar disorder, etc. can be established from social media texts. This study simply tries to classify whether a user has depression or not. However, in future research, we can extend our model by combining various technologies, which will be more helpful for the early detection of depression and preventing it from worsening. Another possible future avenue is “explainable artificial intelligence” (XAI), which is a set of processes and methods that allows human users to comprehend and trust the results and output created by machine learning algorithms. If the development of technology allows us to identify the causes of depression through XAI, it would contribute to the improvement of mental health through customized treatment and emotional management.

Conclusion

In this study, we created a model to determine whether or not social media users are depressed, by analyzing their past social media texts. The proposed model consists of three classifiers: the Y/N sentence classifier which determines whether or not a text sentence is related to depression, the 0–9 sentence classifier which classifies a text sentence according to the depression symptomology in the PHQ-9, and the Depression classifier, which ultimately establishes whether or not a social media user is potentially depressed. To improve the sentence classification accuracy, we tried various text classification algorithms; among them, BERT-based classifiers showed the best performance for both the Y/N and 0–9 sentence classifiers. In particular, the accuracy of the sentence classifier of Yazdavar et al., which is the basis of this paper, was 68%, whereas our sentence classifier showed 83.29% accuracy, which is approximately 15% higher performance. Of course, since it was not compared with the same dataset, it is difficult to compare the performances directly. In addition, it is necessary to verify the proposed approach with other data sets. However, currently, there are no available open depression data sets to perform such verification. Lastly, the depression classifier, which is a logistic regression classifier, also showed that sentence classification based on the PHQ-9 is helpful to improve prediction accuracy. The most significant limitation of this study was that the social media textual data of only 60 users were used in the Depression classifier’s training. To overcome this, fivefold cross-validation was performed, but with more data, it would have been possible to train the model more stably and achieve certain results without having to use k-fold cross-validation. There was also a limitation in that the proposed model is a binary classification of whether or not a user is depressed. Finally, the contribution of this paper at the methodological level is limited because the main purpose of this study is to improve the performance of the Depression Classifier on users’ social media text data by applying the state-of-the-art NLP techniques. Although there was a significant improvement in the performance of the depression classifier from the proposed approach, future studies can improve the performance using emerging advanced techniques.

Table A1.

Logistic regression results of baseline and proposed depression classifier by threefold cross-validation.

	Baseline depression classifier				Proposed depression classifier
K-fold	Variable	Estimate	Std. Error	Pr (>\|t\|)	Variable	Estimate	Std. Error	Pr (>\|t\|)
1	Intercept	0.500	0.080	2.37e-7	Intercept	0.447	0.085	6.53e-6
					Ratio_1	15.759	6.064	0.013
					Ratio_2	-66.203	37.013	0.081

2	Intercept	0.326	0.111	0.005	Intercept	0.406	0.105	4.88e-4
	Intercept	0.326	0.111	0.005	Ratio_1	20.623	7.953	0.013
	Ratio_D	3.714	1.706	0.035	Ratio_2	-65.304	29.789	0.035
					Ratio_3	-29.131	16.752	0.091
					Ratio_6	26.269	12.593	0.044

3	Intercept	0.500	0.080	2.37e-7	Intercept	0.363	0.111	0.0023
					Ratio_1	20.257	11.702	0.0922
					Ratio_2	-81.680	43.931	0.0714
					Ratio_3	-43.838	21.637	0.0504
					Ratio_6	37.081	15.827	0.0249

Table A2.

Accuracy of baseline and proposed depression classifier by threefold cross-validation.

K-fold	1	2	3	Average
Baseline depression classifier	0.50	0.50	0.50	0.50
Proposed depression classifier	0.70	0.65	0.70	0.683

Table A3.

Logistic regression results of baseline and proposed depression classifier by 10-fold cross-validation.

	Baseline depression classifier				Proposed depression classifier
K-fold	Variable	Estimate	Std. Error	Pr (>\|t\|)	Variable	Estimate	Std. Error	Pr (>\|t\|)
1	Intercept	0.371	0.094	2.64e-4	Intercept	0.408	0.088	2.65e-5
	Intercept	0.371	0.094	2.64e-4	Ratio_1	19.611	7.194	0.008
	Ratio_D	3.045	1.573	0.058	Ratio_2	-56.506	28.866	0.055
					Ratio_3	-26.151	16.321	0.115
					Ratio_6	20.434	11.728	0.087

2	Intercept	0.355	0.095	4.64e-4	Intercept	0.405	0.087	2.45e-5
	Intercept	0.355	0.095	4.64e-4	Ratio_1	20.959	7.055	0.004
	Ratio_D	3.381	1.599	0.039	Ratio_2	-69.264	27.726	0.015
					Ratio_3	-36.069	15.941	0.028
					Ratio_6	28.988	11.463	0.014

3	Intercept	0.388	0.095	1.59e-4	Intercept	0.411	0.088	2.47e-5
	Intercept	0.388	0.095	1.59e-4	Ratio_1	20.968	7.303	0.006
	Ratio_D	2.648	1.600	0.103	Ratio_2	-66.677	32.394	0.044
					Ratio_3	-31.352	16.221	0.059
					Ratio_6	24.542	12.778	0.061

4	Intercept	0.500	0.068	1.6e-9	Intercept	0.386	0.089	6.98e-5
					Ratio_1	22.385	7.441	0.004
					Ratio_2	-64.583	31.222	0.043
					Ratio_3	-35.105	16.774	0.041
					Ratio_6	26.204	12.151	0.035

5	Intercept	0.381	0.095	2.04e-4	Intercept	0.3696	0.084	5.87e-5
	Intercept	0.381	0.095	2.04e-4	Ratio_1	11.627	6.531	0.081
	Ratio_D	2.881	1.629	0.082	Ratio_2	-58.225	27.590	0.039
	Ratio_D	2.881	1.629	0.082	Ratio_6	27.298	12.053	0.026

6	Intercept	0.366	0.097	4.26e-4	Intercept	0.395	0.088	4.44e-5
	Intercept	0.366	0.097	4.26e-4	Ratio_1	20.764	7.025	0.004
	Ratio_D	3.024	1.593	0.063	Ratio_2	-63.907	27.663	0.025
					Ratio_3	-31.458	16.053	0.055
					Ratio_6	25.317	11.328	0.030

7	Intercept	0.500	0.068	1.6e-9	Intercept	0.441	0.089	9.04e-6
					Ratio_1	19.753	7.636	0.012
					Ratio_2	-70.562	28.502	0.016
					Ratio_3	-36.055	15.904	0.027
					Ratio_6	24.206	11.409	0.038

8	Intercept	0.500	0.068	1.6e-9	Intercept	0.434	0.075	4.87e-7
8	Intercept	0.500	0.068	1.6e-9	Ratio_1	11.592	6.231	0.068

9	Intercept	0.500	0.068	1.6e-9	Intercept	0.449	0.075	2.3e-7
					Ratio_1	15.401	5.892	0.011
					Ratio_2	-65.636	29.208	0.029

10	Intercept	0.500	0.068	1.6e-9	Intercept	0.375	0.088	9.2e-5
					Ratio_1	14.537	6.835	0.038
					Ratio_2	-52.649	27.772	0.063
					Ratio_6	19.591	11.213	0.086

Table A4.

Accuracy of baseline and proposed depression classifier by 10-fold cross-validation.

K-fold	1	2	3	4	5	6	7	8	9	10	Average
Baseline depression classifier	0.50	0.50	0.50	0.50	0.50	0.33	0.50	0.83	0.33	0.50	0.50
Proposed depression classifier	0.66	0.50	0.66	0.50	0.66	0.66	0.66	0.83	0.66	0.83	0.66

24 in total

1. A SELF-RATING DEPRESSION SCALE.

Authors: W W ZUNG
Journal: Arch Gen Psychiatry Date: 1965-01

2. An inventory for measuring depression.

Authors: A T BECK; C H WARD; M MENDELSON; J MOCK; J ERBAUGH
Journal: Arch Gen Psychiatry Date: 1961-06

3. Long short-term memory.

Authors: S Hochreiter; J Schmidhuber
Journal: Neural Comput Date: 1997-11-15 Impact factor: 2.026

4. Developing a random forest classifier for predicting the depression and managing the health of caregivers supporting patients with Alzheimer's Disease.

Authors: Haewon Byeon
Journal: Technol Health Care Date: 2019 Impact factor: 1.285

5. A model for continuous monitoring of patients with major depression in short and long term periods.

Authors: Francisco Mugica; Àngela Nebot; Solmaz Bagherpour; Luisa Baladón; Antonio Serrano-Blanco
Journal: Technol Health Care Date: 2017 Impact factor: 1.285

6. Diagnostic Assessment of Deep Learning Algorithms for Detection of Lymph Node Metastases in Women With Breast Cancer.

Authors: Babak Ehteshami Bejnordi; Mitko Veta; Paul Johannes van Diest; Bram van Ginneken; Nico Karssemeijer; Geert Litjens; Jeroen A W M van der Laak; Meyke Hermsen; Quirine F Manson; Maschenka Balkenhol; Oscar Geessink; Nikolaos Stathonikos; Marcory Crf van Dijk; Peter Bult; Francisco Beca; Andrew H Beck; Dayong Wang; Aditya Khosla; Rishab Gargeya; Humayun Irshad; Aoxiao Zhong; Qi Dou; Quanzheng Li; Hao Chen; Huang-Jing Lin; Pheng-Ann Heng; Christian Haß; Elia Bruni; Quincy Wong; Ugur Halici; Mustafa Ümit Öner; Rengul Cetin-Atalay; Matt Berseth; Vitali Khvatkov; Alexei Vylegzhanin; Oren Kraus; Muhammad Shaban; Nasir Rajpoot; Ruqayya Awan; Korsuk Sirinukunwattana; Talha Qaiser; Yee-Wah Tsang; David Tellez; Jonas Annuscheit; Peter Hufnagl; Mira Valkonen; Kimmo Kartasalo; Leena Latonen; Pekka Ruusuvuori; Kaisa Liimatainen; Shadi Albarqouni; Bharti Mungal; Ami George; Stefanie Demirci; Nassir Navab; Seiryo Watanabe; Shigeto Seno; Yoichi Takenaka; Hideo Matsuda; Hady Ahmady Phoulady; Vassili Kovalev; Alexander Kalinovsky; Vitali Liauchuk; Gloria Bueno; M Milagro Fernandez-Carrobles; Ismael Serrano; Oscar Deniz; Daniel Racoceanu; Rui Venâncio
Journal: JAMA Date: 2017-12-12 Impact factor: 56.272

7. Machine Learning Methods to Extract Documentation of Breast Cancer Symptoms From Electronic Health Records.

Authors: Alexander W Forsyth; Regina Barzilay; Kevin S Hughes; Dickson Lui; Karl A Lorenz; Andrea Enzinger; James A Tulsky; Charlotta Lindvall
Journal: J Pain Symptom Manage Date: 2018-02-27 Impact factor: 3.612

8. DeepPhe: A Natural Language Processing System for Extracting Cancer Phenotypes from Clinical Records.

Authors: Guergana K Savova; Eugene Tseytlin; Sean Finan; Melissa Castine; Timothy Miller; Olga Medvedeva; David Harris; Harry Hochheiser; Chen Lin; Girish Chavan; Rebecca S Jacobson
Journal: Cancer Res Date: 2017-11-01 Impact factor: 12.701

9. Validation and utility of a self-report version of PRIME-MD: the PHQ primary care study. Primary Care Evaluation of Mental Disorders. Patient Health Questionnaire.

Authors: R L Spitzer; K Kroenke; J B Williams
Journal: JAMA Date: 1999-11-10 Impact factor: 56.272

10. Validation of the Patient Health Questionnaire-9 Korean version in the elderly population: the Ansan Geriatric study.

Authors: Changsu Han; Sangmee Ahn Jo; Ji-Hyun Kwak; Chi-Un Pae; David Steffens; Inho Jo; Moon Ho Park
Journal: Compr Psychiatry Date: 2007-10-24 Impact factor: 3.735