| Literature DB >> 35178000 |
Jeff Sawalha1,2,3, Muhammad Yousefnezhad1,2,3, Zehra Shah2,3, Matthew R G Brown1,2, Andrew J Greenshaw1, Russell Greiner2,3.
Abstract
Rates of Post-traumatic stress disorder (PTSD) have risen significantly due to the COVID-19 pandemic. Telehealth has emerged as a means to monitor symptoms for such disorders. This is partly due to isolation or inaccessibility of therapeutic intervention caused from the pandemic. Additional screening tools may be needed to augment identification and diagnosis of PTSD through a virtual medium. Sentiment analysis refers to the use of natural language processing (NLP) to extract emotional content from text information. In our study, we train a machine learning (ML) model on text data, which is part of the Audio/Visual Emotion Challenge and Workshop (AVEC-19) corpus, to identify individuals with PTSD using sentiment analysis from semi-structured interviews. Our sample size included 188 individuals without PTSD, and 87 with PTSD. The interview was conducted by an artificial character (Ellie) over a video-conference call. Our model was able to achieve a balanced accuracy of 80.4% on a held out dataset used from the AVEC-19 challenge. Additionally, we implemented various partitioning techniques to determine if our model was generalizable enough. This shows that learned models can use sentiment analysis of speech to identify the presence of PTSD, even through a virtual medium. This can serve as an important, accessible and inexpensive tool to detect mental health abnormalities during the COVID-19 pandemic.Entities:
Keywords: emotion; language; machine learning; natural language processing; post-traumatic stress disorder (PTSD); sentiment analysis (SA); telepsychiatry
Year: 2022 PMID: 35178000 PMCID: PMC8844448 DOI: 10.3389/fpsyt.2021.811392
Source DB: PubMed Journal: Front Psychiatry ISSN: 1664-0640 Impact factor: 4.157
Figure 1Interview process with Ellie. Participants were placed in a room in front of a large computer screen, showing the animated character Ellie in the Wizard-of-Oz interview.
Demographics and outcome measures.
|
|
|
|
| ||
|---|---|---|---|---|---|
| Sex (Male / Female) | 122 / 66 | 48 / 39 | N/S | ||
| PTSD mean score (PCL-C) | 26.54 (± 8.77) | 57.98 (± 10.70) | |||
| Depression mean score (PHQ-8) | 4.177 (± 3.65) | 15.69 (± 3.48) |
Clinical and demographic descriptions of both groups (Non-PTSD and PTSD individuals). For testing sex differences across both groups, a chi-squared test was used. A t-test was conducted on the group means and standard deviations of both the PCL-C and the PHQ-8 scores to determine if they were statistically significant. Standard deviation can be seen within the brackets of both assessment tests. Statistical significance was set at p <0.05. N/S refers to not statistically significant.
Figure 2Sentiment analysis pipeline. A simplified version of our pipeline. Raw text is given to a sentiment analyzer (i.e., VADER/Textblob/Flair) that outputs a compound scalar score between [−1, 1] for each utterance. Note that each participant provides many such utterances in the session; our SuperLearner (SL) then bins that participant's set of scores into a set of k bins. It uses internal cross-validation (on the training set) to identify the optimal number of bins and tune hyperparameters. Here, it found that 23 bins was optimal. We repeat this process with different partitioning methods with our dataset, such as five-fold-CV and the original train-test folds. We also consider four different base learners, though our figure only shows the RF learner [Linear Discriminant Analysis (LDA), Support Vector Classifier (SVC), and Random Forests (RF), and Gradient Boosting (GB)].
Demographics and outcome measures for original partitioning folds.
|
|
| |||||||
|---|---|---|---|---|---|---|---|---|
|
|
|
|
|
|
| |||
| Sex (Male / Female) | 93 / 60 | 34 / 32 | 29 / 6 | 14 / 7 | X(2, 1) = 10.19 | |||
| PTSD mean score (PCL-C) | 26.23 (± 8.47) | 56.44 (± 10.29) | 27.94 (± 10.06) | 63.05 (± 10.70) | F(3, 271) = 227.04 |
Clinical and demographic descriptions of both groups (Non-PTSD and PTSD individuals) in the original train-test partition from the AVEC-19 challenge. For testing sex differences across training and testing sets, a chi-squared test was used. A t-test was conducted on the group means and standard deviations of PCL-C scores to determine if they were statistically significant. Standard deviation can be seen within the brackets of both assessment tests. Statistical significance was set at p <0.05. N/S refers to not statistically significant.
Demographics and outcome measures for 2017-to-2019 partitioning folds.
|
|
| |||||||
|---|---|---|---|---|---|---|---|---|
|
|
|
|
|
|
| |||
| Sex (Male / Female) | 77 / 56 | 25 / 31 | 45 / 10 | 23 / 8 | X(2, 1) = 19.20 | |||
| PTSD mean score (PCL-C) | 26.42 (± 8.74) | 55.82 (± 10.66) | 26.83 (± 8.93) | 62.00 (± 9.68) | F(3, 271) = 230.04 |
Clinical and demographic descriptions of participants from the 2017 (training set) and 2019 competition (testing set) (Non-PTSD and PTSD individuals). For testing sex differences across training and testing sets, a chi-squared test was used. A t-test was conducted on the group means and standard deviations of PCL-C scores to determine if they were statistically significant. Standard deviation can be seen within the brackets of both assessment tests. Statistical significance was set at p <0.05. N/S refers to not statistically significant.
Figure 3Machine learning results (F1 score) from partitioned types. (A) Five-fold-CV, (B) Original Train-Test-Split, and (C) 2017-to-2019. High watermark results from each model, sentiment analyzer and bin size for all 3 partitioning methods. The RF (23 bins) with VADER on the original train-test-split folds achieved the highest accuracy (80.4 %) and an F1 score (0.79). Panel (B) displayed the benchmark F1 scores from two other studies. In the agnostic five-fold-CV, the RF (18 bins) with VADER achieved the best accuracy (75.6 %, STD = ± 4.5 %, F1 score = 0.71). The dashed lines represent the results from Stratou et al. (27) and DeVault et al. (26), who tried to predict PTSD on a smaller version of this current dataset.
Performance of models across sentiment analyzers and partitioning schemes.
|
|
|
|
|
|
|
|
|
|---|---|---|---|---|---|---|---|
| Random forest | Train-test-split |
|
|
|
|
|
|
| Flair | 23 | 75.0 | 0.80 | 0.82 | 0.53 | ||
| Textblob | 12 | 71.4 | 0.70 | 0.79 | 0.56 | ||
| Five-fold-CV | Vader | 18 | 75.6 ± 4.5 | 0.72 | 0.83 | 0.58 | |
| Flair | 30 | 74.0 ± 2.9 | 0.70 | 0.82 | 0.49 | ||
| Textblob | 21 | 70.2 ± 6.3 | 0.65 | 0.79 | 0.45 | ||
| 2017-to-2019 | Vader | 23 | 80.2 | 0.82 | 0.86 | 0.67 | |
| Flair | 23 | 80.2 | 0.81 | 0.86 | 0.67 | ||
| Textblob | 21 | 72.1 | 0.71 | 0.80 | 0.52 | ||
| Support Vector Machine (SVM) | Train-test-split | Vader | 18 | 75.0 | 0.73 | 0.80 | 0.67 |
| Flair | 23 | 70.0 | 0.68 | 0.78 | 0.51 | ||
| Textblob | 23 | 78.6 | 0.78 | 0.81 | 0.75 | ||
| Five-fold-CV | Vader | 8 | 70.0 ± 4.9 | 0.66 | 0.77 | 0.51 | |
| Flair | 9 | 66.2 ± 2.9 | 0.62 | 0.75 | 0.47 | ||
| Textblob | 7 | 67.3 ± 4.6 | 0.62 | 0.76 | 0.46 | ||
| 2017-to-2019 | Vader | 18 | 70.1 | 0.68 | 0.77 | 0.59 | |
| Flair | 29 | 68.6 | 0.65 | 0.78 | 0.47 | ||
| Textblob | 18 | 70.0 | 0.70 | 0.74 | 0.65 | ||
| Linear Discriminant Analysis (LDA) | Train-test-split | Vader | 3 | 73.0 | 0.77 | 0.74 | 0.72 |
| Flair | 18 | 75.0 | 0.75 | 0.82 | 0.61 | ||
| Textblob | 29 | 75.0 | 0.75 | 0.78 | 0.71 | ||
| Five-fold-CV | Vader | 6 | 70.2 ± 2.9 | 0.67 | 0.77 | 0.58 | |
| Flair | 30 | 64.3 ± 2.5 | 0.60 | 0.73 | 0.46 | ||
| Textblob | 9 | 65.5 ± 6.6 | 0.61 | 0.74 | 0.48 | ||
| 2017-to-2019 | Vader | 18 | 67.4 | 0.65 | 0.75 | 0.55 | |
| Flair | 6 | 63.9 | 0.62 | 0.71 | 0.52 | ||
| Textblob | 26 | 67.4 | 0.65 | 0.74 | 0.58 | ||
| Gradient Boosting (GB) | Train-test-split | Vader | 15 | 75.0 | 0.74 | 0.81 | 0.63 |
| Flair | 6 | 73.2 | 0.73 | 0.81 | 0.57 | ||
| Textblob | 29 | 66.1 | 0.63 | 0.74 | 0.52 | ||
| Five-fold-CV | Vader | 23 | 70.5 ± 3.2 | 0.66 | 0.79 | 0.49 | |
| Flair | 29 | 67.3 ± 2.5 | 0.60 | 0.77 | 0.41 | ||
| Textblob | 21 | 67.6 ± 6.2 | 0.62 | 0.77 | 0.45 | ||
| 2017-to-2019 | Vader | 29 | 70.9 | 0.68 | 0.78 | 0.56 | |
| Flair | 23 | 66.2 | 0.62 | 0.76 | 0.40 | ||
| Textblob | 12 | 65.1 | 0.61 | 0.75 | 0.42 |
A list of best performing models based on the highest accuracy across the various bin sizes. The high watermark model was the RF using the VADER analyzer, with 23 bins, and the traditional train-test-split partition. This is denoted by the bolded values in the table.
Figure 4Binned sentiment scores per group. These are the mean values of chosen binned sentiment scores (from the RF model in the five-fold-CV partition), in each bin, for both groups. The x-axis represents the bins and their sizes, and the y-axis represents the average number of utterances that fall within those sentiment scores for both PTSD and non-PTSD groups. Table 5 shows the means and standard deviations, and a one-way ANOVA was conducted to compare these values in each bin between the two groups. A Benjamini-Hochberg correction was made for 18 comparisons. The adjusted p-value threshold was set to p < 0.00284. Error bars represent standard deviation. Significant difference denoted by *p < 0.00284, **p < 0.0001.
Average binned sentiments per group (bins = 18).
|
|
|
|
| ||
|---|---|---|---|---|---|
|
|
|
|
| ||
| -1.000 | -0.888 | 0.266 ± 0.587 | 0.471 ± 0.828 | 0.18 | |
| -0.888 | -0.777 | 0.734 ± 0.895 | 1.057 ± 1.178 | 0.145 | |
| -0.777 | -0.666 | 0.814 ± 0.974 | 1.563 ± 1.499 |
| |
| -0.666 | -0.555 | 1.473 ± 1.435 | 1.897 ± 1.583 | 0.234 | |
| -0.555 | -0.444 | 1.681 ± 1.146 | 2.667 ± 2.164 |
| |
| -0.444 | -0.333 | 2.633 ± 2.233 | 4.425 ± 2.970 |
| |
| -0.333 | -0.222 | 4.255 ± 2.324 | 5.000 ± 2.512 | 0.173 | |
| -0.222 | -0.111 | 1.362 ± 1.494 | 1.632 ± 1.399 | 0.641 | |
| -0.111 | 0.000 | 32.287 ± 15.669 | 43.966 ± 19.236 |
| |
| 0.000 | 0.111 | 1.213 ± 1.125 | 1.827 ± 1.341 |
| |
| 0.111 | 0.222 | 1.803 ± 1.417 | 2.000 ± 1.742 | 0.693 | |
| 0.222 | 0.333 | 7.351 ± 4.176 | 8.023 ± 3.971 | 0.693 | |
| 0.333 | 0.444 | 8.638 ± 3.986 | 10.506 ± 4.503 |
| |
| 0.444 | 0.555 | 4.218 ± 2.522 | 4.483 ± 2.832 | 0.693 | |
| 0.555 | 0.666 | 4.968 ± 3.103 | 5.667 ± 3.194 | 0.474 | |
| 0.666 | 0.777 | 4.160 ± 2.477 | 3.920 ± 2.161 | 0.693 | |
| 0.777 | 0.888 | 4.261 ± 2.808 | 3.805 ± 3.013 | 0.693 | |
| 0.888 | 1.000 | 3.441 ± 3.696 | 2.529 ± 3.066 | 0.317 |
This shows the average number of sentiments for each bin per group. We conducted a one-way ANOVA to determine statistical differences between all 18 bins. and used a Benjamini–Hochberg correction to reduce type one errors. The corrected statistical significance was set to p <0.00284. Statistical significance denoted by
p <0.00284,
p <0.0001. The bolded row represents the overall high watermark results from our entisre analysis. It simply reflects the highest accuracy across all processes.