| Literature DB >> 29021528 |
Andrew G Reece1, Andrew J Reagan2,3, Katharina L M Lix4, Peter Sheridan Dodds2,3, Christopher M Danforth5,6, Ellen J Langer7.
Abstract
We developed computational models to predict the emergence of depression and Post-Traumatic Stress Disorder in Twitter users. Twitter data and details of depression history were collected from 204 individuals (105 depressed, 99 healthy). We extracted predictive features measuring affect, linguistic style, and context from participant tweets (N = 279,951) and built models using these features with supervised learning algorithms. Resulting models successfully discriminated between depressed and healthy content, and compared favorably to general practitioners' average success rates in diagnosing depression, albeit in a separate population. Results held even when the analysis was restricted to content posted before first depression diagnosis. State-space temporal analysis suggests that onset of depression may be detectable from Twitter data several months prior to diagnosis. Predictive results were replicated with a separate sample of individuals diagnosed with PTSD (Nusers = 174, Ntweets = 243,775). A state-space time series model revealed indicators of PTSD almost immediately post-trauma, often many months prior to clinical diagnosis. These methods suggest a data-driven, predictive approach for early screening and detection of mental illness.Entities:
Mesh:
Year: 2017 PMID: 29021528 PMCID: PMC5636873 DOI: 10.1038/s41598-017-12961-9
Source DB: PubMed Journal: Sci Rep ISSN: 2045-2322 Impact factor: 4.379
Summary statistics for depression and PTSD tweet collection (Ndepr = 279,951, Nptsd = 243,775).
| Depression | Users | Posts | Posts | Posts (median) |
|---|---|---|---|---|
| Total | 204 | 279,951 | 1373 (1282) | 862 |
| Depressed | 105 | 164,218 | 1564 (1332) | 1127 |
| Healthy | 99 | 115,733 | 1169 (1200) | 574 |
| PTSD | Users | Posts | Posts | Posts (median) |
| Total | 174 | 243,775 | 1401 (1284) | 946.5 |
| Has PTSD | 63 | 91,589 | 1564 (1332) | 1058 |
| Healthy | 111 | 152,186 | 1371 (1268) | 893 |
Classification accuracy metrics for daily and weekly models (Ndepr = 74,990, Nptsd = 54,197).
| Depression | MVR | DC | Daily | Weekly |
|---|---|---|---|---|
| Recall | 0.510 | 0.614 | 0.518 (0.000) | 0.521 (0.000) |
| Specificity | 0.813 | N/A | 0.958 (0.000) | 0.969 (0.000) |
| Precision | 0.42 | 0.742 | 0.852 (0.000) | 0.866 (0.000) |
| NPV | 0.858 | N/A | 0.812 (0.000) | 0.841 (0.000) |
| F1 | 0.461 | 0.672 | 0.644 (0.000) | 0.651 (0.000) |
| PTSD | TBA | NHC | Daily | Weekly |
| Recall | 0.249 | 0.82 | 0.683 (0.000) | 0.658 (0.000) |
| Specificity | 0.979 | N/A | 0.988 (0.000) | 0.994 (0.000) |
| Precision | 0.429 | 0.86 | 0.882 (0.000) | 0.934 (0.000) |
| NPV | 0.602 | N/A | 0.959 (0.000) | 0.954 (0.000) |
| F1 | 0.315 | 0.84 | 0.769 (0.000) | 0.772 (0.000) |
Accuracy scores from Mitchell et al.[28] (MVR), De Choudhury et al.[8] (DC), Taubman-Ben-Ari et al.[22] (TBA), and Nadeem, Horn, & Coppersmith[13] (NHC) are included for comparison to depression (MVR, DC) and PTSD (TBA, NHC) results. Table cells marked N/A indicate unavailable metrics from previous studies.
Figure 1ROC curve and top predictors for Random Forests algorithm, for depression and PTSD samples (Ndepr = 74, 990, Nptsd = 54,197). Predictor names ending in “_happy” are happiness measures; LIWC predictors[37] refer to the occurrence of semantic categories (eg. LIWC_ingest refers to food and eating words, LIWC_swear refers to profanity).
Figure 2Hidden Markov Model showing probability of depression (N = 74,990). X-axis represents days from diagnosis. Healthy data are plotted from a consecutive time span of equivalent length. Trend lines represent cubic polynomial regression fits with 95% CI bands, points are aggregations of 14 day periods, with error bars indicating 95% CI on central tendency of daily values.
Figure 3Hidden Markov Model showing probability of PTSD (N = 54,197). X-axis represents days from trauma event. Healthy data are plotted from a consecutive time span of equivalent length. The purple vertical line indicates mean number of days to PTSD diagnosis, post-trauma, and the purple shaded region shows the average period between trauma and diagnosis. Trend lines represent cubic polynomial regression fits with 95% CI bands, points are aggregations of 30 day periods, with error bars indicating 95% CI on central tendency of daily values.
Figure 4Depression word-shift graph revealing contributions to difference in Twitter happiness observed between depressed (5.98) and healthy (6.11) participants. In column 3, (−) indicates a relatively negative word, and (+) indicates a relatively positive word, both with respect to the average happiness of all healthy tweets. An up (down) arrow indicates that word was used more (less) by the depressed class. Words on the left (right) contribute to a decrease (increase) in happiness in the depressed class. See Appendix III for PTSD word-shift graph.