| Literature DB >> 32130118 |
Burkhardt Funk1, Shiri Sadeh-Sharvit2,3, Ellen E Fitzsimmons-Craft4, Mickey Todd Trockel3, Grace E Monterubio4, Neha J Goel2,3, Katherine N Balantekin4,5, Dawn M Eichen4,6, Rachael E Flatt2,3,7, Marie-Laure Firebaugh4, Corinna Jacobi8, Andrea K Graham9, Mark Hoogendoorn10, Denise E Wilfley4, C Barr Taylor2,3.
Abstract
BACKGROUND: Digital health interventions (DHIs) are poised to reduce target symptoms in a scalable, affordable, and empirically supported way. DHIs that involve coaching or clinical support often collect text data from 2 sources: (1) open correspondence between users and the trained practitioners supporting them through a messaging system and (2) text data recorded during the intervention by users, such as diary entries. Natural language processing (NLP) offers methods for analyzing text, augmenting the understanding of intervention effects, and informing therapeutic decision making.Entities:
Keywords: Digital Health Interventions Text Analytics (DHITA); digital health interventions; eating disorders; guided self-help; natural language processing; text mining
Mesh:
Year: 2020 PMID: 32130118 PMCID: PMC7059510 DOI: 10.2196/13855
Source DB: PubMed Journal: J Med Internet Res ISSN: 1438-8871 Impact factor: 5.428
Figure 1Text fragments along an exemplified user journey of a specific user i (vertical dots refer to other users); open circles refer to other nontext touchpoints and the interaction of the user with the digital health intervention; upward pointing triangles refer to fragments from diaries; red squares refer to the messages sent by coaches; black squares refer to the messages sent by users; and downward pointing triangles refer to the data collected within specific exercises (eg, deep breathing).
Figure 2Framework for the analysis of textual data in DHIs (symbols are explained in the caption of fig. 1).
Figure 3The figure presents an example for an intervention snippet. Raw features are derived as demonstrated by some selected features in each category (features describing the user-coach communication are not shown, because they are only defined on communication threads, but not individual snippets).
Derived features to represent text snippets (we provide the full set of features to interested readers upon request).
| Feature type | Number of featuresa | Comment | Examples (for message snippets) |
| Metadata | 2|2 | Number of words and characters | —b |
| Word usage | 79|189 | For messages: MINOCCc=0.05 and MAXOCCd=0.5; for intervention snippets: MINOCC=0.005 and MAXOCC=0.5 | Most common words in approximately one-fourth of all messages: think, feel, eat, just, and like |
| Word embeddings | 50|50 | We used the pretrained GloVe with 50 dimensions and an average over each dimension as suggested by De Boom et al [ | — |
| POSe | 44|44 | Note that for the intervention snippets it took approximately 10 hours to generate the POS features on 1 core of an Intel i7 | Most common POS tags: personal pronouns, nouns, prepositions, particles, and determiners |
| Topic models | 10|10 | Probabilities for 8 topics+SD of these numbers+log likelihood | — |
| Sentiments | 15|15 | We used 3 different lexica: National Research Council Canada (NRC) (11), AFINNf (1), and Bingg (3), where numbers in parenthesis indicate the number of dimensions | NRC sentiment types: anticipation, trust, joy, sadness, and fear |
| Communication | 2|0 | Only available for message snippets (response rate and mean response time) and only aggregated on the user level | — |
aThe first number in this column refers to the number of features for the message snippets and the second refers to the intervention snippets.
bNot applicable.
cA specific term occurs in at least MINOCC of all messages (minimum occurrence).
dA specific term occurs in not more than MAXOCC of all messages (minimum occurrence).
ePOS: part-of-speech.
fAFINN is an English word list developed by Finn Årup Nielsen. Words scores range from minus five (negative) to plus five (positive).
gAnother list of words from the search engine Bing.
Figure 4Correlation between the 200 features for all user messages. The blue lines indicate the different feature types. The red dots on the diagonal refer to the correlation of each feature with itself, ie, correlation = 1.
Figure 5ROC curves for logistic regression (LR) and random forest (RF). The line color indicates whether the model was learned within- or across-users.
Figure 6Cross-validation curve as a function of the regularizing constant λ. Error bars indicate the standard deviation for 100 folds in cross validation. The blue numbers indicate the number of non-zero parameters from the LASSO regression.