Literature DB >> 35017634

Digital health tools for the passive monitoring of depression: a systematic review of methods.

Valeria De Angel1,2, Serena Lewis3,4, Katie White3, Carolin Oetzmann3, Daniel Leightley3, Emanuela Oprea3, Grace Lavelle3, Faith Matcham3, Alice Pace5, David C Mohr6,7, Richard Dobson8,9, Matthew Hotopf3,8.   

Abstract

The use of digital tools to measure physiological and behavioural variables of potential relevance to mental health is a growing field sitting at the intersection between computer science, engineering, and clinical science. We summarised the literature on remote measuring technologies, mapping methodological challenges and threats to reproducibility, and identified leading digital signals for depression. Medical and computer science databases were searched between January 2007 and November 2019. Published studies linking depression and objective behavioural data obtained from smartphone and wearable device sensors in adults with unipolar depression and healthy subjects were included. A descriptive approach was taken to synthesise study methodologies. We included 51 studies and found threats to reproducibility and transparency arising from failure to provide comprehensive descriptions of recruitment strategies, sample information, feature construction and the determination and handling of missing data. The literature is characterised by small sample sizes, short follow-up duration and great variability in the quality of reporting, limiting the interpretability of pooled results. Bivariate analyses show consistency in statistically significant associations between depression and digital features from sleep, physical activity, location, and phone use data. Machine learning models found the predictive value of aggregated features. Given the pitfalls in the combined literature, these results should be taken purely as a starting point for hypothesis generation. Since this research is ultimately aimed at informing clinical practice, we recommend improvements in reporting standards including consideration of generalisability and reproducibility, such as wider diversity of samples, thorough reporting methodology and the reporting of potential bias in studies with numerous features.
© 2022. The Author(s).

Entities:  

Year:  2022        PMID: 35017634      PMCID: PMC8752685          DOI: 10.1038/s41746-021-00548-8

Source DB:  PubMed          Journal:  NPJ Digit Med        ISSN: 2398-6352


Introduction

Depression remains the leading cause of disability worldwide[1], with a largely chronic course and poor prognosis[2]. Early recognition and access to treatment, as well as a better trial methodology, have been linked to improved treatment outcomes and prognosis[3]. The use of digital technology to track mood and behaviour brings enormous potential for clinical management and the improvement of research in depression. By passively sensing motion, heart rate and other physiological variables, smartphone and wearable sensors provide continuous data on behaviours that are central to psychiatric assessment, such as sociability[4], sleep/wake cycles[5], cognition, activity[6] and movement[7]. With the global trend toward increased smartphone ownership (44.9% worldwide, 83.3% in the UK) and wearable device usage forecast to reach one billion by 2022[8], this new science of “remote sensing”, sometimes referred to as digital phenotyping or personal sensing[9] presents a realistic avenue for the management and treatment of depression. When combined with the completion of questionnaires, remote sensing may generate more objective and frequent measures of mood and other core dimensions of mental disorders, instead of relying on retrospective accounts of patients or participants. The first step in generating meaningful clinical information from data derived from digital sensors is to generate features, which are the smallest constructed building blocks, designed to explain the behaviours of interest (see Mohr et al. [10] for a detailed analytical framework). These low-level features are often aggregated to define high-level behavioural markers, which can be understood as symptoms. For example, GPS data (sensor), can be translated into ‘location type’ (low-level feature), ‘increased time at home location’ (high-level behaviour) derived from location data may indicate social withdrawal or lack of energy (symptom), and may therefore be associated with depression severity. One of the main challenges that arise from this emerging field is that it sits at the intersection between computer science, engineering, and clinical science. The advantages of a multidisciplinary approach are evident, but these domains are yet to be brought together efficiently[11,12], giving rise to large differences in reporting standards with the risk that reproducibility may be threatened[13]. Previous reviews in affective disorders cite the level of heterogeneity across studies as a barrier to carrying out meta-analytic syntheses of the results. Additionally, these reviews have included non-validated measures of depression, and a mix of bipolar and unipolar samples, characteristics which not only show divergent results[11,12,14], but add study diversity. There is therefore a need for a comprehensive review of methodologies, with more specific inclusion criteria, to highlight the sources of heterogeneity and methodological shortcomings in the field. Given the difficulty in extracting a clear message from the available literature, the current work aims to review studies linking passive data from smartphone and wearable devices with depression and summarise key methodological aspects, to: (a) identify sources of heterogeneity and threats to reproducibility, and (b) identify leading digital signals for depression. We will also assess the quality of the included studies and evaluate their reporting of the feasibility of passive data collection methods, participant retention and missing data.

Results

Fifty-one studies were included in the review (see Fig. 1). The majority of articles (n = 45) were published in medical journals, and 33 (65%) were from North America. A summary of included studies is presented in Table 1.
Fig. 1

Study selection flowchart.

Medical and computer science databases were searched to ensure relevant fields were covered. The current flowchart lists reasons for excluding the study from the data extraction and quality assessment.

Table 1

Summary characteristics of included studies.

First authorYearCountryFieldN (RMT)a% femaleMean age (range/SD)RMTa follow up (days)Sample typeDepression measurePassive feature type
SleepPhysical activityCircadian rhythmSociabilityLocationPhone usePhysiologicalEnvironmentalTotal feature types
Avila-Moraes[36]2013BrazilM30100.044 (18–60)7ClinicalBDI, HAMD, MADRSxxx3
Ben-Zeev[46]2015USAM3721.022.5 (19–0)70StudentPHQ-9xxxx4
Boukhechba[34]2018USAM7251.419.8 (2.4)14StudentDASS-21xxx3
Burns[19]2011USAM787.537.4 (19–51)56CommunityPHQ-9xxxxx5
Byrne[25]2019AustraliaM420.0(18–29)7CommunitySCRAM - depxxx3
Caldwell[76]2019USAM115100.027.5 (6.1)3CommunityBDI-IIx1
Cho[4]2016South KoreaM53256.057720CommunityBDI-IIx1
David[47]2018USAM13260.020.68 (18–21)7StudentPHQ-4xx2
Difrancesco[20]2019NetherlandsM35962.450.1 (11.1)7CommunityBDI-IIxxx3
Dillon[21]2018IrelandM39650.8nr7ClinicalCES-Dx1
Doane[77]2015USAM7676.018.1 (0.4)3StudentCES-Dx1
Doryab[44]2014USAM633.3nr120StudentCES-Dxxx3
Ghandeharioun[6]2017USACS1275.037 (20–73)56ClinicalHAM-Dxxxxxx6
Haeffel[78]2017USAM4755.320.97StudentBDI-IIx1
Hori[79]2016JapanM4052.539.87ClinicalHAM-Dx1
Jacobson[80]2019BrazilM1587.047.6 (10.5)7ClinicalBDI, HAMDxx2
Kawada[32]2007JapanM10529.524.1 (1.8)4StudentCES-Dxxx3
Knight[81]2018AustraliaM2377.020.7 (3.2)3CommunityDASS-21x1
Li[82]2018AustraliaM37553.959.5 (5.5)7CommunityCES-Dx1
Lu[5]2018USACS10376.7(18–25)nrStudentQIDSxxx3
Luik[83]2013NetherlandsM173453.462.3 (9.4)7CommunityCES-Dx1
Luik[30]2015NetherlandsM171453.662.2 (9.4)7CommunityCES-Dx1
McCall[84]2015USAM5867.042.1 (12.4)56ClinicalHAM-Dx1
Mendoza-Vasconez[85]2019USAM266nr40.6 (9.9)7CommunityHAM-Dx1
Moukaddam[27]2019USAM2276.050.3 (10.1)56ClinicalPHQ-9xx2
Naismith[86]2011AustraliaM444362.314ClinicalHAM-Dx1
Park[87]2007USAM5457.443 (21–76)14CommunityCES-Dxx2
Pillai[88]2014USAM3973.855 (3.2)7StudentBDI-IIx1
Pratap[89]2019USAM27177.833.4 (10.7)90CommunityPHQ-2xx2
Robillard[33]2013AustraliaM6662.721.57Clinicalclinician assessmentx1
Robillard[41]2014AustraliaM23864.340.410ClinicalHAM-Dxx2
Robillard[38]2015AustraliaM34255.122.314Clinicalclinician assessmentxx2
Robillard[90]2016AustraliaM2548.020.9 (4.6)14Clinicalclinician assessmentxx2
Robillard[91]2018USAM1258.020.1 (18–31)13Clinicalclinician assessmentx1
Saeb[7]2015USAM2171.428.9 (19– 58)14StudentPHQ-9xxx3
Saeb[42]2016USAM3820.8nr70CommunityPHQ-9xx2
Sano[22]2018USAM4772.0(18– 25)30StudentMCSF-12xxxxxxx7
Slyepchenko[37]2019CanadaM7057.9(18– 65)15ClinicalMINIxxx3
Smagula (a)[39]2018aUSAM14567.060 (36-82)9CommunityHAM-Dx1
Smagula (b)[92]2018USAM4538.838.0810CommunityHAM-Dx1
Stremler[93]2017CanadaM10162.734.15CommunityCES-Dx1
Tao[35]2019ChinaM22052.320.3 (2.4)7StudentPROMIS - depx1
Vallance[94]2013CanadaM3850.065.3 (7.5)3CommunityCES-Dx1
Vanderlind[95]2014USAM3542.319.8 (18–23)21StudentCES-Dxx2
Wahle[23]2016SwitzerlandM3664.3(20–57)14CommunityPHQ-9xxxx4
Wang[26]2014USACS4820.8nr7StudentPHQ-9xxx3
Wang[96]2018USACS8351.820.1 (2.3)126StudentPHQ-8xxxxxx6
White[40]2017USAM41860.357 (35–85)7CommunityCES-Dxx2
Yang[45]2017ChinaCS48nrnr70StudentPHQ-9x1
Yaugher[97]2015USAM10058.318.6 (18– 27)7StudentPAI-depx1
Yue[18]2018USACS54nr(18–25)nrStudentPHQ-9xx2
N = 52MedianMedianMedianMedianTotal NTotal
58.057.937.29163124141414741

RMT remote measurement technologies, SD standard deviation, M medical field, CS computer science field, BDI Beck’s Depression Inventory, HAM-D Hamilton Depression Rating Scale, MADRS Montgomery–Åsberg Depression Rating Scale, PHQ Patient Health Questionnaire, PAI-dep Personality Assessment Inventory-depression subscale, CES-D Center for Epidemiologic Studies Depression Scale, MINI Mini International Neuropsychiatric Interview, PROMIS Patient-Reported Outcomes Measurement Information System, MCSF-12 Mental Component of the Short Form Health Survey, QIDS Quick Inventory of Depressive Symptomatology, DASS Depression Anxiety Stress Scales, SCRAM sleep, circadian rhythms, and mood questionnaire.

aNumber of participants/length of follow-up included in passive data collection samples; these may be lower than overall study sample sizes.

Study selection flowchart.

Medical and computer science databases were searched to ensure relevant fields were covered. The current flowchart lists reasons for excluding the study from the data extraction and quality assessment. Summary characteristics of included studies. RMT remote measurement technologies, SD standard deviation, M medical field, CS computer science field, BDI Beck’s Depression Inventory, HAM-D Hamilton Depression Rating Scale, MADRS Montgomery–Åsberg Depression Rating Scale, PHQ Patient Health Questionnaire, PAI-dep Personality Assessment Inventory-depression subscale, CES-D Center for Epidemiologic Studies Depression Scale, MINI Mini International Neuropsychiatric Interview, PROMIS Patient-Reported Outcomes Measurement Information System, MCSF-12 Mental Component of the Short Form Health Survey, QIDS Quick Inventory of Depressive Symptomatology, DASS Depression Anxiety Stress Scales, SCRAM sleep, circadian rhythms, and mood questionnaire. aNumber of participants/length of follow-up included in passive data collection samples; these may be lower than overall study sample sizes. Studies were evenly divided between community samples (n = 19), student samples (n = 18) and clinical populations (n = 14). The median sample size was 58, the median age of participants was 38 years, and the median percentage of females was 58%. However, there was a striking lack of information on some key data—with 12% and 8% of studies failing to give data on age or gender, respectively, and 63% failing to include information on ethnicity. Computer science journals were less likely to report age and gender but more likely to report ethnicity (33% studies failing to report each demographic). Fifteen different measures of depression were used, the most commonly used scales being the Center for Epidemiological Studies Depression Scale (CES-D[15]; n = 12 studies), Hamilton Rating Scale for Depression (HAM-D[16]; n = 12 studies), and Patient Health Questionnaire-9 (PHQ-9[17]; n = 9). There were 14 types of devices used across all studies: 12 of them actigraphy-based wrist-worn devices including one Fitbit and a Microsoft band, as well as one pedometer and smartphones (both android and iPhone). For a breakdown of devices, models and sensors used to measure behaviour see Supplementary Table 1. Most studies had a cohort design, meaning that depression was measured at least at two different time points (see Table 2). However, these time points tended to be shorter than 2 weeks (Fig. 2). Two studies provided no information on the length of follow-up, instead only mentioning that data was obtained from participants providing at least 72 h of consecutive data[5,18].
Table 2

The breakdown of study designs within each sample type.

Study designTotalStudentCommunityClinical
Cross-sectional194105
Case-control6015
Cohort251463
RCT3021
Total51181914
Fig. 2

Sample sizes and follow-up times for all included studies.

The number of studies by the length of time participants were followed up for in each study, differentiated by sample size.

The breakdown of study designs within each sample type.

Sample sizes and follow-up times for all included studies.

The number of studies by the length of time participants were followed up for in each study, differentiated by sample size. To understand the relationships between depression and objective features, studies either looked at group differences (including classification analyses) or correlation and regression. Most studies presented direct bivariate relationships (n = 45), allowing for a closer evaluation of which features are promising markers of depressive symptomatology. Ten studies presented the result of a combination of features and their association with the depressive state (n = 7), or depression severity (n = 8), using machine learning methods. Bivariate Pearson correlation coefficients were the most used analytical method (n = 32).

Quality assessment and feasibility

Figure 3 shows a breakdown of quality scores for each item (see Supplementary Fig. 1 and Supplementary Table 2 for quality assessment scores per study). Justification of sample size was rarely given, and sample representativeness was poor, possibly reflecting that many reports were pilot or feasibility studies. Recruitment strategies and non-participation rates were not reported in the majority of cases. Missing data and strategies for handling missing data were infrequently described. Only four studies referred to a previously published protocol[19-22].
Fig. 3

Quality of the literature by each domain.

The figure shows the number of studies scoring on each study quality item. 2 points are given for fully addressing quality criteria, 1 point for partially addressing quality criteria, and 0 points for failing to address quality criteria.

Quality of the literature by each domain.

The figure shows the number of studies scoring on each study quality item. 2 points are given for fully addressing quality criteria, 1 point for partially addressing quality criteria, and 0 points for failing to address quality criteria. Only five studies reported engagement rates at follow-up, and they all measured engagement at different time points, making comparisons difficult. Additionally, sensor data was sometimes obtained for a subsample, whereas acceptability measures were reported for the wider sample. Eighteen studies (35%) reported, or provided enough information to calculate, how many participants completed the study—results ranging from 22% adherence to the study[23] at 4 weeks, to 100%[24], with a median of 86.6% completers. Reasons for dropouts were provided in four studies and were due to equipment malfunction and technical problems using devices[19,25-27] . Six additional studies reported issues including; lack of data for consecutive days, software error, participants forgetting to charge phones or devices, server and network connectivity problems, sensors breaking, missing clinical data which impeded comparisons with sensor data, and mobile software updates, which can interfere with data integrity[7,22,28-31].

Associations between objective features and depression

The association between groups of features and depression is given in Fig. 4, broken down by feature type. We give the number of studies that have reported the feature and the number of feature–depression associations that reached statistical significance as a proportion of the total such associations reported. See Supplementary Tables 3–10 for a list of tables with terms and feature definitions.
Fig. 4

Feature associations with depression by behaviour type.

The number of times each feature (a sleep, b physical activity, c circadian rhythm, d sociability, e location and f phone use) has been reported in all included studies and their association with depression, where these associations are defined as having a below-threshold p-value (“Significant Association”), above-threshold p-value (“Non-Significant Association”), and where statistical methods have been used that do not yield p-values (“Non-p-value”). The graphs also show the number of studies assessing each feature.

Feature associations with depression by behaviour type.

The number of times each feature (a sleep, b physical activity, c circadian rhythm, d sociability, e location and f phone use) has been reported in all included studies and their association with depression, where these associations are defined as having a below-threshold p-value (“Significant Association”), above-threshold p-value (“Non-Significant Association”), and where statistical methods have been used that do not yield p-values (“Non-p-value”). The graphs also show the number of studies assessing each feature. Twenty-nine studies collected data on sleep, typically ascertained using accelerometer, light and heart rate sensors. Nine different features of sleep are reported in Fig. 4A. Sleep quality, encompassing features relating to sleep fragmentation (number of awakenings and wake after sleep onset [WASO]), was the most commonly reported feature. Sleep efficiency is presented as a separate feature given its prevalence in studies. For all significant results, lower sleep efficiency or quality was associated with higher depression scores. Features with higher proportions of significant findings are features of sleep stability, sleep offset, time in bed; longer time in bed and later sleep offset were associated with higher depression scores. Across studies finding significant results, sleep variability was higher for those with depression compared to controls (27), and those with more severe symptoms (28). The average length of follow-up for studies showing significant associations between sleep stability and depression was 24.7 days (range = 4–63), whereas that for studies showing no significant associations was 8.6 (range = 3–21). Total sleep time showed mixed directionality of significance, with some studies finding negative correlations between total sleep time and higher depression[26,32], others finding the depressed group having longer sleep time than controls[33]. Measures of physical activity were collected in 19 studies using a mixture of smartphone (n = 8) and wearable devices (n = 11). Activity levels were predominantly measured as a gross motor activity within a day, and showed that depression was negatively correlated with physical activity[20,34]. Out of the seven studies extracting ‘activity levels’ as a feature within physical activity, both studies using smartphones found a significant difference in depression severity, compared to one out of the five that used wrist actigraphy. Higher depressive symptoms were associated with less time spent engaging in physical activity[5], movement speed[18] and step count[27]. Two out of the three studies looking at intensity found lower depression in those with more instances of intense activity and fewer sedentary behaviours[5,20], with the third study[35] finding no significant associations. The authors reported very little variability in activity intensity, which could account for such findings. A total of 13 studies assessed movement patterns within a 24-h period. All used accelerometery data, except for Saeb[7] who used GPS data for circadian movement. All significant associations indicated that disturbed rest-activity patterns were associated with depressive symptoms, however, in the majority of instances where circadian rhythm was reported, no significant association with mood was detected. Depression has been associated with lower daytime activity and higher night-time activity (hour-based activity levels[36,37]), low intra-daily stability, more fragmented intra-daily movement, e.g., leaving for work and coming back at less regular times[7], later acrophase, or later activity peaks[38-40]; lower amplitude, less difference between the average levels of activity during the peaks vs. the troughs of activity[20,39]. Four studies calculated circadian rhythmicity as a measure of the extent to which a participant’s pattern follows an expected Cosinor model, finding lower circadian rhythmicity more likely to be associated with being depressed[37,41-43]. Eleven studies assessed sociability. The average number of ingoing and outgoing calls was found to be negatively correlated with depressive symptoms in one small study (n = 6), and only in men[44]. Yang et al.[45], with a combination of microphone, GPS and Bluetooth sensing as a proxy for social proximity, found that an interaction between environmental noise and proximity to others was informative of depressive state, e.g. being in a quiet place with few people around, compared to either spending time outside alone or in a noisy environment with more than 3 people. Other studies found that a higher frequency of conversations in the day and at night correlated with lower depression[26], as well as being around human speech for longer[46]. Location was assessed in 11 studies, measured via GPS. In addition to traditional statistical analyses, Saeb et al.[7] estimated accuracy and mean normalised residual mean square difference (NRMSD) to assess the performance of prediction models. We, therefore, do not have levels of significance as expressed via p-values for all features. Entropy was reported in 26 cases in four different studies. High entropy, or spending more time in fewer, more consistent locations, was associated with depression, as compared to lower entropy, where people spend more time in a greater number of more varied locations. Features of location variance—how varied a participant’s locations are—show a negative correlation with depression, where the more varied the locations, the lower the likelihood of being depressed. Homestay—the amount of time spent at home—shows one of the most consistent patterns across the field, with all included studies reporting a significant association with depression. Three studies associated individual phone use features with depression. All studies found that increased unlock duration and unlock frequency were associated with depression, non-p-value tests reported a mean NRMSD of 0.268 and 0.249, and 74.2% and 68.6% accuracy in classifying depressed vs. non-depressed participants, respectively. Increased use of specific apps, such as Instagram, iOS maps, and the use of photo and video apps was associated with greater depression, whereas book apps were associated with milder symptoms[47]. The temperature was measured by Ávila-Moraes et al.[36], who extracted more than 5 skin temperature features from a wrist-worn device, and found depressed people to have a longer time of elevated temperature compared to controls. One study[48] reported no association between heart rate and depression scores. Ávila-Moraes et al.[36] also used a wrist-worn actigraphy device to measure light exposure and extracted four features. She found depressed groups to have a lower variance of light intensity than controls. Another study found humidity to have a significant positive correlation with depressed symptoms (r = 0.4) in women, but a negative correlation in men, suggesting females, but not males might feel worsening in their condition during rainy weeks[44].

Sensitivity analysis

We carried out a sensitivity analysis to evaluate whether including only high-quality studies had any effect on our overall findings, the results of which can be found in Supplementary Fig. 3. After excluding studies with a score of eight out of 15 or lower, 20 papers remained. Overall, we found that excluding poor-quality studies did not change the patterns of association or significance ratios for sleep, physical activity, sociability, and location, beyond reducing the number of studies and therefore features that were analysed. Many of the studies on circadian rhythms are excluded, making existing associations even more tenuous; all studies showing a significant association between mood and intradaily stability or acrophase are lost, as are those finding no association between hour-based activity levels and depression. No studies looking at bivariate associations between phone use and depression remained.

Combined features

Tables 3 and 4 show the ten studies combining digital features to predict symptom severity (regression models) or depressive state (classification models). Twenty-four models in total were presented by all studies, the majority of which (n = 18) included features of physical activity, followed by location (n = 14), phone use (n = 11) and sleep (n = 9). Both classification and regression models showed predictive value, however, many of them lacked information regarding the handling of missing sensor data and calibration. Those that do, report simple imputation methods such as mean imputation, with two studies using multiple imputation methods[6,18].
Table 3

Details for studies analysing combined features using classification models.

Study IDQuality ratingFirst Author, YearDeviceGroupsNNo. of featuresFeature typeAlgorithm/modelPerformance measureDiscrimination valueMissing data handlingValidation methodComparison models
112Sano, 2018Q sensor, smartphoneMCS SF-12 Low vs. High47204PA, SC, LiSVM RBFAccuracy85.1Interpolation10-fold cross-validationLASSO, SVM Linear
441PA, L, PU, SC, STSVM RBFAccuracy86.1
700S, PA, PU, SC, ST, HR, ClSVM RBFAccuracy77.2
296S, PA, PUSVM RBFAccuracy78.7
25PUSVM RBFAccuracy71.1
25SSVM RBFAccuracy65
28Yue, 2018AndroidClinician MDD vs. HC258PA, LSVM RBFF10.66Multiple ImputationLOOCVl2-regularised (ridge) regression
iPhone548PA, LSVM RBFF10.76
38Wahle, 2016SmartphonePHQ-9 Dep vs. HC36120PA, So, L, PURandom ForestAccuracy60.1UnclearLOOCVSVM
410Pratap, 2019SmartphonePHQ-2 Dep vs. HC9310So, LRandom ForestMedian AUC>0.50 (for 80.6% sample)Mean imputationNone
57Saeb, 2015AndroidPHQ-9 Dep vs. HC188CR, LElastic Net Logistic RegressionAccuracy78.8UnclearLOOCV
67Wang, 2018SmartphonePHQ 4 Dep vs. HC839S, PA, L, PU, HRLasso Logistic RegressionAUC0.809Unclear10-fold cross validation
79Lu, 2018smartphone and FitbitQIDS6936S, PA, SoMulti-Task Deep LearningF10.77ExclusionLO(W)OCVSTL (Lasso) STL (Ridge), MTL Lasso and Ridge

MCS SF mental component survey short form, PHQ Patient Health Questionnaire, MDD major depressive disorder, HC healthy control, S sleep, PA physical activity, CR circadian rhythm, So Sociability, L Location, PU phone use, SC skin conductance, ST skin temperature, HR heart rate, Li light, Cl clinical data, SVM RBF Support Vector Machine - Radial Basis Function, AUC Area Under the Curve, LOOCV Leave One Out Cross Validation, STL Single Task Learning, MTL = Multi-Task Learning

Table 4

Details for studies analysing combined features using regression models.

Study IDQuality ratingFirst Author, YearDeviceOutcomeNNo. of featuresFeature typeAlgorithmPerformance measureExact statisticMissing data handlingValidation methodComparison
28Yue, 2018AndroidPHQ9258PA, LSVM RBFr0.46Multiple ImputationLOOCVSupport Vector Multivariate Linear Regression
iPhonePHQ9548PA, LSVM RBFr0.41Support Vector Multivariate Linear Regression
410Pratap, 2019SmartphonePHQ29310PA, So, LRandom ForestsR2≈ 0Mean ImputationNone Reported
57Saeb, 2015SmartphonePHQ9188CR, LElastic net Linear RegressionMean NRMSD0.251UnclearLOOCV
212PUElastic net linear regressionMean NRMSD0.273
67Wang, 2018Smartphonepre PHQ 88310S, PA, L, PU, HRLasso Linear RegressionMAE2.4Unclear10-fold cross validation
post PHQ 8835S, PA, So, L, PULasso Linear RegressionMAE3.6
79Lu, 2018Smartphone, FitbitQIDS6936S, PA, SoMulti-Task deep LearningR20.44ExclusionLO(W)OCVSTL (Lasso) STL (Ridge), MTL Lasso and Ridge
87Burns, 2011SmartphonePHQ9738PA, So, L, PU, LiRegression TreesAccuracynrUnclear10-fold cross validation
98Jacobson, 2019ActiwatchBDI-II15nrPA, LiXgboostr0.86UnclearLOOCV
107Ghandeharioun, 2017Empatica, SmartphoneHRDS12700S, PA, PUCombination of regularised regression, robust-to-outlier, boosting, Random Forest and Gaussian ProcessRMSE4.5Multiple Imputation10-fold cross validation

PHQ Patient Health Questionnaire, QIDS Quick Inventory of Depressive Symptomatology, nr not reported, S sleep, PA physical activity, CR circadian rhythm, So sociability, L location, PU phone use, SC skin conductance, ST skin temperature, HR heart rate, Li light, Cl clinical data, SVM RBF support vector machine-radial basis function, NRMSD normalised root-mean-square deviation, RMSE root-mean-square error, MAE mean absolute error, STL single-task learning, MTL multi-task learning.

Details for studies analysing combined features using classification models. MCS SF mental component survey short form, PHQ Patient Health Questionnaire, MDD major depressive disorder, HC healthy control, S sleep, PA physical activity, CR circadian rhythm, So Sociability, L Location, PU phone use, SC skin conductance, ST skin temperature, HR heart rate, Li light, Cl clinical data, SVM RBF Support Vector Machine - Radial Basis Function, AUC Area Under the Curve, LOOCV Leave One Out Cross Validation, STL Single Task Learning, MTL = Multi-Task Learning Details for studies analysing combined features using regression models. PHQ Patient Health Questionnaire, QIDS Quick Inventory of Depressive Symptomatology, nr not reported, S sleep, PA physical activity, CR circadian rhythm, So sociability, L location, PU phone use, SC skin conductance, ST skin temperature, HR heart rate, Li light, Cl clinical data, SVM RBF support vector machine-radial basis function, NRMSD normalised root-mean-square deviation, RMSE root-mean-square error, MAE mean absolute error, STL single-task learning, MTL multi-task learning.

Discussion

We sought to summarise the literature on passive sensing for depression, in order to map the methodological challenges and threats to reproducibility, in an effort to generate standards in the literature that allow for quantitative synthesis of results. We also assessed the available evidence for a relationship between sensor data and mood to identify leading digital signals for depression. The first methodological shortcoming stems from the recency of this field. Studies have mostly employed opportunistic study designs, with small sample sizes, short follow-up windows and many being conducted on students, which limits generalisability. Different features may reach peak predictability of mood with different sampling timeframes, so shorter follow-ups may harm the prediction abilities for some behaviours[22]. This is presumably more likely in feature types such as sleep and circadian rhythm which benefit from having more aggregated baseline data[49]. There is no consensus on the timeframe window for optimal phenotyping, different windows, therefore, need to be evaluated. A critical source of heterogeneity comes from the multitude of methods to create any individual feature, often without providing reasonable details of the process. A feature of sleep quality, for instance, defined in different studies as “Nocturnal Awakenings”, may have been constructed by measuring counts of awakenings, total number of minutes awake, or a proportion of awake vs. asleep in a sleep session. Additionally, there may be differences in how raw sensor data is used to classify an event as sleep or awake. This heterogeneity challenges the ability of investigators to reproduce findings and hampered our ability to summarise results in a meta-analysis. The exploratory nature of many of these studies means that many different versions of the same feature may have been generated but studies do not transparently describe and justify feature selection and its association with depression. Researchers should provide a description of the feature, in the paper or supplement materials, that is sufficiently clear to allow for appropriate reproducibility. Additionally, due to the large number of variables obtained in sensing studies, it is likely that published papers are selective in their reporting, and typically emphasise “positive” findings over “negative” ones. Preregistering studies and analyses would be one way of handling this. As the field matures and more studies are published, issues of rigour and reproducibility become more salient, and preregistration becomes more important to reduce reporting bias and cherry-picking in the field. The sources of heterogeneity arise from varying data collection timespans, depression assessment measures, feature construction, and analytical methods. Whilst differences in these areas represent a healthy heterogeneity in an evolving field, it means that nuance is required in interpreting the presence or absence of a relationship between any specific signal and depressed mood. For example, many studies recruited students, who have different socialisation patterns and smartphone usage to older adults[12,50]. Prediction models based on younger populations have been found not to transfer to older age groups[51]. Further, a signal detected in a clinical sample consisting of people with relatively severe depression may not be reproduced in a population sample where the majority of the sample have few or no depressive symptoms and there may be less variability in key sensor data (e.g. sleep or activity data). For any broad concept (e.g. sleep or circadian rhythm) different sensor types or operating systems were used, and component features were derived using different approaches. For example, both iPhone and Android smartphone operating systems were included, and sometimes showed differences in significance levels for the same variables[5,18]. This could be due to differences in sampling and data collections for both operating systems, or differences in the user profiles of these products[52]. We found significant shortcomings in the literature in terms of fundamentals of reporting, including the most basic descriptors of sample characteristics, recruitment, attrition, and missing data. Whilst many of these shortcomings would be resolved by authors and journals following established reporting conventions (e.g. STROBE guidelines), there are a number of issues that are specific to this field. One of those issues is missing data. Our quality assessments reflect poor reporting of missing data at both the sample level (e.g. attrition and study non-completion) and individual level (e.g. missing sensor data from participants). Missing data can arise from issues with technology, such as device and system failures, or from user-related issues which may be associated with depressed mood. For missing data to be used informatively, these two types need to be identified and dealt with in different ways in terms of their exclusion or analysis. Additionally, researchers set different thresholds as to what counts as missing data. This varies between studies and generates an important threat to reproducibility, making it crucial that these thresholds are reported. Our recommendation is that papers should clearly state how much data were missing and how it was managed in the analysis. Remote sensing is a relatively new technology that potentially places a considerable burden on study participants—it was therefore surprising that few studies reported on the acceptability of the study protocol to participants. Where this did happen the emphasis was more on evaluating active questionnaire data rather than passive data and device use, where arguably greater issues over privacy and acceptability arise[6,41]. Finally, there is a general lack of discussion around the extent to which the devices used in these research studies are valid or reliable tools to detect the behaviours of interest. While some behaviours may appear relatively simple to infer from single sensors, such as GPS sensors to infer location and accelerometry as a measure of movement and physical activity, there are validity and reliability concerns surrounding them. For example, although GPS receivers are generally good at detecting location and movement[53], smartphone-based GPS receivers may differ in their measures of distance travelled[54]. Accelerometers are also generally accepted as reliable but can vary in their output and validity in measuring physical activity across devices[55]. More complex behaviours such as sociability and sleep require multisensory data and a larger inferential leap. The evidence for actigraphy for the detection of sleep is uncertain, as several studies have found strong correlations between actigraphy and the gold standard of polysomnography (PSG)[56,57], but a scoping review of 43 studies finding only moderate to poor agreement[58]. A more recent systematic review, however, found that while actigraphy tended to overestimate sleep and underestimate wake, this inaccuracy was consistent, thereby maintaining its usefulness as a potential marker of sleep–wake patterns[59]. There is a clear gap in the definition of validity and reliability of these devices, however, whether or not these sensors measure the exact ground truth may be less concerning than whether the features we do extract are consistent against each other and serve the purpose of detecting changes in health status. So even though we would expect less reliable technologies to increase the noise to signal ratio, the extent to which any inaccuracies in the devices reduce the strength of association in depression is unknown.

Association between mood and digital features

Given the heterogeneity in research quality and reporting standards across studies, making inferences from aggregated associations between digital features and mood may be misleading. It would, however, be a missed opportunity to ignore growing consensus between studies in detecting associations between mood and digital features. We, therefore, report a synthesis of the findings but urge the reader to interpret this summary with caution. Features that consistently appear to be associated with depression are location-based features, with homestay and entropy both associated with the mood in 4 and 5 studies, respectively. However, these studies do not determine the direction of causality, i.e. whether changes in sensed features such as homestay are merely a reflection of behaviours that appear in depression, such as reduced physical activity and social withdrawal[60,61] or whether they are, in themselves, predictors of deterioration in mood. Several sleep features appear also to be consistently associated with depressed mood, with sleep stability showing the highest proportion of significant associations. When measuring socialisation, proximity-related features using Bluetooth and microphone sensors seem more sensitive to mood than call and message frequency counts. However, many of these studies have small sample sizes (median = 58), student samples with a low mean age[34] or report a high degree of intra- and interindividual variance in daily phone usage[62]. Recent studies with larger and more diverse samples using classification machine learning techniques have found that a low average number and duration of calls made daily predicted depression state[63]. Even though disruptions in circadian rhythms have been thought to affect depression[64], the majority of studied features did not have a significant association with mood. As previously mentioned, this may be due to short follow-up since median follow-up times for circadian features = 9 days. The findings of this review highlight the array of potential predictors that sensor data generates. As such, machine learning methods have been the choice analytic approach to the digital phenotyping of depression from multiple features. In addition to helping account for important interactions between the objective features, for example how the effect of being alone is mediated by location (being indoors vs outdoors)[45], analysing multimodal data in this way may help cover missing data from one source to another. However, machine learning methods have been criticised for lacking transparency in how the model is built and how individual variables contribute to the overall prediction[65]. Some studies in the current review do report their top predictors and bivariate associations with depression, but the question of how well these models can be replicated remains, highlighting the importance of thorough reporting.

Strengths and limitations

Our attempt to summarise the literature is necessarily crude because the reporting of feature–depression associations was too opaque and diverse to allow any credible attempt at meta-analysis. We have therefore had to rely on simple counts of associations reported, and this comes with caveats that reports are not weighted by sample size, follow-up duration or study quality. It is possible that the associations we have reported are due to reporting bias, as mentioned in the previous section, where investigators emphasise “significant” findings over “non-significant” ones. To present low-level features in a clear and meaningful way in this review, we combined them into broader low-level features and therefore some of the nuances between them were lost. For example, if one study extracted two features such as a total number of minutes spent in phone calls and the average length of a phone call, they would both load into Call Duration, within the “Sociability” Feature Type (Supplementary Tables 3–10). Several studies included in this review have overlapping samples as they come from existing datasets. For example, four papers[26,42,45,48] use the StudentLife open dataset, where there is some similarity in the analysis, meaning that some of the feature associations may be duplicated.

Recommendations and conclusions

Whilst there have been attempts at standardising reporting standards for actively collected questionnaire data on mood[66], and guidelines exist for the reporting of observational data (STROBE[67]) and multivariable prediction models (TRIPOD statement[68]), there is a need to develop consensus over the manner in which such mobile health studies are conducted and reported. This should not come at the expense of stifling innovation and should acknowledge that a new field of study takes time to develop. The literature we identified derives from both clinical and computer science disciplines and some of the heterogeneity we report results from these disciplines having distinct conventions, with medical outputs putting more weight on sample and clinical outcome characteristics but often overlooking feature extraction and analysis description. The importance of recruiting and reporting the diversity of study samples, however, is highlighted by the difference in validity of these devices in detecting the behaviours of interest. For example, some wearable devices may be more accurate on lighter skin tones[69], and on men[70]. There is a need for experts across the disciplines to build upon and generate a consensus on a set of established guidelines, but based on this work, the following recommendations emerge as a first step at attempting to improve the generalisability of research and generate a more standardised approach to passive sensing in depression. Sample recommendations: Report recruitment strategies, sampling frames and participation rates. Increase the diversity of study populations by recruiting participants of different ages and ethnicities. Report basic demographic and clinical data such as age, gender, ethnicity and comorbidities. Measure and report participant engagement and acceptability in the form of attrition rates, missing data, and/or qualitative data. Data collection and analysis: Use established and validated scales for depression assessment. Present the available evidence, if any, on the validity and reliability of the sensor or device used. Register study protocol including pre-specification of analytical plans and hypotheses. Describe in sufficient detail to allow replication, data processing and feature construction. Provide a definition and description of missing data management. In machine learning models, describe the model selection strategy, performance metrics and parameter estimates in the model with confidence intervals, or nonparametric equivalents (for a full guideline on reporting machine learning models see Luo[71]). Data sharing considerations: Make the code used for feature extraction available within an open science framework. Share anonymised datasets on data repositories. The above points cover aspects of transparency, validity and generalisability. Data sharing considerations become critical in this respect, especially with the use of big data and machine learning models, where validation of the model and data is an integral part of the process. It is therefore important to work towards the creation of open datasets or the widespread sharing of data and to work with community groups to standardise the description, exchange and use of mobile health data. Our most pressing recommendation, however, is that there is a need for consistency in reporting in this field. The failure to report basic demographic information found in many studies, particularly from the computer science field, and the limited description in feature extraction and analysis in medical papers, have important implications for the interpretation of findings. A common framework, with standardised assessment and analytical tools, robust feature extraction and missing data descriptions, tested in more representative populations would be an important step towards improving the ability of researchers to evaluate the strength of the evidence.

Methods

Search strategy and selection criteria

We searched Pubmed, IEEE Xplore, ACM Digital library, Web of Science, and Embase and PsychInfo via OVID, for studies published between January 2007 until November 2019, and used a combination of terms related to the key concepts of (1) depression and (2) digital sensors and remote measurement technologies (RMTs) (full search in the Supplementary Note 1). We also conducted searches based on bibliographies of reviews and meta-analyses on the topic. The protocol was registered on PROSPERO 2019 CRD42019159929. Studies had to have measured depressive symptoms in either clinical or epidemiological samples and to consist of samples with mean ages between 18 and 65 years, due to the differences in behavioural patterns for older adults and children. We limited studies to those which had extracted data for at least 3 consecutive days (to allow for intraday mood fluctuations) from smartphones and wrist-worn devices. Data from devices not worn on the wrist, e.g. on the chest, upper arm or hip, were excluded due to measurement discrepancies between devices worn in different body parts[72]. Studies had to link data between validated scales of depression severity or status (case/non-case) and digital sensor-based variables including measures of behaviour, e.g. activity, sleep, etc., gathered passively. Studies had to be written in English, German or Spanish because these are the languages spoken by the reviewers, be published, peer-reviewed and with accessible full text. Studies were excluded if their primary focus related to a condition other than depression as well as those from inpatient settings. Studies focusing specifically on bipolar depression were excluded, however, mixed studies consisting of unipolar and bipolar were included provided unipolar cases comprised a substantial majority (at least 80%) of the sample. We excluded studies published before 2007 as this was when the first smartphones became available.

Procedure

Studies were checked for eligibility by two researchers independently screening titles and abstracts. Potentially eligible studies’ full texts were reviewed by one researcher, with a second researcher evaluating a random sample of 10% of all texts for validating purposes. Disagreements at any stage of eligibility and data extraction were resolved by discussing with an additional reviewer. Agreement of >90% was reached for all reviewer pairs. The eligibility process was documented according to PRISMA guidelines[73].

Data extraction

Data extraction included the following variables: sample characteristics (N, mean age, gender, ethnicity), comorbidities, study design, study setting (clinical, community, student), depression outcome measures, length of follow-up, device type, features measured, sensors used, statistical analyses and significance levels.

Study quality assessment

No single quality assessment tool was suitable because of the ditabversity of study types. We, therefore, combined the Appraisal Tool for Cross-Sectional Studies (AXIS tool[74] and the Newcastle–Ottawa Scale (NOS) for longitudinal studies[75]. Items were scored with two points for fully fulfilled items, one point for partially fulfilled items, and zero for a non-fulfilled item (see Supplementary Table 11 for a description of each criterion). We added an item regarding having a published protocol prior to publishing results (1 point for a published protocol). Data extraction was carried out on all studies, regardless of their quality assessment score.

Feasibility

We collected information on five measures of the feasibility of using digital health tools, with the aim of identifying potential obstacles to their implementation: engagement with study devices, reasons for study drop out, reported problems with technology, percentage of study tasks completed, attrition and missing data.

Data synthesis

Eight categories of behavioural features were identified: sleep, physical activity, circadian rhythm (rest-activity patterns through a 24-h period), sociability, location, physiological parameters, phone use and environmental features. Supplementary Tables 3–10 provide descriptions for each feature. Within each behavioural category, there are lower-level features, which group together several individual features as reported by each study. It was therefore possible for a single study to present multiple associations for the same feature. Significant associations according to 0.05 p-value thresholds are presented. Due to the heterogeneity of feature types, study designs and data reporting we did not conduct a meta-analysis.
  65 in total

1.  A rating scale for depression.

Authors:  M HAMILTON
Journal:  J Neurol Neurosurg Psychiatry       Date:  1960-02       Impact factor: 10.154

2.  The PHQ-9: validity of a brief depression severity measure.

Authors:  K Kroenke; R L Spitzer; J B Williams
Journal:  J Gen Intern Med       Date:  2001-09       Impact factor: 5.128

3.  [Complete recovery from depression is the exception rather than the rule: prognosis of depression beyond diagnostic boundaries].

Authors:  Josine E Verhoeven; Judith Verduijn; Robert A Schoevers; Albert M van Hemert; Aartjan T F Beekman; Brenda W J H Penninx
Journal:  Ned Tijdschr Geneeskd       Date:  2018-09-06

4.  Differences in psychomotor activity in patients suffering from unipolar and bipolar affective disorder in the remitted or mild/moderate depressive state.

Authors:  Maria Faurholt-Jepsen; Søren Brage; Maj Vinberg; Ellen Margrethe Christensen; Ulla Knorr; Hans Mørch Jensen; Lars Vedel Kessing
Journal:  J Affect Disord       Date:  2012-03-03       Impact factor: 4.839

5.  Harnessing context sensing to develop a mobile intervention for depression.

Authors:  Michelle Nicole Burns; Mark Begale; Jennifer Duffecy; Darren Gergle; Chris J Karr; Emily Giangrande; David C Mohr
Journal:  J Med Internet Res       Date:  2011-08-12       Impact factor: 5.428

6.  Mobile Phone Sensor Correlates of Depressive Symptom Severity in Daily-Life Behavior: An Exploratory Study.

Authors:  Sohrab Saeb; Mi Zhang; Christopher J Karr; Stephen M Schueller; Marya E Corden; Konrad P Kording; David C Mohr
Journal:  J Med Internet Res       Date:  2015-07-15       Impact factor: 5.428

7.  A cross-sectional study of the association between mobile phone use and symptoms of ill health.

Authors:  Yong Min Cho; Hee Jin Lim; Hoon Jang; Kyunghee Kim; Jae Wook Choi; Chol Shin; Seung Ku Lee; Jong Hwa Kwon; Nam Kim
Journal:  Environ Health Toxicol       Date:  2016-10-26

Review 8.  Smartphone-Based Monitoring of Objective and Subjective Data in Affective Disorders: Where Are We and Where Are We Going? Systematic Review.

Authors:  Ezgi Dogan; Christian Sander; Xenija Wagner; Ulrich Hegerl; Elisabeth Kohls
Journal:  J Med Internet Res       Date:  2017-07-24       Impact factor: 5.428

9.  Sleep, circadian rhythm, and physical activity patterns in depressive and anxiety disorders: A 2-week ambulatory assessment study.

Authors:  Sonia Difrancesco; Femke Lamers; Harriëtte Riese; Kathleen R Merikangas; Aartjan T F Beekman; Albert M van Hemert; Robert A Schoevers; Brenda W J H Penninx
Journal:  Depress Anxiety       Date:  2019-07-26       Impact factor: 6.505

Review 10.  Correlations Between Objective Behavioral Features Collected From Mobile and Wearable Devices and Depressive Mood Symptoms in Patients With Affective Disorders: Systematic Review.

Authors:  Darius A Rohani; Maria Faurholt-Jepsen; Lars Vedel Kessing; Jakob E Bardram
Journal:  JMIR Mhealth Uhealth       Date:  2018-08-13       Impact factor: 4.773

View more
  7 in total

Review 1.  A systematic review of engagement reporting in remote measurement studies for health symptom tracking.

Authors:  Katie M White; Charlotte Williamson; Nicol Bergou; Carolin Oetzmann; Valeria de Angel; Faith Matcham; Claire Henderson; Matthew Hotopf
Journal:  NPJ Digit Med       Date:  2022-06-29

2.  Remote Assessment of Disease and Relapse in Major Depressive Disorder (RADAR-MDD): recruitment, retention, and data availability in a longitudinal remote measurement study.

Authors:  Faith Matcham; Daniel Leightley; Sara Siddi; Femke Lamers; Katie M White; Peter Annas; Giovanni de Girolamo; Sonia Difrancesco; Josep Maria Haro; Melany Horsfall; Alina Ivan; Grace Lavelle; Qingqin Li; Federica Lombardini; David C Mohr; Vaibhav A Narayan; Carolin Oetzmann; Brenda W J H Penninx; Stuart Bruce; Raluca Nica; Sara K Simblett; Til Wykes; Jens Christian Brasen; Inez Myin-Germeys; Aki Rintala; Pauline Conde; Richard J B Dobson; Amos A Folarin; Callum Stewart; Yatharth Ranjan; Zulqarnain Rashid; Nick Cummins; Nikolay V Manyakov; Srinivasan Vairavan; Matthew Hotopf
Journal:  BMC Psychiatry       Date:  2022-02-21       Impact factor: 3.630

3.  Using digital health tools for the Remote Assessment of Treatment Prognosis in Depression (RAPID): a study protocol for a feasibility study.

Authors:  Valeria de Angel; Serena Lewis; Sara Munir; Faith Matcham; Richard Dobson; Matthew Hotopf
Journal:  BMJ Open       Date:  2022-05-06       Impact factor: 3.006

4.  Smartphone Ownership, Smartphone Utilization, and Interest in Using Mental Health Apps to Address Substance Use Disorders: Literature Review and Cross-sectional Survey Study Across Two Sites.

Authors:  Michael Hsu; Bianca Martin; Saeed Ahmed; John Torous; Joji Suzuki
Journal:  JMIR Form Res       Date:  2022-07-07

5.  Clinical Targets and Attitudes Toward Implementing Digital Health Tools for Remote Measurement in Treatment for Depression: Focus Groups With Patients and Clinicians.

Authors:  Valeria de Angel; Serena Lewis; Katie M White; Faith Matcham; Matthew Hotopf
Journal:  JMIR Ment Health       Date:  2022-08-15

6.  Real-world behavioral dataset from two fully remote smartphone-based randomized clinical trials for depression.

Authors:  Abhishek Pratap; Ava Homiar; Luke Waninger; Calvin Herd; Christine Suver; Joshua Volponi; Joaquin A Anguera; Pat Areán
Journal:  Sci Data       Date:  2022-08-27       Impact factor: 8.501

Review 7.  Lessons learned from recruiting into a longitudinal remote measurement study in major depressive disorder.

Authors:  Carolin Oetzmann; Katie M White; Alina Ivan; Jessica Julie; Daniel Leightley; Grace Lavelle; Femke Lamers; Sara Siddi; Peter Annas; Sara Arranz Garcia; Josep Maria Haro; David C Mohr; Brenda W J H Penninx; Sara K Simblett; Til Wykes; Vaibhav A Narayan; Matthew Hotopf; Faith Matcham
Journal:  NPJ Digit Med       Date:  2022-09-03
  7 in total

北京卡尤迪生物科技股份有限公司 © 2022-2023.