| Literature DB >> 35431608 |
Ravi Prasad Thati1, Abhishek Singh Dhadwal1, Praveen Kumar1, Sainaba P2.
Abstract
Depression has become a global concern, and COVID-19 also has caused a big surge in its incidence. Broadly, there are two primary methods of detecting depression: Task-based and Mobile Crowd Sensing (MCS) based methods. These two approaches, when integrated, can complement each other. This paper proposes a novel approach for depression detection that combines real-time MCS and task-based mechanisms. We aim to design an end-to-end machine learning pipeline, which involves multimodal data collection, feature extraction, feature selection, fusion, and classification to distinguish between depressed and non-depressed subjects. For this purpose, we created a real-world dataset of depressed and non-depressed subjects. We experimented with: various features from multi-modalities, feature selection techniques, fused features, and machine learning classifiers such as Logistic Regression, Support Vector Machines (SVM), etc. for classification. Our findings suggest that combining features from multiple modalities perform better than any single data modality, and the best classification accuracy is achieved when features from all three data modalities are fused. Feature selection method based on Pearson's correlation coefficients improved the accuracy in comparison with other methods. Also, SVM yielded the best accuracy of 86%. Our proposed approach was also applied on benchmarking dataset, and results demonstrated that the multimodal approach is advantageous in performance with state-of-the-art depression recognition techniques.Entities:
Keywords: Depression detection; Emotion elicitation; Machine learning; Mobile crowd sensing; Multi-modal; Speech elicitation
Year: 2022 PMID: 35431608 PMCID: PMC9000000 DOI: 10.1007/s11042-022-12315-2
Source DB: PubMed Journal: Multimed Tools Appl ISSN: 1380-7501 Impact factor: 2.757
Few standard self-reports
| Self-Report | No. of | Sample contents | Categories of depression |
|---|---|---|---|
| Questionnaire | questions | of the Questionnaires | |
| Patient Health | 9 | Sleep difficulties, excessive | Mild, moderate, moderately |
| Questionnaire(PHQ-9) [ | guilt, fatigue, suicidal ideation | severe, and severe | |
| Beck Depression | 21 | Mood, self-hate, social | minimal, mild, moderate, and |
| Inventory(BDI-II) [ | withdrawal, fatigability | severe depression | |
| Hamilton Rating Scale | 17 | Loss of interest, agitation, | Normal, mild, moderate, and |
| for Depression(HRS-D) [ | mood, loss of weight | severe depression | |
| Quick Inventory of Depressive | 16 | oncentration, suicidal ideation, | Normal, mild, moderate, and |
| Symptomatology (QIDS) [ | sleep disturbance, self-criticism | severe depression |
Fig. 1The overall architecture of the proposed approach
Experimental procedure with time duration to conduct emotion and speech elicitation
| Experimental tasks | Procedure | Description(source) | Duration |
|---|---|---|---|
| Blank screen | NA | 1 minute | |
| Positive video | The Circus(1928) / Charlie Chaplin, a known comedian, performs hilarious acts when he enters a lion cage. | 3:32 minutes | |
| Emotion Elicitation | Blank screen | NA | 1 minute |
| Neutral video | Abstract Shapes/colour bars | 3 minutes | |
| Blank screen | NA | 1 minute | |
| Negative video | The Champ(1979) / Little boy crying when his father is on the death bed. | 3 minutes | |
| Break | Blank screen | NA | 10 minutes |
| Speech Elicitation | Passage reading | Short tale called “The North and the South Wind” | 1 minute |
| Free form speech | Participant’s choice from a list appears on the monitor | 2 minutes |
A subset of the data items collected from the participant’s smartphone for the study conducted
| Data Collected | Probe Used | Listening/polling | Intervals |
|---|---|---|---|
| Acceleration | Accelerometer | Listening | 1 reading / second |
| Application Usage | ApplicationUsageStats | Polling | 1 reading / 15 min |
| Statistics | |||
| Brightness | LightDatum | Listening | 1 reading / second |
| Bluetooth encounters | BluetoothDeviceProximityDatum | Polling | scans and reads performed for 10 seconds each, between 30-second intervals |
| Gyroscope values | GyroscopeDatum | Listening | 1 reading / second |
| GPS/ location | LocationDatum | Polling | 1 reading / 15 min |
| Screen unlocks | ScreenDatum | Polling | 1 reading / 30 seconds |
Fig. 2Samples during elicitation methods (a) Emotion Elicitation: participant’s facial clues are recorded while watching the neutral video (b) Speech Elicitation: participant’s speech is recorded while reading the phonetically balanced paragraph
Smart phone usage features
| Parent feature | Description | Statistical features | No. of |
|---|---|---|---|
| extracted | features | ||
| Accelerometer probe | To measure acceleration (the rate of change of velocity). We approximated accelerometer magnitudes using ( | Mean of the accelerometer magnitude was computed | 1 |
| Gyroscope probe | To measure orientation of the phone. | Axis-wise variance of entries were calculated | 3 |
| Application Usage probe | App categories were extracted using their package references from play store | Average amount of hours per day spent on each application category by the user | 36 |
| Location probe | Raw readings were used to calculate location variance ( | location variance ( | 6 |
| Bluetooth probe | The entries are grouped day wise and the number of unique encounters were calculated using the Address entry. | Day-wise mean, Variance and Standard Deviation of entries | 3 |
| Light (brightness) probe | To measure the illumination of the device(user). Brightness probe readings are used here | Mean, Variance and Standard Deviation of readings | 3 |
| Screen unlock probe | The entries were divided on the basis of binary readings provided by the probe | The percentage of entries where the screen_on entry is True with respect to the total number of entries was calculated | 1 |
Fig. 3Visual feature extraction (a) Visualization of 68 facial landmark location coordinates (b) Examples of few action units extracted from Cohn and Kanades database [25]
Fig. 4Geometrical Features representation using facial landmark locations
Summary of the facial features
| Feature | Feature | Description | Statistical features | No. of |
|---|---|---|---|---|
| category | Name | extracted | features | |
| Displacement features | Displacement (using ( | Mean, median, minimum, maximum, kurtosis, mode, standard deviation, Root mean square, skewness for Each of 6 displacement points. (dp1 to dp6) | 54 | |
| Geometrical features | Distances features | Distances(using ( | Mean, median, minimum, maximum, kurtosis, mode, standard deviation, Root mean square, skewness for Each of 8 distances. (d0 to d7) | 72 |
| Region Units | Area of the mouth. Area of the left eye and Areaof the right eye( | Mean, median, minimum, maximum, kurtosis, mode, standard deviation, Root mean square, skewness for Areas of mouth, left eye and right eye. (A0 A1 and A2) | 27 | |
| Facial Action Unit Features | Action Unit features | The facial action coding system is used to quantify the muscle movements on the face. AU occurrences present(1) or absent(0) for 18 AU. | Mean, median, standard deviation, kurtosis, mode, Root mean square, skewness for each 18 AU present/ absent. | 126 |
| If present, AU intensities for 17 AU intensities. | Mean, median, standard deviation, maximum, kurtosis, mode, Root mean square, skewness for each 17 AU intensities | 136 |
Audio features
| Feature | Description | Statistical features | No. of |
|---|---|---|---|
| Name | extracted | features | |
| Pitch | It is an approximation of the quasi-periodic rate of vibrations per speech cycle. | mean, median, standard deviation, minimum, mode maximum, kurtosis, Root mean square, skewness | 9 |
| Intensity | It is the measure of the perceived loudness. | mean, median, standard deviation, minimum, mode maximum, kurtosis, Root mean square, skewness | 9 |
| Formants [F1,F2, F3,F4] | They indicate resonating frequencies of the vocal tract. The formant with the lowest frequency band is F1, then the second F2, which occurs with 1000Hz intervals. | mean, median, standard deviation, minimum, maximum, kurtosis, mode, Root mean square, skewness | 36 |
| Pulses | A fundamental, audible, and steady beat in the voice. | Count, Mean, standard deviation, variance | 4 |
| Amplitude | It is the size of the oscillations of the vocal folds due to vibrations caused by speech biosignal. | minimum, maximum, mean, Root mean square | 4 |
| Mean Absolute jitter | It is the absolute difference between consecutive vocal periods, divided by the mean vocal period. | Mean | 1 |
| Jitter (local, absolute) | The absolute difference between consecutive periods, in seconds. | Mean | 1 |
| Relative average perturbation jitter | It measures the effects of long-term pitch changes like slow rise/fall in pitch. It is calculated as the average absolute difference between a period and its average and its 2 neighbours, divided by the mean period. | Mean | 1 |
| 5-point period perturbation Jitter | It is calculated using the average absolute difference between a period and the average of it and its 4 closest neighbours, divided by the mean period. | Mean | 1 |
| Mean absolute differences Jitter | It is the absolute difference between consecutive differences between consecutive periods, divided by the mean period | Mean | 1 |
| Shimmer | It defines the short-term (cycle-to-cycle) tiny fluctuations in the amplitude of the waveform which reflects inherent resistance/noise in the voice biosignal. | Mean | 1 |
| Mean Shimmer | Average absolute difference between the amplitudes of consecutive periods, divided by the average amplitude. | Mean | 1 |
| Mean Shimmer dB | average absolute base-10 logarithm of the difference between the amplitudes of consecutive periods, multiplied by 20. | Mean | 1 |
| 3-point Amplitude Perturbation Quotient Shimmer | It is calculated as the average absolute difference between the amplitude of a vocal period and the average of the amplitudes of its neighbours, divided by the average amplitude. | Mean | 1 |
| 5-point Amplitude Perturbation Quotient Shimmer | It is the average absolute difference between the amplitude of a vocal period and the average of the amplitudes of it and its 4 closest neighbours, divided by the average amplitude. | Mean | 1 |
| 11-point Amplitude Perturbation Quotient Shimmer | It is the average absolute difference between the amplitude of a vocal period and the average of the amplitudes of it and its 10 closest neighbours, divided by the average amplitude | Mean | 1 |
| Mean absolute differences shimmer | Average absolute difference between consecutive differences between the amplitudes of consecutive periods. | Mean | 1 |
| Harmonicity of the voiced parts only | It is used for measuring the repeating patterns in voiced speech signals. | Mean | 1 |
| Mean autocorrelation | It is used for measuring the repeating patterns in the speech signal. | Mean | 1 |
| Mean harmonics-to-noise ratio | It is a measure which gives the relationship between the periodic and additive noise components of the speech signal. | Mean | 1 |
| Mean noise-to-harmonics ratio | It is a measure which gives the relationship between the periodic and additive noise components of the speech signal. | Mean | 1 |
| Fraction of locally unvoiced frames | It is a fraction of pitch frames analysed as unvoiced pitch (75Hz) frames in a speech biosignal of a specified length. | Mean | 1 |
| Number of voice breaks | The number of distances between consecutive vocal pulses that are longer than 1.25 divided by the pitch floor. Hence, if the pitch floor is 75 Hz, all inter-pulse intervals which are longer than 16.6667 ms are called as voice breaks. | Count | 1 |
| Degree of voice breaks | This measure is the total duration of breaks between the voiced parts of the speech signal. | Mean | 1 |
| Total energy | Total energy of a vocal signal in air. | Mean | 1 |
| Mean power | The mean power of a speech signal in air. | Mean | 1 |
Fig. 5Visualization of three kinds of correlation :a) positive correlation b) negative correlation, and c) no correlation
Fig. 12Summary of the investigated system configuration: a) feature preparation steps for smart phone usage, audio-visual modalities. b) using normalised feature vectors of different modalities:Individual and feature fusion techniques that were investigated
The top 10 positive correlated features with their description, r value (strength of correlation and direction)
| S.no | Feature Name | Description | |
|---|---|---|---|
| 1 | AU 12 standard deviation (A_12_S) | Lip corner puller intensity Standard deviation | 0.64277 |
| 2 | AU 12 root mean square (A_12_R) | Lip corner puller intensity root mean square | 0.62708 |
| 3 | AU 12 maximum (A_12_M) | Lip corner puller intensity maximum | 0.562378 |
| 4 | AU 12 mean (A_12_MN) | Lip corner puller intensity mean | 0.51846 |
| 5 | AU 10 standard deviation (A_10_S) | Upper lip raiser standard deviation | 0.51244 |
| 6 | AU 06 maximum (A_6_M) | Cheek Raiser maximum | 0.49279 |
| 7 | AU 25 root mean square(A_25_R) | Lips part root mean square | 0.48731 |
| 8 | AU 25 count (A_25_C) | Lips part count | 0.48315 |
| 9 | AU 25 mean (A_25_M) | Lips part mean | 0.47884 |
| 10 | AU 06 standard deviation (A_6_S) | Cheek raiser standard deviation | 0.473363 |
The top 10 negative correlated features with their description, r value(strength of correlation and direction)
| S no | Feature Name | Description | |
|---|---|---|---|
| 1 | AU 25 skewness (A_25_S) | Lips part count skewness | -0.44741 |
| 2 | Fraction of locally unvoiced frames (F_L_U) | It is a fraction of pitch frames analyzed as unvoiced pitch (pitch is 75Hz) frames in a voice. | -0.3891 |
| 3 | Degree of voice (D_V_B) | This measure is total duration of breaks between the voiced parts of the speech signal | -0.3784 |
| 4 | AU 10 skewness (A_10_SK) | Upper lip raiser skewness | -0.36990 |
| 5 | AU 09 skewness (A_9_S) | Nose wrinkle skewness | -0.34867 |
| 6 | AU 25 kurtosis (A_25_K) | Lips part kurtosis | -0.34069 |
| 7 | Pitch Skewness (P_SK) | It is pitch’s skewness | -0.3217 |
| 8 | AU 12 skewness (A_12_SK) | Lip corner puller skewness | -0.3195 |
| 9 | Shimmer APQ3 (SAQ) | It is the average absolute difference between the amplitude of a vocal period and the average of the amplitudes of it and its 2 closest neighbours, divided by the average amplitude. | -0.312 |
| 10 | Mean absolute differences shimmer(MDS) | Average absolute difference between consecutive differences between the amplitudes of consecutive periods. | -0.3129 |
Fig. 6Top 5 Positive correlated feature variations
Fig. 7Top 5 Negative correlated feature variations
Fig. 8Participant wise variations in Top 10 positive correlated features. Red and blue lines indicate the depressed and non-depressed participants, respectively
Fig. 9Participant wise variations in Top 10 negative correlated features. Red and blue lines indicate the depressed and non-depressed participants, respectively
Fig. 10Positive correlated single feature variation in all participants
Fig. 11Negative correlated single feature variation in all participants
Average accuracy classification results for individual modalities
| S.no. | Individual | ML | smart phone | visual | audio | Average | |||
|---|---|---|---|---|---|---|---|---|---|
| modalities | Classifiers | modality | modality | modality | Accuracy | ||||
| # | Acc | # | Acc | # | Acc | ||||
| 1 | All features | LR | 53 | 61 | 415 | 70 | 82 | 60 | 64 |
| 2 | DT | 68 | 77 | 55 | 67 | ||||
| 3 | NB | 62 | 78 | 68 | 69 | ||||
| 4 | RF | 69 | 79 | 67 | 72 | ||||
| 5 | SVM | 65 | 80 | 60 | 68 | ||||
| 6 | Pearson correlation reduced feature vector | LR | 45 | 60 | 166 | 79 | 57 | 67 | 69 |
| 7 | DT | 50 | 80 | 60 | 63 | ||||
| 8 | NB | 58 | 66 | 61 | 62 | ||||
| 9 | RF | 50 | 80 | 68 | 66 | ||||
| 10 | SVM | 55 | 80 | 72 | 69 | ||||
| 11 | PCA | LR | 28-30 | 66 | 40-42 | 80 | 20-22 | 69 | 72 |
| 12 | DT | 68 | 69 | 52 | 63 | ||||
| 13 | NB | 66 | 72 | 50 | 63 | ||||
| 14 | RF | 66 | 79 | 50 | 65 | ||||
| 15 | SVM | 69 | 80 | 69 | 73 | ||||
| Individual modality average | 62 | 77 | 62 | ||||||
#- Number of features in a resultant feature vector, Acc-accuracy, and average accuracy corresponds to the row average to demonstrate each ML classifier used in different methods. Individual modality average corresponds to the column average of the individual modality
Average Accuracy classification results for fused modalities
| S.no. | Fused | ML | smart-phone+ | smartphone + | video+ audio | Method | All | ||||
|---|---|---|---|---|---|---|---|---|---|---|---|
| modalities | Classifiers | audio modality | video modality | modality | Average | modalities | |||||
| # | Acc | # | Acc | # | Acc | Acc | # | Acc | |||
| 1 | Concatenate all features | LR | 135 | 82 | 468 | 78 | 497 | 81 | 80 | 550 | |
| 2 | DT | 81 | 78 | 80 | 80 | 80 | |||||
| 3 | NB | 82 | 72 | 75 | 76 | 80 | |||||
| 4 | RF | 79 | 83 | 83 | 82 | 83 | |||||
| 5 | SVM | 81 | 79 | 83 | 81 | ||||||
| 6 | Concatenate Pearson correlation removed feature vectors | LR | 101 | 80 | 209 | 81 | 220 | 79 | 80 | 265 | 79 |
| 7 | DT | 81 | 79 | 82 | 81 | 80 | |||||
| 8 | NB | 80 | 81 | 82 | 81 | ||||||
| 9 | RF | 85 | 80 | 80 | 82 | ||||||
| 10 | SVM | 83 | 83 | 84 | 83 | ||||||
| 11 | 95% of variance of PCA over concatenated feature vectors | LR | 40-42 | 75 | 30-32 | 79 | 40-42 | 79 | 78 | 50-55 | 78 |
| 12 | DT | 70 | 62 | 60 | 78 | 65 | |||||
| 13 | NB | 79 | 72 | 79 | 77 | 74 | |||||
| 14 | RF | 83 | 73 | 65 | 74 | ||||||
| 15 | SVM | 81 | 79 | 80 | 80 | ||||||
| Fused modalities average | 80 | 77 | 78 | ||||||||
#- Number of features in a resultant feature vector,Acc-Accuracy, and method average corresponds to the row average to demonstrate each ML classifier used in different methods. Fused modalities average corresponds to the column average of the fused modality. Bold: fused modalities performed well when compared with the method average. Fused modalities average is the column average to demonstrate the average of each modality combination
Results of proposed approach on DAIC Dataset
| S.no. | ML Classifiers | Fused | All features | Pearson correlation | PCA | |||
|---|---|---|---|---|---|---|---|---|
| modalities | features | reduced feature vector | ||||||
| # | Acc | # | Acc | # | Acc | |||
| 1 | Logistic Regression | Audio | 82 | 70 | 50 | 81 | 20-25 | 68 |
| 2 | Video | 230 | 81 | 72 | 83 | 20-25 | 80 | |
| 3 | Video + Audio | 312 | 83 | 122 | 30-35 | 83 | ||
| 4 | Decision Tree | Audio | 82 | 62 | 50 | 71 | 20-25 | 62 |
| 5 | Video | 230 | 80 | 72 | 80 | 20-25 | 82 | |
| 6 | Video + Audio | 312 | 80 | 122 | 82 | 30-35 | 82 | |
| 7 | Naive Bayes | Audio | 82 | 55 | 50 | 70 | 20-25 | 70 |
| 8 | Video | 230 | 80 | 72 | 75 | 20-25 | 80 | |
| 9 | Video + Audio | 312 | 80 | 122 | 82 | 30-35 | 80 | |
| 10 | Random Forest | Audio | 82 | 74 | 50 | 66 | 20-25 | 64 |
| 11 | Video | 230 | 85 | 72 | 80 | 20-25 | 81 | |
| 12 | Video + Audio | 312 | 85 | 122 | 85 | 30-35 | 81 | |
| 13 | Support Vector Machines | Audio | 82 | 74 | 50 | 68 | 20-25 | 67 |
| 14 | Video | 230 | 85 | 72 | 83 | 20-25 | 82 | |
| 15 | Video + Audio | 312 | 85 | 122 | 30-35 | 83 | ||
# - number of features in the feature vector. Acc-Accuracy, and BOLD: Best accuracies obtained Note- DAIC dataset does not contain all the low-level openface feature sets. Hence we extracted statistical feature vector on the available low-level feature vector of DAIC dataset
Fig. 13Comparision of accuracies over ML classifiers with feature selection methods: All Features, Pearson’s correlation analysis, and PCA using both audio and video
Fig. 14ROC curve of ML classifiers