Sandeep Kumar Pandey1, Hanumant Singh Shekhawat1, S R M Prasanna2, Shalendar Bhasin3, Ravi Jasuja3,4. 1. Electronics and Electrical Engineering Dept, Indian Institute of Technology Guwahati, Assam, India. 2. Electrical Engineering Dept, Indian Institute of Technology Dharwad, Dharwad, Karnataka, India. 3. Brigham and Womens Hospital, Harvard Medical School, Boston, MA, United States of America. 4. Function promoting Therapies, Waltham, MA, United States of America.
Abstract
Depression is one of the significant mental health issues affecting all age groups globally. While it has been widely recognized to be one of the major disease burdens in populations, complexities in definitive diagnosis present a major challenge. Usually, trained psychologists utilize conventional methods including individualized interview assessment and manually administered PHQ-8 scoring. However, heterogeneity in symptomatic presentations, which span somatic to affective complaints, impart substantial subjectivity in its diagnosis. Diagnostic accuracy is further compounded by the cross-sectional nature of sporadic assessment methods during physician-office visits, especially since depressive symptoms/severity may evolve over time. With widespread acceptance of smart wearable devices and smartphones, passive monitoring of depression traits using behavioral signals such as speech presents a unique opportunity as companion diagnostics to assist the trained clinicians in objective assessment over time. Therefore, we propose a framework for automated depression classification leveraging alterations in speech patterns in the well documented and extensively studied DAIC-WOZ depression dataset. This novel tensor-based approach requires a substantially simpler implementation architecture and extracts discriminative features for depression recognition with high f1 score and accuracy. We posit that such algorithms, which use significantly less compute load would allow effective onboard deployment in wearables for improve diagnostics accuracy and real-time monitoring of depressive disorders.
Depression is one of the significant mental health issues affecting all age groups globally. While it has been widely recognized to be one of the major disease burdens in populations, complexities in definitive diagnosis present a major challenge. Usually, trained psychologists utilize conventional methods including individualized interview assessment and manually administered PHQ-8 scoring. However, heterogeneity in symptomatic presentations, which span somatic to affective complaints, impart substantial subjectivity in its diagnosis. Diagnostic accuracy is further compounded by the cross-sectional nature of sporadic assessment methods during physician-office visits, especially since depressive symptoms/severity may evolve over time. With widespread acceptance of smart wearable devices and smartphones, passive monitoring of depression traits using behavioral signals such as speech presents a unique opportunity as companion diagnostics to assist the trained clinicians in objective assessment over time. Therefore, we propose a framework for automated depression classification leveraging alterations in speech patterns in the well documented and extensively studied DAIC-WOZ depression dataset. This novel tensor-based approach requires a substantially simpler implementation architecture and extracts discriminative features for depression recognition with high f1 score and accuracy. We posit that such algorithms, which use significantly less compute load would allow effective onboard deployment in wearables for improve diagnostics accuracy and real-time monitoring of depressive disorders.
Depression is a mental health issue often characterized by low mood, sadness, and negative thoughts, loss of interest in day-to-day activities, and is often associated with an individual’s inability to cope up with stressful events [1]. According to a report by W.H.O., clinical depression is one of the primary causes of disability [2]. Also termed as Major Depressive Disorder (MDD), depression increases an individual’s risk of suicide ideation [3]. Several studies in the recent years have shown that people who commit suicide often meet the criteria for clinical diagnoses of depressive illness [4, 5]. Depression is among the most treatable of mental disorders. Between 80% and 90% of people with depression eventually respond well to treatment. Almost all patients gain some relief from their symptoms. However, definitive diagnosis in sporadic visits to the treating psychologists presents a challenge since MDD presentation evolves over time and a cross-sectional assessment alone has limited diagnostic accuracy. Accordingly, diagnostic frameworks, which could passively assist in the diagnoses and management, of clinical depression present a substantial unmet need. Current standard of care for the diagnoses of clinical depression involves clinical interviews by psychologists and administration of the standard Hamilton Rating Scale, PHQ-8 rating system to classify symptomatic presentation through a depression score for the individual [6, 7]. However, this method is subjective and time-consuming. These extant methods rely primarily on the self-report measures during interviews when the depressive behavior has been manifested. By design, the prevailing methods are not amenable to proactive, unobtrusive monitoring to prevent an individual’s progression into depressed state. Additionally, the reliance on a psychologist’s ability to deem someone as depressed or not is susceptible to individual clinician’s appraisal bias. Non-intrusive monitoring through wearables and embedded classification algorithms presents an exciting opportunity to mitigate clinician subjective bias and provide a proactive, companion diagnostic framework. These longitudinal assessments can also be effectively integrated with various serum biomarkers such as lower serotonin levels [8], impaired functioning of neurotransmitter gamma-amino butyric acid (GABA), etc., which have been shown to be strong correlates of mental health-related indications [9]. However, these invasive biomarkers are not frequently monitored prior to explicit evidence of depressive disorder. We posit that depression progression or recognition in individuals has to be proactive and multifactorial such that subjectivity in physician’s assessment can be reduced through high fidelity, data-driven algorithmic insights. Several research groups have begun to make headway in studies involving speech signal-based depression recognition [1], eye movements [10], facial activity [11], gesturing [12], slumped posture [13], etc. These markers help in automatic diagnoses of alterations in the depressive states without intruding into the patient’s activities of daily living. They can be employed in wearable smart devices such as smartwatches, smartphones, etc., to continuously monitor the individual’s mental state.Depression recognition from behavioural signals such as speech, facial expressions, etc., has fostered interdisciplinary effort from research teams due to its challenging and complex physiological presentation. Several studies have pursued feature extraction and learning strategies for depression recognition from speech. For instance, the work in Alghowinem et al. [14] investigated the effect of segment level as well as prosodic features on the classification of depressed speech from normal controls. The authors pointed out that statistical functionals computed from low-level features lose information resulting in inferior performance than segment-level features. Interestingly, the studies performed by Alghowinem et al. [15] explored speech style as an aspect of depressed vs. normal speech with gender classification as a precursor to improving the recognition performance. It was found that several speech features such as MFCC, intensity, and energy features were of significance when both male and female participant’s speech was considered. However, shimmer and RMS energy features were of prominence for female only depression classification, and voice quality was a stratifying marker for the male participants only. An investigation on temporal features revealed that the response time and average syllable duration were longer in depressed subjects. In contrast, the interaction involvement and articulation rate were higher in healthy controls. Another interesting study reported by Long et al. [16] examined several speech types such as read speech, interviews, and picture description and emotion types such as positive, negative, and neutral for their discriminative power for depression versus normal speech classification. Experiments on a dataset of 74 subjects using an SVM classifier demonstrated that interview speech and neutral emotion contribute more towards recognition of depression from speech than other speech and emotion types. The study in [17] introduced a new dataset PRIORI, collected from everyday smartphone conversation recordings and utilized it to study the change of emotional activation and valence in depressed and manic phases of Bipolar Disorder. Furthermore, in an independent research study, Cummins et al. [18] investigated the effect of speaker normalization for depression classification performance as mental-health disorders are highly speaker-specific, and also, the speakers for depressed and healthy controls were different. Feature normalization for reducing speaker variabilities were shown to improve recognition performance when MFCC and formant-based features were used. All these techniques relied on hand-crafted features and traditional classifiers such as Gaussian Mixture Models (G.M.M), Support Vector Machines (SVM), etc., focusing on identifying relevant feature set for robust classification of depressed speech from healthy controls.Multimodal approaches using audio, text, and facial geometry features have also been investigated [19-23]. Alghowinem et al. investigated the fusion of information from speech, head pose, and eye gaze behaviors for depression/normal classification on a dataset of 30 depressed and 30 healthy controls collected by Black Dog Institute [19, 24]. The authors leveraged different feature selection and fusion techniques, and found that t-test based feature selection performed well for binary depression/normal classification. Moreover, the individual modality’s performance was also reported, with speech showing the maximum recognition accuracy of 83%, further strengthening the idea that speech alone contains sufficient information for robust depression recognition. Also, in [20], new video and text features are proposed, and a hybrid of deep and shallow networks are used for depression classification using audio, video, and text modalities. Individual modalities such as audio and video were modelled using DCNN-DNN based system, while text modality was modelled using Paragraph Vector (P.V.) based SVM system. Moreover, in [22], an LSTM based system was explored to simultaneously model depression from audio and text sequences without performing explicit topic modelling of the content of the interviews. Also addressing the AVEC 2016 depression sub challenge, the work in [23] used i-vector framework with MFCC features for audio data modelling and geometrical features along with polynomial parametrization of facial landmarks was used in a late-fusion fashion for depression classification. From recent literature in depression classification, it is prominent that different combinations of modalities have been explored to demonstrate a robust system. However, another major observation which can be derived from such studies is the higher performance using audio modality, which serves as a motivating factor to further explore audio based depression recognition.With progress in the deep learning field and increased computation efficiency, the dependence on hand-crafted features is reduced. Deep learning has facilitated efficient end-to-end modelling of complex paralinguistic phenomenon which is difficult to assess using traditional techniques. Deep learning has been successfully applied to the task of automated diagnosis and modelling such as Bipolar Disorder [17], anxiety [25], alzheimer’s dementia [26], clinical depression [27] etc. Much of the recent work has explored the use of time-frequency-based speech representations such as spectrograms and log-mel spectrograms as input for deep learning architectures to classify depression from audio. Srimadhur et al. [28] investigated spectrograms as well as raw waveform as input to CNN-based network on a subset of DAIC-WOZ dataset in speaker-dependent fashion. Moreover, in the study by Ma et al. [29], a CNN-LSTM based architecture was explored that extracted discriminative features from mel-spectrograms using 1d convolution in the first layer. A random sampling strategy was also proposed to mitigate the data imbalance issue associated with the DAIC-WOZ dataset. The majority voting of the labels for segments of speech coming from an individual is used for depression prediction for an individual. In a recent study by Vazquez-Romero et al. [30], an ensemble of 1d-CNN networks is used with mel-spectrograms as input features. The label for an individual is generated by the mean of the segment level probabilities for each constituent network in the ensemble, and the ensemble labels are averaged to yield a final label for the individual. This ensemble technique demonstrated appreciable improvements in recognition performance over hand-crafted features based on SVM classification and other single deep learning-based networks.Multiple instance learning (MIL) is the apt choice when a single label is available for a group of utterances as in Depression classification problem [31]. The majority of the approaches in literature exploiting MIL architecture works by generating labels for individual segments and averaging them to yield a final label for the whole utterance. This is done using a network that shares parameters with all the segments of an utterance [32, 33]. However, the inherent problem with the MIL framework for depression classification is that not all the segments of the utterance exhibit depression-related characteristics, with the majority of the segments being in a neutral emotional state. As such, false labels are predicted quite often due to the majority of neutral state segments.Motivated by these limitation of the extant modelling methodologies, we developed a Tensor-based approach to extract shared and discriminative features from multiple segments of an utterance. Tensor factorizations provide a natural method for analyzing common information spread across modes of a tensor [34]. Utilizing this aspect, we use tensor factorization in conjunction with neural network-based learning to address the multiple-instance learning in a novel framework. Furthermore, the utterance level tensor core generated by the feature extraction block is passed on to an attention mechanism to generate the utterance level attentive feature. Statistic pooling of attentive representations is performed to extract bag-level features, which are classified using fully connected layers. This mitigates the dependence on average/max pooling output labels for individual segments for utterance level prediction, thus countering the inherent issue of traditional MIL frameworks. The proposed tensor based MIL approach for depression classification outperforms several state-of-the-art methodologies and provides a promising avenue for robust depression classification from speech signals.
2 Materials and methods
2.1 Tensor preliminaries
We review the introductory multilinear algebra, which is necessary to understand Tucker decomposition. A detailed, comprehensive review of tensor algebra can be found in [34, 35]. Sticking with the notations used in tensor literature, a vector is denoted by a lowercase letter(e.g. ‘a’), a matrix with an uppercase letter (e.g. ‘A’) and tensors of order three or more by calligraphic letters(e.g. ‘A’).Tensors are multidimensional arrays e.g. , where n is the number of modes in the tensor, also referred to as order of the tensor, which may correspond to space, time, frequency, trials, utterances etc and I specifies the dimensionality of the mode corresponding to nth mode of the tensor . Tensor manipulation often requires its reshaping to matrix form, and one such particular reshaping is called mode-n matricization or unfolding. For a third order tensor , mode-n matricization is achieved by fixing one index and varying the other two. It is denoted by , where the column vectors of X( are the mode-n vectors of . For N matrices, one corresponding to each mode, we denote it using a superscript in parenthesis, example U(.Mode-n multiplication of a tensor with a matrix U is obtained by multiplying all the vector fibers of a mode-n matrix with the matrix U. It is denoted as , and in matrix form it can be written asMultilinear subspace requires the understanding of multilinear projections as a tensor subspace is defined as a mapping from high-dimensional space to a low-dimensional space [36]. Considering the general case of higher order tensors, an Nth order tensor resides in the tensor space , where denotes real vector spaces and ⊗ represents the tensor outer product (for details see [34]). As such, the tensor space for N order tensors consists of the outer product of N vector spaces , n ∈ 1, 2, ⋯, N. A tensor can be projected onto a lower dimensional tensor , where P ≤ I using N projection matrices , one corresponding to each mode of the tensor.
2.1.1 Tucker decomposition
Tucker decomposition of a third order tensor is defined as a multilinear transformation of a core tensor, generally of small size and dense, by the factor matrices corresponding to each mode of the tensor [34, 37].Here, , and corresponds to the subspaces along mode-1, mode-2 and mode-3 respectively The subspaces consists of the basis vectors obtained from matrix unfolding along each mode of the tensor. Tucker decomposition has the constraint of orthogonality and ordering on the core tensor and factor matrices, while other constraints such as non-negativity, sparsity, etc. can also be imposed.A matrix representation of the tucker decomposition, in general case, can be achieved by matricizing and as [38]
where ⊗ denotes the Kronecker product. The decomposition can also be written as a linear combination of rank one tensors.
2.2 Dataset and preprocessing
For the task of depression classification from speech signals, we use the audio modality from the Distress Analysis Interview Corpus-Wizard of Oz (DAIC-WOZ), which is a subset of the larger corpus DAIC [39] and was introduced in the Audio/Visual Emotion Challenge (AVEC) 2016 [40]. The dataset consists of clinical interviews conducted between a participant and a virtual interviewer ellie which was controlled by a human interviewer remotely. The dataset was collected with the motive to augment the diagnoses of psychological conditions such as stress, anxiety, depression, etc., through automatic computer applications based on verbal and non-verbal indicators. It consists of audio, facial geometry features as well as text transcriptions of the interviews. Table 1 shows the distribution of participants according to gender for the train, validation and test partitions. The dataset is recorded in English from a population of 189 subjects comprising 146 depressed subjects and 43 healthy controls. The duration of the audio ranges from 7-33 min (average 16 minutes). Each participant’s audio file has been given a PHQ-8 score by the psychologist, which denotes the severity of depression, with 0 being no depression to 22 being severely depressed. Also, a binary PHQ-8 score is also provided, which classifies participants as depressed/not-depressed. Furthermore, the train-development-test split provided by the AVEC 2016 challenge divides the dataset into partitions comprising of 118, 24, 47 participants in the train, development, and test set, respectively.
Table 1
Distribution of male and female participants across train, validation and test partitions of the DAIC-WOZ depression dataset.
Partition
Male
Female
Train
63
44
Validation
16
19
Test
23
24
Since the virtual interviewer’s speech is not a part of the analysis, a silence region-based segmentation technique from the Python library PyAudioAnalysis [41] is employed to segment out the participant’s speech and discard the speech segments from the virtual interviewer as it doesn’t contain any emotion information. Also, the speech segments produced are of different duration, and deep learning techniques such as CNN and TFNN [42] require equal length input, so the speech segments are either zero-padded or truncated to 7 secs duration. The sampling rate of the speech signal is 16 kHz.
2.3 Methodology
This section discusses the Tensor Factorization-based Multiple-Instance Learning Technique, which is used for the classification of depression versus normal speech from multiple utterances of a single speaker. Furthermore, an utterance level attention followed by a statistics pooling layer [43] is employed to extract temporal features in the subsequent layers of the network. Moreover, a standard Multiple-Instance Learning (MIL) network based on Convolution layers is also discussed, which serves as a baseline for comparing results.
2.3.1 CNN and 2D TFNN based MIL framework
Multiple Instance Learning with CNN as a base architecture has been explored in many previous works [44, 45]. As such, we have used this architecture as a baseline in our work. The base CNN architecture comprises of 3 feature learning blocks followed by vectorization of the deep features and classification using a sigmoid layer. Each feature learning block comprises a 2D convolution layer, a batch normalization layer, an activation layer, and a max-pooling layer. The convolution layer extracts local features with the help of trainable kernels. Batch normalization forces the mean of the features over the entire batch to be centered at zero with unit variance. The normalized features are passed through an activation function (ELU in our work). Finally, a max-pooling layer is employed to reduce the size of the feature maps obtained, keeping the relevant information only. Given a bag of utterances belonging to a speaker, the base CNN architecture is employed on each of the utterances to yield a label for each utterance. A global max-pooling of the labels yields the final label for the bag of utterances.A 2D TFNN architecture [46] is employed as a base network for the MIL, similar to the CNN architecture. The 2D TFNN base receives mel spectrograms extracted from speech utterances as input. The factor matrices corresponding to the time and frequency modes extract the core feature tensor from the input tensors. Four consecutive Tensor FF layers yield the final feature tensor, which is then used to generate a class probability by doing an inner product with a weight matrix of the exact dimensions as the feature tensor. Fig 1 shows the end-to-end system for 2D CNN and 2D TFNN based MIL architecture.
Fig 1
MIL technique using CNN (top) and TFNN (bottom) as base architectures.
The input to the architectures is 2D Mel-spectrogram tensor, generated from speech utterances of a speaker. For the CNN architecture, stacks of three feature learning blocks followed by vectorization and dense layers is depicted. For TFNN, stacks of three 2-D Tensor FF layer followed by a Tensor Sigmoid layer is depicted. Finally, global average pooling of utterance labels is shown.
MIL technique using CNN (top) and TFNN (bottom) as base architectures.
The input to the architectures is 2D Mel-spectrogram tensor, generated from speech utterances of a speaker. For the CNN architecture, stacks of three feature learning blocks followed by vectorization and dense layers is depicted. For TFNN, stacks of three 2-D Tensor FF layer followed by a Tensor Sigmoid layer is depicted. Finally, global average pooling of utterance labels is shown.
2.3.2 3D TFNN architecture as feature extractor for MIL
The 3D TFNN architecture was introduced in [46] for emotion recognition from speech. The 3D TFNN serves as a natural framework for Multiple Instance Learning as the core idea of Tensor Factorization is capturing the shared information across different modes of a tensor. As such, given a bag of utterances belonging to a speaker, the utterances are first converted to the 2D speech representations such as mel-spectrograms of dimensions I × I. The mel-spectrograms for each utterance are stacked along the 3rd dimension to form a 3D-tensor of dimensions I × I × I representing the bag of utterances. The 3D tensor is passed through successive Tensor Factorization layers to obtain the deep feature tensors. Finally, a tensor sigmoid layer, comprising a weight tensor of the same size as the deep feature tensor, is utilized to get the probability for the bag of utterances.The 3D TFNN architecture for Multiple Instance Learning benefits from not repeating the same architecture individually on each utterance as in conventional CNN-based MIL systems. Moreover, the probability generated by the 3D TFNN represents the entire bag as opposed to CNN-based MIL, where a global max-pooling of the labels generates a bag-level label. This comes from the inherent capability of Tensor Factorization-based feature extraction. The shared information across mel-spectrograms of utterances for an individual is utilized to conclude the label for that particular speaker. In contrast, the utterance level information is independent in conventional MIL systems, and no shared information across utterances is utilized. Fig 2 shows the proposed end-to-end Tensor factorization based approach for MIL.
Fig 2
MIL technique using 3D TFNN and 3D TFNN + Utterance level attention as base architectures.
The input to the architectures is Mel-spectrogram tensor, generated by stacking mel-spectrograms of utterances by a speaker along the third dimension. For the 3D TFNN, stacks of 3D Tensor FF Layer followed by a Tensor Sigmoid Layer is depicted. For the 3D TFNN + Utterance level attention, the stacks of 3D Tensor FF Layer is followed by an utterance level attention layer. A statistics pooling layer aggregates information from the attentive feature vectors of utterances followed by a dense layer for classification.
MIL technique using 3D TFNN and 3D TFNN + Utterance level attention as base architectures.
The input to the architectures is Mel-spectrogram tensor, generated by stacking mel-spectrograms of utterances by a speaker along the third dimension. For the 3D TFNN, stacks of 3D Tensor FF Layer followed by a Tensor Sigmoid Layer is depicted. For the 3D TFNN + Utterance level attention, the stacks of 3D Tensor FF Layer is followed by an utterance level attention layer. A statistics pooling layer aggregates information from the attentive feature vectors of utterances followed by a dense layer for classification.
2.3.3 3D TFNN with utterance level attention
In this technique, the 3D TFNN described in 2.3.2 is utilized to extract deep tensor features from 3D tensor representations of bags of utterances. The feature tensor now comprises utterance level representations stacked along the third dimension of the feature core tensor. For each 2D slice of the 3D feature tensor, an utterance level attentive feature representation is generated using the following attention mechanism.2.3.3.1 Attention layer. The attention layer used in our work is based on the attention proposed in [47]. The attention layer takes in a sequence of high-level feature vectors, focuses on the depression-related parts employing attention weights, and generates an utterance level attention feature vector representing the depression-related frames of the input sequence. Given a 2D slice of 3D feature tensor tensor , where I1, I2, I3 represents the number of utterances, number of mel filter bands and number of frames respectively, normalized attention weights are first computed using a softmax function as described in equation -
where t ∈ (1, 2, ⋯, T), T being the total number of frames in the feature tensor slice and h being a feature vector belonging to the tth frame. The utterance level feature vector is obtained by taking the weighted sum of the attention weights with h as following -2.3.3.2 Statistics pooling. The statistics pooling was first introduced in [43] for extracting utterance level statistics from frame-level features embeddings generated using a Time Delay Neural Network for speaker verification tasks. In our proposed architecture, statistics pooling is employed to extract bag level statistics—mean and standard deviation from the utterance level attentive feature vectors. As such, the output of the statistics pooling layer aggregates the relevant discriminative information obtained from several speaker utterances and provides a unified feature for further classification objectives. Given a set of attention feature vectors and , obtained as described in section 2.3.3.1, where I1 represents the number of utterances in the bag, the statistics pooling is calculated using mean, which is the average and var, which is the variance -This results in a pooled feature vector of dimensions , with μ and σ concatenated for each entry of c.2.3.3.3 Fully connected layer. The output from the statistics pooling layer contains the aggregation of information across several utterances of a speaker. The pooled feature vector is passed to a fully connected network, having two layers to reduce the dimensionality and extract additional high-level features. Finally, the output of the fully connected layers is passed on to the last layer with sigmoid activation to generate the classification probability of being depressed/ normal.
2.3.4 Experimental setting
The four architectures—baseline CNN-MIL, TFNN-MIL, 3D TFNN, and 3D TFNN+Attention, are evaluated on the DAIC-WOZ dataset for Depression classification. For tensor formation, a set of utterances or bag sizes in the range [10, 60] are selected from each speaker. Thus multiple tensors are formed for each speaker considering multiple bags formed because of the bag size chosen without repetition of utterances. For the training scenario, each individual bag of utterances is considered coming from a new speaker bearing the same label as all the other children bags of the parent speaker, thereby generating a large number of tensors for training. However, for the testing scenario, the label for the parent speaker is calculated by averaging the predicted probability of all the children bags and comparing the final averaged probability against a threshold. The threshold is calculated from the ROC curve generated using the validation data.Mel spectrograms are computed from the speech segments to be used as input for the Tensor Factorized Neural Network and baseline CNN architecture. For the computation of mel spectrograms, the speech segments are first windowed using a hamming window of 2048 samples with a shift of 512 samples. The windowed signal is used to compute Short-Time Fourier Transform (STFT). The magnitude spectrogram obtained from STFT is then passed through a mel-scale to obtain the filterbank energies. A log operator is finally used to get the log-mel spectrogram.For baseline CNN architecture, the number of filters in the first and second feature learning block is 64 with a kernel size of 3 × 3 and a shift of 1. For the third feature learning block, the number of filters is 128 with kernel size 2 × 2. The activation function used in all feature learning blocks is ELU and a max-pooling with kernel size of 2 × 2 is used. The feature maps generated after the third feature learning block is vectorized and passed through a fully connected network with sigmoid non-linearity in its last layer to generate probabilities for the depressed versus non-depressed categories.For the TFNN-MIL system, the base architecture consists of four consecutive 2D Tensor Feed Forward layers. The features dimension produced from the Tensor FF layers are respectively 120 × 210, 110 × 200, 100 × 180 and 80 × 160. The output from the fourth Tensor FF layer is used to calculate logits using an inner product with a weight tensor of dimensions 80 × 160. Finally, the logits are passed through the activation function to yield utterance segment-level probabilities. This base architecture is repeated for all the instances in the bag, and a final global average pooling of the probabilities generates the bag level probability.For 3D TFNN architecture, the input tensor is of size num × 128 × 219 where the dimensions refer to the number of utterances, mel filters, and the number of time frames, respectively. The input mel-spectrogram tensor is passed through two 3D tensor feed-forward layers where the core tensors are of size num × 120 × 200 and num × 100 × 180 respectively. The activation function used in both the Tensor FF layers is RELU. The feature tensor obtained after the second Tensor FF layer is fed to a Tensor sigmoid layer. The output of the inner product of the feature tensor with a trainable weight tensor of the same size is passed through a sigmoid non-linearity to generate class probability.In the case of 3D TFNN+ Attention architecture, two 3D tensor FF layers, as used in 3D TFNN architecture above, extract discriminative feature tensor of the size num × 100 × 180. The utterance level attention mechanism generates utterance level feature vectors of dimensions num × 100. This feature sequence is passed to a statistics pooling layer generating a feature vector of dimensions R200, which is passed through two fully connected layers of dimensions 256, 256 and a last layer having sigmoid non-linearity to generate class probability for the bag of utterances.
3 Results
The four architectures—baseline CNN-MIL, TFNN-MIL, 3D TFNN, and 3D TFNN+Attention, are trained and evaluated on the DAIC-WOZ dataset using the following metrics—weighted accuracy, unweighted accuracy, and F1-score. Since the dataset is highly imbalanced, unweighted accuracy and F1-score becomes the apt choice to highlight the true prediction capability of the models. Moreover, another inherent issue with class imbalanced datasets is threshold-moving, which makes the default threshold of 0.5 for binary classification problems shift. For our work, we have utilized the optimal threshold calculated from the ROC curve on the validation dataset, which is the development partition of the dataset. The optimal threshold is then used to generate labels for the probabilities predicted for the test set.As seen from the Table 2, the 3D TFNN and 3D TFNN + Attention architecture outperforms the baseline CNN-MIL system by a considerable margin of 16.67% and 17.2% respectively in terms of UA. This justifies that Tensor Factorized Neural Networks are more suitable for MIL-based systems due to their common information capturing capability amongst several modes of the tensor input. Moreover, the 3D TFNN+Attention system provides a balance of overall accuracy to average of class accuracies. This becomes important for imbalanced datasets where the model’s chances of fitting towards the majority class are always high. Moreover, in terms of F1-score, 3D TFNN outperforms other techniques and reaches the state-of-the-art.
Table 2
Recognition accuracies in terms of Weighted Accuracy (WA) and Unweighted Accuracy (UA) and F1-scores for different tensor based techinques for test set of Daic-Woz dataset.
Method
Single Utterances
Speaker Level
WA(%)
UA(%)
WA(%)
UA (%)
F1-score (Normal, Depressed)
CNN MIL
54.40
55.65
51.06
54.87
0.56,0.43
TFNN MIL
60.00
62.52
65.95
71.64
0.70,0.60
3D TFNN
59.20
65.17
74.47
71.54
0.81,0.60
3D TFNN + Att
60.40
61.06
72.34
72.07
0.78,0.60
Fig 3 presents the confusion matrices for the four architectures on the test set of the DAIC-WOZ dataset, taking 30 utterances per tensor. It is evident from the confusion matrix in Fig 3d that 3D TFNN+Attention architecture can balance the model toward both depressed and non-depressed categories, followed by 3D TFNN architecture. This supports our proposal of using utterance level attention to generate attentive feature vectors per utterance segment. Moreover, the impact of the number of utterances per tensor on the recognition performance of the model is assessed in Fig 4. The range for the number of utterances per tensor is considered in the interval [10, 60]. The figure is plotted using b-spline interpolation [48] to account for the fewer data points and getting a smooth curve. As is evident from the graph, the model performs best when 30 utterances are chosen per tensor. Also, the performance shows a gradual decline in the accuracy when the number of utterances per tensor is increased. This may be because redundant information apart from the desired objective is also being captured with increasing utterances, which accounts for increased confusion and decreased accuracy.
Fig 3
Normalized confusion matrix for the test set of DAIC-WOZ depression dataset for the three architectures—Baseline CNN-MIL, TFNN-MIL, 3D TFNN and 3D TFNN+Attention.
Fig 4
Comparison of unweighted accuracy for varying number of utterances per tensor for the architectures CNN-MIL, TFNN-MIL, 3D TFNN and 3D TFNN+Attention.
3.1 Comparison with State-of-the-Art
Several studies have utilized Daic-WoZ Depression dataset for unimodal as well as multi-modal depression recognition [20, 49]. Since in this investigation, we have considered only the audio modality, the performance is compared with other studies using audio modality only. Moreover, few studies have reported the final results which are limited on the development partition of the dataset. More importantly, our work utilizes the test set as the unseen data; we compare with similar works reporting results on test partition. Also, the published studies are segregated upon the metrics used to give a fair comparison and restricted to the ones which have used accuracy and F1-score as metrics have been included for comparison.Table 3 presents the state-of-the-art techniques for Depression recognition from speech utterances using the DAIC-WOZ dataset. Valstar et al. [40] provided the baseline results for the DAIC-WOZ dataset using both the audio and video modality. Our novel implementation outperforms the baseline by 0.21 for the mean F1-score for the audio modality scenario. Previously, Ma et al. [29] utilized a combination of CNN and LSTM networks to extract high-level features from raw speech representations and uses a random sampling strategy to balance out the examples between depressed and normal classes. In contrast, our investigation uses a weighted loss function to alleviate the imbalance of classes and thereby incorporate all the training speakers during model training. As such, our proposed architecture achieves an overall performance gain of around 9% in terms of accuracy.
Table 3
Comparison with the state-of-the-art techniques on the test partition of DAIC-WOZ dataset in terms of Weighted Accuracy(WA), Unweighted Accuracy(UA) and F1-scores.
sl.no
Method
Year of Publication
Accuracy
F1 score
WA
UA
Depressed
Normal
Mean
1.
Valstar et al. (AVEC base)
2016
-
-
0.41
0.58
0.495
2.
Ma et al. (DepAudioNet)
2016
0.65
-
0.52
0.70
0.610
3.
Romero et al. (Ensemble)
2020
0.72
-
0.63
0.78
0.705
4
3D TFNN (proposed)
-
0.745
0.715
0.60
0.81
0.705
3.2 Discussion
Several features have been investigated in literature for depression diagnosis from speech utterances. This study focused on mel-spectrograms for two reasons. First, mel-spectrogram has proven to contain para-linguistic information present in speech utterances such as emotional states [50], cough [51] etc. Secondly, spectrograms provide a natural 2D tensor form for speech utterances. The proposed Tensor-Based MIL techniques tries to exploit the time-frequency information spread across several utterances of a speaker. The 3D TFNN extracts shared information across the mel-spectrograms of a speaker, thus trying to model the temporal information spread across multiple utterances in an interview setting. The 3D core tensor, which is the feature tensor, is comprised of the coefficients of interactions across the subspaces corresponding to each of the modes- time subspace, frequency subspace and utterance subspace. Moreover, when using utterance-level attention, the model tries to extract more relevant information pertaining to depression from each utterance by the means of self-attention. This in turn refines the feature extraction process by producing attentive feature vectors for each utterance in the tensor. To aggregate the information extracted using attention layers, statistics pooling is used, which generates a combined feature vector for all the utterances in the tensor. The proposed techniques are computationally efficient as using Tensor Factorization based architecture significantly lowers the number of trainable parameters [46].
4 Conclusion
In this work, we present a tensor-based architecture for the task of Multiple Instance Learning when a collection of utterances for a speaker is available, and inferences about the speaker label have to be drawn using the feature set from utterances. The conventional MIL architectures such as the baseline CNN-MIL system described in Fig 1 suffer from the inherent drawbacks of not considering relative and shared information across the utterances in a bag. These techniques rely on inferring labels for individual utterances and finally averaging or max-pooling the labels to infer the speaker-level labels. The tensor-based architectures solve this problem by considering the utterances as the third mode in addition to the time and frequency modes in speech spectrograms. As such, TFNNs, by its rich mathematical framework, try to capture the shared information across the utterances of a bag by tensor factorization where the input tensor is projected over three subspaces—time subspace, frequency subspace, and utterance subspace. This helps to leverage the shared information and generate a single speaker/bag level probability for the specified task. To this end, we have implemented two tensor MIL architectures—3D TFNN and 3D TFNN+Attention. Comparison with the state-of-the-art proves that both these novel techniques are effective in capturing depression-related information across bags of utterances. Moreover, additional analysis on the optimal number of utterances per bag is also presented to shed light on the model performance when using varying bag sizes.31 Mar 2022
PONE-D-22-04732
A deep tensor-based approach for automatic depression recognition from speech utterances
PLOS ONE
Dear Dr. Pandey,Thank you for submitting your manuscript to PLOS ONE. After careful consideration, we feel that it has merit but does not fully meet PLOS ONE’s publication criteria as it currently stands. Therefore, we invite you to submit a revised version of the manuscript that addresses the points raised during the review process.Please submit your revised manuscript by May 15 2022 11:59PM. If you will need more time than this to complete your revisions, please reply to this message or contact the journal office at plosone@plos.org. When you're ready to submit your revision, log on to https://www.editorialmanager.com/pone/ and select the 'Submissions Needing Revision' folder to locate your manuscript file.Please include the following items when submitting your revised manuscript:
A rebuttal letter that responds to each point raised by the academic editor and reviewer(s). You should upload this letter as a separate file labeled 'Response to Reviewers'.A marked-up copy of your manuscript that highlights changes made to the original version. You should upload this as a separate file labeled 'Revised Manuscript with Track Changes'.An unmarked version of your revised paper without tracked changes. You should upload this as a separate file labeled 'Manuscript'.If you would like to make changes to your financial disclosure, please include your updated statement in your cover letter. Guidelines for resubmitting your figure files are available below the reviewer comments at the end of this letter.If applicable, we recommend that you deposit your laboratory protocols in protocols.io to enhance the reproducibility of your results. Protocols.io assigns your protocol its own identifier (DOI) so that it can be cited independently in the future. For instructions see: https://journals.plos.org/plosone/s/submission-guidelines#loc-laboratory-protocols. Additionally, PLOS ONE offers an option for publishing peer-reviewed Lab Protocol articles, which describe protocols hosted on protocols.io. Read more information on sharing protocols at https://plos.org/protocols?utm_medium=editorial-email&utm_source=authorletters&utm_campaign=protocols.We look forward to receiving your revised manuscript.Kind regards,Dhananjay Singh, Ph.D.Academic EditorPLOS ONEJournal Requirements:When submitting your revision, we need you to address these additional requirements.1. Please ensure that your manuscript meets PLOS ONE's style requirements, including those for file naming. The PLOS ONE style templates can be found athttps://journals.plos.org/plosone/s/file?id=wjVg/PLOSOne_formatting_sample_main_body.pdf andhttps://journals.plos.org/plosone/s/file?id=ba62/PLOSOne_formatting_sample_title_authors_affiliations.pdf.2. Acknowledgments Section: Move New Information to the Financial Disclosure:"Thank you for stating the following in the Acknowledgments Section of your manuscript:[This research was supported under the India-Korea joint program cooperation of scienceand technology by the National Research Foundation (NRF) Korea(2020K1A3A1A68093469), the Ministry of Science and ICT (MSIT) Korea and by theDepartment of Biotechnology (India) (DBT/IC-12031(22)-ICD-DBT).]We note that you have provided funding information that is currently declared in your Funding Statement. However, funding information should not appear in the Acknowledgments section or other areas of your manuscript. We will only publish funding information present in the Funding Statement section of the online submission form.Please remove any funding-related text from the manuscript and let us know how you would like to update your Funding Statement. Currently, your Funding Statement reads as follows:[This research was supported under the India-Korea joint program cooperation of science and technology by the National Research Foundation (NRF) Korea (2020K1A3A1A68093469), the Ministry of Science and ICT (MSIT) Korea and by the Department of Biotechnology (India) (DBT/IC-12031(22)-ICD-DBT).]Please include your amended statements within your cover letter; we will change the online submission form on your behalf.3. Please ensure that you refer to Figure 1 in your text as, if accepted, production will need this reference to link the reader to the figure.Additional Editor Comments:-In the related work, need to include more recently published work with explanation of what is the role of the deep learning research in depression analysis?-I suggest to more detailed explanation for Evaluation based on framework for automated depression classification and the compression with others recent work.- Please explain in briefly and move their explanation in a paragraph explaining what each table and figures refe to.- It will be helpful to have a few lines explaining what is inside the features chosen and why a particular fusion method is suitable for the particular depression analysis.[Note: HTML markup is below. Please do not edit.]Reviewers' comments:Reviewer's Responses to Questions
Comments to the Author1. Is the manuscript technically sound, and do the data support the conclusions?The manuscript must describe a technically sound piece of scientific research with data that supports the conclusions. Experiments must have been conducted rigorously, with appropriate controls, replication, and sample sizes. The conclusions must be drawn appropriately based on the data presented. Reviewer #1: YesReviewer #2: Yes********** 2. Has the statistical analysis been performed appropriately and rigorously? Reviewer #1: YesReviewer #2: Yes********** 3. Have the authors made all data underlying the findings in their manuscript fully available?The PLOS Data policy requires authors to make all data underlying the findings described in their manuscript fully available without restriction, with rare exception (please refer to the Data Availability Statement in the manuscript PDF file). The data should be provided as part of the manuscript or its supporting information, or deposited to a public repository. For example, in addition to summary statistics, the data points behind means, medians and variance measures should be available. If there are restrictions on publicly sharing data—e.g. participant privacy or use of data from a third party—those must be specified. Reviewer #1: NoReviewer #2: Yes********** 4. Is the manuscript presented in an intelligible fashion and written in standard English?PLOS ONE does not copyedit accepted manuscripts, so the language in submitted articles must be clear, correct, and unambiguous. Any typographical or grammatical errors should be corrected at revision, so please note any specific errors here. Reviewer #1: YesReviewer #2: Yes********** 5. Review Comments to the AuthorPlease use the space provided to explain your answers to the questions above. You may also include additional comments for the author, including concerns about dual publication, research ethics, or publication ethics. (Please upload your review as an attachment if it exceeds 20,000 characters) Reviewer #1: The paper is strong and innovative. However, these areas should be improved:1- Please discuss the work of U of Michigan team on PRIORI which uses pitch to predict mood of patients with bipolar disorder. Mccinis, Khorram, and others are among the authors. Please compare your work with theirs and let us know how your work has enhanced their work.2- Please add the n to all your tables.3- Quality of figures are not acceptable. These figures are hard to follow.4- Too many abbreviations. Please reduce the use of abbreviations to those very nessasary.5- PHQ8 and PHQ 9 differ, and their results should not be combined. Why not limiting the results to 8 items.6- If PHQ is used, why do you discuss and cite Hamilton measure too?7- We need tables that descripbe the participants.8- Show us correlation between all variables. (Pearson r)9- Please use MDD for major depression. DOes not need the letter "S".Reviewer #2: The author's article on “ A deep tensor-based approach for Automatic Depression Recognition(ADR) from Speech utterances” is interesting for looking into the dimension of human mental health which is currently an important topic in every age groups of mankind and its analysis through speech processing presently societal need. The authors are appreciated for their intuitive knowledge in speech analysis and current technology exploration.==============================Article is acceptable for minor corrections==============================Minor Suggestions :The author's article on “ A deep tensor-based approach for Automatic Depression Recognition(ADR) from Speech utterances” is interesting for looking into the dimension of human mental health which is currently an important topic in every age groups of mankind and its analysis through speech processing presently societal need. The authors are appreciated for their intuitive knowledge in speech analysis and current technology exploration.With minor correction the article is very much acceptable for the journal.1. The title of the article says depression recognition so the authors are only specific parameters based patterns generated to Recognition or authors may look the contextually fit then instead recognition detection is more suitable.? Justify with your answer.2. The Word GIVE must be justified or reference adds for abbreviation if taken from research sources. Give the significant reasons for capitalization of the word3. In page 1 line no. 6 of Introduction section Major Depression Disorder(MSDD) and Multi system depressive disorder must be given proper relation and both abbreviations cannot be interchangeable hence look into these point and justify your comments.4. In Line No.10 page 2 the numbers added with the symbols of percentage and again word used so redundancy in information may be corrected. Ex: 80% percentage is not correct 80 percent or 80% along correct. Kindly justify your action on this modification5. Line 18 of page 2 “P.H.Q” abbreviated word usability in the entire document must be consistent.6. Differentiation of the depression detection and depression recognition must be distinguished while using the work in the article. The usability of the word must be maintained in a consistent manner.7. Appreciate the authors with more grammatically correct and with simple sentence breaks made throughout the article in a consistent manner.8. Authors can justify properly the use of Abbreviations separated full stops and some places without them. Entire articles use a uniform process.9. Page No. 8 Section 3.3.2 given equations need to be numbered.10. Page no. 9 and content in Table 1. Captions of the Row must be added to the percentage symbols as the accuracy is measured with ratios.11. Authors suggested rewriting the Abstract and Conclusions with simple sentences by appropriate breaks and conveying the authors view properly.12. The technical concepts are good and explanation flow is appreciable.With this minor correction the article is acceptable.********** 6. PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files.If you choose “no”, your identity will remain anonymous but your review may still be made public.Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy. Reviewer #1: Yes: Shervin AssariReviewer #2: No[NOTE: If reviewer comments were submitted as an attachment file, they will be attached to this email and accessible via the submission site. Please log into your account, locate the manuscript record, and check for the action link "View Attachments". If this link does not appear, there are no attachment files.]While revising your submission, please upload your figure files to the Preflight Analysis and Conversion Engine (PACE) digital diagnostic tool, https://pacev2.apexcovantage.com/. PACE helps ensure that figures meet PLOS requirements. To use PACE, you must first register as a user. Registration is free. Then, login and navigate to the UPLOAD tab, where you will find detailed instructions on how to use the tool. If you encounter any issues or have any questions when using PACE, please email PLOS at figures@plos.org. Please note that Supporting Information files do not need this step.25 May 20221I. COMMENTS FROM REVIEWER 1A. CommentPlease discuss the work of U of Michigan team on PRIORI which uses pitch to predict moodof patients with bipolar disorder. Mccinis, Khorram, and others are among the authors. Pleasecompare your work with theirs and let us know how your work has enhanced their work.Reply: We like to thank the reviewer for the suggestion. We have cited the paper and have addeddiscussion in the introduction section. The work in ”The PRIORI Emotion Dataset: Linking Moodto Emotion Detected In-the-Wild” has investigated emotion in speech as an intermediary featurefor monitoring Bipolar Disorder over depressed and manic states of an individual. The authorshave proposed a new dataset collected from smartphone speech recordings and assessed severalparameters such as PCC, CCC, RMSE etc to prove the robustnes of the feature and deep learningmethods in identifying emotional activation and valence.Compared to this suggested work, we have explored depression recognition in an individualbased on his speech utterances. The architecture proposed tries to capture shared informationacross several utterances of a speaker spread temporally across the interview conversation. Theemotion aspect as suggested by the reviewer is very interesting and is part of our future workin extending this study.B. CommentPlease add the n to all your tables.Reply: Thank you for the suggestion. However, we are not clear regarding the expectation of thereviewer in this comment. We have tried to stick to PLOS ONE Journal formatting guidelinesto the best.C. CommentQuality of figures are not acceptable. These figures are hard to follow.Reply: We deeply regret the inconvenience in understanding the figures in the current form.To incorporate the reviewer’s suggestion, we have added a brief description of the componentsof the figure in the caption to understand the information flow along various components. Wehave updated the figures to represent the end-to-end process starting from speech signal to labelgeneration.D. CommentToo many abbreviations. Please reduce the use of abbreviations to those very nessasary.Reply: Thank you for pointing out this. As per the suggestion, to enhance the readability wehave reduced the use of abbreviations in the paper.E. CommentPHQ8 and PHQ 9 differ, and their results should not be combined. Why not limiting theresults to 8 items.Reply: We agree with the observation of the reviewer. The DAIC-WoZ Dataset is annotated fordepression using the PHQ-8 scale. Hence we have omitted any mention of PHQ-9 present inthe paper as suggested.F. CommentIf PHQ is used, why do you discuss and cite Hamilton measure too?Reply:Thank you reviewer for this observation. We have mentioned Hamilton measure in the literaturesurvey of the paper to make the paper inclusive of readers belonging to non-medical backgroundstoo. Hence, the sole intention of mentioning Hamilton scale was to inform the user about severaltechniques used by the clinicians to label mental health data.G. CommentWe need tables that describe the participants.Reply:Thank you for the suggestion. We have added a table that discusses the composition of thedataset with respect to participants according to the gender. Moreover, additional details are notavailable as the Daic-Woz Dataset is provided by the AVEC 2016 Challenge and the releasedbaseline paper has limited information about the participants.H. CommentShow us correlation between all variables. (Pearson r)Reply:Thank you for the suggestion. As we have used a deep learning model based on tensorfactorization, the input to the method is 3D Tensors, which are stack of 2D Mel-spectrogramtensors along the third dimension. As such Pearson r coefficient of the input tensor is not defined.Moreover, since the stack of deep learning layers learn abstract information in the hidden layers,which are unknown in general, it is infeasible to provide correlation of variables in this particularproposed method.I. CommentPlease use MDD for major depression. DOes not need the letter ”S”.Reply: Thank you for the suggestion. We have made the suggested changes.II. COMMENTS FROM REVIEWER 2A. CommentThe title of the article says depression recognition so the authors are only specific parametersbased patterns generated to Recognition or authors may look the contextually fit then insteadrecognition detection is more suitable.? Justify with your answer.Reply:. We thank the reviewer for the question. As suggested , we have replaced ” detection”with ”recognition” in line with the standards used in similar research papers such as -1. L. Yang, D. Jiang, W. Han and H. Sahli, ”DCNN and DNN based multi-modal depressionrecognition,” 2017 Seventh International Conference on Affective Computing and IntelligentInteraction (ACII), 2017, pp. 484-489, doi: 10.1109/ACII.2017.8273643.2.Hongying Meng, Di Huang, Heng Wang, Hongyu Yang, Mohammed AI-Shuraifi, and Yun-hong Wang. 2013. Depression recognition based on dynamic facial and vocal expression featuresusing partial least square regression. In Proceedings of the 3rd ACM international workshop onAudio/visual emotion challenge (AVEC ’13). Association for Computing Machinery, New York,NY, USA, 21–30. https://doi.org/10.1145/2512530.2512532B. CommentThe Word GIVE must be justified or reference adds for abbreviation if taken from researchsources. Give the significant reasons for capitalization of the wordReply:. Thank you for pointing out this. We have removed the word ”GIVE” as it was notadding any extra information.C. CommentIn page 1 line no. 6 of Introduction section Major Depression Disorder(MSDD) and Multisystem depressive disorder must be given proper relation and both abbreviations cannot beinterchangeable hence look into these point and justify your comments.Reply: We thank the reviewer for pointing out this mistake. We have made the necessarycorrection and removed” Multisystem Depressive Disorder ” as this was out of context here.D. CommentIn Line No.10 page 2 the numbers added with the symbols of percentage and again word usedso redundancy in information may be corrected. Ex: 80% percentage is not correct 80 percentor 80% along correct. Kindly justify your action on this modificationReply: We thank the reviewer for pointing out this mistake. We have corrected as per thesuggestion and only kept ”%” symbol and removed the word ”percent” to reduce redundantinformation.E. CommentLine 18 of page 2 “P.H.Q” abbreviated word usability in the entire document must be con-sistent.Reply: We have modified the abbreviation according to the suggestion of the reviewerF. CommentDifferentiation of the depression detection and depression recognition must be distinguishedwhile using the work in the article. The usability of the word must be maintained in a consistentmanner.Reply: We thank the reviewer for this valuable comment. As per the suggestion of the reviewer,we have maintained uniformness across the paper by replacing ”detection” with ”recognition” asper the standards in the existing literature which utilizes deep learning in depression recognition.G. CommentAppreciate the authors with more grammatically correct and with simple sentence breaks madethroughout the article in a consistent manner.Reply: We thank the reviewer for raising this concern. We have tried to simplify sentences andimprove grammar wherever possible in the paper.H. CommentAuthors can justify properly the use of Abbreviations separated full stops and some placeswithout them. Entire articles use a uniform process.Reply: We thank the reviewer for pointing out this mistake. As per the suggestion, we havemade the abbreviations uniform across the article.I. CommentPage No. 8 Section 3.3.2 given equations need to be numbered.Reply: Thank you for the suggestion. We have added equation numbers for the same.J. CommentPage no. 9 and content in Table 1. Captions of the Row must be added to the percentagesymbols as the accuracy is measured with ratios.Reply: We have made the necessary changes as per the reviewer’s suggestion.K. CommentAuthors suggested rewriting the Abstract and Conclusions with simple sentences by appro-priate breaks and conveying the authors view properly.Reply: Thank you for the suggestion. We have tried to incorporate the reviewer’s suggestion.L. CommentThe technical concepts are good and explanation flow is appreciable.Reply: We thank the reviewer for his kind appreciation.III. C OMMENTS FROM THE EDITORA. CommentIn the related work, need to include more recently published work with explanation of whatis the role of the deep learning research in depression analysis?Reply:. We thank the editor for this suggestion. We have added reference to the more recentwork as well as emphasized on the importance of deep learning for mental health diagnosis.B. CommentI suggest to more detailed explanation for Evaluation based on framework for automateddepression classification and the compression with others recent work.Reply:. We thank the editor for the suggestion. We have explained the evaluation procedure indetail in section 3.4 along with details on the hyperparameter values used. Also, the comparisonwith other recent work is limited to only those studies which have used accuracy and F1 scoreas metric and evaluated on test partition . Most of the other recent study has used regressionmetric and validation partition for evaluation, which is not included in our table of comparison.C. CommentPlease explain in briefly and move their explanation in a paragraph explaining what each tableand figures refer to.Reply:. We thank the editor for the suggestion. We have added details about the informationflow across various components in the diagram in the captions .D. CommentIt will be helpful to have a few lines explaining what is inside the features chosen and whya particular fusion method is suitable for the particular depression analysis.Reply:. We thank the editor for this suggestion. We have added a subsection titled ” Discussion”which briefly explains the choice of mel-spectrograms as speech feature and what information3D TFNN based MIL architecture tries to capture in its various layers.Submitted filename: Response to Reviewers.pdfClick here for additional data file.25 Jul 2022A deep tensor-based approach for automatic depression recognition from speech utterancesPONE-D-22-04732R1Dear Dr. Pandey,We’re pleased to inform you that your manuscript has been judged scientifically suitable for publication and will be formally accepted for publication once it meets all outstanding technical requirements.Within one week, you’ll receive an e-mail detailing the required amendments. When these have been addressed, you’ll receive a formal acceptance letter and your manuscript will be scheduled for publication.An invoice for payment will follow shortly after the formal acceptance. To ensure an efficient process, please log into Editorial Manager at http://www.editorialmanager.com/pone/, click the 'Update My Information' link at the top of the page, and double check that your user information is up-to-date. If you have any billing related questions, please contact our Author Billing department directly at authorbilling@plos.org.If your institution or institutions have a press office, please notify them about your upcoming paper to help maximize its impact. If they’ll be preparing press materials, please inform our press team as soon as possible -- no later than 48 hours after receiving the formal acceptance. Your manuscript will remain under strict press embargo until 2 pm Eastern Time on the date of publication. For more information, please contact onepress@plos.org.Kind regards,Dhananjay Singh, Ph.D.Academic EditorPLOS ONEAdditional Editor Comments (optional):The overall quality of communication is good although proofreading will be needed.Reviewers' comments:Reviewer's Responses to Questions
Comments to the Author1. If the authors have adequately addressed your comments raised in a previous round of review and you feel that this manuscript is now acceptable for publication, you may indicate that here to bypass the “Comments to the Author” section, enter your conflict of interest statement in the “Confidential to Editor” section, and submit your "Accept" recommendation. Reviewer #3: All comments have been addressed********** 2. Is the manuscript technically sound, and do the data support the conclusions?The manuscript must describe a technically sound piece of scientific research with data that supports the conclusions. Experiments must have been conducted rigorously, with appropriate controls, replication, and sample sizes. The conclusions must be drawn appropriately based on the data presented. Reviewer #3: Yes********** 3. Has the statistical analysis been performed appropriately and rigorously? Reviewer #3: Yes********** 4. Have the authors made all data underlying the findings in their manuscript fully available?The PLOS Data policy requires authors to make all data underlying the findings described in their manuscript fully available without restriction, with rare exception (please refer to the Data Availability Statement in the manuscript PDF file). The data should be provided as part of the manuscript or its supporting information, or deposited to a public repository. For example, in addition to summary statistics, the data points behind means, medians and variance measures should be available. If there are restrictions on publicly sharing data—e.g. participant privacy or use of data from a third party—those must be specified. Reviewer #3: Yes********** 5. Is the manuscript presented in an intelligible fashion and written in standard English?PLOS ONE does not copyedit accepted manuscripts, so the language in submitted articles must be clear, correct, and unambiguous. Any typographical or grammatical errors should be corrected at revision, so please note any specific errors here. Reviewer #3: Yes********** 6. Review Comments to the AuthorPlease use the space provided to explain your answers to the questions above. You may also include additional comments for the author, including concerns about dual publication, research ethics, or publication ethics. (Please upload your review as an attachment if it exceeds 20,000 characters) Reviewer #3: The authors have addressed the reviewer's concerns and the revised version of the manuscript appears to be good.********** 7. PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files.If you choose “no”, your identity will remain anonymous but your review may still be made public.Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy. Reviewer #3: No**********