| Literature DB >> 28934302 |
Md Nasir1, Brian Robert Baucom2, Panayiotis Georgiou1, Shrikanth Narayanan1.
Abstract
Automated assessment and prediction of marital outcome in couples therapy is a challenging task but promises to be a potentially useful tool for clinical psychologists. Computational approaches for inferring therapy outcomes using observable behavioral information obtained from conversations between spouses offer objective means for understanding relationship dynamics. In this work, we explore whether the acoustics of the spoken interactions of clinically distressed spouses provide information towards assessment of therapy outcomes. The therapy outcome prediction task in this work includes detecting whether there was a relationship improvement or not (posed as a binary classification) as well as discerning varying levels of improvement or decline in the relationship status (posed as a multiclass recognition task). We use each interlocutor's acoustic speech signal characteristics such as vocal intonation and intensity, both independently and in relation to one another, as cues for predicting the therapy outcome. We also compare prediction performance with one obtained via standardized behavioral codes characterizing the relationship dynamics provided by human experts as features for automated classification. Our experiments, using data from a longitudinal clinical study of couples in distressed relations, showed that predictions of relationship outcomes obtained directly from vocal acoustics are comparable or superior to those obtained using human-rated behavioral codes as prediction features. In addition, combining direct signal-derived features with manually coded behavioral features improved the prediction performance in most cases, indicating the complementarity of relevant information captured by humans and machine algorithms. Additionally, considering the vocal properties of the interlocutors in relation to one another, rather than in isolation, showed to be important for improving the automatic prediction. This finding supports the notion that behavioral outcome, like many other behavioral aspects, is closely related to the dynamics and mutual influence of the interlocutors during their interaction and their resulting behavioral patterns.Entities:
Mesh:
Year: 2017 PMID: 28934302 PMCID: PMC5608311 DOI: 10.1371/journal.pone.0185123
Source DB: PubMed Journal: PLoS One ISSN: 1932-6203 Impact factor: 3.240
Fig 1Overview of the work described in this paper.
We use 2 out of 3 interactions (shown on left). We employ automated feature extraction from acoustics and/or human behavioral coding (center) and machine learning (right) to derive outcomes.
Behavioral coding systems used in the dataset: SSIRS (Social Support Interaction Rating System) and CIRS (Couple Interaction Rating System).
| Coding System | Codes |
|---|---|
| SSIRS | Global positive affect, global negative affect, use of humor, influence of humor by the other, sadness, anger/frustration, belligerence/domineering, contempt/disgust, tension/anxiety, defensiveness, affection, satisfaction, solicits partner’s suggestions, instrumental support offered, emotional support offered, submissive or dominant, topic being a relationship issue, topic being a personal issue, discussion about husband, discussion about wife |
| CIRS | Acceptance of the other, blame, responsibility for self, solicits partner’s perspective, states external origins, discussion, clearly defines problem, offers solutions, negotiates, makes agreements, pressures for change, withdraws, avoidance |
Number of data samples with different outcome ratings.
| Outcome | Decline | No Change | Partial Recovery | Recovery |
|---|---|---|---|---|
| Rating | 1 | 2 | 3 | 4 |
| Count | 12 | 26 | 34 | 67 |
Basic acoustic features used in the study.
| Feature Type | Feature Names |
|---|---|
| Spectral | 15 MFCCs and their derivatives, 8 MFBs |
| Prosody | Intensity, Pitch and their derivatives |
| Voice quality | Jitter, Shimmer, Harmonics-to-Noise Ratio and their derivatives |
Different features representations used in the study.
| Representation | Input | Scope | Definition |
|---|---|---|---|
| Raw features | Audio | 25 ms window | as described in |
| Static functionals | Raw features | 1 session (10 minutes) | Statistics over entire session |
| Short-term dynamic | Turns | 1 session (10 minutes) | Statistics over all turns |
| Long-term dynamic | Segments | Duration of therapy | Delta between two sessions |
Fig 2Short-term dynamic functionals capture the statistics of differences between the means of features of adjacent turns in the interaction, both within an interlocutor (e.g., Wife to wife turn changes) but also across interlocutors (e.g., Wife to husband turn changes).
Pearson’s correlation coefficients of top 5 features and the corresponding functionals (all correlations are statistically significant, i.e., p < 0.05).
| Rank | Feature | Category | Functional | Coefficient | p-value |
|---|---|---|---|---|---|
| 1 | MFCC | spectral | mean | −0.2997 | 0.0003 |
| 2 | Loudness | prosodic | std. dev. | 0.2983 | 0.0003 |
| 3 | MFB | spectral | median | 0.2859 | 0.0005 |
| 4 | Jitter | voice-quality | mean | −0.2791 | 0.0006 |
| 5 | Pitch delta | prosodic | mean | 0.2772 | 0.0008 |
Fig 3Scatter plot of two prosodic features(normalized) with highest correlation: Loudness (r = 0.2983) and pitch delta (r = 0.2772).
The corresponding static functionals are standard deviation and mean, in respective order. Class 0 and class 1 represent respectively no recovery and recovery cases.
Classification accuracy (in terms of their mean and standard deviation over all folds of cross-validation) of different experiments (across the columns) with different feature sets (across the rows).
| Featureset | Dim. | Expt. 1 | Expt. 2 | Expt. 3 | |||
|---|---|---|---|---|---|---|---|
| Chance | - | 51.8 | - | 47.2 | - | 48.2 | - |
| Behavioral codes | 264 | 75.6 | 13.5 | 65.4 | 14.7 | 61.8 | 11.2 |
| Static functionals | 3552 | 76.4 | 10.0 | 70.9 | 13.8 | 63.2 | 11.4 |
| Dynamic functionals | 6696 | 78.9 | 7.6 | 71.1 | 12.8 | 61.5 | 12.3 |
| Acoustic (all functionals) | 10248 | 79.3 | 10.2 | 72.6 | 13.0 | 12.8 | |
| All features | 9144 | 7.4 | 12.6 | 64.1 | 13.2 | ||
F-scores(in terms of their mean and standard deviation over all folds of cross-validation) of different experiments (across the columns) with different feature sets (across the rows).
| Featureset | Expt. 1 | Expt. 2 | Expt. 3 | |||
|---|---|---|---|---|---|---|
| Behavioral Codes | 0.68 | 0.12 | 0.49 | 0.11 | 0.48 | 0.11 |
| Static functionals | 0.56 | 0.10 | 0.60 | 0.07 | 0.52 | 0.09 |
| Dynamic functionals | 0.63 | 0.05 | 0.59 | 0.07 | 0.50 | 0.09 |
| Acoustic (all functionals) | 0.70 | 0.09 | 0.64 | 0.08 | 0.11 | |
| All features | 0.07 | 0.09 | 0.56 | 0.10 | ||
p-values of statistical significance test against the null hypotheses that the there is no significant difference in performance of the two feature sets compared.
The entries in bold indicate statistically signifcant difference (p < 0.05).
| Comparison | Expt. 1 | Expt. 2 | Expt. 3 |
|---|---|---|---|
| Acoustic (all) | |||
| Acoustic (all) | |||
| All features | |||
| All features | 0.079 |
95% confidence intervals of the statistic for significance test for comparing different feature sets.
| Comparison | Expt. 1 | Expt. 2 | Expt. 3 |
|---|---|---|---|
| Acoustic (all) | (0.019 0.243) | (0.284 0.395) | (0.159 0.271) |
| Acoustic (all) | (0.276, 0.294) | (0.221, 0.258) | (0.376, 0.457) |
| All features | (0.009 0.133) | (0.156 0.237) | (0.184 0.208) |
| All features | (0.240, 0.303) | (0.298, 0.334) | (−0.029, 0.311) |