| Literature DB >> 33207564 |
Maciej Dzieżyc1,2, Martin Gjoreski3,4, Przemysław Kazienko1,2, Stanisław Saganowski1,2, Matjaž Gams3,4.
Abstract
To further extend the applicability of wearable sensors in various domains such as mobile health systems and the automotive industry, new methods for accurately extracting subtle physiological information from these wearable sensors are required. However, the extraction of valuable information from physiological signals is still challenging-smartphones can count steps and compute heart rate, but they cannot recognize emotions and related affective states. This study analyzes the possibility of using end-to-end multimodal deep learning (DL) methods for affect recognition. Ten end-to-end DL architectures are compared on four different datasets with diverse raw physiological signals used for affect recognition, including emotional and stress states. The DL architectures specialized for time-series classification were enhanced to simultaneously facilitate learning from multiple sensors, each having their own sampling frequency. To enable fair comparison among the different DL architectures, Bayesian optimization was used for hyperparameter tuning. The experimental results showed that the performance of the models depends on the intensity of the physiological response induced by the affective stimuli, i.e., the DL models recognize stress induced by the Trier Social Stress Test more successfully than they recognize emotional changes induced by watching affective content, e.g., funny videos. Additionally, the results showed that the CNN-based architectures might be more suitable than LSTM-based architectures for affect recognition from physiological sensors.Entities:
Keywords: affect recognition; deep learning; emotion recognition; end-to-end machine learning; multimodal deep learning; personal sensors; physiological signals; stress detection; wearables
Mesh:
Year: 2020 PMID: 33207564 PMCID: PMC7697590 DOI: 10.3390/s20226535
Source DB: PubMed Journal: Sensors (Basel) ISSN: 1424-8220 Impact factor: 3.576
Figure 1The comparison between the classical feature-based approach to multimodal physiological signal processing (red) and the end-to-end deep learning workflow (green), both applicable to affect recognition problem.
Figure 2Three best performing end-to-end DL architectures: spectrotemporal residual networks (Stresnet), a fully convolutional neural network FCN), and a residual network Resnet). All architectures consist of layers stacked vertically as well as horizontal branches separately dedicated for each signal. The branches are finally concatenated (Concat) and fed into a dense layer (Dense) with softmax activation (SM).CL: convolutional layer, ReLu: rectifier layer, BN: batch normalization layer, GAP: global average pooling layer, Add: addition layer, Drop: dropout layer, AvgP: average pooling layer, Spect: spectogram layer.
Summary of the deep learning architectures used in the experiments with respect to different DL layers (N: number of signals, FC: fully connected (dense) layer, CL: convolutional layer, LSTM: Long Short-term Memory, ResBloc: residual block with three CLs, Att: attention mechanism). FCN, Resnet, and Stresnet are shown in Figure 2 in greater depth.
| Architecture | Description |
|---|---|
| FCN [ | N x [CL - CL - CL] - FC |
| Resnet [ | N x [ResBlock - …- ResBlock] - FC |
| Stresnet [ | N x [ResBlock-Time + ResBlock-Freq] - FC |
| MLP [ | N x [FC - …- FC] - FC |
| Encoder [ | N x [CL - CL - CL - Att] - FC |
| Time-CNN [ | N x [CL - CL] - FC |
| CNN-LSTM | N x [CL - CL - LSTM] - FC |
| MLP-LSTM | N x [FC - …- FC - LSTM] - FC |
| MCDCNN [ | N x [CL - CL] - FC - FC |
| Inception [ | N x [ |
Sampling frequencies for each modality before (original sampling) and after downsampling. The total number of distinct signals is provided for each dataset. For example, modality collected from Empatica and RespiBAN in the WESAD dataset consists of three signals: . Therefore, WESAD contains eight one-signal modalities and two modalities with three signals each.
| Dataset | Modes of Signals (Distinct Signals/DL Branches) | Original Sampling | Downsampled to |
|---|---|---|---|
| WESAD (14 signals) | ECG RespiBAN | 700 Hz | 70 Hz |
| ACC RespiBAN (3 signals) | 700 Hz | 10 Hz | |
| EMG RespiBAN | 700 Hz | 10 Hz | |
| EDA RespiBAN | 700 Hz | 3.5 Hz | |
| TEMP RespiBAN | 700 Hz | 3.5 Hz | |
| Respiration RespiBAN | 700 Hz | 3.5 Hz | |
| BVP Empatica | 64 Hz | 64 Hz | |
| ACC Empatica (3) | 32 Hz | 8 Hz | |
| EDA Empatica | 4 Hz | 4 Hz | |
| TEMP Empatica | 4 Hz | 4 Hz | |
| AMIGOS (17) | ECG (2) | 128 Hz | 64 Hz |
| EEG (14) | 128 Hz | 64 Hz | |
| EDA | 128 Hz | 8 Hz | |
| ASCERTAIN (33) | ECG (2) | 128 Hz | 64 Hz |
| EEG (8) | 32 Hz | 32 Hz | |
| EDA | 128 Hz | 8 Hz | |
| EMO (22) | 128 Hz | 4 Hz | |
| DECAF(25) | ECG | 1000 Hz | 64 Hz |
| EMG | 1000 Hz | 64 Hz | |
| EOG | 1000 Hz | 64 Hz | |
| EMO (22) | 20 Hz | 4 Hz |
Thresholds for the discretization of Russell’s arousal-valence model. If the self-reported value was equal or higher than a given threshold, it was considered as high (see the two columns); otherwise, it was low.
| Dataset | High Arousal | High Valence |
|---|---|---|
| ASCERTAIN | ≥3 | ≥0 |
| AMIGOS | ≥5 | ≥5 |
| DECAF | ≥2 | ≥0 |
| WESAD | – | – |
Representation of classes in the datasets. For AMIGOS, DECAF, and ASCERTAIN, discretization of self-reported levels of arousal and values was done according to rules and thresholds in Table 3. For WESAD, the classes were assigned based on stimuli.
| Dataset | LALV | LAHV | HALV | HAHV | Total |
|---|---|---|---|---|---|
| AMIGOS | 80 (13%) | 155 (25%) | 194 (31%) | 200 (32%) | 629 |
| DECAF | 148 (6%) | 714 (31%) | 574 (25%) | 843 (37%) | 2279 |
| ASCERTAIN | 73 (4%) | 221 (11%) | 665 (34%) | 982 (51%) | 1941 |
| Amusement | Stress | Baseline | |||
| WESAD | – | 186 (17%) | 332 (30%) | 587 (53%) | 1105 |
Results for AMIGOS averaged over 5 iterations and 5 cross-validation folds ordered by F1-score.
| Architecture | Accuracy | ± | F1-Score | ± | ROC AUC | ± |
|---|---|---|---|---|---|---|
| Random guess | 0.25 | – | 0.24 | – | – | – |
| Resnet | 0.31 | 0.04 | 0.23 | 0.04 | 0.55 | 0.02 |
| FCN | 0.31 | 0.04 | 0.22 | 0.04 | 0.55 | 0.03 |
| Stresnet | 0.27 | 0.04 | 0.22 | 0.04 | 0.52 | 0.03 |
| Encoder | 0.30 | 0.03 | 0.20 | 0.04 | 0.52 | 0.03 |
| Inception | 0.29 | 0.05 | 0.20 | 0.05 | 0.54 | 0.02 |
| MLP-LSTM | 0.29 | 0.02 | 0.16 | 0.02 | 0.50 | 0.03 |
| Time-CNN | 0.29 | 0.06 | 0.14 | 0.05 | 0.51 | 0.01 |
| Majority class | 0.35 | – | 0.13 | – | – | – |
| MLP | 0.31 | 0.03 | 0.13 | 0.03 | 0.50 | 0.01 |
| MCDCNN | 0.26 | 0.07 | 0.12 | 0.04 | 0.50 | 0.00 |
| CNN-LSTM | 0.13 | 0.00 | 0.06 | 0.00 | – | – |
Results for DECAF averaged over 5 iterations and 5 cross-validation folds ordered by F1-score.
| Architecture | Accuracy | ± | F1-Score | ± | ROC AUC | ± |
|---|---|---|---|---|---|---|
| FCN | 0.36 | 0.02 | 0.26 | 0.02 | 0.53 | 0.01 |
| Stresnet | 0.35 | 0.02 | 0.25 | 0.02 | 0.55 | 0.02 |
| Encoder | 0.37 | 0.02 | 0.25 | 0.02 | 0.54 | 0.02 |
| Inception | 0.35 | 0.02 | 0.25 | 0.02 | 0.53 | 0.02 |
| Resnet | 0.35 | 0.01 | 0.25 | 0.02 | 0.54 | 0.01 |
| MLP-LSTM | 0.37 | 0.02 | 0.24 | 0.03 | 0.55 | 0.01 |
| Random guess | 0.25 | – | 0.23 | – | – | – |
| CNN-LSTM | 0.37 | 0.01 | 0.23 | 0.02 | 0.54 | 0.01 |
| MCDCNN | 0.31 | 0.09 | 0.15 | 0.06 | 0.52 | 0.02 |
| Majority class | 0.38 | – | 0.14 | – | – | – |
| MLP | 0.34 | 0.04 | 0.13 | 0.01 | 0.50 | 0.00 |
| Time-CNN | 0.29 | 0.09 | 0.13 | 0.04 | 0.50 | 0.01 |
Results for ASCERTAIN averaged over 5 iterations and 5 cross-validation folds ordered by F1-score.
| Architecture | Accuracy | ± | F1-Score | ± | ROC AUC | ± |
|---|---|---|---|---|---|---|
| Inception | 0.47 | 0.02 | 0.24 | 0.01 | 0.51 | 0.02 |
| Resnet | 0.46 | 0.03 | 0.24 | 0.02 | 0.52 | 0.01 |
| FCN | 0.48 | 0.01 | 0.22 | 0.01 | 0.52 | 0.01 |
| Stresnet | 0.47 | 0.02 | 0.22 | 0.02 | 0.52 | 0.02 |
| Encoder | 0.50 | 0.01 | 0.22 | 0.02 | 0.52 | 0.02 |
| Random guess | 0.25 | – | 0.21 | – | – | – |
| MLP-LSTM | 0.50 | 0.01 | 0.20 | 0.02 | 0.52 | 0.01 |
| Majority class | 0.50 | – | 0.17 | – | – | – |
| Time-CNN | 0.42 | 0.11 | 0.16 | 0.03 | 0.50 | 0.01 |
| MLP | 0.46 | 0.06 | 0.16 | 0.01 | 0.50 | 0.00 |
| MCDCNN | 0.42 | 0.13 | 0.16 | 0.03 | 0.50 | 0.01 |
| CNN-LSTM | 0.04 | 0.00 | 0.02 | 0.00 | – | – |
Results for WESAD averaged over 5 iterations and 5 cross-validation folds ordered by F1-score.
| Architecture | Accuracy | ± | F1-Score | ± | ROC AUC | ± |
|---|---|---|---|---|---|---|
| FCN | 0.79 | 0.05 | 0.73 | 0.07 | 0.91 | 0.02 |
| Resnet | 0.74 | 0.07 | 0.69 | 0.07 | 0.89 | 0.04 |
| Time-CNN | 0.75 | 0.03 | 0.66 | 0.05 | 0.86 | 0.02 |
| MCDCNN | 0.74 | 0.04 | 0.62 | 0.05 | 0.84 | 0.03 |
| Stresnet | 0.69 | 0.11 | 0.62 | 0.10 | 0.82 | 0.05 |
| MLP-LSTM | 0.73 | 0.01 | 0.61 | 0.03 | 0.82 | 0.01 |
| Inception | 0.71 | 0.06 | 0.58 | 0.07 | 0.81 | 0.07 |
| Encoder | 0.71 | 0.03 | 0.57 | 0.05 | 0.83 | 0.02 |
| MLP | 0.72 | 0.01 | 0.57 | 0.01 | 0.78 | 0.02 |
| CNN-LSTM | 0.70 | 0.02 | 0.56 | 0.03 | 0.79 | 0.02 |
| Random guess | 0.33 | – | 0.32 | – | – | – |
| Majority class | 0.53 | – | 0.23 | – | – | – |
F1-score, precision, recall, and support (number of samples) of models with the highest F1-score averaged over all iterations and folds for each class and dataset, separately. F1-scores from Table 5, Table 6, Table 7 and Table 8 are the arithmetic means of F1-scores calculated for each class in a given dataset. Please note that the F1-score in this table is an arithmetic mean of the harmonic means of precisions and recalls, so, e.g., for the baseline in WESAD, the F1-score is not simply a harmonic mean of precision and recall. Also support is averaged over all folds - multiplying it by 5 provides the values from Table 4.
| Dataset | Class | F1 | Precision | Recall | Support |
|---|---|---|---|---|---|
| WESAD | Baseline | 0.80 | 0.87 | 0.79 | 117.4 |
| WESAD | Stress | 0.92 | 0.92 | 0.93 | 66.4 |
| WESAD | Amusement | 0.48 | 0.55 | 0.54 | 37.2 |
| DECAF | LALV | 0.00 | 0.00 | 0.00 | 29.6 |
| DECAF | LAHV | 0.33 | 0.35 | 0.34 | 142.8 |
| DECAF | HALV | 0.24 | 0.33 | 0.20 | 114.8 |
| DECAF | HAHV | 0.45 | 0.40 | 0.56 | 168.6 |
| ASCERTAIN | LALV | 0.01 | 0.02 | 0.00 | 14.6 |
| ASCERTAIN | LAHV | 0.03 | 0.08 | 0.02 | 44.2 |
| ASCERTAIN | HALV | 0.32 | 0.39 | 0.30 | 133.0 |
| ASCERTAIN | HAHV | 0.60 | 0.52 | 0.74 | 196.4 |
| Amigos | LALV | 0.01 | 0.01 | 0.02 | 16.0 |
| Amigos | LAHV | 0.26 | 0.27 | 0.31 | 31.0 |
| Amigos | HALV | 0.30 | 0.32 | 0.37 | 38.8 |
| Amigos | HAHV | 0.33 | 0.35 | 0.39 | 40.0 |
Ranks and average rank are presented for each architecture. Although Friedman’s rank test yielded , the post-hoc Wilcoxon-Holm method did not present any statistically significant differences between any given pair of architectures.
| AMIGOS | ASCERTAIN | DECAF | WESAD | Average | |
|---|---|---|---|---|---|
| FCN | 2 | 3 | 1 | 1 | 1.75 |
| Resnet | 1 | 2 | 5 | 2 | 2.5 |
| Stresnet | 3 | 4 | 2 | 5 | 3.5 |
| Inception | 5 | 1 | 4 | 7 | 4.25 |
| Encoder | 4 | 5 | 3 | 8 | 5 |
| MLP-LSTM | 6 | 6 | 6 | 6 | 6 |
| Time-CNN | 7 | 7 | 10 | 3 | 6.75 |
| MCDCNN | 9 | 9 | 8 | 4 | 7.5 |
| MLP | 8 | 8 | 9 | 9 | 8.5 |
| CNN-LSTM | 10 | 10 | 7 | 10 | 9.25 |