| Literature DB >> 35992909 |
Jing Zhang1, Hang Yin1, Jiayu Zhang1, Gang Yang1, Jing Qin2, Ling He1.
Abstract
Mental stress is becoming increasingly widespread and gradually severe in modern society, threatening people's physical and mental health. To avoid the adverse effects of stress on people, it is imperative to detect stress in time. Many studies have demonstrated the effectiveness of using objective indicators to detect stress. Over the past few years, a growing number of researchers have been trying to use deep learning technology to detect stress. However, these works usually use single-modality for stress detection and rarely combine stress-related information from multimodality. In this paper, a real-time deep learning framework is proposed to fuse ECG, voice, and facial expressions for acute stress detection. The framework extracts the stress-related information of the corresponding input through ResNet50 and I3D with the temporal attention module (TAM), where TAM can highlight the distinguishing temporal representation for facial expressions about stress. The matrix eigenvector-based approach is then used to fuse the multimodality information about stress. To validate the effectiveness of the framework, a well-established psychological experiment, the Montreal imaging stress task (MIST), was applied in this work. We collected multimodality data from 20 participants during MIST. The results demonstrate that the framework can combine stress-related information from multimodality to achieve 85.1% accuracy in distinguishing acute stress. It can serve as a tool for computer-aided stress detection.Entities:
Keywords: deep learning; matrix eigenvector; multimodality fusion; objective indicators; stress detection
Year: 2022 PMID: 35992909 PMCID: PMC9389269 DOI: 10.3389/fnins.2022.947168
Source DB: PubMed Journal: Front Neurosci ISSN: 1662-453X Impact factor: 5.152
FIGURE 1The Montreal imaging stress task process.
FIGURE 2Data collection platform.
FIGURE 3Multimodality stress detection.
FIGURE 4Electrocardiogram preprocessing.
FIGURE 52D convolution strategy for electrocardiogram and Mel spectrogram input in this work.
FIGURE 6The overall architecture of the ResNet50.
FIGURE 73D convolution strategy for facial expressions input in this work.
FIGURE 8The overall architecture of the I3D with temporal attention module.
FIGURE 9The fusion method based on matrix eigenvector.
Stress detection accuracy, precision, recall and F1-score using single- and multimodality data.
| Modal | Accuracy | Precision | Recall | F1-Score |
| ECG | 0.741 | 0.737 | 0.743 | 0.731 |
| Voice | 0.830 | 0.825 | 0.829 | 0.827 |
| Facial expressions | 0.792 | 0.795 | 0.803 | 0.799 |
|
|
|
|
|
|
The bold values are the result of the multimodality stress detection.
Stress detection confusion matrix of the single-modal and multimodality methods.
| Actual labels | Predicted labels | |||||||
| ECG | Voice | Facial expressions | Fusion | |||||
| Calm | Stress | Calm | Stress | Calm | Stress | Calm | Stress | |
| Calm | 0.751 | 0.266 | 0.821 | 0.164 | 0.868 | 0.262 | 0.913 | 0.193 |
| Stress | 0.249 | 0.734 | 0.179 | 0.836 | 0.132 | 0.738 | 0.087 | 0.807 |
Stress detection confusion matrix of I3D without temporal attention module.
| Actual labels | Predicted labels | |
| Calm | Stress | |
| Calm | 0.824 | 0.261 |
| Stress | 0.176 | 0.739 |
FIGURE 10Performance comparison of I3D without temporal attention module (TAM) and I3D with TAM.
FIGURE 11Time assessment for real-time applications.
Stress detection accuracy, precision, recall and F1-score of several widely used convolutional neural networks.
| Modal | Model | Accuracy | Precision | Recall | F1-score |
| ECG | ResNet101[98] | 0.706 | 0.712 | 0.717 | 0.714 |
| GoogLeNet[97] | 0.723 | 0.735 | 0.739 | 0.737 | |
| EfficientNet[99] | 0.769 | 0.773 | 0.781 | 0.777 | |
| ResNet50[98] | 0.741 | 0.737 | 0.743 | 0.740 | |
| Voice | ResNet101[98] | 0.802 | 0.797 | 0.801 | 0.799 |
| GoogLeNet[97] | 0.763 | 0.756 | 0.752 | 0.754 | |
| EfficientNet[99] | 0.809 | 0.804 | 0.812 | 0.808 | |
| ResNet50[98] | 0.830 | 0.825 | 0.829 | 0.827 | |
| Facial expressions | C3D[100] | 0.582 | 0.582 | 0.500 | 0.538 |
| I3D[101] | 0.775 | 0.775 | 0.782 | 0.778 | |
| I3D with TAM | 0.792 | 0.795 | 0.803 | 0.799 |
Stress detection accuracy, precision, recall and F1-score by multimodality using different convolutional neural networks.
| Fusion | Accuracy | Precision | Recall | F1-score |
| EfficientNetI3D with TAM | 0.839 | 0.850 | 0.858 | 0.854 |
| ResNet50 I3D with TAM | 0.851 | 0.857 | 0.866 | 0.861 |