| Literature DB >> 29570678 |
Kwan Woo Lee1, Hyo Sik Yoon2, Jong Min Song3, Kang Ryoung Park4.
Abstract
Because aggressive driving often causes large-scale loss of life and property, techniques for advance detection of adverse driver emotional states have become important for the prevention of aggressive driving behaviors. Previous studies have primarily focused on systems for detecting aggressive driver emotion via smart-phone accelerometers and gyro-sensors, or they focused on methods of detecting physiological signals using electroencephalography (EEG) or electrocardiogram (ECG) sensors. Because EEG and ECG sensors cause discomfort to drivers and can be detached from the driver's body, it becomes difficult to focus on bio-signals to determine their emotional state. Gyro-sensors and accelerometers depend on the performance of GPS receivers and cannot be used in areas where GPS signals are blocked. Moreover, if driving on a mountain road with many quick turns, a driver's emotional state can easily be misrecognized as that of an aggressive driver. To resolve these problems, we propose a convolutional neural network (CNN)-based method of detecting emotion to identify aggressive driving using input images of the driver's face, obtained using near-infrared (NIR) light and thermal camera sensors. In this research, we conducted an experiment using our own database, which provides a high classification accuracy for detecting driver emotion leading to either aggressive or smooth (i.e., relaxed) driving. Our proposed method demonstrates better performance than existing methods.Entities:
Keywords: aggressive driving emotion; convolutional neural network; near-infrared light camera sensor; thermal camera sensor
Year: 2018 PMID: 29570678 PMCID: PMC5948584 DOI: 10.3390/s18040957
Source DB: PubMed Journal: Sensors (Basel) ISSN: 1424-8220 Impact factor: 3.576
Comparison of proposed and previous works.
| Category | Methods | Advantage | Disadvantage | ||
|---|---|---|---|---|---|
| Gyro-sensor and accelerometer-based method | Aggressive driving detection [ | An accelerometer and a gyro-sensor in a smart phone are used. Accordingly, no device needs to be purchased or installed. Highly portable. | Depending on the performance of a GPS receiver, data values can be inaccurate. This method is not applicable in GPS-unavailable areas. | ||
| Voice-based method | Detection of a driver’s emotion based on voice or car–voice interaction [ | Data acquisition using inexpensive sensor. | Surrounding noise influences performance. | ||
| Bio-signal-based method | Various bio-signals, including ECG, EEG, and pulse, are measured to recognize a driver’s emotion or fatigue [ | Bio-signals that cannot be detected by the naked eye are detected at high speed. | Sensors can be detached by a driver’s active motion. | ||
| Camera-based method | Using single camera | Using visible light camera | Yawning detection [ | An inexpensive camera is used. | Physical characteristics that cannot be observed by the naked eye are not detected. |
| Using thermal camera | Driver’s emotion recognition [ | Night photography is possible without a separate light source. | The camera is expensive compared to visible light or NIR cameras. | ||
| Using multiple cameras | Using dual NIR cameras | Percentage of eye closure over time (PERCLOS)- and average eye-closure speed (AECS)-based detection of driver fatigue [ | In cases where more than two cameras are used, driver fatigue is detected over a wide range. | Physical characteristics that cannot be observed by the naked eye are not detected. | |
| Using NIR and thermal cameras | Aggressive driving emotion detection-based convolutional neural networks (CNN) (Proposed method) | Thermal cameras can measure temperature changes in a driver’s body, which cannot be checked by the naked eye, whereas, a NIR camera can detect facial feature points and measure their changes. | The use of two cameras increases algorithm complexity and processing time. | ||
Figure 1Flowchart of proposed method.
Figure 2Experimental setup for classifying aggressive and smooth driving emotion using multimodal cameras.
Figure 3Examples of captured images by NIR (left images) and thermal (right images) cameras from (a) person 1 and (b) person 2.
Figure 4Examples of (a) detected facial feature points; and (b) the index numbers of facial feature points.
Figure 5ROI regions in (a) NIR image; and (b) thermal image.
CNN architecture used in our research (i.e., Conv, ReLU, and Pool mean convolutional layer, rectified linear unit, and max pooling layer, respectively).
| Layer Type | Number of Filters | Size of Feature Map | Size of Kernel | Number of Stride | Number of Padding | |
|---|---|---|---|---|---|---|
| Image input layer | 224 (height) × 224 | |||||
| Group 1 | Conv1_1 (1st convolutional layer) | 64 | 224 × 224 × 64 | 3 × 3 | 1 × 1 | 1 × 1 |
| ReLU1_1 | 224 × 224 × 64 | |||||
| Conv1_2 (2nd convolutional layer) | 64 | 224 × 224 × 64 | 3 × 3 | 1 × 1 | 1 × 1 | |
| ReLU1_2 | 224 × 224 × 64 | |||||
| Pool1 | 1 | 112 × 112 × 64 | 2 × 2 | 2 × 2 | 0 × 0 | |
| Group 2 | Conv2_1 (3rd convolutional layer) | 128 | 112 × 112 × 128 | 3 × 3 | 1 × 1 | 1 × 1 |
| ReLU2_1 | 112 × 112 × 128 | |||||
| Conv2_2 (4th convolutional layer) | 128 | 112 × 112 × 128 | 3 × 3 | 1 × 1 | 1 × 1 | |
| ReLU2_2 | 112 × 112 × 128 | |||||
| Pool2 | 1 | 56 × 56 × 128 | 2 × 2 | 2 × 2 | 0 × 0 | |
| Group 3 | Conv3_1 (5th convolutional layer) | 256 | 56 × 56 × 256 | 3 × 3 | 1 × 1 | 1 × 1 |
| ReLU3_1 | 56 × 56 × 256 | |||||
| Conv3_2 (6th convolutional layer) | 256 | 56 × 56 × 256 | 3 × 3 | 1 × 1 | 1 × 1 | |
| ReLU3_2 | 56 × 56 × 256 | |||||
| Conv3_3 (7th convolutional layer) | 256 | 56 × 56 × 256 | 3 × 3 | 1 × 1 | 1 × 1 | |
| ReLU3_3 | 56 × 56 × 256 | |||||
| Pool3 | 1 | 28 × 28 × 256 | 2 × 2 | 2 × 2 | 0 × 0 | |
| Group 4 | Conv4_1 (8th convolutional layer) | 512 | 28 × 28 × 512 | 3 × 3 | 1 × 1 | 1 × 1 |
| ReLU4_1 | 28 × 28 × 512 | |||||
| Conv4_2 (9th convolutional layer) | 512 | 28 × 28 × 512 | 3 × 3 | 1 × 1 | 1 × 1 | |
| ReLU4_2 | 28 × 28 × 512 | |||||
| Conv4_3 (10th convolutional layer) | 512 | 28 × 28 × 512 | 3 × 3 | 1 × 1 | 1 × 1 | |
| ReLU4_3 | 28 × 28 × 512 | |||||
| Pool4 | 1 | 14 × 14 × 512 | 2 × 2 | 2 × 2 | 0 × 0 | |
| Group 5 | Conv5_1 (11th convolutional layer) | 512 | 14 × 14 × 512 | 3 × 3 | 1 × 1 | 1 × 1 |
| ReLU5_1 | 14 × 14 × 512 | |||||
| Conv5_2 (12th convolutional layer) | 512 | 14 × 14 × 512 | 3 × 3 | 1 × 1 | 1 × 1 | |
| ReLU5_2 | 14 × 14 × 512 | |||||
| Conv5_3 (13th convolutional layer) | 512 | 14 × 14 × 512 | 3 × 3 | 1 × 1 | 1 × 1 | |
| ReLU5_3 | 14 × 14 × 512 | |||||
| Pool5 | 1 | 7 × 7 × 512 | 2 × 2 | 2 × 2 | 0 × 0 | |
| Fc6 (1st FCL) | 4096 × 1 | |||||
| ReLU6 | 4096 × 1 | |||||
| Dropout6 | 4096 × 1 | |||||
| Fc7 (2nd FCL) | 4096 × 1 | |||||
| ReLU7 | 4096 × 1 | |||||
| Dropout7 | 4096 × 1 | |||||
| Fc8 (3rd FCL) | 2 × 1 | |||||
| Softmax layer | 2 × 1 | |||||
| Output layer | 2 × 1 | |||||
Figure 6Experimental procedure. Smooth and aggressive driving images refer to images acquired while the participant is operating the smooth and aggressive driving simulators, respectively.
Image database.
| NIR Images | Thermal Images | |||
|---|---|---|---|---|
| Smooth Driving | Aggressive Driving | Smooth Driving | Aggressive Driving | |
| Number of images | 29,463 | 29,463 | 29,463 | 29,463 |
p-Value, Cohen’s d value, and effect size of five features between smooth and aggressive driving.
| 0.2582 | 0.1441 | 0.7308 | ||||
| Cohen’s | 0.4233 | 0.5487 | 0.1325 | |||
| Effect size | medium | medium | Small | |||
| 0.9490 | 0.6715 | |||||
| Cohen’s | 0.0236 | 0.1565 | ||||
| Effect size | Small | Small | ||||
Figure 7Graphs of mean and standard deviation for 5 features between smooth and aggressive driving (a) euclidean distance change between left and right lip corners (DLRL); (b) euclidean distance change between upper and lower lips (DULL); (c) facial temperature-based heart rate (HR); (d) eye-blinking rate (EBR); (e) eyebrow movement (EM).
p-Value, Cohen’s d value and effect size for pixel values of NIR and thermal ROIs between smooth and aggressive driving.
| 0.0046 | 0.0123 | 0.0021 | ||||
| Cohen’s | 1.1234 | 0.9842 | 1.2355 | |||
| Effect size | Large | Large | Large | |||
| 0.0139 | 0.0450 | 0.0476 | ||||
| Cohen’s | 0.9770 | 0.7662 | 0.7565 | |||
| Effect size | Large | Large | Large | |||
Figure 8Graphs of means and standard deviations for pixel values of NIR and thermal ROIs between smooth and aggressive driving. ROIs of (a) left eye; (b) right eye; (c) mouth; (d) middle of forehead; (e) left check; and (f) right cheek.
Figure 9Accuracy and loss during CNN training in two-fold cross validation: (a,c) accuracy and loss in two-fold cross validation for NIR image training and validation datasets, respectively; and (b,d) accuracy and loss in two-fold cross validation for the thermal image training and validation datasets, respectively. In (a–d), “loss 1” and “accuracy 1” are from the first-fold validation, respectively. “Loss 2” and “accuracy 2” are from the second-fold validation, respectively.
CNN models for comparison. ConvN is the filter of N × N size (e.g., Conv3 represents a 3 × 3 filter).
| Net Configuration | VGG face-16 (Fine Tuning) (Proposed Method) | AlexNet |
|---|---|---|
| # of layers | 16 | 8 |
| Filter size (# of filters) | Conv3 (64) | Conv11 (96) |
| Pooling type | MAX | MAX |
| Filter size (# of filters) | Conv3 (128) | Conv5 (256) |
| Pooling type | MAX | MAX |
| Filter size (# of filters) | Conv3 (256) | Conv3 (384) |
| Pooling type | MAX | - |
| Filter size (# of filters) | Conv3 (512) | Conv3 (384) |
| Pooling type | MAX | - |
| Filter size (# of filters) | Conv3 (512) | Conv3 (256) |
| Pooling Type | MAX | MAX |
| Fc6 (1st FCL) | 409 | 409 |
Classification accuracies by proposed VGG face-16 model (%).
| Aggressive | 95.913 | 4.087 | 95.941 | 4.059 | 95.927 | 4.073 |
| Smooth | 4.06 | 95.94 | 4.057 | 95.943 | 4.0585 | 95.9415 |
| Aggressive | 95.859 | 4.141 | 94.773 | 5.227 | 95.316 | 4.684 |
| Smooth | 5.143 | 94.857 | 5.217 | 94.783 | 5.18 | 94.82 |
Classification accuracies by AlexNet model (unit: %).
| Aggressive | 94.885 | 5.115 | 94.931 | 5.069 | 94.908 | 5.092 |
| Smooth | 5.057 | 94.943 | 5.080 | 94.920 | 5.0685 | 94.9315 |
| Aggressive | 94.076 | 5.924 | 94.008 | 5.992 | 94.042 | 5.958 |
| Smooth | 5.964 | 94.036 | 5.884 | 94.116 | 5.924 | 94.076 |
Classification accuracies by score-level fusion, based on weighted SUM rule of the proposed method (%).
| Actual | Predicted | |||||
|---|---|---|---|---|---|---|
| First fold | Second fold | Average | ||||
| Aggressive | Smooth | Aggressive | Smooth | Aggressive | Smooth | |
| Aggressive | 99.955 | 0.045 | 99.972 | 0.028 | 99.9635 | 0.0365 |
| Smooth | 0.053 | 99.947 | 0.027 | 99.973 | 0.04 | 99.96 |
Figure 10Comparisons of ROC curves of proposed and previous methods. “VGG” represents the VGG-face 16 (i.e., fine tuning).
Comparisons of positive predictive value (PPV), true positive rate (TPR), accuracy (ACC), and F_score of the proposed and previous methods. “VGG” represents VGG-face 16 (i.e., fine tuning) (%).
| PPV | TPR | ACC | F_Score | |
|---|---|---|---|---|
| Proposed | 99.96 | 99.97 | 99.96 | 99.97 |
| VGG (NIR) | 95.94 | 95.92 | 95.93 | 95.93 |
| VGG (Thermal) | 95.08 | 95.07 | 95.07 | 95.08 |
| AlexNet (NIR) | 94.92 | 94.91 | 94.92 | 94.91 |
| AlexNet (Thermal) | 94.07 | 94.06 | 94.06 | 94.07 |
| Method using whole face (weight SUM rule) | 87.31 | 87.28 | 87.29 | 87.29 |
| Method using whole face (weight PRODUCT rule) | 85.48 | 85.45 | 85.48 | 85.46 |
| Multi-channel-based method [ | 83.24 | 83.27 | 83.26 | 83.25 |
| [ | 54.01 | 54.1 | 54.05 | 54.05 |
| [ | 58.39 | 58.29 | 58.34 | 58.33 |
Comparisons of PPV, TPR, ACC, and F_score of proposed and previous methods based on 10-fold cross validation. “VGG” represents VGG-face 16 (i.e., fine tuning) (%).
| PPV | TPR | ACC | F_Score | |
|---|---|---|---|---|
| Proposed | 99.94 | 99.95 | 99.95 | 99.94 |
| VGG (NIR) | 95.87 | 95.85 | 95.85 | 95.86 |
| VGG (Thermal) | 95.11 | 95.1 | 95.1 | 95.1 |
| AlexNet (NIR) | 94.85 | 94.87 | 94.86 | 94.86 |
| AlexNet (Thermal) | 94.12 | 94.1 | 94.11 | 94.11 |
| Method using whole face (weight SUM rule) | 86.11 | 86.09 | 87.1 | 86.1 |
| Method using whole face (weight PRODUCT rule) | 85.28 | 85.25 | 87.27 | 85.26 |
| Multi-channel-based method [ | 82.19 | 82.17 | 82.17 | 82.18 |
| [ | 55.21 | 55.24 | 54.22 | 55.22 |
| [ | 59.28 | 59.25 | 59.27 | 59.26 |