| Literature DB >> 34883863 |
Aleksej Logacjov1, Kerstin Bach1, Atle Kongsvold2, Hilde Bremseth Bårdstu3,4, Paul Jarle Mork2.
Abstract
Existing accelerometer-based human activity recognition (HAR) benchmark datasets that were recorded during free living suffer from non-fixed sensor placement, the usage of only one sensor, and unreliable annotations. We make two contributions in this work. First, we present the publicly available Human Activity Recognition Trondheim dataset (HARTH). Twenty-two participants were recorded for 90 to 120 min during their regular working hours using two three-axial accelerometers, attached to the thigh and lower back, and a chest-mounted camera. Experts annotated the data independently using the camera's video signal and achieved high inter-rater agreement (Fleiss' Kappa =0.96). They labeled twelve activities. The second contribution of this paper is the training of seven different baseline machine learning models for HAR on our dataset. We used a support vector machine, k-nearest neighbor, random forest, extreme gradient boost, convolutional neural network, bidirectional long short-term memory, and convolutional neural network with multi-resolution blocks. The support vector machine achieved the best results with an F1-score of 0.81 (standard deviation: ±0.18), recall of 0.85±0.13, and precision of 0.79±0.22 in a leave-one-subject-out cross-validation. Our highly professional recordings and annotations provide a promising benchmark dataset for researchers to develop innovative machine learning approaches for precise HAR in free living.Entities:
Keywords: accelerometer; benchmark; deep learning; human activity recognition; machine learning; physical activity behavior; public dataset
Mesh:
Year: 2021 PMID: 34883863 PMCID: PMC8659926 DOI: 10.3390/s21237853
Source DB: PubMed Journal: Sensors (Basel) ISSN: 1424-8220 Impact factor: 3.576
This table shows the main characteristics of eight different publicly available HAR accelerometer-based datasets, and our HARTH. We consider the symbol “#” as an abbreviation for “number of”, “PAs” for “physical activities” and “accelero.” for “accelerometers”.
| Name | #Labels | #PAs | #Subjects | #Accelero. | Sensor Type | Annotation |
|---|---|---|---|---|---|---|
| Real-life-HAR [ | 4 | 2 | 19 | 1 | Smartphone | User |
| SHL [ | 8 | 5 | 3 | 4 | Smartphone | User and expert |
| HASC-PAC2016 [ | 6 | 6 | 81 | 1 | Smartphone | User |
| WISDMv2.0 [ | 6 | 6 | 225 | 1 | Smartphone | User |
| DailyLog [ | 19 | 7 | 7 | 2 | Smartphone & Smartwatch | User |
| ExtraSensory [ | 51 | 8 | 60 | 2 | Smartphone & Smartwatch | User |
| TMD [ | 5 | 3 | 13 | 1 | Smartphone | User |
| SDL [ | 10 | 4 | 8 | 1 | Smartwatch | User |
| HARTH (ours) | 12 | 9 | 22 | 2 | Axivity AX3 | Human experts |
Figure 1This figure shows the two sensor positions (highlighted with orange lines) used for our dataset. (a) The lower back sensor is positioned at approximately the 3rd lumbar vertebra. The z-axis of the coordinate system points forward. (b) The thigh sensor is positioned approximately 10 cm above the upper kneecap. The z-axis points backward.
The definitions of all twelve activities used during annotation.
| Activity | Definition |
|---|---|
| Sitting | When the person’s buttocks is on the seat of the chair, bed, or floor. Sitting can include some movement in the upper body and legs; this should not be tagged as a separate transition. Adjustment of sitting position is allowed. |
| Standing | Upright, feet supporting the person’s body weight, with no feet movement, otherwise this could be shuffling/walking. Movement of upper body and arms is allowed. If feet position is equal before and after upper body movement, standing can be inferred. Without being able to see the feet, if upper body and surroundings indicate no feet movement, standing can be inferred. |
| Lying | The person lies either on the stomach, on the back, or on the right/left shoulder. Movement of arms, feet, and head is allowed. |
| Walking | Locomotion towards a destination with one stride or more, (one step with both feet, where one foot is placed at the other side of the other). Walking could occur in all directions. Walking along a curved line is allowed. |
| Running | Locomotion towards a destination, with at least two steps where both feet leave the ground during each stride. Running can be inferred when trunk moves forward is in a constant upward-downward motion with at least two steps. Running along a curved line is allowed. |
| Stairs (asc./desc.) | Start: Heel-off of the foot that will land on the first step of the stairs. End: When the heel-strike of the last foot is placed on flat ground. If both feet rests at the same step with no feet movement, standing should be inferred. |
| Shuffling | Stepping in place by non-cyclical and non-directional movement of the feet. Includes turning on the spot with feet movement not as part of walking bout. Without being able to see the feet, if movement of the upper body and surroundings indicate non-directional feet movement, shuffling can be inferred. |
| Cycling (sitting) | Pedaling while the buttocks is placed at the seat. Cycling starts at first pedaling, or when the bike is moving while one/both feet are on the pedal(s). Cycling ends when the first foot is in contact with the ground. If one/both feet are placed on the pedal(s), the buttocks is placed at the seat, with no pedaling and the bike is standing still, this should be tagged as sitting. |
| Cycling (standing) | Standing with both feet on the pedals, while riding a bike. Cycling (standing) starts when the buttocks leave the seat, and ends when the buttocks is placed on the seat. |
| Transport (sitting) | When sitting in a bus/car/train among others. |
| Transport (standing) | When standing in a bus/train among others. Movement of feet while standing is allowed and should not be tagged separately. |
Figure 2This bar plot shows the total amount of recorded minutes for each activity in the dataset.
Figure 3This figure shows ten seconds (x-axis) of the acceleration signals (on the y-axis and in ) of all three axes of the back and thigh accelerometers. We focus on the subject with subject ID 28. The background is shaded according to the activity label, in this case walking (green), shuffling (yellow), and standing (gray).
Figure 4This figure illustrates a single layer in a standard CNN (a) and a multi-resolution CNN (b).
Figure 5This figure illustrates the five preprocessing steps we performed. First, the two accelerometer signals and the annotated (denoted as annot.) video are time-synchronized. Second, a 20 Hz low-pass filter is applied to the annotated acceleration signals. Third, each signal is segmented into one-second windows, and a majority label voting is used. These windows are fed into the deep learning models for training. Fourth, 161 features (denoted as F) are computed for each window. Fifth, min–max feature scaling is applied. The resulting feature vectors are used to train the traditional machine learning models.
This table shows the recall, precision, and F1-score of the leave-one-subject-out cross-validation, averaged across all twelve labels, with the corresponding standard deviations. The best results are shown as gray cells. The term “mCNN” is an abbreviation for “multi-resolution CNN”.
| k-NN | SVM | RF | XGB | BiLSTM | CNN | mCNN | |
|---|---|---|---|---|---|---|---|
| Recall |
|
|
|
|
|
|
|
| Precision |
|
|
|
|
|
|
|
| F1-score |
|
|
|
|
|
|
|
This table shows the average recall, precision, and F1-score of the leave-one-subject-out cross-validation. Twelve labels are merged into nine physical activities by summing up the corresponding rows/columns of the summed confusion matrix. The best results are shown as gray cells. The term “mCNN” is an abbreviation for “multi-resolution CNN”.
| k-NN | SVM | RF | XGB | BiLSTM | CNN | mCNN | |
|---|---|---|---|---|---|---|---|
| Recall |
|
|
|
|
|
|
|
| Precision |
|
|
|
|
|
|
|
| F1-score |
|
|
|
|
|
|
|
Figure 6This figure shows four summed confusion matrices of the leave-one-subject-out cross-validation. The four considered models are the two best traditional machine learning approaches, SVM (left) and XGB (right of SVM), as well as the two best deep learning models CNN (left of multi-resolution CNN) and multi-resolution CNN (left). The rows show the ground truth labels and the columns the predictions. Additionally, the matrices are normalized such that each row sums up to one. The diagonal represents the proportion of correctly classified samples. The leading zero of each entry is removed.