| Literature DB >> 35585077 |
Laura Fiorini1, Federica Gabriella Cornacchia Loizzo2, Alessandra Sorrentino2, Erika Rovini3, Alessandro Di Nuovo4, Filippo Cavallo3,2.
Abstract
This paper makes the VISTA database, composed of inertial and visual data, publicly available for gesture and activity recognition. The inertial data were acquired with the SensHand, which can capture the movement of wrist, thumb, index and middle fingers, while the RGB-D visual data were acquired simultaneously from two different points of view, front and side. The VISTA database was acquired in two experimental phases: in the former, the participants have been asked to perform 10 different actions; in the latter, they had to execute five scenes of daily living, which corresponded to a combination of the actions of the selected actions. In both phase, Pepper interacted with participants. The two camera point of views mimic the different point of view of pepper. Overall, the dataset includes 7682 action instances for the training phase and 3361 action instances for the testing phase. It can be seen as a framework for future studies on artificial intelligence techniques for activity recognition, including inertial-only data, visual-only data, or a sensor fusion approach.Entities:
Mesh:
Year: 2022 PMID: 35585077 PMCID: PMC9117293 DOI: 10.1038/s41597-022-01324-3
Source DB: PubMed Journal: Sci Data ISSN: 2052-4463 Impact factor: 8.501
A comparison of the VISTA dataset with existing benchmarks.
| Dataset | Sensorsa | Activitiesb |
|---|---|---|
| FallAllD1, 2 | ||
| UniMiB dataset3 | ||
| DU-MD Dataset4 | ||
| RGBDHuDaAct6 | ||
| Cornell Activity Dataset7, 8 (CAD-60) | ||
| MSR Daily Activity 3D9 | ||
| MEX dataset10 | ||
| Up-Fall Detection Dataset11 | ||
| MoCA Dataset5 | ||
| VISTA Datasets |
a‘Wide’ and ‘Fine’ indicate the sensors used to recognize the wide and fine movements, respectively.
bThe actions were clustered into ‘Basic Gesture’, e.g. walking, sitting, lying, ‘ADL’, related to a specific activity of daily living, and ‘Scene’, which includes all the activities
composed by two or more activities without restrictions in the passage from one to the other.
Description of actions and of the associated scenes included in the dataset.
| PHASE 1 (Training): each action for one minute | PHASE 2 (Testing): each scene for one minute | ||||||
|---|---|---|---|---|---|---|---|
| Action | Description | Position | HC | HL | PH | WO | RE |
| Take the fork, eat and put it back | Sitting on the chair | X | |||||
| Take the glass, drink and put it back | Sitting on the chair | X | X | ||||
| Take the toothbrush, brush teeth and put it back | Sitting on the chair | X | |||||
| Type on the keyboard with both hands | Sitting on the chair | X | |||||
| Take a pen and write on a paper | Sitting on the chair | X | |||||
| Take the phone, talk on it and put it back | Sitting on the chair | X | X | ||||
| Walk forward and backward repeatedly | Standing | X | |||||
| Take the broom, sweep and put it back at the end | Standing | X | |||||
| Sit comfortably on the couch and relax | Sitting on the couch | X | |||||
| Take the book, read it and turn pages repeatedly | Sitting on the couch | X | |||||
For the actions, ‘EF’ stands for ‘eat with the fork’, ‘DG’ for ‘drink from a glass’, ‘BT’ for ‘brush teeth’, ‘UL’ for ‘use laptop’, ‘WP’ for ‘write on a paper’, ‘TP’ for ‘talk on the phone’, ‘WK’ for ‘walk’, ‘SB’ for ‘sweep with the broom’, ‘RC’ for ‘relax on the couch’ and ‘RB’ for ‘read a book’. For the scenes, ‘HC’ stands for ‘house cleaning’, ‘HL’ for ‘having lunch’, ‘PH’ for ‘personal hygiene’, ‘WO’ for ‘working’ and ‘RE’ for ‘relax’.
Fig. 1SensHand glove.
Fig. 2Setup for experimental session.
Fig. 3GIFs explaining the movement for the ten activities. Participant has granted the permission to publish these photos.
Fig. 4Data collection interface when Training (left) and Testing (right) are selected, respectively.
Activities instances distribution in training and testing sequences.
| Activity | Training | Testing |
|---|---|---|
| Eat with the fork (EF) | 10% | 12% |
| Drink from a glass (DG | 10% | 15% |
| Brush teeth (BT) | 10% | 12% |
| Use laptop (UL) | 10% | 7% |
| Write on a paper (WP) | 10% | 9% |
| Talk on the phone (TP) | 10% | 10% |
| Walk (WK) | 10% | 9% |
| Sweep with the broom (SB) | 10% | 11% |
| Relax on the couch (RC) | 10% | 6% |
| Read a book (RB) | 10% | 9% |
Fig. 5Organization of the VISTA database.
Number of joint and associated joint type.
| Number of joint | Joint type |
|---|---|
| 0 | Nose |
| 1 | Neck |
| 2 | Right Shoulder |
| 3 | Right Elbow |
| 4 | Right Wrist |
| 5 | Left Shoulder |
| 6 | Left Elbow |
| 7 | Left Wrist |
| 8 | Mid Hip |
| 9 | Right Hip |
| 10 | Right Knee |
| 11 | Right Ankle |
| 12 | Left Hip |
| 13 | Left Knee |
| 14 | Left Ankle |
| 15 | Right Eye |
| 16 | Left Eye |
| 17 | Right Ear |
| 18 | Left Ear |
| 19 | Left Big Toe |
| 20 | Left Small Toe |
| 21 | Left Heel |
| 22 | Right Big Toe |
| 23 | Right Small Toe |
| 24 | Right Heel |
| 25 | Background |
Fig. 6Skeleton tracking.
The first two columns show the features selected after the correlation analysis from the combined Index + Wrist dataset, while the third one shows the ones selected from Index and Wrist dataset when analysed on their own.
| Index + Wrist | Index/Wrist | |
|---|---|---|
| Wrist acc. mean | Index acc. mean | Acc. mean |
| Wrist acc. stdev | Index acc. stdev | Acc. stdev |
| Wrist acc. RMS | Index acc. RMS | Acc. RMS |
| Wrist acc. skewness | Index acc. skewness | Acc. skewness |
| Wrist acc. kurtosis | Index acc. kurtosis | Acc. kurtosis |
| Wrist acc. SMA | Index acc. SMA | Acc. SMA |
| Wrist acc. power | Index acc. power | Acc. power |
| Wrist ang. vel. mean | Index vel. power | Ang.vel. mean |
| Wrist ang. vel. power | Ang.vel. stdev | |
| Ang.vel. power | ||
Fig. 7Scheme of the feature-level fusion.
Results obtained by stand-alone systems.‘A’ stands for Accuracy, ‘R’ for Recall, ‘F’ for F-measure and ‘P’ for Precision.
| Index (I) | Wrist (W) | I + W | ||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| A | R | F | P | A | R | F | P | A | R | F | P | |
| SVM | 0.72 | 0.72 | 0.72 | 0.73 | 0.72 | 0.72 | 0.71 | 0.73 | 0.56 | 0.56 | 0.55 | 0.63 |
| RF | 0.56 | 0.57 | 0.57 | 0.59 | 0.58 | 0.57 | 0.57 | 0.60 | 0.66 | 0.66 | 0.66 | 0.67 |
| KNN | 0.73 | 0.73 | 0.73 | 0.73 | 0.73 | 0.73 | 0.73 | 0.73 | 0.81 | 0.81 | 0.81 | 0.82 |
| SVM | 0.84 | 0.84 | 0.84 | 0.85 | 0.60 | 0.60 | 0.61 | 0.66 | ||||
| RF | 0.71 | 0.72 | 0.72 | 0.72 | 0.64 | 0.65 | 0.64 | 0.65 | ||||
| KNN | 0.90 | 0.90 | 0.90 | 0.90 | 0.81 | 0.81 | 0.82 | 0.81 | ||||
Fusion at Feature-level’s Results. ‘A’ stands for Accuracy, ‘R’ for Recall, ‘F’ for F-measure and ‘P’ for Precision.
| I + FC | W + FC | IW + FC | ||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| A | R | F | P | A | R | F | P | A | R | F | P | |
| SVM | 0.76 | 0.76 | 0.76 | 0.78 | 0.81 | 0.80 | 0.79 | 0.80 | 0.67 | 0.66 | 0.64 | 0.72 |
| RF | 0.75 | 0.75 | 0.76 | 0.77 | 0.76 | 0.77 | 0.77 | 0.77 | 0.77 | 0.78 | 0.78 | 0.79 |
| KNN | 0.89 | 0.89 | 0.89 | 0.89 | 0.88 | 0.88 | 0.89 | 0.88 | 0.89 | 0.89 | 0.89 | 0.90 |
| SVM | 0.63 | 0.63 | 0.64 | 0.67 | 0.56 | 0.56 | 0.55 | 0.58 | 0.67 | 0.66 | 0.64 | 0.72 |
| RF | 0.74 | 0.75 | 0.75 | 0.76 | 0.73 | 0.73 | 0.73 | 0.74 | 0.77 | 0.78 | 0.78 | 0.79 |
| KNN | 0.77 | 0.77 | 0.78 | 0.82 | 0.77 | 0.77 | 0.79 | 0.82 | 0.89 | 0.89 | 0.89 | 0.90 |
Fig. 8Spider plot which contains the F-measure values of the best classifiers on all datasets divided into those related to the frontal camera, on the left, and those related to the lateral one, on the right. All F-measure values are relative to individual actions.
| Measurement(s) | movement of upper and lower limbs |
| Technology Type(s) | IMU • camera |