| Literature DB >> 35864879 |
Hao Quan1, Yu Hu2, Andrea Bonarini1.
Abstract
Human activity recognition is attracting increasing research attention. Many activity recognition datasets have been created to support the development and evaluation of new algorithms. Given the lack of datasets collected in real environments (In The Wild) to support human activity recognition in public spaces, we introduce a large-scale video dataset for activity recognition In The Wild: POLIMI-ITW-S. The fully labeled dataset consists of 22,161 RGB video clips (about 46 h) including 37 activity classes performed by 50 K+ subjects in real shopping malls. We evaluated the state-of-the-art models on this dataset and get relatively low accuracy. We release the dataset including the annotations composed by person tracking bounding boxes, 2-D skeleton, and activity labels for research use at: https://airlab.deib.polimi.it/polimi-itw-s-a-shopping-mall-dataset-in-the-wild.Entities:
Keywords: Computer vision; Human activity recognition; In the wild; Mobile robot
Year: 2022 PMID: 35864879 PMCID: PMC9294482 DOI: 10.1016/j.dib.2022.108420
Source DB: PubMed Journal: Data Brief ISSN: 2352-3409
Datasets used for recent human recognition models.
| Model | Publisher | NTU 60 | NTU 120 | Kinetics-Skeleton | NUCLA | SYSU |
|---|---|---|---|---|---|---|
| Efficient GCN | TPAMI 22 | |||||
| CTR-GCN | ICCV 21 | |||||
| SGN | CVPR 20 | |||||
| MSG3D | CVPR 20 | |||||
| 4S-Shift-GCN | CVPR 20 | |||||
| NAS-GCN | AAAI 20 | |||||
| 2S-AGCN | CVPR 19 | |||||
| 7 | 5 | 3 | 2 | 1 | ||
The state-of-the-art methods on NTU 60 dataset in accuracy (%).
| NTU 60: collected from laboratory | |||
|---|---|---|---|
| Model | Publisher | X-Sub | X-View |
| Efficient GCN | TPAMI 22 | 92.1 | 96.1 |
| CTR-GCN | ICCV 21 | 92.4 | 96.8 |
| MS-G3D | CVPR 20 | 91.5 | 96.2 |
| 4S-Shift-GCN | CVPR 20 | 90.7 | 96.5 |
| SGN | CVPR 20 | 89.0 | 94.5 |
| NAS-GCN | AAAI 20 | 89.4 | 95.7 |
| 2S-AGCN | CVPR 19 | 88.5 | 95.1 |
| 90.5 | 95.8 | ||
X-Sub: Cross-Subject evaluation [8].
X-View: Cross-View evaluation [8].
The state-of-the-art methods on NTU 120 dataset in accuracy (%).
| NTU 120: collected from laboratory | |||
|---|---|---|---|
| Model | Publisher | X-Sub120 | X-Set120 |
| Efficient GCN | TPAMI 22 | 88.7 | 88.9 |
| CTR-GCN | ICCV 21 | 88.9 | 90.6 |
| MS-G3D | CVPR 20 | 86.9 | 88.4 |
| 4S-Shift-GCN | CVPR 20 | 85.9 | 87.6 |
| SGN | CVPR 20 | 79.2 | 81.5 |
| 85.9 | 87.4 | ||
X-Sub120: Cross-Subject evaluation [9].
X-Set120: Cross-Setup evaluation [9].
The state-of-the-art methods on NUCLA dataset in accuracy (%) .
| NUCLA: collected from laboratory | ||
|---|---|---|
| Model | Publisher | NUCLA (%) |
| CTR-GCN | ICCV 21 | 96.5 |
| 4S-Shift-GCN | CVPR 20 | 94.6 |
| 95.6 | ||
The state-of-the-art method on SYSU dataset in accuracy (%).
| SYSU: collected from laboratory | |||
|---|---|---|---|
| Model | Publisher | X-Sub | Same-Sub |
| SGN | CVPR 20 | 90.6 | 89.3 |
X-Sub: Cross-Subject evaluation [12].
Same-Sub: Same-Subject evaluation [12].
The state-of-the-art methods on Kinetics-Skeleton dataset in accuracy (%).
| Kinetics-Skeleton: collected by crowd-sourcing method | ||
|---|---|---|
| Model | Publisher | Kinetics-Skeleton (%) |
| MS-G3D | ICCV 21 | 38 |
| NAS | CVPR 20 | 37.1 |
| 2S-AGCN | CVPR 19 | 36.1 |
| 37.1 | ||
Comparison between different datasets and ITW dataset requirements: 1. viewpoint similar to the one of the robot, 2. video taken from moving camera, 3. representative actions, 4. different people performing the same action, 5. different genders and ages, 6. real life background, 7. crowded scenes with occlusions, 8. no “actors” and unscripted actions, 9. presence of sequences of actions, 10. presence of human-object and multi-agent interactive actions, 11. large-scale dataset.
| Datasets | Year | Classes | Subjects | Samples | Scenes | Views | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| SYSU | 2015 | 12 | 40 | 480 | 1 | 1 | Y | N | Y | Y | N | N | N | N | N | N | N |
| ActivityNet | 2015 | 203 | – | 849 h | – | 1 | N | Y | N | Y | Y | Y | Y | Y | Y | Y | Y |
| NTU | 2016 | 60 | 40 | 56,880 | 1 | 3 | Y | N | N | Y | N | N | N | N | N | Y | Y |
| Kinetics | 2017 | 400 | – | 300,000 | – | 1 | N | Y | N | Y | Y | Y | Y | Y | Y | Y | Y |
| AVA | 2018 | 80 | – | 437 | – | 1 | N | Y | N | Y | Y | Y | Y | N | Y | Y | Y |
| NTU 120 | 2019 | 120 | 106 | 114,480 | 1 | 3 | Y | N | N | Y | N | N | N | N | N | Y | Y |
| Toyota S.H. | 2019 | 31 | 18 | 16,115 | 1 | 7 | N | N | Y | Y | N | Y | N | Y | Y | N | Y |
| ETRI | 2020 | 55 | 100 | 112,620 | 1 | 4 | Y | N | Y | Y | N | Y | N | Y | Y | Y | Y |
| FineGym | 2020 | 530 | – | 708 h | – | 1 | N | Y | N | Y | N | Y | N | Y | Y | Y | Y |
| BABEL | 2021 | 256 | – | 43.5 h | 1 | 1 | Y | N | Y | N | N | N | N | N | Y | Y | Y |
| UAV-Human | 2021 | 155 | Multiple | 67,428 | – | 1 | N | Y | Y | Y | Y | Y | N | Y | Y | Y | Y |
| HOMAGE | 2021 | 75 | 27 | 25.4 h | – | 2–5 | N | N | Y | Y | Y | Y | N | N | Y | Y | Y |
| POLIMI-ITW-S | 2022 | 37 | 50 K+ | 233,446 | malls | robotic | Y | Y | Y | Y | Y | Y | Y | Y | Y | Y | Y |
Activity labels.
| cleaning, crouching, jumping, laying, riding, running, scooter, sitting, standing, walking |
| sittingTogether, standingTogether, walkingTogether |
| sittingWhileCalling, sittingWhileDrinking, sittingWhileEating, sittingWhileHoldingBabyInArms, |
| sittingWhileTalkingTogether, sittingWhileWatchingPhone, standingWhileCalling, standingWhileDrinking, |
| standingWhileEating, standingWhileHoldingBabyInArms, standingWhileHoldingCart, standingWhileHoldingStroller, |
| standingWhileLookingAtShops, standingWhileTalkingTogether, standingWhileWatchingPhone, walkingWhileCalling, |
| walkingWhileDrinking, walkingWhileEating, walkingWhileHoldingBabyInArms, |
| walkingWhileHoldingCart, walkingWhileHoldingStroller, walkingWhileLookingAtShops, |
| walkingWhileTalkingTogether, walkingWhileWatchingPhone |
COCO body keypoints .
| 1 | Nose |
| 2 | left eye |
| 3 | right eye |
| 4 | left ear |
| 5 | right ear |
| 6 | left shoulder |
| 7 | right shoulder |
| 8 | left elbow |
| 9 | right elbow |
| 10 | left wrist |
| 11 | right wrist |
| 12 | left hip |
| 13 | right hip |
| 14 | left knee |
| 15 | right knee |
| 16 | left ankle |
| 17 | right ankle |
Fig. 1Original pose (left); Pose reconstructed by “interpolation” (right).
Number of sequences and frames in the original and the cropped versions of datasets.
| Label ID | Activity class | #Seq. orig. | #Frame orig. | #Seq. crop. | #Frame crop. |
|---|---|---|---|---|---|
| 0 | cleaning | 665 | 60,814 | 665 | 60,814 |
| 1 | crouching | 2735 | 195,537 | 2735 | 195,537 |
| 2 | jumping | 260 | 12,564 | 260 | 12,564 |
| 3 | laying | 92 | 7071 | 92 | 7071 |
| 4 | riding | 301 | 13,555 | 301 | 13,555 |
| 5 | running | 1457 | 63,498 | 1457 | 63,498 |
| 6 | scooter | 208 | 11,919 | 208 | 11,919 |
| 7 | sitting | 8070 | 376,905 | 3000 | 182,026 |
| 8 | sittingTogether | 3044 | 195,566 | 3000 | 193,984 |
| 9 | sittingWhileCalling | 334 | 42,693 | 334 | 42,693 |
| 10 | sittingWhileDrinking | 325 | 32,778 | 325 | 32,778 |
| 11 | sittingWhileEating | 776 | 81,780 | 776 | 81,780 |
| 12 | sittingWhileHoldingBabyInArms | 467 | 34,998 | 467 | 34,998 |
| 13 | sittingWhileTalkingTogether | 766 | 82,371 | 766 | 82,371 |
| 14 | sittingWhileWatchingPhone | 5602 | 546,327 | 3000 | 357,020 |
| 15 | standing | 34,399 | 1,785,446 | 3000 | 158,669 |
| 16 | standingTogether | 13,367 | 902,424 | 3000 | 202,879 |
| 17 | standingWhileCalling | 2303 | 307,560 | 2303 | 307,560 |
| 18 | standingWhileDrinking | 439 | 46,009 | 439 | 46,009 |
| 19 | standingWhileEating | 1148 | 125,342 | 1148 | 125,342 |
| 20 | standingWhileHoldingBabyInArms | 2059 | 144,727 | 2059 | 144,727 |
| 21 | standingWhileHoldingCart | 576 | 44,719 | 576 | 44,719 |
| 22 | standingWhileHoldingStroller | 1216 | 121,875 | 1216 | 121,875 |
| 23 | standingWhileLookingAtShops | 15,524 | 1,193,938 | 3000 | 220057 |
| 24 | standingWhileTalkingTogether | 10,310 | 1,032,687 | 3000 | 362779 |
| 25 | standingWhileWatchingPhone | 9727 | 990,380 | 3000 | 268797 |
| 26 | walking | 67,615 | 3,384,638 | 3000 | 142640 |
| 27 | walkingTogether | 26,276 | 1,621,113 | 3000 | 179645 |
| 28 | walkingWhileCalling | 3338 | 401,582 | 3338 | 401,582 |
| 29 | walkingWhileDrinking | 896 | 86,520 | 896 | 86,520 |
| 30 | walkingWhileEating | 1256 | 128,535 | 1256 | 128,535 |
| 31 | walkingWhileHoldingBabyInArms | 3373 | 198,582 | 3373 | 198,582 |
| 32 | walkingWhileHoldingCart | 2381 | 206,911 | 2381 | 206,911 |
| 33 | walkingWhileHoldingStroller | 2806 | 268,606 | 2806 | 268,606 |
| 34 | walkingWhileLookingAtShops | 1674 | 94,657 | 1674 | 94,657 |
| 35 | walkingWhileTalkingTogether | 479 | 38,259 | 479 | 38,259 |
| 36 | walkingWhileWatchingPhone | 7182 | 581,222 | 3000 | 195,973 |
| Total | 233,446 | 15,464,108 | 65,330 | 5317931 |
Fig. 2Number of sequences by activity.
Fig. 3Annotation samples.
Fig. 4The structure of annotated JSON format file.
Fig. 5An annotation example.
The results of skeleton-based activity recognition.
| Model | Publisher | Accuracy (%) | |
|---|---|---|---|
| Original | Cropped | ||
| Efficient GCN | TPAMI 22 | 48.3 | 38.5 |
| CTR-GCN | ICCV 21 | 44.93 | 34.97 |
| MS-G3D | CVPR 20 | 43.37 | 34.05 |
| 2S-AGCN | CVPR 19 | 44.46 | 34.13 |
| Subject | Computer Vision and Pattern Recognition |
| Specific subject area | Human activity recognition |
| Type of data | 2-D RGB video |
| Annotation composed by person tracking bounding boxes, 2-D skeleton, and activity labels in JSON format | |
| How the data were acquired | This dataset was taken from RGB cameras of two smartphones with resolution |
| The models of the two smartphones are | |
| Data format | Raw |
| Description of data collection | We collected the dataset in shopping malls in the Hubei province of China. The shopping malls have multiple floors providing different services, e.g., main hall with grocery stores on the ground floor; clothes shops on the second and third floor; restaurants and drink bars on the fourth floor; cinema with a waiting room on the fifth floor; supermarket on the underground floor. The diverse settings guarantee the desired variety of subjects and situations. The subjects are clients and staff in the shopping malls having different genders and ages, including men and women, babies, children, teenagers, adults and elderly people. So we have different subjects for almost every recorded video clip. |
| The cameras were held by hands at about 90 cm from the floor. The recorders imitated the mobile robot, keeping moving or staying still by looking around to capture persons that are performing actions. | |
| We did not mount the cameras on a robot in order to avoid uncommon situations that the presence of a robot may trigger. We did not use 3-D stereo/depth cameras as recording tools since this type of camera suffers for the moving issue, which is not suitable for data collection for mobile robots. Moreover, they may not perform well when subjects are far away from the camera. | |
| Data source location | |
| Data accessibility | Repository name: Science Data Bank |
| Data identification number: sciencedb.01694 | |
| Direct URL to data: |