Literature DB >> 35864879

POLIMI-ITW-S: A large-scale dataset for human activity recognition in the wild.

Hao Quan1, Yu Hu2, Andrea Bonarini1.   

Abstract

Human activity recognition is attracting increasing research attention. Many activity recognition datasets have been created to support the development and evaluation of new algorithms. Given the lack of datasets collected in real environments (In The Wild) to support human activity recognition in public spaces, we introduce a large-scale video dataset for activity recognition In The Wild: POLIMI-ITW-S. The fully labeled dataset consists of 22,161 RGB video clips (about 46 h) including 37 activity classes performed by 50 K+ subjects in real shopping malls. We evaluated the state-of-the-art models on this dataset and get relatively low accuracy. We release the dataset including the annotations composed by person tracking bounding boxes, 2-D skeleton, and activity labels for research use at: https://airlab.deib.polimi.it/polimi-itw-s-a-shopping-mall-dataset-in-the-wild.
© 2022 Published by Elsevier Inc.

Entities:  

Keywords:  Computer vision; Human activity recognition; In the wild; Mobile robot

Year:  2022        PMID: 35864879      PMCID: PMC9294482          DOI: 10.1016/j.dib.2022.108420

Source DB:  PubMed          Journal:  Data Brief        ISSN: 2352-3409


Specifications Table

Value of the Data

The data are useful for those working in the area of human activity recognition from skeletal data or from RGB videos; It will be possible to develop new algorithms to classify a more detailed set of activities based on semantic meanings, thus improving the performance of the applications; It could be used to support the development and evaluation of the models of human activity recognition from mobile robots operating in public environments; Basing on these data, it will be possible to classify actions performed In The Wild, thus opening a wide sort of applications for Social Robotics and other disciplines; Except for the topic of human activity recognition, the dataset could be useful for other human subjects related tasks. Since the clips include many crowded scenes, this dataset is interesting for investigating opening problems like person tracking, pose tracking, person re-identification, body/head orientation. The data may also contribute to the autonomous robotic research field to develop new path planning, and obstacle avoidance methods. Other researchers will become interested in problems and algorithms arising when operating in the wild.

Data Description

Human activity recognition (HAR) involves skeleton representations of human bodies instead of raw RGB videos. Due to its strong adaptability and highly abstract characteristics, many significant models were developed based on skeletal data [1], [2], [3], [4], [5], [6], [7]. Compared to the RGB video representation, the greatest benefits of the skeletal data are that they are free of dynamic environment noise and robust against complicated backgrounds (lighting conditions, color of clothing, object obstruction, etc.). It is important for service robots to recognize the actions of people in the real world to further enhance their capabilities to offer services. We analyzed some relevant skeleton-based HAR models in the last three years to check how public datasets were used to train and evaluate models in the community. As shown in Table 1, the most commonly used datasets are (in descending order): NTU RGB + D 60 [8], NTU RGB + D 120 [9], Kinetics [10], Northwestern-UCLA Multiview Action 3D [11], SYSU 3D Human-Object Interaction [12] datasets. Among those, only Kinetics was not collected from a constrained environment but from online streaming resources by using the crowd-sourcing method instead, while all the other datasets were collected in the respective laboratories.
Table 1

Datasets used for recent human recognition models.

ModelPublisherNTU 60 [8]NTU 120 [9]Kinetics-Skeleton [10], [13]NUCLA [11]SYSU [12]
Efficient GCN [14]TPAMI 22
CTR-GCN [7]ICCV 21
SGN [5]CVPR 20
MSG3D [3]CVPR 20
4S-Shift-GCN [15]CVPR 20
NAS-GCN [4]AAAI 20
2S-AGCN [1]CVPR 19

Total75321
Datasets used for recent human recognition models. The state-of-the-art models got about 90% accuracy on the datasets collected in laboratory environments as shown in Table 2, 3, 5 and 6. Nevertheless, they got only less than 40% accuracy on the Kinetics dataset which was collected from online streaming resources by crowd-sourcing methods as shown in Table 4. It hints that the state-of-the-art models could perform well on the datasets collected in constrained environments, but they may meet challenges when recognizing actions from unconstrained, natural environments.
Table 2

The state-of-the-art methods on NTU 60 dataset in accuracy (%).

NTU 60: collected from laboratory
ModelPublisherX-Suba (%)X-Viewb (%)
Efficient GCN [14]TPAMI 2292.196.1
CTR-GCN [7]ICCV 2192.496.8
MS-G3D [3]CVPR 2091.596.2
4S-Shift-GCN [15]CVPR 2090.796.5
SGN [5]CVPR 2089.094.5
NAS-GCN [4]AAAI 2089.495.7
2S-AGCN [1]CVPR 1988.595.1

Average Value90.595.8

X-Sub: Cross-Subject evaluation [8].

X-View: Cross-View evaluation [8].

Table 3

The state-of-the-art methods on NTU 120 dataset in accuracy (%).

NTU 120: collected from laboratory
ModelPublisherX-Sub120a (%)X-Set120b (%)
Efficient GCN [14]TPAMI 2288.788.9
CTR-GCN [7]ICCV 2188.990.6
MS-G3D [3]CVPR 2086.988.4
4S-Shift-GCN [15]CVPR 2085.987.6
SGN [5]CVPR 2079.281.5
Average Value85.987.4

X-Sub120: Cross-Subject evaluation [9].

X-Set120: Cross-Setup evaluation [9].

Table 5

The state-of-the-art methods on NUCLA dataset in accuracy (%) .

NUCLA: collected from laboratory
ModelPublisherNUCLA (%)
CTR-GCN [7]ICCV 2196.5
4S-Shift-GCN [15]CVPR 2094.6
Average Value95.6
Table 6

The state-of-the-art method on SYSU dataset in accuracy (%).

SYSU: collected from laboratory
ModelPublisherX-Suba (%)Same-Subb (%)
SGN [5]CVPR 2090.689.3

X-Sub: Cross-Subject evaluation [12].

Same-Sub: Same-Subject evaluation [12].

Table 4

The state-of-the-art methods on Kinetics-Skeleton dataset in accuracy (%).

Kinetics-Skeleton: collected by crowd-sourcing method
ModelPublisherKinetics-Skeleton (%)
MS-G3D [3]ICCV 2138
NAS [4]CVPR 2037.1
2S-AGCN [1]CVPR 1936.1
Average Value37.1
The state-of-the-art methods on NTU 60 dataset in accuracy (%). X-Sub: Cross-Subject evaluation [8]. X-View: Cross-View evaluation [8]. The state-of-the-art methods on NTU 120 dataset in accuracy (%). X-Sub120: Cross-Subject evaluation [9]. X-Set120: Cross-Setup evaluation [9]. The state-of-the-art methods on Kinetics-Skeleton dataset in accuracy (%). The state-of-the-art methods on NUCLA dataset in accuracy (%) . The state-of-the-art method on SYSU dataset in accuracy (%). X-Sub: Cross-Subject evaluation [12]. Same-Sub: Same-Subject evaluation [12]. Because the main datasets for evaluating new HAR models are collected in the specific laboratories and the accuracy is about 90%, we think that there is little optimizing space for the models trained on such a type of datasets. Meanwhile, we argue that reliable HAR models to support the production of mobile service robots should not only be evaluated on the datasets collected in controlled environments but also on datasets collected in the final, public environments, situation defined in the community as “In The Wild” (ITW). Due to the well-known issues (like having unbalanced taxonomies, unnatural scenes, label noise and invalid websites links) of the crowd-sourcing methods, the dataset like Kinetics collected from online streaming resources by crowd-sourcing methods may not satisfy the needs of developing robust models which are able to perform well in the real world. To fill this gap, we propose the POLIMI-ITW-S dataset to develop reliable skeleton-based human activity recognition models that could be deployed on mobile service robots to recognize actions that happen in the real world. We propose that a reliable ITW dataset of clips useful for robotic applications should have the following characteristics: viewpoint similar to the one of the robot, including subjects viewed both in full figure when the robot is far from the subject and only in part, when the robot is close, taken from a camera having characteristics typical of the ones that are mounted on commercial, mobile, service robots; video clips recorded from free moving viewpoints like the ones of mobile robots; clips representative of the common actions in the selected environment, evenly distributed among the classes; a large number of different subjects performing the same action; different genders of subjects with a large range of ages, from babies to elderly people; real-life background, possibly including people and objects that could typically be in the context; presence of crowded scenes with large quantities of persons, including subjects occluded by person(s) or object(s). unscripted, natural actions: there are no “actors”, people are recorded without knowing in advance that they are, so they are supposed to perform naturally; real-life sequencing of actions, so that it is possible to consider typical sequences of actions from realistic clips; possible inclusion of human-object and multi-person interactive actions (such as “calling”, “talking”, “drinking”, “eating”, “holding baby in arms”, etc.); large-scale dataset. As shown in Table 7, the available datasets are usually collected in controlled contexts, such as laboratory, home, or conveniently extracted from streaming sources produced for other purposes.
Table 7

Comparison between different datasets and ITW dataset requirements: 1. viewpoint similar to the one of the robot, 2. video taken from moving camera, 3. representative actions, 4. different people performing the same action, 5. different genders and ages, 6. real life background, 7. crowded scenes with occlusions, 8. no “actors” and unscripted actions, 9. presence of sequences of actions, 10. presence of human-object and multi-agent interactive actions, 11. large-scale dataset.

DatasetsYearClassesSubjectsSamplesScenesViews1234567891011
SYSU [12]2015124048011YNYYNNNNNNN
ActivityNet [16]2015203849 h1NYNYYYYYYYY
NTU [8]2016604056,88013YNNYNNNNNYY
Kinetics [10]2017400300,0001NYNYYYYYYYY
AVA [17]2018804371NYNYYYYNYYY
NTU 120 [9]2019120106114,48013YNNYNNNNNYY
Toyota S.H. [18]2019311816,11517NNYYNYNYYNY
ETRI [19]202055100112,62014YNYYNYNYYYY
FineGym [20]2020530708 h1NYNYNYNYYYY
BABEL [21]202125643.5 h11YNYNNNNNYYY
UAV-Human [22]2021155Multiple67,4281NYYYYYNYYYY
HOMAGE [23]2021752725.4 h2–5NNYYYYNNYYY
POLIMI-ITW-S20223750 K+233,446 46 hmallsroboticYYYYYYYYYYY
Comparison between different datasets and ITW dataset requirements: 1. viewpoint similar to the one of the robot, 2. video taken from moving camera, 3. representative actions, 4. different people performing the same action, 5. different genders and ages, 6. real life background, 7. crowded scenes with occlusions, 8. no “actors” and unscripted actions, 9. presence of sequences of actions, 10. presence of human-object and multi-agent interactive actions, 11. large-scale dataset. The main advantages of these datasets are that they are easy to obtain and with relatively low labor cost comparing to the datasets collected in public spaces. In addition, datasets creators could take advantages of the limited environments to deploy RGB-D cameras to get 3-D human skeletal data. Since the actors know what they should do in advance and there are rarely considered crowded scenes, they did not need to dedicate a lot of resources for annotation. However, most datasets do not consider the viewpoint of mobile robots in public spaces. Furthermore, there is no large-scale visual dataset that deals with real, daily behavior of people. In most cases, actions are performed upon request, often by actors, usually separated from each other. Results obtained starting from such constrained conditions may not completely hold in a real-world scenario, as we have verified on our dataset for state of art models. There is a lack of adequate dataset to train models that could be used by robots to recognize common human activities in public spaces. The absence of datasets for human activity recognition in the wild is a serious impediment to computer vision and robot intelligence research. Different from the state of art datasets, our dataset satisfies all the requirements mentioned above for a reliable ITW dataset. We have collected 22,161 video clips with more than 15.4 million frames. The average duration of each video clip is about 7 s. The total length of the dataset is about 45.97 h. The dataset was collected in Hubei province of China. According to the population statistics published by the local government [24], the gender distribution of the city is 51.48% (male) and 48.52% (female). The age distribution is 18.7% (0–14), 62.28% (15–59), 19.03% (over 60). We believe the distribution of gender and age in the dataset matches the distribution of the population of the city. Individuals were anonymized by blurring faces using RetinaFace [25]. We used OpenPifPaf [26] to extract person tracking bounding boxes and 2-D skeleton data. Before starting the annotation work, we analyzed a subset of the collected video clips and used the proposed detailed labeling mode to define 37 activity classes. Actually, except for the defined activity classes, there are also other activities that occurred in videos such as “falling down”, “fighting”, “kicking”, “throwing trash”, etc. We didn’t add these activities to the dataset since they have a relatively small number of clips, which would have dramatically affected the learning performance. The defined classes were distributed on three levels: The labels of the general level are used for describing single actions. We have defined “standing”, “walking”, “sitting”, “crouching”, “cleaning”, “jumping”, “laying”, “riding”, “running” and “scooter” for this level. The modifier level labels are “walkingTogether”, “sittingTogether”, “standingTogether”, etc, which refer to multiple persons or a group of people walking, sitting, or standing together, etc. The aggregate level detailed labels aim at describing multiple actions in a single label, such as “standingWhileCalling”, “standingWhileLookingAtShop”, “walkingWhileWatchingPhone”, “sittingWhileHoldingBabyInArms”, etc. The complete list of the defined labels is shown in Table 8.
Table 8

Activity labels.

General Level (10):
cleaning, crouching, jumping, laying, riding, running, scooter, sitting, standing, walking
Modifier Level (3):
sittingTogether, standingTogether, walkingTogether
Aggregate Level (24):
sittingWhileCalling, sittingWhileDrinking, sittingWhileEating, sittingWhileHoldingBabyInArms,
sittingWhileTalkingTogether, sittingWhileWatchingPhone, standingWhileCalling, standingWhileDrinking,
standingWhileEating, standingWhileHoldingBabyInArms, standingWhileHoldingCart, standingWhileHoldingStroller,
standingWhileLookingAtShops, standingWhileTalkingTogether, standingWhileWatchingPhone, walkingWhileCalling,
walkingWhileDrinking, walkingWhileEating, walkingWhileHoldingBabyInArms,
walkingWhileHoldingCart, walkingWhileHoldingStroller, walkingWhileLookingAtShops,
walkingWhileTalkingTogether, walkingWhileWatchingPhone
Activity labels. We have also defined a rule for the labels containing the keyword “together”. It is only used for the groups of persons performing social activities. For example, if two or more persons are standing closely, but are not involved in any social activities, their activities will not be considered as done “together”. The dataset was fully labeled by HAVPTAT [27]. We provide RGB videos, persons’ tracking bounding boxes, 2-D skeleton data (17 body keyjoints) and labeled activities’ classes in JSON format. To build a high-quality dataset that offers correct annotations, we adopt a series of approaches including: training annotators with tutorial slides and video demos, pre-testing the annotators rigorously before formal annotation, and cross-validating across annotators. For the reader’s convenience, Table 9 shows the COCO body 17 keypoint arrangement [28], [29] adopted in our dataset. We notice that the entire dataset misses joints 4 (left ear) and 5 (right ear) of the COCO’s 17 keypoints.
Table 9

COCO body keypoints .

1Nose
2left eye
3right eye
4left ear
5right ear
6left shoulder
7right shoulder
8left elbow
9right elbow
10left wrist
11right wrist
12left hip
13right hip
14left knee
15right knee
16left ankle
17right ankle
COCO body keypoints . The skeletons of the data collected from the real world are often incomplete. A frame containing few joints could lead the ambiguity for the learning model. For instance, a frame with only 2 and 3 joints may reduce the possibility for a system to identify the corresponding activity class. To reduce such a type of learning error, we also temporal-linearly interpolated the missing joints and hold the pose as valid only with more than a given number (nine) joints. The threshold was fixed by nine since only partial bodies could be captured by a camera in some cases. For example, as shown in Fig. 1, when the camera is close to a person, only the upper part of the body is present, but the keypoints of the lower part of the body (14 left knee, 15 right knee, 16 left ankle, and 17 right ankle) miss. The left picture of Fig. 1 is the original pose extracted by OpenPifPaf [26]. Right elbow (9), left (10) and right (11) wrist were not detected. After having been processed by the interpolation operation provided by the Python’s Pandas library [30], the three missing keypoints were reconstructed on the right picture. This approach was applied to the entire dataset. We call it the “original version” dataset in the experimental phase.
Fig. 1

Original pose (left); Pose reconstructed by “interpolation” (right).

Original pose (left); Pose reconstructed by “interpolation” (right). From Table 10 and Fig. 2a, we could see that the imbalance issue is present in the original dataset. The most frequent activities are “walking”, “standing”, and “walkingTogether”, which occur in about 55% of the dataset.
Table 10

Number of sequences and frames in the original and the cropped versions of datasets.

Label IDActivity class#Seq. orig.#Frame orig.#Seq. crop.#Frame crop.
0cleaning66560,81466560,814
1crouching2735195,5372735195,537
2jumping26012,56426012,564
3laying927071927071
4riding30113,55530113,555
5running145763,498145763,498
6scooter20811,91920811,919
7sitting8070376,9053000182,026
8sittingTogether3044195,5663000193,984
9sittingWhileCalling33442,69333442,693
10sittingWhileDrinking32532,77832532,778
11sittingWhileEating77681,78077681,780
12sittingWhileHoldingBabyInArms46734,99846734,998
13sittingWhileTalkingTogether76682,37176682,371
14sittingWhileWatchingPhone5602546,3273000357,020
15standing34,3991,785,4463000158,669
16standingTogether13,367902,4243000202,879
17standingWhileCalling2303307,5602303307,560
18standingWhileDrinking43946,00943946,009
19standingWhileEating1148125,3421148125,342
20standingWhileHoldingBabyInArms2059144,7272059144,727
21standingWhileHoldingCart57644,71957644,719
22standingWhileHoldingStroller1216121,8751216121,875
23standingWhileLookingAtShops15,5241,193,9383000220057
24standingWhileTalkingTogether10,3101,032,6873000362779
25standingWhileWatchingPhone9727990,3803000268797
26walking67,6153,384,6383000142640
27walkingTogether26,2761,621,1133000179645
28walkingWhileCalling3338401,5823338401,582
29walkingWhileDrinking89686,52089686,520
30walkingWhileEating1256128,5351256128,535
31walkingWhileHoldingBabyInArms3373198,5823373198,582
32walkingWhileHoldingCart2381206,9112381206,911
33walkingWhileHoldingStroller2806268,6062806268,606
34walkingWhileLookingAtShops167494,657167494,657
35walkingWhileTalkingTogether47938,25947938,259
36walkingWhileWatchingPhone7182581,2223000195,973
Total233,44615,464,10865,3305317931
Fig. 2

Number of sequences by activity.

Number of sequences and frames in the original and the cropped versions of datasets. Number of sequences by activity. We show two snapshots with annotation in Fig. 2. Fig. 3a shows bounding boxes with tracking ID and annotated action labels. Fig. 3b shows the same information and skeletons.
Fig. 3

Annotation samples.

Annotation samples.

Experimental Design, Materials and Methods

The structure of the annotated JSON format file and the data information of the fields are shown in Fig. 4. An annotated file is composed by all the information of frames ordered by temporal sequence. Every “frame” contains the main entry “prediction” which includes the detailed data: “keypoints” are composed by 17 tuples of (X, Y, confident_score) with X, Y coordinates and a confident score of each joint (in total [] dimensional data shape); “bbox” is composed by upper left X, Y coordinates, width, height of the bounding box ([4] dimensional data shape), “score” is the confident score of the bounding box ([1] dimensional data shape), “category_id” is the constant 1 inferring a person subject following the convention of COCO annotation ([1] dimensional data shape), “id_” is the ID of a tracked person ([1] dimensional data shape), “action” is the ground truth label. A piece of an annotation file is shown in Fig. 5. The example is composed by two frames of a clip and each frame includes two persons with “sittingWhileWatchingPhone” and “standing” actions.
Fig. 4

The structure of annotated JSON format file.

Fig. 5

An annotation example.

The structure of annotated JSON format file. An annotation example. We used PyHAPT to pre-process the data [31].1, 2 After the annotation work done by HAVPTAT [27], we obtained the annotated files like the example shown in Fig. 5. We thus represent each joint with a couple of pairs (X, Y) corresponding to its extremes so that a skeleton frame is recorded as an array of 17 couples with data shape (17, 2). Based on the field of “id_” which is the person tracking ID, we could facilitate composing the keypoints of the same person in different T temporal frames to get (T, 17, 2) data shape. For the multi-person cases, we take all the detected persons in each clip into account. We consider each person performing the same action in a single video clip as a valid action sequence. If the same person performs multiple actions in a single video clip, we consider them as different action sequences performed by the same person. Since every action sequence includes only a person’s data, so we extend the previous data shape to (T, 17, 2, 1) for convenience of implementation. For the whole dataset, the script reshapes and gets the array of (N, 2, T, 17, 1) dimensions by concatenating the single action sequences of persons with N action samples. We summarize the meaning of each element in the tuple: the script generates N samples of action sequences in total; two dimensions (X, Y) skeletal data; an action sequence lasts T frames; 17 keypoints of a human body; 1 person data in each tuple. All action skeleton sequences are padded to frames by replaying the actions as also done by other skeletal datasets. The training set and test set split ratio is 70% and 30%. The number of padded frames and the training-test set split ratio can be both customized by users. From Table 1, we evaluated four most relevant state-of-the-art human activity recognition models in the last three years (Efficient GCN [14], CTR-GCN [7], MS-G3D [1], 2S-AGCN [3]) on the new POLIMI-ITW-S dataset. We believe the results are representative for the current mainstream human activity recognition algorithms. To have a fair comparison, we decided to use joint data for training and test. As a result, from Table 11, we observe that the accuracy is only less than 50%. We infer that the state-of-the-art activity recognition models could not perform well on real-life data.
Table 11

The results of skeleton-based activity recognition.

ModelPublisherAccuracy (%)
OriginalCropped
Efficient GCN [14]TPAMI 2248.338.5
CTR-GCN [7]ICCV 2144.9334.97
MS-G3D [3]CVPR 2043.3734.05
2S-AGCN [1]CVPR 1944.4634.13
The results of skeleton-based activity recognition. When we were in the data collection phase, we already noticed that most of the actions occurred are “walking” and “standing” without involving any other additional actions in the shopping malls. This leads to an imbalance issue: some highly frequent actions appear more often than others. For instance, “walking” and “standing” have 80 K+ and 50 K+ sequences, while “walkingWhileEating” and “sittingWhileDrinking” have only 1256 and 325 sequences. To evaluate whether the imbalance classes could cause the bad performance, we tried to take only 3000 sequences of some classes with huge numbers to build a “cropped” version dataset as shown in Fig. 2b. We evaluated the models also on the “cropped” version dataset. Unfortunately, the accuracy was even lower than the one with the original version dataset. We could say that the imbalanced classes might not directly affect the performance of the models.

Ethics Statements

According to the GDPR Art. 89, individuals were properly de-identified by blurring their faces. The dataset can be used for research purposes.

CRediT authorship contribution statement

Hao Quan: Data curation, Software, Writing – original draft, Investigation, Formal analysis. Yu Hu: Data curation, Investigation. Andrea Bonarini: Conceptualization, Methodology, Validation, Writing – review & editing, Supervision.

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.
SubjectComputer Vision and Pattern Recognition
Specific subject areaHuman activity recognition
Type of data2-D RGB video
Annotation composed by person tracking bounding boxes, 2-D skeleton, and activity labels in JSON format
How the data were acquiredThis dataset was taken from RGB cameras of two smartphones with resolution 1920×1080 pixels, 30 fps.
The models of the two smartphones are VIVO S7 5G and Honor 30 S.
Data formatRaw
Description of data collectionWe collected the dataset in shopping malls in the Hubei province of China. The shopping malls have multiple floors providing different services, e.g., main hall with grocery stores on the ground floor; clothes shops on the second and third floor; restaurants and drink bars on the fourth floor; cinema with a waiting room on the fifth floor; supermarket on the underground floor. The diverse settings guarantee the desired variety of subjects and situations. The subjects are clients and staff in the shopping malls having different genders and ages, including men and women, babies, children, teenagers, adults and elderly people. So we have different subjects for almost every recorded video clip.
The cameras were held by hands at about 90 cm from the floor. The recorders imitated the mobile robot, keeping moving or staying still by looking around to capture persons that are performing actions.
We did not mount the cameras on a robot in order to avoid uncommon situations that the presence of a robot may trigger. We did not use 3-D stereo/depth cameras as recording tools since this type of camera suffers for the moving issue, which is not suitable for data collection for mobile robots. Moreover, they may not perform well when subjects are far away from the camera.
Data source location Public space: shopping malls
City: Shiyan
Province: Hubei
Country: China
Data accessibilityRepository name: Science Data Bank
Data identification number: sciencedb.01694
Direct URL to data: https://doi.org/10.57760/sciencedb.01694
  5 in total

1.  Jointly Learning Heterogeneous Features for RGB-D Activity Recognition.

Authors:  Jian-Fang Hu; Wei-Shi Zheng; Jianhuang Lai; Jianguo Zhang
Journal:  IEEE Trans Pattern Anal Mach Intell       Date:  2016-12-15       Impact factor: 6.226

2.  View Adaptive Neural Networks for High Performance Skeleton-Based Human Action Recognition.

Authors:  Pengfei Zhang; Cuiling Lan; Junliang Xing; Wenjun Zeng; Jianru Xue; Nanning Zheng
Journal:  IEEE Trans Pattern Anal Mach Intell       Date:  2019-01-31       Impact factor: 6.226

3.  NTU RGB+D 120: A Large-Scale Benchmark for 3D Human Activity Understanding.

Authors:  Jun Liu; Amir Shahroudy; Mauricio Perez; Gang Wang; Ling-Yu Duan; Alex C Kot
Journal:  IEEE Trans Pattern Anal Mach Intell       Date:  2019-05-14       Impact factor: 6.226

4.  Constructing Stronger and Faster Baselines for Skeleton-based Action Recognition.

Authors:  Yi-Fan Song; Zhang Zhang; Caifeng Shan; Liang Wang
Journal:  IEEE Trans Pattern Anal Mach Intell       Date:  2022-03-07       Impact factor: 6.226

5.  Symbiotic Graph Neural Networks for 3D Skeleton-Based Human Action Recognition and Motion Prediction.

Authors:  Maosen Li; Siheng Chen; Xu Chen; Ya Zhang; Yanfeng Wang; Qi Tian
Journal:  IEEE Trans Pattern Anal Mach Intell       Date:  2022-05-05       Impact factor: 6.226

  5 in total

北京卡尤迪生物科技股份有限公司 © 2022-2023.