| Literature DB >> 35336262 |
Abstract
Human action recognition has been applied in many fields, such as video surveillance and human computer interaction, where it helps to improve performance. Numerous reviews of the literature have been done, but rarely have these reviews concentrated on skeleton-graph-based approaches. Connecting the skeleton joints as in the physical appearance can naturally generate a graph. This paper provides an up-to-date review for readers on skeleton graph-neural-network-based human action recognition. After analyzing previous related studies, a new taxonomy for skeleton-GNN-based methods is proposed according to their designs, and their merits and demerits are analyzed. In addition, the datasets and codes are discussed. Finally, future research directions are suggested.Entities:
Keywords: graph neural networks; human action recognition; skeleton graphs; survey
Mesh:
Year: 2022 PMID: 35336262 PMCID: PMC8952863 DOI: 10.3390/s22062091
Source DB: PubMed Journal: Sensors (Basel) ISSN: 1424-8220 Impact factor: 3.576
Figure 1The method ST-GCN [13].
Figure 2The method 2s-AGCN [14], B-Stream and J-Stream stand for bone stream and joint stream, respectively.
Figure 3Illustration of (a) the spatial-based approach, (b) the spatiotemporal-based approach and (c) the generated approach. Among the figures, green trapezoid blocks denote the GNN-HAR model with dashed lines marking the unfixed model structure; pink trapezoid blocks denote action classifiers; orange trapezoid blocks for hidden states and purple blocks for tasks rather than HAR. The multiple skeletons together in (b,c) stand for skeleton graphs of video clips. The purple arrow in (c) denotes supervision from tasks, such as adversarial learning and knowledge distillation.
The summary of datasets. ’RGB’ is RGB videos, ’IR’ is infrared sequences, ’D’ is depth maps, ’S’ is skeletons, ’PD’ is pose direction, ’OP’ is optical flow, ’WIS’ is wearable inertial sensor, ’TOF’ is time-of-flight sensor, ’RRM’ is retro-reflective markers, ’H’ is head movement, ’Script’ is dialog transcriptions, and ’M’ is mesh data.
| Name | Sensors | Subjects | Views | Actions | Data | Year | Types |
|---|---|---|---|---|---|---|---|
| CMU Mocap, | Vicon | 144 | - | 23 | RGB+S | 2003 | Indoor simulation, including interaction. |
| HDM05, | RRM | 5 | 6 | >70 | RGB+S | 2007 | Indoor simulation |
| IEMOCAP, | Vicon | - | 8 | 9 | RGB+H + Script | 2008 | Emotion and speech dataset |
| CA, | Hand held camera | - | - | 5 | RGB+PD | 2009 | Group activities |
| TUM, | - | - | 4 | 9 (l),9 (r), 2 (t) | RGB+S | 2009 | Activities in kitchen |
| MSR Action3D, | - | 10 | 1 | 20 | D+S | 2010 | Interaction with game consoles |
| CAD 60, | Kinect v1 | 4 | - | 12 | RGB+D+S | 2011 | Human–object interaction |
| MSR DailyActivity3D, | Kinect | 10 | 1 | 16 | RGB+D+S | 2012 | Daily activities in living room |
| UT-Kinect, | Kinect v1 | 10 | 4 | 10 | RGB+D+S | 2012 | Indoor simulation |
| Florence3D, | Kinect v1 | 10 | - | 9 | RGB+S | 2012 | Indoor simulation |
| SBU Kinect Interaction, | Kinect | 7 | - | 8 | RBB+D+S | 2012 | Human–human interaction simulation |
| J-HMDB, | HMDB51 | - | 16 | 21 | RGB+S | 2013 | Annotated subset of HMDB51, 2D skeletons, real-life |
| 3D Action Pairs, | Kinect v1 | 10 | 1 | 12 | RGB+D+S | 2013 | Each pair has similarity in motion and shape |
| CAD 120, | Kinect v1 | 4 | - | 10+10 | RGB+D+S | 2013 | Human–object interaction |
| ORGBD, | - | 36 | - | 7 | RGB+D+S | 2014 | Human–object interaction |
| Human3.6M, | Laser scanner, TOF | 11 | 4 | 17 | RGB+S+M | 2014 | Indoor simulation, meshes |
| N-UCLA, | Kinect v1 | 10 | 3 | 10 | RGB+D+S | 2014 | Daily action simulation |
| UWA3D Multiview, | Kinect v1 | 10 | 1 | 30 | RGB+D+S | 2014 | Different scales, including self-occlusions and human–object interaction. |
| UWA3D Multiview Activity II, | Kinect v1 | 10 | 4 | 30 | RGB+D+S | 2015 | Different views and scales, including self-occlusions and human–object interaction. |
| UTD-MHAD, | Kinect v1 + WIS | 8 | 1 | 27 | RGB+D+S | 2015 | Indoor single-subject simulation |
| NTU RGB+D, | Kinect V2 | 40 | 80 | 50 + 10 | RGB+IR +D+S | 2016 | Simulation, including human–human interaction |
| Charades, | - | 267 | - | 157 | RGB+OP | 2016 | Real-life daily indoor activities |
| UOW LSC, | Kinect | - | - | 94 | RGB+D+S | 2016 | Combined dataset |
| StateFarm, | Kaggle competition | - | - | 10 | RGB | 2016 | Real driving videos |
| DHG-14/28 | Intel RealSense | 20 | - | 14/28 | RGB+D+S | 2016 | Hand gestures |
| Volleyball, | YouTube volleyball | - | - | 9 | RGB | 2016 | Group activities, volleyball |
| SYSU, | Kinect v1 | 40 | 1 | 12 | RGB+D+S | 2017 | Human–object interaction |
| SHREC’17, | Intel RealSense | 28 | - | - | RGB+D+S | 2017 | Hand gestures |
| Kinetics, | YouTube | - | - | 400 | RGB | 2017 | Real life, including human–object interaction and human–human interaction |
| PUK-MDD, | Kinect | 66 | 3 | 51 | RGB+D+ IR+S | 2017 | Daily action simulation, including interactions. |
| ICVL-4, | - | - | - | 13 | RGB | 2018 | Human–object action in real-life, a subset of ICVL. |
| UW-IOM, | Kinect | 20 | - | 17 | RGB+D+S | 2019 | Indoor object manipulation |
| NTU RGB+D 120, | Kinect V2 | 106 | 155 | 94 + 26 | RGB+IR + D+S | 2019 | Simulation, including human–human interaction |
| IRD | CCTV | - | - | 2 | RGB | 2019 | Illegal rubbish dumping in real life |
| HiEve, | - | - | - | 14 | RGB+S | 2020 | Multi-person events under complex scenes |
The summary of methods and their accuracy. The accuracy on each dataset is top-1. AE is Autoencoder, SVM is support vector machine, LSTM is long-short term memory network, TCN is temporal convolution network, RL is reinforcement learning, CAM is class activation maps, GFT is graph fourier transform, SE is squeeze-excitation block, TCN is temporal convolution, conv. is convolution, PGNs is Pyramidal GCNs, and FV is Fisher Vector.
| Datasets | ||||||||
|---|---|---|---|---|---|---|---|---|
| Name | Code | Year | Details | Kinetics | NTU RGB+D | NTU RGB+D 120 | ||
| CV | CS | CV | CS | |||||
| GCN [ | - | 2017 | GCN+SVM | - | - | - | - | - |
| STGR [ | - | 2018 | Concatenate spatial router and temporal router | 33.6 | 92.3 | 86.9 | - | - |
| AS-GCN [ |
| 2018 | AE, learn edges | 34.8 | 94.2 | 86.8 | - | - |
| - | 2018 | GCN+LSTM, adaptively weighting skeletal joints | - | 82.8 | 72.74 | - | - | |
| GR-GCN [ | - | 2018 | LSTM | - | 94.3 | 87.5 | - | - |
| SR-TSL [ | - | 2018 | GCN+clip LSTM | - | 92.4 | 84.8 | - | - |
| DPRL [ | - | 2018 | RL | - | 89.8 | 83.5 | - | - |
| BPLHM [ | - | 2018 | Edge aggregation | 33.4 | 91.1 | 85.4 | - | - |
| ST-GCN [ |
| 2018 | ST-GCN | 30.7 | 88.3 | 81.5 | - | - |
| PB-GCN [ |
| 2018 | Skip connection, subgraphs, graphs are overlapped | - | 93.2 | 87.5 | - | - |
| 3s RA-GCN [ |
| 2019 | ST-GCN backbone, softmax for CAM | - | 93.5 | 85.9 | - | - |
| AR-GCN [ | - | 2019 | Skip connection, BRNN + attentioned ST-GCN, spatial and temporal attention | 33.5 | 93.2 | 85.1 | - | - |
| GVFE+AS-GCN with DH-TCN [ | - | 2019 | ST-GCN-based, dilated temporal CNN, skip connection | - | 92.8 | 85.3 | - | 78.3 |
| BAGCN [ | - | 2019 | LSTM | - | 96.3 | 90.3 | - | - |
| [ | - | 2019 | ST-GCN | - | 89.6 | 82.6 | - | - |
| [ | - | 2019 | GFT | - | - | - | - | - |
| OHA-GCN [ | - | 2019 | Human–object, frame selection + GCN | - | - | - | - | - |
| AM-STGCN [ | - | 2019 | Attention | 32.9 | 91.4 | 83.4 | - | - |
| [ | - | 2019 | GCN | 30.59 | 88.87 | 80.66 | - | - |
| [ | - | 2019 | Add edges, hand gestures | - | - | - | - | - |
| GCN-HCRF [ | - | 2019 | HCRF, directed message passing | - | 91.7 | 84.3 | - | - |
| Si-GCN [ | - | 2019 | Structure induced part-graphs | - | 89.05 | 84.15 | - | - |
| 4s DGNN [ |
| 2019 | Directed graph | 36.9 | 96.1 | 89.9 | - | - |
| 2s-AGCN [ |
| 2019 | Two stream, attention | 36.1 | 95.1 | 88.5 | - | - |
| 2s AGC-LSTM [ | - | 2019 | Attention | - | 95.0 | 89.2 | - | - |
| SDGCN [ | - | 2019 | Skip connection | - | 95.74 | 89.58 | - | - |
| JRIN-SGCN [ | - | 2019 | Adjacent inference | 35.2 | 91.9 | 86.2 | - | - |
| JRR-GCN [ | - | 2019 | RL for joint-relation-reasoning | 34.8 | 91.2 | 85.89 | - | - |
| 3heads-MA-GCN [ | - | 2019 | Multi-heads attention | - | 91.5 | 86.9 | - | - |
| GC-LSTM [ | - | 2019 | LSTM | - | 92.3 | 83.9 | - | - |
| Bayesian GC-LSTM [ | - | 2019 | Bayesian for the parameters of GC-LSTM | - | 89 | 81.8 | - | - |
| RGB + skeleton [ | - | 2020 | Cross attention (joints + scenario context information), ST-GCN backbone | 39.9 | 89.27 | 84.23 | - | - |
| ST-GCN-jpd [ | - | 2020 | ST-GCN backbone | - | 88.84 | 83.36 | - | - |
| [ | - | 2020 | Skip connection, attention to select joints | - | - | 90.7 | - | - |
| 2s-FGCN [ | - | 2020 | Fully connected graph | 36.3 | 95.6 | 88.7 | - | - |
| 2s-GAS-GCN [ | - | 2020 | Gated CNN, channel attention | 37.8 | 96.5 | 90.4 | - | 86.4 |
| SGP-JCA-GCN [ | - | 2020 | Structure-based graph pooling, learn edges between human parts | - | 93.1 | 86.1 | - | - |
| 4s Shift-GCN [ |
| 2020 | Cheap computation, shift graph convolution | - | 96.5 | 90.7 | 85.9 | 87.6 |
| STG-INs [ | - | 2020 | LSTM | - | 88.7 | 85.8 | - | - |
| MS-AGCN [ | - | 2020 | Multistream, AGCN backbone | - | 95.8 | 90.5 | - | - |
| MSGCN [ | - | 2020 | Attention+SE block, scaled by part division | - | 95.7 | 88.8 | - | - |
| Res-split GCN [ | - | 2020 | Directed graph, skip connection | 37.2 | 96.2 | 90.2 | - | - |
| Stacked-STGCN [ | - | 2020 | Hourglass, human–object-scene nodes | - | - | - | - | - |
| 2s-PST-GCN [ | - | 2020 | Find new topology | 35.53 | 95.1 | 88.68 | - | - |
| 2s-ST-BLN [ | - | 2020 | Symmetric spatial attention, symmetric of relative positions of joints | - | 95.1 | 87.8 | - | - |
| 4s-TA-GCN [ | - | 2020 | Skip connection, temporal attention | 36.9 | 95.8 | 89.91 | - | - |
| DAG-GCN [ | - | 2020 | Joint and channel attention, build dependence relations for bone nodes | - | 95.76 | 90.01 | 82.44 | 79.03 |
| LSGM+GTSC [ | - | 2020 | LSTM, feature calibration, temporal attention | - | 91.74 | 84.71 | - | - |
| VT+GARN (Joint&Part) [ | - | 2020 | View-invariant, RNN | - | - | - | - | - |
| [ | - | 2020 | Skeleton fusion, ST-GCN backbone | - | - | 82.9 | - | - |
| 4s-EE-GCN [ | - | 2020 | One-shot aggregation, CNN | 39.1 | 96.8 | 91.6 | - | 87.4 |
| MS-ESTGCN [ | - | 2020 | Spatial conv. + temporal conv. | 39.4 | 96.8 | 91.4 | - | - |
| EN-GCN [ | - | 2020 | Fuse edge and node | - | 91.6 | 83.2 | - | - |
| MS TE-GCN [ | - | 2020 | GCN+1DCNN as TCN | - | 96.2 | 90.8 | - | 84.4 |
| ST-GCN+channel augmentation [ | - | 2020 | ST-GCN+new features from parameterized curve | - | 91.3 | 83.4 | - | 77.3 |
| RHCN+ACSC + STUFE [ | - | 2020 | CNN, skeleton alignment | - | 92.5 | 86.9 | - | - |
| [ | - | 2020 | Use MS-G3D to extract features, multiple person | - | - | - | - | - |
| MS-TGN [ | - | 2020 | Multi-scale graph | 37.3 | 95.9 | 89.5 | - | - |
| MM-IGCN [ | - | 2020 | Attention, skip connection, TCN | - | 96.7 | 91.3 | - | 88.8 |
| SlowFast-GCN [ | - | 2020 | Two stream with 2 temporal resolution, ST-GCN backbone | - | 90.0 | 83.8 | - | - |
| VE-GCN [ | - | 2020 | CRF as loss, distance-based partition, learn edges | - | 95.2 | 90.1 | - | 84.5 |
| RV-HS-GCNs [ | - | 2020 | GCN, interaction representaion | - | 96.61 | 93.79 | - | 88.2 |
| MS-G3D [ |
| 2020 | Dialted window, GCN+TCN | 38 | 96.2 | 91.5 | 86.9 | 88.4 |
| WST-GCN [ | - | 2020 | Multi ST-GCN, ranking loss | - | 89.8 | 79.9 | - | - |
| MS-AAGCN+TEM [ | - | 2020 | Extended TCN as TEM | 38.6 | 96.5 | 91 | - | - |
| ST-PGN [ | - | 2020 | GRU, PGNs+LSTM | - | - | - | - | - |
| GCN-NAS [ |
| 2020 | GRU, PGNs, LSTM | 37.1 | 95.7 | 89.4 | - | - |
| Poincare-GCN [ | - | 2020 | ST-GCN on Poincare space | - | 96 | 89.7 | - | 80.5 |
| ST-TR [ |
| 2020 | Spatial self-attention + temporal self-attention | - | 96.1 | 89.9 | - | 81.9 |
| S-STGCN [ | - | 2020 | Skip connection, self-attention | - | - | - | - | - |
| MS-AAGCN [ |
| 2020 | Attention | 37.8 | 96.2 | 90 | - | - |
| HSR-TSL [ | - | 2020 | Skip connection, skip-clip LSTM | - | 92.4 | 84.8 | - | - |
| PA-ResGCN [ |
| 2020 | Part attention | - | 96.0 | 90.9 | - | 87.3 |
| 3s RA-GCN [ |
| 2020 | Occlusion, select joints, ST-GCN backbone | - | 93.6 | 87.3 | - | 81.1 |
| IE-GCN [ | - | 2020 | 35.0 | 95.0 | 89.2 | - | - | |
| FV-GNN [ | - | 2020 | FV encoding, ST-GCN as feature extractor | 31.9 | 89.8 | 81.6 | - | - |
| GINs [ | - | 2020 | Two skeletons, transfer learning (teacher–student) | - | - | - | - | - |
| MV-IGNet [ |
| 2020 | Two graphs, multi-scale graph | - | 96.1 | 88.8 | - | 83.9 |
| GCLS [ | - | 2020 | Spatial attention, channel attention | 37.5 | 96.1 | 89.5 | - | - |
| AMCGC-LSTM [ | - | 2020 | LSTM, point, joint and scene level transformation | - | 87.6 | 80.1 | - | 71.7 |
| GGCN+FSN [ | - | 2020 | RL, TCN, feature fusion based on LSTM | 36.7 | 95.7 | 90.1 | - | 85.1 |
| ST-GCN-PAM [ |
| 2020 | Pairwise adjacency, ST-GCN backbone interaction | 41.68 | - | - | 76.85 | 73.87 |
| CGCN [ | - | 2020 | ST-GCN backbone | 37.5 | 96.4 | 90.3 | - | - |
| FGCN [ | - | 2020 | Dense connnection of GCN layers | - | 96.3 | 90.2 | - | 85.4 |
| PGCN-TCA [ | - | 2020 | Learn graph connections, spatial+channel attention | - | 93.6 | 88.0 | - | - |
| Dynamic GCN [ | - | 2020 | Seperable CNN as CeN to regress adjacency matrix | 37.9 | 96.0 | 91.5 | - | 87.3 |
| PeGCN [ |
| 2020 | AGCN backbone | 34.8 | 93.4 | 85.6 | - | - |
| CA-GCN [ | - | 2020 | Directed graph, vertex information aggregation | 34.1 | 91.4 | 83.5 | - | - |
| SFAGCN [ | - | 2020 | Gated TCN | 38.3 | 96.7 | 91.2 | - | 87.3 |
| 2s-AGCN+PM-STFGCN [ | - | 2020 | Attetion, AGCN/ST-GCN backbone | 38.1 | 96.5 | 91.9 | - | - |
| 2s-TL-GCN [ | - | 2020 | 36.2 | 95.4 | 89.2 | - | - | |
| 2s-WPGCN [ | - | 2020 | 5-parts directed subgraph, GCN backbone | 39.1 | 96.5 | 91.1 | - | 87.0 |
| SAGP [ | - | 2020 | Attention, spectral sparse graph | 36.6 | 96.9 | 91.3 | - | 67.5 |
| JOLO-GCN (2s-AGCN) [ | - | 2020 | Descriptor of motion, ST-GCN/AGCN backbone | 38.3 | 98.1 | 93.8 | - | 87.6 |
| Sem-GCN [ | - | 2020 | Attention, skip connection, | 34.3 | 94.2 | 86.2 | - | - |
| SGN [ |
| 2020 | Use semantics (frame + joint index) | - | 94.5 | 89.0 | - | 79.2 |
| Hyper-GNN [ | - | 2021 | Add hyperedges, attention, skip connection | 37.1 | 95.7 | 89.5 | - | - |
| SEFN [ | - | 2021 | Multi-perspective Attention, AGC+TGC block | 39.3 | 96.4 | 90.7 | - | 86.2 |
| Sym-GNN [ | - | 2021 | Multiple graph, one-hop, GRU | 37.2 | 96.4 | 90.1 | - | - |
| PR-GCN [ |
| 2021 | MCNN, attention, pose refinement | 33.7 | 91.7 | 85.2 | - | - |
| GCN-HCRF [ | - | 2021 | HCRF | - | 95.5 | 90.0 | - | - |
| 2s-ST-GDN [ | - | 2021 | GDN, part-wise attention | 37.3 | 95.9 | 89.7 | - | 80.8 |
| STV-GCN [ | - | 2021 | ST-GCN to obtain emotional state, KNN | - | - | - | - | - |
| MMDGCN [ | - | 2021 | Dense GCN, ST-attention | 37.6 | 96.5 | 90.8 | - | 86.8 |
| CC-GCN [ | - | 2021 | CNN, generate new graph | 36.7 | 95.33 | 88.87 | - | - |
| SGCN-CAMM [ | - | 2021 | GCN, redundancies, merge nodes by weighted summation of original nodes | 37.1 | 96.2 | 90.1 | - | - |
| DCGCN [ |
| 2021 | ST-GCN backbone, attentioned graph dropout | - | 96.6 | 90.8 | - | 86.5 |
Figure 4The framework of this paper, where RNN means recurrent neural networks, CNN means convolutional neural networks and DL means deep learning.
Figure 5The input graphs, (a) with the green color to denote the temporal dimension and blue to denote the spatial dimension; (b) demonstrates shaking hands [47].
Figure 6The generated graph [49], where (a) the original skeleton for the action phone call and (b,c) illustrates the inferred action-specific skeletons for the action phone call with new edges in green.
Figure 7Examples of spatial methods: (a) the CRF approach [87], the others are RNN methods. Precisely, (b) a separated approach [88], (c) a bidirectional approach [53] and (d) the aggregated approach [89].
Figure 8Examples of spatiotemporal-based methods: (a) the approach that attempts to expand timescale [99], (b) an approach that modifies GCN [100] and (c) an RNN-based approach [83].
Figure 9Examples of generated methods: (a–c) self-supervised approaches, where (a) the approach with Autoencoder (AE) [77], (b) the adversarial approach [69], (c) a teacher–student approach [115] and (d) a neural architecture search (NAS) approach [116].
Figure 10Examples of (a) self-attention [118], (b) the skip connections [119] and (c) the effective squeeze-excitation (eSE) block [120], where (b,c) are dense blocks.
Figure 11The candidates for a multi-modalities framework.
Figure 12Examples of frameworks, where (a) [106], (b) [140] are examples, which change space, and (c,d) [141] are examples of neighbor convolution.
The summary of datasets and methods.
| Name | Papers | Action List |
|---|---|---|
| CMU Mocap | [ | Human Interaction, Interaction With Environment, Locomotion, Physical Activities + Sports, Situations + Scenarios |
| HDM05 | [ | Walk, Run, Jump, Grab and Deposit, Sports, Sit and Lie Down, Miscellaneous Motions |
| IEMOCAP | [ | Anger, Happiness, Excitement, Sadness, Frustration, Fear, Surprise, Other and Neutral State |
| CA | [ | Null, Crossing, Wait, Queueing, Walk, Talk |
| TUM | [ | Set the Table, Transport Each Object Separately as Done by an Inefficient Robot, Take Several Objects at Once as Humans Usually Do, Iteratively Pick Up and Put Down Objects From and to Different Places |
| MSR Action3D | [ | High Arm Wave, Horizontal Arm Wave, Hammer, Hand Catch, Forward Punch, High Throw, Draw X, Draw Tick, Draw Circle, Hand Clap, Two Hand Wave, Side Boxing, Bend, Forward Kick, Side Kick, Jogging, Tennis Swing, Tennis Serve, Golf Swing, Pick Up + Throw |
| CAD 60 | [ | Still, Rinse Mouth, Brush Teeth, Wear Contact Lenses, Talk on Phone, Drink Water, Open Pill Container, Cook (Chop), Cook (Stir), Talk on Couch, Relax on Couch, Write on Whiteboard, Work on Computer |
| MSR DailyActivity3D | [ | Drink, Eat, Read Book, Call Cellphone, Write on a Paper, Use Laptop, Use Vacuum Cleaner, Cheer Up, Sit Still, Toss Paper, Play Game, Lay Down on Sofa, Walk, Play Guitar, Stand Up, Sit Down |
| UT-Kinect | [ | Walk, Sit Down, Stand Up, Pick Up, Carry, Throw, Push, Pull, Wave Hands, Clap Hands |
| Florence3D | [ | Wave, Drink From a Bottle, Answer Phone, Clap, Tight Lace, Sit Down, Stand Up, Read Watch, Bow |
| SBU Kinect Interaction | [ | Approach, Depart, Push, Kick, Punch, Exchange Objects, Hug, and Shake Hands |
| J-HMDB | [ | Brush Hair, Catch, Clap, Climb Stairs, Golf, Jump, Kick Ball, Pick, Pour, Pull-Up, Push, Run, Shoot Ball, Shoot Bow, Shoot Gun, Sit, Stand, Swing Baseball, Throw, Walk, Wave |
| 3D Action Pairs | [ | Pick Up a box/Put Down a Box, Lift a box/Place a Box, Push a chair/Pull a Chair, Wear a hat/Take Off a Hat, Put on a backpack/Take Off a Backpack, Stick a poster/Remove a Poster. |
| CAD 120 | [ | Make Cereal, Take Medicine, Stack Objects, Unstack Objects, Microwave Food, Pick Objects, Clean Objects, Take Food, Arrange Objects, Have a Meal |
| ORGBD | [ | Drink, Eat, Use Laptop, Read Cellphone, Make Phone Call, Read Book, Use Remote |
| Human3.6M | [ | Conversations, Eat, Greet, Talk on the Phone, Pose, Sit, Smoke, Take Photos, Wait, Walk in Various Non-Typical Scenarios (With a Hand in the Pocket, Talk on the Phone, Walk a Dog, or Buy an Item) |
| N-UCLA | [ | Pick Up With One Hand, Pick Up With Two Hands, Drop Trash, Walk Around, Sit Down, Stand Up, Donning, Doffing, Throw, Carry |
| UWA3D Multiview | [ | One Hand Wave, One Hand Punch, Sit Down, Stand Up, Hold Chest, Hold Head, Hold Back, Walk, Turn Around, Drink, Bend, Run, Kick, Jump, Mope Floor, Sneeze, Sit Down (Chair), Squat, Two Hand Wave, Two Hand Punch, Vibrate, Fall Down, Irregular Walk, Lie Down, Phone Answer, Jump Jack, Pick Up, Put Down, Dance, Cough |
| UWA3D Multiview Activity II | [ | One Hand Wave, One Hand Punch, Two Hand Wave, Two Hand Punch, Sit Down, Stand Up, Vibrate, Fall Down, Hold Chest, Hold Head, Hold Back, Walk, Irregular Walk, Lie Down, Turn Around, Drink, Phone Answer, Bend, Jump Jack, Run, Pick Up, Put Down, Kick, Jump, Dance, Mope Floor, Sneeze, Sit Down (Chair), Squat, Cough |
| UTD-MHAD | [ | Indoor Daily Activities. Check |
| NTU RGB+D | [ |
|
| Charades | [ |
|
| UOW LSC | [ | Large Motions of All Body Parts, E.g., Spinal Stretch, Raising Hands and Jumping, and Small Movements of One Part, E.g., Head Anticlockwise Circle. Check |
| StateFarm | [ | Safe Drive, Text-Right, Talk on the Phone-Right, Text-Left, Talk on the Phone-Left, Operate the Radio, Drink, Reach Behind, Hair and Makeup, Talk to Passenger |
| DHG-14/28 | [ | Grab, Tap, Expand, Pinch, Rotate Clockwise, Rotatr Couter Clockwise, Swipe Right, Swipe Left, Swipe Up, Swipe Down, Swipe X, Swipe V, Swipe +, Shake |
| Volleyball | [ | Wait, Set, Dig, Fall, Spike, Block, Jump, Move, Stand |
| SYSU | [ | Drink, Pour, Call Phone, Play Phone, Wear Backpacks, Pack Backpacks, Sit Chair, Move Chair, Take Out Wallet, Take From Wallet, Mope, Sweep |
| SHREC’17 | [ | Grab, Tap, Expand, Pinch, Rotate Clockwise, Rotatr Couter Clockwise, Swipe Right, Swipe Left, Swipe Up, Swipe Down, Swipe X, Swipe V, Swipe + and Shake |
| Kinetics | [ |
|
| PUK-MDD | [ | 41 Daily + 10 Interactions. Details are shown in |
| ICVL-4 | [ | Sit, Stand, Stationary, Walk, Run, Nothing, Text, and Smoke, Others |
| UW-IOM | [ | 17 Actions as a Hierarchy Combination of four Tiers: Whether the Box or the Rod Is Manipulated, Human Motion (Walk, Stand and Bend), Captures the Type of Object Manipulation if Applicable (Reach, Pick-Up, Place and Hold) and the Relative Height of the Surface Where Manipulation Is Taking Place (Low, Medium and High) |
| NTU RGB+D 120 | [ | 82 Daily Actions (Eating, Writing, Sitting Down etc.), 12 Health-Related Actions (Blowing Nose, Vomiting etc.) and 26 Mutual Actions (Handshaking, Pushing etc.). |
| IRD | [ | Garbage Dump, Normal |
| HiEve | [ | Walk-Alone, Walk-Together, Run-Alone, Run-Together, Ride, Sit-Talk, Sit-Alone, Queuing, Stand-Alone, Gather, Fight, Fall-Over, Walk-Up-Down-Stairs and Crouch-Bow |
Figure A1Examples of datasets. (a) CMU Mocap [167]; (b) HDM05 [155]; (c) IEMOCAP [156,197]; (d) MSR Action3D [147]; (e) TUM [157]; (f) Florence3D [159]; (g) CAD 60 and 120 [148,151]; (h) UT-Kinect [158]; (i) Human3.6M [168]; (j) N-UCLA [160]; (k) 3D Action Pairs [149]; (l) SBU Kinect Interaction [166]; (m) CA [171] and (n) ORGBD [152].
Figure A2Examples of datasets. (a) MSR DailyActivity3D [150]; (b) UW-IOM [154]; (c) UWA3D Multiview [161,162]; (d) UTD MHAD [163]; (e) SYSU [153]; (f) NTU RGB+D [47]; (g) SHREC’17 [165]; (h) UOW LSC [146]; (i) Charades [173]; (j) StateFarm [175]; (k) J-HMDB [172] and (l) Volleyball [174].
Figure A3Examples of datasets. (a) HiEve [179]; (b) Kinetics [176]; (c) ICVL-4 [177]; (d) PUK-MDD [169]; (e) DHG-14/28 [164] and (f) IRD [177].
Figure 13The properties of the cited datasets: (a) Action categories of every dataset, with single-subject actions (with or without objects), pure human–object actions, group activities, hybrid for single-subject and interaction actions, and interaction for pure human–human interactions. (b) How many methods are developed on each dataset. The methods are those listed in Table A3. The MSR stands for MSRAction3D, and MSRDA stands for MSR DailyActivity3D.
Figure 14The performances of models (colored dots) on commonly used datasets: (a) The accuracy in logarithmic scale of cited methods (Table A3) on NTU RGB+D, NTU RGB+D 120 and Kinetics. (b) The performance (logarithmic) on SYSU, N-UCLA, MSRAction3D, HDM05 and SBU. Each method is denoted by its index Table A3, marked as RefIndex in the figure. The colors identity each dataset. In (b), the numbers around dots denote [12,13,29,56,59,61,69,72,74,75,83,85,87,88,89,90,92,97,110,115,135,136,138,141,182] respectively in ascending order.
Figure 15The logarithmic complexity and model size of the most popular methods (denoted as dots) performed on NTU cross-subjects. The text around each dot indicates the index of reference of the method. Green texts annotate the methods measured by ‘Params(M)’, and blue texts annotate ‘FLOPs (G)/Action’. The numbers around dots denote [13,14,50,54,55,57,59,62,73,77,80,82,85,86,89,110,116,117,121,127,130,135,141,183] respectively in ascending order. Number 94 [54] is a remarkable one, with both relatively low complexity and small model size.
Figure 16The common recipe for skeleton-GNN-based HAR: (a) [14] the attention plus skip connections, (b) a way to use multi-modalities with multi-stream.