| Literature DB >> 35408188 |
Weiwei Li1,2, Rong Du1,2, Shudong Chen1,2.
Abstract
Despite the great progress in 3D pose estimation from videos, there is still a lack of effective means to extract spatio-temporal features of different granularity from complex dynamic skeleton sequences. To tackle this problem, we propose a novel, skeleton-based spatio-temporal U-Net(STUNet) scheme to deal with spatio-temporal features in multiple scales for 3D human pose estimation in video. The proposed STUNet architecture consists of a cascade structure of semantic graph convolution layers and structural temporal dilated convolution layers, progressively extracting and fusing the spatio-temporal semantic features from fine-grained to coarse-grained. This U-shaped network achieves scale compression and feature squeezing by downscaling and upscaling, while abstracting multi-resolution spatio-temporal dependencies through skip connections. Experiments demonstrate that our model effectively captures comprehensive spatio-temporal features in multiple scales and achieves substantial improvements over mainstream methods on real-world datasets.Entities:
Keywords: 3D pose estimation; graph convolutional networks; non-local mechanics; temporal convolutional networks
Mesh:
Year: 2022 PMID: 35408188 PMCID: PMC9003032 DOI: 10.3390/s22072573
Source DB: PubMed Journal: Sensors (Basel) ISSN: 1424-8220 Impact factor: 3.576
Figure 1Illustration of the proposed skeleton-based spatio-temporal U-Net. In the semantic pooling stage, spatio-temporal semantic features are gradually compressed and fused into different granularities. In the semantic upsampling phase, spatio-temporal features are decoded and multi-resolution spatio-temporal dependencies are abstracted by skipping connections in the U-Net structure.
Figure 2Overview of the proposed spatio-temporal U-Net scheme. The STUNet architecture consists of a cascade structure of semantic-structural graph convolution network(S-GCN) layers and structural temporal convolution network (S-TCN) layers to progressively integrate semantic features in local time and space. Taking 27 frames of input as an example, the model contains two layers of graph pooling in the spatial dimension and three layers of TCN compression in the temporal dimension.
Figure 3Data flow in the proposed S-TCN model, from bottom input to top output.
Figure 4The defined hierarchical graph pooling strategy for the human body.
Figure 5Visualization of learned weighting matrices, M, of S-GCN in the network.
Reconstruction error on Human3.6M under protocol 1. Results are in millimeters.
| Dir. | Disc. | Eat | Greet | Phone | Photo | Pose | Purch. | Sit | SitD. | Smoke | Wait | WalkD. | Walk | WalkT. | Avg | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Pavlakos et al. [ | 67.4 | 71.9 | 66.7 | 69.1 | 72.0 | 77.0 | 65.0 | 68.3 | 83.7 | 96.5 | 71.7 | 65.8 | 74.9 | 59.1 | 63.2 | 71.9 |
| Fang et al. [ | 50.1 | 54.3 | 57.0 | 57.1 | 66.6 | 73.3 | 53.4 | 55.7 | 72.8 | 88.6 | 60.3 | 57.7 | 62.7 | 47.5 | 50.6 | 60.4 |
| Pavlakos et al. [ | 48.5 | 54.4 | 54.4 | 52.0 | 59.4 | 65.3 | 49.9 | 52.9 | 65.8 | 71.1 | 56.6 | 52.9 | 60.9 | 44.7 | 47.8 | 56.2 |
| Yang et al. [ | 51.5 | 58.9 | 50.4 | 57.0 | 62.1 | 65.4 | 49.8 | 52.7 | 69.2 | 85.2 | 57.4 | 58.4 | 43.6 | 60.1 | 47.7 | 58.6 |
| Luvizon et al. [ | 49.2 | 51.6 | 47.6 | 50.5 | 51.8 | 60.3 | 48.5 | 51.7 | 61.5 | 70.9 | 53.7 | 48.9 | 57.9 | 44.4 | 48.9 | 53.2 |
| Hossain et al. [ | 48.4 | 50.7 | 57.2 | 55.2 | 63.1 | 72.6 | 53.0 | 51.7 | 66.1 | 80.9 | 59.0 | 57.3 | 62.4 | 46.6 | 49.6 | 58.3 |
| Lee et al. [ | 40.2 | 49.2 | 47.8 | 52.6 | 50.1 | 75.0 | 50.2 | 43.0 | 55.8 | 73.9 | 54.1 | 55.6 | 58.2 | 43.3 | 43.3 | 52.8 |
| Pavllo et al. [ | 45.9 | 48.5 | 44.3 | 47.8 | 51.9 | 57.8 | 46.2 | 45.6 | 59.9 | 68.5 | 50.6 | 46.4 | 51.0 | 34.5 | 35.4 | 49.0 |
| Cai et al. [ | 44.6 | 47.4 | 45.6 | 48.8 | 50.8 | 59.0 | 47.2 | 43.9 | 57.9 | 61.9 | 49.7 | 46.6 | 51.3 | 37.1 | 39.4 | 48.8 |
| Yeh et al. [ | 44.8 | 46.1 | 43.3 | 46.4 | 49.0 | 55.2 | 44.6 | 44.0 | 58.3 | 62.7 | 47.1 | 43.9 | 48.6 | 32.7 | 33.3 | 46.7 |
| Xu et al. [ |
| 43.5 | 42.7 |
| 46.6 | 59.7 |
| 45.1 |
|
| 45.8 |
| 47.7 | 33.7 | 37.1 | 45.6 |
| Liu et al. [ | 41.8 | 44.8 |
| 44.9 | 47.4 |
| 43.4 | 42.2 | 56.2 | 63.6 | 45.3 | 43.5 | 45.3 | 31.3 | 32.2 | 45.1 |
| Ours (27 frames) | 43.5 | 44.8 | 43.9 | 44.1 | 47.7 | 56.5 | 44.0 | 44.2 | 55.8 | 67.9 | 47.3 | 46.5 | 45.7 | 33.4 | 33.6 | 46.6 |
| Ours (81 frames) | 42.6 | 43.6 | 42.8 | 43.1 |
| 54.6 | 43.3 | 42.4 | 53.5 | 63.2 | 45.8 | 44.2 | 44.9 | 31.9 | 32.0 | 45.0 |
| Ours (243 frames) | 41.9 |
| 42.3 | 42.9 | 46.3 | 54.2 | 42.9 |
| 53.1 | 62.8 |
| 43.9 |
|
|
|
|
Reconstruction error on Human3.6M under protocol 2. Results are in millimeters.
| Dir. | Disc. | Eat | Greet | Phone | Photo | Pose | Purch. | Sit | SitD. | Smoke | Wait | WalkD. | Walk | WalkT. | Avg | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Martinez et al. [ | 39.5 | 43.2 | 46.4 | 47.0 | 51.0 | 56.0 | 41.4 | 40.6 | 56.5 | 69.4 | 49.2 | 45.0 | 49.5 | 38.0 | 43.1 | 47.7 |
| Sun et al. [ | 42.1 | 44.3 | 45.0 | 45.4 | 51.5 | 53.0 | 43.2 | 41.3 | 59.3 | 73.3 | 51.0 | 44.0 | 48.0 | 38.3 | 44.8 | 48.3 |
| Fang et al. [ | 38.2 | 41.7 | 43.7 | 44.9 | 48.5 | 55.3 | 40.2 | 38.2 | 54.5 | 64.4 | 47.2 | 44.3 | 47.3 | 36.7 | 41.7 | 45.7 |
| Pavlakos et al. [ | 34.7 | 39.8 | 41.8 | 38.6 | 42.5 | 47.5 | 38.0 | 36.6 | 50.7 | 56.8 | 42.6 | 39.6 | 43.9 | 32.1 | 36.5 | 41.8 |
| Yang et al. [ |
|
| 36.3 | 39.9 | 43.9 | 47.4 |
|
|
| 58.4 | 41.5 |
|
| 42.5 | 32.2 | 37.7 |
| Hossain et al. [ | 35.7 | 39.3 | 44.6 | 43.0 | 47.2 | 54.0 | 38.3 | 37.5 | 51.6 | 61.3 | 46.5 | 41.4 | 47.3 | 34.2 | 39.4 | 44.1 |
| Pavllo et al. [ | 34.2 | 36.8 | 33.9 | 37.5 | 37.1 | 43.2 | 34.4 | 33.5 | 45.3 | 52.7 | 37.7 | 34.1 | 38.0 | 25.8 | 27.7 | 36.8 |
| Cai et al. [ | 35.7 | 37.8 | 36.9 | 40.7 | 39.6 | 45.2 | 37.4 | 34.5 | 46.9 | 50.1 | 40.5 | 36.1 | 41.0 | 29.6 | 33.2 | 39.0 |
| Xu et al. [ | 31.0 | 34.8 | 34.7 |
| 36.2 | 43.9 | 31.6 | 33.5 | 42.3 |
| 37.1 | 33.0 | 39.1 | 26.9 | 31.9 | 36.2 |
| Liu et al. [ | 32.3 | 35.2 | 33.3 | 35.8 |
| 41.5 | 33.2 | 32.7 | 44.6 | 50.9 | 37.0 | 32.4 | 37.0 | 25.2 | 27.2 | 35.6 |
| Ours (27 frames) | 34.3 | 35.7 | 34.9 | 36.6 | 37.5 | 42.7 | 33.1 | 36.0 | 44.4 | 53.7 | 38.5 | 33.5 | 38.4 | 26.0 | 28.4 | 36.9 |
| Ours (81 frames) | 33.5 | 35.1 | 33.9 | 36.0 | 36.9 |
| 32.3 | 34.5 | 42.9 | 50.1 | 37.7 | 33.0 | 37.8 | 25.6 | 27.6 | 36.0 |
| Ours (243 frames) | 33.3 | 34.8 |
| 35.2 | 36.3 | 42.2 | 32.1 | 33.7 | 42.6 | 49.4 |
| 32.8 | 37.4 |
|
|
|
Reconstruction error on HumanEva-I dataset under protocol 2. Results are in millimeters.
| Walk | Jog | Box | |||||||
|---|---|---|---|---|---|---|---|---|---|
| S1 | S2 | S3 | S1 | S2 | S3 | S1 | S2 | S3 | |
| Pavlakos et al. [ | 22.3 | 19.5 | 29.7 | 28.9 | 21.9 | 23.8 | - | - | - |
| Lee et al. [ | 18.6 | 19.9 | 30.5 | 25.7 | 16.8 | 17.7 | 42.8 | 48.1 | 53.4 |
| Pavllo et al. [ | 13.9 | 10.2 | 46.6 | 20.9 | 13.1 | 13.8 | 23.8 | 33.7 | 32.0 |
| Yeh et al. [ | 15.2 | 10.3 | 47.0 | 21.8 | 13.1 | 13.7 | 22.8 | 31.8 | 31.0 |
| Xu et al. [ | 13.2 | 10.2 | 29.9 |
| 12.3 | 13.0 |
| 18.1 | 20.4 |
| Liu et al. [ | 13.1 | 9.8 | 26.8 | 16.9 | 12.8 | 13.3 | - | - | - |
| Ours (243 frames) |
|
|
| 16.0 |
|
| 14.6 |
|
|
Figure 6The visualized qualitative results of 3D pose estimation in video.
Computational complexity of various models under protocol 1.
| Model | Parameters | FLOPs | MPJPE (mm) |
|---|---|---|---|
| Hossain et al. [ | 16.96M | 33.88M | 58.3 |
| Pavllo (81 frames) et al. [ | 12.75M | 25.48M | 47.7 |
| Pavllo (243 frames) et al. [ | 16.95M | 33.87M | 46.8 |
| Ours (27 frames) | 14.80 M | 29.03 M | 45.8 |
| Ours (81 frames) | 19.67 M | 38.45 M | 45.0 |
| Ours (243 frames) | 29.58 M | 64.84 M | 44.5 |
Ablation study for our 81-frame and 243-frame model under protocol 1 on Human3.6M.
| Frames |
|
| MPJPE (mm) |
|---|---|---|---|
| 81 | 1 | 2 | 45.8 |
| 81 | 1 | 3 | 45.0 |
| 81 | 2 | 3 | 45.3 |
| 243 | 1 | 3 | 45.5 |
| 243 | 1 | 4 | 45.3 |
| 243 | 2 | 3 | 44.9 |
| 243 | 2 | 4 | 44.5 |
| 243 | 3 | 4 | 44.7 |