| Literature DB >> 35904666 |
Lujing Chen1, Rui Liu2, Xin Yang3, Dongsheng Zhou1,3, Qiang Zhang1,3, Xiaopeng Wei3.
Abstract
In recent years, human motion prediction has become an active research topic in computer vision. However, owing to the complexity and stochastic nature of human motion, it remains a challenging problem. In previous works, human motion prediction has always been treated as a typical inter-sequence problem, and most works have aimed to capture the temporal dependence between successive frames. However, although these approaches focused on the effects of the temporal dimension, they rarely considered the correlation between different joints in space. Thus, the spatio-temporal coupling of human joints is considered, to propose a novel spatio-temporal network based on a transformer and a gragh convolutional network (GCN) (STTG-Net). The temporal transformer is used to capture the global temporal dependencies, and the spatial GCN module is used to establish local spatial correlations between the joints for each frame. To overcome the problems of error accumulation and discontinuity in the motion prediction, a revision method based on fusion strategy is also proposed, in which the current prediction frame is fused with the previous frame. The experimental results show that the proposed prediction method has less prediction error and the prediction motion is smoother than previous prediction methods. The effectiveness of the proposed method is also demonstrated comparing it with the state-of-the-art method on the Human3.6 M dataset.Entities:
Keywords: Gragh convolutional network; Human motion prediction; Transformer
Year: 2022 PMID: 35904666 PMCID: PMC9338210 DOI: 10.1186/s42492-022-00112-5
Source DB: PubMed Journal: Vis Comput Ind Biomed Art ISSN: 2524-4442
Fig. 1Overview of STTG-Net network structure
Fig. 2Temporal transformer (T-transformer) module. The module combines the encoded features of the connected human pose vector Z through the TPE and the input sequence, and obtains the output through the T-transformer module composed of 6 identical T-transformer layers. Specifically, each T-transformer layer will through the layer norm, and then the multi-head attention calculation is performed by the dot product attention composed of Q, K, and V of multiple heads, and finally connect the attention results and pass through the MLP composed of two FC layers
Fig. 3The architecture of S-GCN module
Fig. 4The prediction revision module. The first line of ‘prediction’ is the result directly predicted by the network, while the second line is the “final prediction.” From the second frame, the current ‘prediction’ result is fused with the “final prediction” result of the previous frames and then calculated as the final prediction for the current frame
The joint angle error and average angle error of all actions compared with baselines on Human3.6M
| Milliseconds | 80 | 160 | 320 | 400 | 80 | 160 | 320 | 400 | 80 | 160 | 320 | 400 | 80 | 160 | 320 | 400 |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Walking | Eating | Smoking | Discussion | |||||||||||||
| Res. sup. [ | 0.28 | 0.49 | 0.72 | 0.81 | 0.23 | 0.39 | 0.62 | 0.76 | 0.23 |
|
|
| 0.31 | 0.68 | 1.01 | 1.09 |
| convSeq2Seq [ | 0.33 | 0.54 | 0.68 | 0.73 | 0.22 | 0.36 | 0.58 | 0.71 | 0.26 | 0.49 | 0.96 | 0.92 | 0.32 | 0.67 | 0.94 | 1.01 |
| Multi-Gan [ | 0.23 | 0.51 | 0.62 | 0.66 | 0.20 | 0.31 |
| 0.66 | 0.25 | 0.46 | 0.88 | 0.88 | 0.28 | 0.55 | 0.81 | 0.92 |
| OoD [ | 0.23 | 0.37 | 0.58 | 0.63 | 0.21 | 0.37 | 0.59 | 0.72 | 0.27 | 0.54 | 1.03 | 1.03 | 0.30 | 0.66 | 0.94 | 1.02 |
| DMGNN [ |
|
|
| 0.58 | 0.17 |
|
|
|
|
|
|
| 0.26 | 0.65 | 0.92 | 0.99 |
| ST-Conv [ | 0.19 | 0.34 | 0.57 | 0.63 |
|
|
|
| 0.22 | 0.41 | 0.85 | 0.81 | 0.22 | 0.57 | 0.84 | 0.98 |
| POTR-GCN [ |
| 0.40 | 0.62 | 0.73 |
|
| 0.53 | 0.68 |
|
| 0.84 | 0.82 |
|
| 0.79 | 0.88 |
| ST-Transformer [ | 0.21 | 0.36 | 0.58 | 0.63 | 0.17 |
|
|
| 0.22 | 0.43 | 0.88 | 0.82 |
|
| 0.79 | 0.88 |
| HRI [ |
|
|
|
|
|
|
|
| 0.22 |
| 0.86 | 0.80 | 0.20 |
|
|
|
| Ours | 0.19 | 0.33 |
|
|
|
| 0.51 | 0.62 | 0.33 | 0.61 | 1.05 | 1.15 | 0.21 |
|
|
|
| Direction | Greeting | Phoning | Posing | |||||||||||||
| Res. sup. [ | 0.26 | 0.47 | 0.72 | 0.84 | 0.75 | 1.17 | 1.74 | 1.83 |
|
|
|
| 0.36 | 0.71 | 1.22 | 1.48 |
| convSeq2Seq [ | 0.39 | 0.60 | 0.80 | 0.91 | 0.51 | 0.82 | 1.21 | 1.38 | 0.59 | 1.13 | 1.51 | 1.65 | 0.29 | 0.60 | 1.12 | 1.37 |
| Multi-Gan [ | 0.36 | 0.57 | - | 0.89 | 0.51 | 0.86 | - | 1.36 | 0.54 | 1.05 | - | 1.58 | 0.22 | 0.51 | - | 1.41 |
| OoD [ | 0.38 | 0.58 | 0.79 | 0.90 | 0.49 | 0.81 | 1.24 | 1.43 | 0.57 | 1.10 | 1.48 | 1.61 | 0.26 | 0.56 | 1.26 | 1.55 |
| DMGNN [ | 0.32 | 0.65 | 0.93 | 1.05 | 0.36 | 0.61 |
|
| 0.52 | 0.97 | 1.29 | 1.43 |
|
| 1.06 | 1.34 |
| ST-Conv [ |
|
| 0.77 | 0.81 | 0.35 | 0.61 | 1.01 | 1.20 | 0.53 | 1.00 |
| 1.40 | 0.26 | 0.51 | 1.08 | 1.32 |
| POTR-GCN [ |
| 0.45 | 0.79 | 0.91 |
| 0.69 | 1.17 | 1.30 | 0.50 | 1.04 | 1.41 | 1.54 | 0.61 | 0.68 | 1.05 |
|
| ST-Transformer [ | 0.25 |
| 0.75 | 0.86 | 0.35 | 0.61 | 1.10 | 1.32 | 0.53 | 1.04 | 1.41 | 1.54 | 0.61 | 0.68 |
|
|
| HRI [ | 0.25 |
|
|
| 0.35 |
| 0.95 | 1.14 | 0.53 | 1.01 | 1.31 | 1.43 |
|
| 1.09 | 1.35 |
| Ours | 0.27 |
|
|
|
|
|
|
|
|
| 1.29 |
| 0.25 |
|
|
|
| Purchases | Sitting | Sittingdown | Takingphoto | |||||||||||||
| Res. sup. [ | 0.51 | 0.97 | 1.07 | 1.16 | 0.41 | 1.05 | 1.49 | 1.63 | 0.39 | 0.81 | 1.40 | 1.62 | 0.24 | 0.51 | 0.90 | 1.05 |
| convSeq2Seq [ | 0.63 | 0.91 | 1.19 | 1.29 | 0.39 | 0.61 | 1.02 | 1.18 | 0.41 | 0.78 | 1.16 | 1.31 | 0.23 | 0.49 | 0.88 | 1.06 |
| Multi-Gan [ | 0.55 | 0.85 | - | 1.23 | 0.35 | 0.60 | - | 1.13 | 0.36 | 0.72 | - | 1.20 | 0.23 | 0.41 | - | 0.99 |
| OoD [ | 0.61 | 0.89 | 1.27 | 1.37 | 0.38 | 0.62 | 1.06 | 1.22 | 0.41 | 0.83 | 1.28 | 1.41 | 0.25 | 0.51 | 0.81 | 0.95 |
| DMGNN [ |
|
| 1.05 | 1.14 |
|
|
|
| 0.32 | 0.65 | 0.93 | 1.05 |
|
|
| 0.71 |
| ST-Conv [ |
|
| 1.08 | 1.15 | 0.30 | 0.49 | 0.90 | 1.09 |
| 0.65 | 0.97 | 1.08 |
|
|
| 0.72 |
| POTR-GCN [ | 0.33 | 0.63 | 1.04 | 1.09 |
| 0.47 | 0.92 | 1.09 |
|
| 1.00 | 1.12 |
| 0.41 | 0.71 | 0.86 |
| ST-Transformer [ | 0.43 | 0.77 | 1.30 | 1.37 | 0.29 | 0.46 | 0.84 | 1.01 | 0.32 | 0.66 | 0.98 | 1.10 |
| 0.38 | 0.64 | 0.75 |
| HRI [ |
| 0.65 |
|
| 0.29 | 0.47 | 0.83 | 1.01 | 0.30 |
|
|
| 0.16 | 0.36 |
|
|
| Ours | 0.43 |
|
|
| 0.30 |
|
|
| 0.40 |
|
|
| 0.16 |
|
|
|
| Waiting | Walkingdog | Walkingtogether | Average | |||||||||||||
| Res. sup. [ | 0.28 | 0.53 | 1.02 | 1.14 | 0.56 | 0.91 | 1.26 | 1.40 | 0.31 | 0.58 | 0.87 | 0.91 | 0.36 | 0.67 | 1.02 | 1.15 |
| convSeq2Seq [ | 0.30 | 0.62 | 1.09 | 1.30 | 0.59 | 1.00 | 1.32 | 1.44 | 0.27 | 0.52 | 0.71 | 0.74 | 0.38 | 0.68 | 1.01 | 1.13 |
| Multi-Gan [ | 0.23 | 0.56 | - | 1.29 | 0.53 | 0.85 | - | 1.33 | 0.22 | 0.45 | - | 0.73 | 0.37 | 0.67 | - | 1.43 |
| OoD [ | 0.29 | 0.58 | 1.06 | 1.29 | 0.52 | 0.88 | 1.17 | 1.34 | 0.21 | 0.44 | 0.66 | 0.74 | 0.37 | 0.63 | 1.08 | 1.18 |
| DMGNN [ | 0.22 |
|
|
|
|
| 1.16 | 1.34 |
|
|
| 0.57 |
|
| 0.83 | 0.95 |
| ST-Conv [ |
| 0.51 | 0.97 | 1.17 | 0.43 | 0.78 | 1.10 | 1.24 |
|
|
|
|
|
| 0.87 | 0.98 |
| POTR-GCN [ |
| 0.56 | 1.14 | 1.37 |
| 0.79 | 1.21 | 1.33 |
| 0.44 | 0.63 | 0.70 |
| 0.56 | 0.94 | 1.01 |
| ST-Transformer [ | 0.22 | 0.51 | 0.98 | 1.22 | 0.43 | 0.78 | 1.15 | 1.30 | 0.17 | 0.37 | 0.58 | 0.62 | 0.30 | 0.55 | 0.90 | 1.02 |
| HRI [ | 0.22 |
| 0.92 | 1.14 | 0.46 | 0.78 |
|
|
|
|
|
|
|
|
|
|
| Ours | 0.24 |
|
|
| 0.43 |
|
|
| 0.17 | 0.36 |
| 0.57 | 0.28 |
|
|
|
The best results are presented in bold, and the sub-optimal results are presented in italics
Fig. 5Visualization results of predictions for the four actions of (a) walking (b) smoking (c) walkingdog (d) greeting (e) eating (f) phoning on Human3.6 M. The ground-truth, LTD [4], HRI [27], and the proposed method are shown from top to bottom. The changes in actions from the first to the last frame of the prediction can be clearly seen in the grey dashed box, while the blue round box shows the comparison between predicted action and ground truth by the proposed method and other methods. It can be seen from the visualization results that the proposed method produces predictions closer to the ground truth than HRI and LTD.
The ablation studies of different modules in STTG-Net, reporting results for the joint angle errors on Human3.6M. “√” indicates that the module is used in experiment, “×” means that the part is removed from experiment
| T-Transformer | S-GCN | TPE | PR | MAE | ||||
|---|---|---|---|---|---|---|---|---|
| 80 ms | 160 ms | 320 ms | 400 ms | |||||
| (1) | √ | × | × | × | 0.59 | 0.83 | 1.16 | 1.28 |
| (2) | × | √ | × | × | 0.49 | 0.69 | 0.98 | 1.09 |
| (3) | √ | √ | × | × | 0.28 | 0.54 | 0.88 | 1.00 |
| (4) | √ | √ | √ | × |
| 0.52 | 0.86 | 0.98 |
| (5) | √ | √ | √ | √ | 0.28 |
|
|
|
The best results are presented in bold
The ablation experiments for different module parameters in STTG-Net, with reported results for joint angle errors on Human3.6M
| H | 8 | 6 | 12 | 8 | 8 | 8 | 8 |
| LT | 6 | 6 | 6 | 4 | 8 | 6 | 6 |
| LS | 14 | 14 | 14 | 14 | 14 | 12 | 16 |
| 80 ms |
| 0.28 | 0.28 | 0.28 | 0.28 | 0.29 | 0.27 |
| 160 ms |
| 0.55 | 0.53 | 0.53 | 0.55 | 0.54 | 0.53 |
| 320 ms |
| 0.87 | 0.86 | 0.86 | 0.89 | 0.87 | 0.87 |
| 400 ms |
| 0.99 | 0.99 | 0.98 | 1.01 | 1.00 | 0.99 |
The best results are presented in bold
The ablation studies with different coefficients in prediction revision module, reporting results for the mean joint angle errors at 80, 160, 320, and 400 ms for different α and β on Human3.6M
| 80 ms | 160 ms | 320 ms | 400 ms | |
|---|---|---|---|---|
| α=0, β=1 |
| 0.52 | 0.86 | 0.98 |
| α=0.1, β=0.9 | 0.30 | 0.54 | 0.88 | 0.99 |
| α=0.125, β=0.875 | 0.28 |
|
|
|
| α=0.175, β=0.825 | 0.30 | 0.54 | 0.85 | 0.97 |
| α=0.225, β=0.775 | 0.29 | 0.54 | 0.86 | 0.98 |
| α=0.25, β=0.75 | 0.28 | 0.53 | 0.86 | 0.99 |
| α=0.5, β=0.5 | 0.29 | 0.55 | 0.88 | 1.00 |
The best results are presented in bold