| Literature DB >> 35684728 |
Amal El Kaid1,2,3, Denis Brazey3, Vincent Barra1, Karim Baïna2.
Abstract
Two-dimensional (2D) multi-person pose estimation and three-dimensional (3D) root-relative pose estimation from a monocular RGB camera have made significant progress recently. Yet, real-world applications require depth estimations and the ability to determine the distances between people in a scene. Therefore, it is necessary to recover the 3D absolute poses of several people. However, this is still a challenge when using cameras from single points of view. Furthermore, the previously proposed systems typically required a significant amount of resources and memory. To overcome these restrictions, we herein propose a real-time framework for multi-person 3D absolute pose estimation from a monocular camera, which integrates a human detector, a 2D pose estimator, a 3D root-relative pose reconstructor, and a root depth estimator in a top-down manner. The proposed system, called Root-GAST-Net, is based on modified versions of GAST-Net and RootNet networks. The efficiency of the proposed Root-GAST-Net system is demonstrated through quantitative and qualitative evaluations on two benchmark datasets, Human3.6M and MuPoTS-3D. On all evaluated metrics, our experimental results on the MuPoTS-3D dataset outperform the current state-of-the-art by a significant margin, and can run in real-time at 15 fps on the Nvidia GeForce GTX 1080.Entities:
Keywords: 3D multi-person pose estimation; absolute poses; artificial intelligence; camera-centric coordinates; computer vision; deep-learning
Mesh:
Year: 2022 PMID: 35684728 PMCID: PMC9185275 DOI: 10.3390/s22114109
Source DB: PubMed Journal: Sensors (Basel) ISSN: 1424-8220 Impact factor: 3.847
Figure 13D Skeleton model in MuPoTS-3D format and joints names.
Figure 2Examples of 3D absolute poses resulting from our whole framework.
Figure 3The pipeline of the Root-GAST-Net framework.
Figure 4The structural diagram of the various types of networks used in the framework.
Person-centric and camera-centric evaluations on the MuPoTS-3D dataset. The best is in bold, the second best is underlined.
| Method | Year | PCK | AUC | 3D-PCK | AP |
|---|---|---|---|---|---|
| 3D MPPE PoseNet [ | 2019 | 81.8 | 39.8 | 31.5 | 31.0 |
| HDNet [ | 2020 | 83.7 | - | 35.2 | 39.4 |
| SMAP [ | 2020 | 80.5 | 45.5 | 38.7 | 45.5 |
| HMOR [ | 2020 | 82.0 | 43.5 | 43.8 | - |
| GnTCN [ | 2021 |
|
| 45.7 | 45.2 |
| TDBU_Net [ | 2021 |
|
| 48.0 | 46.3 |
| DAS [ | 2022 | 82.7 | - | 39.2 | - |
| Root-GAST with GR | - | 63.8 | 30.6 | 54.7 |
|
| Root-GAST with GA | - | 82.5 | 45.3 |
| 56.8 |
| Root-GAST with GAR | - | 82.5 | 45.3 |
|
|
Sequence-wise 3D-PCK comparison with the state-of-the-art on the MuPoTS-3D dataset. (*) The accuracies of methods are measured on matched ground truths. The best is in bold, the second best is underlined.
|
|
|
|
|
|
|
|
|
| 3D MPPE PoseNet (*) [ | 59.5 | 45.3 | 51.4 | 46.2 | 53.0 | 27.4 | 23.7 |
| HDNet [ | 21.4 | 22.7 | 58.3 | 27.5 | 37.3 | 12.2 | 49.2 |
| SMAP (*) [ | 42.1 | 41.4 | 46.5 | 16.3 | 53.0 | 26.4 | 47.5 |
| GnTCN (*) [ | 64.7 |
|
| 63.1 | 52.6 |
| 31.9 |
| TDBU_Net [ |
| 57.1 | 49.3 |
|
| 36.1 |
|
| Root-GAST with GAR (*) |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 3D MPPE PoseNet (*) [ | 26.4 | 39.1 | 23.6 | 8.3 | 14.9 | 38.2 | 29.5 |
| HDNet [ |
|
| 43.9 | 43.2 |
| 39.7 | 28.3 |
| SMAP (*) [ | 18.7 | 36.7 |
|
| 22.7 | 24.3 | 38.9 |
| GnTCN (*) [ | 35.2 | 53.0 | 28.3 | 37.6 | 26.7 | 46.3 |
|
| TDBU_Net [ | 33.0 | 43.5 | 52.8 |
|
|
| 37.1 |
| Root-GAST with GAR (*) |
|
|
| 33.5 | 26.1 |
|
|
|
|
|
|
|
|
|
|
|
| 3D MPPE PoseNet (*) [ | 36.8 | 23.6 | 14.4 | 20.0 | 18.8 | 25.4 | 31.8 |
| HDNet [ | 49.5 | 23.8 | 18.0 | 26.9 | 25.0 | 38.8 | 35.2 |
| SMAP (*) [ | 47.5 | 34.2 | 35.0 | 20.0 | 38.7 | 64.8 | 38.7 |
| GnTCN (*) [ |
|
|
| 23.5 |
|
| 46.3 |
| TDBU_Net [ | 47.3 |
| 20.3 |
|
|
|
|
| Root-GAST with GAR (*) |
| 35.7 |
|
| 26.0 | 35.3 |
|
Average precision of the root keypoint evaluation by different distances on the MuPoTS-3D dataset.
| Method | AP | AP | AP | AP |
|---|---|---|---|---|
| 3D MPPE PoseNet [ | 31.0 | 21.5 | 10.2 | 2.3 |
| HDNet [ | 39.4 | 28.0 | 14.6 | 4.1 |
| Root-GAST with GA | 56.8 | 47.1 | 36.8 | 22.4 |
MPJPE of the relative poses on the MuPoTS-3D dataset. The best is in bold, the second best is underlined.
| Method | Year | MPJPE (mm) |
|---|---|---|
| Temporal smoothing [ | 2020 | 107 |
| Temporal smoothing + Pose refinement [ | 2020 |
|
| Depth Prediction Network [ | 2019 | 120 |
| LCR-Net [ | 2017 | 146 |
| Mehta et al. [ | 2018 | 132 |
| GAST-Net | - |
|
MRPE results comparison with RootNet [31] on the Human3.6M dataset. MRPE, MRPE, and MRPE are the average MRPE errors in the x, y, and z axes, respectively.
| Method | MRPE (mm) | MRPE | MRPE | MRPE |
|---|---|---|---|---|
| 3D MPPE PoseNet [ | 289.28 | 35.95 | 58.65 | 268.49 |
| Root-GAST with GA | 178 | 33 | 41.9 | 158 |
Response time per model.
| Model | Min Response Time (ms) | Max Response Time (ms) | Average Response Time (ms) |
|---|---|---|---|
| Yolo-v3 | 24 | 30 | 28 |
| HrNet | 9 | 12 | 10 |
| GAST-Net | 27 | 33 | 29 |
| GAST-Net | 23 | 29 | 26 |
| RootNet | 4 | 8 | 5 |
Frame rate per strategy.
| Strategy | Average Frame Rate (fps) |
|---|---|
| Root-GAST with GR | 13 |
| Root-GAST with GA | 16 |
| Root-GAST with GAR | 15 |
Figure 5Erroneous 3D multi-person pose estimation. The first two images represent two similar poses of different people because one is completely occluded. In the right two images, one pose is incorrect because the body parts are partially outside of the boxes.