| Literature DB >> 35808480 |
Yiqi Wu1,2, Shichao Ma1, Dejun Zhang1,2, Weilun Huang3, Yilin Chen2,4.
Abstract
Estimating accurate 3D human poses from 2D images remains a challenge due to the lack of explicit depth information in 2D data. This paper proposes an improved mixture density network for 3D human pose estimation called the Locally Connected Mixture Density Network (LCMDN). Instead of conducting direct coordinate regression or providing unimodal estimates per joint, our approach predicts multiple possible hypotheses by the Mixture Density Network (MDN). Our network can be divided into two steps: the 2D joint points are estimated from the input images first; then, the information of human joints correlation is extracted by a feature extractor. After the human pose feature is extracted, multiple pose hypotheses are generated via the hypotheses generator. In addition, to make better use of the relationship between human joints, we introduce the Locally Connected Network (LCN) as a generic formulation to replace the traditional Fully Connected Network (FCN), which is applied to a feature extraction module. Finally, to select the most appropriate 3D pose result, a 3D pose selector based on the ordinal ranking of joints is adopted to score the predicted pose. The LCMDN improves the representation capability and robustness of the original MDN method notably. Experiments are conducted on the Human3.6M and MPII dataset. The average Mean Per Joint Position Error (MPJPE) of our proposed LCMDN reaches 50 mm on the Human3.6M dataset, which is on par or better than the state-of-the-art works. The qualitative results on the MPII dataset show that our network has a strong generalization ability.Entities:
Keywords: 3D human pose estimation; Gaussian mixture model; graphic convolutional network; mixture density network; ordinal ranking
Mesh:
Year: 2022 PMID: 35808480 PMCID: PMC9269848 DOI: 10.3390/s22134987
Source DB: PubMed Journal: Sensors (Basel) ISSN: 1424-8220 Impact factor: 3.847
Figure 1Pipeline for the proposed Locally Connected Mixture Density Network (LCMDN). Our network takes RGB images as input and outputs multiple predictions of 3D pose first. Then, a 3D pose selector is trained to select the best estimation result.
Figure 2Illustration of the proposed Locally Connected Mixture Density Networks. The network consists of a 2D pose estimator, a feature extractor, a 3D pose hypotheses generator, and a 3D pose selector. The hypotheses generator outputs multiple hypothetical 3D poses from the detected 2D joints. The ordinal matrix is used to generate the ordinal ranking of joints to select the correct human pose estimation. M denotes the numbers of Gaussian components.
Figure 3The difference of structure matrix between LCN and GCN. For Joint2, the LCN considers only the relationships within the orange joints, while the FCN considers the relationships with all other joints.
The numerical results under protocol # 1 and comparisons on the Human3.6M dataset (mm). (Best result in bold.)
| Method | Direct. | Discuss. | Eating | Greet | Phone | Photo | Pose | Purch. | Sitting | SittingD. | Smoke | Wait | WalkD. | Walk | WalkT. | Avg. |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Lin et al. [ | 132.7 | 183.6 | 132.3 | 164.4 | 162.1 | 205.9 | 150.6 | 171.3 | 151.6 | 243.0 | 162.1 | 170.7 | 177.1 | 96.6 | 127.9 | 162.1 |
| Du et al. [ | 85.1 | 112.7 | 104.9 | 122.1 | 139.1 | 135.9 | 105.9 | 166.2 | 117.5 | 226.9 | 120.0 | 117.7 | 137.4 | 99.3 | 106.5 | 126.5 |
| Zhou et al. [ | 87.4 | 109.3 | 87.1 | 103.2 | 116.2 | 143.3 | 106.9 | 99.8 | 124.5 | 199.2 | 107.4 | 118.1 | 114.2 | 79.4 | 97.7 | 113.0 |
| Pavlakos et al. [ | 67.4 | 71.9 | 66.7 | 69.1 | 72.0 | 77.0 | 65.0 | 68.3 | 83.7 | 96.5 | 71.7 | 65.8 | 74.9 | 59.1 | 63.2 | 71.9 |
| Jahangiri et al. [ | 63.1 | 55.9 | 58.1 | 64.5 | 68.7 | 61.3 | 55.6 | 86.1 | 117.6 | 71.0 | 71.2 | 66.3 | 57.1 | 62.5 | 61.0 | 68.0 |
| Zhou et al. [ | 54.8 | 60.7 | 58.2 | 71.4 | 62.0 | 65.5 | 53.8 | 55.6 | 75.2 | 111.6 | 64.1 | 66.0 | 51.4 | 63.2 | 55.3 | 64.9 |
| Martinez et al. [ | 51.8 | 56.2 | 58.1 | 59.0 | 69.5 | 78.4 | 55.2 | 58.1 | 74.0 | 94.6 | 62.3 | 59.1 | 65.1 | 49.5 | 52.4 | 62.9 |
| Lee et al. [ | 43.8 | 51.7 | 48.8 | 53.1 |
| 74.9 | 52.7 |
|
| 74.3 | 56.7 | 66.4 |
| 68.4 | 45.6 | 55.8 |
| Li et al. [ | 43.8 | 48.6 | 49.1 | 49.8 | 57.6 | 61.5 | 45.9 | 48.3 | 62.0 | 73.4 | 54.8 | 50.6 | 56.0 | 43.4 | 45.5 | 52.7 |
| Ci et al. [ | 46.8 | 52.3 | 44.7 | 50.4 | 52.9 | 68.9 | 49.6 | 46.4 | 60.2 | 78.9 |
| 50.0 | 54.8 |
| 43.3 | 52.7 |
|
|
|
|
|
| 54.5 |
|
| 45.8 | 57.9 |
| 52.0 |
| 52.7 | 41.3 |
|
|
Results of missing joints and comparisons. The first three rows show the results of one missing joint, and the last three rows show the results of two missing joints. (Best result in bold.)
| Method | Direct. | Discuss. | Eating | Greet | Phone | Photo | Pose | Purch. | Sitting | SittingD. | Smoke | Wait | WalkD. | Walk | WalkT. | Avg. |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Jahangiri et al. [ | 108.6 | 105.9 | 105.6 | 109.0 | 105.5 | 109.9 | 102.0 | 111.3 | 119.6 | 107.8 | 107.1 | 111.3 | 108.4 | 107.0 | 110.3 | 108.6 |
| Martinez et al. [ | 57.4 | 64.6 | 64.3 | 65.6 | 73.3 | 85.5 | 61.0 | 62.1 | 84.0 | 101.1 | 68.2 | 66.7 | 70.8 | 55.6 | 59.6 | 69.1 |
| Li et al. [ | 48.9 | 53.9 | 54.5 | 55.5 | 62.6 | 70.4 | 51.3 | 52.0 | 69.7 | 83.9 | 60.7 | 57.2 | 62.4 | 48.3 | 50.8 | 58.8 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| Jahangiri et al. [ | 125.0 | 121.8 | 115.1 | 124.1 | 116.9 | 123.8 | 116.4 | 119.6 | 130.8 | 120.6 | 118.4 | 127.1 | 125.9 | 121.6 | 127.6 | 122.3 |
| Martinez et al. [ | 62.9 | 66.9 | 69.9 | 71.4 | 80.2 | 93.8 | 66.3 | 65.9 | 90.6 | 109.7 | 74.2 | 72.1 | 75.5 | 61.7 | 65.7 | 75.1 |
| Li et al. [ | 54.0 | 58.5 | 60.6 | 61.4 | 68.6 | 77.9 | 56.6 | 57.0 | 77.8 | 92.4 | 66.2 | 62.6 | 67.5 | 52.5 | 55.0 | 64.6 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Figure 4Visualization results on Human3.6M dataset. The first column shows the 2D poses estimated by the stacked hourglass network. The second column shows the ground truth offered by the dataset. The following 5 columns are the estimated hypothetical results generated by the hypotheses generator of the LCMDN, and the poses surrounded by orange frames are the final estimated poses selected by the LCMDN.
Comparison between different steps of neighbors. The numbers in the first row represent the value of K.
| Method | 1 | 2 | 3 | 4 |
|---|---|---|---|---|
| LCN [ | 58.77 | 57.73 | 57.56 | 58.37 |
| LCMDN | 51.28 | 50.02 | 50.42 | 50.45 |
Comparison between different numbers of kernels. Numbers in the first row represent the number of kernels.
| Method | 1 | 3 | 5 | 8 |
|---|---|---|---|---|
| Li et al. [ | 62.9 | 55.2 | 52.7 | 52.6 |
| LCMDN | 58.8 | 52.4 | 50.0 | 49.8 |
Figure 5Qualitative results on the MPII dataset. Columns 1, 3, and 5 are the input images and columns 2, 4, and 6 are corresponding 3D poses generated by our network.