| Literature DB >> 31652665 |
Onorina Kovalenko1, Vladislav Golyanik2, Jameel Malik3,4,5, Ahmed Elhayek6,7, Didier Stricker8,9.
Abstract
Recovery of articulated 3D structure from 2D observations is a challenging computer vision problem with many applications. Current learning-based approaches achieve state-of-the-art accuracy on public benchmarks but are restricted to specific types of objects and motions covered by the training datasets. Model-based approaches do not rely on training data but show lower accuracy on these datasets. In this paper, we introduce a model-based method called Structure from Articulated Motion (SfAM), which can recover multiple object and motion types without training on extensive data collections. At the same time, it performs on par with learning-based state-of-the-art approaches on public benchmarks and outperforms previous non-rigid structure from motion (NRSfM) methods. SfAM is built upon a general-purpose NRSfM technique while integrating a soft spatio-temporal constraint on the bone lengths. We use alternating optimization strategy to recover optimal geometry (i.e., bone proportions) together with 3D joint positions by enforcing the bone lengths consistency over a series of frames. SfAM is highly robust to noisy 2D annotations, generalizes to arbitrary objects and does not rely on training data, which is shown in extensive experiments on public benchmarks and real video sequences. We believe that it brings a new perspective on the domain of monocular 3D recovery of articulated structures, including human motion capture.Entities:
Keywords: articulated structure recovery; human pose estimation; structure from motion
Year: 2019 PMID: 31652665 PMCID: PMC6833108 DOI: 10.3390/s19204603
Source DB: PubMed Journal: Sensors (Basel) ISSN: 1424-8220 Impact factor: 3.576
Figure 1We recover different articulated structures from real-world videos with high accuracy and no need for training data. Our Structure from Articulated Motion (SfAM) approach is not restricted to a single object class and only requires a rough articulated structure prior. The reconstructions are provided under different view angles.
Figure 2Side-by-side comparison of the non-rigid structure from motion (NRSfM) method [11] and our SfAM. Reconstruction results of [11] violate anthropometric properties of the human skeleton due to changing bone lengths from frame to frame.
Figure 3The pipeline of the proposed SfAM approach. Following factorization-based NRSfM, we first recover the camera pose using 2D position observations. Then, we recover 3D articulated structure by optimizing our new energy functional accounting for articulated priors.
The reconstruction error of SfAM and previous methods on Human 3.6m dataset. “*” indicates learning-based methods which are trained on Human 3.6m [12]. We outperform all model-based approaches and reach very close to the tuned supervised learning techniques.
| Method | P1 | P2 | P3 |
|---|---|---|---|
| Zhou et al. [ | 106.7 | - | - |
| Akhter et al. [ | - | 181.1 | - |
| Ramakrishna et al. [ | - | 157.3 | - |
| Bogo et al. [ | - | 82.3 | - |
| Kanazawa et al. [ | 67.5 | 66.5 | - |
| Moreno-Noguer [ | 62.2 | - | - |
| Yasin et al. [ | - | - | 110.2 |
| Rogez et al. [ | - | - | 88.1 |
| Chen, Ramanan [ | - | - | 82.7 |
| Nie et al. [ | - | - | 79.5 |
| Sun et al. [ | - | - | 48.3 |
| Omran et al. [ | 59.9 | - | - |
| Zhou et al. [ | 54.7 | - | - |
| Mehta et al. [ | 54.6 | - | - |
| Pavlakos et al. [ | 51.9 | - | - |
| Kinauer et al. [ | 50.3 | - | - |
| Tekin et al. [ | 50.1 | - | - |
| Rogez et al. [ | 49.2 | 51.1 | 42.7 |
| Habibie et al. [ | 49.2 | - | - |
| Martinez et al. [ | 45.6 | - | - |
| Zhao et al. [ | 43.8 | - | - |
| Pavlakos et al. [ | 41.8 | - | - |
| Arnab, Doersch et al. [ | 41.6 | - | - |
| Chen, Lin et al. [ | 41.6 | - | - |
| Sun et al. [ | 40.6 | - | - |
| Wandt, Rosenhahn [ | 38.2 | - | - |
| Pavllo et al. [ | 36.5 | - | - |
| Dabral et al. [ | 36.3 | - | - |
| SMSR [ | 106.6 | 105.2 | 102.9 |
| SMSR [ | 145.2 | 124.0 | 139.9 |
| Our SfAM | 51.2 | 51.7 | 53.9 |
The normalized mean 3D error of previous NRSfM methods and our SfAM for synthetic sequences [20].
| Method | Drink | PickUp | Stretch | Yoga |
|---|---|---|---|---|
| MP [ | 0.4604 | 0.4332 | 0.8549 | 0.8039 |
| PTA [ | 0.0250 | 0.2369 | 0.1088 | 0.1625 |
| CSF1 [ | 0.0223 | 0.2301 | 0.0710 | 0.1467 |
| CSF2 [ | 0.0223 | 0.2277 |
| 0.1465 |
| BMM [ | 0.0266 |
| 0.1034 |
|
| Lee [ | 0.8754 | 1.0689 | 0.9005 | 1.2276 |
| PPTA [ |
| 0.235 | 0.084 | 0.158 |
| SMSR [ | 0.0287 |
| 0.0783 | 0.1493 |
| SMSR [ | 0.4348 | 0.4965 | 0.3721 | 0.4471 |
| Our SfAM | 0.0226 |
|
|
|
Figure 4Comparison of our SfAM and NRSfM [11] on Human 3.6m [12]. NRSfM considers humans as general non-rigid objects and changes bone lengths from frame to frame.
Figure A1Additional visualizations of our results and reconstructions with NRSfM of Ansari et al. [11] on several sequences from [12]. (a)–(c): our results on sitting, photo and discussion. These sequences and poses are among the most challenging in the dataset. (d): comparison of our SfAM and NRSfM [11].
Figure 5(a): the reconstruction error under 2D noise; (b): under incorrect bone lengths initializations; (c): average bone lengths error for the increasing levels of Gaussian noise before (red) and after (green) the optimization; (d): standard deviation of bone lengths for SMSR [11] and our SfAM.
Figure 6Comparison of our SfAM, NRSfM [11], and the learning-based method of Martinez et al. [9] on challenging real-world videos.
Figure 7Comparison of our SfAM to NRSfM [11] on an NYU hand pose dataset [14].