| Literature DB >> 33319816 |
Elena Nicora1, Gaurvi Goyal1, Nicoletta Noceti2, Alessia Vignolo3, Alessandra Sciutti3, Francesca Odone1.
Abstract
MoCA is a bi-modal dataset in which we collect Motion Capture data and video sequences acquired from multiple views, including an ego-like viewpoint, of upper body actions in a cooking scenario. It has been collected with the specific purpose of investigating view-invariant action properties in both biological and artificial systems. Besides that, it represents an ideal test bed for research in a number of fields - including cognitive science and artificial vision - and application domains - as motor control and robotics. Compared to other benchmarks available, MoCA provides a unique compromise for research communities leveraging very different approaches to data gathering: from one extreme of action recognition in the wild - the standard practice nowadays in the fields of Computer Vision and Machine Learning - to motion analysis in very controlled scenarios - as for motor control in biomedical applications. In this work we introduce the dataset and its peculiarities, and discuss a baseline analysis as well as examples of applications for which the dataset is well suited.Entities:
Mesh:
Year: 2020 PMID: 33319816 PMCID: PMC7738546 DOI: 10.1038/s41597-020-00776-9
Source DB: PubMed Journal: Sci Data ISSN: 2052-4463 Impact factor: 6.444
A comparison of the MoCA dataset with existing benchmarks: HMDB[3], Activitynet[1], HACS[4], Kinectics-700[2], UCF 101[5], MPII Cooking 2[6], EPIC-Kitchens[11], You Cook 2[35], Arbitrary view[7], IXMAS[8], NUCLA[9], NTU[10], Schreiber & Moissenet[12], Fukuchi et al.[13], UE-HRI[36], CMU-MMAC[37], TUM Kitchen[38], Ego Yale[39].
| Dataset | Visual sensors | View(s) Setup | Body part | Envir. | Acquisition Conditions | Annotated Task |
|---|---|---|---|---|---|---|
| [ | RGB | FVa | Full/Upper | Clutt. | Web | Action Rec. |
| [ | RGB | FV | Full/Upper | Clutt. | Web | Activity Rec. |
| [ | RGB | FV | Full/Upper | Clutt. | Web | Action Rec. |
| Action det. | ||||||
| [ | RGB | FV | Full/Upper | Clutt. | Web | Action Rec. |
| [ | RGB | FV | Full/Upper | Clutt. | Web | Action Rec. |
| [ | RGB | 1 Vb | Full/Upper | Clutt. | LabFM | Activity Rec. |
| [ | RGB | Ego | Arms | Clutt. | LabFM | Activity Rec. |
| Object Rec. | ||||||
| [ | RGB | FV | (Mostly) upper | Clutt. | Web | Activity Rec. |
| Object Rec. | ||||||
| [ | RGB-D* | 6 V | Upper | Clean | LabPA | Action Rec. |
| Skeleton* | CVV | |||||
| [ | RGB* | 5 V | Full | Clean | LabPA | Action Rec. |
| [ | RGBD* | 3 V | Full | Both | LabPA | Action Rec. |
| Skeleton* | ||||||
| [ | RGBD* | 3 Vc | Full | Clean | LabPA | Action Rec. |
| Skeleton* | ||||||
| [ | Skeleton | — | Full | — | LabGait | Gait analysis |
| [ | Skeleton | — | Legs | — | LabGait | Gait analysis |
| [ | RGB-D | FV | Upper | Clean | LabFM | Human engag. |
| [ | RGB* | 5 V | Full/Upper | Clutt. | LabFM | Activity Rec. |
| Skeleton* | ||||||
| [ | RGB* | 4 V | Full | Clutt. | LabFM | Markerless Motion |
| Skeleton* | Analysis | |||||
| Activity Rec. | ||||||
| Activity det. | ||||||
| [ | RGB | Ego | Arms | Clutt. | LabFM | Grasp Analysis |
| MoCA | RGB* | 3 V | Upper | Clean | LabPA | Motion Analysis |
| Skeleton* | Action Primitive Det. | |||||
| Action Rec. |
The column View(s) setup reports info on the setup referring to the camera setup, that may include different fixed cameras (nV, where n is the number of cameras), may have no specific constraint on the mutual position between camera and subject (we named Free Viewpoint - FV), or may consider the use of a wearable ego camera (referred to as Ego). In addition, one the benchmarks also includes a continuous varying view (CVV). In case of multiple views and/or visual modalities, an refers to the fact the streams have been acquired synchronously, meaning that all the visual sensors observed the very same dynamic event. The column Environment indicates whether the acquisitions have been performed in a cluttered scene or collected from the web (in both cases referred to as Clutt.) or acquired with a clean background to focus on specific aspects of the analysis (clean). In column Acquisitions conditions we report information on the fact videos have been collected online (Web) or in a laboratory, considering predefined actions (LabPA), free movements (LabFM), or more specifically gaits (LabGait). NOTES aActions annotated according to 4 course views (front, back, left, right)bAmong the 7 views considered, only one is fully available. cHeight and distance of the 3 cameras have been varied to collect acquisitions from a richer set of viewpoints.
List of actions included in the MoCA, with associated main characteristics (see the text for details on the categorizations).
| Action | Structure | Objects | Arms | |||||
|---|---|---|---|---|---|---|---|---|
| P1 | P2 | P3 | S | M | ||||
| 1 | Shred a carrot | x | x | n | 1 | |||
| 2 | Cut the bread | x | x | y | 1 | |||
| 3 | Clean a dish | x | x | y | 1 | |||
| 4 | Eat | x | x | n | 1 | |||
| 5 | Beat eggs | x | x | y | 1 | |||
| 6 | Squeeze a lemon | x | x | y | 1 | |||
| 7 | Mince with a crescent | x | x | y | 2 | |||
| 8 | Mix in a bowl | x | x | y | 1 | |||
| 9 | Open a bottle | x | x | y | 2 | |||
| 10 | Turn the pancake | x | x | y | 1 | |||
| 11 | Pestle | x | x | y | 1 | |||
| 12 | Pour water in containers | x | x | y | 1 | |||
| 13 | Pour water in a mug | x | x | y | 1 | |||
| 14 | Reach an object | x | x | n | 1 | |||
| 15 | Roll the dough | x | x | y | 2 | |||
| 16 | Wash the salad | x | x | y | 1 | |||
| 17 | Salt | x | x | y | 1 | |||
| 18 | Spread cheese on bread | x | x | y | 2 | |||
| 19 | Clean the table | x | x | y | 1 | |||
| 20 | Transport an object | x | x | y | 1 | |||
Fig. 1The acquisition setup (d) is composed by a table on which the subject performs a selection of cooking activities. The scene is observed with 3 cameras, placed in 3 different viewpoints (see a–c) and a motion capture system, collecting trajectories of joints positions over time (see in e–g examples of 3 different actions). The volunteer A. Vignolo gave the consent to include her photographs in the publication.
Sequences of actions for the 5 available scenes.
Fig. 2A sketch of the strategy followed for the annotation. The plots report the evolution of the 3 coordinates of different actions (rolling the dough, mincing with the mezzaluna, and cleaning a dish), and, marked with red circles, the time locations that have been manually annotated as action instance delimiters. Below, samples frames from View 0 clarify to which moment in the action the instants correspond to.
Actions instances distribution in training and test sequences, after manual annotation (see Section Data Annotation).
| Action | Training | Test | |
|---|---|---|---|
| 1 | Shred a carrot | 12% | 15% |
| 2 | Cut the bread | 4% | 3% |
| 3 | Clean a dish | 4% | 3% |
| 4 | Eat | 3% | 3% |
| 5 | Beat eggs | 15% | 16% |
| 6 | Squeeze a lemon | 4% | 3% |
| 7 | Mince with a crescent | 5% | 7% |
| 8 | Mix in a bowl | 3% | 4% |
| 9 | Open a bottle | 4% | 3% |
| 10 | Turn the pancake | 4% | 3% |
| 11 | Pestle | 5% | 4% |
| 12 | Pour water in containers | 2% | 2% |
| 13 | Pour water in a mug | 4% | 4% |
| 14 | Reach an object | 4% | 4% |
| 15 | Roll the dough | 5% | 4% |
| 16 | Wash the salad | 4% | 4% |
| 17 | Salt | 3% | 3% |
| 18 | Spread cheese on bread | 4% | 4% |
| 19 | Clean the table | 4% | 4% |
| 20 | Transport an object | 7% | 7% |
Fig. 3Lengths of action instances in training (left) and test (right) streams.
Some statistics and features on the annotated action instances.
| Action | rep. | Std.Dev. 3D Pos. | Std.Dev. 3D Vel. | Vel. norm | |||||
|---|---|---|---|---|---|---|---|---|---|
| # | name | X | Y | Z | X | Y | Z | ||
| 1 | Shred a carrot | 139 | 2.85 | 2.73 | 36.77 | 0.66 | 0.48 | 5.14 | 5.00 |
| 2 | Cut the bread | 35 | 16.68 | 24.38 | 12.69 | 1.34 | 2.43 | 0.76 | 2.54 |
| 3 | Clean a dish | 35 | 12.23 | 16.11 | 11.59 | 1.65 | 2.13 | 1.56 | 2.66 |
| 4 | Eat | 28 | 20.49 | 113.11 | 92.89 | 0.45 | 1.97 | 1.66 | 0.51 |
| 5 | Beat eggs | 157 | 3.79 | 6.07 | 6.33 | 0.98 | 1.60 | 1.51 | 2.41 |
| 6 | Squeeze a lemon | 40 | 9.31 | 10.36 | 6.63 | 0.89 | 1.06 | 0.65 | 1.49 |
| 7 | Chop with a crescent | 61 | 32.76 | 7.14 | 39.91 | 3.20 | 0.74 | 3.82 | 4.63 |
| 8 | Mix in a bowl | 37 | 33.68 | 28.12 | 4.37 | 3.00 | 2.56 | 0.43 | 3.90 |
| 9 | Open a bottle | 34 | 19.35 | 8.78 | 19.97 | 0.83 | 0.52 | 0.71 | 0.94 |
| 10 | Turn the pancake | 37 | 9.97 | 14.53 | 19.97 | 0.75 | 1.32 | 1.80 | 1.51 |
| 11 | Pestle | 43 | 21.65 | 15.99 | 7.58 | 2.04 | 1.66 | 0.89 | 2.87 |
| 12 | Pour water in containers | 22 | 18.04 | 9.25 | 13.22 | 0.49 | 0.26 | 0.43 | 0.42 |
| 13 | Pour water in a mug | 40 | 52.76 | 28.16 | 86.22 | 0.99 | 0.52 | 1.49 | 0.81 |
| 14 | Reach an object | 39 | 108.80 | 123.43 | 26.22 | 2.78 | 3.25 | 1.51 | 0.56 |
| 15 | Roll the dough | 47 | 6.66 | 78.90 | 5.78 | 0.55 | 4.58 | 0.64 | 5.27 |
| 16 | Wash the salad | 41 | 19.81 | 20.88 | 1.29 | 1.22 | 1.24 | 0.10 | 1.91 |
| 17 | Salt | 35 | 36.89 | 9.44 | 37.25 | 3.24 | 0.76 | 3.26 | 2.72 |
| 18 | Spread cheese on bread | 43 | 13.55 | 16.28 | 6.73 | 0.51 | 0.54 | 0.43 | 0.67 |
| 19 | Clean the table | 40 | 8.93 | 73.42 | 2.42 | 0.91 | 6.15 | 0.28 | 6.82 |
| 20 | Transport an object | 73 | 117.59 | 105.58 | 26.27 | 2.17 | 1.96 | 1.34 | 0.86 |
From left to right: (Column 1): actions identification number. (Col. 2): action name or description. (Col. 3): number of action instances manually annotated. (Col. 4-5-6): standard deviation of the motion capture palm marker 3D positions, with respect to the three main directions (X, Y, Z). (Col. 7-8-9): standard deviation of the 3D velocities, computed from the motion capture palm marker position, with respect to the three main directions (X, Y, Z). Col. 10: 3D velocity norm.
Fig. 4Normalized standard deviation of the palm 3D position for each action, referring to columns 4, 5 and 6 of Table 2. This visualisation emphasizes the presence of one or more peaks in the standard deviation of the 3 coordinates, suggesting a possible categorization of the actions – according to the number of dimensions in which the movement mainly evolves – and providing a guide for manual annotation.
Fig. 5Example of 3D + t histograms for 3 different actions. Above: sample frames to show the evolution of actions. Middle: histograms of action positions. Below: histograms of instantaneous velocities. All histograms refer to the palm joint.
Action recognition benchmark, see text. (*) The final 2 layers of the I3D model were finetuned on the training data.
| Method | Accuracy | |
|---|---|---|
| MoCap | Space 3D + | 0.92 ± 0.19 |
| Vel 3D + t histograms + linSVM[ | 0.82 ± 0.27 | |
| Full 3D + t histograms + linSVM[ | 0.95 ± 0.11 | |
| Haskel[ | 0.98 ± 0.01 | |
| Videos | 0.94 ± 0.22 | |
| Full | 0.92 ± 0.23 |
Fig. 6A visual comparison between time locations corresponding to dynamic instants – i.e. local minima of a velocity profile obtained from optical flow maps as in[25] – and to the annotation we provide for the MoCA for a sequence of mixing actions. While the latter identifies action instances, the first delimit motion primitives.
Fig. 7A visual representation of the experiment designed to evaluate the ability of humans to judge action similarities in absence of contextual information.
Performance evaluation (in %) on the MoCA dataset considering various training and test subsets. Views - 0: Lateral, 1: Egocentric, 2: Frontal.
| Source|Target | 0,1|2 | 0,2|1 | 1,2|0 | 0|1 | 0|2 | 1|0 | 1|2 | 2|0 | 2|1 |
|---|---|---|---|---|---|---|---|---|---|
| SLP | 67.46 | 46.03 | 68.10 | 47.38 | 68.33 | 47.38 | 32.86 | 66.27 | 34.84 |
| Inception | 62.30 | 61.67 | 62.70 | 50.63 | 64.84 | 33.10 | 36.35 | 61.67 | 54.92 |
Fig. 8Sample frames to show the potential of a marker-less analysis for feature detection. The localized points (highlighted with different colors in the images) are nicely overlapped with the markers placed on the arm, that the method has been trained to detect.
| Measurement(s) | body movement coordination trait • food cooking process |
| Technology Type(s) | motion capture • digital camera |
| Factor Type(s) | viewpoint |