| Literature DB >> 31659689 |
Wim Pouw1,2, James P Trujillo3,4, James A Dixon5.
Abstract
There is increasing evidence that hand gestures and speech synchronize their activity on multiple dimensions and timescales. For example, gesture's kinematic peaks (e.g., maximum speed) are coupled with prosodic markers in speech. Such coupling operates on very short timescales at the level of syllables (200 ms), and therefore requires high-resolution measurement of gesture kinematics and speech acoustics. High-resolution speech analysis is common for gesture studies, given that field's classic ties with (psycho)linguistics. However, the field has lagged behind in the objective study of gesture kinematics (e.g., as compared to research on instrumental action). Often kinematic peaks in gesture are measured by eye, where a "moment of maximum effort" is determined by several raters. In the present article, we provide a tutorial on more efficient methods to quantify the temporal properties of gesture kinematics, in which we focus on common challenges and possible solutions that come with the complexities of studying multimodal language. We further introduce and compare, using an actual gesture dataset (392 gesture events), the performance of two video-based motion-tracking methods (deep learning vs. pixel change) against a high-performance wired motion-tracking system (Polhemus Liberty). We show that the videography methods perform well in the temporal estimation of kinematic peaks, and thus provide a cheap alternative to expensive motion-tracking systems. We hope that the present article incites gesture researchers to embark on the widespread objective study of gesture kinematics and their relation to speech.Entities:
Keywords: Deep learning; Gesture and speech analysis; Motion tracking; Multimodal language; Video recording
Mesh:
Year: 2020 PMID: 31659689 PMCID: PMC7148275 DOI: 10.3758/s13428-019-01271-9
Source DB: PubMed Journal: Behav Res Methods ISSN: 1554-351X
Overview of motion-tracking methods
| Method | Key features | Cost level | Application |
|---|---|---|---|
| Video-based | |||
| Pixel differentiation | Simple to compute; Requires very stable background | Low | Calculation of overall movement and velocity in relatively constrained data |
| Computer vision | Can track very specific parts of the scene (e.g., hands, face); Computationally costly | Low | Tracking specific body parts and/or movements of multiple people |
| Device-based | |||
| Wired | High precision and robust against occlusion; Limited by number of wired sensors that can easily be attached | High | Focus on a small number of articulators, for which precision is needed and occlusion may be a problem for other methods |
| Optical (markered) | Gold-standard precision; Requires calibration and for participants to wear visible markers | High | High precision tracking of multiple body parts on one or multiple participants |
| Markerless (single-camera) | Non-invasive, 3-D tracking; Lower precision and tracking stability | Moderate | Mobile setup for whole-body tracking when fine-grained precision is less necessary |
Fig. 1Schematic overview of post-processing steps
Fig. 2Example gesture event peak speed per method. Example of a gesture event lasting 800 ms (see the video here: https://osf.io/aj2uk/) from the dataset (Event 10 from Participant 2). Red dots indicate the positive maxima peaks in the respective data streams: fundamental frequency in hertz (F0), Polhemus speed in centimeters per second, DNN speed in pixel position change per second, and pixel-method speed in summed pixel change per second. Note that velocity is directional speed, whereas speed is non-direction-specific velocity
Results and comparisons estimation peak speed versus peak F0 in gesture
| Polhemus | DNN | Pixel | |
|---|---|---|---|
| Estimated mean | – 10 ms | 39 ms | – 14 ms |
| Correlation Polhemus | |||
| | .756 | .797 | |
| 95%CI | [.700–.803] | [.750–.837] | |
| | < .00001 | < .00001 | |
Fig. 3Results and comparisons: Estimated peak speed versus peak F0 in gestures. Upper panel: Videography estimates of gesture–speech synchrony (vertical axis) are compared to Polhemus estimates of synchrony. Purple dots indicate pixel change method performance relative to Polhemus, and red dots indicate deep neural network (DNN) performance relative to Polhemus. A 1:1 slope (as indicated by the black dashed line of identity) would indicate identical performance of videography and Polhemus. Dots along the region of the identity line indicate comparable approximations of gesture–speech synchrony for the different methods. Note that some points are excluded that fell far from the point cloud (for the full graph, go to https://osf.io/u9yc2/). Lower panel: Smoothed density distributions for the estimated gesture–speech synchrony estimates per method, with means (dashed vertical lines) indicating average gesture–speech synchrony
Fig. 4Example trajectory as measured by the Polhemus versus the deep neural network (DNN): Example of an iconic gesture with a circling motion (axis z-scaled), as registered by the Polhemus and the DNN. This type of positional information is not available when using the pixel change method. For our comparison, we looked at the moment at which a negative velocity was highest—that is, where a gesture reached its highest speed when moving downward