| Literature DB >> 32669652 |
Joris Willems1, Alessandro Corbetta1, Vlado Menkovski2, Federico Toschi3,4,5.
Abstract
We investigate in real-life conditions and with very high accuracy the dynamics of body rotation, or yawing, of walking pedestrians-a highly complex task due to the wide variety in shapes, postures and walking gestures. We propose a novel measurement method based on a deep neural architecture that we train on the basis of generic physical properties of the motion of pedestrians. Specifically, we leverage on the strong statistical correlation between individual velocity and body orientation: the velocity direction is typically orthogonal with respect to the shoulder line. We make the reasonable assumption that this approximation, although instantaneously slightly imperfect, is correct on average. This enables us to use velocity data as training labels for a highly-accurate point-estimator of individual orientation, that we can train with no dedicated annotation labor. We discuss the measurement accuracy and show the error scaling, both on synthetic and real-life data: we show that our method is capable of estimating orientation with an error as low as [Formula: see text]. This tool opens up new possibilities in the studies of human crowd dynamics where orientation is key. By analyzing the dynamics of body rotation in real-life conditions, we show that the instantaneous velocity direction can be described by the combination of orientation and a random delay, where randomness is provided by an Ornstein-Uhlenbeck process centered on an average delay of [Formula: see text]. Quantifying these dynamics could have only been possible thanks to a tool as precise as that proposed.Entities:
Year: 2020 PMID: 32669652 PMCID: PMC7363920 DOI: 10.1038/s41598-020-68287-6
Source DB: PubMed Journal: Sci Rep ISSN: 2045-2322 Impact factor: 4.379
Figure 2(a, c) Pedestrian trajectories (purple) superimposed to depth snapshots (gray). Orientation estimates and local velocities (directions of motion) are reported, respectively, in red and yellow. We estimate shoulder orientation on a snapshot-by-snapshot basis, considering depth “imagelets” centered on a pedestrian. The sub-panel (b) reports an example of such an imagelet with the coordinate system considered. We employ instantaneous direction of motion extracted from preexisting trajectory data as training labels for a neural network. This yields a reliable estimator for the orientation , accurate even in cases challenging for humans, like in (c). Due to clothing, arms and body posture, presence of bag-packs or errors in depth reconstruction, the overhead pedestrian shape might appear substantially different from an ellipse elongated in the direction of the shoulders.
Figure 1We measure and investigate the dynamics of shoulder orientation for walking pedestrians in real-life scenarios. Our measurements are based on raw data acquired via grids of overhead depth sensors, such as Microsoft Kinect™[5]. In (a, b) we report, respectively, a front and an aerial view of a data acquisition setup (similar to that in Ref.[6]). The sensors, of which the typical view cone is reported in (a), are represented in (b) as thick segments. In overhead depth images (c), the pixel value, here colorized in gray, represents the distance between each pixel and the camera plane: brighter shades are far from the sensor and, the darker the pixel color, the closer the pixel is to the sensor. Heads are, therefore, in darker shade than the floor. Through localization and tracking algorithms from Refs.[7,8], we extract imagelets centered on individual pedestrians (cf. imagelets annotated with ground truth in Fig. 2) for which we estimate orientations via the method introduced here.
List of symbols
| Symbol | |
|---|---|
| Overhead imagelet, centered on a pedestrian | |
| Orientation estimator, modelled via a neural network | |
| Real projective line, the set on which we consider pedestrian orientation | |
| Average value of | |
| Network estimate of | |
| “Two-hot” discrete probability distribution encoding for | |
| Cross-entropy loss | |
| Instantaneous velocity | |
| Ground-truth shoulder-line orientation angle | |
| Velocity direction angle | |
| Orientation point estimation, as predicted by the network | |
| Low-pass filtered continuous time signals of, respectively, | |
| Symmetric, zero-centered residual that relates | |
| Orthogonal group, containing all rotations and mirrorings that can be applied to | |
| Orientation estimator, strictly respecting | |
| Average prediction bias, quantifying the network systematic error | |
| ARMSE | Average root mean square error, quantifying the total network error |
| Ground-truth orientation annotation, available only for synthetic data | |
| Reference annotation for real-life data. Obtained by subsampling from smoothed orientation signals | |
| Average walking velocity | |
| Simulated delay | |
| Simulation parameter, relating amplitudes of | |
| Simulation parameter, average delay of 80 ms between | |
| Simulation parameter, intensity of the | |
| Simulation parameter, relaxation time of OU-process | |
Figure 3Examples of synthetic imagelets that we employ to analyze the performance of our neural network. Contrarily to the real-life data, ground-truth orientation is available for synthetic data, enabling accurate validation of the estimations. The neural network is trained against labeled target data with predefined noise level () to imitate training with real-life imagelets and velocity target data. Target data with predefined noise level and ground-truth orientation for validation are superimposed on the imagelets as blue and red bars respectively.
Fig. 4(a–c) Velocity direction and shoulder orientation signal, for three trajectories collected in real-life (depth maps sequences similar to Fig. 2 are on the right of each panel). We report the instantaneous values of velocity (obtained from tracking) and predicted orientation, and , and the continuous orientation signal (low-pass filter of ). Orientation has been computed via our CNN trained on 30 h of real-life velocity data. (a) Reports a typical pedestrian behavior, where and oscillate “in sync” (frequency ) following the stepping. Our tool resolves correctly also rare sidestepping events or orientation of standing individuals in which the signals are out of sync. (b) Shows a pedestrian rotating their body, possibly observing their surroundings, while maintaining the walking direction. (c) Shows an individual initially standing, then performing a body rotation and finally walking away. In this case, the velocity is undefined for time as there is no position variation (so high noise intensity ). Note that the depth map sequences in (c) are in space-time coordinates, in a photo finish-like fashion. In this reference, when the pedestrian stands still, a horizontal line is traced. (d–f) Prediction performance of the network, f (Eq. 1), in case of artificial imagelets (d) and real-life data (e). We train with datasets of increasing size (N, x-axis). We report the Root Mean Square Error of the predictions averaged over independent training of the networks [ARMSE, Eq. (11)] and, in the inset, the average bias, (Eq. 10). The test sets used to compute the indicators include, for (d), 25k unseen synthetic images with error-free annotation () and, for (e), 25k unseen real-life imagelets, annotated considering low-pass filtered high-resolution orientation estimates, , obtained with our neural network trained with samples and O(2)-group averaging. The bias, in both cases, decreases rapidly below . The ARMSE, for the networks trained with the largest dataset approaches, respectively, and , as N grows. For both ARMSE and bias, we report the fitted exponents characterizing the error converge in the label. We complement the evaluation of the ARMSE considering noisy labels [ for case (d) and for case (e)]. In case (d) the ARMSE saturates consistently with the level of noise in the labels (cf. SI). In case (e), the ARMSE approaches a saturation point at about . This reflects the random disagreement between velocity and orientation. (f) Performance can be further increased by enforcing O(2)-symmetry of the orientation estimator, map , Eq. (8). In (f) we consider maps built from the networks trained with the largest training datasets from (d, e) (), both for the synthetic and real-life cases, vs. the number of samples used for the group average, k. We consider both uniform and random sampling of O(2) (superscript U and R respectively). The group average further reduces the ARMSE from to in case of synthetic imagelets (no observable difference between uniform and random sampling), and from to in case of real-life imagelets , with higher performance in case of random sampling for .
Fig. 5Probability distribution function of the delay time between the shoulder orientation, , and velocity orientation, , signals for different average velocities, . As the average velocity grows, the average delay and the delay fluctuations reduce. The inset reports the ratio between the standard deviation, , and the average, , of the delay as a function of . The measurements considered are 78k trajectories ( of data, all not exceeding a maximum orientation of ), acquired during the GLOW event[6].
Fig. 6Comparison between simulations (red dots) and real-life measurements (blue solid line). We build velocity direction signals on top of delayed orientation measurements , where the delay d(t) is modelled by a OU random process [cf. Eqs. (12), (13)]. Measurements have been acquired during the GLOW event (36k trajectories, restricting to people keeping normal average velocity ). In (a) we report the probability distribution function (pdf) of the difference between velocity direction and orientation shifted in time by the average delay, . In (b), analogously to Fig. 5, we report a PDF of the delay time between and . The insets in (a, b) report the data in semi-logarithmic scale. For both these quantities we observe excellent agreement among simulations and measurements for (a, b). Panel (c) shows the velocity direction signals’ grand average power spectral density (psd) of and . Our model modifies the psd only at high frequencies. As an effect, the most energetic components of the velocity orientation, around and , remain, respectively slightly under and slightly over-represented.