Literature DB >> 32669652

Pedestrian orientation dynamics from high-fidelity measurements.

Joris Willems¹, Alessandro Corbetta¹, Vlado Menkovski², Federico Toschi^3,4,5.

Abstract

We investigate in real-life conditions and with very high accuracy the dynamics of body rotation, or yawing, of walking pedestrians-a highly complex task due to the wide variety in shapes, postures and walking gestures. We propose a novel measurement method based on a deep neural architecture that we train on the basis of generic physical properties of the motion of pedestrians. Specifically, we leverage on the strong statistical correlation between individual velocity and body orientation: the velocity direction is typically orthogonal with respect to the shoulder line. We make the reasonable assumption that this approximation, although instantaneously slightly imperfect, is correct on average. This enables us to use velocity data as training labels for a highly-accurate point-estimator of individual orientation, that we can train with no dedicated annotation labor. We discuss the measurement accuracy and show the error scaling, both on synthetic and real-life data: we show that our method is capable of estimating orientation with an error as low as [Formula: see text]. This tool opens up new possibilities in the studies of human crowd dynamics where orientation is key. By analyzing the dynamics of body rotation in real-life conditions, we show that the instantaneous velocity direction can be described by the combination of orientation and a random delay, where randomness is provided by an Ornstein-Uhlenbeck process centered on an average delay of [Formula: see text]. Quantifying these dynamics could have only been possible thanks to a tool as precise as that proposed.

Entities: Disease Species

Year: 2020 PMID： 32669652 PMCID： PMC7363920 DOI： 10.1038/s41598-020-68287-6

Source DB: PubMed Journal: Sci Rep ISSN： 2045-2322 Impact factor: 4.379

Introduction

The orientation of our body and shoulder-line changes continuously as we walk. When our gait is regular, these changes are nearly periodic and follow the swinging trend of our trajectories as we balance our weight between our feet[1]. At times, motion direction and body orientation remain temporarily decoupled. This happens, for instance, when we sidestep or in proximity of turns and distractions. Shoulder-line yawing is not just a mechanical reflection of the walking action, it rather becomes an essential dynamic ingredient as our motion gets geometrically constrained, e.g. by a dense crowd or by a narrow environment. In both cases, as we need to make our way to our destination, we, consciously or unconsciously, rotate our bodies sideways to minimize collisions or maintain comfort distances with other pedestrians or the environment. The capability of measuring and understanding the orientation dynamics of our body and shoulders comes both with societal and fundamental relevance. As a proxy for sight direction, shoulder orientation can be used to assess individual visual attention[2] or even to increase our capacity to identify anomalous behavior. Moreover, augmenting the traditional position-centered modeling of pedestrians with the orientation degree of freedom, strengthens the connection between human dynamics and other active matter systems, where shape and nematic ordering are key elements to individual and collective behaviors, particularly at high densities (e.g.,[3,4]). We measure and investigate the dynamics of shoulder orientation for walking pedestrians in real-life scenarios. Our measurements are based on raw data acquired via grids of overhead depth sensors, such as Microsoft Kinect™[5]. In (a, b) we report, respectively, a front and an aerial view of a data acquisition setup (similar to that in Ref.[6]). The sensors, of which the typical view cone is reported in (a), are represented in (b) as thick segments. In overhead depth images (c), the pixel value, here colorized in gray, represents the distance between each pixel and the camera plane: brighter shades are far from the sensor and, the darker the pixel color, the closer the pixel is to the sensor. Heads are, therefore, in darker shade than the floor. Through localization and tracking algorithms from Refs.[7,8], we extract imagelets centered on individual pedestrians (cf. imagelets annotated with ground truth in Fig. 2) for which we estimate orientations via the method introduced here.

Figure 2

The dynamics of shoulder-line rotation has been scarcely investigated from a quantitative viewpoint. The data currently available is extremely limited and has been acquired via few laboratory experiments (e.g. Refs.[9-11]). Such scarceness of accurate data hinders the capability of statistic characterizations of the rotation dynamics beyond the estimation of average properties, to include, e.g., fluctuations and rare events. We believe that this is connected with the inherent technical complexity of measuring body yawing accurately and in real-life conditions. Real-life measurements campaigns, in fact, need to rely only on non-intrusive imaging data (or alike) of pedestrians, and cannot be supported by ad hoc wearable sensors, such as accelerometers[10]. Indeed, even the accurate estimation of the position of an individual in real-life, a more “macroscopic” or “coarser-scale” degree of freedom than orientation, is a recognized technical challenge[12]. Since few years, overhead depth-sensing[7,8,13,14], as used in this work, has been successfully employed to perform accurate pedestrian localization and prolonged tracking campaigns (see example in Fig. 1 and Ref.[6]). Overhead depth data, not only allows privacy respectful data acquisition, but enables also accurate position measurements even in high-density conditions (for a highly-accurate algorithm leveraging on machine learning-based analyses see, e.g. Ref.[15]).

Figure 1

We measure and investigate the dynamics of shoulder orientation for walking pedestrians in real-life scenarios. Our measurements are based on raw data acquired via grids of overhead depth sensors, such as Microsoft Kinect™[5]. In (a, b) we report, respectively, a front and an aerial view of a data acquisition setup (similar to that in Ref.[6]). The sensors, of which the typical view cone is reported in (a), are represented in (b) as thick segments. In overhead depth images (c), the pixel value, here colorized in gray, represents the distance between each pixel and the camera plane: brighter shades are far from the sensor and, the darker the pixel color, the closer the pixel is to the sensor. Heads are, therefore, in darker shade than the floor. Through localization and tracking algorithms from Refs.[7,8], we extract imagelets centered on individual pedestrians (cf. imagelets annotated with ground truth in Fig. 2) for which we estimate orientations via the method introduced here.

(a, c) Pedestrian trajectories (purple) superimposed to depth snapshots (gray). Orientation estimates and local velocities (directions of motion) are reported, respectively, in red and yellow. We estimate shoulder orientation on a snapshot-by-snapshot basis, considering depth “imagelets” centered on a pedestrian. The sub-panel (b) reports an example of such an imagelet with the coordinate system considered. We employ instantaneous direction of motion extracted from preexisting trajectory data as training labels for a neural network. This yields a reliable estimator for the orientation , accurate even in cases challenging for humans, like in (c). Due to clothing, arms and body posture, presence of bag-packs or errors in depth reconstruction, the overhead pedestrian shape might appear substantially different from an ellipse elongated in the direction of the shoulders. In this paper we propose a novel method to measure—in real-life conditions and with very high accuracy—the shoulder rotation of walking pedestrians. Our measurement method is based on a deep Convolutional Neural Network[16] (CNN) point-estimator which operates on overhead depth images centered on individual pedestrians—from now on referred to as “imagelets”. Intuition suggests that pedestrians seen from an overhead perspective have a well-defined elongated, elliptic-like, shape. In our measurements this is true only in a small fraction of cases in which pedestrians walk carrying their arms alongside the body. Conversely, we found a majority of exceptions, impossible to address by hand-made algorithms (cf. Fig. 2). This marks an ideal use-case for supervised deep learning[16]. It is well known that the high performance of Deep Neural Network methods come also at the price of, often prohibitively, labor-intensive manual annotations of training data (frequently in the order of millions of individual images). Depending on the context, the reliability of human annotations can furthermore be arguable, this is the case whenever different experts are in frequent mutual disagreement about the annotation value. Shoulder orientation in depth imagelets falls in such a case. Here by relying on the strong statistical correlation between individual velocity and body orientation, we manage to produce potentially limitless annotations. While walking on straight paths, our velocity direction is (on average) in very good approximation orthogonal to our shoulder line. On this basis, we can employ the velocity direction as a singularly slightly imperfect, but correct on average, annotation for the orientation. Notably, the zero-average residual error between the velocity direction and the actual orientation gets averaged out as we train our CNN point-estimator with gradient descent. This (self) amends for annotation errors. We investigate the orientation measurement accuracy of our method and consider its error scaling vs. the size of the training set using both real-life and synthetic depth imagelets. Combining extensive training with the enforcement of O(2) symmetry of the estimator, we show that we can deliver an orientation estimator with an error as low as . Our tool enables us to characterize the stochastic process that connects the instantaneous velocity direction to the shoulder orientation. We show that the velocity orientation can be well described by delaying the orientation dynamics through a stochastic process centered on, about, and with Ornstein–Uhlenbeck (OU) statistics. Conceptually speaking, although our tool has been devised for depth imagelets, it can be easily extended to other computer vision-based pedestrian tracking approaches and, more in general, can be used for any system in which there is a statistical connection between (average) individual “particle” velocity and (average) shape.

Orientation measurements: problem definition

Let be a overhead imagelet centered on a pedestrian, see examples in Fig. 2 (for convenience we opt for imagelets of square shape, yet this is not a constraint). We define the shoulder-line orientation angle, , as the angle between the direction normal to the shoulder-line and a fixed reference, here the y axis (direction , cf. Fig. 2a,b). According to this definition, a body rotation of leaves unchanged. Thus, we aim at a function f such thatwhere approximates the actual orientation (with , i.e. is an element of the real projective line , see e.g. Ref.[17]). We report a list of the symbols employed in Table 1.

Table 1

List of symbols

Symbol
\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\mathcal {I}$$\end{document}I	Overhead imagelet, centered on a pedestrian
f	Orientation estimator, modelled via a neural network
\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\mathbf{P} ^1(\mathbb {R})$$\end{document}P1(R)	Real projective line, the set on which we consider pedestrian orientation
\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\mathbb {E}[x]$$\end{document}E[x], \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\mathbb {E}_y[x]$$\end{document}Ey[x]	Average value of x (probability law and/or set are indicated as subscript, y, if necessary)
\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$h_{pred}$$\end{document}hpred	Network estimate of \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\theta$$\end{document}θ as a discrete probability distribution
\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$h_2(\theta _v)$$\end{document}h2(θv)	“Two-hot” discrete probability distribution encoding for \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\theta _v$$\end{document}θv (in training)
\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\mathcal {H}(\cdot ,\cdot )$$\end{document}H(·,·)	Cross-entropy loss
\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\mathbf {v}(t)$$\end{document}v(t)	Instantaneous velocity
\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\theta$$\end{document}θ	Ground-truth shoulder-line orientation angle
\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\theta _v$$\end{document}θv	Velocity direction angle
\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\theta _o$$\end{document}θo	Orientation point estimation, as predicted by the network
\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\theta (t), \theta _v(t)$$\end{document}θ(t),θv(t)	Low-pass filtered continuous time signals of, respectively, \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\theta_0$$\end{document}θ0 and \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\theta _v$$\end{document}θv
\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\epsilon$$\end{document}ϵ	Symmetric, zero-centered residual that relates \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\theta$$\end{document}θ and \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\theta _v$$\end{document}θv
O(2)	Orthogonal group, containing all rotations and mirrorings that can be applied to \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\mathcal {I}$$\end{document}I
\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\tilde{f}$$\end{document}f~	Orientation estimator, strictly respecting O(2) symmetry
\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\hat{b}$$\end{document}b^	Average prediction bias, quantifying the network systematic error
ARMSE	Average root mean square error, quantifying the total network error
\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\theta _{gt}$$\end{document}θgt	Ground-truth orientation annotation, available only for synthetic data
\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\theta _r$$\end{document}θr	Reference annotation for real-life data. Obtained by subsampling from smoothed orientation signals \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\theta (t)$$\end{document}θ(t)
\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\hat{v}$$\end{document}v^	Average walking velocity
d(t)	Simulated delay \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\theta (t)$$\end{document}θ(t) and \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\theta _v(t)$$\end{document}θv(t)
A	Simulation parameter, relating amplitudes of \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\theta (t)$$\end{document}θ(t) and \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\theta _v(t)$$\end{document}θv(t)
\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\hat{d}$$\end{document}d^	Simulation parameter, average delay of 80 ms between \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\theta (t)$$\end{document}θ(t) and \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\theta _v(t)$$\end{document}θv(t)
\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\xi$$\end{document}ξ	Simulation parameter, intensity of the \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\delta$$\end{document}δ-correlated noise
\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\tau$$\end{document}τ	Simulation parameter, relaxation time of OU-process
\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\dot{W}$$\end{document}W˙	\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\delta$$\end{document}δ-Correlated white noise

List of symbols We model the mapping f via a deep neural network that we train in a supervised, end-to-end, fashion (see structure in the Supporting Information, SI). The network returns the estimate of as a discrete probability distribution, , on (quantized in uniform bins, wide, via soft-max activation function in the final layer). We retain the -average (“circular average”) of the distribution , as final output. It formulas, the output readswe leave the details to the Supplementary information. We train with orientation data with a “two-hot” encoding: each orientation is unambiguously represented in terms of a probability distribution non-vanishing on (up to) two adjacent angular bins (we chose “two-hot” in opposition to the typical one-hot training data for classification problems, in which the annotations are Dirac probability distributions on the ground-truth class). We will refer to this encoding, that avoids quantization errors, as (we observed no strong sensitivity on the number of bins when these were more or equal than 10). As usual, we use a cross-entropy loss, . Examples of synthetic imagelets that we employ to analyze the performance of our neural network. Contrarily to the real-life data, ground-truth orientation is available for synthetic data, enabling accurate validation of the estimations. The neural network is trained against labeled target data with predefined noise level () to imitate training with real-life imagelets and velocity target data. Target data with predefined noise level and ground-truth orientation for validation are superimposed on the imagelets as blue and red bars respectively. We employ pedestrian velocity information to tackle the need for huge amounts of accurately annotated data to train the free parameters of the deep neural network (usually in the millions, in our case). Let be the angle between the walking velocity and a reference at time , i.e.where is the instantaneous velocity, and denotes the angle comprised the directions in its argument (with -periodicity). When we walk, either for the periodic sway or when we make turns, our shoulder line is most-frequently, and in very good approximation, orthogonal with respect to the walking velocity, i.e.Therefore, velocities provide a meaningful “proxy” annotation for orientation. We used the “approximately equal” sign in Eq. (4) because we can have frequent, yet small, disagreements between velocity and orientation. These can be due to small loss of alignment between the two (e.g. because something attracted our attention) or they can be due to inaccuracies, e.g., in the velocity measurements. It is also possible, yet less likely, that velocity and orientation remain misaligned for longer time intervals. This holds, e.g., for people walking sideways. We retain these as rare occasions, which we expect to occur symmetrically for both left and right sides, with no relevant weight in our training dataset. This hypothesis reasonably holds on unidirectional pedestrian flows happening on rectilinear corridors, but might be invalid in case, e.g., of curved paths. Formally, for a walking person, we model the relation in Eq. (4) aswith being a small, symmetric, and zero-centered residual. We train our neural network using the labels as a proxy for . The training process aims at the minimization of the (average) loss . As such, the output converges to the distribution of annotations of similar imagelets, whose average is the correct point-estimation of the orientation:We refer to the Supplementary information for a formal proof with simplifying assumptions and a simulation-based proof in the general case. Finally, once a pedestrian with shoulder orientation rigidly rotates around the vertical axis by an angle , their orientation becomes . Similarly, “mirroring” a pedestrian around the direction, their orientation changes sign. The map f must satisfy such symmetry with respect to imagelets rotations and mirroring. In other words, f must be co-variant[18] with respect to the group, O(2), of the orthogonal transformations of the plane. In formulas, this readsfor all transformations , that concatenate a rotation of , , and, possibly, a reflection (i.e. , respectively the identity, , and the reflection, J, from which the sign change given by the determinant of the transformation: ). Symmetries in neural networks are often injected at training time, by augmenting the training set by all the symmetry group orbits. Similarly, we include multiple copies of the same imagelets with multiple random rotations with and without flipping. This also ensures that the training set spans uniformly. Yet, this does not yield a strictly O(2)-symmetric estimator Eq. (7). We further enforce this symmetry by constructing a new map, , as the O(2)-group average of f, which is thus strictly respecting Eq. (7). In formulas it holds we leave the proof of this identity, the O(2)-symmetry of and further details on -averages to the SI. In the following, we consider approximations of the integral in Eq. (9) by equi-spaced and random sampling of O(2). (a–c) Velocity direction and shoulder orientation signal, for three trajectories collected in real-life (depth maps sequences similar to Fig. 2 are on the right of each panel). We report the instantaneous values of velocity (obtained from tracking) and predicted orientation, and , and the continuous orientation signal (low-pass filter of ). Orientation has been computed via our CNN trained on 30 h of real-life velocity data. (a) Reports a typical pedestrian behavior, where and oscillate “in sync” (frequency ) following the stepping. Our tool resolves correctly also rare sidestepping events or orientation of standing individuals in which the signals are out of sync. (b) Shows a pedestrian rotating their body, possibly observing their surroundings, while maintaining the walking direction. (c) Shows an individual initially standing, then performing a body rotation and finally walking away. In this case, the velocity is undefined for time as there is no position variation (so high noise intensity ). Note that the depth map sequences in (c) are in space-time coordinates, in a photo finish-like fashion. In this reference, when the pedestrian stands still, a horizontal line is traced. (d–f) Prediction performance of the network, f (Eq. 1), in case of artificial imagelets (d) and real-life data (e). We train with datasets of increasing size (N, x-axis). We report the Root Mean Square Error of the predictions averaged over independent training of the networks [ARMSE, Eq. (11)] and, in the inset, the average bias, (Eq. 10). The test sets used to compute the indicators include, for (d), 25k unseen synthetic images with error-free annotation () and, for (e), 25k unseen real-life imagelets, annotated considering low-pass filtered high-resolution orientation estimates, , obtained with our neural network trained with samples and O(2)-group averaging. The bias, in both cases, decreases rapidly below . The ARMSE, for the networks trained with the largest dataset approaches, respectively, and , as N grows. For both ARMSE and bias, we report the fitted exponents characterizing the error converge in the label. We complement the evaluation of the ARMSE considering noisy labels [ for case (d) and for case (e)]. In case (d) the ARMSE saturates consistently with the level of noise in the labels (cf. SI). In case (e), the ARMSE approaches a saturation point at about . This reflects the random disagreement between velocity and orientation. (f) Performance can be further increased by enforcing O(2)-symmetry of the orientation estimator, map , Eq. (8). In (f) we consider maps built from the networks trained with the largest training datasets from (d, e) (), both for the synthetic and real-life cases, vs. the number of samples used for the group average, k. We consider both uniform and random sampling of O(2) (superscript U and R respectively). The group average further reduces the ARMSE from to in case of synthetic imagelets (no observable difference between uniform and random sampling), and from to in case of real-life imagelets , with higher performance in case of random sampling for . Probability distribution function of the delay time between the shoulder orientation, , and velocity orientation, , signals for different average velocities, . As the average velocity grows, the average delay and the delay fluctuations reduce. The inset reports the ratio between the standard deviation, , and the average, , of the delay as a function of . The measurements considered are 78k trajectories ( of data, all not exceeding a maximum orientation of ), acquired during the GLOW event[6].

CNN: training and testing

We consider two types of training/testing imagelets: algorithmically generated, “synthetic”, imagelets, of which the orientation angle is known, and real-life imagelets. In the first case we mimic a velocity-based training by adding a centered noise to labels known exactly [following Eq. (5)]. In the second case, as we have no manually annotated ground truth, of which the accuracy would nevertheless be debatable, we propose a validation based on the convergence towards low-pass filtered orientation signals. In both cases, we show that the average prediction error [ARMSE, Eq. (11)] is about degrees or, possibly, lower, should the training set size N be large enough. Specifically, the datasets are as follows:

Synthetic dataset

We generate synthetic imagelets mimicking the overhead shape of people in terms of a superposition of two ellipses: one for the body/shoulders, , and another one, , at lower depth values (i.e. higher on the ground), for the head. We generate synthetic imagelets mimicking the overhead shape of people in terms of a superposition of two ellipses: one for the body/shoulders, , and another one, at lower depth values (i.e. objects higher above the ground are closer to the overhead mounted depth sensors and are thus associated with smaller depth values), for the head. We report examples of such imagelets in Fig. 3, while the details of the generation algorithm are left to the SI.

Figure 3

Examples of synthetic imagelets that we employ to analyze the performance of our neural network. Contrarily to the real-life data, ground-truth orientation is available for synthetic data, enabling accurate validation of the estimations. The neural network is trained against labeled target data with predefined noise level () to imitate training with real-life imagelets and velocity target data. Target data with predefined noise level and ground-truth orientation for validation are superimposed on the imagelets as blue and red bars respectively.

By construction, the rotation angle of represents the pedestrian orientation, i.e. it is the ground truth for the training. We train the network with such synthetic imagelets and a small centered Gaussian noise superimposed to to imitate velocity-based training. Hence, we train using labels while we validate with (cf. Eq. 5).

Real-life dataset

We consider depth images and velocity data from a real-life measurement campaign conducted during a city-wide festival (GLOW) in Eindhoven, The Netherlands, in Nov. 2017. The measurements involve a uni-directional crowd flow passing through a corridor-shaped exhibit (tracking area: ), for further details see[6]. The dataset leverages on high-resolution individual localization and tracking based on overhead depth images (as in Fig. 1) and with time sampling. The localization and tracking algorithms employed are analogous to what employed in previous works[7, 8]. To ensure that our velocity data provides a well-defined proxy for orientation, we restrict to pedestrians having average velocity above . Moreover, for each trajectory we extract imagelets and velocity data with a time sampling of , which increases the independence between training data. Additionally, we apply random rotations and random horizontal flips to all imagelets (and, correspondingly, to labels). This aims at training with a dataset uniformly distributed on . In absence of ground truth, we build our test set as follows: we rely on our neural network trained with different imagelets (i.e. twice as much the largest training dataset considered in Fig. 4d,e, on which we perform random augmentation and final O(2)-averaging of the operator), hence the most accurate, to make orientation predictions over complete pedestrian trajectories. As an orientation signal needs to be continuous in time, we smoothen the predicted in time (low-pass Butterworth filter[19] of order , cutoff frequency and window length ) to eliminate random noise. The final dataset contains values from different trajectories and sampled at different, independent, time instants. Therefore, we build the dataset on the basis of two independent elements: a heavily trained network and a physics-based time-regularity hypothesis on orientation signals.

Fig. 4

(a–c) Velocity direction and shoulder orientation signal, for three trajectories collected in real-life (depth maps sequences similar to Fig. 2 are on the right of each panel). We report the instantaneous values of velocity (obtained from tracking) and predicted orientation, and , and the continuous orientation signal (low-pass filter of ). Orientation has been computed via our CNN trained on 30 h of real-life velocity data. (a) Reports a typical pedestrian behavior, where and oscillate “in sync” (frequency ) following the stepping. Our tool resolves correctly also rare sidestepping events or orientation of standing individuals in which the signals are out of sync. (b) Shows a pedestrian rotating their body, possibly observing their surroundings, while maintaining the walking direction. (c) Shows an individual initially standing, then performing a body rotation and finally walking away. In this case, the velocity is undefined for time as there is no position variation (so high noise intensity ). Note that the depth map sequences in (c) are in space-time coordinates, in a photo finish-like fashion. In this reference, when the pedestrian stands still, a horizontal line is traced. (d–f) Prediction performance of the network, f (Eq. 1), in case of artificial imagelets (d) and real-life data (e). We train with datasets of increasing size (N, x-axis). We report the Root Mean Square Error of the predictions averaged over independent training of the networks [ARMSE, Eq. (11)] and, in the inset, the average bias, (Eq. 10). The test sets used to compute the indicators include, for (d), 25k unseen synthetic images with error-free annotation () and, for (e), 25k unseen real-life imagelets, annotated considering low-pass filtered high-resolution orientation estimates, , obtained with our neural network trained with samples and O(2)-group averaging. The bias, in both cases, decreases rapidly below . The ARMSE, for the networks trained with the largest dataset approaches, respectively, and , as N grows. For both ARMSE and bias, we report the fitted exponents characterizing the error converge in the label. We complement the evaluation of the ARMSE considering noisy labels [ for case (d) and for case (e)]. In case (d) the ARMSE saturates consistently with the level of noise in the labels (cf. SI). In case (e), the ARMSE approaches a saturation point at about . This reflects the random disagreement between velocity and orientation. (f) Performance can be further increased by enforcing O(2)-symmetry of the orientation estimator, map , Eq. (8). In (f) we consider maps built from the networks trained with the largest training datasets from (d, e) (), both for the synthetic and real-life cases, vs. the number of samples used for the group average, k. We consider both uniform and random sampling of O(2) (superscript U and R respectively). The group average further reduces the ARMSE from to in case of synthetic imagelets (no observable difference between uniform and random sampling), and from to in case of real-life imagelets , with higher performance in case of random sampling for .

We assess the prediction performance as the training set size, N, increases. To compute exhaustive performance statistics, we train the network on M independent datasets for every N (in a cross-validation-like setting). We can distinguish two kinds of errors, systematic and random[20]. The first is an error that always, and in the same manner, interferes with the outcome of the measurements (e.g. a constant rotation offset for all predictions); the second, also referred to as variance, is caused by unexplained variability of the model with respect to the observed imagelets (i.e. the prediction accuracy may vary between different imagelets). To quantify the network performance in relation to these two sources of error, we employ the two following measures. Given a reference orientation (e.g. ground truth), , for an imagelet from dataset (), we quantify the systematic error as the average prediction bias, , evaluated as the root-mean-square of the individual network biases, :Additionally, we consider the average root-mean-square error (ARMSE), which quantifies the total error, as superposition of systematic and random components. In formulas, this readswhere is the variance of the prediction of the k-th network. Comparison between simulations (red dots) and real-life measurements (blue solid line). We build velocity direction signals on top of delayed orientation measurements , where the delay d(t) is modelled by a OU random process [cf. Eqs. (12), (13)]. Measurements have been acquired during the GLOW event (36k trajectories, restricting to people keeping normal average velocity ). In (a) we report the probability distribution function (pdf) of the difference between velocity direction and orientation shifted in time by the average delay, . In (b), analogously to Fig. 5, we report a PDF of the delay time between and . The insets in (a, b) report the data in semi-logarithmic scale. For both these quantities we observe excellent agreement among simulations and measurements for (a, b). Panel (c) shows the velocity direction signals’ grand average power spectral density (psd) of and . Our model modifies the psd only at high frequencies. As an effect, the most energetic components of the velocity orientation, around and , remain, respectively slightly under and slightly over-represented.

Fig. 5

Probability distribution function of the delay time between the shoulder orientation, , and velocity orientation, , signals for different average velocities, . As the average velocity grows, the average delay and the delay fluctuations reduce. The inset reports the ratio between the standard deviation, , and the average, , of the delay as a function of . The measurements considered are 78k trajectories ( of data, all not exceeding a maximum orientation of ), acquired during the GLOW event[6].

Results

In Fig. 4a–c, we report the orientation signals as estimated by the networks in three different real-life contexts. The network is capable of accurate predictions that, as expected, are independent of the actual instantaneous velocity. Hence, it remains accurate in case of a pedestrian walking sideways (Fig. 4b), in which the orientation signal loses temporarily coupling with the velocity orientation and in case of a pedestrian temporarily stopping and standing (Fig. 4c), in which the velocity orientation is even undefined (note that these cases were excluded from the training). We include in Fig. 4d–f the values of average prediction bias and ARMSE as the training set size increases, in case of synthetic and real-life imagelets [respectively, in (d) and (e)]. In both cases the network performance increases with N, with slightly slower convergence rate for the ARMSE for the real-life dataset, which is likely more challenging to learn than the synthetic one. In both cases the predictions are free of bias (cf. sub-panels). With the largest number of training imagelets considered (), we measured an ARMSE of about for the synthetic data and for the real-life data. We managed to further reduce this error to, respectively, and by enforcing O(2) symmetry. Note that we could trivially apply Eq. (8) as we are in a bias-free context, else a systematic correction for the bias would have been necessary. In Fig. 4f, we report the network performance as we approximate better and better the O(2) group average. We stress that in case of real-life data, the network predictions, on which no time-smoothing has been applied, converge to test data that underwent time-smoothing. Thus, as N grows, the network predictions show increasing robustness and consistency approaching jitter-free signals.

Real-life orientation dynamics

We are now capable of investigating with high-resolution, and in real-life conditions, the connection between shoulder orientation and velocity direction—which, in the previous sections, we reduced to the error term . In particular, we can characterize a stochastic delay signal, d(t), which allows us to model the relation between velocity and orientation aswhere A is a positive constant. First, thanks to the high-accuracy of our tool, we measure a velocity-dependent delay between velocity orientation and shoulder orientation whose probability distribution function is in Fig. 5 (see Supplementary information for details on the delay measurement algorithm). The velocity orientation follows in time the shoulder yawing, with a delay that decreases (on average) between and as the average walking velocity, , increases from to (respectively walking speed values in leisure and normal walking regimes, see, e.g. Ref.[21]). The structure of d(t) appears well-modeled by a OU process:where is the average delay (), is the OU time-scale and is the intensity of the -correlated white noise . In particular, in Fig. 6 we compare statistical observables of measurements and simulations considering the case of normal walking speed (average velocity ), of which we retain the measured orientation signals, , as a basis for Eq. (12) (simulation parameters: , , and ). In Fig. 6a, we report the pdf of the difference between orientation and velocity orientation when one is shifted in time by, , to compensate for the average delay. Measurements and simulations, in excellent mutual agreement, follow a Gaussian statistics. Thanks to a stochastic delay, we achieve a very good quantitative agreement in the delay distributions (Fig. 6b). In Fig. 6c, we report the Power Spectral Density (psd) of and computed by averaging all the psds obtained from individual velocity direction and orientation signals. We observe that the stochastic delay does not substantially modify the psd of orientations, especially at low frequencies. As an effect, the peak around , connected with the periodic swinging in walking (see Fig. 4a), is reproduced (yet it is slightly underestimated). Moreover, the psd shows another peak around , which is connected to large-scale motion (a pedestrian might curve along their path) and/or have a non-periodic orientation signal (which yields low-frequency spectral artifacts). Also this peak, not modeled by Eq. (13), is reproduced by the model but with a slight overestimation.

Fig. 6

Comparison between simulations (red dots) and real-life measurements (blue solid line). We build velocity direction signals on top of delayed orientation measurements , where the delay d(t) is modelled by a OU random process [cf. Eqs. (12), (13)]. Measurements have been acquired during the GLOW event (36k trajectories, restricting to people keeping normal average velocity ). In (a) we report the probability distribution function (pdf) of the difference between velocity direction and orientation shifted in time by the average delay, . In (b), analogously to Fig. 5, we report a PDF of the delay time between and . The insets in (a, b) report the data in semi-logarithmic scale. For both these quantities we observe excellent agreement among simulations and measurements for (a, b). Panel (c) shows the velocity direction signals’ grand average power spectral density (psd) of and . Our model modifies the psd only at high frequencies. As an effect, the most energetic components of the velocity orientation, around and , remain, respectively slightly under and slightly over-represented.

Discussion

In this paper we presented an extremely accurate estimator for the pedestrian shoulder-line orientation based on deep convolutional neural networks. We leveraged on statistic aspects of pedestrian dynamics to overcome two outstanding issues related to deep networks training: the labor-intensive annotation of training data in sufficient amounts (generally millions of images) and the accuracy of annotations in non-trivial contexts. Thanks to the strong statistical correlation of shoulder-line and velocity direction, which are typically orthogonal, we can employ the velocity direction as a training label. Although often slightly incorrect, it remains correct on the average, to which our point-estimator converges. Notably, the relation between velocity and orientation holds regardless of the quality of the raw imaging data employed. In case of overhead depth maps, as used here, often we had disagreement between human annotators, which would unavoidably yield low quality labels. By using velocity we can circumvent this issue and produce training data in arbitrarily large amounts. We stress that this correlation assumption is crucial only for training the estimator. As evidenced in the paper, once trained, the estimator can be used to successfully measure pedestrian orientation when shoulder-line orientation and velocity direction are systematically not orthogonal (like it happens for people walking sideways), or even for vanishing walking velocity (where the velocity orientation is not defined). We mention additionally that our approach can be conceptually extended to other imaging formats, such as color images, provided accurate and sufficiently prolonged tracking data are available. Our tool unlocked the possibility to accurately investigate the relation between velocity direction and shoulder orientation. We could measure a velocity-dependent delay of about between velocity and orientation, that we are able to quantitatively reproduce in terms of a simple Ornstein–Uhlenbeck process. In particular, on the basis of measured orientation signals, we could generate velocity directions featuring amplitude with respect to the orientation signal, velocity-orientation delay distribution and power spectral density in very good agreement with the measurements. We currently employed our velocity-trained network to investigate dynamics at relatively low density. We expect the network to be capable to operate and deliver accurate orientation estimates in different scenarios from those considered. As such, natural next steps are the investigation of static and dynamic high-density crowds, clogged bottlenecks conditions, or other scenarios in which the “nematic” ordering of the crowd is expected to play a key role in the dynamics. Additionally, the tool developed can be applied to do (real-time) analyses of orientation, e.g. to gather a first estimate of sight/attention direction and/or possibly extend anomaly detection capabilities for crowd motion. Supplementary information. (PDF 428 kb)

2 in total

1. Control and function of arm swing in human walking and running.

Authors: Herman Pontzer; John H Holloway; John H Holloway; David A Raichlen; Daniel E Lieberman
Journal: J Exp Biol Date: 2009-02 Impact factor: 3.312

2. Fluctuations around mean walking behaviors in diluted pedestrian flows.

Authors: Alessandro Corbetta; Chung-Min Lee; Roberto Benzi; Adrian Muntean; Federico Toschi
Journal: Phys Rev E Date: 2017-03-15 Impact factor: 2.529

2 in total

1 in total

1. Using a Rotating 3D LiDAR on a Mobile Robot for Estimation of Person's Body Angle and Gender.

Authors: Dražen Brščić; Rhys Wyn Evans; Matthias Rehm; Takayuki Kanda
Journal: Sensors (Basel) Date: 2020-07-16 Impact factor: 3.576

1 in total