Literature DB >> 22298916

Multi-scale 2D tracking of articulated objects using hierarchical spring systems.

Nicole M Artner¹, Adrian Ion, Walter G Kropatsch.

Abstract

This paper presents a flexible framework to build a target-specific, part-based representation for arbitrary articulated or rigid objects. The aim is to successfully track the target object in 2D, through multiple scales and occlusions. This is realized by employing a hierarchical, iterative optimization process on the proposed representation of structure and appearance. Therefore, each rigid part of an object is described by a hierarchical spring system represented by an attributed graph pyramid. Hierarchical spring systems encode the spatial relationships of the features (attributes of the graph pyramid) describing the parts and enforce them by spring-like behavior during tracking. Articulation points connecting the parts of the object allow to transfer position information from reliable to ambiguous parts. Tracking is done in an iterative process by combining the hypotheses of simple trackers with the hypotheses extracted from the hierarchical spring systems.

Entities: Chemical Disease Gene Species

Year: 2011 PMID： 22298916 PMCID： PMC3268560 DOI： 10.1016/j.patcog.2010.10.025

Source DB: PubMed Journal: Pattern Recognit ISSN： 0031-3203 Impact factor: 7.740

Introduction

The task of monocular tracking of articulated objects is a challenging one. Complex articulations can significantly change the appearance of the object and distant parts can perform very different motions. These aspects affect popular trackers [1] that consider the appearance of simple shapes (e.g. rectangles, as certain poses might not be very compact and cover only a small portion of the bounding box) and trackers that assume a simple global motion model for the whole part. The most promising approaches of articulated tracking are quite complex and depend to a large extent on strong motion and subject specific priors. While they do deliver excellent results for the object class they have been designed for (e.g. humans), most of them do not generalize very well and would need extensive adaptation to work for other object classes. Recent examples of such well performing specialized methods are Lee and Elgammal [2], who introduce a model that ties together the human body configuration manifold and visual manifold in one representation, which is then used for tracking within a Bayesian framework, and Brubaker et al. [3] who present a physics-based model with a bio-mechanical characterization of lower-body dynamics, where tracking is accomplished with a form of sequential Monte Carlo inference. In contrast, the presented approach requires only basic information on the structure of the target object and no motion prior, which makes it less object-class specific and more general. Objects are represented as features in arbitrary configurations. Tracking a whole object builds on simple, single hypothesis feature trackers, and deals with partial occlusion, scaling, and limited non-rigid deformation. The output consists of the 2D positions and bounding box of the object parts in every frame of the video. At the heart of the method is a representation which describes the appearance and kinematics of articulated objects. It consists of multiple object parts modeled by rectangular regions of interest and features extracted out of these regions. Kinematics are realized by connecting object parts through articulation points, which limit the movement of each part to a circle (see Fig. 3).

Fig. 3

Left: distance constraints imposed by articulation points. Right: articulation point a in the local coordinate system defined by an ordered pair of points p1, p2.

Multiple feature trackers, called sub-trackers, are used for each part: one attempting to track the whole part and the rest considering small fixed-size windows centered around detected interest points (see Fig. 1).

Fig. 1

Example representation for a part. (a) Feature for the top level sub-tracker. (b) Features for the bottom level sub-trackers. The white edges are the edges of G. (c) Corresponding graph pyramid P={G, G} (not all bottom level vertices and edges are shown).

To deal with occlusion and to avoid drifting of the sub-trackers we model the parts as a graph hierarchy with two levels: one top-level vertex for the sub-tracker tracking the whole part and multiple bottom-level vertices for the interest-point sub-trackers. The edges of the graph are weighted with the pairwise distances between the features, and act like springs pushing and pulling the vertices to reduce the deformation of the graph-structure of the parts, thus giving the name hierarchical spring system (HSS). The final position of each feature (top and bottom level) is obtained through a mediation between the corresponding tracker, pulling towards what it considers to be the target region, and the HSS trying to enforce the initial structure (reduce deformation). The weight of each of these two factors is dynamically adjusted depending on the similarity of the region at the current position with the known appearance of the part. Thus, during occlusion (by a different looking object) the HSS has more weight allowing for badly tracked features to be placed at known relative positions, while at times of successful tracking the very confident sub-trackers are given more weight, allowing for a certain amount of non-rigid deformation. A global scaling factor is maintained and used to adjust the “relaxed” (no deformation) lengths of the springs, allowing to cope with global changes in scale. Articulated objects are modeled as multiple HSS corresponding to each part connected by vertices representing the articulation points. Articulation points have no corresponding sub-trackers and move solely under the “forces” of the adjacent parts. Thus movement of one adjacent part is transmitted to the other enforcing articulated motion. All computation (position of sub-trackers, scaling, and articulation) is done using local confidence measures to balance between trusting the sub-trackers i.e. the visual feedback, and the object structure, i.e. the prior knowledge.

Related work

First introduced by Fischler et al. in 1973 [4], pictorial structures represent an object by its parts (e.g. head, torso, arms, legs) arranged in a deformable spatial configuration. This deformable configuration is represented by spring-like connections between pairs of parts. Object recognition or tracking can be done by minimizing the energy in this deformable configuration to find the most likely configuration of the object parts in an image. Felzenszwalb et al. employed this idea in [5] to do part-based object recognition for faces and articulated objects (humans). Their approach is a statistical framework minimizing the energy of the spring system learned from training examples using maximum likelihood estimation. Ramanan et al. apply in [6] the ideas from [5] in tracking people. Besides Computer Vision, the proposed representation is also related to representations used in Computer Graphics called mass–spring systems [7]. Mass–spring systems are a physically based technique that is used to effectively model deformable objects for animations in Computer Graphics (e.g. a flag moving in the wind). An object is modeled by a collection of point masses connected by springs in a lattice structure. Different from the mentioned approaches, we stress solutions that emerge from the underlying structure, as opposed to using structure to verify sampled hypothesis. The proposed representation not only connects parts in a deformable way like in [5], but introduces a bottom level consisting of “small” region descriptors described by a spring system. In comparison to pictorial structures the presented approach does not need training, because the spring-like behavior is modeled via a combination of structural and appearance offsets (provided by the sub-trackers). Even though the bottom level of the proposed hierarchical spring system is similar to a mass–spring system [7], there are significant differences. The presented Spring System is used to supply structural feedback for tracking algorithms, which is a totally different purpose and it does not consider any external forces (e.g. gravity). In the proposed approach a vertex does not have a mass, but the force of the spring is calculated by its confidence in the current frame.

Contributions

Our main contribution is the flexible framework for representing and tracking articulated objects of arbitrary complexity with each (rigid) part of an object represented by a hierarchical spring system (HSS), connected to other parts by articulation points. Articulation points are used to transfer information between the HSS of the adjacent object parts. All decisions balance between “seeing” and “knowing” using maintained confidence measures. We pose tracking as a hierarchical optimizations process on structure and appearance. A preliminary version of our approach has been presented in [8]. Possible applications are action recognition, human computer interfaces, motion based diagnosis and identification, etc.

Overview

This paper is organized as follows: Section 2 describes how to represent the appearance and structure of a rigid object in a HSS. It is explained how our approach combines the hypotheses of the sub-trackers and the HSS. In Section 3 the introduced concepts of Section 2 are used to model articulated objects consisting of several rigid object parts. Additionally, articulation points and the information transfer between the object parts are described. Section 4 presents the algorithm of the tracking with the help of pseudo-code. In Section 5 experiments qualitatively and quantitatively analyze the results of the presented approach. Section 6 gives a conclusion, and the Appendix introduces the employed region descriptor (Sigma Sets).

Representation and tracking of a rigid object

Background clutter, similar objects in the scene and occlusions are the main reasons for tracking failure, because they can be good matches to the model of the target object and thus distract the tracker. If the appearance of an object is uniform (no texture, mainly one color), it is advisable to describe and track it by one feature (e.g. region descriptor). Tracking whole rigid objects or parts can deliver robust positions even during motion blur due to the large image region considered. Nevertheless, in cases of partial occlusion or scaling such a description is not able to aid the tracker in overcoming the difficult distractions by providing useful information. On the other hand, if the target object is textured (e.g. face of a human), it is possible to extract several discriminative features out of the region covering the object and track them successfully when there are no distractions. By additionally encoding the spatial relationships of the features in the representation of the object, it is possible to deal with occlusions and estimate scaling. Unfortunately, these “small” features are more sensitive to noise and fast motion of the object (big distances between frames, motion blur). As we cannot generally decide which representation is more suitable for an object and to get the best of both worlds, we describe and track objects using multiple features and sub-trackers, where the spatial relationships of the features are described and enforced by a hierarchical spring system (HSS).

The sub-tracker

The purpose of each sub-tracker is to track a fixed-size region independently of the other sub-trackers, based solely on the content of the image. At any frame, given as input an initial estimate of the position of a tracked region and a description of its appearance, the corresponding sub-tracker will return an offset to what it considers to be the correct position of the target region.

The hierarchical spring system (HSS)

We represent the HSS of an object as a graph pyramid with two levels P={G, G}, where the top level G(V, E) contains one single vertex V={v}, and the bottom level graph G(V, E) multiple vertices connected by edges. There is an one-to-one mapping between the vertices in the graph pyramid and the features with their corresponding sub-trackers. Edges are weighted with the known distance in the image plane between the features corresponding to the incident vertices. The vertex in the top level is connected with all vertices in the base level to allow communication between the two levels. Fig. 1 shows an example representation for an object and the corresponding regions for the sub-trackers. (Options for inserting the edges are discussed in Section 5.3.1.)

Tracking with sub-trackers and HSS

For each frame the first hypotheses of the sub-trackers are refined using an iterative alternation and combination of the offsets from the sub-trackers and the offsets from the HSS.

Energies in the HSS

The HSS encodes the spatial relationships of the features of the object considering their spatial distances and arrangement. Its task is to keep the structure of the features as similar as possible to the initial state in the first frame. This is realized by providing the tracker with structural offsets (see Section 2.3.4). To calculate a structural offset for a feature it is necessary to determine the extent of the spatial deformation in the HSS. The extent of the deformation in a vertex v at time i=1…n is represented and calculated by the energy in v:where E (v) are all edges e of the levels E and E at time i−1 incident to vertex v. is the confidence (see Section 2.3.2) of the neighboring vertex v at time i connected by e, which weights the influence of v on . The motivation behind the weighting with is that occluded neighboring vertices should have a lower impact on than reliably tracked neighbors. and denote the deformed and initial edge lengths between v and v, and x is the current scaling factor of the object (for rigid objects x=x⁎(p), for articulated objects with several parts x=x⁎(O)). x is used to apply a global scaling to the initial edge lengths to be able to track an object changing its distance to the camera (see Section 2.3.3).

The confidence of a vertex

The confidence is used to dynamically weight influences of vertices in different calculations and situations e.g. calculation of (see Section 2.3.1). The confidence of a vertex v at time i depends on its degree I(v) (number of incident edges), its energy and the dissimilarity D(v) between its feature S(v) at time i−1 to its descriptor S1(v) in the initial iteration:, and are normalized so that .where E(v) are the edges incident to vertex v and E are all edges in the HSS.where is the energy in vertex v in iteration i−1 (see Eq. (1)), is the standard deviation of the energies in the local neighborhood (vertex v and its connected neighboring vertices), and is the maximum energy smaller or equal to . The standard deviation is considered to penalize outliers and to normalize with a suitable .where h(S(v), S1(v)) is the distance between the feature S(v) in the iteration i−1 and S1(v) in the initial iteration. s is the standard deviation in the local neighborhood (vertex v and its connected neighboring vertices) and hmax is the highest distance value in the neighborhood of v, where . As with the idea behind considering the standard deviation is to successfully deal with outliers and employ a suitable normalization factor hmax.

Estimation of the scaling factor

To make the representation invariant to scaling, a scaling factor x⁎ is estimated once in each frame after the sub-trackers have provided their first hypotheses for the positions of the features.where x⁎(v) is the estimated scaling factor in the local neighborhood of vertex v. N(v) is the neighborhood of v (all vertices v connected to v by e). is the confidence of the neighboring vertices in the current iteration. x⁎(v) is determined by a weighted sum to boost the influence of the most reliable vertices and the associated edges. The scaling factor x⁎(v) of each vertex is used to calculate a scaling factor for the rigid object (part of an articulated object):where V are all vertices v of the bottom level of the HSS.

Offsets of the HSS

To compute the offsets of the HSS we employ graph relaxation, which models the spring-like behavior of the edges with the purpose to minimize the energies in the HSS, i.e. to bring all edges E to have the same length ratio as in the model (e.g. initial frame). A structural offset vector for vertex v is calculated so that it is pointing to a spatial position in which the is minimized:where is the unitary vector pointing from a neighboring vertex v toward v and x is the scaling factor of the object (for rigid objects x=x⁎(p), for articulated objects with several parts x=x⁎(O)). Fig. 2 shows the concept of producing structural offsets with graph relaxation.

Fig. 2

Graph relaxation examples. B is the initial state of the vertex and the deformed one. The arrows visualize the structural offset vectors .

Combining the hypotheses

For each feature (vertex) and in each iteration i the corresponding sub-tracker and HSS propose a “new” position with the knowledge of the position of the previous iteration i−1 and their offsets. Both hypotheses are combined to determine the position c of each vertex as follows:where is the confidence of vertex v at time i, t is a vector representing the hypothesis of the sub-tracker and s is the proposed position of v of the HSS.

Assembling parts to form articulated objects

Articulated objects are modeled as multiple object parts represented by hierarchical spring systems (HSSs) and connected by vertices representing articulation points. To exchange information between the parts of the object, articulation points are connected to the corresponding HSSs. Articulation points have no corresponding sub-trackers and move solely under the “forces” of the adjacent parts.

The confidence of a part

The confidence of object parts becomes meaningful when the target object is an articulated object consisting of several parts connected by articulation points. It is computed out of the size I(p), the energy E(p), and the dissimilarity D(p) of the feature S(p) in comparison to S1(p) of the initial frame:, and are normalized to satisfy .where F(p) is the number of features of part p and F is the number of all features in the object. The sum of all local energies in an object part Ei−1(p) is normalized by the number of features (vertices):where h(S(p), S1(p)) is the distance between the feature S(p) in the current iteration and S1(p) in the initial frame. s is the standard deviation of the distances for all parts in the target object and hmax is the highest distance value, where .

Scaling of the whole object

The estimation of the global scaling of the whole articulated object is based on the scaling factors of the object parts x⁎(p) (see Section 2.3.3), which are combined by a weighted sum:

Articulation points: agents of the information transfer

An articulation point connects several rigid parts. It allows them to move independently from each other, while keeping the same distance to it. The movement of a point of a rigid part in the image plane is constrained to a circle centered at the articulation point. The radius is equal to the distance between the point of the rigid part and the articulation point. Fig. 3 illustrates this concept. If the articulation point moves it “pulls” the connected rigid part to keep the distance constraint, and vice versa. In this way position information is transferred from one rigid part to an adjacent one over the articulation point.

Modeling articulation points

Planar articulated motion from frame f to frame can be decomposed into: an independent rotation of the rigid parts around the articulation point, followed by a common translation of the parts (and the articulation point). Given two pairs of points corresponding to two rigid parts performing articulated motion, each at frame f and , the rotation () of each part, the common translation (O, O) as well as the position of the articulated point at frame f are obtained by solving the resulting system of eight equations with eight unknowns. During the initialization of the representation a local coordinate system of each pair of features of an object part is created (see Fig. 3). The coordinates of the articulation point in this coordinate systems are stored. Having the position of any two features is then enough to build the local coordinate system and reconstruct the position of the articulation point in every frame.

Tracking articulation points

At any time during tracking, knowing the positions of two vertices of a part and the current scaling factor is sufficient to generate a hypothesis for the positions of all adjacent articulation points. These hypotheses are produced with the local coordinate system defined by the two most confident features (see Section 2.3.2) – further on named reference vertices – of each part. The hypotheses of all parts connected to an articulation point are combined with a weighted sum to calculate the current position a of the articulation point a. The weight for each hypothesis depends on the confidence of the corresponding part (see Section 3.1):where P(a) is the set of parts connected to the articulation point a. y is the hypothesis determined with the local coordinate system (which considers the current scaling factor x) of part p. is the confidence of part p. With this weighted sum, the influence of ambiguous parts on the position of the articulation point is low (e.g. if a part is occluded) and of reliably tracked parts high.

Information transfer

For each rigid part, the distance constraint to the articulation point is enforced by connecting all vertices from the bottom level and the vertex from the top level with the corresponding articulation point. The articulation point “transfers” position information from reliably to ambiguously tracked parts through its distance constraints (circles). The information transfer is realized with graph relaxation by calculating a structural offset vector. Therefore, Eq. (8) is adapted as follows:where is the confidence of vertex v, is the length of edge e connecting v with a and represents the length of the same edge in the initial frame. is the unitary vector pointing from a vertex v toward the articulation point a.

Tracking as a hierarchical optimization process—the algorithm

The algorithm to track articulated objects using HSSs is summarized in Algorithm 1. Tracking is done in a top to bottom or bottom to top process, depending on the confidence values (see Algorithm 1, Line 8). In frames when the tracking is reliable, the springs connecting the top vertex with the bottom level are used to generate additional structural offsets for the vertices in the bottom level (top to bottom processing). During occlusions this flow of structural feedback is inverted s.t. structural offsets are determined for the top vertex (bottom to top processing). The decision for top to bottom or bottom to top processing is taken by a comparison of the confidence values of the top and bottom vertices. In cases of ambiguity bottom to top processing is preferred (confidence value of top vertex is smaller than confidence of bottom vertex). Algorithm for tracking articulated objects.

Experiments

The following experiments show the application of the presented framework on concrete tracking tasks with different complexities and difficulties.

The sub-trackers

We use the mean shift algorithm for the sub-trackers. It is a simple, single hypothesis tracker, which on its own is not able to track complex, articulated objects successfully. Mean shift efficiently searches for local extremal values in a probability distribution with a search window, and generates an offset vector pointing to the corresponding position. The value of the distribution at a certain point depends on the similarity between features extracted within a window centered at that point and features extracted in an initialization phase from the region to be tracked.

The region descriptors

Sigma Sets are used in the experiments as the region descriptors (features) describing the appearance of the corresponding regions of interests covering the target object. Appendix A gives a brief recall of Sigma Sets. The extraction of the features in every frame is very expensive with regard to computation time. In a frame with a resolution of 480×640 pixels the calculation of the features consumes between 60 and 70 s of the overall computing time of maximum 75 s per frame.

Initializing the hierarchical spring systems

Features/vertices. Before a HSS can be built, a target object needs to be defined and suitable features describing the object have to be selected. This can be done automatically by methods like in [9-12] or semi-manually as for the experiments in this paper. The top level is described by one region descriptor S1(p), extracted out of a region of interest (ROI) covering the whole object part (Fig. 1(a)). The bottom level consists of several smaller region descriptors, which are from the same ROI (see Fig. 1(b)). A Harris corner detector is applied on the ROI to find promising positions for the smaller region descriptors S1(v). Around each corner point a small ROI is built to extract a Sigma Set (e.g. 9×9 pixels). Edges. The edges can be inserted with a Delaunay triangulation (see Fig. 4(b)) or a fully connected graph can be built (see Fig. 4(c)). For more details on inserting the edges refer to Section 5.3.1.

Fig. 4

Building a HSS. Target object: head of jumping jack. (a) Selected features: region descriptors (boxes). (b) Inserted edges: triangulated graph. (c) Inserted edges: fully connected graph.

Articulation points. They can be initialized manually (as in the following experiments) or automatically by observing the articulated motion of the target object [13,9].

Connectivity issues

This section deals with the impact of the connectivity of the vertices in the HSS on the quality of the structural feedback, i.e. on the structural offset vector. Given the features represented as vertices, there are different possibilities for adding the edges connecting them e.g.: a Delaunay triangulation or a fully connected graph (see Fig. 4). If a vertex v is of degree 1 – only connected to one neighbor – the structural feedback determined by graph relaxation is ambiguous. The local energy in the current vertex v is minimized () by moving v to any point on the circle centered on its neighbor with the radius equal to the “original” length of the edge connecting them. Therefore, there is no unique global minimum or structural offset vector for v. For a vertex v with degree 2, the ambiguity is reduced to two possible positions, both with . Above degree 2, there is only one position in the image, which minimizes . Fig. 5 visualizes these three cases.

Fig. 5

Ambiguity of structural offset vectors. (a) Vertex degree 1, all positions on circle are minima. (b) Vertex degree 2, two minima. (c) Vertex degree 3, one unique minimum.

In our experiments both a Delaunay triangulation and a fully connected graph are used as representation. Table 1 lists important facts of both representations.

Table 1

Comparison of facts of a triangulated and a fully connected graph.

Representation	Connectivity	Quality of struc. feedback	Propagation of information
Planar triangulation	Low, some vertices have only degree 2	Robust without occlusion, can be ambiguous in cases of occlusion	Slow for graphs with many vertices
Fully connected	High, all vertices of E	Robust with and without occlusions	Fast, independent on the number of vertices

As Table 1 lists, a fully connected graph may produce superior results. When determining the structural offset vector (see Eq. (8)) each vertex gets structural input from every other vertex in the graph. Especially in cases of occlusion, this leads to a faster propagation of “correct” position information (see Fig. 6 in Section 5). The only drawback we identified for the fully connected graph is the, in our experiments insignificant, increase in processing time when calculating the structural offset vector.

Fig. 6

Experiment 1: tracking an occluded face with mean shift (top), with our approach in a triangulation (middle) and our approach with a fully connected graph (bottom). The images show the features of the bottom level connected by edges to illustrate the deformations and the qualitative results.

Experimental setup

The videos employed for the following experiments are self-produced (800×600 pixel), from the Motion of Body (MoBo) database [15] (486×640 pixel) and from Amit et al. [16] (352 ×288 pixel). The videos are selected considering the current status of the presented approach. Even though the proposed framework is able to successfully track objects through articulated motion and scaling, it can only deal with affine or perspective changes up to a certain degree. The reason for this lies in the current state of the HSS as it does not consider the 3D space when generating structural offset vectors. Therefore, videos with objects moving in the 3D space are not suitable for our experiments and will lead to significant errors in tracking. In all experiments presented in this section, the target object is initialized manually by selecting the parts of the object and defining the positions of the articulation points. Except of the video in experiment 1, the ground truth was determined by us and is a result of manually selecting the center positions of the object parts.

Experiment 1: occlusion

This experiment focuses on occlusions and compares the tracking results of mean shift alone and our combined approach. The video used in this experiment is from the work of Amit et al. [16]. It shows the face of a woman being partially occluded several times. In Fig. 6 one can see the results of tracking with mean shift alone, with a HSS with triangulated graphs and with a HSS using fully connected graphs. As already mentioned in Section 5.3.1, the fully connected graph is superior to the triangulated graph in challenging cases of occlusion, which occur in this video sequence. The face is occluded several times by a highly textured object (magazine) moving in different directions and occluding different parts of the face. This leads to big confusions and errors in the tracking with mean shift alone (see Fig. 6 (top)). Fig. 7 shows the quantitative results of this experiment. These results confirm the qualitative results. The ground truth is provided by [16]. When comparing the results of Fig. 7 with the results in [16], one can see that the methods have a similar error rate. The approach of Amit et al. [16] has problems in frames 500–600, where as our approach performed better in this period. Both methods are challenged in frames 700–800, but this time the method of Amit et al. is slightly better.

Fig. 7

Experiment 1: deviation from ground truth. (full) Using HSS with a fully connected graph, (planar) using HSS with a triangulated graph, and (without) using only tracking with mean shift.

Experiment 2: articulated motion with self-occlusion

This experiment uses a video of [15] of subject 04011 in view vr16_7, where the aim was to track hand, torso, and upper and lower arms. The challenges are self-occlusions and similar appearance in several object parts. (We do not show images of subject 04011 as it is not allowed by [15].) Fig. 8 shows that the presented representation significantly improves the quality of the results of tracking with mean shift. The left lower arm is the most challenging object part to track, but our approach is able to recover well from wrong hypotheses.

Fig. 8

Experiment 2: deviation from ground truth: (top) tracking with mean shift, (bottom) tracking with our approach with fully connected graphs.

Experiment 3: articulated motion under scaling

In experiment 3 the aim is to successfully track an articulated object consisting of eight parts connected via six articulation points (jumping jack). The challenges are the scaling (approximately from 100% to 130% and to 80%) and the two types of motion: articulated and camera. In Fig. 9 one can see three frames of the video. Fig. 10 shows the deviation from the manually labeled ground truth of tracking with mean shift alone, of our approach with HSSs represented by planar triangulated graphs or fully connected graphs. As expected there is no remarkable difference in the results for planar and fully connected graph.

Fig. 9

Experiment 3: some frames of the video showing the scaling.

Fig. 10

Experiment 3: deviation from ground truth. The position error in pixels is a sum over the error of all object parts.

Experiment 4: fast movements

In this experiment the robustness and recovery potential of the HSS is tested. The employed video shows a woman waving a hand very fast, which leads to heavy motion blur. Fig. 11 shows some frames of the video sequence including qualitative results for tracking with mean shift alone and our approach with fully connected graphs. Frames 155 and 170 show the superior results of our approach in comparison to mean shift on its own. Fig. 12 evaluates the results in concrete numbers.

Fig. 11

Experiment 4: tracking an articulated object through motion blur. (top) Tracking with mean shift and (bottom) our approach with HSS and fully connected graphs.

Fig. 12

Experiment 4: deviation from ground truth. (without) Tracking the object parts with mean shift, (full) our approach with fully connected graphs.

Experiment 5: tracking a whole human

In experiment 5 representations with 10 object parts and nine articulation points are built and track walking humans in 04002 and 04006 in view vr7_7 of [15]. Fig. 13 shows images of 04002 and 04006, where in (d) one can see that for some parts (in this case left upper arm) it is not possible to extract enough local features. In such cases also tracking is more difficult and depends mainly on the top level of the HSS. As expected tracking with our approach by combining mean shift and HSSs delivers the better result (see Fig. 14).

Fig. 13

Experiment 5: (a) frame of subject 04002 with the top level of the HSSs and the articulation points, (b) subject 04002 and corresponding bottom level of HSSs, (c) frame of subject 04006 and its top level with the articulation points, and (d) showing the bottom level of the HSSs of 04006.

Fig. 14

Experiment 5: Deviation from ground truth. (top) Video with subject 04002 in view vr7_7, (bottom) subject 04006 in the same view. For both videos results with mean shift (without) and with our approach (full) are shown. The position error in pixels is a sum over the error of all object parts.

Discussion and future work

The presented experiments showed the application of the proposed framework in tracking objects of different complexity under “simple” motion, articulated motion, camera motion, scaling, occlusion, and motion blur. Even though tracking with mean shift and Sigma Sets are employed as basic building blocks, both the tracker and the region descriptor are exchangeable. The focus of our work lies in the hierarchical representation. The experiments in this section showed that a fully connected graph as representation for a HSS is equal or superior to a triangulated graph (especially during occlusions). Therefore, we intend to employ this representation in future. The increase in processing time is insignificant, as most of the processing time (approximately 95%) is spent in calculating region descriptors and building distributions. Besides its advantages during occlusion, the fully connected graph is also a good basis to start future research on updating the elements of the HSS. When an object moves in the 3D space (e.g. turning around) it happens that some regions of the object become invisible and new regions appear. Therefore, it is necessary to develop an update process for the elements of the HSS, which allows the removal of “old” vertices and the addition of “new” ones. This process requires changes in the graph representing the HSS and here a fully connected graph is easier to handle than a triangulation. Furthermore, we plan to extend our HSS to be able to handle 3D position information. One possibility to realize this, could be to stick with mean shift tracking in 2D, but optimize the Spring System in 3D coordinates.

Conclusion

This paper presented a flexible framework to represent and track articulated objects consisting of several rigid parts connected with articulation points. The parts of the object are described by a hierarchical spring system which is represented by an attributed graph pyramid. The attributes of the pyramid are region descriptors and the edges encode the spatial relationships between the vertices/attributes. This spatial structure is enforced during tracking by the spring-like behavior of the edges in the hierarchical spring systems. The “springs” allow to determine structural offset vectors, which are combined with the offset vectors provided by the employed mean shift tracker. Position information can be transferred between the parts over the corresponding articulation points depending on the confidence of the parts and their features.

1:	processFrame
	T_i threshold maximum number of iterations
2:	i←1 ▹ iteration counter
3:	while (i<Ti) do
4:	for every rigid part do
5:	calculate confidences δi(v) and δi(p)
6:	estimate positions with sub-trackers top and bottom
7:	ifi>1then
8:	decide between top to bottom or bottom to top processing
9:	do structural iteration top and bottom
10:	mix hypotheses for positions depending on δi(v)
11:	end if
12:	update energies in HSS
13:	ifi==1 then
14:	estimate scaling factor
15:	end if
16:	end for
17:	for every rigid part do
18:	update δi(v) and δi(p)
19:	end for
20:	calculate current position of articulation point
21:	for every rigid part do
22:	information transfer
23:	update energies in HSS
24:	end for
25:	i←i+1
26:	end while
27:	end

1 in total

1. A factorization-based approach for articulated nonrigid shape, motion and kinematic chain recovery from video.

Authors: Jingyu Yan; Marc Pollefeys
Journal: IEEE Trans Pattern Anal Mach Intell Date: 2008-05 Impact factor: 6.226

1 in total

2 in total

1. Hierarchical spatio-temporal extraction of models for moving rigid parts.

Authors: Nicole M Artner; Adrian Ion; Walter G Kropatsch
Journal: Pattern Recognit Lett Date: 2011-12-01 Impact factor: 3.756

2. The adaptation of GDL motion recognition system to sport and rehabilitation techniques analysis.

Authors: Tomasz Hachaj; Marek R Ogiela
Journal: J Med Syst Date: 2016-04-22 Impact factor: 4.460

2 in total