This paper presents a flexible framework to build a target-specific, part-based representation for arbitrary articulated or rigid objects. The aim is to successfully track the target object in 2D, through multiple scales and occlusions. This is realized by employing a hierarchical, iterative optimization process on the proposed representation of structure and appearance. Therefore, each rigid part of an object is described by a hierarchical spring system represented by an attributed graph pyramid. Hierarchical spring systems encode the spatial relationships of the features (attributes of the graph pyramid) describing the parts and enforce them by spring-like behavior during tracking. Articulation points connecting the parts of the object allow to transfer position information from reliable to ambiguous parts. Tracking is done in an iterative process by combining the hypotheses of simple trackers with the hypotheses extracted from the hierarchical spring systems.
This paper presents a flexible framework to build a target-specific, part-based representation for arbitrary articulated or rigid objects. The aim is to successfully track the target object in 2D, through multiple scales and occlusions. This is realized by employing a hierarchical, iterative optimization process on the proposed representation of structure and appearance. Therefore, each rigid part of an object is described by a hierarchical spring system represented by an attributed graph pyramid. Hierarchical spring systems encode the spatial relationships of the features (attributes of the graph pyramid) describing the parts and enforce them by spring-like behavior during tracking. Articulation points connecting the parts of the object allow to transfer position information from reliable to ambiguous parts. Tracking is done in an iterative process by combining the hypotheses of simple trackers with the hypotheses extracted from the hierarchical spring systems.
The task of monocular tracking of articulated objects is a
challenging one. Complex articulations can significantly change the appearance
of the object and distant parts can perform very different motions. These
aspects affect popular trackers [1]
that consider the appearance of simple shapes (e.g. rectangles, as certain poses
might not be very compact and cover only a small portion of the bounding box)
and trackers that assume a simple global motion model for the whole
part.The most promising approaches of articulated tracking are quite
complex and depend to a large extent on strong motion and subject specific
priors. While they do deliver excellent results for the object class they have
been designed for (e.g. humans), most of them do not generalize very well and
would need extensive adaptation to work for other object classes. Recent
examples of such well performing specialized methods are Lee and Elgammal
[2], who introduce a model that
ties together the human body configuration manifold and visual manifold in one
representation, which is then used for tracking within a Bayesian framework, and
Brubaker et al. [3] who present a
physics-based model with a bio-mechanical characterization of lower-body
dynamics, where tracking is accomplished with a form of sequential Monte Carlo
inference.In contrast, the presented approach requires only basic
information on the structure of the target object and no motion prior, which
makes it less object-class specific and more general. Objects are represented as
features in arbitrary configurations. Tracking a whole object builds on simple,
single hypothesis feature trackers, and deals with partial occlusion, scaling,
and limited non-rigid deformation. The output consists of the 2D positions and
bounding box of the object parts in every frame of the video.At the heart of the method is a representation which describes
the appearance and kinematics of articulated objects. It consists of multiple
object parts modeled by rectangular regions of interest and features extracted
out of these regions. Kinematics are realized by connecting object parts through
articulation points, which limit the movement of each part to a circle (see
Fig. 3).
Fig. 3
Left: distance constraints imposed by articulation
points. Right: articulation point a in the local
coordinate system defined by an ordered pair of points
p1,
p2.
Multiple feature trackers, called
sub-trackers, are used for each part: one attempting
to track the whole part and the rest considering small fixed-size windows
centered around detected interest points (see Fig.
1).
Fig. 1
Example representation for a part. (a) Feature for the
top level sub-tracker. (b) Features for the bottom level sub-trackers. The white
edges are the edges of
G. (c) Corresponding
graph pyramid
P={G,
G} (not all
bottom level vertices and edges are shown).
To deal with occlusion and to avoid
drifting of the sub-trackers we model the parts as a
graph hierarchy with two levels: one top-level vertex for the sub-tracker
tracking the whole part and multiple bottom-level vertices for the
interest-point sub-trackers. The edges of the graph are weighted with the
pairwise distances between the features, and act like springs pushing and
pulling the vertices to reduce the deformation of the graph-structure of the
parts, thus giving the name hierarchical spring system
(HSS).The final position of each feature (top and bottom level) is
obtained through a mediation between the corresponding tracker, pulling towards
what it considers to be the target region, and the HSS trying to enforce the
initial structure (reduce deformation). The weight of each of these two factors
is dynamically adjusted depending on the similarity of the region at the current
position with the known appearance of the part. Thus, during occlusion (by a
different looking object) the HSS has more weight allowing for badly tracked
features to be placed at known relative positions, while at times of successful
tracking the very confident sub-trackers are given more weight, allowing for a
certain amount of non-rigid deformation. A global scaling
factor is maintained and used to adjust the “relaxed” (no
deformation) lengths of the springs, allowing to cope with global changes in
scale.Articulated objects are modeled as
multiple HSS corresponding to each part connected by vertices representing the
articulation points. Articulation points have no corresponding sub-trackers and
move solely under the “forces” of the adjacent parts. Thus
movement of one adjacent part is transmitted to the other enforcing articulated
motion.All computation (position of sub-trackers, scaling, and
articulation) is done using local confidence measures to balance between
trusting the sub-trackers i.e. the visual feedback, and the object structure,
i.e. the prior knowledge.
Related work
First introduced by Fischler et al. in 1973 [4], pictorial structures represent an
object by its parts (e.g. head, torso, arms, legs) arranged in a deformable
spatial configuration. This deformable configuration is represented by
spring-like connections between pairs of parts. Object recognition or
tracking can be done by minimizing the energy in this deformable
configuration to find the most likely configuration of the object parts in
an image. Felzenszwalb et al. employed this idea in [5] to do part-based object recognition for
faces and articulated objects (humans). Their approach is a statistical
framework minimizing the energy of the spring system learned from training
examples using maximum likelihood estimation. Ramanan et al. apply in
[6] the ideas from
[5] in tracking
people.Besides Computer Vision, the proposed representation is also
related to representations used in Computer Graphics called
mass–spring systems
[7]. Mass–spring systems
are a physically based technique that is used to effectively model
deformable objects for animations in Computer Graphics (e.g. a flag moving
in the wind). An object is modeled by a collection of point masses connected
by springs in a lattice structure.Different from the mentioned approaches, we stress solutions
that emerge from the underlying structure, as opposed to using structure to
verify sampled hypothesis. The proposed representation not only connects
parts in a deformable way like in [5], but introduces a bottom level consisting of
“small” region descriptors described by a spring system. In
comparison to pictorial structures the presented approach does not need
training, because the spring-like behavior is modeled via a combination of
structural and appearance offsets (provided by the sub-trackers).Even though the bottom level of the proposed hierarchical
spring system is similar to a mass–spring system [7], there are significant differences. The
presented Spring System is used to supply structural feedback for tracking
algorithms, which is a totally different purpose and it does not consider
any external forces (e.g. gravity). In the proposed approach a vertex does
not have a mass, but the force of the spring is calculated by its confidence
in the current frame.
Contributions
Our main contribution is the flexible framework for
representing and tracking articulated objects of arbitrary complexity with
each (rigid) part of an object represented by a hierarchical spring system
(HSS), connected to other parts by articulation points. Articulation points
are used to transfer information between the HSS of the adjacent object
parts. All decisions balance between “seeing” and
“knowing” using maintained confidence measures. We pose
tracking as a hierarchical optimizations process on structure and
appearance.A preliminary version of our approach has been presented in
[8]. Possible applications are
action recognition, human computer interfaces, motion based diagnosis and
identification, etc.
Overview
This paper is organized as follows: Section 2 describes how to represent the
appearance and structure of a rigid object in a HSS. It is explained how our
approach combines the hypotheses of the sub-trackers and the HSS. In
Section 3 the introduced
concepts of Section 2 are used to
model articulated objects consisting of several rigid object parts.
Additionally, articulation points and the information transfer between the
object parts are described. Section
4 presents the algorithm of the tracking with the help of
pseudo-code. In Section 5
experiments qualitatively and quantitatively analyze the results of the
presented approach. Section 6
gives a conclusion, and the Appendix introduces the employed region
descriptor (Sigma Sets).
Representation and tracking of a rigid object
Background clutter, similar objects in the scene and occlusions
are the main reasons for tracking failure, because they can be good matches to
the model of the target object and thus distract the tracker.If the appearance of an object is uniform (no texture, mainly
one color), it is advisable to describe and track it by one feature (e.g. region
descriptor). Tracking whole rigid objects or parts can deliver robust positions
even during motion blur due to the large image region considered. Nevertheless,
in cases of partial occlusion or scaling such a description is not able to aid
the tracker in overcoming the difficult distractions by providing useful
information.On the other hand, if the target object is textured (e.g. face
of a human), it is possible to extract several discriminative features out of
the region covering the object and track them successfully when there are no
distractions. By additionally encoding the spatial relationships of the features
in the representation of the object, it is possible to deal with occlusions and
estimate scaling. Unfortunately, these “small” features are more
sensitive to noise and fast motion of the object (big distances between frames,
motion blur).As we cannot generally decide which representation is more
suitable for an object and to get the best of both worlds, we describe and track
objects using multiple features and sub-trackers, where the spatial
relationships of the features are described and enforced by a
hierarchical spring system (HSS).
The sub-tracker
The purpose of each sub-tracker is to track a fixed-size
region independently of the other sub-trackers, based solely on the content
of the image. At any frame, given as input an initial estimate of the
position of a tracked region and a description of its appearance, the
corresponding sub-tracker will return an offset to what it considers to be
the correct position of the target region.
The hierarchical spring system (HSS)
We represent the HSS of an object as a graph pyramid with
two levels
P={G,
G}, where
the top level
G(V,
E) contains one
single vertex
V={v},
and the bottom level graph
G(V,
E) multiple
vertices connected by edges. There is an one-to-one mapping between the
vertices in the graph pyramid and the features with their corresponding
sub-trackers. Edges are weighted with the known distance in the image plane
between the features corresponding to the incident vertices. The vertex in
the top level is connected with all vertices in the base level to allow
communication between the two levels. Fig.
1 shows an example
representation for an object and the corresponding regions for the
sub-trackers. (Options for inserting the edges are discussed in
Section 5.3.1.)
Tracking with sub-trackers and HSS
For each frame the first hypotheses of the sub-trackers are
refined using an iterative alternation and combination of the offsets from
the sub-trackers and the offsets from the HSS.
Energies in the HSS
The HSS encodes the spatial relationships of the
features of the object considering their spatial distances and
arrangement. Its task is to keep the structure of the features as
similar as possible to the initial state in the first frame. This is
realized by providing the tracker with structural
offsets (see Section
2.3.4).To calculate a structural offset for a feature it is
necessary to determine the extent of the spatial deformation in the HSS.
The extent of the deformation in a vertex v at
time i=1…n is
represented and calculated by the energy in v:where
E
(v) are all edges e of the
levels E and
E at time
i−1 incident to vertex
v. is the confidence (see Section 2.3.2) of the neighboring
vertex v
at time i connected by e,
which weights the influence of
v
on . The motivation behind the weighting with is that occluded neighboring vertices should have a lower
impact on than reliably tracked neighbors. and denote the deformed and initial edge lengths between
v and
v,
and x is the current scaling factor of the object
(for rigid objects
x=x⁎(p),
for articulated objects with several parts
x=x⁎(O)).
x is used to apply a global scaling to the
initial edge lengths to be able to track an object changing its distance to the
camera (see Section
2.3.3).
The confidence of a vertex
The confidence is used to dynamically weight influences
of vertices in different calculations and situations e.g. calculation of
(see Section
2.3.1).The confidence of a vertex v at time
i depends on its degree
I(v) (number of
incident edges), its energy and the dissimilarity
D(v)
between its feature
S(v)
at time i−1 to its descriptor
S1(v)
in the initial iteration:, and are normalized so that .where E(v) are the
edges incident to vertex v and
E are all edges in the HSS.where is the energy in vertex v in
iteration i−1 (see Eq. (1)), is the standard deviation of the energies in the local
neighborhood (vertex v and its connected
neighboring vertices), and is the maximum energy smaller or equal to . The standard deviation is considered to penalize outliers and to normalize with a
suitable .where
h(S(v),
S1(v))
is the distance between the feature
S(v)
in the iteration i−1 and
S1(v)
in the initial iteration.
s
is the standard deviation in the local neighborhood (vertex
v and its connected neighboring vertices) and
hmax
is the highest distance value in the neighborhood of
v, where . As with the idea behind considering the standard deviation is to
successfully deal with outliers and employ a suitable normalization
factor
hmax.
Estimation of the scaling factor
To make the representation invariant to scaling, a
scaling factor x⁎ is
estimated once in each frame after the sub-trackers have provided their
first hypotheses for the positions of the features.where
x⁎(v)
is the estimated scaling factor in the local neighborhood of vertex
v. N(v)
is the neighborhood of v (all vertices
v
connected to v by e).
is the confidence of the neighboring vertices in the current
iteration.
x⁎(v)
is determined by a weighted sum to boost the influence of the most
reliable vertices and the associated edges.The scaling factor
x⁎(v)
of each vertex is used to calculate a scaling factor for the rigid
object (part of an articulated object):where
V are all
vertices v of the bottom level of the
HSS.
Offsets of the HSS
To compute the offsets of the HSS we employ
graph relaxation, which models the
spring-like behavior of the edges with the purpose to minimize the
energies in the HSS, i.e. to bring all edges E to
have the same length ratio as in the model (e.g. initial
frame).A structural offset vector
for vertex v is calculated so that it
is pointing to a spatial position in which the is minimized:where is the unitary vector pointing from a neighboring vertex
v
toward v and x is the
scaling factor of the object (for rigid objects
x=x⁎(p),
for articulated objects with several parts
x=x⁎(O)).
Fig. 2 shows the concept of producing structural offsets
with graph relaxation.
Fig. 2
Graph relaxation examples. B is
the initial state of the vertex and the deformed one. The arrows visualize the structural offset vectors
.
Combining the hypotheses
For each feature (vertex) and in each iteration
i the corresponding sub-tracker and HSS
propose a “new” position with the knowledge of the
position of the previous iteration i−1
and their offsets.Both hypotheses are combined to determine the position
c of
each vertex as follows:where is the confidence of vertex v at time
i,
t
is a vector representing the hypothesis of the sub-tracker and
s is
the proposed position of v of the HSS.
Assembling parts to form articulated objects
Articulated objects are modeled as
multiple object parts represented by hierarchical spring
systems (HSSs) and connected by vertices representing
articulation points. To exchange information between the parts of the object,
articulation points are connected to the corresponding HSSs. Articulation points
have no corresponding sub-trackers and move solely under the
“forces” of the adjacent parts.
The confidence of a part
The confidence of object parts becomes meaningful when the target object is an articulated
object consisting of several parts connected by articulation points. It is
computed out of the size I(p),
the energy
E(p),
and the dissimilarity
D(p)
of the feature
S(p)
in comparison to
S1(p) of
the initial frame:, and are normalized to satisfy .where F(p) is the
number of features of part p and
F is the number of all features in the
object.The sum of all local energies in an object part
Ei−1(p)
is normalized by the number of features (vertices):where
h(S(p),
S1(p)) is
the distance between the feature
S(p)
in the current iteration and
S1(p) in
the initial frame.
s
is the standard deviation of the distances for all parts in the target
object and
hmax
is the highest distance value, where .
Scaling of the whole object
The estimation of the global scaling of the whole
articulated object is based on the scaling factors of the object parts
x⁎(p)
(see Section 2.3.3), which are
combined by a weighted sum:
Articulation points: agents of the information
transfer
An articulation point connects
several rigid parts. It allows them to move independently from each other,
while keeping the same distance to it. The movement of a point of a rigid
part in the image plane is constrained to a circle centered at the
articulation point. The radius is equal to the distance between the point of
the rigid part and the articulation point. Fig.
3 illustrates this
concept.If the articulation point moves it “pulls”
the connected rigid part to keep the distance constraint, and vice versa. In
this way position information is transferred from one rigid part to an
adjacent one over the articulation point.
Modeling articulation points
Planar articulated motion from frame
f to frame can be decomposed into: an independent rotation of the rigid
parts around the articulation point, followed by a common translation of
the parts (and the articulation point). Given two pairs of points
corresponding to two rigid parts performing articulated motion, each at
frame f and , the rotation () of each part, the common translation
(O,
O)
as well as the position of the articulated point at frame
f are obtained by solving the resulting
system of eight equations with eight unknowns.During the initialization of the representation a local
coordinate system of each pair of features of an object part is created
(see Fig. 3). The coordinates
of the articulation point in this coordinate systems are stored. Having
the position of any two features is then enough to build the local
coordinate system and reconstruct the position of the articulation point
in every frame.
Tracking articulation points
At any time during tracking, knowing the positions of
two vertices of a part and the current scaling factor is sufficient to
generate a hypothesis for the positions of all adjacent articulation
points. These hypotheses are produced with the local coordinate system
defined by the two most confident features (see Section 2.3.2) – further on
named reference vertices – of each
part.The hypotheses of all parts connected to an articulation
point are combined with a weighted sum to calculate the current position
a of
the articulation point a. The weight for each
hypothesis depends on the confidence of the corresponding part (see
Section 3.1):where P(a) is the set
of parts connected to the articulation point a.
y is the
hypothesis determined with the local coordinate system (which considers
the current scaling factor x) of part
p. is the confidence of part p. With
this weighted sum, the influence of ambiguous parts on the position of
the articulation point is low (e.g. if a part is occluded) and of
reliably tracked parts high.
Information transfer
For each rigid part, the distance constraint to the
articulation point is enforced by connecting all vertices from the bottom
level and the vertex from the top level with the corresponding articulation
point. The articulation point “transfers” position
information from reliably to ambiguously tracked parts through its distance
constraints (circles).The information transfer is realized with graph relaxation
by calculating a structural offset vector. Therefore, Eq. (8) is adapted as follows:where is the confidence of vertex v, is the length of edge e connecting
v with a and represents the length of the same edge in the initial frame.
is the unitary vector pointing from a vertex
v toward the articulation point
a.
Tracking as a hierarchical optimization process—the
algorithm
The algorithm to track articulated objects using HSSs is
summarized in Algorithm
1.Tracking is done in a top to bottom or
bottom to top process, depending on the confidence
values (see Algorithm 1, Line 8).
In frames when the tracking is reliable, the springs connecting the top vertex
with the bottom level are used to generate additional structural offsets for the
vertices in the bottom level (top to bottom processing).
During occlusions this flow of structural feedback is inverted s.t. structural
offsets are determined for the top vertex (bottom to top
processing). The decision for top to bottom or
bottom to top processing is taken by a comparison of
the confidence values of the top and bottom vertices. In cases of ambiguity
bottom to top processing is preferred (confidence
value of top vertex is smaller than confidence of bottom vertex).Algorithm for tracking articulated objects.
Experiments
The following experiments show the application of the presented
framework on concrete tracking tasks with different complexities and
difficulties.
The sub-trackers
We use the mean shift algorithm for the sub-trackers. It is
a simple, single hypothesis tracker, which on its own is not able to track
complex, articulated objects successfully.Mean shift efficiently searches for local extremal values in
a probability distribution with a search window, and generates an offset
vector pointing to the corresponding position. The value of the distribution
at a certain point depends on the similarity between features extracted
within a window centered at that point and features extracted in an
initialization phase from the region to be tracked.
The region descriptors
Sigma Sets are used in the experiments as the region
descriptors (features) describing the appearance of the corresponding
regions of interests covering the target object. Appendix A gives a brief recall of Sigma Sets.The extraction of the features in every frame is very
expensive with regard to computation time. In a frame with a resolution of
480×640 pixels the calculation of the features consumes between 60
and 70 s of the overall computing time of maximum
75 s per frame.
Initializing the hierarchical spring
systems
Features/vertices. Before a HSS can
be built, a target object needs to be defined and suitable features
describing the object have to be selected. This can be done automatically by
methods like in [9-12] or semi-manually as for the experiments
in this paper.The top level is described by one region descriptor
S1(p),
extracted out of a region of interest (ROI) covering the whole object part
(Fig. 1(a)). The bottom level
consists of several smaller region descriptors, which are from the same ROI
(see Fig. 1(b)). A Harris corner
detector is applied on the ROI to find promising positions for the smaller
region descriptors
S1(v).
Around each corner point a small ROI is built to extract a Sigma Set (e.g.
9×9 pixels).Edges. The edges can be inserted with
a Delaunay triangulation (see Fig.
4(b)) or a fully
connected graph can be built (see Fig.
4(c)). For more details on inserting the edges refer to
Section 5.3.1.
Fig. 4
Building a HSS. Target object: head of jumping jack. (a)
Selected features: region descriptors (boxes). (b) Inserted edges: triangulated
graph. (c) Inserted edges: fully connected graph.
Articulation points. They can be
initialized manually (as in the following experiments) or automatically by
observing the articulated motion of the target object [13,9].
Connectivity issues
This section deals with the impact of the connectivity
of the vertices in the HSS on the quality of the structural feedback,
i.e. on the structural offset vector.Given the features represented as vertices, there are
different possibilities for adding the edges connecting them e.g.: a
Delaunay triangulation or a fully connected graph (see Fig. 4).If a vertex v is of degree 1
– only connected to one neighbor – the structural
feedback determined by graph relaxation is ambiguous. The local energy
in the current vertex v is minimized
() by moving v to any point on the
circle centered on its neighbor with the radius equal to the
“original” length of the edge connecting them. Therefore, there is no unique
global minimum or structural offset vector for v.
For a vertex v with degree 2, the ambiguity is
reduced to two possible positions, both with . Above degree 2, there is only one position in the image,
which minimizes . Fig.
5 visualizes these
three cases.
Fig. 5
Ambiguity of structural offset vectors. (a) Vertex
degree 1, all positions on circle are minima. (b) Vertex degree 2, two minima.
(c) Vertex degree 3, one unique minimum.
In our experiments both a Delaunay triangulation and a
fully connected graph are used as representation. Table 1
lists important facts of both representations.
Table 1
Comparison of facts of a triangulated and a fully
connected graph.
Representation
Connectivity
Quality of struc. feedback
Propagation of information
Planar triangulation
Low, some vertices have only degree 2
Robust without occlusion, can be ambiguous in cases of
occlusion
Slow for graphs with many vertices
Fully connected
High, all vertices of E
Robust with and without occlusions
Fast, independent on the number of vertices
As Table 1
lists, a fully connected graph may produce superior results. When
determining the structural offset vector (see Eq. (8)) each vertex gets structural input
from every other vertex in the graph. Especially in cases of occlusion,
this leads to a faster propagation of “correct” position
information (see Fig.
6 in Section 5). The only drawback we
identified for the fully connected graph is the, in our experiments
insignificant, increase in processing time when calculating the
structural offset vector.
Fig. 6
Experiment 1: tracking an occluded face with mean shift
(top), with our approach in a triangulation (middle) and our approach with a
fully connected graph (bottom). The images show the features of the bottom level
connected by edges to illustrate the deformations and the qualitative
results.
Experimental setup
The videos employed for the following experiments are
self-produced (800×600 pixel), from the Motion of Body (MoBo)
database [15] (486×640
pixel) and from Amit et al. [16]
(352 ×288 pixel).The videos are selected considering the current status of
the presented approach. Even though the proposed framework is able to
successfully track objects through articulated motion and scaling, it can
only deal with affine or perspective changes up to a certain degree. The
reason for this lies in the current state of the HSS as it does not consider
the 3D space when generating structural offset vectors. Therefore, videos
with objects moving in the 3D space are not suitable for our experiments and
will lead to significant errors in tracking.In all experiments presented in this section, the target
object is initialized manually by selecting the parts of the object and
defining the positions of the articulation points. Except of the video in
experiment 1, the ground truth was determined by us and is a result of
manually selecting the center positions of the object parts.
Experiment 1: occlusion
This experiment focuses on occlusions and compares the
tracking results of mean shift alone and our combined approach. The video
used in this experiment is from the work of Amit et al. [16]. It shows the face of a woman being
partially occluded several times.In Fig. 6 one can
see the results of tracking with mean shift alone, with a HSS with
triangulated graphs and with a HSS using fully connected graphs. As already
mentioned in Section 5.3.1, the
fully connected graph is superior to the triangulated graph in challenging
cases of occlusion, which occur in this video sequence. The face is occluded
several times by a highly textured object (magazine) moving in different
directions and occluding different parts of the face. This leads to big
confusions and errors in the tracking with mean shift alone (see
Fig. 6 (top)).Fig.
7 shows the quantitative
results of this experiment. These results confirm the qualitative results.
The ground truth is provided by [16]. When comparing the results of Fig. 7 with the results in [16], one can see that the methods have a
similar error rate. The approach of Amit et al. [16] has problems in frames 500–600, where as
our approach performed better in this period. Both methods are challenged in
frames 700–800, but this time the method of Amit et al. is slightly
better.
Fig. 7
Experiment 1: deviation from ground truth. (full) Using
HSS with a fully connected graph, (planar) using HSS with a triangulated graph,
and (without) using only tracking with mean shift.
Experiment 2: articulated motion with
self-occlusion
This experiment uses a video of [15] of subject 04011 in view vr16_7, where the aim was
to track hand, torso, and upper and lower arms. The challenges are
self-occlusions and similar appearance in several object parts. (We do not
show images of subject 04011 as it is not allowed by [15].)Fig.
8 shows that the
presented representation significantly improves the quality of the results
of tracking with mean shift. The left lower arm is the most challenging
object part to track, but our approach is able to recover well from wrong
hypotheses.
Fig. 8
Experiment 2: deviation from ground truth: (top)
tracking with mean shift, (bottom) tracking with our approach with fully
connected graphs.
Experiment 3: articulated motion under
scaling
In experiment 3 the aim is to successfully track an
articulated object consisting of eight parts connected via six articulation
points (jumping jack). The challenges are the scaling (approximately from
100% to 130% and to 80%) and the two types of motion: articulated and
camera.In Fig.
9 one can see three
frames of the video. Fig.
10 shows the deviation
from the manually labeled ground truth of tracking with mean shift alone, of
our approach with HSSs represented by planar triangulated graphs or fully
connected graphs. As expected there is no remarkable difference in the
results for planar and fully connected graph.
Fig. 9
Experiment 3: some frames of the video showing the
scaling.
Fig. 10
Experiment 3: deviation from ground truth. The position
error in pixels is a sum over the error of all object parts.
Experiment 4: fast movements
In this experiment the robustness and recovery potential of
the HSS is tested. The employed video shows a woman waving a hand very fast,
which leads to heavy motion blur.Fig.
11 shows some frames of
the video sequence including qualitative results for tracking with mean
shift alone and our approach with fully connected graphs. Frames 155 and 170
show the superior results of our approach in comparison to mean shift on its
own. Fig. 12 evaluates the results in concrete numbers.
Fig. 11
Experiment 4: tracking an articulated object through
motion blur. (top) Tracking with mean shift and (bottom) our approach with HSS
and fully connected graphs.
Fig. 12
Experiment 4: deviation from ground truth. (without)
Tracking the object parts with mean shift, (full) our approach with fully
connected graphs.
Experiment 5: tracking a whole human
In experiment 5 representations with 10 object parts and
nine articulation points are built and track walking humans in 04002 and
04006 in view vr7_7 of [15].
Fig. 13 shows images of 04002 and 04006, where in (d) one can
see that for some parts (in this case left upper arm) it is not possible to
extract enough local features. In such cases also tracking is more difficult
and depends mainly on the top level of the HSS. As expected tracking with
our approach by combining mean shift and HSSs delivers the better result
(see Fig. 14).
Fig. 13
Experiment 5: (a) frame of subject 04002 with the top
level of the HSSs and the articulation points, (b) subject 04002 and
corresponding bottom level of HSSs, (c) frame of subject 04006 and its top level
with the articulation points, and (d) showing the bottom level of the HSSs of
04006.
Fig. 14
Experiment 5: Deviation from ground truth. (top) Video
with subject 04002 in view vr7_7, (bottom) subject 04006 in the same view. For
both videos results with mean shift (without) and with our approach (full) are
shown. The position error in pixels is a sum over the error of all object
parts.
Discussion and future work
The presented experiments showed the application of the
proposed framework in tracking objects of different complexity under
“simple” motion, articulated motion, camera motion, scaling,
occlusion, and motion blur.Even though tracking with mean shift and Sigma Sets are
employed as basic building blocks, both the tracker and the region
descriptor are exchangeable. The focus of our work lies in the hierarchical
representation.The experiments in this section showed that a fully
connected graph as representation for a HSS is equal or superior to a
triangulated graph (especially during occlusions). Therefore, we intend to
employ this representation in future. The increase in processing time is
insignificant, as most of the processing time (approximately 95%) is spent
in calculating region descriptors and building distributions.Besides its advantages during occlusion, the fully connected
graph is also a good basis to start future research on updating the elements
of the HSS. When an object moves in the 3D space (e.g. turning around) it
happens that some regions of the object become invisible and new regions
appear. Therefore, it is necessary to develop an update process for the
elements of the HSS, which allows the removal of “old”
vertices and the addition of “new” ones. This process
requires changes in the graph representing the HSS and here a fully
connected graph is easier to handle than a triangulation.Furthermore, we plan to extend our HSS to be able to handle
3D position information. One possibility to realize this, could be to stick
with mean shift tracking in 2D, but optimize the Spring System in 3D
coordinates.
Conclusion
This paper presented a flexible framework to represent and track
articulated objects consisting of several rigid parts connected with
articulation points. The parts of the object are described by a hierarchical
spring system which is represented by an attributed graph pyramid. The
attributes of the pyramid are region descriptors and the edges encode the
spatial relationships between the vertices/attributes. This spatial structure is
enforced during tracking by the spring-like behavior of the edges in the
hierarchical spring systems. The “springs” allow to determine
structural offset vectors, which are combined with the offset vectors provided
by the employed mean shift tracker. Position information can be transferred
between the parts over the corresponding articulation points depending on the
confidence of the parts and their features.
1:
processFrame
Ti
threshold maximum number of iterations
2:
i←1▹ iteration counter
3:
while (i<Ti) do
4:
for every rigid part
do
5:
calculate confidences δi(v) and δi(p)
6:
estimate positions with sub-trackers top
and bottom
7:
ifi>1then
8:
decide between top to
bottom or bottom to
top processing