Detecting elements such as planes in 3D is essential to describe objects for applications such as robotics and augmented reality. While plane estimation is well studied, table-top scenes exhibit a large number of planes and methods often lock onto a dominant plane or do not estimate 3D object structure but only homographies of individual planes. In this paper we introduce MDL to the problem of incrementally detecting multiple planar patches in a scene using tracked interest points in image sequences. Planar patches are reconstructed and stored in a keyframe-based graph structure. In case different motions occur, separate object hypotheses are modelled from currently visible patches and patches seen in previous frames. We evaluate our approach on a standard data set published by the Visual Geometry Group at the University of Oxford [24] and on our own data set containing table-top scenes. Results indicate that our approach significantly improves over the state-of-the-art algorithms.
Detecting elements such as planes in 3D is essential to describe objects for applications such as robotics and augmented reality. While plane estimation is well studied, table-top scenes exhibit a large number of planes and methods often lock onto a dominant plane or do not estimate 3D object structure but only homographies of individual planes. In this paper we introduce MDL to the problem of incrementally detecting multiple planar patches in a scene using tracked interest points in image sequences. Planar patches are reconstructed and stored in a keyframe-based graph structure. In case different motions occur, separate object hypotheses are modelled from currently visible patches and patches seen in previous frames. We evaluate our approach on a standard data set published by the Visual Geometry Group at the University of Oxford [24] and on our own data set containing table-top scenes. Results indicate that our approach significantly improves over the state-of-the-art algorithms.
Entities:
Keywords:
Multiple structure from motion; Object modelling; Plane detection; Reconstruction
Today object modelling is dominated by approaches based on interest points and local descriptors, e.g., [1,2]. While successful for recognition, representations in terms of sparse sets of points are not suitable to accurately convey object shape, which is required by certain types of applications. For example, robots make use of the relation between surfaces to calculate grasp points on an object; horizontal patches indicate where it is possible to put down an object; and relationships between surfaces allow alignment of objects for complex tasks such as object stacking. Similarly, in augmented reality (AR) applications interactions with objects require virtual contact points with the surfaces of the workspace.In these cases it is beneficial to model the scene from planar patches rather than individual points. In computer vision planes are used for various tasks such as camera calibration [3,4], feature matching [5,6], and scene understanding [7,8]. In robotics and AR planes are used for obstacle detection [9], localisation [10] and 3D scene reconstruction [11-15].The difficulty is to model many different surface patches and patches of different sizes. Larger patches such as table surfaces are dominant and consume adjacent points at the expense of smaller patches. For example, this is the case in sequential dominant plane detection with RANSAC (cf. [16,17]). Moreover, single views typically do not provide full 3D shape information. So in order to model complete objects several views around the object are required. To this end structure from motion approaches provide a good framework to start. For example, Schindler et al. [18] and Ozden et al. [19] compute the motion model of multiple consistently moving interest points but they do not estimate the surface structure.In this paper we propose an efficient approach to incrementally model the surface of multiple objects in a scene. Concretely, our contributions are:The introduction of the MDL approach (inspired by the model selection framework of Leonardis et al. [20]) to the problem of incrementally fitting planar surface patches.The introduction of an incremental scheme for detecting multiple planes in a scene. Accordingly, the approach allows to guide the hypotheses generation and to restrict the search space, while at the same time it benefits from model selection to explain the interest points with the best subset of hypotheses in terms of the MDL criterion and it avoids a bias to the dominant structure (typical for sequential approaches).The exploitation of the 3D reconstruction in a merge and split scheme to segment objects that move consistently. This is not possible from moving interest points in the projective image space alone.The motivation of using planar patches as basic parts is twofold: (1) planes can directly be detected as homographies in image pairs and (2) the planar patches are a more suitable representation than sets of points for applications such as robot interaction, augmented reality, and scene understanding. The proposed method starts from detecting planes as homographies in image pairs. Hence, our model is simpler than the fundamental matrix used by SfM approaches, allowing it to scale well with increasing scene complexity. Next, the planes are reconstructed in 3D and merged and split depending on common motion to arrive at individual object models composed of planar patches. We introduced the core algorithms for incremental model selection and for the merge and split scheme in [21,22]. Here we extend this work in several directions: We added new strategies for generating plane hypotheses. We propose efficient data sampling algorithms and methods for early pruning of hypotheses to speed up model selection. Finally, we present a detailed validation of the individual algorithms and data cues for the assembly of the final object hypotheses and a comparison to the approach by Chin et al. [23].Fig. 1 shows a typical test scenario where a camera moves around objects, planes are detected and reconstructed and finally the planes are clustered into separate object hypotheses if different motions occur. We evaluate our approach on a standard data set published by the Visual Geometry Group at the University of Oxford [24] and on our own data set containing table-top scenes. Results indicate that our approach significantly improves over the state-of-the-art algorithms.
Fig. 1
Planes detected from tracked interest points in an image sequence are reconstructed and clustered to separate objects in case different motions occur.
Note that while the work presented in this paper is concerned with planar patches, we will be using the terms planes and planar patches interchangeably for the sake brevity, when there is no risk of confusion.The paper proceeds with a discussion of related work in Section 2. After that our framework for interactive object modelling based on piecewise planar surface patches and the implementation details are described in Section 3, followed by an evaluation of the proposed algorithms in Section 4. Finally, we conclude the paper with Section 5.
Related work
The motivation for this work is to reconstruct objects with planar patches for applications such as robotics, augmented reality and scene understanding. Therefore, we need an active vision system that is able to learn several independently moving foreground objects in unknown environments. Hence, in the following we review state of the art in detection of planar surfaces and in the reconstruction of multiple objects.
Detection of planar surfaces
The basic parts of our object model are planar patches. Detecting planes in uncalibrated image sequences is well studied. Most approaches use a hypothesise-and-test framework. A popular method for detecting multiple models is to use the robust estimation method RANSAC [25], to sequentially fit the model to a data set and then to remove inliers. To generate plane hypotheses Vincent et al. [16] use groups of four points which are likely to be coplanar to compute the homography. To increase the likelihood that the points belong to the same plane they select points lying on two different lines in an image. In contrast Kanazawa et al. [17] define a probability for feature points to belong to the same plane using the Euclidean distance between the points. Both approaches use a RANSAC scheme, iteratively detect the dominant plane, remove the inliers and proceed with the remaining interest points. A valid plane hypothesis requires selecting a sample of four coplanar points. In [26,27] different strategies are proposed to sequentially reduce the set of points/lines to three pairs. More recent approaches, such as proposed by Toldo et al. [28] and Chin et al. [29], concentrate on robust estimation of multiple structures to treat hypotheses equally and do not favour planes detected first over subsequent planes by greedily consuming features. These approaches have to create plane hypotheses independently of each other and thus it is not possible to restrict the search space, which leads to higher computational complexity. In Chin et al. [23] the issue of efficient hypotheses generation is addressed by guiding sampling with information derived from residual sorting. They show that residual sorting innately encodes the probability of two points to have arisen from the same model. Instead we propose incremental model selection based on the MDL principle in order to accomplish efficient hypotheses generation, to explain the interest points with the best subset of hypotheses and to avoid a bias to the dominant structure (typical for sequential approaches).
Reconstruction of multiple objects
The planes, represented by homographies, are the basic entities for 3D reconstruction and for merging/splitting to create the final object model. Classical Structure-from-Motion in a static scene is essentially solved in a coherent theory [30] and several robust systems exist. In addition to the SfM approaches based on point features in [31] Faugeras et al. show how to benefit from planar structures, how the unknown camera motion and plane equation can be recovered from an estimate of the matrix of this collineation and how the motion ambiguity of the camera, i.e., the multiple solutions for camera pose from a single observed plane, can be removed by looking at a second plane or by taking a third view. A SfM framework entirely based on lines is proposed by Bartoli et al. [32]. They consider the triangulation problem based on an maximum likelihood algorithm and Plücker coordinates and they propose the orthonormal representation of 3D lines, which allows for a convenient formulation of the nonlinear optimization of camera poses and lines from multiple views. In [7] Bartoli introduces a random sampling strategy to segment piecewise planar surfaces in order to create 3D models of man-made environments. While the above concentrated on the reconstruction and SfM in general Klein et al. [33] developed parallel tracking and mapping approach (PTAM) for real-time augmented reality applications. To improve the robustness of the keyframe-based SLAM approach they introduce edge features to the map in addition to points and exploit the resilience of the edge features to motion blur.In recent years, researchers focused on dynamic scenes composed of rigidly moving objects. The solutions available so far can be broadly classified into algebraic methods [34,35], which exploit algebraic constraints satisfied by all scene objects, even though they move relative to each other, and non-algebraic methods [36,37], which essentially combine rigid SfM with segmentation.Most related to our system are the methods proposed by Schindler [18] and by Ozden [19]. They use interleaved segmentation and 3D reconstruction of tracked features into independent objects. Instead of directly sampling features and generating 3D object hypotheses, we incrementally cluster features to planes in images using homographies, i.e., a simpler model providing robustness in complex scenes, and then reconstruct and merge/split planes into independently moving objects in 3D. Finally, instead of a sparse point cloud we get a dense representation in terms of planar patches, and thus a more accurate description of object shape.
Approach
The proposed vision system consists of two main components (see Fig. 2). Firstly, consistent planar surface patches are detected as homographies in image sequences. Detection of planes is based on interest points (IPs) which are tracked in image pairs. We developed an incremental model selection scheme, where planes once detected are tracked and serve as priors in subsequent images. The incremental approach adds new planes if new viewpoints are visited. Secondly, planes are reconstructed in 3D and clusters with common motion build initial object hypotheses. Whenever independent motion occurs in the scene and planes start moving separately, a split event is triggered and the accumulated information, which is stored in a keyframe based graph structure, is evaluated. The graph is traced back and depending on colour and structure information, the planes are assigned to the most likely object hypothesis.
Fig. 2
System overview.
The following sections describe the detection of planes and the incremental reconstruction of objects in detail.
Consistent planes in image sequences
The first step is the detection of multiple planes in image sequences. Typically planes are detected in image pairs with a hypothesise-and-test framework by tracking interest points, sequentially detecting dominant planes and removing the inliers (cf. [16,17,26]). Given a fixed threshold to detect inliers, incremental methods favour planes detected first over subsequent planes by greedily consuming features. If all hypotheses are created simultaneously first and have to compete for data points, this drawback is overcome. This however means that the sequential pruning of the search space is lost and in complex environments the number of random hypotheses required to guarantee that all planes are detected (with a given probability) grows prohibitively. Therefore, we propose to embed Minimal Description Length (MDL) based model selection in an iterative scheme. Existing plane hypotheses compete with newly created hypotheses to ensure that interest points are assigned to the best currently available hypothesis. Additionally hypothesis generation can be guided to unexplained regions (see Section 3.1.2). This method avoids the bias towards dominant planes that is typical for iterative methods while at the same time limiting the search space, which leads to faster convergence. In the following we review the core of the algorithm which we first proposed in [21], extend the work with new considerations about hypotheses generation and develop new strategies for efficient data sampling and early pruning of hypotheses to speed up model selection.Plane detection using model selectionAlgorithm 1 shows our proposed method for plane detection. In each iteration, a small number Z of new plane hypotheses P′ are computed which have to compete with the selected hypotheses P of the last iteration. The termination criterion is based on the true inlier ratio ∊ and the minimum sample set size M, which in our case of homography estimation is 4. The true inlier rate is of course not known in advance and so we estimate it as the ratio of the number of inliers Imax of the current set of plane hypotheses and the number of data points N of the current frame. Furthermore, k is the number of iterations, η stands for the probability that no correct set of hypotheses is found and the parameter η0 is the desired failure rate. Due to the incremental scheme, it is possible to guide the computation of new hypotheses to unexplained regions.
Hypotheses representation and model selection
In each iteration, selected homographies of the last iteration have to compete with newly sampled hypotheses. For the selection, the idea is that the same feature cannot belong to more than one planar patch. Thus an over-complete set of homographies is generated and the best subset in terms of a Minimum Description Length criterion is chosen. The core of the problem is to find a general mechanism to optimally describe the data with respect to an objective function and to reduce the number of redundant models. The basic mathematical tool we use was introduced by Leonardis et al. [20] for the purpose of range image segmentation and adapted in [1]. In the following, we briefly describe the basic ideas, which are then reformulated for plane detection.According to Leonardis et al. [20], model selection can be formalised aswith the goal to maximise the savings (or merits) S(n) of the set of hypotheses, where the indicator vector n = [n1, n2, … , n] stands for a set of models, with n = 1 if a model is selected and n = 0 otherwise. S describes the merit in explaining the data in terms of a set of models n. S models the costs of coding the models, i.e. essentially the number of parameters needed to describe each model. The error costs S describe the remaining error in fitting the set of models to the data. The constants K1, K2 and K3 are weights which can be determined on a purely information–theoretical basis (in terms of bits), or they can be adjusted in order to express the preference for a particular type of description, which is the approach we take.Intuitively, this formulation shows that an encoding is efficient if the number of data points described by a model is large, the contributed error is low, and the number of parameters is small. In practice, the weights K1, K2 and K3 of Eq. (1) are related to the average cost of the data points, the model and the error, and we only need to consider the relative savings between different combinations of hypotheses. Hence, to select the best model, the savings for each individual hypothesis H can be expressed aswhere and . In our case S is the number of points N explained by n. Since we use one model (the homography of a plane) and S can be set to 1. S describes the cost for the error added, which we express with the log-likelihood over all points f of the plane hypothesis H. Experiments indicate that with ∊ the Euclidean distance of inliers to the estimated homography, the Gaussian error modelin conjunction with an approximation of the log–likelihood, which has also been proposed by other authors works best for us. Thus the cost of the error is given aswhere log(p(f∣H)) is the log–likelihood that an point belongs to the plane, and N the number of points explained by hypothesis j. Substitution of Eq. (7) into Eq. (2) yields the merit of a modelA point can only be assigned to one model. Hence, overlapping models compete for points which can be represented by interaction costswhere pmin = min{p(f∣H),p(f∣H)} refers to the plane where the point contributes the smaller error.Finding the optimal possible set of homographies for the current iteration leads to a Quadratic Boolean Problem (QBP)1where s = S is the a merit term of single hypothesis. The time to solve the QBP grows exponentially with the number of hypotheses. Several methods have been proposed to solve the problem with an approximate solution. Our results indicate that for our specific problem a greedy approximation gives good results (cf. Section 4.1.3). But still, what is most important is to keep the number of hypotheses tractable. We addressed this by embedding model selection in an iterative algorithm and hence, the solution can be found very fast.
Hypotheses generation and efficiency
One of the key issues of approaches that use random samples is to select good features. Our method addresses this issue in different ways. Following Myatt et al. [38], sampling is biased to features that are most likely located on the same plane. The second strategy is to sequentially guide sampling towards unexplained regions. Furthermore, we use a pre-filter which selects good hypotheses and adds them to the iterative model selection. In the following paragraphs, we describe the different sampling strategies and the proposed filter to select good plane hypotheses.Uniformly distributed sampling. One possibility to compute plane hypotheses is to sample features uniformly. This method is often used for robust object detection or pose estimation, where the percentage of outliers is known to be lower than 50%. The number of attempts to select outlier free samples isIn our typical test scenarios we marked about 10 ground truth planes. To compute the plane homography, we need m = 4 point pairs. If we assume that all 10 planes have equal size, and we want a desired confidence of p = 99% and the data consists of only 20% noise, 112,429 are trials necessary to compute one plane. Hence, uniformly distributed sampling does not lead to satisfying results within a given timeframe.Sampling biased towards adjacent points. Instead of uniformly distributed sampling, Myatt et al. [38] propose to bias random selection depending on the Euclidean distance of points. If a selected point A is an inlier, then there will be an increased probability of a point adjacent to A also being an inlier. Following this approach we first select a point A randomly. Then all other points are ordered by increasing Euclidean distance from A and three additional nearby points are randomly selected, with a sampling probability depending on their position in the sorted list using a Gaussian distribution.Sampling biased to features with a similar motion vector. Another heuristic, which significantly improves the performance, is to sort the points depending on the motion vector. The motion vector describes the shift of a specific point between two images. The method proposed in the last paragraph sorts points depending on the Euclidean distance from an initial selected point. Here we propose to select a point A and sort all other points depending on the similarity of the motion vector to the first point. The selection is also biased to similar motion with a Gaussian distribution.Sampling biased to unexplained regions. The above methods focus on increasing the probability of selecting three points which lie on the same plane as a point A, selected first. The overall approach is concerned with describing the whole scene with planes. Thus, if a plane is found it seems to be plausible to bias sampling towards unexplained regions. Our iterative model selection scheme perfectly supports this. In each iteration interest point pairs are ranked in decreasing order depending on the smallest residual to any of the existing plane models, i.e. to homographies selected in the previous iteration. Note that we limit to d. So all current outliers are treated equally, as there is no basis on which one should be preferred over the other. Inliers are ranked higher if they do not fit their respective model well. The sampling probability now depends on the position in the sorted list using a Gaussian distribution. In contrast to sequential RANSAC, where inliers are removed from the current set of points, we only decrease the probability of re-selecting an inlier. As described in the beginning of Section 3.1, newly generated plane hypotheses then have to compete with previously selected ones. The different approaches to sampling will be evaluated in Section 4.1.2.Hypothesis validity check. While the above sampling strategies significantly increase the chance to draw samples from “good” sets of points, that is still no guarantee that a sample constitutes a valid hypothesis. So it is important to prune hypotheses as early as possible. To this end we propose a connected components analysis of points supporting a hypothesis. First neighbouring points are connected using a 2D Delaunay triangulation (see Fig. 3), forming a neighbourhood graph. This graph is constructed once per image and used by all hypotheses. For a given hypothesis we then traverse the graph starting from one of the sampled points and collect all points supporting the hypothesis, subject to a given threshold, stopping when no more neighbouring points can be added. Using the neighbourhood graph means we only have to check a fraction of the points rather than all the points in the image, We end up with a cluster of connected points supporting the hypothesis (see Fig. 3b). We can reject a hypothesis if one of the original sample points does not lie in the cluster, because this means it lies on a different physical surface separated from the surface on which the clustered points are lying. This hypothesis validity check might cut off points which belong to a valid surface, but in general each point is connected with multiple points and thus outliers are bypassed through other edges. Experiments have shown that for our scenarios, where the outlier ratio of the interest point matches is below 50%, single outliers do not block this strategy and this method out-performs the other methods in terms of the number of hypotheses necessary to explain the scene (see Fig. 7). Note that this hypothesis validity check constitutes a “sparse” variant of CC-RANSAC [39]. In our test images, only about 3% of the initial plane hypotheses pass the validity check.
Fig. 3
Connected components filter for validity check of plane hypotheses. A 2D Delaunay triangulation is used to connect points (white edges). Three points are sampled (red) and an affine homography is computed. The graph is traced and points supporting the affine mapping are clustered (green). A Hypothesis is accepted if all initially sampled points are connected within the cluster (right image). (For interpretation of the references to colour in this figure legend, the reader is referred to the web version of this article.)
Fig. 7
Sampling strategies and pre-filtering of hypotheses.
Two-step fitting. As a further step to speed up detection of good hypotheses, we adopt a two step approach, where we first search for a simpler model, namely an affine homography with five parameters, and only if this passes the above validity check move to the full model, the fully projective homography with 8 parameters. Given that the validity check is concerned with points within a local neighbourhood, an affine homography is a sufficient model. If an affine homography hypothesis passes the validity check, the full homography is computed using the Direct Linear Transform (DLT) algorithm proposed by Hartley [30], using all the points in the cluster. The cluster is then further expanded with points supporting the full homography and this plane hypothesis is finally considered for the subsequent model selection.Algorithm 2 summarises the proposed plane hypothesis generation. In Fig. 3a a typical bad hypothesis is shown, which could e.g. result from a bias towards similar motion. Three initial interest points are shown in red. The clustering of interest points (green dots) immediately stops because one of the three points lies on a totally different surface, so the hypothesis does not correspond to a physical surface. In contrast, in Fig. 3b the cluster of interest points (coloured green) also contains the initial sampled points (red).Connected component filter for early pruning of plane hypotheses
Tracking of planar patches
One of the key benefits of our algorithm is that prior knowledge can be introduced easily. We exploit this in image sequences where detected planes are propagated to subsequent frames. For this, the interest points of planes detected in the previous image pair are matched with interest points of the current frame, followed by a robust homography estimation using least median of squares (LMedS).2 Thus Algorithm 1 is extended with tracked planes P, which are used to initialise P. Given P, the initialisation value of the inlier ratio ∊ is given as the number of interest points supporting the tracked planes divided by the total number of matched interest points. Hence, plane detection already starts with an initial guess of planes which have to survive the following model selection stage.
From planes to objects
In the previous sections we developed a method to detect planar surface patches represented by homographies. Our goal is to represent 3D object shapes, so we still need to reconstruct the Euclidean 3D structure and segment the scene into individual objects. Reconstruction of sparse point clouds and tracking of the camera pose is typically done with dynamic SfM frameworks [18,19]. Schindler et al. [18] use interleaved segmentation and 3D reconstruction of tracked features into independent objects. Instead of directly sampling features and generating 3D object hypotheses, we rely on the interest points clustered to homographies and then reconstruct and merge/split groups of planes into independently moving objects. Thus in the first step we use a simpler model to more robustly cluster tracked features into planes, followed by a second step to reconstruct and merge/split groups of planes and create the final object model. Instead of a sparse point cloud we thus get a dense representation in terms of planar patches. Furthermore we store tracked patches in a graph structure and in case a split event is triggered we assign visible and currently occluded planar patches to the most likely split object hypothesis (see Fig. 4). In the following we review the splitting and merging approach, which we first proposed in [22]. Here, we further concentrate on a detailed evaluation and extend the preliminary results presented in [22] with a comparison of the individual data cues used to assign occluded planes to the most likely object hypotheses.
Fig. 4
The upper row indicates selected keyframes with detected planes (first number in the white boxes) grouped to an initial object (second number) because of similar motion in Euclidean space. After an push event (keyframe #X4) the keyframe-graph is traced back and planes are reassigned to the most plausible object (red/blue) depending on colour and interest point adjacency. (For interpretation of the references to colour in this figure legend, the reader is referred to the web version of this article.)
Piecewise planar object modelling pipelineAlgorithm 3 gives a detailed outline of the piecewise planar object modelling pipeline. The following sections describe the methods to reconstruct, merge and split planes in order to create individual object models.
Structure from Motion (SfM)
To reconstruct planes and track the camera pose in 3D Euclidean space we use a standard SfM pipeline similar to Nister et al. [41] and Klein et al. [33]. If new clusters of planes and their homographies detected with the incremental model selection algorithm (Section 3.1) are available, which are not supported by existing 3D motion models, the first step is to reconstruct the planes in Euclidean space and estimate their camera location. In order to acquire an accurate initial camera pose we select the largest cluster of interest points and decompose the nonlinear optimised homography3 (cf. [42, pp. 136]). In the following frames, the relative motion from the camera pose C−1 to C is computed using the robust estimation schema RANSAC [25] and a direct least squares solution between the two point sets (cf. [43]). To handle the drift of the camera poses accumulated while tracking and to refine the 3D point locations a sparse bundle adjustment over the last t frames is performed.4 If new planes are detected our algorithm tries to assign them to existing motion models. In case there is no supporting model available a new camera pose is reconstructed.
Merging and splitting of groups of planar patches
Motivated by Palmer [45] – who stated that, although the vast majority of objects in ordinary environments are stationary for the vast majority of the time, objects that move are important – we first group planar patches depending on common motion and, if patches start moving separately, we split the group into individual object hypotheses. Planar patches that are no longer visible (e.g. because of occlusion or because they left the field of view) are assigned to object models based on patch appearance computed from colour and on patch adjacency computed from 3D interest point adjacency graph. Hence, the framework is able to create individual object models from planar patches visible in the current image and from currently occluded patches seen in previous frames.Grouping planar patches with consistent motion. Grouping planar patches is based on a check whether the motion of a new planar patch is consistent with the motion of an existing object. We developed a strategy to greedily assign homographies to a motion model. Analogous to Eq. (8) in Section 3.1.1, a formulation is developed, which results in a confidencethat a planar patch i with N points moves consistently with an existing motion model of an object H, where is the probability that an interest point k of a patch i belongs to the 3D object H. This is modelled using a Gaussian error model. The camera pose of object j is used to compute 3D points for patch i. Then the projections are compared to the corresponding tracked image points. ν1 and ν2 are constants to weight the different factors, where ν1 is an offset which must be reached to be considered as moving together. Homographies are assigned to the motion model according to the highest probability p.Separating groups of planar patches in case of different motions. While tracking the camera motion relative to an object hypothesis (cluster of planar patches), each individual patch is continuously tested if it still supports the motion. For this, the formalism outlined in the last paragraph is used and in case p is low, i.e., planar patches start moving separately, a split event is triggered. Planar patches which are currently visible build the initial split object hypotheses from which we compute the appearance model A. Then the frame of the original object (before splitting) – which is stored in a key-frame based graph structure – is traced back and the occluded patches are assigned to the most plausible split object. For this, the pseudo-likelihoodis computed, which combines the 3D adjacency of interest points and the colour in a probabilistic manner. The interest point adjacency p is based on a probabilistic voting scheme. For this purpose, a neighbourhood graph of all currently available 3D points is constructed and the mean μ and the standard deviation σ of the length of edges which connect points of the same patch are computed. Then μ and σ are used to compute Gaussian votes , where each 3D point k of a target patch i votes for the nearest object and thus the object which is close to the patch accumulates more votes and gets a higher probability that the patch belongs to that object. The second term p models the colour distribution of the objects. For this we build the 8 × 8 × 8 bin colour histogram c of the target patch i and the histogram C of the object j to which the patch should be assigned. We use normalised rgb colours to be insensitive to brightness differences of object planes. The border of the patch is approximated by the convex hull of the interest points. For comparison of colour models, we use the Bhattacharyya coefficientHence, the probability of a planar patch i which has to be assigned to an object j consists of a probabilistic vote of each interest point to the nearest object and a probability describing the colour similarity. While we are aware of the fact that assigning occluded planes based on colour and 3D interest point adjacency is a heuristic which could fail in certain cases, our experiments indicate that for our scenarios, where only a relatively small number of objects are modelled simultaneously, this is an adequate criterion.
Experimental results
The method is evaluated on a standard dataset for plane fitting (houses data set published by the Visual Geometry Group at the University of Oxford [24]) and an evaluation where a robot moves in a scene and attempts to model objects. The robot is supposed to move around the structure in order to detect and reconstruct planar patches. Then it pushes a surface patch and models the objects from what is seen in the current image and from patches detected in previous images. Fig. 5a shows the test setup. In the following sections specific aspects of our algorithms are tested and finally results from the overall system are shown.
Fig. 5
Overall test scenario (a) and test image for plane detection with ground truth overlay (b).
Plane detection
To test plane detection, we use two completely different sets of images. The first set of images shows table-top scene with objects typically found in a supermarket (see Fig. 5b). We placed each object in front of a weakly textured background as well as in a highly cluttered scene. For comparison, we additionally test the system with the “Houses” data sets published by the Visual Geometry Group at the University of Oxford [24] (see Fig. 12). To get ground truth, we manually marked all planes in the foreground and the dominant ones of the background, resulting in 231 planes in 56 images. To test the tracking capability of our method, the packaging data set consists of 8 sequences with 4 subsequent images, whereas we use 6 sets from Oxford with also 4 images, but these images are not ordered in a sequence.
Fig. 12
Examples of the Oxford Visual Geometry data set (left) and indoor scenes (right) using ItMoS.
For these experiments we use SIFT, the well known interest point detector/descriptor proposed by Lowe [46]. SIFT features are matched in image pairs using the Euclidean distance of the descriptors and matches are accepted if the NNDR (nearest/next distance ratio) d is below 0.8. To compute the homography, we follow Hartley [30], i.e., points are normalised to zero mean and scaled to get an average distance of from the origin. Then the homography is estimated using the Direct Linear Transform (DLT) algorithm.Three measures are computed to compare the methods. The first is the feature based precisionwhich is the ratio of the number of inliers n correctly located on a ground truth plane and the total number of features per detected plane n + n. The second measure is the over segmentation rateper plane which indicates whether a plane is erroneously split into several parts. n is the number of false positives, i.e. the number of detected planes minus the number of correctly detected planes n. Furthermore, we compute the plane based true-positive rate (tp-rate, or recall)which describes the ratio of the correctly detected planes n and the total number of ground truth planes n + n.
Parameter optimisation
To analyse the influence of the parameters of the proposed plane detection method, we test it with the first half of the packaging data set. We vary the parameters: number of random hypotheses Z = [1 … 35], κ1 = [1 … 15] and κ2 = [0 … 1.] and plot the performance measures. Fig. 6a–c shows that our algorithm, in particular with respect to precision, is robust against variation of the parameters. While the over-segmentation-rate in Fig. 6a is almost constant, the precision slightly increases at the beginning and the tp-rate reaches a peak at Z = 3. The Parameter κ1 mostly influences the over-segmentation-rate and the tp-rate. We set κ1 = 6 to the maximum of p, where p is already low. In Fig. 6c it can be seen that the Parameter κ2 is stable in a wide range. We set κ2 = 0.4, where the tp-rate has a maximum.
Fig. 6
Parameter optimisation.
Comparison of sampling strategies
Fig. 7 shows the improvement when employing the different sampling strategies described in Section 3.1.2. In these experiments we manually set the number of samples rather than letting the Algorithm 1 decide based on η, and run until the curves flatten out up to a maximum of 5000 iterations. While in the test scenario shown in Fig. 5b, uniform sampling does not exceed a tp-rate of 0.4, a bias towards near adjacent points improves the tp-rate to 0.6. It is interesting to note that if a bias to a similar motion vector is used, the tp-rate is slightly higher for a lower number of iterations. The reason for this might be that for big planes, which are detected first, the interest points are distributed more uniformly over the plane, while in contrast the results are more unstable if near adjacent points are selected. As expected, incrementally adding hypotheses in unexplained regions drastically improves performance. In combination with the connected component analysis, this method has a tp-rate higher than 0.99 with a low number of only 120 filtered hypotheses. As can be seen in Fig. 7 our method also out-performs the sampling strategy proposed by Chin et al. [23]. Chin et al. propose to guide sampling with information from residual sorting. This method is superior to uniform sampling and to sampling with a bias depending on the Euclidean distance, especially if planes are rather large and have equal size, but samples for small planes are underrepresented. In case the images contain large dominant planes and small planar parts our approach out-performs this method. This is because if dominant planes are detected our method enforces sampling to unselected interest points and sequentially smaller planes get a higher hit-rate if larger ones are detected.
Comparison of the greedy and the exact brute force solution of the QBP
To evaluate the performance of a greedy approximate solution of the Quadratic Boolean Problem (QBP) from Section 3.1.1, we compute the feature based precision p, the over-segmentation-rate p, the true-positive rate p and the total savings S (see Eq. (10)) for our algorithm. Table 1 shows that there is a very small decrease in performance for the approximate solution.
Table 1
Comparison of a greedy solution and the exact brute force computed solution of the QBP.
Method
ppr
pov
ptp
Savings
Greedy
0.978
0,021
0.944
323.6
Brute force
0.978
0,004
0.966
323.8
Comparison of plane detection methods
For the evaluation of the proposed plane detection method all images of our packaging data set and the Oxford houses data set [24] are used. We compare the proposed methods with a sequential RANSAC and J-LINKAGE. The RANSAC based method detects a dominant plane and marks supporting features, which are then excluded in the following iterations. J-LINKAGE is an implementation according to Fouhey et al. [47]. With ItMoS we refer to the proposed iterative model selection algorithm (see Section 3.1). For the tests, sampling of interest points is biased to near adjacent points and to unexplained regions (see Section 3.1.2). Additionally, ItMoS (f) stands for our method including the connected components based validity filter and ItMoS (f,t) refers to our method including tracking of planes in image sequences.The experimental evaluation shows that our model selection method outperforms the other methods in terms of tp-rate and lower over-segmentation, especially for complex scenes. Although it is not optimised for outdoor environments of the Oxford houses, it is better than the other methods. The incremental RANSAC approach has a slightly higher tp-rate at the cost of over-segmentation. If one compares Fig. 8a with Fig. 8b and Fig. 9b, an interesting detail can be seen. In the case of highly cluttered images over-segmentation increases especially with the incremental RANSAC method, while it remains low for the ItMoS methods. Comparing Fig. 8a and b it seems that all methods have a higher tp-rate in case of a cluttered background. For foreground objects, some of the marked ground truth planes are very small and thus easily missed, while the background planes of the cluttered scenes are generally rather large and thus more easily detected, which explains the higher overall tp-rate for these scenes.
Fig. 8
Comparison of plane detection methods. Left graph shows the plane detection result for images with no background texture. The test images of the right graph have a highly textured background.
Table 2 shows the numerical results depicted in Figs. 8 and 9. t stands for the mean processing time per image without the time needed for computation of the interest points. The experiments have been performed on a laptop with an Intel Core2 Duo CPU T7500 (2.20 GHz). Furthermore, n is the mean number of samples generated per image. For the methods with a pre-filter (ItMoS (f), ItMoS (f,t)), the first number is the number of samples after filtering and the number within parentheses is the total number of generated samples. It can be seen that much more hypotheses can be analysed within a shorter time and only about 3% are passed onto the model selection stage. Comparing ItMoS and sequential RANSAC, it can be seen that although ItMoS converges faster and the mean number of random samples per image is lower the tp-rate is higher. One reason for this is that the incremental removal of inliers by the RANSAC method leads to a decreasing inlier ratio and thus to an increasing number of samples for planes detected later. In contrast, ItMoS treats all planes simultaneously and thus the number of samples is accordingly lower.
Table 2
Results for the packaging data set.
Method
ppr
pov
ptp
tprocessing (s)
nsamples (1/image)
ItMoS
0.973
0.139
0.742
8.7
8950
ItMoS (f)
0.970
0.126
0.761
2.5
1017 (26,528)
ItMoS (f,t)
0.978
0.080
0.756
1.7
639 (17,897)
RANSAC
0.979
0.259
0.722
3.0
9320
J-LINKAGE
0.980
0.110
0.620
16.0
5000
In Figs. 10–12, the detected planes are depicted in different colours. A critical point in images with a highly cluttered background is the inlier threshold. Especially interest points of the CD’s on the table are often clustered with parts of the CD cover. In Figs. 10 and 11, they are correctly separated, but interest points on the CD’s lying on the magazine are clustered together with the magazine. The inlier threshold is also responsible for the approximation of the cylinder with piecewise planar surfaces in Fig. 11c. If the inlier threshold were lower and if the cylindrical object were less textured, the approximation would be less accurate. For all our experiments we set the inlier threshold to 1.
Fig. 11
Results for the packaging data set 2 using ItMoS, the proposed Algorithm 1.
Fig. 12a and b shows results of the Merton college and the Wadham college from the Oxford data set. In these images, the camera motion between the frames is much larger than it is in our data set. Furthermore the size of the images is larger (1024 × 768). In general this leads to a higher processing time. While our methods converge after 15s … 20s, RANSAC and J-LINKAGE need more than one minute. Fig. 12c shows a rather crowded living room where planes of the pillow break in different pieces. In contrast Fig. 12d shows a sparsely textured room where very small planes on the tile stove are merged to one bigger plane and because of low texture, the couch is hardly visible for the system.
Object modelling
In [48] it has been shown that a sub-pixel refinement essentially improves pose estimation. Hence, we tested the overall system with Shi-Tomasi interest points [49] and a KLT-tracker [50]. We use the affine refined location of the interest points with sub-pixel accuracy and finally compute a non-linear optimised homography using homest
[40].To test our system, we use five videos with about 800 frames each. Motivated by our robotic scenarios, the sequences show packaging of arbitrary shapes typically found in a supermarket (see Fig. 15). We placed two different objects on a table and manually moved camera and gripper around them in a way that one half of the objects is already occluded before the gripper pushes one object.
Fig. 15
The upper row shows three keyframes of sequence 1 (897 frames) with detected planes in green which are merged because of common 3D motion. The brightness of the interest points indicates the assignment to different planes. After the gripper fingers (seen only as two black dots on the left image border) pushed the plane 43, the keyframe-graph is traced back and the object model (44) and the background object (43) are created (lower image row). Changing plane ids of the top surface of the hexagonal object indicate that planes represented by homographies are substituted with better explanations. (For interpretation of the references to colour in this figure legend, the reader is referred to the web version of this article.)
Evaluation of plane assignment using colour and interest point proximity
First, we evaluate the cues described in Section 3.2.2, which we used to assign occluded planes to individual object hypotheses. For this, we select about 200 keyframes of the first three videos and mark the objects. After reconstruction of the planes visible in the first frames, we build the colour model and the interest point proximity model (see Fig. 13). In the following frames, these models are used to assigned planes to separate objects, according to Eq. (15). In all tests we set ν3 = 1 and we used the mean Euclidean distance d of the interest points to model the Gaussian .
Fig. 13
Initialisation to evaluate the plane assignment using colour and point proximity. (For interpretation of the references to colour in this figure legend, the reader is referred to the web version of this article.)
Fig. 13 and Table 3 show the results of the evaluation. We compare the individual cues colour (Eq. (16)) and proximity (first part of Eq. (15)) with the combination of them. Fig. 14d shows a typical ground truth image. We marked the object with red and green and the background with blue. Planes are assigned to the most likely objects using the pseudo-likelihood (15). In case the pseudo-likelihood drops below 0 (note that the colour term uses a log-likelihood and is thus negative) we set the label to “undefined”. The numbers in Table 3 indicate that our method provides an appropriate heuristic for this setup. Although colour is a rather weak cue, in combination with the interest point proximity it helps to avoid a hard threshold. Fig. 14 shows an example where colour proposes a wrong plane assignment but the combination of colour and interest point proximity finally leads to a correct decision shown in Fig. 14c.
Table 3
Comparison of plane assignment.
Colour (%)
Proximity (%)
Combined (%)
True
78.8
93.2
96.9
False
21.2
6.8
0.0
Undefined
0.0
0.0
3.1
Fig. 14
Comparison of plane assignment using colour (a), interest point proximity (b), the combination of colour and proximity (c) and the according ground truth image (d) for sequence 1 frame 280. (For interpretation of the references to colour in this figure legend, the reader is referred to the web version of this article.)
Experiments with the integrated system
The goal of the experiments is that our system detects the planes, reconstructs, tracks and merges them depending on common motion and that finally, after pushing one object, the system creates two separate piecewise planar object models. Figs. 15–19 show the qualitative results of our system. Planes merged to one object are drawn with the same colour, whereas the brightness of interest points indicates the assignment to different planes. In each figure the third image of each row shows the perspective of the camera shortly before/after the object is pushed and the last one depicts the reconstructed objects. Fig. 15 shows the whole event chain, that is, detection, reconstruction and merging of planes with a common motion coloured green and separating planes as they start moving independently (indicated in red and blue). In Sequences 1, 2, 4 and 5, shown in Figs. 15, 16, 18 and 19, object modelling was successful and accurate as expected. The 3D reconstruction (right most image of each row) shows that sometimes parts of an object, which we intuitively would mark as one plane, are split. On the one hand, this is due to the fact that these surface are in fact not flat but a little bit curved, and on the other hand that model selection within our plane detection algorithm replaces a plane in the following keyframes if a better, more complete/accurate plane is found. Fig. 18 shows one of the failures that might occur. These two objects have approximately the same height and are sufficiently close that one joined explanation instead of two separate ones was favoured. In the case shown in Fig. 18, this results in a much too big top surface of the red object which covers a part of the heart-shaped box. Figs. 19 and 20 show the limits of our system. Our reconstruction relies on planes modelled by homographies and thus for one plane a minimum number of five interest points is necessary (4 + 1 that supports the homography). Because of reliability issues, we used a threshold of 10 points. Hence, in Fig. 19, even though a small plane is detected (shown at the top of the middle image), the top of the cleaner bottle is lost. Another example is shown in Fig. 20, where the object in the middle and the red cylindrical object do not have texture to calculate a sufficient number of interest points, whereas the other three objects are nicely recovered.
Fig. 18
Sequence 4 (543 frames).
Fig. 19
Sequence 5 (811 frames).
Fig. 20
Example image and reconstruction of a small, more complex sequence which shows the limits of our system. Planes of the three dominant objects at the front are reconstructed, while the object in the centre of the image and the objects in the background are not detected because of low texture and too few features.
Conclusion
We presented a system for visual perception that allows interactive modelling of objects in terms of planar patches. For this, we developed a new method to consistently detect planes in image sequences. Interest points are tracked in image pairs and randomly selected points are used to generate plane hypotheses represented by the 2D projective transformation homography. To select the subset of planes that best explains the images we reformulated model selection based on Minimum Description Length (MDL) proposed by Leonardis [20]. Planes are merged greedily and split into independently moving objects whenever inconsistent motion triggers a split event. Occluded planes seen in previous frames are assigned to objects based on a pseudo-likelihood computed from similarity in appearance and adjacency. Using this approach, it is possible to autonomously acquire object models by interacting with the reconstructed scene so far, e.g. by choosing a prominent planar patch for pushing or to superimpose augmented content.Experiments have shown that the proposed plane detection outperforms state of the art approaches in complex image sequences. Furthermore, we have shown that the combination of colour and interest point proximity formalised with the proposed pseudo-likelihood is a good heuristic to complete object models by assigning occluded surface patches to the most likely item.Limitations of the system are shown in Fig. 20, where an object is missed because of detecting only a few interest points due to low texture. Interest point based approaches require texture. This is a weakness which all approaches have in common (e.g. [16,29,47]). One possibility to overcome this drawback would be to benefit from additional depth information. Recently new active sensors have been proposed, which provide RGB images and depth information for each pixel (e.g. the RGB-D sensors Kinect and Xtion manufactured by Microsoft and Asus). These active sensors allow to generate plane hypotheses directly from the depth information. The problem of selecting the best subset of hypotheses still remains and the proposed iterative model selection scheme can easily be adapted by substitution of the homography model for images with a rigid transformation model for the point clouds. Despite the drawback of requiring texture monocular systems are still attractive to the user because of the low cost and the small size. For example in our scenario where the camera is mounted on an arm active sensors are too bulky. For monocular systems a possibility to overcome the texture constraint is to follow the approach of Newcombe et al. [51], who use the camera pose of Structure and Motion for an optical flow based dense reconstruction, or to introduce a multi-label segmentation using a Markov Random Field (MRF) optimization and graph-cuts, e.g., such as proposed by Sudipta et al. [52] and Micusik et al. [14].
P ← 0,P′ ← 0
k ← 0,∊ ← M/N, Imax ← 0
whileη = (1 − ∊M)k ⩾ η0do
P′ ← P
Add Z random plane hypotheses to P′
Select plane hypotheses from P′ following MDL principle and store in P
Count number of explained IPs (inliers) I for P
ifI > Imaxthen
Imax ← I
∊ ← Imax/N
end if
k ← k + 1
end while
Create 2D neighbourhood graph using the Delaunay triangulation
while No good plane hypothesis found do
Sample 3 interest point pairs
Compute affine transformation A (5 parameters)
Trace graph and cluster interest points which support A
if Cluster contains initial 3 sampled points then
Good hypothesis found
break
else
No hypothesis found
continue
end if
end while
Compute least-squares homography (8 parameters) using the DLT algorithm
Continue clustering and store plane hypothesis for iterative model selection
1. Instantiate new interest points (IPs)
2. Track interest points
3. Track planar patches modelled by homographies and try to estimate 3D motion for existing objects
if planar patch does not support 3D motion then
• trigger split event and create new objects from current and past keyframes
end if
if average displacement of the IPs < d pixels then
• goto step 1
else
• initialise a new keyframe and continue
end if
4. Detect and renew planar patch (Algorithm 1)
5. Merge and reconstruct groups of planar patches greedily
if new planar patch supports active object motion model then
• insert planar patch
else
• create new 3D object and motion model (SfM)
else if
6. Refine objects using incremental bundle adjustment