Zhanlong Yang1,2, Dinggang Shen2,3, Pew-Thian Yap2. 1. School of Marine Science and Technology, Northwestern Polytechnical University, Xi'an, Shaanxi, China. 2. Department of Radiology and Biomedical Research Imaging Center (BRIC), University of North Carolina, Chapel Hill, NC, United States of America. 3. Department of Brain and Cognitive Engineering, Korea University, Seoul, Korea.
Abstract
In this paper, we present a novel image mosaicking method that is based on Speeded-Up Robust Features (SURF) of line segments, aiming to achieve robustness to incident scaling, rotation, change in illumination, and significant affine distortion between images in a panoramic series. Our method involves 1) using a SURF detection operator to locate feature points; 2) rough matching using SURF features of directed line segments constructed via the feature points; and 3) eliminating incorrectly matched pairs using RANSAC (RANdom SAmple Consensus). Experimental results confirm that our method results in high-quality panoramic mosaics that are superior to state-of-the-art methods.
In this paper, we present a novel image mosaicking method that is based on Speeded-Up Robust Features (SURF) of line segments, aiming to achieve robustness to incident scaling, rotation, change in illumination, and significant affine distortion between images in a panoramic series. Our method involves 1) using a SURF detection operator to locate feature points; 2) rough matching using SURF features of directed line segments constructed via the feature points; and 3) eliminating incorrectly matched pairs using RANSAC (RANdom SAmple Consensus). Experimental results confirm that our method results in high-quality panoramic mosaics that are superior to state-of-the-art methods.
The automatic construction of large, high-resolution image mosaics is an active area of research in the fields of photogrammetry, computer vision, image processing, and computer graphics [1]. It is considered as important as other image processing tasks such as image fusion [2], image denoising [3], image segmentation [4] and depth estimation [5]. Image mosaicking finds applications in a wide variety of areas. A typical application is the construction of large aerial and satellite images from collections of smaller photographs [1, 6]. More applications include scene stabilization and change detection [7], video compression [8], video indexing [9] and so on [1]. Some widely used commercial software packages for image mosaicking are available, such as AutoStitch [10], Microsoft ICE [11], and Panorama Maker [12].The key problem in image mosaicking is to combine two or more images by stitching them seamlessly together into a new one that distorts the original images as little as possible [13]. Image mosaicking techniques can be mainly divided into two categories: grayscale-based methods and feature-based methods. Grayscale-based methods are easy to implement, but they are relatively sensitive to grayscale changes especially under variable lighting. Feature-based methods extract features from image pixel values. Because these features are partially invariant to lighting changes, matching ambiguity can be better resolved during image matching. Matching robustness can be further improved by using feature points that can be detected reliably. Many methods have been shown to be effective for the extraction of image feature points, for example, Harris method [14], Susan method [15], and Shi-Tomasi method [16]. Feature-based image mosaicking methods afford two main advantages: (1) the computation complexity of image matching will be significantly reduced since the number of feature points is far smaller than the number of pixels; (2) the feature points are very robust to unbalanced lighting and noise, resulting in better image mosaicking results.A wide variety of feature detectors and descriptors have been proposed in the literature (e.g. [17-21]). Detailed comparisons and evaluations of these detectors and descriptors on benchmark datasets were performed in [22, 23]. Among various methods, SIFT [18] has been shown to give the best performance [22]. Recent efforts (e.g. SURF [24], BRISK [25], FREAK [26], NESTED [27], and Ozuysal’s method [28]) have been focused on improving SIFT-based matching accuracy and reducing computation time. Arguably SURF [24] is among the best methods. Fei Lei et al. proposed a fast method for image mosaicking based on a simple application of SURF [29]. Jun Zhu et al. proposed an image mosaicking method that uses the Harris detector and SIFT features of line segments [30]. For performance and efficiency, this method uses Harris corner detection operator to detect key points. Then features of line segments are used to match feature points owing to their effective representation of local image information, such as textures and gradients. However, the Harris corner detector is very sensitive to changes in image scale; so it does not provide a good basis for matching images of different sizes. Motivated by this observation, we propose an image mosaicking method that is based on SURF features [24] of line segments. First, the method uses the SURF detection operator to locate feature points and then constructs a directed graph of the extracted points. Second, it describes directed line segments with SURF features and matches them to obtain rough matching of points. Finally, it adjusts matching points and eliminates incorrectly matched pairs through the RANSAC algorithm [31]. The framework of our method is summarized in Fig 1.
Fig 1
An overview of our method.
2 SURF
SURF, like the SIFT operator, is a robust feature detection method that is invariant to image scaling, rotation, illumination changes, and even substantial affine distortion. Both of these descriptors encode the distribution of pixel intensities in the neighborhoods of the detected points. SURF is computationally more efficient than SIFT owing to the use of integral images [32] and the box filters [33] that approximate second order partial derivatives of Gaussian convolutions. Similarly to many other approaches, SURF consists of two consecutive parts, including feature point detection and feature point description.
2.1 SURF feature-point detector
Similarly to the SIFT method, the detection of features in SURF relies on a scale-space representation combined with first and second order differential operators. The key feature of the SURF method is that these operations are approximated using box filters computed via integral images. So, the procedure of SURF feature detection involves first computing an integral image, establishing an image scale space with box filters, and finally locating feature points in the scale space.The SIFT detector is based on the determinant of the Hessian matrix, which is defined at point x = (x, y) and scale σ as
where L(x, σ) is the convolution of the Gaussian second order derivative with the image I at point x, and similarly for L(x, σ) and L(x, σ). As mentioned before, in order to reduce computation, SURF approximates L, L, L with the box filtering using sum of the Haar wavelet responses, resulting respectively in D, D, D andThis can be performed very efficiently using an integral image I∑, which given an input image I is calculated asThe determinant of the approximated Gaussians isThus, the interest points, including their scales and locations, are detected in approximate Gaussian scale space. The size of the box filter is varied with octaves and intervals [34]:The filter sizes for various octaves and intervals are illustrated in Fig 2. Only pixels with greater responses than their surrounding pixels are classified as interest points. The maximal responses are then interpolated in scale and space to locate interest points with sub-pixel accuracy.
Fig 2
Filter sizes for four different octaves and intervals (marked by arcs).
2.2 SURF descriptor
The goal of a descriptor is to provide a unique and robust description of the intensity distribution within the neighborhood of the point of interest. In order to achieve rotational invariance, the orientation of the point of interest needs to be determined. Orientation is calculated in a circular area of radius 6s centered at the interest point, where s is the scale at which the interest point is detected. In this area, Haar wavelet responses in x and y directions are calculated and weighted with a Gaussian centered at the point of interest. By computing the sum of the horizontal and vertical responses within a sliding orientation window of size π/3 and traversing the entire circle every 5 degrees, 72 orientations can be obtained. The two summed responses then yield a local orientation vector. The longest of such vector over all windows defines the main orientation.Once position, scale and orientation are determined, a feature descriptor is computed. The first step consists of constructing a square region centered around the feature point and oriented along the orientation determined previously. The region is divided uniformly into smaller 4 × 4 sub-regions. For each sub-region, Haar wavelet responses are computed at 5 × 5 regularly-spaced sample points. The x and y wavelet responses, denoted by dx and dy respectively, are computed at these sample points weighting with a Gaussian centered at the interest point and summed up over each sub-region to form a first set of entries to the feature vector. In order to obtain information on the polarity of the intensity changes, the sums of the absolute values of the responses, |dx| and |dy|, are also extracted. Therefore each sub-region is associated with a four-dimensional vectorCombining the vectors, v’s, from all sub-region yields a single 64-dimensional descriptor, which is normalized to unit-norm for contrast invariance.
3 Matching of directed line segments
3.1 Rough matching
The best candidate match for each keypoint is found by identifying its nearest neighbor in the set of keypoints generated from a reference image. The nearest neighbor is defined as the keypoint with the minimal Euclidean distance determined based on the invariant descriptor vector described above.However, many features from an image do not have any matching counterparts in the reference image because they arise from background clutter or cannot be detected in the reference image. Therefore, we use a global threshold on the distance to discard keypoints without good matches. Fig 3 shows the Euclidean distance of 10000 keypoints with correct matches for real image data. This figure was generated by matching images with different scales, rotation angles, changes in illumination, and affine distortions. As shown in Fig 3, most of the matched pairs have small Euclidean distances ranging from 0 to 0.15. We set the global threshold to 0.1 in our experiments, eliminating more than 90% of the false matches while discarding less than 5% of the correct matches.
Fig 3
Euclidean distances of 10000 matched keypoints.
3.2 Line segment features
Features of line segments are effective representation of local image information, such as textures and gradients. Given two images I and I′ to be matched, the feature points are detected for each image using SURF to construct two directed graphs, G = (V, E) and G′ = (V′, E′), where V = {a1, a2,⋯,a} and V′ = {b1, b2,⋯,b} are key points extracted from I and I′, and E = {(a, a), i ≠ j} and E′ = {(b, b), i ≠ j} are the edge sets of directed graphs G and G′, respectively. Features are generated for each line segment between two key points. For each edge of graph G, e ∈ E, with starting point a and end point a, we equidistantly sample three points {p1, p2, p3}, with p = p + ((k−1)/2) (p−p), k = 1, 2, 3. p is the coordinates of point a. The SURF features are extracted for each of these points, giving a feature matrix S = [s1, s2, s3]. Each s is a 64-dimensional vector. For each line segment, we have a 192-dimensional feature vector.
3.3 Nearest neighbor matching
We use the nearest-neighbor matching criterion proposed in [30] for rough matching of line segments. Assuming image I has n1 directed line segments, L = [l1, l2,⋯,l], and image I′ has n2 directed line segments, , the nearest-neighbor pairs can be encoded using an adjacency matrix :The distance between a pair of line segments l and , with feature matrices S and respectively, is defined using the F-norm of the feature matrices: .The matching is further refined as follows. With the sets of key points in two given images, V = {a1, a2,⋯,a} and V′ = {b1, b2,⋯,b}, we use the statistical voting method reported in [30] to obtain the matching frequency of each point. A matrix G ∈ R is initiated as a null matrix. If based on K two straight lines match each other, we vote for the starting point pairs and the ending point pairs of the two lines once. This is carried out by incrementing the corresponding element in G by 1. A larger element in matrix G indicates higher probability of matching of two points. The procedure for computation of matrix G is detailed in Algorithm 1.Algorithm 1 Computation of GInput: Matrix KOutput: Matrix G1: procedure ComputeMatrix(K, G)2: Initialize G ∈ R as a null matrix3: for
i = 1, 2,…,n1, j = 1, 2,…,n2
do4: if
K(i, j) = 1 then5: Find directed line segment l[a → a],6: G(l, p) = G(l, p) + 1, G(m, q) = G(m, q) + 17: end if8: end for9: Output matrix G10: end procedureTo avoid matching to too many points to one point, the criteria to select matching points are as follows:Discard pairs with G(i, j) ≤ σ, where σ = 0.5 max
G(i, j).Select pairs giving maximal values in all rows and columns as matched pairs.If the maximal element in row i and the maximal element in column j are not the same, select the larger one. For example, assuming G(i, p) is the maximal element in row i and G(q, j) is the maximal element in column j, if G(i, p) > G(q, j), then a and b match each other.Incorrectly matched pairs are further removed by using RANSAC (RANdom SAmple Consensus) [31] and then a homography matrix M is estimated for image alignment.
4 Experimental results
In this section, the experimental results of the proposed method are presented. Evaluation was performed with gray level images with different rotation angles, scales, illumination, and affine distortions are used. Representative results are shown here.In order to compare our proposed method with a recent state-of-the-art method presented in [30], images downloaded from the website [35] were used. Representative image pairs are shown in Fig 4. The lighting conditions in the two images are largely different in Fig 4(A). The left image has longer exposure time than the right one. The two images in Fig 4(B) were taken by ordinary camera in different orientations. The two images have different resolutions in Fig 4(C). The left one is a blurred low-resolution image and the right one has higher resolution. In Fig 4(D), the left image is taken with the lens of the camera zoomed relative to the right one. Therefore, the buildings in the left image appear larger than the ones in the right.
Fig 4
Image pairs with photometric or geometric variations.
(A) lighting, (B) rotation, (C) blur, (D) scaling. Reprinted from [30] under a CC BY license, with permission from [Computational and Mathematical Methods in Medicine], original copyright [2014].
Image pairs with photometric or geometric variations.
(A) lighting, (B) rotation, (C) blur, (D) scaling. Reprinted from [30] under a CC BY license, with permission from [Computational and Mathematical Methods in Medicine], original copyright [2014].Results of matching by different methods are shown in Figs 5–7. Fig 5 indicates that SURF cannot even stitch the images correctly due to incorrectly matched points. Figs 6 and 7 demonstrate that both SIFT and our method obtain good results. However, Fig 6(B) indicates that SIFT still results in wrongly matched points. Our method incorporates robust statistical voting and rough matching strategies that could eliminate incorrectly matched pairs.
Fig 5
Matching results based on SURF method.
Fig 7
Matching results based on proposed method.
Fig 6
Matching results based on SIFT method.
Figs 8 and 9 show the panoramic images stitched by our method and the algorithm presented in [30]. As shown in Fig 8(A) and 8(D) (in regions marked with red circles), the comparison method results in ghosting due to inaccurate matching.
Fig 8
Mosaicking results given by the method proposed in [30].
Reprinted from [30] under a CC BY license, with permission from [Computational and Mathematical Methods in Medicine], original copyright [2014].
Fig 9
Mosaicking results given by our method.
Mosaicking results given by the method proposed in [30].
Reprinted from [30] under a CC BY license, with permission from [Computational and Mathematical Methods in Medicine], original copyright [2014].As shown in Figs 8(C) and 9(C), we can see Fig 9(C) is not clear as the Fig 8(C). The reason is that the quality of the original image downloaded from the website is not good.Fig 10 shows an image pair with significant affine distortion. Results of matching by different methods are shown in Fig 11(A)–11(C). Fig 12 shows the panoramic images stitched by SIFT and our method. We can see that the panoramic image stitched by our method is cleaner than the one given by SIFT.
Fig 10
Images with affine distortion.
Fig 11
Matching by different methods.
(A) SURF, (B) SIFT, (C) Our method.
Fig 12
Mosaicking results by different methods.
(A) SIFT, (B) Our method.
Matching by different methods.
(A) SURF, (B) SIFT, (C) Our method.
Mosaicking results by different methods.
(A) SIFT, (B) Our method.To evaluate the proposed method quantitatively, we used some representative test image pairs from website [36], taken for the textured and structured scenes, as shown in Fig 13. The following metric is used:
Fig 13
Test image pairs taken from textured and structured scenes under photometric or geometric transformations.
Test image pairs taken from textured and structured scenes under photometric or geometric transformations.
(A) Bikes (blur), (B) tree (blur), (C) Leuven (lighting), (D) bark (scaling and rotation), (E) wall brick (viewpoint), (F) boat (rotation), (G) graffiti (viewpoint), (H) UBC (JPEG).Note that a correct match is a match where two keypoints correspond to the same physical location, and a false match is one where two keypoints come from different physical locations.Table 1 presents the comparison of the matching results, including the number of correct matches over the number of total matches and 1-precision. The results in the table indicate that our proposed algorithm is superior in terms of 1-precision.
Table 1
Performance comparison with state-of-the-art methods.
Image
#CM/#TM
1-Precision
SURF
SIFT
Proposed
SURF
SIFT
Proposed
A
122/156
234/353
133/156
0.22
0.34
0.15
B
67/95
140/592
81/96
0.29
0.73
0.16
C
63/88
97/192
74/88
0.28
0.49
0.16
D
*
76/156
28/56
*
0.51
0.50
E
60/133
176/445
73/133
0.55
0.60
0.45
F
29/62
158/186
53/62
0.53
0.15
0.15
G
8/23
9/57
13/23
0.65
0.84
0.43
H
368/422
484/723
388/422
0.13
0.33
0.08
CM: correct matches; TM: total matches; *: matching failed.
CM: correct matches; TM: total matches; *: matching failed.
5 Conclusion
In this paper, we have introduced a novel image mosaicking method based on SURF features of line segments. This method firstly uses SURF detection operator to detect feature points. Secondly, it constructs directed line segments, describes them with SURF feature, and matches those directed segments to acquire rough point matching. Finally, the RANSAC (RANdom SAmple Consensus) algorithm is used to eliminate incorrect pairs for robust image mosaicking. Experimental results demonstrate that the proposed algorithm is robust to scaling, rotation, lighting, resolution and a substantial range of affine distortion.Recently, Ji et.al [37] proposed a novel compact bag-of-patterns (CBoP) descriptor with an application to low bit rate mobile landmark search. The CBoP descriptor offers a compact yet discriminative visual representation, which significantly improves search efficiency. In the future, we will try these new methods [37-39] proposed in the fields of mobile visual location recognition and mobile visual search to further improve the performance of our algorithm.