Literature DB >> 26007729

Building keypoint mappings on multispectral images by a cascade of classifiers with a resurrection mechanism.

Yong Li¹, Jing Jing², Hongbin Jin³, Wei Qiao⁴.

Abstract

Inspired by the boosting technique for detecting objects, this paper proposes a cascade structure with a resurrection mechanism to establish keypoint mappings on multispectral images. The cascade structure is composed of four steps by utilizing best bin first (BBF), color and intensity distribution of segment (CIDS), global information and the RANSAC process to remove outlier keypoint matchings. Initial keypoint mappings are built with the descriptors associated with keypoints; then, at each step, only a small number of keypoint mappings of a high confidence are classified to be incorrect. The unclassified keypoint mappings will be passed on to subsequent steps for determining whether they are correct. Due to the drawback of a classification rule, some correct keypoint mappings may be misclassified as incorrect at a step. Observing this, we design a resurrection mechanism, so that they will be reconsidered and evaluated by the rules utilized in subsequent steps. Experimental results show that the proposed cascade structure combined with the resurrection mechanism can effectively build more reliable keypoint mappings on multispectral images than existing methods.

Entities: Chemical Disease Species

Keywords: cascade structure; multispectral; resurrection mechanism

Year: 2015 PMID： 26007729 PMCID： PMC4481929 DOI： 10.3390/s150511769

Source DB: PubMed Journal: Sensors (Basel) ISSN： 1424-8220 Impact factor: 3.576

Introduction

Multispectral imaging techniques have been applied in a variety of fields, including civilian surveillance, intelligent navigation, automatic target recognition, etc. For example, researchers wish to analyze with images geographic and geological change before and after a devastating earthquake, to help conduct a better reconstruction. These images are often taken at different times, from different views and/or with different spectral light. They provide complementary information to the visible spectrum [1] and make image analysis more reliable. To effectively utilize them, an accurate registration is necessitated to account for the misalignment between images, so that all images of a scene can be aligned to a common coordinate. Then, a fine-fused image can be generated as if it were taken by one camera. During the last few decades, many methods have been proposed to align multispectral images. A category of commonly-used methods addresses the image registration by building keypoint mappings. These methods generally include three steps: detecting keypoints, calculating descriptors and building keypoint mappings with the descriptors associated with the detected keypoints. Typical keypoints and descriptors include the Harris corner [2], SURF [3], SIFT [4], MSER [5], WLD [6], etc. These descriptors are often devised to be invariant to small scaling, rotation and even affine transformation on single-mode images. Morel and Yu [7] propose ASIFT, which is fully affine invariant. It simulates all image views obtainable by varying the two camera axis parameters and covers the other four parameters using SIFT. Cai et al. [8] then further proposed a perspective scale invariant feature transform (PSIFT) using homographic transformation to simulate perspective distortion. To increase the number of keypoints, Park et al. [9] proposed using higher-order scale space derivatives, ∂2L(x,y,σ)/∂(σ2, ∂3L(x, y, σ)/∂σ3, ∂4L(x, y, σ)/∂σ4, and then extracted the extrema in the high-order scale space. However, if the two images to be registered are taken by different spectral light, e.g., thermal infrared and visible light, the ratio of correct keypoint mappings established with SURF and SIFT will dramatically decrease, resulting in a significant decrease of registration performance. To adapt the descriptors designed for single-mode images (e.g., SIFT) to multispectral images, many techniques have been proposed, including NG_SIFT (NG, normalized gradient) [10], SAR_SIFT [11] and MIND [12]. NG_SIFT utilizes the normalized gradients around keypoints for describing the local pattern to achieve the invariance against non-linear intensity changes between multispectral images. It outperforms the original SIFT on the multispectral images of a structured scene. SAR_SIFT proposes a new local gradient pattern around keypoints, in which the orientation and magnitude are robust against the speckle noise. SAR_SIFT gives a higher ratio of correct keypoint mappings than the original SIFT on multispectral images. Mainali et al. [13] proposed the D-SIFER scale-invariant feature detection algorithm using the 10th order scale-space optimal Gaussian derivative filter. D-SIFER was validated on hyperspectral images and was shown to perform better than SIFT and SURF. Aguilera et al. [1] proposed EOH (edge of histogram) by borrowing the idea from MPEG-7 descriptors [14]. EOH considers only the edge pixels for computing descriptors (also, see [15] ) and does not utilize the gradient magnitude of a pixel weighing its contribution. Unlike other descriptors, EOH generates 4 × 4 × 5 descriptors, with 4 × 4 blocks for the neighboring window around keypoints, five orientation bins in each block, including 0°, 45°, 90° and 135°, and a non-direction bin. Similar to EOH, Bingjian et al. [16] computed descriptors with edge points only, but the local window around a keypoint is partitioned into 17 bins in a log-polar coordinate. Existing descriptors for multispectral images perform well when the spectral distance is small. Spectral distance here refers to the distance/difference of two imaging wavebands. For example, the spectral distance between Band 1 Visible (0.43–0.45 μm) and Band 10 TIRS 1 (10.6–11.19 μm) in Landsat 8 is approximately 10.6 – 0.45 = 10.15 μm. Since the common information between multispectral images decreases with the increase of spectral distance, the keypoint mappings built solely with the matching ability of the associated descriptors often contain a high ratio of incorrect mappings. Effectively representing common information between multispectral images is not an easy task. Among commonly-used information is the gradient pattern (e.g., SIFT), edge points, etc. Some similarity metrics can be treated as a variant of common information, e.g., linear correlation and mutual information (MI) [17]. Correlation assumes the linear relationship between image intensities, which often does not hold for multispectral images [18]. In fact, gradient computation is a linear operation of image intensities, so the linear relationship is the foundation of SIFT descriptors. MI utilizes the statistical dependence between intensities. However, it is sensitive to keypoint positions and excels at finer registration, e.g., [19]. Given the matching ability of descriptors, an essential problem is now to remove the incorrect keypoint mappings. Techniques, such as RANSAC [20] and CIDS (color and intensity distribution of segment) [21], have been proposed to identify the correct mappings from the initial set built with the associated descriptors. Gong et al. [19] proposed a coarse-to-fine registration scheme on multispectral images by combining SIFT with RANSAC and the mutual information (MI) [17]. However, most of the existing approaches to removing incorrect keypoint mappings apply only a one-pass process, e.g., RANSAC [20], CIDS [21], BBF (best bin first) [22] and FSC (fast sample consensus) [23]. The problem with these approaches is that when the ratio of correct keypoint mappings is low, the rate of correct mappings being misidentified as incorrect (and the reverse) will be very high. Consequently, the set of the keypoint mappings preserved by these methods still contains a large percent of incorrect mappings, and also, some correct mappings are removed by mistake. Observing this and motivated by the boosting technique [24] in the field of object detection, we abandon the idea of deciding on keypoint mappings in only a one-pass process and propose a cascade structure with a resurrection mechanism. The cascade structure is composed of multiple steps (different approaches) to construct keypoint matchings. The resurrection mechanism will evaluate the quality of each keypoint matching and assign a score to it. The score is to be updated in the next step according to its matching quality at that step. Each step of the cascade structure employs a “loose” rule, so that only the keypoint mappings that score very low for this rule will be removed, and the other mappings remain to be undetermined or temporarily considered as correct mappings. The remaining mappings are to be evaluated in the following steps of the cascade structure by the resurrection mechanism. Since it is almost certain that some correct keypoint mappings are mistakenly removed at every step, the resurrection mechanism allows for a fraction of wrongly-removed keypoint mappings to be reconsidered and evaluated in the subsequent steps. Alternatively, the resurrection mechanism can be thought of as a compensation for the drawback of the rules utilized in the cascade structure, as this “gets back” some of correct mappings that are removed according to the rules. The rest of this paper is organized as follows. Section 2 discusses the proposed resurrection mechanism. Section 3 discusses the proposed cascade structure that is comprised of four steps. Section 4 presents the experimental results, and Section 5 concludes this paper.

The Resurrection Mechanism

The repeatability of keypoints under modality change is low, and to obtain more correct keypoint mappings, we propose a resurrection mechanism for reducing the number of mis-removed correct mappings due to repeatability insufficiency. This idea arises from the observation that some correct keypoint mappings are often misjudged by any keypoint matching method. In the resurrection mechanism, a score is assigned to every keypoint mapping based on the matching quality assessed with the method applied in one step of the cascade structure. The score is to be updated in the next step according to the matching quality assessed with the method applied at that step. Through the resurrection mechanism, each keypoint mapping will be evaluated by different matching criteria. This gives the chance for the misjudged correct keypoint mappings to be reconsidered and evaluated in subsequent steps; otherwise, they would have been removed permanently. To understand the resurrection mechanism, a confidence grade cg is defined to indicate the matching quality of each keypoint mapping at one step in the cascade structure. According to the matching quality, cg is set to 0, 1, 2 and 3, respectively. cg = 0 means that a keypoint mapping is of the worst quality, and it will be removed and never be considered. cg = 1 means that the quality of a keypoint mapping is not high enough at the current step, but may possibly be high enough at the next step. If cg = 2 or 3, the keypoint mapping will successfully go through the current step to the next step. At the next step, cg is updated for each keypoint mapping, and the keypoint mappings with cg = 0 will be removed. The keypoint mappings with cg = 1, 2 or 3 will be passed to the next step. In the last step, the keypoint mappings with cg = 2 or 3 are defined to be the matched keypoints from the proposed cascade structure. Let cg denote the cg value of a keypoint mapping at the previous step. Let cg denote the cg value determined only by the method applied at the current step. Then, the score cg at the current step can be assigned (updated) as follows, Note that the keypoint mappings with cg = 0 have already been removed at the previous step. Figure 1 gives an example showing the functionality of the resurrection mechanism. We utilize the BBF in Step 1 and the CIDS in Step 2 in the proposed cascade structure. In Step 1, keypoint mappings are divided into three levels according to their scores. Figure 1a shows 350 keypoint mappings whose cg is three, including 270 correct mappings with a correct ratio of 71%. Figure 1b shows 166 mappings whose cg is two, including 68 correct mappings with a correct ratio of 41%. Figure 1c shows 119 mappings whose cg is one, including 22 correct mappings with a correct ratio of 19%. This shows that a higher cg stands for a greater probability that a keypoint mapping is correct. However, there are correct mappings in the pending keypoint mappings, as shown in Figure 1c, which would have been determined to be incorrect if only one step were utilized. In Step 2, 18 mappings are resurrected from those pending mappings, as shown in Figure 1d, and there are 13 pairs of correct keypoint mappings “recovered” with the resurrection mechanism.

Figure 1

The example of resurrection. (a) The mappings whose cg is three in Step 1; (b) the mappings whose cg is two in Step 1; (c) the mappings whose cg is one in Step 1; (d) the mappings resurrected from (c) in Step 2.

The Proposed Cascade Structure with the Resurrection Mechanism

To apply the proposed resurrection mechanism to build more accurate keypoint mappings on multispectral images, we designed a cascade structure including four steps to remove outlier keypoint mappings. Firstly, the SURF keypoints and descriptors are extracted from both the reference and the sensed image. Note, other techniques, such as SIFT, can be applied here. The keypoints and descriptors have a great impact on the final registration accuracy, but this paper is focused on the cascade structure. Then, the cascade is applied to remove incorrect mappings; its workflow is shown in Figure 2.

Figure 2

The proposed cascade structure with the resurrection mechanism.

Initial Keypoint Mapping (Step 1)

This step aims to establish an initial set of keypoint mappings with BBF. For every keypoint , in the reference image I(x, y), the keypoint in the sensed image I(x, y) is defined to be the matched point of if the descriptor of has the smallest distance to the descriptor of , namely, Let denote the set of the keypoint mappings built with Equation (2). contains matches that are based on reference keypoints, i.e., for every reference keypoint , search the best matched test keypoint with Equation (2). Traditional BBF defines and to be matched if the following and Equation (2) hold simultaneously, Equations (2) and (3) say that has the smallest descriptor distance to and vice versa. However, the matches satisfying Equations (2) and (3) may be incorrect. To understand this, suppose is the truly matched keypoint to . Due to the matching ability decrease of descriptors on multispectral images, may be just marginally more close to than . This discards correct matches and probably preserves incorrect matches. To cope with this issue, we wish to loosen Equation (3). For every sensed keypoint , we keep n ≥ 1 mapping reference keypoint according to the descriptor distance. Specifically, we sort in an ascending order and then keep the first n reference keypoint as the mapping keypoint for , resulting in a matching set . The intersection ∩ is the initial set of keypoint mappings to which Step 2 is applied. The value of n is typically set to three, to ensure that only the keypoint mappings meeting the BBF rule will be preserved for consideration in the following steps. The mappings in ∩ are divided into three levels. Let and denote the set of reference and test keypoints. For a mapping , by the rule of Step 1, is the closest keypoint for , and is one of the n ≥ 1 closest keypoints for . The score for the mapping is determined as follows. If is the first closest keypoint for , then cg is set to three; if is the second closest keypoint for , cg is set to two; if is the third closest keypoint for , cg is set to one.

Color and Intensity Distribution of Segment (Step 2)

This step follows the idea of CIDS (color and intensity distribution of segment) [21]. The correlation of pixel color/gray values on two segments characterizes the quality of the two pairs of keypoint mappings (A1 ∼ B1) and (A5 ∼ B5) in Figure 3. To compute the correlation, a sampling process is conducted along the segment A1A5 and B1B5, giving two arrays of the same number of entries. The color values of the sampled points are generated with a bilinear interpolation.

Figure 3

Illustration of sampling the two segments A1A5 and B1B5.

Next, the two arrays of sampled points (A1, A2, A3, A4, A5) and (B1, B2, B3, B4, B5) are normalized to obtain x = {x1, x2, x3, x4, x5} and y = {y1, y2, y3, y4, y5}, such that ‖x‖ = ‖y‖ = 1. Then, a L2-norm d is computed in Equation (4) to measure the similarity between the segment A1A5 and B1B5. Formally, where x and y represent the two normalized arrays and n represents the number of the sampled points. Ideally, d = 0 when the keypoint mappings (A1 ∼ B1) and (A5 ∼ B5) are infinitely accurate and I(x, y) and I(x, y) are completely identical under a certain transformation. Since ‖x‖ = ‖y‖ = 1, by triangle inequality ‖x − y‖ ≤ ‖x‖ + ‖y‖ = 2. If d is greater than a predetermined threshold t, the two keypoint mappings will be classified as incorrect. A loose (i.e., big) threshold t is set here, so that only the keypoint mappings between which the two segments differ significantly will be removed. The detailed implementation applies a voting scheme as follows. Choose two mappings randomly; Calculate d with Equation (4) for the two mappings. If d is smaller than a preset threshold t, vote for both mappings, otherwise vote against them; Repeat 1 and 2 until all such two keypoint mappings are considered; Count the votes for every keypoint mapping N, and score it with: where N is the number of all keypoint mappings input to Step 2. Utilize Equations (1) and (5); the value of cg is updated by Equation (6), In our experiments, t is set to 0.6, since the difference of x and y can be large on multispectral images. After this step, we obtain a set of keypoint mappings .

Evaluate Keypoint Mappings with Global Information (Step 3)

In this step, we utilize global information to search “good” mappings in . Consider three keypoint mappings in , , and . They determine an affine transformation T. If all three mappings are correct, T is close to the ground truth, so the similarity metric between reference I(x, y) and the transformed test is high. This work employs the number of overlapped edge pixels (NOEP) [25] as the similarity metric. where E(x, y) and are the binary edge maps of I(x, y) and . Figure 4 illustrates an example of utilizing the global information. Three keypoint mappings determine a transformation T shown in Figure 4a,b. If T is close to the ground truth, the similarity metric between I(x, y) and is high, i.e., the number of overlapped edge pixels shown in Figure 4c is large. Consequently, the quality of three keypoint mappings can be evaluated with their determined T through the similarity . Note, this step can be easily adapted to four or more keypoint mappings to account for projective or polynomial transformations. Additionally, the similarity metric in Equation (7) can be substituted for other metrics, such as mutual information (MI) [17]. In practice, Equation (7) performs as well as or slightly better than MI and requires less computational cost.

Figure 4

(a) I(x, y); (b) , , illustrating three keypoint mappings that are to be assessed. The three mappings determine an affine transformation T, and then, we compute the similarity metric between I(x, y) and with Equation (7); (c) Overlapped edge maps, showing the overlapped edge maps where green pixels are from visible image I(x, y) and red pixels are from IR image . When T is close to the ground truth, a majority of edge pixels is expected to be overlapped, resulting in a high similarity metric.

There are multiple combinations of three keypoint mappings, and a keypoint mapping appears in multiple triplets, which yield different similarity metrics. Therefore, an iterative process is utilized to find the maximum similarity metric for every keypoint mapping. The iterative process outputs the maximum similarity for a keypoint mapping over all triplets in which it appears. Formally, the maximum similarity metric by is defined as: where T is determined by , , . The iterative process outputs the maximum similarity metric for each keypoint mapping. measures the quantity of image content that can be brought into alignment by . A larger Smax represents that the mapping is better from the view of the entire image content. We preserve the top mappings of ranked according to the Smax, resulting in a set of mappings to be processed in the next step. The detailed scoring rules are shown in Equation (9), where MAXS is the maximum value in all S. can also serve as the keypoint mapping of good quality for fast sample consensus [23]. Utilize Equations (1) and (9); the value of cg is updated by Equation (10),

RANSAC (Step 4)

At the last step, we apply the RANSAC algorithm, because there are still some outlier keypoint mappings after Step 3, although most of the incorrect keypoint mappings are expected to have been removed. Random sample consensus (RANSAC) is an iterative approach to estimating the parameters of a mathematical model from a set of observed data containing outliers. RANSAC performs well in removing outliers of keypoint mappings if the correct ratio is high. However, the performance of RANSAC decreases dramatically especially when the correct ratio is low, e.g., 20% or less. Due to this, not all of the keypoint mappings built with the first three steps are used as the input of Step 4; rather, only the keypoint matches whose cg are two or three after Step 3 are fed into Step 4, since these keypoint matches have a greater probability of being correct. Affine or projective transformations are utilized with RANSAC to remove outliers. When the distance of real scene content to the camera is the same, an affine transformation would be enough to account for the misalignment. When the distance varies from point to point, a projective transformation or polynomial transformation is necessitated. Polynomial transformations require at least six keypoint mappings, which significantly increases the possibility that a sample composed of six mappings contains incorrect ones. Consequently, projective transformations are utilized to address images of scene depth, and the proposed method can build correct mappings (see Figure 8 and its associated text).

Figure 8

Matching results on two image pairs from the OSUColor and Thermal Database. (a) Proposed method; and (c) proposed method results; (b) SIFT + RANSAC; and (d) SIFT + RANSAC results. The top two images were taken at the same time; the bottom were taken at different times. Although this image pair is multispectral, its property in some local area is close to being single-spectrum. For example, the house ceiling is brighter than other areas on both visible and IR images, which means the similarity of the local pattern. SIFT + RANSAC contain some correct matches.

Table 1 gives the experimental results. The datasets ‘Country’ to ‘Water’ are multispectral images, but the spectral distance is small (about 300 nm). On these datasets, the SIFT descriptor yields a high correct ratio of keypoint mappings, and therefore, RANSAC can further effectively remove incorrect keypoint mappings. The dataset ‘EOIR’ has more multimodality, and the correct ratio of initial keypoint mappings on it is markedly lower. Consequently, it becomes a challenging problem for RANSAC to preserve the correct mappings. See Section 4 for a detailed analysis.

Table 1

The error distribution histogram.

		Country			Field			Forest			Indoor			Mountain			Oldbuilding			Street			Urban			Water			EOIR			VS-LWIR
ERROR	[0, 2]	[2, 5]	> 5	[0, 2]	[2, 5]	> 5	[0, 2]	[2, 5]	> 5	[0, 2]	[2, 5]	> 5	[0, 2]	[2, 5]	> 5	[0, 2]	[2, 5]	> 5	[0, 2]	[2, 5]	> 5	[0, 2]	[2, 5]	> 5	[0, 2]	[2, 5]	> 5	[0, 2]	[2, 5]	> 5	[0, 2]	[2, 5]	> 5
CASCADE	898	99	9	2540	145	51	4749	18	0	5641	96	2	5641	96	2	6987	75	0	3502	124	0	12,717	19	0	1777	84	17	162	130	257	78	117	320
CASCADE	0.89	0.10	0.01	0.93	0.05	0.02	0.99	0.01	0.00	0.98	0.02	0.00	0.98	0.02	0.00	0.99	0.01	0.00	0.97	0.03	0.00	0.998	0.002	0.00	0.95	0.04	0.01	0.30	0.23	0.47	0.15	0.23	0.62
SIFT + RANSAC	847	182	195	2431	274	164	3635	561	31	5470	614	29	6474	600	106	8004	590	3	4055	527	16	14,412	825	1	1971	230	170	75	54	700	15	29	744
SIFT + RANSAC	0.69	0.15	0.16	0.85	0.10	0.05	0.86	0.13	0.08	0.89	0.10	0.01	0.90	0.08	0.02	0.93	0.07	0.00	0.88	0.11	0.01	0.95	0.05	0.00	0.83	0.10	0.07	0.09	0.07	0.84	0.02	0.04	0.94
SIFT	379	87	285	691	106	226	1244	276	140	464	23	128	742	258	110	696	81	28	356	94	43	735	8	23	425	44	129	22	7	202	2	11	3948
SIFT	0.50	0.12	0.38	0.68	0.10	0.22	0.75	0.17	0.08	0.75	0.04	0.21	0.67	0.23	0.10	0.86	0.10	0.03	0.72	0.19	0.09	0.96	0.01	0.03	0.71	0.07	0.22	0.10	0.03	0.87	0.00	0.01	0.99
GS_SIFT	381	88	290	694	121	227	1254	299	144	466	24	127	753	258	111	700	81	32	362	99	43	746	8	23	431	46	132	22	7	203	2	10	4939
GS_SIFT	0.50	0.12	0.38	0.67	0.12	0.22	0.74	0.18	0.08	0.76	0.04	0.21	0.67	0.23	0.10	0.86	0.10	0.04	0.72	0.20	0.09	0.96	0.01	0.03	0.71	0.08	0.22	0.09	0.03	0.88	0.00	0.01	0.99
ISS	213	54	884	246	73	794	639	167	1000	299	26	178	179	98	200	281	32	58	179	54	300	389	14	74	263	35	542	25	14	313	2	16	4578
ISS	0.19	0.05	0.77	0.22	0.07	0.71	0.35	0.09	0.55	0.59	0.05	0.35	0.38	0.21	0.42	0.76	0.09	0.16	0.34	0.10	0.56	0.82	0.03	0.16	0.31	0.04	0.65	0.07	0.04	0.89	0.00	0.01	0.99
ORB	195	95	6860	295	168	2919	481	165	869	399	57	634	269	186	379	304	103	58	209	144	790	366	17	18	300	122	2501	328	87	4765	2	14	10,249
ORB	0.03	0.01	0.96	0.09	0.05	0.86	0.32	0.11	0.57	0.37	0.05	0.58	0.32	0.22	0.45	0.65	0.22	0.12	0.18	0.13	0.69	0.91	0.04	0.04	0.10	0.04	0.86	0.06	0.02	0.92	0.00	0.00	1.00
FREAK	12	10	9567	11	9	3982	84	34	14,647	69	34	5490	97	13	5798	43	24	6864	37	6	6295	111	34	9779	32	20	6932	11	11	371	1	3	575
FREAK	0.00	0.00	1.00	0.00	0.00	1.00	0.01	0.00	0.99	0.01	0.01	0.98	0.02	0.00	0.98	0.01	0.00	0.99	0.01	0.00	0.99	0.01	0.00	0.99	0.00	0.00	0.99	0.03	0.03	0.94	0.00	0.01	0.99

Experimental Results

To investigate the performance of the proposed method, many experiments were conducted. Images include the data from the reference [26], ‘EOIR’ data and the dataset ‘VS_LWIR’from [1]. The dataset [26] includes 477 image pairs in nine categories taken with RGB and NIR (near-infrared). Dataset ‘EOIR’ includes 87 image pairs captured by ourselves, 12 Landsat image pairs from NASA, four remote sensing image pairs of the 2008 Sichuan earthquake and two image pairs from the OSUColor and Thermal Database. The 87 image pairs include indoor scenes and outdoor scenes, with one image taken with visible light and the other taken with middle-wave infrared (MWIR) light. In addition to the spectral distance, they are taken at different times, so the content of one image may be slightly different from that of the other. The 12 Landsat image pairs are downloaded from [27] with one taken with the visible band, e.g., Landsat 8 Band 3 Visible (0.53–0.59 μm), and the other taken with middle-wave light or the Thermal Infrared Sensor (TIRS), e.g., Landsat 8 Band 10 TIRS 1 (10.6–11.19 μm). The four remote sensing image pairs were taken over Wenchuan county (Sichuan Province, China) during the 2008 Sichuan earthquake. They were acquired by the Formosat-2 satellite. One image is a multispectral image (1960 × 1683) before the earthquake, and the other is a panchromatic image (1968 × 1705) after the earthquake of the same area. In order to further verify the performance of the proposed method for multispectral images taken at different times, we take two image pairs from the OSU Color and Thermal Database (Data 03, [28]). The two image pairs are captured by a thermal sensor (Raytheon PalmIR 250D, 25-mm lens) and a color sensor (Sony TRV87 Handycam). Dataset ‘VS_LWIR’ [1] contains 100 image pairs, one image taken with the visible bandwidth (0.4–0.7 μm) and the other taken with the long-wave infrared bandwidth (LWIR, 8–14 μm). Dataset ‘EOIR’ is much more challenging than the dataset from [26], since a visible image has less common information with a middle-wave infrared image than a near-infrared image. Dataset ‘VS-LWIR’ is even more challenging due to a larger spectral difference. Note that not all visible and TIRS (MWIR) Landsat image pairs available at [27] are so hard as ‘EOIR’ to be registered. Some such image pairs are very close to single-spectrum pairs in terms of intensity/gradient pattern, although they are taken by different spectral light. This work aims to investigate the performance of different methods for which the multimodality is relatively strong, i.e., the local gradient pattern does not match well, and the 12 Landsat image pairs have this property.

The Effectiveness of the Resurrection Mechanism

Firstly, we conduct experiments to evaluate the performance of the resurrection mechanism. The number of pending keypoint mappings and the number of resurrected mappings are counted. As above, the keypoint mappings are divided into three levels based on the calculated cg. At the beginning (first) step, pending keypoint mappings are not available, and so, the resurrection mechanism is not used. At the end (fourth) step, it is not used either, as only the keypoint mappings with cg = 2 and cg = 3 are input, while the keypoint mappings with cg = 1 are completely discarded. Consequently, we consider the number of pending keypoint mappings in Steps 1 and 2 and the number of resurrected keypoint mappings in Steps 2 and 3. Figure 5 shows the performance of the resurrection mechanism. From left to right are the total number of pending mappings in Steps 1 and 2 (blue bar), the total number of resurrected mappings in Steps 2 and 3 (red bar) and the number of resurrected correct mappings (green bar). From Figure 5, it can be seen that the resurrection mechanism successfully recovers some wrongly-discarded correct mappings (green bar) on all datasets. In particular, the dataset ‘EOIR’ is more challenging than other datasets, as the image pairs contain much fewer correct mappings, and hence, the number of recovered correct ones by the resurrection mechanism is much smaller, as well.

Figure 5

On each dataset, from left to right: the number of pending mappings, the number of resurrected mappings and the number of resurrected correct mappings.

Visual Results

Figure 6 gives the visual result of keypoint mappings built with the proposed method and SIFT + RANSAC on an image pair from ‘EOIR’. Visually, the proposed method builds six correct keypoint matches, while SIFT + RANSAC gives two correct mappings. The incorrect mappings by SIFT + RANSAC are difficult to remove, since they are caused by the repeating structure and the distinctiveness decrease of descriptors on multispectral images.

Figure 6

Matching result on an image pair from ‘EOIR’ taken by ourselves. The infrared image is transformed to investigate the matching performance under rotation. (a) The proposed method performs well under rotation; and (b) SIFT + RANSAC yields some incorrect matches due to the repeating windows of buildings. Such incorrect mappings are very difficult to remove, since the local gradient patterns (descriptors) of the windows lying in different position are similar to each other.

Figure 7 gives another visual result of keypoint mappings on an image pair taken during the 2008 Sichuan earthquake from the dataset ‘EOIR’, with all five mappings built with SIFT + RANSAC incorrect in Figure 7b. In the proposed method, more than one mapping reference keypoint is assigned to a test keypoint in the first step, which helps preserve more keypoint mappings. Furthermore, clouds appear in the IR image, which cause occlusion and mismatches. The proposed method utilizes the descriptors, as well as complementary information, providing six correct keypoint matches in this image pair in Figure 7a.

Figure 7

Matching result on a remote sensing image pair taken during the 2008 Sichuan earthquake (on a scale of 1:10,000) from the dataset ‘EOIR’. (a) The proposed method; (b) SIFT + RANSAC. The clouds appearing in the IR image do not generate incorrect matches for the proposed method, since they have been removed step by step in the cascade structure. SIFT + RANSAC barely generates a keypoint mapping due to the lack of texture in the cloud area (hence, fewer keypoints). The repeating structure of image content causes mismatches for SIFT + RANSAC.

Figure 8 shows the matching results on image pairs from the OSU Color and Thermal Database. One image pair is taken at the same time, while the other is taken at different times. The proposed method established six correct mappings on Figure 8a,c, indicating that the effect of imaging time on the accuracy of keypoint mapping is small. It is interesting to see that SIFT + RANSAC performs better in Figure 8d, taken at different times than in Figure 8b. The reason may be that although the image pairs shown in Figure 8d,c are taken at different times, their scene does not change much, and they are still of the same resolution. Again, these mappings are difficult to remove, since the local gradients encoded by descriptors are very similar to each other. The proposed method also outperforms SIFT [4] GS_SIFT [29], ISS [30], ORB[31] and FREAK [32]; however, the visual results are not listed here, and only the results of the proposed and SIFT + RANSAC are given. See the following quantitative analysis in Table 1.

Quantitative Results

This section presents the quantitative results. We compare the proposed method with the descriptors SIFT, GS_SIFT, ISS, ORB, FREAK and SIFT + RANSAC. The distances between keypoints are calculated for every keypoint mapping, and we count the number of keypoint mappings with distance falling in a range. Then, the histogram of the calculated distances is generated for every dataset with three bins [0, 2], [2, 5] and > 5. Table 1 gives the number histograms for which each bin counts the number of keypoint mappings with the distance falling in the bin. For example, on the dataset ‘Country’, 898 keypoint mappings built with the proposed method have a distance less than two; 99 keypoint mappings have a distance lying in [2, 5]; and nine keypoint mappings have a distance greater than five. On ‘Country’, SIFT + RANSAC gives 847 mappings of a distance less than two, 182 mappings of a distance less than 0.15 and 195 mappings of a distance larger than 0.16. ISS is not only an approach to extracting descriptors, but one to building keypoint mappings by finding the rough orientation difference between two images. SIFT, GS_SIFT, ORB and FREAK are mapped with BBF in our experiment. These keypoints and descriptors are detected with the source code in OpenCV, and the mapping keypoints are implemented with OpenCV. For all of these methods, OpenCV basically follows the parameter setups of the original papers. The total number of keypoint mappings built by different methods varies from one to another, so we assess the performance in terms of the percent of correct and incorrect mappings. Table 1 also gives the percent histogram of the distance between mapped keypoints. For example, on dataset ‘Country’, the proposed method has 89% (0.89) of keypoint mappings with a distance falling in [0, 2] and 10% (0.10) falling in [2, 5], while SIFT + RANSAC has 69% (0.69) in [0, 2] and 15% (0.15) in [2, 5] and ISS has 19% and 5% falling in [0, 2] and [2, 5]. The datasets ‘Country’ to ‘Water’ include multispectral image pairs, but the spectral distance is small (300 nm–400 nm); therefore, their characteristics are very close to monomodal image pairs. SIFT or SURF can build keypoint mappings of a sufficiently high correct ratio, and accordingly, SIFT + RANSAC performs relatively well. On datasets ‘Country’, ‘Field’, ‘Forest’, ‘Indoor’, ‘Mountain’, ‘Oldbuilding’, ‘Street’, ‘Urban’ and ‘Water’, SIFT provides correct ratios of 50% (0.50), 68% (0.68), 75% (0.75), 75% (0.75), 67% (0.67), 86% (0.86), 72% (0.72), 96% (0.96) and 71% (0.71). Dataset ‘EOIR’ includes image pairs taken by visible light and middle/long-wave infrared light and is much more challenging, since the spectral distance is larger. On this dataset, the presented method has 30% (0.30) and 53% (0.53) keypoint mappings with a distance falling in [0, 2] and [0, 5], while SIFT + RANSAC has 9% (0.09) and 16% (0.16) in [0,2] and [0, 5], SIFT has 10% (0.10) and 13% (0.13), GS_SIFT has 9% (0.09) and 12% (0.12), ISS has 7% (0.07) and 11% (0.11), ORB has 6% (0.06) and 8% (0.02) and FREAK has 3% (0.03) and 6% (0.06). Dataset ‘VS-LWIR’ is even more challenging than ‘EOIR’, as it can be seen that the correct rates for all methods are lower than on ‘EOIR’. For example, the presented method has 15% (0.15) and 48% (0.48) on ‘VS-LWIR’, and SIFT + RANSAC has only 2% (0.02) and 6% (0.06). RANSAC does not improve much over the original SIFT, since the keypoint mappings built with the original SIFT contain too many incorrect ones, which makes RANSAC fail to identify correct mappings. Compared with other methods, the presented method increases the correct ratio. Since the dataset is relatively large, this result shows that the proposed method is more suitable for multispectral images than other methods.

Conclusions

With the insight of the boosting technique, we proposed a cascade structure of four steps with a resurrection mechanism for registering multispectral images. The cascade structure classifies keypoint mappings at a step to be correct, incorrect and pending. Unlike the traditional approaches that typically determine keypoint mappings to be correct or not with a one-pass process, the “pending” keypoint mappings will be determined by the rules at subsequent steps. This effectively overcomes the drawback of applying a single algorithm for registering images. Furthermore, the resurrection mechanism gets back some of the correct mappings that were misclassified at previous steps. The experimental results demonstrate that the cascade structure can provide more robust registration than the state-of-the-art. Several directions are possible in the future. Improving the keypoint repeatability and the repeatability and distinctiveness of descriptors are the building blocks for keypoint mappings on multispectral images. At each step, different approaches can be applied to investigate the performance of the cascade, and the number of steps can be adjusted for applications. Last, but not least, other image information, such as CIDS, can be adapted to multispectral images to supplement the matching ability of the descriptors.

8 in total

1. MIND: modality independent neighbourhood descriptor for multi-modal deformable registration.

Authors: Mattias P Heinrich; Mark Jenkinson; Manav Bhushan; Tahreema Matin; Fergus V Gleeson; Sir Michael Brady; Julia A Schnabel
Journal: Med Image Anal Date: 2012-05-31 Impact factor: 8.545

2. WLD: a robust local image descriptor.

Authors: Jie Chen; Shiguang Shan; Chu He; Guoying Zhao; Matti Pietikäinen; Xilin Chen; Wen Gao
Journal: IEEE Trans Pattern Anal Mach Intell Date: 2010-09 Impact factor: 6.226

3. A partial intensity invariant feature descriptor for multimodal retinal image registration.

Authors: Jian Chen; Jie Tian; Noah Lee; Jian Zheng; R Theodore Smith; Andrew F Laine
Journal: IEEE Trans Biomed Eng Date: 2010-02-18 Impact factor: 4.538

4. Multisensor image registration via implicit similarity.

Authors: Yosi Keller; Amir Averbuch
Journal: IEEE Trans Pattern Anal Mach Intell Date: 2006-05 Impact factor: 6.226

5. A coarse-to-fine subpixel registration method to recover local perspective deformation in the application of image super-resolution.

Authors: Fei Zhou; Wenming Yang; Qingmin Liao
Journal: IEEE Trans Image Process Date: 2011-06-16 Impact factor: 10.856