| Literature DB >> 28420142 |
Cristhian A Aguilera1,2, Angel D Sappa3,4, Cristhian Aguilera5, Ricardo Toledo6,7.
Abstract
This paper presents a novel CNN-based architecture, referred to as Q-Net, to learn local feature descriptors that are useful for matching image patches from two different spectral bands. Given correctly matched and non-matching cross-spectral image pairs, a quadruplet network is trained to map input image patches to a common Euclidean space, regardless of the input spectral band. Our approach is inspired by the recent success of triplet networks in the visible spectrum, but adapted for cross-spectral scenarios, where, for each matching pair, there are always two possible non-matching patches: one for each spectrum. Experimental evaluations on a public cross-spectral VIS-NIR dataset shows that the proposed approach improves the state-of-the-art. Moreover, the proposed technique can also be used in mono-spectral settings, obtaining a similar performance to triplet network descriptors, but requiring less training data.Entities:
Keywords: CNN; cross-spectral; descriptor; infrared
Year: 2017 PMID: 28420142 PMCID: PMC5424750 DOI: 10.3390/s17040873
Source DB: PubMed Journal: Sensors (Basel) ISSN: 1424-8220 Impact factor: 3.576
Figure 1The proposed network architecture. It consists of four copies of the same CNN that accepts as input two different cross-spectral correctly matched image pairs (MP1 and MP2). The network computes the loss based on multiples distance comparisons between the output of each CNN, looking for the matching pair with the biggest distance and the non-matching pair with the smallest distance. Both cases are then used for backpropagation of the network. This can be seen as positive and negative mining.
Figure 2VIS-NIR cross-spectral image pairs; top images are from the visible spectrum and bottom images from the near-infrared spectrum.
Shows the number of cross-spectral image pairs per category on the VIS-NIR patch dataset used to train and evaluate our work.
| Category | # Cross-Spectral Pairs |
|---|---|
| country | 277,504 |
| field | 240,896 |
| forest | 376,832 |
| indoor | 60,672 |
| mountain | 151,296 |
| old building | 101,376 |
| street | 164,608 |
| urban | 147,712 |
| water | 143,104 |
Figure 3Image patches from the VIS-NIR training set. The first row corresponds to grayscale images from the visible spectrum; and the second row to NIR images. (a,b): non-matching pairs; (c,d): correctly matched pairs.
Figure 4PN-Net training triplet architecture.
Average FPR95 for each category.
| Train seq. | PN-Net Gray | PN-Net NIR | PN-Net Random |
|---|---|---|---|
| country | 11.79 | 11.63 | |
| field | 17.84 | 16.56 | |
| forest | 36.00 | 32.47 | |
| indoor | 48.21 | 47.26 | |
| mountain | 29.35 | 26.29 | |
| old building | 29.22 | 27.69 | |
| street | 18.23 | 16.73 | |
| urban | 36.61 | 33.35 | |
| water | 18.16 | 17.76 | |
| average | 26.84 | 25.84 |
Figure 5Q-Net training quadruplet architecture.
Q-Net layer descriptions.
| Layer | Description | Kernel | Output Dim |
|---|---|---|---|
| 1 | Convolution | 7 × 7 | 32 × 26 × 26 |
| 2 | Tanh | - | 32 × 26 × 26 |
| 3 | MaxPooling | 2 × 2 | 32 × 13 × 13 |
| 4 | Convolution | 6 × 6 | 64 × 8 × 8 |
| 5 | Tanh | - | 64 × 8 × 8 |
| 6 | Linear | - | 256 |
Figure 6FPR95 performance on the VIS-NIR scene dataset for Q-Net 2P-4N using different descriptor sizes ((a) 64; (b) 128; (c) 256 and (d) 512). Shorter bars indicate better performances. On top of the bars, standard deviation values are represented with segments.
FPR95 performance on the VIS-NIR scene dataset. Each network, i.e., siamese-L2, PN-Net and Q-Net, were trained in the country sequence and tested in the other eight sequences as in [9]. Smaller results indicate better performance. In brackets, the standard deviation is provided.
| Descriptor/Network | Field | Forest | Indoor | Mountain | Old Building | Street | Urban | Water | Mean |
|---|---|---|---|---|---|---|---|---|---|
| EHD | |||||||||
| LGHD | |||||||||
| siamese-L2 | |||||||||
| PN-Net RGB | |||||||||
| PN-Net NIR | |||||||||
| PN-Net Random | |||||||||
| Q-Net 2P-4N (ours) | |||||||||
| PN-Net Random DA | |||||||||
| Q-Net 2P-4N DA (ours) |
Figure 7ROC curves for the different descriptors evaluated on the VIS-NIR dataset. For Q-Net and PN-Net, we selected the network with the best performance. Each subfigure shows the result in one of eight tested categories of the dataset.
Matching results in the multi-view stereo correspondence dataset. Evaluations were made on the 100 K image pairs’ ground truth recommended from the authors. Results correspond to FPR95. The smallest results indicate better performance. The standard deviation is provided in brackets.
| Training | Notredame | Liberty | Notredame | Yosemite | Yosemite | Liberty | |
|---|---|---|---|---|---|---|---|
| Testing | Yosemite | Liberty | Notredame | ||||
| Descriptor | mean | ||||||
| siamese-L2 | |||||||