Literature DB >> 33286094

Relative Distribution Entropy Loss Function in CNN Image Retrieval.

Pingping Liu1,2,3, Lida Shi4, Zhuang Miao1, Baixin Jin1, Qiuzhan Zhou5.   

Abstract

Convolutional neural networks (CNN) is the most mainstream solution in the field of image retrieval. Deep metric learning is introduced into the field of image retrieval, focusing on the construction of pair-based loss function. However, most pair-based loss functions of metric learning merely take common vector similarity (such as Euclidean distance) of the final image descriptors into consideration, while neglecting other distribution characters of these descriptors. In this work, we propose relative distribution entropy (RDE) to describe the internal distribution attributes of image descriptors. We combine relative distribution entropy with the Euclidean distance to obtain the relative distribution entropy weighted distance (RDE-distance). Moreover, the RDE-distance is pan class="Gene">fused with the contrastive loss and tripn>let loss to build the relative distributed pan class="Disease">entropy loss functions. The experimental results demonstrate that our method attains the state-of-the-art performance on most image retrieval benchmarks.

Entities:  

Keywords:  Euclidean distance; deep metric learning; image retrieval; relative entropy

Year:  2020        PMID: 33286094      PMCID: PMC7516778          DOI: 10.3390/e22030321

Source DB:  PubMed          Journal:  Entropy (Basel)        ISSN: 1099-4300            Impact factor:   2.524


1. Introduction

In recent years, the newly proposed image retrieval algorithms based on convolutional neural networks [1,2,3,4,5] (CNN) have greatly improved the accuracy and efficiency. This has been the mainstream direction of academic research of image retrieval. In the beginning, CNN could only be applied to image classification tasks [6,7,8]. However, image classification is different from the image retrieval. Krizhevsky [9] flexibly applied a convolutional neural network to image retrieval. AlexNet [9] is designed for image classification and retrieval. Subsequently, Noh [10] propn>osed a Local Feature descripn>tor, called DELF (Deepn> Local Feature), which is suitable for large-scale image retrieval. A large number of studies have illustrated that the outpn>ut features of the convolutional layer of the neural network have excellent discrimination and scalability. More recently, image retrieval algorithms based on convolutional neural networks emerged one after another. These methods are mainly summarized into three categories: fine-tuned networks, pre-trained networks, and hybrid networks. Among them, hybrid networks are less efficient in image retrieval tasks, and pre-trained networks are widely used. The fine-tuned network initializes the network architecture through the pre-trained image classification model and then adjusts the parameters according to different retrieval tasks. The fine-tuned network usually opn>timizes the network parameters by training the network architecture of metric learning. Metric learning aims to learn an embedding spn>ace, where the embedded vectors of positive sampn>les are encouraged to be closer, while negative sampn>les are pushed apn>art from each other [11,12,13]. Recently, a lot of deepn> metric learning methods have been based on pairs of sampn>les such as contrastive loss [14], tripn>let loss [15], quadrupn>let loss [16], lifted structured loss [17], N-pairs loss [18], binomial deviance loss [19], histogram loss [20], n>an class="Disease">angular loss [21], distance weighted margin-based loss [22], and hierarchical triplet loss (HTL) [23]. Most of the above-mentioned loss functions take common vector similarity (such as Euclidean distance) as the final image descriptor into consideration. However, it is not accurate enough to measure the similarity between features only by Euclidean distance, which lacks the difference in the internal spatial distribution [24] of the image pair. As illustrated in Figure 1, each rectangle represents a feature descriptor obtained after the convolution of the neural network. The value of Euclidean distance between descriptors for different images may be the equal or small, but the spatial distribution of every descriptor may be greatly different. In the information processing field, entropy [25] is an effective measurement to reflect the distribution information. Relative entropy [26,27,28] is a measure of the distance between two random distributions, which is equivalent to the difference of information entropy between two distributions. Inspired by this, we introduce the idea of relative entropy into image retrieval.
Figure 1

Each rectangle represents a feature descriptor obtained after the convolution of the neural network. Different colors of small squares represent different feature intensities in the descriptor. The Euclidean distance () between descriptor n and descriptor q is , while the Euclidean distance () between descriptor p and descriptor q is . D1 is approximately equal to D2, but the internal spatial distribution of p and n is obviously different.

To solve the key problem mentioned above, we propose relative distribution entropy (RDE) to describe the distribution attributes of image descriptors. We combine the relative distribution entropy with the Euclidean distance [29] to build the relative distribution entropy weighted distance (RDE-distance). We pan class="Gene">fuse the RDE-distance into the contrastive loss and tripn>let loss to obtain the relative distributed entropn>y contrastive loss and the relative distributed entropn>y tripn>let loss. We call them the relative distributed entropy loss functions. To be more specific, we make three contributions, as follows: Firstly, we propose a loss function modified by relative distribution entropy, called the relative distribution entropy loss function. Furthermore, the relative distribution n>an class="Disease">entropy loss function contains the following two aspects: (1) Euclidean distance between the descriptors. (2) The differences in internal distribution state between descriptors. The core idea of the algorithm is illustrated in Figure 2. We combine the Euclidean distance with the relative distribution entropy to obtain the relative distribution entropy weighted distance (RDE-distance), which increases the discrimination between the image pair. We replace the Euclidean distance in the original contrastive loss and triplet loss with the relative distribution entropy weighted distance (RDE-distance) to obtain the relative distribution entropy loss function.
Figure 2

The core idea of relative distribution entropy (RDE)-loss. D represents the Euclidean distance between two image descriptors. RDE represents the relative distribution entropy between two image descriptors. RDE-distance is a new metric that combines Euclidean distance with relative distributed entropy. We called it the relative distribution entropy weighted distance, which can enhance the discrimination of image descriptors.

Secondly, during the experiment, we use pan class="Chemical">GeM pooling [30] and whitening post-processing [31]. Thirdly, we employ the fine-tuned network [30] to perform our experiments on different datasets to verify the effectiveness of our proposed method. The organization of this work is as follows. Section 2 introduces the related work. Section 3 mainly introduces our proposed relative distribution pan class="Disease">entropy loss function. Spn>ecific expn>erimental results and analyses will be presented in Section 4. Section 5 is a summary of the contributions and methods of this papn>er.

2. Related Work

2.1. Deep Metric Learning

Deep metric learning (DML) has become one of the most interesting research areas in pan class="Chemical">machine learning. Metric learning aims to learn an embedding space. In this space, images are converted into embedding vectors. The distance between embedding vectors of positive sampn>les is small, and the distance between embedding vectors of negative sampn>les is large [11,12,13]. Now, deepn> metric learning has played a vital role in many areas, such as face recognition [32], pedestrian recognition [33], fine-grained retrieval [34], image retrieval [35], target tracking [36], and multimedia retrieval [37]. We will summarize the recent emergence of metric learning methods in the next section. Here, we introduce deepn> metric learning into image retrieval. Deepn> metric learning aims to learn a discriminative feature embedding for inpn>ut image . In other words, is the descripn>tor of the image. Formally, we define the Euclidean distance between two descripn>tors as .

2.1.1. Contrastive Loss

The siamese network [14] is a typical pair-based method. Its embedding is obtained through contrastive loss. The purpose of contrastive loss is to reduce the Euclidean distance between positive samples and increase the Euclidean distance between negative samples. The equation of the contrastive loss function is illustrated in (1). if the sample is a positive sample, and if the sample is a negative sample. is the margin. This keeps negative distances above a certain threshold.

2.1.2. Triplet Loss

However, the above models only focus on the similarity of intra-class of the samples. To solve this problem, the triplet loss function [15] is proposed. Each triplet comprises a positive sample and a negative sample sharing the query. Triplet loss aims to learn an embedding space. In this space, the distance between the query and the negative sample is greater than the distance between the query and the positive sample. The triplet loss function is illustrated in (2). Here, represents the Euclidean distance between the descriptors of the positive sample and the query. Similarly, represents the Euclidean distance between the descriptors of the negative sample and the query. is the violate margin that requires that the negative distances to be larger than the positive distances.

2.1.3. N-pair Loss

Triplet loss function only compares one negative sample and ignores the negative samples of other classes during the learning stage. As a result, the embedding vector of the query can only be promoted to maintain a large distance between the selected negative samples, but it cannot guarantee to maintain a large distance between the embedding vector and other non-selected negative samples. N-pair loss [18] has improved the above problems. Unlike triplet loss function, N-pair loss considers the relationship between query samples and other negative samples of different classes within a mini-batch. The equation of the N-pair loss function is illustrated in (3). Each training tuple of N-pair loss is composed of samples: . is the positive sample. are the negative samples.

2.1.4. Lifted Structured Loss

The existing triplet loss methods cannot take full advantage of mini-batch SGD training. Lifted structured pan class="Disease">loss calculates loss [17] based on all positive and negative sampn>le pairs in the training set (mini-batch). The lifted structured loss function is illustrated in Equation (4). where is the set of positive sampn>les in the training set, and is the set of positive sampn>les in the training set. is the violate margin. Although these loss functions calculate the distance of descriptors between images, they neglect the difference in internal distribution between the image pair. In this work, we propose a new loss function, which is called the relative distribution pan class="Disease">entropy loss function. We use the relative distribution entropn>y to reflect the difference in the descripn>tor distribution, and add this difference to the loss function. In this way, the new loss function can combine the Euclidean distance with the internal relative difference in the distribution state between the descripn>tors of the image pair.

2.2. Application of Spatial Information in Image Retrieval

The performance of image retrieval has been greatly improved in recent years through the use of deep feature representations. However, most existing methods aim to retrieve images that are visually similar or semantically relevant to the query, without considering the spatial information. Before that, some researchers attempted to add spatial information into image retrieval algorithms to improve retrieval performance. Mehmood [38] proposed adding a local region and a global histogram to the BoW algorithm and combining them as the final descriptor. Krapac [39] used the BoW descriptor to encode the spatial part of the image, which improved the performance of the image classification. Koniusz [40] used spatial coordinate coding to represent and simplify the spatial pyramid to provide more compact image features. Sanchezet [41] added the spatial position information of features into the descriptor, which overcame the change in the proportion of retrieved objects and the change in the local area of the image. Liu [42] introduced the concept of spatial distribution entropy and combined it with the VLAD algorithm. These methods have fully introduced the spatial information into the image retrieval algorithm, and great effects are obtained. However, these methods all aimed at the improvement of traditional image descriptors. In this work, we attempt to add spatial information to the descriptor by optimizing the loss function of a deep convolutional network.

2.3. Pooling and Normalization

The feature map generated by the deep retrieval framework reflects the color, texture, shape, and other characteristics of the image. Since the convolutional neural network needs to be integrated into a multi-dimensional feature descriptor before retrieval, the feature map of the convolutional layer needs to be further processed. It is required that the processed result retains the main features of the image while reducing the parameters and calculations of the next layer. Babenko [43] proposed the mean-pooling method, which sums the pixel values of the feature map to obtain an N-dimensional feature vector. Razavian [44] used max-pooling or mean-pooling on the feature descriptors and the result reduced pan class="Chemical">MAC [44] descripn>tors in dimension with a series of normalization and pan class="Chemical">PCA [45] whitening operations. Finally, the region feature vectors are summed to obtain a single image representation. In this work, we used the generalized mean-pooling [30]. We used to represent the input to the pooling layer and to represent the pooling layer output. The mentioned pooling can be expressed as follows: Max pooling (pan class="Chemical">MAC vector): Average pooling (pan class="Chemical">SPoC vector [43]): Generalized mean pooling (pan class="Chemical">GeM [30]): where is the number of feature mapn>s, means the channel of features and is the number of feature values in the -th channel feature mapn>. The descripn>tor finally consists of a single value per feature mapn>. is the pooling parameter, which can be manually set or learned. The supn>erscripn>t of is the pooling method. repn>resent the max pooling, average pooling and generalized mean pooling, respectively. The max-pooling and mean-pooling are special cases of generalized mean-pooling. When, it is the mean-pooling. On the contrary, when it is positive infinity, it is the max-pooling. Generalized mean-pooling with parameters can better adapt to the network and improve retrieval performance. In this work, we use normalization to balance the effect of the range of pixel values as follows: where represents a vector, represents the norm of the vector, and represents the value of the dimension on the vector.

2.4. Whitening

In large-scale image retrieval applications, high-dimensional global images typically require the use of pan class="Chemical">PCA to reduce the dimensions of features for the next stepn>. Jegou and Chum [45] studied the influence of PCA on the BoW and VLAD descriptor representations, and highlighted the use of multiple visual dictionaries for dimensionality reduction, thereby reducing the information loss of the dimensionality reduction process. In this work, whitening is used as a post-processing step. In this paper, the method of whitening is the linear discriminant projection proposed by Mikolajczyk and Matas [31]. The processing steps are divided into two parts. In the first part, the intra-class image feature vector is whitened. The whitening part is the reciprocal of the square root of the intra-class image pair (matched image pair) covariance matrix . where and are the descriptors of the image after pooling, represents the image matched pair, and represents the covariance matrix of the matched image pair. In the second part, the inter-class image feature is rotated. The rotating part is the eigenvector of the covariance matrix between inter-class image pair (the non-matched image pair) where and are the descriptors of the image after pooling, represents the non-matched image pair, and represents the covariance matrix of the non-matched image pair. Then, we apply the projection to , where is the mean pan class="Chemical">GeM [30] vector to perform centering.

3. Method Overview

3.1. Calculation of Relative Distribution Entropy

We firstly introduce the concept of relative distribution entropy. Relative distribution entropy can better represent the distribution difference between two descriptors of image samples. The relative distribution entropy is derived from the relative entropy. Relative entropy can be computed as follows: and are the two probability distributions on the random variable . From this, we can get the equation of relative distribution entropy (RDE). where represent the images. represents the descriptor of image . It is a normalized vector. is the relative distribution entropy of two images. In this work, we use histograms to describe the distribution of image descriptors. is the number of bins, and is an adjustable parameter. is the probability distribution in the bin. The equation for is the same as for. is the dimension of descriptors.

3.2. Relative Distribution Entropy Loss Function

From the above, we introduce the calculation of relative distributed entropy. Next, we show how to add the relative distribution entropy into the loss function. We add the relative distribution entropy to contrastive loss [14] and triplet loss [15] to build relative distribution pan class="Disease">entropy loss functions. Firstly, we introduce the fusion process of contrastive loss and relative distribution entropy. The equation of the contrastive loss function is shown in (14). where represents the Euclidean distance between the descriptors of the query and the sample . if the sample is a positive sample, and if the sample is a negative sample. is the margin. Then, we add the relative distribution entropy to to get the : where is the weighting parameter of the relative distribution entropy. As illustrated in Equation (16), we get the new distance metric. We call it the relative distribution entropy weighted distance (RDE-distance). We substitute into (14) to get the new contrastive loss function, as is shown in Equation (17): Similarly, we introduce the fusion process of triplet loss and relative distribution entropy. The equation of the contrastive loss function is shown in (18). is the Euclidean distance between the descriptors of the positive sample and the query. Similarly, is the Euclidean distance between the descriptors of the negative sample and the query. Then, we add the relative distribution entropy to to get the. We substitute the new Euclidean distances into Equation (18) and get the new triplet loss.

3.3. Network Architecture for Relative Distribution Entropy Loss Function

3.3.1. CNN Network Architecture

We construct a CNN neural network to obtain the descriptor of the image. We only use the convolutional layers, discarding the fully connected layer. Convolution layers can extract features of images. The feature map obtained by the convolution layer is vectorized by the pan class="Chemical">GeM pooling opn>eration [30]. If the whitening opn>eration is performed, the whitening is processed following the pooling layer. Whitening can reduce the correlation between features, and it can make the features share the same variance (covariance matrix is 1), which can greatly impn>rove the image retrieval performance. Here, we use the whitening [31] method. The last stepn> is the normalization opn>eration. The purpn>ose of normalization is to make the prepn>rocessed data limited to a certain range (such as [0,1] or [−1,1]), thereby eliminating the adverse effects caused by the singular sampn>le data. The network architecture is pre-trained in ImageNet [46] network architecture. Furthermore, we adopt network architectures such as ResNet [47] and pan class="Chemical">AlexNet [9], and these two networks are also pre-trained on ImageNet [46]. The CNN network architecture is shown in Figure 3.
Figure 3

Convolutional neural network (CNN) network architecture.

3.3.2. Architecture of Training

The training procedure consists of multiple networks sharing the same weight. The architecture of CNN is introduced in the previous part. We added our newly proposed relative distribution entropy to the previous loss function, as shown in Figure 4 and Figure 5:
Figure 4

Training process using relative distribution entropy contrastive loss.

Figure 5

Training process using relative distribution entropy triplet loss.

In Figure 4 and Figure 5, D represents the Euclidean distance between the two descriptors, and represents the relative distribution entropy between the two descriptors. RDE-distance is the relative distribution entropy weighted distance obtained after the fusion of D and RDE.

4. Experiments and Evaluation

In this section, we discuss the implementation details of training and testing. Also, we analyze the experimental results and compare them with previous work.

4.1. Training Datasets

In this work, experimental training data are distilled from Retrieval-SFM-120K [48], which contains 7.4 million images. After clustering [49], we get about 20,000 images as the query seed. The structure-from-motion (SfM) algorithm constructs 1474 3D models from the training datasets. We removed the duplications and retained 713 of them, which contained more than 163,000 different images. There are 91,642 training images in the dataset, and 98 cluster images that are the same or almost the same as the test dataset. Through the minimum hash and spatial verification methods mentioned in the clustering process, about 20,000 images are selected as query images, 18,1697 pairs of positive images and 551 training clusters, including more than 163,000 clusters [50] from the original dataset. The dataset contains all images from the Oxford 5k [51] and pan class="Chemical">Paris 6k [52] datasets.

4.2. Training Configurations

In the experiments, we use the pan class="Chemical">Pytorch deepn> learning framework to train the deepn> network model. We use ResNet [47], VGG [53] and pan class="Chemical">AlexNet [9], which are all pre-trained on ImageNet [46]. In the experiment of relative distribution entropy contrastive loss, ResNet [47] and VGG [53] are trained using Adam learning strategy [54], while pan class="Chemical">AlexNet [9] is trained using SGD. Our initial learning rate for Adam is , and the margin for ResNet and VGG are 0.95 and 0.9. We use an initial learning rate equal to for SGD, and the margin for pan class="Chemical">AlexNet is 0.75. In the experiment of triplet loss, we also use ResNet [47], VGG [53], and pan class="Chemical">AlexNet [9] to initialize the network. They are trained using the Adam learning strategy [54]. Our initial learning rate for Adam is . The margin for ResNet and VGG are 0.5, and the margin for pan class="Chemical">AlexNet is 0.3. The size of the training image is not more than 362 * 362 while maintaining the aspect ratio of the original image. The experimental environment is an intel(R) i7-8700 processor, GPU with 12GB of memory, NVIDIA(R) 2080Ti grapn>hics card, driver version 419.**. Opn>erating system is Ubuntu 18.04 LTS, n>an class="Chemical">PyTorch version v1.0.0, CUDA version 10.0, CUDNN version 7.5. The time spent in each training cycle trained on our method on VGG, ResNet, and AlexNet is 0.48, 0.72, and 0.22 hours, respectively. During the testing phase, testing VGG, ResNet, and AlexNet networks takes 620, 990, and 277 seconds, respectively. There are subtle differences between different test sets. With the same computing power, the training time of our method is almost the same as that of other methods [30].

4.3. Datasets and Evaluation of Image Retrieval

We conduct our testing experiments on the following benchmark datasets frequently. Herein, we give the details of these datasets. Oxford5k [51] is a widely used landmark dataset consisting of 5062 building images from the Flickr dataset. It contains 11 famous landmarks in the Oxford area, and each landmark building has 55 query images. Paris6k [52] contains 6392 images and is also one of the widely used datasets in the field of image retrieval. It collects many landmark buildings in Paris, and most of these images are from tourists. Similar to the Oxford 5k dataset, it also has 55 query images. In addition, we use 100k interference images to pan class="Gene">fuse with the Oxford5k and Paris6k datasets to obtain Oxford105k [51] and Paris106k [52]. In the experiments, we use mean average precision (mApan class="Chemical">P) to measure the performance of image retrieval.

4.4. Results and Analysis

4.4.1. The Adjustment Process of Hyperparameter

In this experiment, two hyperparameters and are adjusted to obtain best performance. is the weight of the relative distribution entropy. As mentioned in Section 3.2, when fusing the Euclidean distance with RDE, the weight will affect the ratio of Euclidean distance and relative distribution entropy to the finally generated RDE-distance. As mentioned in Section 3.1, is the total bin amount in the histogram during the calculation of RDE. will affect the degree of differentiation of internal spatial distribution differences between two descriptors. However, a large also increases the computational burden. The ability of RDE-distance to distinguish between two descriptors is determined by two factors and . The values of these two hyperparameters have a great impact on our experimental results. During our experiments, we adjust them to get the best performance. We use pan class="Chemical">AlexNet and VGG networks for tuning and the n>an class="Chemical">GeM pooling [30]. The partially representative results are shown in Table 1.
Table 1

Experimental results of hyperparameter comparison in the relative distribution entropy contrastive loss function. The best results would be highlighted in bold.

Network α β Oxford5kOxford5k(W)Pairs6kPairs6k(W)
AlexNet0.501058.1067.6071.6479.60
0.752060.8767.1975.3379.43
0.852560.7967.93 75.60 79.59
0.90 30 61.32 68.22 75.29 80.07
1.005060.2767.7274.8880.10
VGG0.853084.6287.8382.4088.01
0.8510076.1883.0481.7187.11
0.90 30 85.09 88.00 82.69 88.12
Here, we take some representative results. In the relative distribution entropy contrastive loss, we take the value of within 0.5–1, and the results show that the performance is the best when . Additionally, we take the value of within 10–100. When the value of is large (), the effect of adding relative distribution entropy is not obvious. After a large number of experiments, we can make the following conclusions. When and , our performance achieve the best results on Oxford5k and pan class="Chemical">Pairs6k. The performance achieves 88.00% and 88.12% on VGG. On n>an class="Chemical">AlexNet, the performance achieves 68.22% and 80.07%. Therefore, we set to 30 and to 0.9 in ResNet and VGG as our final hyperparameters.

4.4.2. Comparison of MAC, SPoC, and GeM

In this section, we combine the relative distribution entropy contrastive loss function with the current most advanced pooling methods, pan class="Chemical">GeM [30], pan class="Chemical">MAC [44] and SPoc [43], for end-to-end training. In this experiment, we use the AlexNet for training. The experimental results are shown in Table 2.
Table 2

Comparative results of different pooling methods on AlexNet. The best results would be highlighted in bold.

NetOxford5kOxford5k(W)Pairs6kPairs6k(W)
SPoC [43]41.8355.3455.4968.61
MAC [44]47.5055.9562.1671.30
GeM [30] 60.79 68.22 75.29 80.07
The conclusions drawn from Table 2 are as follows. The results in the table indicate that the experiment results using GeM pooling [30] on n>an class="Chemical">AlexNet are superior to the other two pooling methods. We get the results of 60.79%, 68.22%, 75.29% and 80.07%, which are the maximum values on the different datasets. In the next experiments, we will use the GeM pooling [30] method for training.

4.4.3. Comparison of Relative Distribution Entropy Triplet Loss and Triplet Loss

In this section, we present the experimental results of our method in triplet loss and compare them with the previous method [30]. We perform comparison tests on VGG and ResNet. Using the same pooling method, experimental steps, and network model, we compare the relative distribution entropy triplet loss with the traditional triplet loss. The comparison results are shown in Table 3.
Table 3

Performance comparison of relative distribution entropy triplet loss and triplet loss. The best results would be highlighted in bold.

LossNetworkOxford5kOxford5k(W)Pairs6kPairs6k(W)
Triplet loss [30]VGG81.4882.8082.7984.78
Ours 82.39 83.07 83.61 85.45
Triplet loss [30]ResNet81.4985.3387.7091.11
Ours 82.88 86.54 89.33 91.97
From Table 3, the result indicates that when we experiment on VGG, our proposed method obtains the best performance on all these datasets, with 82.39%, 83.07%, 83.61%, and 85.45%. The same conclusion is obtained when we perform the experiments on ResNet. We get the results of 82.88%, 86.54%, 89.33%, and 91.97%, which is the best performance amongst the datasets. In this experience, we combine the relative entropy with the Euclidean distance into relative distributed entropy weighted distance, which is a new metric. We put this new metric into the triplet loss function, and experiments have proven that our method is greatly effective.

4.4.4. Comparison with State-of-Art

In this section, we compare the relative distribution entropy contrastive loss with the latest methods. The performance comparison is shown in Table 4. The results of other methods are given by referring to the results in their papers. From Table 4, it can be learned that our proposed method attains better performance on multiple datasets. As shown in the table, we divide the existing networks into two categories: (1) using fine-tuning networks (yes) and (2) not using fine-tuning networks (no). When using the VGG network, compared with the RMAC [55], relative distribution entropn>y contrastive loss provides a significant impn>rovement of +4.9% and +1.0% on the Oxford5k and Paris6k datasets, respectively. Compared to the latest release, our method also has performance improvements. When using ResNet, our experimental results achieve +0.6% growth compared to GeM [30] on Oxford 5k. Our method also shows superior performance on large-scale datasets. When using the VGG network, our experimental results achieved +0.2% growth compared to GeM [30] on Oxford 105k. When using the ResNet network, our experimental results achieve +0.3% growth compared to GeM [30] on Oxford105k. Our method shows more obvious performance improvements after adding re-ranking and query expansion. Under the VGG, the gain over GeM + αQE [30] is +0.1% and +0.5% on the Paris 6k dataset, respectively. Under the ResNet, the our method achieved mAP of 91.7%, 89.7%, 96.0%, and 92.1% and offered over 91.0%, 89.5%, 96.7%, and 91.9% gain over the GeMQE [37] on Oxford 5k, Oxford 105k, Paris 6k and Paris 106k datasets, respectively.
Table 4

Comparison of our method with the state-of-art image retrieval methods. The best results would be highlighted in bold.

NetMethodF-tunedOxford5kOxford105kPairs6kPairs106k
VGGMAC [44]no56.447.872.358.0
SPoC [43]no68.161.178.268.4
Crow [56]no70.865.379.772.2
R-MAC [52]no66.961.683.075.7
BoW-CNN [57]yes73.959.382.064.8
NetVLAD [58]yes71.6-79.7-
Fisher [59]yes81.576.682.4-
R-MAC [55]yes83.178.687.179.7
GeM [30]yes87.983.387.7 81.3
oursyes 88.0 83.5 88.1 79.9
ResR-MAC [52]no69.463.785.277.8
GeM [30]yes87.884.692.7 86.9
oursyes 88.4 84.9 92.7 86.3
Re-ranking(R) and Query Expansion(QE)
VGGCrow + QE [56]no74.970.684.879.4
R-MAC+R+QE [52]no77.373.286.579.8
BoW-CNN+R+QE [57]no78.865.184.864.1
R-MAC+QE [55]yes89.187.391.286.8
GeM+αQE [30]yes91.989.691.9 87.6
oursyes 92.0 89.6 92.4 87.3
ResR-MAC+QE [52]no78.975.589.785.3
R-MAC+QE [60]yes90.689.496.093.2
GeM+QE [30]yes91.089.595.591.9
oursyes 91.7 89.7 96.0 92.1

5. Conclusions

In this paper, we discuss the deficiency of traditional loss functions in spatial distribution differences. To make up for the lack of spatial distribution differences in the descriptors of image pair, the concept of relative distribution entropy (RDE) is presented. The calculation process of the relative distribution entropy is introduced in Section 3. Next, we combine Euclidean distance and relative distribution entropy to obtain a new similarity measurement, called relative distribution entropy weighted distance (RDE-distance). We combine RDE-distance with contrastive loss and triplet loss to obtain relative distribution entropy contrastive loss and relative distribution entropy triplet loss. We train the entire framework in an end-to-end manner, and the results of extensive experiments prove that our new method achieves the state-of-the-art performance. Our method mainly focuses on how to pan class="Gene">fuse Euclidean distance and spatial information. Here, we introduce relative distribution entropn>y to describe spatial information. We would like to focus on other fusion methods instead of the existing linear fusion. In addition, we would concentrate on adding our relative distribution entropn>y to other loss functions in future work.
  4 in total

1.  Spatial Pyramid Pooling in Deep Convolutional Networks for Visual Recognition.

Authors:  Kaiming He; Xiangyu Zhang; Shaoqing Ren; Jian Sun
Journal:  IEEE Trans Pattern Anal Mach Intell       Date:  2015-09       Impact factor: 6.226

2.  Automatic categorization of medical images for content-based retrieval and data mining.

Authors:  Thomas M Lehmann; Mark O Güld; Thomas Deselaers; Daniel Keysers; Henning Schubert; Klaus Spitzer; Hermann Ney; Berthold B Wein
Journal:  Comput Med Imaging Graph       Date:  2005 Mar-Apr       Impact factor: 4.790

3.  Large-scale discovery of spatially related images.

Authors:  Ondrej Chum; Jirí Matas
Journal:  IEEE Trans Pattern Anal Mach Intell       Date:  2010-02       Impact factor: 6.226

4.  Fine-Tuning CNN Image Retrieval with No Human Annotation.

Authors:  Filip Radenovic; Giorgos Tolias; Ondrej Chum
Journal:  IEEE Trans Pattern Anal Mach Intell       Date:  2018-06-12       Impact factor: 6.226

  4 in total

北京卡尤迪生物科技股份有限公司 © 2022-2023.