| Literature DB >> 34326867 |
Mingyong Li1, Qiqi Li1, Lirong Tang1, Shuang Peng1, Yan Ma1, Degang Yang1.
Abstract
Cross-modal hashing encodes heterogeneous multimedia data into compact binary code to achieve fast and flexible retrieval across different modalities. Due to its low storage cost and high retrieval efficiency, it has received widespread attention. Supervised deep hashing significantly improves search performance and usually yields more accurate results, but requires a lot of manual annotation of the data. In contrast, unsupervised deep hashing is difficult to achieve satisfactory performance due to the lack of reliable supervisory information. To solve this problem, inspired by knowledge distillation, we propose a novel unsupervised knowledge distillation cross-modal hashing method based on semantic alignment (SAKDH), which can reconstruct the similarity matrix using the hidden correlation information of the pretrained unsupervised teacher model, and the reconstructed similarity matrix can be used to guide the supervised student model. Specifically, firstly, the teacher model adopted an unsupervised semantic alignment hashing method, which can construct a modal fusion similarity matrix. Secondly, under the supervision of teacher model distillation information, the student model can generate more discriminative hash codes. Experimental results on two extensive benchmark datasets (MIRFLICKR-25K and NUS-WIDE) show that compared to several representative unsupervised cross-modal hashing methods, the mean average precision (MAP) of our proposed method has achieved a significant improvement. It fully reflects its effectiveness in large-scale cross-modal data retrieval.Entities:
Mesh:
Year: 2021 PMID: 34326867 PMCID: PMC8310450 DOI: 10.1155/2021/5107034
Source DB: PubMed Journal: Comput Intell Neurosci
Figure 1The proposed SAKDH framework consists of two modules: unsupervised teacher model (a) and supervised student model (b). The teacher model is trained in an unsupervised way. By distilling the knowledge from the teacher model, the similarity matrix S (soft similarity) is established, and it is used to supervise the student model. The teacher model adopted an unsupervised semantic alignment hashing method, and the student model adopts the joint loss of pairwise and triplet; these loss functions apply not only to intermodal, but also to intramodal.
Figure 2Examples of soft similarity and hard similarity.
Figure 3Precision@top-K curves on MIRFLICKR and NUS-WIDE with 128-bit code length. (a) Image-to-text. (b) Text-to-image. (c) Image-to-text. (d) Text-to-image.
Setup of the two cross-modal datasets.
| Dataset | Total | Training set | Test set | Labels | Image feature | Text feature |
|---|---|---|---|---|---|---|
| MIRFLICKR-25k | 20,015 | 18,015 | 2,000 | 24 | 4,096d VGGNet | 1,386d BoW |
| NUS-WIDE | 186,577 | 15,000 | 1865 | 10 | 4,096d VGGNet | 1,000d BoW |
The MAP@50 results of two retrieval tasks on MIRFLICKR with various code lengths.
| Methods | Image-query-text | Text-query-image | ||||||
|---|---|---|---|---|---|---|---|---|
| 16 | 32 | 64 | 128 | 16 | 32 | 64 | 128 | |
| CVH [ | 0.606 | 0.599 | 0.596 | 0.598 | 0.591 | 0.583 | 0.576 | 0.576 |
| IMH [ | 0.612 | 0.601 | 0.592 | 0.579 | 0.603 | 0.595 | 0.589 | 0.580 |
| CMFH [ | 0.621 | 0.624 | 0.625 | 0.627 | 0.642 | 0.662 | 0.676 | 0.685 |
| LSSH [ | 0.584 | 0.599 | 0.602 | 0.614 | 0.618 | 0.626 | 0.626 | 0.628 |
| DBRC [ | 0.617 | 0.619 | 0.620 | 0.621 | 0.618 | 0.626 | 0.626 | 0.628 |
| UDCMH [ | 0.689 | 0.698 | 0.714 | 0.717 | 0.692 | 0.704 | 0.718 | 0.733 |
| DJSRH [ | 0.810 | 0.843 | 0.862 | 0.876 | 0.786 | 0.822 | 0.835 | 0.847 |
| JDSH [ | 0.832 | 0.853 | 0.882 | 0.892 | 0.825 | 0.864 | 0.878 | 0.880 |
| Ours |
|
|
|
|
|
|
|
|
The MAP@50 results of two retrieval tasks on NUS-WIDE with various code lengths.
| Methods | Image-query-text | Text-query-image | ||||||
|---|---|---|---|---|---|---|---|---|
| 16 | 32 | 64 | 128 | 16 | 32 | 64 | 128 | |
| CVH [ | 0.372 | 0.362 | 0.406 | 0.390 | 0.401 | 0.384 | 0.442 | 0.432 |
| IMH [ | 0.470 | 0.473 | 0.476 | 0.459 | 0.478 | 0.483 | 0.472 | 0.462 |
| CMFH [ | 0.455 | 0.459 | 0.465 | 0.467 | 0.529 | 0.577 | 0.614 | 0.645 |
| LSSH [ | 0.481 | 0.489 | 0.507 | 0.507 | 0.455 | 0.459 | 0.416 | 0.473 |
| DBRC [ | 0.424 | 0.459 | 0.447 | 0.447 | 0.455 | 0.459 | 0.416 | 0.473 |
| UDCMH [ | 0.511 | 0.519 | 0.524 | 0.558 | 0.637 | 0.653 | 0.695 | 0.716 |
| DJSRH [ | 0.724 | 0.773 | 0.798 | 0.817 | 0.712 | 0.744 | 0.771 | 0.789 |
| JDSH [ | 0.736 | 0.793 | 0.832 | 0.835 | 0.721 | 0.785 | 0.794 | 0.804 |
| Ours |
|
|
|
|
|
|
|
|
The MAP@50 results at 128 bits for ablation analysis on MIRFLICKR.
| Method | Configuration | I2T | T2I |
|---|---|---|---|
| SAKDH | Teacher ( | 0.905 | 0.884 |
| SAKDH-1 | Teacher ( | 0.893 | 0.879 |
| SAKDH-2 | Teacher ( | 0.876 | 0.861 |
| SAKDH-3 | Teacher ( | 0.851 | 0.842 |
Figure 4Parameters' sensitivity analysis on MIRFLICKR. (a) The parameter μ. (b) The parameter α. (c) The parameter β.