| Literature DB >> 35458906 |
Ge Shi1, Feng Li1, Lifang Wu1, Yukun Chen1.
Abstract
The core of cross-modal hashing methods is to map high dimensional features into binary hash codes, which can then efficiently utilize the Hamming distance metric to enhance retrieval efficiency. Recent development emphasizes the advantages of the unsupervised cross-modal hashing technique, since it only relies on relevant information of the paired data, making it more applicable to real-world applications. However, two problems, that is intro-modality correlation and inter-modality correlation, still have not been fully considered. Intra-modality correlation describes the complex overall concept of a single modality and provides semantic relevance for retrieval tasks, while inter-modality correction refers to the relationship between different modalities. From our observation and hypothesis, the dependency relationship within the modality and between different modalities can be constructed at the object level, which can further improve cross-modal hashing retrieval accuracy. To this end, we propose a Visual-textful Correlation Graph Hashing (OVCGH) approach to mine the fine-grained object-level similarity in cross-modal data while suppressing noise interference. Specifically, a novel intra-modality correlation graph is designed to learn graph-level representations of different modalities, obtaining the dependency relationship of the image region to image region and the tag to tag in an unsupervised manner. Then, we design a visual-text dependency building module that can capture correlation semantic information between different modalities by modeling the dependency relationship between image object region and text tag. Extensive experiments on two widely used datasets verify the effectiveness of our proposed approach.Entities:
Keywords: cross-modal hash learning; deep model; hashing retrieval
Year: 2022 PMID: 35458906 PMCID: PMC9029824 DOI: 10.3390/s22082921
Source DB: PubMed Journal: Sensors (Basel) ISSN: 1424-8220 Impact factor: 3.576
Figure 1A pair of similar image text samples from the MIRFlickr dataset.
Figure 2The overall framework of our proposed visual-textual correlation graph hashing (VCGH) approach.
The MAP scores of two retrieval tasks on the MIRFlickr dataset with different lengths of hash codes. The best results are highlighted in bold.
| Method | Image→Text | Text→Image | ||||||
|---|---|---|---|---|---|---|---|---|
| 16 | 32 | 64 | 128 | 16 | 32 | 64 | 128 | |
| CVH [ | 0.602 | 0.587 | 0.578 | 0.572 | 0.607 | 0.591 | 0.581 | 0.574 |
| PDH [ | 0.623 | 0.624 | 0.621 | 0.626 | 0.627 | 0.628 | 0.628 | 0.629 |
| CMFH [ | 0.659 | 0.660 | 0.663 | 0.653 | 0.611 | 0.606 | 0.575 | 0.563 |
| CCQ [ | 0.637 | 0.639 | 0.639 | 0.638 | 0.628 | 0.628 | 0.622 | 0.618 |
| CMSSH [ | 0.611 | 0.602 | 0.599 | 0.591 | 0.612 | 0.604 | 0.592 | 0.585 |
| SCM [ | 0.636 | 0.64 | 0.641 | 0.643 | 0.661 | 0.664 | 0.668 | 0.670 |
| DJSRH [ | 0.659 | 0.661 | 0.675 | 0.684 | 0.655 | 0.671 | 0.673 | 0.685 |
| MGAH [ | 0.685 | 0.693 | 0.704 | 0.702 | 0.673 | 0.676 | 0.686 | 0.690 |
| VCGH |
|
|
|
|
|
|
|
|
The MAP scores of two retrieval tasks on the NUS-WIDE dataset with different lengths of hash codes. The best results are highlighted in bold.
| Method | Image→Text | Text→Image | ||||||
|---|---|---|---|---|---|---|---|---|
| 16 | 32 | 64 | 128 | 16 | 32 | 64 | 128 | |
| CVH [ | 0.458 | 0.432 | 0.410 | 0.392 | 0.474 | 0.445 | 0.419 | 0.398 |
| PDH [ | 0.475 | 0.484 | 0.480 | 0.490 | 0.489 | 0.512 | 0.507 | 0.517 |
| CMFH [ | 0.517 | 0.550 | 0.547 | 0.520 | 0.439 | 0.416 | 0.377 | 0.349 |
| CCQ [ | 0.504 | 0.505 | 0.506 | 0.505 | 0.499 | 0.496 | 0.492 | 0.488 |
| CMSSH [ | 0.512 | 0.470 | 0.479 | 0.466 | 0.519 | 0.498 | 0.456 | 0.488 |
| SCM [ | 0.517 | 0.514 | 0.518 | 0.518 | 0.518 | 0.510 | 0.517 | 0.518 |
| DJSRH [ | 0.503 | 0.517 | 0.528 | 0.554 | 0.526 | 0.541 | 0.539 | 0.570 |
| MGAH [ | 0.613 | 0.623 | 0.628 | 0.631 | 0.603 | 0.614 | 0.640 | 0.641 |
| VCGH |
|
|
|
|
|
|
|
|
Figure 3The PR curves on the MIRFlickr and NUS-WIDE datasets with 16-bit hash codes.
The MAP scores by using different variants of the proposed VCGH on MIRFlickr and NUS-WIDE. The best results are highlighted in bold.
| Task | Method | MIRFlickr | NUS-WIDE | ||||||
|---|---|---|---|---|---|---|---|---|---|
| 16 | 32 | 64 | 128 | 16 | 32 | 64 | 128 | ||
| image→text | Baseline | 0.656 | 0.682 | 0.689 | 0.698 | 0.507 | 0.559 | 0.569 | 0.581 |
| Baseline-global | 0.681 | 0.688 | 0.692 | 0.703 | 0.589 | 0.593 | 0.611 | 0.615 | |
| Baseline-object | 0.686 | 0.691 | 0.699 | 0.710 | 0.594 | 0.611 | 0.619 | 0.623 | |
| VCGH |
|
|
|
|
|
|
|
| |
| text→image | Baseline | 0.664 | 0.690 | 0.691 | 0.700 | 0.498 | 0.583 | 0.576 | 0.619 |
| Baseline-global | 0.680 | 0.683 | 0.687 | 0.694 | 0.581 | 0.588 | 0.610 | 0.614 | |
| Baseline-object | 0.691 | 0.696 | 0.710 | 0.718 | 0.597 | 0.616 | 0.621 | 0.628 | |
| VCGH |
|
|
|
|
|
|
|
| |