| Literature DB >> 35837221 |
Zongzong Wu1, Xiangchun Yu1, Donglin Zhu1, Qingwei Pang1, Shitao Shen1, Teng Ma2, Jian Zheng1.
Abstract
In real-life scenarios, the accuracy of person re-identification (Re-ID) is subject to the limitation of camera hardware conditions and the change of image resolution caused by factors such as camera focusing errors. People call this problem cross-resolution person Re-ID. In this paper, we improve the recognition accuracy of cross-resolution person Re-ID by enhancing the image enhancement network and feature extraction network. Specifically, we treat cross-resolution person Re-ID as a two-stage task: the first stage is the image enhancement stage, and we propose a Super-Resolution Dual-Stream Feature Fusion sub-network, named SR-DSFF, which contains SR module and DSFF module. The SR-DSFF utilizes the SR module recovers the resolution of the low-resolution (LR) images and then obtains the feature maps of the LR images and super-resolution (SR) images, respectively, through the dual-stream feature fusion with learned weights extracts and fuses feature maps from LR and SR images in the DSFF module. At the end of SR-DSFF, we set a transposed convolution to visualize the feature maps into images. The second stage is the feature acquisition stage. We design a global-local feature extraction network guided by human pose estimation, named FENet-ReID. The FENet-ReID obtains the final features through multistage feature extraction and multiscale feature fusion for the Re-ID task. The two stages complement each other, making the final pedestrian feature representation has the advantage of accurate identification compared with other methods. Experimental results show that our method improves significantly compared with some state-of-the-art methods.Entities:
Mesh:
Year: 2022 PMID: 35837221 PMCID: PMC9276474 DOI: 10.1155/2022/4398727
Source DB: PubMed Journal: Comput Intell Neurosci
Figure 1The network consists of SR-DSFF sub-network and FENet-ReID. The query images first enter the SR-DSFF, and the SR images are output through the feature extractor and the upscale module in the SR module. Then, the feature maps of the query images and the SR images are jointly learned and fused through the DSFF module, and the final images are output into FENet-ReID through transposed convolution. FENet-ReID extracts the global and local features of the images that are obtained in the SR-DSFF and fuses them to obtain the final feature maps. Finally, a fully connected (FC) layer is used on the final feature maps to predict the ID labels of pedestrians. Our network is divided into two training stages: (1) Update the SR module with the SR loss ℒrec (equation (6)); and (2) jointly train the DSFF and the FENet-ReID with the total loss ℒTOTAL (equation (12)). These two stages are represented by yellow and black arrows on the figure, respectively.
Figure 2Shows the performance of our SR module on the dataset Market1501. The effect is evident by comparing it with LR images.
Figure 3We add spatial attention and channel attention to the last ResNet101 Block. The lower right corner of the figure takes the branch FES as an example to give a detailed attention diagram, which is in FES has the same structure.
Figure 4Flowchart of FENet-ReID. The full image and three human key regions images are extracted by FE-C1 and FE-C2, respectively, and the obtained 256-dimensional features are fused by three fusion units in FFM.
The proposed method is compared with the current state-of-the-art methods on the dataset MLR-Market1501.
| Methods | MLR-Market1501 | ||
|---|---|---|---|
| Rank-1 | Rank-5 | Rank-10 | |
| SING [ | 74.4 | 87.8 | 91.6 |
| SPreID [ | 77.4 | 89 | 93.9 |
| CamStyle [ | 74.5 | 88.5 | 92.2 |
| CAD-net [ | 83.7 | 92.7 | 95.8 |
| FFSR + RIFE [ | 66.9 | 84.7 | - |
| CRGAN [ | 83.7 | 92.7 | 95.8 |
| INTACT [ |
|
| 96.9 |
| PRI [ | 84.9 | 93.5 | 96.1 |
| LA-transformer [ | 86.7 |
|
|
| Ours |
|
|
|
The best and second-best results are in bold and italics, respectively.
The proposed method is compared with the current state-of-the-art methods on the dataset CAVIAR.
| Methods | CAVIAR | ||
|---|---|---|---|
| Rank-1 | Rank-5 | Rank-10 | |
| SING [ | 33.5 | 72.7 | 89 |
| SPreID [ | 36.2 | 71.9 | 88.7 |
| CamStyle [ | 32.1 | 72.3 | 85.9 |
| CAD-net [ | 42.8 | 76.2 | 91.5 |
| FFSR + RIFE [ | 36.4 | 72 | — |
| CRGAN [ | 42.8 | 76.2 | 91.5 |
| INTACT [ |
|
|
|
| PRI [ | 43.2 | 78.5 | 91.9 |
| LA-transformer [ | 42.1 | 80.7 | 92.4 |
| Ours |
|
|
|
The best and second-best results are in bold and italics, respectively.
The proposed method is compared with the current state-of-the-art methods on the dataset MLR-CUHK03.
| Methods | MLR-CUHK03 | ||
|---|---|---|---|
| Rank-1 | Rank-5 | Rank-10 | |
| SING [ | 67.7 | 90.7 | 94.7 |
| SPreID [ | 76.5 | 92.5 | 98.3 |
| CamStyle [ | 69.1 | 89.6 | 93.9 |
| CAD-net [ | 82.1 |
|
|
| FFSR + RIFE [ | 73.3 | 92.6 | — |
| INTACT [ |
|
|
|
| PRI [ | 85.2 |
|
|
| LA-transformer [ |
| 97.1 |
|
| Ours |
|
|
|
The best and second-best results are in bold and italics, respectively.
Performance of different feature extractors on MLR-Market1501.
| Structure | Weight learning | Rank-1 | Rank-5 |
|---|---|---|---|
| ResNet101 | — | 76.9 | 82.4 |
| Two ResNet101 | — | 80.4 | 90.9 |
| Two ResNet101 | √ | 86.6 | 95.7 |
| SR-DSFF (ours) | √ | 89.2 | 95.9 |
The influence of different loss functions on recognition accuracy.
| Loss functions | Rank-1 | Rank-5 | Dataset |
|---|---|---|---|
| Circle loss | 88.4 | 95.7 | MLR-Market1501 |
| Triplet loss | 88.7 | 94.9 | MLR-Market1501 |
| Sphere loss | 89.3 | 96.1 | MLR-Market1501 |
| Ours | 90.9 | 96.4 | MLR-Market1501 |
Performance of different feature extractors on MLR-Market1501.
| Models | DS | Weight | Rank-1 | Rank-5 |
|---|---|---|---|---|
| CycleGAN [ | — | — | 62.6 | 76.2 |
| SING [ | — | — | 74.4 | 87.8 |
| CSR-GAN [ | — | √ | 74.3 | 87.7 |
| FFSR + RIFE [ | √ | √ | 66.9 | 84.7 |
| CAD-NET [ | — | — | 83.7 | 92.7 |
| SR-DSFF (ours) | √ | — | 86.1 | 92.6 |
| SR-DSFF (ours) | √ | √ | 90.3 | 96.4 |
“DS” represents whether dual-stream feature fusion is performed and “Weight” indicates whether weighting loss was added during feature extraction.