| Literature DB >> 32604850 |
Dasol Jeong1, Hasil Park1, Joongchol Shin1, Donggoo Kang1, Joonki Paik1.
Abstract
Person re-identification (Re-ID) has a problem that makes learning difficult such as misalignment and occlusion. To solve these problems, it is important to focus on robust features in intra-class variation. Existing attention-based Re-ID methods focus only on common features without considering distinctive features. In this paper, we present a novel attentive learning-based Siamese network for person Re-ID. Unlike existing methods, we designed an attention module and attention loss using the properties of the Siamese network to concentrate attention on common and distinctive features. The attention module consists of channel attention to select important channels and encoder-decoder attention to observe the whole body shape. We modified the triplet loss into an attention loss, called uniformity loss. The uniformity loss generates a unique attention map, which focuses on both common and discriminative features. Extensive experiments show that the proposed network compares favorably to the state-of-the-art methods on three large-scale benchmarks including Market-1501, CUHK03 and DukeMTMC-ReID datasets.Entities:
Keywords: Siamese network; attention mechanism; person re-identification
Year: 2020 PMID: 32604850 PMCID: PMC7349100 DOI: 10.3390/s20123603
Source DB: PubMed Journal: Sensors (Basel) ISSN: 1424-8220 Impact factor: 3.576
Figure 1Visualization of the activation maps on the Market1501 dataset. Each set of triplet images respectively represent an input image, activation maps of pre-trained ResNet50 and Ours for two persons (a,b).
Figure 2The overall architecture of proposed method.
Figure 3The encoder-decoder attention and channel attention modules.
Figure 4The pipeline of the proposed method. is the feature map after Resnet50 , is the attention map of channel attention module, and is attention map of encode-decode attention module. is a feature vector and is used to predict ID and verification. Attention score is compare with positive and negative pairs by Siamese network. GAP and FC denote the global average pooling and fully connected layer.
Network architecture table of proposed method.
| Name | Size | Backbone | Attention Module |
|---|---|---|---|
| Input | 128 × 256 | ||
| Conv1 | 64 × 32 | 7 × 7, 64, stride 2 | |
| Layer1 | 64 × 32 | ||
| Layer2 | 32 × 16 | ||
| Layer3 | 16 × 8 | ||
| Channel | 1 × 1 | global average pool | |
| Encoder | 16 × 8 |
| |
| Decoder | 16 × 8 |
| |
| Multiple1 | 16 × 8 | ChannelAtt × Layer3 | |
| layer4 | 8 × 4 | ||
| Spatial | 1 × 1 | 1 × 1, 2048 | |
| global average pool |
Comparison with state-of-the-art person ReID methods on the Market1501 dataset.
| Market1501 | ||||
|---|---|---|---|---|
|
|
|
| ||
|
|
|
|
|
|
| XQDA [ | 43.8 | 22.2 | 54.1 | 28.4 |
| SCS [ | 51.9 | 26.3 | - | - |
| DNS [ | 61.0 | 35.6 | 71.5 | 46.0 |
| CRAFT [ | 68.7 | 42.3 | 77.0 | 50.3 |
| CAN [ | 60.3 | 35.9 | 72.1 | 47.9 |
| S-LSTM [ | - | - | 61.6 | 35.3 |
| G-SCNN [ | 65.8 | 39.5 | 76.0 | 48.4 |
| SVDNet [ | 82.3 | 62.1 | - | - |
| MSCAN [ | 80.3 | 57.5 | 86.8 | 66.7 |
| HA-CNN [ | 91.2 | 75.7 | 93.8 | 82.8 |
| Ours (ResNet50) | 91.3 | 79.2 | 94.1 | 85.3 |
| Ours (VGG16) | 89.3 | 73.3 | 92.8 | 81.0 |
Comparison with state-of-the-art person ReID methods on the CUHK03 dataset.
| CUHK03 | ||||
|---|---|---|---|---|
|
|
|
| ||
|
|
|
|
|
|
| BoW + XQDA [ | 6.4 | 6.4 | 7.9 | 7.3 |
| LOMO + XQDA [ | 12.8 | 11.5 | 14.8 | 13.6 |
| IDE-R [ | 21.3 | 19.7 | 22.2 | 21.0 |
| IDE-R + XQDA [ | 31.1 | 28.2 | 32.0 | 29.6 |
| PAN [ | 36.3 | 34.0 | 36.9 | 35.0 |
| DPFL [ | 40.7 | 37.0 | 43.0 | 40.5 |
| HA-CNN [ | 41.7 | 38.6 | 44.4 | 41.0 |
| MLFN [ | 52.8 | 47.8 | 54.7 | 49.2 |
| CASN [ | 57.4 | 50.7 | 58.9 | 52.2 |
| Ours (ResNet50) | 58.9 | 52.6 | 62.6 | 57.7 |
| Ours (VGG16) | 52.7 | 48.4 | 46.9 | 42.2 |
Comparison with state-of-the-art person ReID methods on the DukeMTMC-reID dataset.
| DukeMTMC-ReID | ||
|---|---|---|
|
|
|
|
| BoW + KISSME [ | 25.1 | 12.2 |
| LOMO + XQDA [ | 30.8 | 17.0 |
| ResNet50 [ | 65.2 | 45.0 |
| JLML [ | 73.3 | 56.4 |
| SVDNet [ | 76.7 | 56.8 |
| HA-CNN [ | 80.5 | 63.8 |
| Ours (ResNet50) | 80.7 | 65.5 |
| Ours (VGG16) | 78.0 | 61.4 |
Efficiency of the proposed method on the Market1501 dataset. The attention module and uniformity loss are denoted AM and UL, respectively.
| Dataset | Market1501 | |||
|---|---|---|---|---|
|
|
|
|
|
|
| ResNet50-Basel. [ | 88.1 | 95.0 | 96.8 | 71.2 |
| BesNet50 + AM | 89.7 | 96.2 | 97.4 | 76.7 |
| Ours(ResNet50 + AM + UL) | 91.3 | 96.9 | 98.2 | 79.2 |
| VGG16-Basel. [ | 85.3 | 94.5 | 96.3 | 68.2 |
| VGG16 + AM | 87.2 | 95.5 | 97.4 | 69.2 |
| Ours(VGG16 + AM + UL) | 89.2 | 96.1 | 97.5 | 73.3 |
Comparison on network architectural change on the Market1501 dataset. Each notation means that the attention module (AM) is located after the corresponding layer. Bold indicates the best performance.
| Dataset | Market1501 | |||
|---|---|---|---|---|
|
|
|
|
|
|
| layer1-AM | 89.3 | 96.3 | 97.4 | 74.6 |
| layer2-AM | 90.7 | 96.7 | 98.0 | 78.2 |
| layer3-AM | 91.3 | 96.9 | 98.2 | 79.2 |
| layer4-AM | 88.7 | 96.1 | 97.4 | 75.4 |
Figure 5Visualized examples of comparing the proposed method and the others on Market1501 dataset. 1st column is input images, 2nd column is ResNet50, 3rd column is applied only the attention module, and 4th column is the proposed method, respectively.