| Literature DB >> 32028568 |
Yang Li1,2, Huahu Xu1,3, Minjie Bian3, Junsheng Xiao1.
Abstract
As a result of its important role in video surveillance, pedestrian attribute recognition has become an attractive facet of computer vision research. Because of the changes in viewpoints, illumination, resolution and occlusion, the task is very challenging. In order to resolve the issue of unsatisfactory performance of existing pedestrian attribute recognition methods resulting from ignoring the correlation between pedestrian attributes and spatial information, in this paper, the task is regarded as a spatiotemporal, sequential, multi-label image classification problem. An attention-based neural network consisting of convolutional neural networks (CNN), channel attention (CAtt) and convolutional long short-term memory (ConvLSTM) is proposed (CNN-CAtt-ConvLSTM). Firstly, the salient and correlated visual features of pedestrian attributes are extracted by pre-trained CNN and CAtt. Then, ConvLSTM is used to further extract spatial information and correlations from pedestrian attributes. Finally, pedestrian attributes are predicted with optimized sequences based on attribute image area size and importance. Extensive experiments are carried out on two common pedestrian attribute datasets, PEdesTrian Attribute (PETA) dataset and Richly Annotated Pedestrian (RAP) dataset, and higher performance than other state-of-the-art (SOTA) methods is achieved, which proves the superiority and validity of our method.Entities:
Keywords: channel attention (CAtt); conventional long short-term memory (ConvLSTM); convolutional neutral networks (CNN); multi-label classification; pedestrian attribute recognition
Mesh:
Year: 2020 PMID: 32028568 PMCID: PMC7038686 DOI: 10.3390/s20030811
Source DB: PubMed Journal: Sensors (Basel) ISSN: 1424-8220 Impact factor: 3.576
Figure 1The architecture of proposed convolutional neural networks (CNN), channel attention (CAtt) and convolutional long short-term memory (ConvLSTM) model (CNN-CAtt-ConvLSTM) for pedestrian attribute recognition. MLCNN: multi-label classification CNN.
Figure 2The internal structure of ConvLSTM.
Figure 3The channel attention mechanism for the proposed CNN-CAtt-ConvLSTM model.
Comparison with state-of-the-art (SOTA) methods on the PETA and RAP datasets.
| Metric | PETA | RAP | |||||||
|---|---|---|---|---|---|---|---|---|---|
| Method | mA | mP | mR | F1 | mA | mP | mR | F1 | |
| ACN [ | 81.15 | 84.06 | 81.26 | 82.64 | 69.66 | 80.12 | 72.26 | 75.98 | |
| DeepMAR [ | 81.50 | 89.70 | 81.90 | 85.68 | 76.10 | 82.20 | 74.80 | 78.30 | |
| HP-net [ | 81.77 | 84.92 | 83.24 | 84.07 | 76.12 | 77.33 | 78.79 | 78.05 | |
| CTX [ | 80.13 | 79.68 | 80.24 | 79.68 | 70.13 | 71.03 | 71.20 | 70.23 | |
| SR [ | 82.83 | 82.54 | 82.76 | 82.65 | 74.21 | 75.11 | 76.52 | 75.83 | |
| JRL [ | 85.67 | 86.03 | 85.34 | 85.42 | 77.81 | 78.11 | 78.98 | 78.58 | |
| RA [ | 86.11 | 84.69 | 88.51 | 86.56 | 81.16 | 79.45 | 79.23 | 79.34 | |
| Ours | 88.56 | 88.32 | 89.62 | 88.97 | 83.72 | 81.85 | 79.96 | 80.89 | |
Experimental result on the effect of ConvLSTM and CAtt.
| Metric | PETA | RAP | |||||||
|---|---|---|---|---|---|---|---|---|---|
| Method | mA | mP | mR | F1 | mA | mP | mR | F1 | |
| MLCNN | 79.86 | 81.73 | 79.92 | 80.81 | 68.22 | 72.46 | 71.34 | 71.90 | |
| CNN-LSTM | 81.63 | 83.25 | 82.54 | 82.89 | 74.63 | 75.97 | 76.62 | 76.29 | |
| CNN-SAtt-LSTM | 85.13 | 85.75 | 84.95 | 85.35 | 77.49 | 77.85 | 78.32 | 78.08 | |
| CNN-ConvLSTM | 85.92 | 85.21 | 86.12 | 85.66 | 79.35 | 78.73 | 78.65 | 78.69 | |
| CNN-SAtt-ConvLSTM | 86.08 | 85.34 | 86.22 | 85.78 | 79.48 | 78.83 | 78.77 | 78.80 | |
| CNN-CAtt-ConvLSTM (Ours) | 88.56 | 88.32 | 89.62 | 88.97 | 83.72 | 81.85 | 79.96 | 80.89 | |
Figure 4The attention heat map of the CNN-CAtt-ConvLSTM model when predicting pedestrian attributes in different regions. (a) is the original pedestrian image, (b–f) are the attention heat maps of the regions: head-shoulder, upper-body, lower-body, footwear and accessory.
Experimental results on the effect of the optimized prediction sequence.
| Metric | PETA | RAP | |||||||
|---|---|---|---|---|---|---|---|---|---|
| Method | mA | mP | mR | F1 | mA | mP | mR | F1 | |
| Random sequence | 88.01 | 87.81 | 89.13 | 88.47 | 83.13 | 81.32 | 79.46 | 8038 | |
| Optimized sequence | 88.56 | 88.32 | 89.62 | 88.97 | 83.72 | 81.85 | 79.96 | 80.89 | |