| Literature DB >> 34067707 |
Naveed Ilyas1, Boreom Lee1, Kiseon Kim2.
Abstract
Crowd counting is a challenging task due to large perspective, density, and scale variations. CNN-based crowd counting techniques have achieved significant performance in sparse to dense environments. However, crowd counting in high perspective-varying scenes (images) is getting harder due to different density levels occupied by the same number of pixels. In this way large variations for objects in the same spatial area make it difficult to count accurately. Further, existing CNN-based crowd counting methods are used to extract rich deep features; however, these features are used locally and disseminated while propagating through intermediate layers. This results in high counting errors, especially in dense and high perspective-variation scenes. Further, class-specific responses along channel dimensions are underestimated. To address these above mentioned issues, we therefore propose a CNN-based dense feature extraction network for accurate crowd counting. Our proposed model comprises three main modules: (1) backbone network, (2) dense feature extraction modules (DFEMs), and (3) channel attention module (CAM). The backbone network is used to obtain general features with strong transfer learning ability. The DFEM is composed of multiple sub-modules called dense stacked convolution modules (DSCMs), densely connected with each other. In this way features extracted from lower and middle-lower layers are propagated to higher layers through dense connections. In addition, combinations of task independent general features obtained by the former modules and task-specific features obtained by later ones are incorporated to obtain high counting accuracy in large perspective-varying scenes. Further, to exploit the class-specific response between background and foreground, CAM is incorporated at the end to obtain high-level features along channel dimensions for better counting accuracy. Moreover, we have evaluated the proposed method on three well known datasets: Shanghaitech (Part-A), Shanghaitech (Part-B), and Venice. The performance of the proposed technique justifies its relative effectiveness in terms of selected performance compared to state-of-the-art techniques.Entities:
Keywords: CNNs; crowd analysis; crowd counting; deep learning
Year: 2021 PMID: 34067707 PMCID: PMC8156381 DOI: 10.3390/s21103483
Source DB: PubMed Journal: Sensors (Basel) ISSN: 1424-8220 Impact factor: 3.576
Figure 1The overview of HADF-Crowd, a hierarchical attention-based dense feature extraction network for single-image crowd counting (top). The dense feature extraction module (DFEM) with four deep DSCMs densely connected with each other (middle). The channel attention module (CAM) (bottom).
Figure 2The Overview of HADF-Crowd (top). The detailed architecture of the backbone network (bottom). It is a single-column network with four blocks starting from block 1 and ending with block 4.
Figure 3The expansion of DFEM with multiple DSCMs densely connected with each other (top). The internal architecture of each of DSCM with dense connections within each of the DFEMs (bottom).
The architecture of HADF-Crowd.
| Modules | Sub-Modules | Channels | Filter | Padding | Dilation | HADF-Crowd |
|---|---|---|---|---|---|---|
|
| Sub-M1 | 64 | 3 × 3 | 1 | 1 | Conv3-64 |
| Sub-M2 | 128 | 3 × 3 | 1 | 1 | Conv3-128 | |
| Sub-M3 | 256 | 3 × 3 | 1 | 1 | Conv3-256 | |
| Sub-M4 | 512 | 3 × 3 | 1 | 1 | Conv3-512 | |
|
| Sub-M5 | 512, 256, 128, 64 | 3 × 3 | 1 | 1 | Conv3-512-1 |
| Sub-M6 | 576, 256, 128, 64 | 3 × 3 | 1 | 1 | Conv3-576-2 | |
| Sub-M7 | 640, 256, 128, 64 | 3 × 3 | 1 | 1 | Conv3-640-2 | |
| Sub-M8 | 640, 512 | 3 × 3 | 1 | 1 | Conv3-640-3 | |
| Output | 512, 128, 64, 1 | 3 × 3 | 1 | 1 | Conv3-512-1 |
Estimation errors for ShanghaiTech (Part-A), (Part-B), and Venice datasets.
| Technique | Part-A | Part-B | Venice | |||
|---|---|---|---|---|---|---|
| MAE | MSE | MAE | MSE | MAE | MSE | |
| Marsden et al. [ | 126.5 | 173.5 | 23.8 | 33.1 | - | - |
| MCNN [ | 110.2 | 173.2 | 26.4 | 41.3 | 145.4 | 147.3 |
| C-MTL [ | 101.3 | 152.4 | 20.0 | 31.1 | - | - |
| SwitchCNN [ | 90.4 | 135.0 | 21.6 | 33.4 | 52.8 | 59.5 |
| SaCNN [ | 86.8 | 139.2 | 16.2 | 25.8 | - | - |
| Mult-S-CNN [ | 83.7 | 124.5 | 17.9 | 32.4 | - | - |
| CP-CNN [ | 73.6 | 106.4 | 20.1 | 30.1 | - | - |
| ACSCP [ | 75.7 | 102.7 | 17.2 | 27.4 | - | - |
| Deep-NCL [ | 73.5 | 112.3 | 18.7 | 26.0 | - | - |
| IG-CNN [ | 72.5 | 118.2 | 13.6 | 21.1 | - | - |
| CLPNet [ | 71.5 | 108.7 | 12.2 | 20.0 | - | - |
| SCNet [ | 71.9 | 117.9 | 9.3 | 14.4 | - | - |
| ic-CNN [ | 68.5 | 116.2 | 10.7 | 12.2 | - | - |
| CSRNet [ | 68.2 | 115.0 | 10.0 | 16.0 | 35.8 | 50.0 |
| DecideNet [ | - | - | - | - | 21.5 | 31.9 |
| DRASAN [ | 69.3 | 96.4 | 11.1 | 18.2 | - | - |
| DFE-Crowd [ | 71.6 | 110.9 | 9.7 | 16.0 | 23.8 | 34.5 |
| IA-DCCN [ | 66.9 | 108.4 | 10.2 | 16.0 | - | - |
| DsNet [ | 61.2 | 102.6 | 6.7 | 10.5 | - | - |
| RANet [ | 59.4 | 102.0 | 7.9 | 12.9 | - | - |
| ECAN [ | 62.3 | 100.0 | 7.8 | 12.2 | 20.5 | 29.9 |
| HADF-Crowd | 71.1 | 111.6 | 9.7 | 15.7 | 14.1 | 20.1 |
Figure 4Visualization of ShanghaiTech (Part-A), (Part-B), and Venice datasets; ground truth density; estimated density.
An ablation study on the Venice dataset.
| Modules | Venice Dataset | |
|---|---|---|
| MAE | MSE | |
| Backbone | 43.0 | 60.2 |
| Backbone + DFEM | 23.8 | 34.5 |
| Backbone + DFEM + CAM | 14.1 | 20.1 |