| Literature DB >> 30889874 |
Minglei Tong1, Lyuyuan Fan2, Hao Nan3, Yan Zhao4.
Abstract
Estimating the number of people in highly clustered crowd scenes is an extremely challenging task on account of serious occlusion and non-uniformity distribution in one crowd image. Traditional works on crowd counting take advantage of different CNN like networks to regress crowd density map, and further predict the count. In contrast, we investigate a simple but valid deep learning model that concentrates on accurately predicting the density map and simultaneously training a density level classifier to relax parameters of the network to prevent dangerous stampede with a smart camera. First, a combination of atrous and fractional stride convolutional neural network (CAFN) is proposed to deliver larger receptive fields and reduce the loss of details during down-sampling by using dilated kernels. Second, the expanded architecture is offered to not only precisely regress the density map, but also classify the density level of the crowd in the meantime (MTCAFN, multiple tasks CAFN for both regression and classification). Third, experimental results demonstrated on four datasets (Shanghai Tech A (MAE = 88.1) and B (MAE = 18.8), WorldExpo'10(average MAE = 8.2), NS UCF_CC_50(MAE = 303.2) prove our proposed method can deliver effective performance.Entities:
Keywords: crowd counting; fractional stride network; multiple task learning; smart camera
Year: 2019 PMID: 30889874 PMCID: PMC6471139 DOI: 10.3390/s19061346
Source DB: PubMed Journal: Sensors (Basel) ISSN: 1424-8220 Impact factor: 3.576
Figure 1Flowchart of the Edge Computing.
Figure 2The structure of the proposed double-column convolutional neural network for crowd density map estimation.
A configuration of CAFN.
| Front-End (Double-Column) | Back-End | |
|---|---|---|
| Dilation rate = 2 | Dilation rate =3 | No Dilation |
| Conv9 × 9 @ 16 | Conv7 × 7 @ 20 | Conv3 × 3 @ 24 |
| Max-pooling | Max-pooling | Conv3 × 3 @ 32 |
| Conv7 × 7 @ 32 | Conv5 × 5 @ 40 | ConvTranspose4 × 4@16 |
| Max-pooling | Max-pooling | PReLU |
| Conv7 × 7 @ 16 | Conv5 × 5 @ 20 | Conv1 × 1 @ 1 |
| Conv7 × 7 @ 8 | Conv5 × 5 @ 10 | Max-pooling |
Figure 3Different convolution methods [32]. Blue maps are inputs, and cyan maps are outputs. (a) Half padding, no strides. (b) No padding, no strides, transposed. (c) No padding, no stride, dilation.
Figure 4Two most used methods in multi-task learning. (a) Hard weight-sharing is generally applied by sharing the hidden layers between all tasks while keeping several task-specific output layers. (b) In soft weight-sharing, each task has its model with its parameters individually. The distance between the weights of the model is then regularized to encourage the parameters to be similar.
Figure 5The structure of MTCAFN.
Specific parameters of MTCAFN with Dilation Rate =2.
| Layer Name | Layer Type | Channel | Output Size | Last Layer Name | Dilation |
|---|---|---|---|---|---|
| Input | 1 | H × W × C | |||
| Conv1_1 | Conv2d | 16/32/32/32 | H × W × C | Input | True, =2 |
| Conv2_1 | Maxpool + Conv2d | 32/64/64/64 | (H/2) × (W/2) × C | Conv1_1 | True, =2 |
| Conv3_1 | Maxpool + Conv2d | 64/128/128 | (H/4) × (W/4) × C | Conv2_1 | True, =2 |
| Conv4_1 | Maxpool + Conv2d | 128/256 | (H/8) × (W/8) × C | Conv3_1 | True, =2 |
| Conv4_2 | Upsampling + Conv2d | 256/128 | (H/4) × (W/4) × C | Conv4_1 | True, =2 |
| Merge3 | Concatenate | 256 = (128 + 128) | (H/4) × (W/4) × C | Conv3_1, Conv4_2 | |
| Conv3_2 | Conv2d + Upsample | 128/128/64 | (H/2) × (W/2) × C | Merge3 | False |
| Merge2 | Concatenate | 128 = (64 + 64) | (H/2) × (W/2) × C | Conv2_1, Conv3_2 | |
| Conv2_2 | Conv2d + Upsample | 64/64/32 | H × W × C | Merge2 | False |
| Merge1 | Concatenate | 64 = (32 + 32) | H × W × C | Conv1_1, Conv2_2 | |
| Output1 | Conv2d + Maxpool | 16/16/1 | (H/2) × (W/2) × 1 | Merge1 | |
| Conv_b | Conv2d + Maxpool | 512/512/1024 | (H/8) × (W/8) × C | Conv4_1 | False |
| Avgpool | GlobalAveragePool | 1024 | 1024 | Conv_b | |
| Output2 | Dense + Softmax | 256/5 | 5 | Avgpool |
Figure 6Four types of combinations.
Comparison of results on Shanghai tech dataset.
| Type | Part_A | Part_B | ||
|---|---|---|---|---|
| MAE | MSE | MAE | MSE | |
| Type1 | 100.8 | 152.3 | 21.5 | 38.0 |
| Type2 | 103.0 | 161.9 | 24.8 | 45.8 |
| Type3 | 99.6 | 155.0 | 28.3 | 48.7 |
| Type4 | 101.1 | 160.5 | 24.1 | 45.7 |
Parameters of the benchmark.
| Num | Max | Min | Ave | Total |
|---|---|---|---|---|
| 50 | 4543 | 94 | 1280 | 63974 |
| 3980 | 253 | 1 | 50 | 199923 |
| 482 | 3139 | 33 | 501 | 241677 |
| 716 | 578 | 9 | 124 | 88488 |
Estimation errors on Shanghai Tech dataset.
| Method | Part_A | Part_B | ||
|---|---|---|---|---|
| MAE | MSE | MAE | MSE | |
| Zhang et al. [ | 181.8 | 277.7 | 32.0 | 49.8 |
| Marsden et al. [ | 126.5 | 173.5 | 23.8 | 33.1 |
| MCNN [ | 110.2 | 173.2 | 26.4 | 41.3 |
| Cascaded-MTL [ | 101.3 | 152.4 | 20.0 | 31.1 |
| Switching-CNN [ | 90.4 | 135.0 | 21.6 | 33.4 |
| CAFN (ours) | 100.8 | 152.3 | 21.5 | 33.4 |
| MTCAFN (ours) | 88.1 | 137.2 | 18.8 | 31.3 |
Estimation errors on UCF_CC_50 dataset.
| Method | MAE | MSE |
|---|---|---|
| Zhang et al. [ | 467.0 | 498.5 |
| MCNN [ | 377.6 | 509.1 |
| Marsden et al. [ | 338.6 | 424.5 |
| Cascaded-MTL [ | 322.8 | 397.9 |
| Switching-CNN [ | 318.1 | 439.2 |
| CAFN (ours) | 305.3 | 429.4 |
| MTCAFN (ours) | 303.2 | 417.6 |
Estimated errors on the WorldExpo’10 dataset.
| Method | Sce1 | Sce2 | Sce3 | Sce4 | Sce5 | Avg. |
|---|---|---|---|---|---|---|
| Zhang et al. [ | 9.8 | 14.1 | 14.3 | 22.2 |
| 12.9 |
| Shang et al. [ | 7.8 | 15.4 | 14.9 | 11.8 | 5.8 | 11.7 |
| MCNN [ | 3.4 | 20.6 | 12.9 | 13.0 | 8.1 | 11.6 |
| Switching-CNN [ | 4.4 | 15.7 | 10.0 | 11.0 | 5.9 | 9.4 |
| CP-CNN [ | 2.9 | 14.7 | 10.5 | 10.4 | 5.8 | 8.9 |
| CAFN (ours) | 3.3 | 25.3 | 27.4 | 26.3 | 4.2 | 17.3 |
| MTCAFN (ours) | 3.4 | 13.8 | 11.2 | 9.7 | 4.8 | 8.2 |
Figure 7Visualization of crowd density. The corresponding test results and standard metrics are given for each row of images. The right 3 columns are described as: (a) Original images from Shanghai Tech dataset [16]; (b) Corresponding ground maps; (c) Estimated density maps.