| Literature DB >> 35890966 |
Khouloud Ben Ali Hassen1, José J M Machado2, João Manuel R S Tavares2.
Abstract
The crowd counting task has become a pillar for crowd control as it provides information concerning the number of people in a scene. It is helpful in many scenarios such as video surveillance, public safety, and future event planning. To solve such tasks, researchers have proposed different solutions. In the beginning, researchers went with more traditional solutions, while recently the focus is on deep learning methods and, more specifically, on Convolutional Neural Networks (CNNs), because of their efficiency. This review explores these methods by focusing on their key differences, advantages, and disadvantages. We have systematically analyzed algorithms and works based on the different models suggested and the problems they are trying to solve. The main focus is on the shift made in the history of crowd counting methods, moving from the heuristic models to CNN models by identifying each category and discussing its different methods and architectures. After a deep study of the literature on crowd counting, the survey partitions current datasets into sparse and crowded ones. It discusses the reviewed methods by comparing their results on the different datasets. The findings suggest that the heuristic models could be even more effective than the CNN models in sparse scenarios.Entities:
Keywords: computer vision; crowded datasets; deep learning; people counting; sparse datasets
Mesh:
Year: 2022 PMID: 35890966 PMCID: PMC9315600 DOI: 10.3390/s22145286
Source DB: PubMed Journal: Sensors (Basel) ISSN: 1424-8220 Impact factor: 3.847
Figure 1Overall structure of the current review study.
Figure 2PRISMA diagram showing the results of the executed literature search.
Summary of regression-based methods.
| Method | Global Features | Regression Model | Dataset(s) |
|---|---|---|---|
| [ | Segment, internal edge, texture | Gaussian | Peds1, Peds2 |
| [ | Segment, motion | Linear regression | PETS2009 |
| [ | Segment, edge, gradient | Gaussian | UCSD pedestrian, Pets 2009 |
| [ | Segment, edge, texture | Kernel ridge regression | UCSD, Mall |
| [ | Edge | Linear regression | Internal data (2000 images, number of people per image: from 3 to 27 people) |
Figure 3Usual CNN architecture (adapted from [56]).
Details of the three CNN layers.
| Actions | Parameters | Input | Output | |
|---|---|---|---|---|
| Convolutional layer |
Apply filters to extract features. Filters are composed of learned kernels. Apply the activation function on every value of the feature map. |
Number of kernels Size of kernels Activation function Stride Padding Regularization type and value |
3D cube Previous set of feature maps |
3D cube One 2D map per filter |
| Pooling layer |
Reduce dimensionality Extract the maximum of the average of a region. Sliding window |
Stride Size of a window |
3D cube Previous set of feature maps |
3D cube One 2D map per filter Reduced spatial dimension |
| Fully connected layer |
Aggregate information from final feature maps Generate final classification |
Number of nodes Activation function |
Flattened 3D cube Previous set of feature maps |
3D cube One 2D map per filter. |
Figure 4General structure of the Basic CNN architecture.
Figure 5Overall architecture of the multi-column CNN.
Figure 6General structure of the single-column CNN.
Disadvantages and advantages of different methods that have been proposed for crowd counting.
| Disadvantages | Advantages | |
|---|---|---|
| Detection methods | They do not perform well on crowded images as most of the target objects are not clearly visible. | They work well for detecting faces, especially for sparse datasets. |
| Regression methods | They always ignore spatial information. | These methods are successful in dealing with problems of occlusion and background clutter. |
| Clustering methods |
In the case where an individual is camouflaged, it will be ignored by the process. Do not work for estimating crowds from individual still images. | Joint evaluation of different hypotheses is unnecessary because trajectories of trucked features are unique. |
| Basic CNN | Trained using perspective maps of images that are not always available. | A light network that can automatically learn the effective features for training. |
| Multi-column CNN |
Multi-column CNN is tough to train and takes a long time for that task. It introduces redundant structure. The different columns seem to behave similarly without significant differences. | Address the scale variation problem for crowd counting thanks to the use of multi-branches with different receptive field sizes. |
| Single-column CNN | Complex architecture for methods using encoding-decoding blocks such as TedNet. | Rather than the bloated structure of multi-column network architecture, deploys single and more profound CNNs without increasing the complexity of the network. |
Comparison of the performance of different methods on the used crowd counting datasets.
| Methods | Year | Sparse | Crowded | ||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
|
|
|
|
|
|
|
| |||||||||
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| ||
| MORR [ | 2012 | 2.29 | 8.08 | 3.15 | 15.7 | - | - | - | - | - | - | - | - | - | - |
| Clustering motion cues [ | 2014 | 2.97 | - | - | - | - | - | - | - | - | - | - | - | - | - |
| MCNN [ | 2016 | 1.07 | 1.35 | - | - | 26.4 | 41.3 | 110.2 | 173.2 | 377.6 | 509.1 | 11.6 | - | 277 | 426 |
| CrowdNet [ | 2016 | - | - | - | - | - | - | - | - | 452.5 | - | - | - | - | - |
| MSCNN [ | 2017 | - | - | - | - | 17.7 | 30.2 | 83.8 | 127.4 | 363.7 | 468.4 | 11.7 | - | - | - |
| ConvLSTM-nt [ | 2017 | 1.73 | 3.52 | 2.53 | 11.2 | - | - | - | - | 284.5 | 297.1 | 11.9 | - | - | - |
| CSRNet [ | 2018 | 1.16 | 1.47 | - | - | 10.6 | 16.0 | 68.2 | 115.0 | 266.1 | 397.5 | 8.6 | - | - | - |
| D-ConvNet [ | 2018 | - | - | - | - | 18.7 | 26.0 | 73.5 | 112.3 | 288.4 | 404.7 | 9.1 | - | - | - |
| SaCNN [ | 2018 | - | - | - | - | 16.2 | 25.8 | 86.8 | 139.2 | 314.9 | 424.8 | 8.5 | - | - | - |
| CNN with pixel-wise [ | 2018 | - | - | - | - | 10.0 | 16.5 | 72.3 | 116.2 | - | - | 8.8 | - | - | - |
| DecideNet [ | 2018 | - | - | 1.52 | 1.90 | 21.53 | 31.98 | - | - | - | - | 9.23 | - | - | - |
| RANet [ | 2019 | - | - | - | - | 7.9 | 12.9 | 59.4 | 102.0 | 239.8 | 319.4 | - | - | 111 | 190 |
| TedNet [ | 2019 | - | - | - | - | 8.2 | 12.8 | 64.2 | 109.1 | 249.4 | 354.5 | 8.0 | - | 113 | 188 |
| PaCNN [ | 2019 | 0.89 | 1.18 | 8.9 | 13.5 | 66.3 | 106.4 | 267.9 | 357.8 | 7.8 | |||||
| SAAN [ | 2019 | - | - | 1.28 | 1.68 | - | - | - | - | - | - | - | - | - | - |
| PGCNet [ | 2019 | - | - | - | - | 8.8 | 13.7 | 57.0 | 86.0 | - | - | 8.1 | - | - | - |
| ADSCNet [ | 2020 | - | - | - | - | 6.4 | 11.3 | 55.4 | 97.7 | - | - | - | - | 71.3 | 132.5 |
| SASNet [ | 2021 | - | - | - | - | 6.35 | 9.9 | 53.59 | 88.38 | 161.4 | 234.46 | 5.71 | - | 85.2 | 147.3 |
* ShanghaiTech Part B is a sparse dataset that is why it is mentioned before Part A in the table.