| Literature DB >> 35591010 |
Siqi Tang1, Zhisong Pan1, Guyu Hu1, Yang Wu2, Yunbo Li1.
Abstract
In this paper, we propose a multi-scene adaptive crowd counting method based on meta-knowledge and multi-task learning. In practice, surveillance cameras are stationarily deployed in various scenes. Considering the extensibility of a surveillance system, the ideal crowd counting method should have a strong generalization capability to be deployed in unknown scenes. On the other hand, given the diversity of scenes, it should also effectively suit each scene for better performance. These two objectives are contradictory, so we propose a coarse-to-fine pipeline including meta-knowledge network and multi-task learning. Specifically, at the coarse-grained stage, we propose a generic two-stream network for all existing scenes to encode meta-knowledge especially inter-frame temporal knowledge. At the fine-grained stage, the regression of the crowd density map to the overall number of people in each scene is considered a homogeneous subtask in a multi-task framework. A robust multi-task learning algorithm is applied to effectively learn scene-specific regression parameters for existing and new scenes, which further improve the accuracy of each specific scenes. Taking advantage of multi-task learning, the proposed method can be deployed to multiple new scenes without duplicated model training. Compared with two representative methods, namely AMSNet and MAML-counting, the proposed method reduces the MAE by 10.29% and 13.48%, respectively.Entities:
Keywords: crowd counting; meta-knowledge; multi-scene adaptive; multi-task learning
Mesh:
Year: 2022 PMID: 35591010 PMCID: PMC9104539 DOI: 10.3390/s22093320
Source DB: PubMed Journal: Sensors (Basel) ISSN: 1424-8220 Impact factor: 3.847
Figure 1The pipeline of proposed coarse-to-fine multi-scene adaptive crowd counting. At the coarse-grained stage, the frame pairs of multiple known scenes are used to train a generic model with meta-knowledge. At the fine-grained stage, overall counting regression from estimated density maps of each scene is regarded as a specific task. Multi-task learning is used to learn the regression weight of each specific scene.
Figure 2Two-stream network to capture meta-knowledge.
The MAE comparison on WorldExpo’10. The best performance is colored red and second best is colored blue.
| Methods | S1 | S2 | S3 | S4 | S5 | Ave. | |
|---|---|---|---|---|---|---|---|
| Fully Supervised Methods | Cross Scene Net | 9.8 | 14.1 | 14.3 | 22.2 | 3.7 | 12.9 |
| CRSNet | 2.9 | 11.5 | 8.6 | 16.6 | 3.4 | 8.6 | |
| CAN | 2.9 | 12.0 | 10.0 |
| 4.3 | 7.4 | |
| AMSNet |
|
| 10.8 | 10.4 |
|
| |
| Domain Adaption Methods | SE Cycle GAN | 2.7 | 15.4 | 12.1 | 11.9 | 3.6 | 9.1 |
| MAML-counting | 3.05 | 10.37 | 8.18 | 9.41 | 3.91 | 7.05 | |
| Proposed Methods | TSN | 2.8 | 9.2 | 8.9 | 11.7 | 3.2 | 7.2 |
| TSN with fine-tune | 2.5 | 9.0 |
| 11.3 |
| 6.8 | |
| Backbone (CRSNet) with MTL | 2.5 | 10.3 | 8.4 | 13.4 | 3.0 | 7.5 | |
| Proposed TSN with MTL |
|
|
|
|
|
| |
Figure 3Comparison of density maps estimated by CRSNet and our proposed TSN.
Figure 4Robustness performance of the proposed TSN-MTL under different lighting conditions and crowd density. Display of estimated crowd density map of pictures collected by camera 100,400 and 100,730. The first column is the original picture, the second column is the density map estimated by STN and the third column is the density map estimated by STN with MTL. The first two rows are frames collected by camera 100,400, while the next two rows are frames collected by camera 100,730.
The performance comparison on camera 100,400 and camera 100,730.
| Methods | 100,400 | 100,730 | ||||
|---|---|---|---|---|---|---|
|
|
|
|
| |||
| TSN | 11.56 | 16.35 | 34.96 | 7.44 | 11.57 | 10.57 |
| 7.98 | 12.06 | 20.49 | 5.01 | 8.83 | 6.17 | |
Figure 5Similarity relationship of parameters in multiple scenes.