| Literature DB >> 35655505 |
Yamin Sun1,2, Yue Zhao3, Sirui Wang4.
Abstract
Traffic target tracking is a core task in intelligent transportation system because it is useful for scene understanding and vehicle autonomous driving. Most state-of-the-art (SOTA) multiple object tracking (MOT) methods adopt a two-step procedure: object detection followed by data association. The object detection has made great progress with the development of deep learning. However, the data association still heavily depends on hand crafted constraints, such as appearance, shape, and motion, which need to be elaborately trained for a special object. In this study, a spatial-temporal encoder-decoder affinity network is proposed for multiple traffic targets tracking, aiming to utilize the power of deep learning to learn a robust spatial-temporal affinity feature of the detections and tracklets for data association. The proposed spatial-temporal affinity network contains a two-stage transformer encoder module to encode the features of the detections and the tracked targets at the image level and the tracklet level, aiming to capture the spatial correlation and temporal history information. Then, a spatial transformer decoder module is designed to compute the association affinity, where the results from the two-stage transformer encoder module are fed back to fully capture and encode the spatial and temporal information from the detections and the tracklets of the tracked targets. Thus, efficient affinity computation can be applied to perform data association in online tracking. To validate the effectiveness of the proposed method, three popular multiple traffic target tracking datasets, KITTI, UA-DETRAC, and VisDrone, are used for evaluation. On the KITTI dataset, the proposed method is compared with 15 SOTA methods and achieves 86.9% multiple object tracking accuracy (MOTA) and 85.71% multiple object tracking precision (MOTP). On the UA-DETRAC dataset, 12 SOTA methods are used to compare with the proposed method, and the proposed method achieves 20.82% MOTA and 35.65% MOTP, respectively. On the VisDrone dataset, the proposed method is compared with 10 SOTA trackers and achieves 40.5% MOTA and 74.1% MOTP, respectively. All those experimental results show that the proposed method is competitive to the state-of-the-art methods by obtaining superior tracking performance.Entities:
Mesh:
Year: 2022 PMID: 35655505 PMCID: PMC9152393 DOI: 10.1155/2022/9693767
Source DB: PubMed Journal: Comput Intell Neurosci
Figure 1Framework of the proposed method.
Figure 2Spatial transformer encoder module for image-level feature extraction.
Figure 3Spatial transformer decoder module for computing association affinity matrix.
Different lengths of tracklets for temporal transformer encoder model on KITTI validation.
| Method | FP↓ (%) | FN↓ (%) | MOTA↑ (%) |
|---|---|---|---|
|
| 8.9 | 10.1 | 83.2 |
|
| 5.3 | 5.8 | 87.1 |
|
| 5.5 | 6.1 | 86.9 |
| Ours | 5.2 | 5.9 | 88.4 |
Different input queries for spatial transformer decoder model on KITTI validation set.
| Method | FP↓ (%) | FN↓ (%) | MOTA↑ (%) |
|---|---|---|---|
| P1-tracker | 7.9 | 9.4 | 84.3 |
| P2-tracker | 6.4 | 7.1 | 86.5 |
| Ours | 5.2 | 5.9 | 88.4 |
KITTI dataset evaluation results.
| Dataset | Method | Setting | MT↑ (%) | ML↓ (%) | IDS↓ | FG↓ | MOTA↑ (%) | MOTP↑ (%) |
|---|---|---|---|---|---|---|---|---|
| KITTI Car | MCMOT_CPD [ | Offline | 52.31 | 11.69 | 228 | 536 | 78.90 | 82.13 |
| DSK [ | Offline | 60 | 8.31 | 296 | 868 | 76.15 | 83.42 | |
| Complexer-YOLO [ | Online | 58 | 5.08 | 1186 | 2092 | 75.7 | 78.46 | |
| NOMT [ | Offline | 41.08 | 25.23 | 31 | 207 | 66.6 | 78.17 | |
| LP_SSVM [ | Offline | 35.54 | 21.26 | 62 | 539 | 61.77 | 76.93 | |
| CEM [ | Offline | 20 | 31.54 | 125 | 396 | 51.94 | 77.11 | |
| RMOT [ | Online | 21.69 | 31.859 | 209 | 727 | 52.42 | 75.18 | |
| ODAMOT [ | Online | 27.08 | 15.54 | 389 | 1274 | 59.23 | 75.45 | |
| SCEA [ | Online | 26.92 | 26.62 | 104 | 448 | 57.03 | 78.84 | |
| CIWT [ | Online | 13.75 | 34.71 | 112 | 901 | 43.37 | 71.44 | |
| FAMNet [ | Online | 51.38 | 8.92 | 123 | 713 | 77.08 | 78.79 | |
| SASN-MCF [ | Online | 58 | 7.85 | 443 | 975 | 70.06 | 82.65 | |
| MASS [ | Online | 74 | 2.92 | 353 | 516 | 84.64 | 85.36 | |
| SAMT [ | Online | 62.77 | 6.00 | 198 | 294 | 83.64 | 85.89 | |
| CenterTrack [ | Online | 82.15 | 2.46 | 254 | 227 | 88.83 | 84.97 | |
| Ours | Online | 83.1 | 2.9 | 271 | 254 | 86.90 | 85.71 |
Figure 4Tracking examples of the proposed method on KITTI dataset. (a) Sequence 0014. (b) Sequence 0015. (c) Sequence 0017.
UA-DETRAC dataset evaluation results.
| Dataset | Method | Setting | Detector | MT↑ (%) | ML↓ (%) | IDS↓ | FG↓ | MOTA↑ (%) | MOTP↑ (%) |
|---|---|---|---|---|---|---|---|---|---|
| UA-DETRAC | GOG [ | Offline | CompACT | 13.90 | 19.90 | 3334.6 | 3172.4 | 14.20 | 37.00 |
| H2T [ | Offline | CompACT | 14.8 | 19.4 | 852.2 | 1117.2 | 12.40 | 35.7 | |
| IHTLS [ | Offline | CompACT | 13.8 | 19.9 | 953.6 | 3556.9 | 11.10 | 36.8 | |
| DCT [ | Offline | CompACT | 6.7 | 29.3 | 141.4 | 132.4 | 10.80 | 37.1 | |
| DCT [ | Offline | R-CNN | 10.1 | 22.8 | 758.7 | 742.9 | 11.7 | 38.0 | |
| CEM [ | Offline | CompACT | 3 | 35.3 | 267.9 | 352.3 | 5.10 | 35.2 | |
| CMOT [ | Online | CompACT | 16.1 | 18.6 | 285.3 | 1516.8 | 12.60 | 36.1 | |
| IOU [ | Online | CompACT | 14.8 | 19.7 | 2308.1 | 3250.4 | 16.10 | 37.0 | |
| IOU [ | Online | R-CNN | 13.8 | 20.7 | 5029.4 | 5795.7 | 16.00 | 38.3 | |
| V-IOUT [ | Online | CompACT | 17.4 | 18.8 | 363.8 | 1123.5 | 17.7 | 36.4 | |
| FAMNET [ | Online | CompACT | 17.1 | 18.2 | 617 | 970.2 | 19.80 | 36.7 | |
| Ours | Online | CompACT | 17.6 | 18.1 | 518.2 | 1546.8 | 20.14 | 34.37 | |
| Ours | Online | R-CNN | 18.9 | 17.6 | 463.4 | 1450.6 | 20.82 | 35.65 |
Figure 5Tracking examples of the proposed method from UA-DETRAC dataset. (a) Sequence MVI_40853. (b) Sequence MVI_40763.
VisDrone2018 dataset evaluation results.
| Dataset | Method | Setting | MT↑ | ML↓ | IDS↓ | FG↓ | MOTA↑ (%) | MOTP↑ (%) |
|---|---|---|---|---|---|---|---|---|
| VisDrone2018 | H2T [ | Offline | 214 | 494 | 1269 | 2035 | 32.2 | 73.3 |
| IHTLS [ | Offline | 245 | 446 | 1435 | 2662 | 36.5 | 74.8 | |
| GOG [ | Offline | 244 | 496 | 1114 | 2012 | 38.4 | 75.1 | |
| CEM [ | Offline | 105 | 752 | 1002 | 1858 | 5.1 | 72.3 | |
| CMOT [ | Online | 282 | 435 | 789 | 2257 | 31.5 | 73.3 | |
| SCTrack [ | Online | 211 | 550 | 798 | 2042 | 35.8 | 75.6 | |
| TBD [ | Online | 302 | 419 | 1834 | 2307 | 35.6 | 74.1 | |
| V-IOUT [ | Online | 297 | 514 | 265 | 1380 | 40.2 | 74.9 | |
| Ctrack [ | Online | 369 | 375 | 1376 | 2190 | 30.8 | 73.3 | |
| FRMOT [ | Online | 254 | 463 | 1043 | 2534 | 33.1 | 73.0 | |
| Ours | Online | 319 | 451 | 779 | 2090 | 40.5 | 74.1 |
Run-time performance (FPS) with different object detector on UA-DETRAC dataset.
| Trackers | GOG [ | H2T [ | DCT [ | CEM [ | IOU [ | Famnet [ | CMOT [ | TBD [ | IHTLS [ | Ours | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| FPS | CompACT | 389.51 | 3.02 | 2.19 | 4.62 | 100.84 | 0.6 | 3.79 | 4.88 | 19.79 | 7.5 |
| RCNN | 352.8 | 2.78 | 0.71 | 5.4 | — | — | 3.59 | 3.17 | 11.96 | 8.3 | |
Run-time performance (FPS) on VisDrone dataset dataset.
| Trackers | H2T [ | IHTLS [ | GOG [ | CEM [ | CMOT [ | SCTrack [ | TBD [ | V-IOUT [ | Ctrack [ | FRMOT [ | Ours |
|---|---|---|---|---|---|---|---|---|---|---|---|
| FPS | 1.56 | 16.3 | 564.8 | 7.74 | 1.39 | 2.9 | 0.7 | 20 | 15 | 5 | 8.3 |