| Literature DB >> 34884103 |
Bin Yu1,2, Ming Tang2, Guibo Zhu1,2, Jinqiao Wang1,2,3, Hanqing Lu1,2.
Abstract
Bounding box estimation by overlap maximization has improved the state of the art of visual tracking significantly, yet the improvement in robustness and accuracy is restricted by the limited reference information, i.e., the initial target. In this paper, we present DCOM, a novel bounding box estimation method for visual tracking, based on distribution calibration and overlap maximization. We assume every dimension in the modulation vector follows a Gaussian distribution, so that the mean and the variance can borrow from those of similar targets in large-scale training datasets. As such, sufficient and reliable reference information can be obtained from the calibrated distribution, leading to a more robust and accurate target estimation. Additionally, an updating strategy for the modulation vector is proposed to adapt the variation of the target object. Our method can be built on top of off-the-shelf networks without finetuning and extra parameters. It yields state-of-the-art performance on three popular benchmarks, including GOT-10k, LaSOT, and NfS while running at around 40 FPS, confirming its effectiveness and efficiency.Entities:
Keywords: bounding box estimation; distribution calibration; overlap maximization; visual tracking
Mesh:
Year: 2021 PMID: 34884103 PMCID: PMC8662439 DOI: 10.3390/s21238100
Source DB: PubMed Journal: Sensors (Basel) ISSN: 1424-8220 Impact factor: 3.576
The mean similarity (mSim) and variance similarity (vSim) between the modulation vector of car-1 the size of and those of other target objects from the LaSOT dataset.
| Video Name | Target Size | mSim | vSim |
|---|---|---|---|
|
|
| 97% | 95% |
|
|
| 82% | 71% |
|
|
| 92% | 88% |
|
|
| 69% | 57% |
|
|
| 48% | 32% |
|
|
| 36% | 14% |
|
|
| 40% | 18% |
Figure 1Visualization examples of targets of similar classes and close sizes, which are selected from car-1, car-6, and car-20 on the training set of LaSOT.
Modern trackers and the used bounding box estimation methods.
| Tracker | Venue | MSS | BBR | OM | Other |
|---|---|---|---|---|---|
| KCF [ | TPAMI2015 | ||||
| SAMF [ | ECCV2014 | √ | |||
| DSST [ | TPAMI2017 | √ | |||
| MDNet [ | CVPR2016 | √ | |||
| SiamFC [ | ECCV2016 | √ | |||
| ECO [ | CVPR2017 | √ | |||
| EAST [ | ICCV2017 | √ | |||
| SiamRPN [ | CVPR2018 | √ | |||
| SiamRPN++ [ | CVPR2019 | √ | |||
| SiamMASK [ | CVPR2019 | √ | |||
| ATOM [ | CVPR2019 | √ | |||
| DiMP [ | ICCV2019 | √ | |||
| DCFST [ | ECCV2020 | √ | |||
| KYS [ | ECCV2020 | √ | |||
| SiamCAR [ | CVPR2020 | √ | |||
| SiamRCNN [ | CVPR2020 | √ | |||
| AlphaRefine [ | CVPR2021 | √ |
Figure 2An overview of the proposed DCOM. The CNN module is composed of the backbone network and an extra convolution layer, and the MLP (multi-layer perception) consists of three fully connected layers.
Ablation study of the sub-modules on LaSOT.
| ATOM-DCOM | DiMP-DCOM | |||
|---|---|---|---|---|
| Method | AUC | Prec. | AUC | Prec. |
| Baseline | 0.515 | 0.479 | 0.568 | 0.535 |
| Baseline+Noise | 0.498 | 0.463 | 0.542 | 0.511 |
| Baseline+Up | 0.513 | 0.477 | 0.568 | 0.536 |
| Baseline+DC | 0.526 | 0.490 | 0.574 | 0.542 |
| Baseline+DC+Up | 0.536 | 0.501 | 0.583 | 0.549 |
Comparisons with the state-of-the-art trackers on LaSOT.
| Tracker | Backbone | AUC | Prec. |
|---|---|---|---|
| ECO [ | VGG-m | 0.324 | 0.302 |
| MDNet [ | VGG-m | 0.397 | 0.370 |
| SiamRPN++ [ | ResNet-50 | 0.496 | 0.467 |
| MAML [ | ResNet-18 | 0.523 | - |
| SiamCAR [ | ResNet-50 | 0.516 | 0.493 |
| SiamBAN [ | ResNet-50 | 0.514 | 0.491 |
| ATOM [ | ResNet-18 | 0.515 | 0.479 |
| DiMP [ | ResNet-50 | 0.568 | 0.535 |
| ATOM-DCOM | ResNet-18 | 0.536 | 0.501 |
| DiMP-DCOM | ResNet-50 |
|
|
Figure 3Comparisons with state-of-the-art trackers on LaSOT [22] in terms of precision plots and success plots. All the figures are drawn by the official toolkit.
Comparisons with the state-of-the-art trackers on GOT-10k.
| SiamRPN++ | SiamCAR | ATOM | DiMP | ATOM-DCOM | DiMP-DCOM | |
|---|---|---|---|---|---|---|
| AO | 0.517 | 0.579 | 0.556 | 0.611 | 0.572 |
|
| SR0.50 | 0.616 | 0.677 | 0.634 | 0.717 | 0.651 |
|
| SR0.75 | 0.325 | 0.437 | 0.402 | 0.492 | 0.407 |
|
Comparisons with the state-of-the-art trackers on NfS.
| ECO | SiamRPN++ | ATOM | DiMP | ATOM-DCOM | DiMP-DCOM | |
|---|---|---|---|---|---|---|
| AUC | 0.466 | 0.620 | 0.590 | 0.620 | 0.616 |
|
Figure 4Comparisons with state-of-the-art trackers on NfS [23] in terms of success plots.
The mean FPSs of our DiMP-DCOM and ATOM-DCOM and other state-of-the-art trackers on LaSOT.
| Tracker | DiMP-DCOM | DiMP | SiamBAN | SiamCAR | ECO | SiamFC |
|---|---|---|---|---|---|---|
| mFPS | 38 | 43 | 40 | 52 | 6 | 26 |
|
|
|
|
|
|
|
|
| mFPS | 54 | 58 | 42 | 35 | 1 | 36 |
Figure 5Visualization tracking results of DiMP-DCOM (green), DiMP (blue), and ATOM (red) on the challenging sequences from GOT-10k [21]. We can see that DiMP-DCOM shows stronger ability of bounding box estimation and better accuracy throughout tracking. Best viewed with zooming in.