| Literature DB >> 35009905 |
Haoyi Ma1, Scott T Acton1, Zongli Lin1.
Abstract
Accurate and robust scale estimation in visual object tracking is a challenging task. To obtain a scale estimation of the target object, most methods rely either on a multi-scale searching scheme or on refining a set of predefined anchor boxes. These methods require heuristically selected parameters, such as scale factors of the multi-scale searching scheme, or sizes and aspect ratios of the predefined candidate anchor boxes. On the contrary, a centerness-aware anchor-free tracker (CAT) is designed in this work. First, the location and scale of the target object are predicted in an anchor-free fashion by decomposing tracking into parallel classification and regression problems. The proposed anchor-free design obviates the need for hyperparameters related to the anchor boxes, making CAT more generic and flexible. Second, the proposed centerness-aware classification branch can identify the foreground from the background while predicting the normalized distance from the location within the foreground to the target center, i.e., the centerness. The proposed centerness-aware classification branch improves the tracking accuracy and robustness significantly by suppressing low-quality state estimates. The experiments show that our centerness-aware anchor-free tracker, with its appealing features, achieves salient performance in a wide variety of tracking scenarios.Entities:
Keywords: anchor-free; centerness; convolutional neural network; visual object tracking
Year: 2022 PMID: 35009905 PMCID: PMC8749605 DOI: 10.3390/s22010354
Source DB: PubMed Journal: Sensors (Basel) ISSN: 1424-8220 Impact factor: 3.576
Figure 1Flowchart of the proposed CAT tracker. The backbone Siamese network takes exemplar image Z and search image X as input and outputs corresponding feature maps denoted as and . In order to embed the features from two branches, a depth-wise cross-correlation operation is employed to obtain the multi-channel response map denoted as P. Then, to reduce the computation, a convolution layer with a kernel size of is employed to fuse the response map. The fused response map with reduced dimension is denoted as R and is adopted as the input to the centerness-aware anchor-free network. Regarding every spatial location on the regression map D, the regression branch learns to estimate the distance from the corresponding location to each side of the ground truth bounding box. For the classification map C, with the observation that many low-quality predictions are produced corresponding to the locations far from the target center, the centerness-aware classification branch learns to output a 0 for the background and a value ranging from 0 to 1 to indicate the normalized distance between the spatial location within the foreground and the target center to suppress predictions with low quality.
Figure 2Qualitative comparisons between the proposed tracker CAT and representative trackers SiamRPN [8], SiamRPN++ [17], and SiamCAR [11] on boat3 (first row), truck1 (second row), bike1 (third row), wakeboard5 (fourth row), and car8 (bottom row) sequences that involve large scale variations and aspect ratio variations. Compared to other trackers, even facing challenging scenarios including occlusion, large scale variations, and aspect ratio variations, CAT provides accurate state estimations that significantly improve the robustness and accuracy in tracking.
Comparisons of CAT and five representative state-of-the-art methods on UAV123. The best and the second best values are in bold and underlined, respectively. CAT obtains the best performance in terms of the area-under-curve (AUC) and the precision measures. ↑ means that a higher score is better, and ↓ denotes that a lower value is better.
| SiamFC | SiamRPN | SiamMask | SiamRPN++ | SiamCAR | CAT | |
|---|---|---|---|---|---|---|
| AUC ↑ | 0.485 | 0.557 | 0.603 | 0.610 |
|
|
| Prec. ↑ | 0.693 | 0.768 | 0.795 | 0.803 |
|
|
Comparisons of CAT with five representative state-of-the-art methods on VOT-2019. The best and the second best values are in bold and underlined, respectively. The EAO score measures the expected no-reset IOU between the estimated bounding box and the ground truth bounding box. The accuracy (A) denotes the mean IOU between the predicted bounding box and the ground truth bounding box in successful tracking intervals. The robustness (R) denotes the number of times that the target is lost per video sequence. ↑ means that a higher score is better, and ↓ denotes that a lower value is better.
| SiamFC | SiamRPN | SiamMask | SiamRPN++ | SiamCAR | CAT | |
|---|---|---|---|---|---|---|
| EAO ↑ | 0.189 | 0.272 | 0.287 | 0.285 |
|
|
| A ↑ | 0.510 | 0.582 | 0.592 |
|
| 0.583 |
| R ↓ | 0.958 | 0.527 | 0.461 | 0.482 |
|
|
Comparisons between CAT and SiamCAR with respect to the number of parameters and speed. ↑ means that a higher score is better, and ↓ denotes that a lower value is better.
| SiamCAR | CAT | |
|---|---|---|
| Number of Parameters ↓ | 51,384,903 | 51,380,293 |
| Speed (Frames Per Second) ↑ | 54.62 | 57.83 |
Comparisons between CAT and two variants of CAT on UAV123. The best and the second best scores are in boldface and underlined, respectively. ↑ means a higher score is better, and ↓ denotes a lower value is better.
| CAT_wo_cen | CAT_w_cen_div | CAT_wo_mod | CAT | |
|---|---|---|---|---|
| AUC ↑ | 0.480 |
| 0.595 |
|
| Prec. ↑ | 0.646 |
| 0.788 |
|
Comparisons between CAT and two variants of CAT on VOT-2019. The best and the second best scores are in boldface and underlined, respectively. ↑ means that a higher score is better, and ↓ denotes that a lower value is better.
| CAT_wo_cen | CAT_w_cen_div | CAT_wo_mod | CAT | |
|---|---|---|---|---|
| EAO ↑ | 0.224 |
| 0.266 |
|
| A ↑ | 0.475 |
| 0.580 |
|
| R ↓ | 0.482 |
| 0.547 |
|