| Literature DB >> 30691156 |
Zheng Xu1,2,3,4,5, Haibo Luo6,7,8,9, Bin Hui10,11,12,13, Zheng Chang14,15,16,17.
Abstract
Recently, we have been concerned with locating and tracking vehicles in aerial videos. Vehicles in aerial videos usually have small sizes due to use of cameras from a remote distance. However, most of the current methods use a fixed bounding box region as the input of tracking. For the purpose of target locating and tracking in our system, detecting the contour of the target is utilized and can help with improving the accuracy of target tracking, because a shape-adaptive template segmented by object contour contains the most useful information and the least background for object tracking. In this paper, we propose a new start-up of tracking by clicking on the target, and implement the whole tracking process by modifying and combining a contour detection network and a fully convolutional Siamese tracking network. The experimental results show that our algorithm has significantly improved tracking accuracy compared to the state-of-the-art regarding vehicle images in both OTB100 and DARPA datasets. We propose utilizing our method in real time tracking and guidance systems.Entities:
Keywords: Siamese network; contour detection; deep learning; object tracking
Year: 2019 PMID: 30691156 PMCID: PMC6387134 DOI: 10.3390/s19030514
Source DB: PubMed Journal: Sensors (Basel) ISSN: 1424-8220 Impact factor: 3.576
Figure 1Block diagram of the main steps.
Figure 2Results of both bounding box proposal method and object contour proposal method: (a) Bounding box proposal; (b) Object contour proposal.
Figure 3An example of false tracking due to target and background mixing: the green box indicates ground truth and the purple one indicates tracking result.
The receptive field and stride size in our contour detection network. (RF is short for receptive fields, C is short for convolution, and P is short for pooling).
| Layer | C1_2 | P1 | C2_2 | P2 | C3_3 | P3 | C4_3 | P4 | C5_3 |
|---|---|---|---|---|---|---|---|---|---|
| RF size | 5 | 6 | 14 | 16 | 40 | 44 | 92 | 100 | 196 |
| stride | 1 | 2 | 2 | 4 | 4 | 8 | 8 | 16 | 16 |
Figure 4Architecture of the proposed contour detection network.
Figure 5Results of different methods on an image taken by DARPA.
Figure 6Results of our contour detection method: (a) Original image; (b) Contour.
Figure 7Sketch of our template extraction module: (a) Original image; (b) Contour; (c) Mask: (d) Target.
Figure 8The schematic diagram of our feature extraction network’s structure: black boxes denote convolutional layers and red boxes denote max pooling layers.
The receptive field and stride size in our Siamese network (RF is short for receptive fields, C is short for convolution and P is short for pooling).
| layer | C1 | P1 | C2 | P2 | C3 | C4 | C5 |
|---|---|---|---|---|---|---|---|
| RF size | 11 | 15 | 31 | 39 | 71 | 103 | 135 |
| stride | 4 | 8 | 8 | 16 | 16 | 16 | 16 |
Figure 9Main framework of our fully convolutional Siamese network.
Figure 10Results of the top 10 trackers of OTB100 vehicle videos: (a) Distance precision based on one-pass evaluation (OPE); (b) Success rate based on OPE.
Figure 11Results on images in datasets taken by DARPA: (a–e) are five groups of comparison between the tracking results of SiamFC and our method. In each group, the left images are results from SiamFC and right images are results from our method. For each method, the image at the bottom is the tracking result in the origin frame, the image at the top left corner is the partially enlarged detail of the tracking result in the current frame, and the image at the top right corner is the corresponding score map that denotes the similarity. All the green boxes denote the ground truth and all the red boxes denote the tracking results.
Figure 12Statistical results of SiamFC and our method in DARPA VIVID datasets: (a–e) are centering errors in five different video sequences; (f) is the average centering error.