| Literature DB >> 30875917 |
Yi Zou1, Weiwei Zhang2, Wendi Weng3, Zhengyun Meng4.
Abstract
Online multi-object tracking (MOT) has broad applications in time-critical video analysis scenarios such as advanced driver-assistance systems (ADASs) and autonomous driving. In this paper, the proposed system aims at tracking multiple vehicles in the front view of an onboard monocular camera. The vehicle detection probes are customized to generate high precision detection, which plays a basic role in the following tracking-by-detection method. A novel Siamese network with a spatial pyramid pooling (SPP) layer is applied to calculate pairwise appearance similarity. The motion model captured from the refined bounding box provides the relative movements and aspects. The online-learned policy treats each tracking period as a Markov decision process (MDP) to maintain long-term, robust tracking. The proposed method is validated in a moving vehicle with an onboard NVIDIA Jetson TX2 and returns real-time speeds. Compared with other methods on KITTI and self-collected datasets, our method achieves significant performance in terms of the "Mostly-tracked", "Fragmentation", and "ID switch" variables.Entities:
Keywords: Markov decision process; Siamese network; data association; multi-vehicle tracking; tracking-by-detection
Year: 2019 PMID: 30875917 PMCID: PMC6471168 DOI: 10.3390/s19061309
Source DB: PubMed Journal: Sensors (Basel) ISSN: 1424-8220 Impact factor: 3.576
Figure 1The overview of the proposed multiple vehicle tracking system. Discriminative appearance similarity and motion model are implemented to perform pairwise associations and Markov decision processes (MDPs) to define the real-time state.
Figure 2Central-surround two-channel spatial pyramid pooling network (CSTCSPP). This network uses the Siamese-type architecture to extract shallow features with different resolutions and then calculates pairwise similarity. A spatial pyramid pooling layer embedded before the top decision network allows patches to be free of size limitations. All convolution layers are followed by Rectified Linear Units (ReLU), which could increase the nonlinear relation between each layer of the neural network.
Details of each branch network.
| Layer | Type | Kernel Size | Stride |
|---|---|---|---|
|
| Raw data | ||
|
| Convolution | 7 × 7 | 2 |
|
| Max pooling | 3 × 3 | 2 |
|
| Convolution | 5 × 5 | 2 |
|
| Max pooling | 3 × 3 | 2 |
|
| Convolution | 3 × 3 | 1 |
|
| FC |
Figure 3Online multi-vehicle tracking problem formulated as decision-making in MDP. The upper-right framework represents the transition map of four categorized states at each time step. Each target is initialized with a unique MDP to manage their lifetimes, depicted in different colors.
Figure 4(a) NVIDIA Jetson TX2 with 256 GPU cores; (b) Comprehensive tests are validated in the moving vehicle in different scenes (e.g., highway).
Comparative results under different traffic scenes.
| Detector | Evaluation of Detection (AP) | Tracker | Evaluation of Tracking (MOTA) | ||||
|---|---|---|---|---|---|---|---|
| Campus | Urban | Highway | Campus | Urban | Highway | ||
| SSD | 65.25% | 60.16% | 68.84% | Proposed | 70.64% | 72.62% | 74.32% |
| YOLOv3 | 63.55% | 62.99% | 70.19% | Proposed | 74.65% |
| 77.98% |
| Detection probes |
|
|
| Proposed |
| 76.06% |
|
Figure 5Comprehensive analyses of the proposed framework. (a) The contribution of each components in two typical scenes respectively; (b) The tracking accuracy in different distance and the threshold selection depends on the image size.
Comparison of our proposed methods with five state-of-the-art methods on KITTI.
| Method | MOTA ↑ | MOTP ↑ | FRAG ↓ | IDS ↓ | MT ↑ | ML ↓ |
|---|---|---|---|---|---|---|
| Proposed | 76.53% | 81.19% |
| 11 |
| 9.92% |
| SSP [ | 57.85% | 77.65% | 704 |
| 29.38% | 24.31% |
| RMOT [ | 65.83% | 75.42% | 727 | 209 | 40.15% | 9.69% |
| MDP [ | 69.35% | 82.10% | 387 | 130 | 52.15% | 13.38% |
| ExtraCK [ | 79.99% | 82.46% | 938 | 342 | 62.15% | 5.54% |
| MOTBeyondPixels [ |
|
| 944 | 468 | 73.23% |
|
Figure 6Exemplary output under four typical traffic scenes.