| Literature DB >> 28420194 |
Abstract
Multiple-object tracking is affected by various sources of distortion, such as occlusion, illumination variations and motion changes. Overcoming these distortions by tracking on RGB frames, such as shifting, has limitations because of material distortions caused by RGB frames. To overcome these distortions, we propose a multiple-object fusion tracker (MOFT), which uses a combination of 3D point clouds and corresponding RGB frames. The MOFT uses a matching function initialized on large-scale external sequences to determine which candidates in the current frame match with the target object in the previous frame. After conducting tracking on a few frames, the initialized matching function is fine-tuned according to the appearance models of target objects. The fine-tuning process of the matching function is constructed as a structured form with diverse matching function branches. In general multiple object tracking situations, scale variations for a scene occur depending on the distance between the target objects and the sensors. If the target objects in various scales are equally represented with the same strategy, information losses will occur for any representation of the target objects. In this paper, the output map of the convolutional layer obtained from a pre-trained convolutional neural network is used to adaptively represent instances without information loss. In addition, MOFT fuses the tracking results obtained from each modality at the decision level to compensate the tracking failures of each modality using basic belief assignment, rather than fusing modalities by selectively using the features of each modality. Experimental results indicate that the proposed tracker provides state-of-the-art performance considering multiple objects tracking (MOT) and KITTIbenchmarks.Entities:
Keywords: CCD; LIDAR; deep learning; multiple objects tracking; multiple sensor fusion
Year: 2017 PMID: 28420194 PMCID: PMC5424760 DOI: 10.3390/s17040883
Source DB: PubMed Journal: Sensors (Basel) ISSN: 1424-8220 Impact factor: 3.576
Figure 1The overall architecture of MOFT. The blue box: representation (Section 4.1 and Section 4.2). The green box: matching between target objects and candidates (Section 4.3). The yellow box: structured fine-tuning (Section 6). The red box: fusion of tracking results (Section 5).
Figure 2Architecture for representing instances.
Figure 3Proposed matching network to learn the matching function. The numbers in brackets on convolutional layers are the kernel size, number of outputs and stride sizes, from the top. and are the representation of the i-th target object in frame k and the j-th candidate in frame .
Figure 4The concept of the structured fine-tuning of target appearance models.
Comparison models used to evaluate the proposed MOFT. Depth (PC) and Depth (stereo) denote depth frames extracted from 3D point clouds and stereo vision, respectively. “Init.” indicates the initialized target. w and w/o of the “update” column mean that the proposed fine-tuning method was used for MOFT (w) or not (w/o), respectively.
| Tracker | Matching Target | Representation | Representation Usage | Modality | Update |
|---|---|---|---|---|---|
| VGG-16 | Adaptively | RGB + Depth (PC) | w | ||
| Init. | VGG-16 | Adaptively | RGB + Depth (PC) | w | |
| AlexNet | Adaptively | RGB + Depth (PC) | w | ||
| VGG-16 | RGB + Depth (PC) | w | |||
| VGG-16 | RGB + Depth (PC) | w | |||
| VGG-16 | RGB + Depth (PC) | w | |||
| VGG-16 | RGB + Depth (PC) | w | |||
| VGG-16 | Adaptively | RGB | w | ||
| VGG-16 | Adaptively | Depth (PC) | w | ||
| VGG-16 | Adaptively | Depth (stereo) | w | ||
| VGG-16 | Adaptively | RGB + Depth (PC) | w/o |
Comparison of the proposed MOFT with design-varied trackers on testing sequences. The best and second-best scores are boldfaced and underlined, respectively. The direction of the arrows indicates the direction of better performances; multiple object tracking accuracy (MOTA), multiple object tracking precision (MOTP), mostly tracked targets (MT), mostly lost targets (ML) and the number of ID switches (IDS).
| Tracker | MOTA↑ | MOTP↑ | MT↑ | ML↓ | IDS↓ |
|---|---|---|---|---|---|
| 60.22% | 23.24% | 30.44% | 27 | ||
| 62.42% | 69.25% | 24.58% | 31.57% | 20 | |
| 59.14% | 74.22% | 24.97% | 29.12% | 22 | |
| 59.46% | 73.39% | 26.14% | 27.41% | 29 | |
| 48.49% | 61.44% | 18.62% | 32.01% | ||
| 77.61% | 27.55% | 24.88% | 29 | ||
| 61.51% | 63.20% | 26.04% | 33.63% | 31 | |
| 60.55% | 68.24% | 26.88% | 34.58% | 30 | |
| 60.48% | 66.91% | 27.43% | 28.29% | 26 | |
| 63.47% | 77.79% | 14 |
Comparison of the performances according to training and testing datasets. M and K are MOT15 and 16 and KITTI benchmarks, respectively. A→B of the model indicates that the matching network was trained on dataset A, whereas the testing was generated on dataset B. The direction of the arrows indicates the direction of better performances; multiple object tracking accuracy (MOTA), multiple object tracking precision (MOTP), mostly tracked targets (MT), mostly lost targets (ML) and the number of ID switches (IDS).
| Tracker | MOTA↑ | MOTP↑ | MT↑ | ML↓ | IDS↓ |
|---|---|---|---|---|---|
| 61.51% | 63.20% | 26.04% | 33.63% | 31 | |
| 46.88% | 77.24% | 18.92% | 46.54% | 41 | |
| 60.11% | 61.09% | 22.23% | 33.98% | 33 | |
| 45.92% | 77.16% | 17.99% | 45.98% | 43 |
Comparison of the proposed MOFT with previous trackers on the MOT16 [49] benchmark dataset. The boldfaced and underlined scores indicate the best and second-best scores, respectively. The direction of the arrows indicates the direction of better performances; multiple object tracking accuracy (MOTA), multiple object tracking precision (MOTP), mostly tracked targets (MT), and mostly lost targets (ML).
| Tracker | MOTA↑ | MOTP↑ | MT↑ | ML↓ |
|---|---|---|---|---|
| TBD [ | 33.7% | 7.2% | 54.2% | |
| LTTSC-CRF [ | 37.6% | 75.9% | 9.6% | 55.2% |
| OVBT [ | 38.4% | 75.4% | 7.5% | 47.3% |
| EAMTT-pub [ | 38.8% | 75.1% | 7.9% | 49.1% |
| LINF1 [ | 41.0% | 74.8% | 11.6% | 51.3% |
| MHT-DAM [ | 42.9% | 76.6% | 13.6% | 46.9% |
| oICF [ | 43.2% | 74.3% | 11.3% | 48.5% |
| JMC [ | 46.3% | 75.7% | 15.5% | |
| NOMT [ | 76.6% | |||
| ours | 45.34% |
Comparison of the proposed MOFT with previous trackers on the KITTI benchmark dataset [59]. This evaluation was validated on the category. The boldfaced and underlined scores indicate the best and second-best scores, respectively. The direction of the arrows indicates the direction of better performances; multiple object tracking accuracy (MOTA), multiple object tracking precision (MOTP), mostly tracked targets (MT), and mostly lost targets (ML).
| Tracker | MOTA↑ | MOTP↑ | MT↑ | ML↓ |
|---|---|---|---|---|
| SCEA [ | 51.30% | 26.22% | 26.22% | |
| TBD [ | 49.52% | 78.35% | 20.27% | 32.16% |
| NOMT [ | 78.17% | |||
| CEM [ | 44.31% | 77.11% | 19.51% | 31.40% |
| DCO [ | 28.72% | 74.36% | 15.24% | 30.79% |
| mbodSSP [ | 48.00% | 77.52% | 22.10% | 27.44% |
| HM [ | 41.47% | 78.34% | 11.59% | 39.33% |
| DP-MCF [ | 35.72% | 78.41% | 16.92% | 35.67% |
| MCMOT-CPD [ | 72.11% | 82.13% | 52.13% | 11.43% |
| ours |
Comparison of the proposed MOFT with previous trackers on the KITTI benchmark dataset [59]. This evaluation was validated on the category. The boldfaced and underlined scores indicate the best and second-best scores, respectively. The direction of the arrows indicates the direction of better performances; multiple object tracking accuracy (MOTA), multiple object tracking precision (MOTP), mostly tracked targets (MT), and mostly lost targets (ML).
| Tracker | MOTA↑ | MOTP↑ | MT↑ | ML↓ |
|---|---|---|---|---|
| SCEA [ | 26.02% | 68.45% | 9.62% | 47.08% |
| NOMT-HM [ | 17.26% | 67.99% | 14.09% | 50.52% |
| NOMT [ | 25.55% | 67.75% | 17.53% | 42.61% |
| CEM [ | 18.18% | 68.48% | 8.93% | 51.89% |
| RMOT [ | 25.47% | 68.06% | 13.06% | 47.42% |
| MCMOT-CPD [ | ||||
| ours |
Figure 5Comparison of tracked targets on: (a) RGB frames; (b) depth frames extracted from 3D point clouds; and (c) MOFT. Each box indicates the following: yellow box: correctly-tracked objects; red box: shifted objects; blue box: missed objects.
Figure 6Failure cases of MOFT. Each box indicates the following: green box: ground truth; red box: shifted object; blue box: missed object.