Literature DB >> 34960574

Efficient Online Object Tracking Scheme for Challenging Scenarios.

Khizer Mehmood¹, Ahmad Ali², Abdul Jalil¹, Baber Khan¹, Khalid Mehmood Cheema³, Maria Murad¹, Ahmad H Milyani⁴.

Abstract

Visual object tracking (VOT) is a vital part of various domains of computer vision applications such as surveillance, unmanned aerial vehicles (UAV), and medical diagnostics. In recent years, substantial improvement has been made to solve various challenges of VOT techniques such as change of scale, occlusions, motion blur, and illumination variations. This paper proposes a tracking algorithm in a spatiotemporal context (STC) framework. To overcome the limitations of STC based on scale variation, a max-pooling-based scale scheme is incorporated by maximizing over posterior probability. To avert target model from drift, an efficient mechanism is proposed for occlusion handling. Occlusion is detected from average peak to correlation energy (APCE)-based mechanism of response map between consecutive frames. On successful occlusion detection, a fractional-gain Kalman filter is incorporated for handling the occlusion. An additional extension to the model includes APCE criteria to adapt the target model in motion blur and other factors. Extensive evaluation indicates that the proposed algorithm achieves significant results against various tracking methods.

Entities: Chemical

Keywords: APCE; fractional-gain Kalman filter; image processing; object tracking

Mesh：

Year: 2021 PMID： 34960574 PMCID： PMC8706150 DOI： 10.3390/s21248481

Source DB: PubMed Journal: Sensors (Basel) ISSN： 1424-8220 Impact factor: 3.576

1. Introduction

Visual object tracking (VOT) is an essential task in a variety of computer vision applications such as video surveillance [1,2,3], automobile [4], human–computer interaction [5], cinematography [6], sensor network [7], motion analysis [8], robotics [9,10,11], anti-aircraft system [12], autonomous vehicles [13], and traffic monitoring [14]. As presented in Figure 1, VOT remains a challenging issue due to motion blur, occlusion, fast motion, among others [15,16,17,18,19].

Figure 1

Challenging scenarios in visual object tracking (VOT). The first row shows motion blur in an image sequence. The second row shows the scale variation of the target. The third row shows heavy occlusion of the target. Pictures in the figure are part of OTB-100 dataset [26].

Tracking methods can be categorized as generative and discriminative. In generative tracking methods, the computation cost is high, and they are adaptable with environmental factors due to which these tracking methods might fail in background clutter situations [20,21,22]. Discriminative tracking methods perform better in clutter background situations since they treat these as a binary classification problem. However, they are slow, making them unsuitable for real-time applications [23,24,25].

1.1. Related Work

The STC tracker [27] has been widely used in recent years due to its computational efficiency. STC integrates spatial context information around the target of interest and considers prior information of previous frames for computing the extreme-of-confidence map by using Fourier transform. Die et al. [28] combined a correlation filter (CF) and STC. They extracted HOG (histogram of oriented gradients), (CN) color naming, and gray features for learning-correlation filters. Then, the response of CF and STC is fused. Yang et al. [29] proposed an improved tracking method by incorporating peak to sidelobe ratio (PSR)-based occlusion detection mechanism and model update scheme in the STC framework. Zhang et al. [30] proposed a tracking method by incorporating HOG, CN features, and an average difference of frames-based adaptive learning rate mechanism in the spatiotemporal context framework. Zhang et al. [31] suggested a tracking method by incorporating a selection update mechanism in the spatiotemporal context framework. Song et al. [32] anticipated an improved STC-based tracking method by combining a scale filter and loss function criteria for better performance in UAV applications. During the past decade, significant progress has been made to develop accurate scale estimation in VOT [33,34,35,36,37,38]. Danelljan et al. [39] proposed a tracking-by-detection framework by learning filters for translation and scale estimation based on pyramid representation. Li et al. [40] incorporated an adaptive scale scheme in a kernelized correlation filter (KCF) tracker using HOG and CN features. Bibi et al. [41] modify the KCF tracker by maximizing posterior distribution over the scales grid and updating the filter by fixed point optimization. Lu et al. [42] combined KCF and Fourier–Mellin transform to deal with rotation and scale variation of the target. Yin et al. [43] modified the scale adaptive with multiple features (SAMF) tracker by using APCE-based rate of change between consecutive frames to control scale size. Ma et al. [44] incorporated APCE in discriminative correlation filters to address fixed template size. A Kalman filter is used in various tracking algorithms for occlusion handling [45,46,47,48,49]. Kaur et al. [50] suggested a real-time tracking approach using a fractional-gain Kalman filter for nonlinear systems. Soleh et al. [51] proposed the Hungarian Kalman filter (HKF) for multiple target tracking. Farahi et al. [52] proposed a probabilistic Kalman filter (PKF) by incorporating an extra stage for estimating target position by applying the Viterbi algorithm to a probabilistic graph. Gunjal et al. [53] proposed a Kalman filter-based tracking algorithm for moving targets under surveillance applications. Ali et al. [54] address issues in VOT such as fast maneuvering of the target, occlusions, and deformation by combining Kalman filter, CF, and adaptive mean shift in the heuristic framework. Kaur et al. [55] proposed a modified fractional-gain-based Kalman filter for vehicle tracking by incorporating a fractional feedback loop and cost function minimization. Zhou et al. [56] address issues in VOT such as occlusions, motion blur, and clutter background by incorporating a Kalman filter in a compressive tracking framework. By summarizing the current methods, it can be perceived that significant work has been done to develop a robust tracking algorithm by incorporating scale update schemes, model update mechanisms, occlusion detection, and handling techniques in different tracking frameworks. The STC algorithm proposed in [27] uses FFT for detection and context information for a model update. However, it cannot effectively deal with occlusions, scale variations, and motion blur.

1.2. Our Contributions

To address the limitations of the STC, this paper proposes a robust tracking algorithm suitable for various image processing applications, such as surveillance and autonomous vehicles. The contributions can be listed concisely as follows. We introduce novel criteria for detecting occlusion by utilizing APCE, model update rules, and previous history of the modified response map to prevent the tracking model from wrong updates. We introduce an effective occlusion handling mechanism by incorporating a modified feedback-based fractional-gain Kalman filter in the spatiotemporal context framework to track an object’s motion. We incorporate a max-pooling-based scale scheme by maximizing over posterior probability in the STC framework’s detection stage. We applied a combination of STC and max-pooling to attain higher accuracy. We introduce an APCE-based adaptive learning rate mechanism that utilizes information of current frame and previous history to reduce error accumulation and correctly updates from the wrong appearance of the target. Extensive performance analysis of the proposed tracker is carried out on standard benchmark videos in comparison with STC [27], KCF_MTSA [41], MACF [57], MOSSECA [58], and Modified KCF [59].

1.3. Organization

The organization of this paper follows: brief explanations of STC and fractional calculus are provided in Section 2. In Section 3, the tracking modules of the proposed tracker are explained. Section 4 includes performance analysis. Discussion is given in Section 5, while Section 6 concludes the paper.

2. Review of STC and Fractional Calculus

2.1. STC Tracking

The STC tracking algorithm formulates the relation between the target of interest and its context in the Bayesian framework. The feature set Ωc(x*)} and spatial relation between target context is presented in Figure 2.

Figure 2

The spatial relation between object and its context. Picture in the figure is part of OTB-100 dataset [26].

The confidence map is given as follows: is the prior context model and is the spatial context model. The confidence map function is given in (2): where v is the normalization constant, Þ is a parameter for shape, and is a parameter for scale. The spatial context uses the intensity of the image and weighted Gaussian function given in (3) and (4): Equation (5) describes the spatial context model: Explaining for the spatial context: Fast Fourier transform (FFT) can be calculated as follows: The solution of (8) follows: As presented in (10), can be obtained by computing the extreme-of-confidence map: The confidence map can be considered from (11): Spatiotemporal context is updated on learning rate , as given in (12):

2.2. Fractional Calculus

In this work, the Grünwald–Letnikov definition [60] is used for calculating fractional difference defined in (13): where is fractional order, is the sampling interval, is the number of samples of given signal , and is obtained using (14):

3. Proposed Solution

In this section, tracking modules are elaborated. First, the max-pooling-based scale mechanism is presented. Second, the APCE-based occlusion detection mechanism is discussed. Third, the fractional-gain Kalman filter-based mechanism for occlusion handling is examined. Fourth, an APCE-based modified learning rate mechanism is explained. The flowchart of the proposed tracker is displayed in Figure 3.

Figure 3

Flowchart of proposed tracking method.

As presented in Figure 3, for each sequence, the ground truth of the target is manually initialized in the first frame. Afterward, the confidence map of the target is calculated. Then, by maximizing the posterior probability, the scale of the target is estimated. Then APCE of the response map is calculated along with the difference of APCE between consecutive frames. Based on occlusion criteria, the fractional-gain Kalman filter activates and predicts the location of the target. Afterward, the learning rate of the tracking model is updated by utilizing the current target position and previous history of APCE values.

3.1. Scale Integration Scheme

One limitation of STC is the inability to rapidly change the scale. During the detection phase of STC, we applied max-pooling over multiple scales by maximizing the posterior probability, as given in (15): where represents ith scale and is the maximum detection likelihood response at ith scale. The prior term is the Gaussian distribution whose standard deviation is set through experimentation. It allows for a smooth transition between frames, given that the target scale does not vary much between frames.

3.2. Occlusion Detection Mechanism

The performance of any tracking algorithm is affected by various factors, of which the most common is occlusion. It is essential to create a mechanism for the detection of occlusion. In the present work, an occlusion feedback mechanism is presented, which detects occlusion and updates the target model by evaluating the tracking status of each frame. Average peak to correlation energy (APCE) [61] determines tracker effectiveness. The value of APCE changes according to the target occlusion state. Small values of APCE specify tracking failure or target occlusion. It is given in (16): where and are maximum and minimum response values, respectively, and gives indices of the response map. The occlusion detection criteria are built as given in (17) and (18): where and are the APCE values at t and (t − 1) frames, respectively, is the difference of the APCE between two sequential frames, and is the threshold value acquired by performing multiple experiments. Rules of occlusion and model update follow: When or, it indicates that the target is coming out of the shelter, and both the tracking and model updates are based on STC. When and, it indicates that the target is in the occlusion state and tracking is based on the fractional-gain Kalman filter. The tracking model is also updated based on the Kalman filter prediction. When or, it indicates that the target occludes, and both the tracking and model update are based on STC. When or, it indicates that the target tracking is good and that both the tracking and model update are based on STC. As seen in Figure 4a, without occlusion, both APCE and are high; therefore, no occlusion occurs. However, when both APCE and give low values, as shown in Figure 4b, case occlusion occurs and the occlusion handling mechanism is activated. By using this mechanism, proposed tracking achieved significant results for the occlusion challenge.

Figure 4

Occlusion detection mechanism. Pictures in the figure are part of OTB-100 dataset [26].

3.3. Fractional-Gain Kalman Filter

The Kalman filter is widely used in the research area of VOT. A modified discrete time linear system can be characterized by Equations (19) and (20): where is the state vector, is system output, is system input, and is output noise. , and are transition, control, and measurement matrices, respectively. The innovation equation is the difference between the estimated output and actual output defined in (21): where is the priori state. The estimation of the next state with a modified gain is given in (22) and (23): where is the fractional derivative of previous Kalman gain. Priori error between actual and estimated state and its covariance can be given in (24) and (25): Posteriori error between actual and estimated state and its covariance can be given, as in (26) and (27): Kalman gain is calculated by minimizing posteriori error covariance as given in (28): Finding the value of in (29): can be written as in (30): The modified Kalman gain consists of two terms. The first term represents the Kalman filter’s gain, and the second represents the mean of the fractional difference of previous gains. The makes the mean value nominal.

3.4. Adaptive Learning Rate

The motion of the target is changed in each frame during tracking. It is, therefore, necessary to update the target model adaptively rather than on a fixed learning rate. We used an APCE-based degree indicator to better cope with environmental changes occurring during tracking to make it adaptive. In the present work, we used maxima of historical APCE values to normalize APCE, since the APCE value is very high. The degree indicator is defined in (31): where is the start index frame. The value of the learning rate is adjusted as in (32): where is the threshold value acquired by performing multiple experiments. Figure 5a shows that, without both motion and blur, APCE and are high; therefore, the learning rate of tracking should be fast. However, when motion blur occurs, both APCE and give low values, as shown in Figure 5b. Thus, in that case, the model should be updated slowly due to the appearance change of the target. By using this mechanism, the proposed tracking achieved significant results for the motion blur challenge. The tracker is given in Algorithm 1.

Figure 5

Learning rate mechanism. Pictures in the figure are part of OTB-100 dataset [26].

4. Performance Analysis

Comprehensive assessments were conducted on videos taken from the OTB 2015 [26] dataset for the proposed tracking method’s quantitative and quantitative evaluation. These sequences include scale variations, motion blur, and fast motion challenges.

4.1. Evaluation Criteria

The proposed algorithm is compared with tracking methods on two evaluation criteria: distance precision rate (DPR) and center location error (CLE). The calculation formula for CLE is mentioned in (33):

4.2. Quantitative Analysis

DPR evaluation is presented in Table 1. In videos Blurcar1, Car2, Human7, Jogging1, and Jogging2, the proposed algorithm outperforms Modified KCF, MOSSECA, MACF, KCF_MTSA, and STC. For the sequences Blurcar3, Blurcar4, Boy, Dancer2, and Suv, the proposed tracker has marginally less precision value. Overall, the proposed algorithm has a higher mean value than the other algorithms.

Table 1

Distance precision rate.

Sequence	Proposed	Modified KCF [59]	STC [27]	MACF [57]	MOSSE_CA [58]	KCF_MTSA [41]
Blurcar1	0.978	0.858	0.024	0.698	0.999	0.999
Blurcar3	0.896	0.829	0.406	1	1	1
Blurcar4	0.876	0.987	0.113	0.944	1	1
Boy	0.973	0.64	0.761	1	1	1
Car2	0.988	1	1	1	0.993	1
Dancer2	0.993	1	1	1	1	1
Human7	0.904	0.76	0.332	0.636	0.824	0.448
Jogging1	0.973	0.993	0.228	0.231	0.231	0.964
Jogging2	0.866	0.945	0.186	0.166	1	0.189
Suv	0.778	0.978	0.805	0.978	0.976	0.98
Mean Precision	0.923	0.899	0.486	0.765	0.902	0.858

Average center location error evaluation is presented in Table 2. In the videos Blurcar1, Car2, Dancer2, Jogging1, and Human7, the proposed algorithm outperforms Modified KCF, MOSSECA, MACF, KCF_MTSA, and STC. For the videos Blurcar3, Blurcar4, Boy, Jogging2, and Suv, the proposed algorithm has marginally high error values. Overall, the proposed algorithm has the lowest mean error compared to the other algorithms.

Table 2

Average center location error.

Sequence	Proposed	Modified KCF [59]	STC [27]	MACF [57]	MOSSE_CA [58]	KCF_MTSA [41]
Blurcar1	4.86	16.05	1.31 × 10⁶	85.16	6.34	6.01
Blurcar3	9.12	14.46	71.37	3.69	2.98	3.7
Blurcar4	15.01	11.19	2.61 × 10³	8.04	10.15	7.15
Boy	8.09	50.34	27.4	2.65	2.31	2.91
Car2	2.68	3.96	12.43	1.55	5.39	2.13
Dancer2	6.82	6.41	15.3	6.48	5.8	6.68
Human7	7.59	16.74	42.98	19.62	12.14	36.63
Jogging1	8.39	3.72	5010	94.97	115.98	4.27
Jogging2	14.2	4.74	104.02	147.77	3.47	136.4
Suv	15.36	3.65	48	3.34	3.73	3.71
Mean Error	9.212	13.126	1.3 × 10⁶	37.327	16.829	20.959

The precision and error plots are presented in Figure 6 and Figure 7, respectively. These plots provide a frame-by-frame comparison in entire image sequences. Since precision and location error gives the mean of the entire sequence, it is possible that the algorithm loses the target for a few frames but correctly tracks again. Therefore, these plots were presented to show the effectiveness of the tracking method. In the videos Blurcar1, Human7, Jogging1, and Jogging2, the proposed algorithm has the highest precision in the entire video. It has slightly low accuracy in the Blurcar3, Blurcar4, Boy, Car2, Dancer2 and Suv videos. The proposed algorithm has the lowest error in the Blurcar1, Human7, Jogging1, and Jogging2 videos. It has marginally high error compared with a few trackers for the Blurcar3, Blurcar4, Boy, Car2, Dancer2, and Suv sequences.

Figure 6

Precision plot comparison for the OTB-100 dataset [26].

Figure 7

Center location error (in pixels) comparison for the OTB-100 dataset [26].

Frames per second (fps) analysis is presented in Table 3. In the Blurcar1, Car2, Dancer2, Human7, and Jogging1 videos, the proposed algorithm outperforms Modified KCF, MOSSECA, MACF, KCF_MTSA, and STC in terms of precision in error at the expense of modest frame rate.

Table 3

Frames per second (fps).

Sequence	Proposed	Modified KCF [59]	STC [27]	MACF [57]	MOSSE_CA [58]	KCF_MTSA [41]
Blurcar1	10.78	66.29	27.75	18.5	53.06	15.35
Blurcar3	18.04	33.62	28.87	32.7	51.74	6.08
Blurcar4	5.7	21.42	20.07	8.64	27.65	5.83
Boy	26.67	85.51	33.48	58.7	157.17	22.02
Car2	57.18	90.7	94.08	55.3	95.38	11.2
Dancer2	29.66	29.65	65.1	29.2	38.87	6.26
Human7	25.17	34.44	59.66	40.5	26.11	11.48
Jogging1	42.71	95.45	61.75	49	36.59	12.55
Jogging2	22.77	33.01	56.92	34.6	33.97	11
Suv	69.61	76.32	98.03	50.9	79.7	8.44

The computational time for the learning rate module is presented in Table 4. It can be seen that the proposed tracker takes less time in motion blur sequences. However, the overall speed of the tracker is slightly slow, given in Table 3. Combining the different tracking modules presented in Section 3, performance of the proposed tracker is significant as each module is specifically designed and incorporated into the STC framework, making it efficient in terms of less error and high precision for different challenging attributes in VOT.

Table 4

Computation time of the proposed tracker’s learning rate module.

Sequence	Frame Size	Number of Frames	Time
Blurcar1	640 × 480	742	0.011
Blurcar3	640 × 480	357	0.008
Blurcar4	640 × 480	380	0.009
Boy	640 × 480	602	0.009
Car2	320 × 240	913	0.018
Dancer2	320 × 262	150	0.006
Human7	320 × 240	250	0.007
Jogging1	352 × 288	307	0.012
Jogging2	352 × 288	307	0.008
Suv	320 × 240	945	0.017

4.3. Qualitative Analysis

Figure 8 depicts the qualitative analysis of the proposed tracking with five state-of-the-art trackers. Modified KCF and KCF_MTSA are extensions of KCF [62] based tracking methods. However, Modified KCF is not robust to motion blur (Blurcar1, Blurcar3, and Human7), whereas the performance of KCF_MTSA is affected in occlusion (Jogging2) and motion blur (Human7). MACF is an improved version of fast discriminative scale space tracking [63] and achieved favorable results in various challenges of VOT. However, it does not perform well in motion blur (Blurcar1) and occlusion (Jogging1 and Jogging 2). MOSSECA is an improved context-aware formulation version of the MOSSE [64] tracker. The results are exceptional except in the Jogging1 and Human7 sequences. STC is the baseline tracker of the proposed method and achieves favorable results. However, it can be seen that it does not address occlusion (Jogging1 and Jogging2) or motion blur (Blurcar1, Blurcar3, Blurcar4, Boy, and Human7).

Figure 8

Qualitative comparison for the OTB-100 dataset [26].

It can be seen that the proposed tracker outperforms other tracking methods in these sequences. This performance is attributed to three factors. First, a max-pooling-based scale scheme is incorporated, making it less sensitive to scale variations (Boy). Second, incorporation of the APCE-based modified occlusion detection mechanism and fractional-gain Kalman filter-based occlusion handling makes it effective toward occlusions (Jogging1, Jogging2, and Suv). Third, the combination of APCE criteria in the learning rate of the proposed algorithm model update effectively, making it efficient towards motion blur (Blurcar1, Blurcar3, Blurcar4, Boy, and Human7) and illumination variations (Car2 and Dancer2).

5. Discussion

We discuss several observations from performance analysis. First, max-pooling-based scale formulation in spatiotemporal context outperforms trackers without this formulation. This can be attributed to estimating maximum likelihood by using target appearance sampled at a different set of scales. Second, trackers which utilize modules for occlusion detection and handling module outperform trackers without these modules. This can be attributed to the fractional-gain Kalman filter and an APCE-based occlusion detection mechanism preventing tracker from drift. Third, trackers with adaptive learning rate perform better than those with fixed learning rate.

6. Conclusions

This paper contributes insight into an STC-based accurate tracking algorithm by incorporating max-pooling, fractional-gain Kalman, and APCE measures for occlusion detection and tracking model update. It can improve the adaptability of the target model and prevent error accumulation. Evaluations specify that the proposed tracker achieves enhanced results in various complicated scenarios. However, there are some problems: (1) tracking performance is severely affected in dense occlusion; (2) the tracker lost the target of interest in deformation and fast motion; and (3) frame rate of the proposed tracking method is slow. These three points will be the focus of follow-up research. Additionally, considering the challenges of VOT, we also plan to perform future in-depth research on the fusion of features and better prediction estimation mechanisms, and carry out Raspberry Pi, FPGA, and DSP-based hardware implementation and practical application for meeting the requirements of society.

9 in total