Chinese Optics Letters, 2019, 17 (3): 031001, Published Online: Mar. 8, 2019  

An improved long-term correlation tracking method with occlusion handling Download: 760次

Author Affiliations
School of Aeronautics and Astronautics, Shanghai Jiao Tong University, Shanghai 200240, China
Abstract
By improving the long-term correlation tracking (LCT) algorithm, an effective object tracking method, improved LCT (ILCT), is proposed to address the issue of occlusion. If the object is judged being occluded by the designed criterion, which is based on the characteristic of response value curve, an added re-detector will perform re-detection, and the tracker is ordered to stop. Besides, a filtering and adoption strategy of re-detection results is given to choose the most reliable one for the re-initialization of the tracker. Extensive experiments are carried out under the conditions of occlusion, and the results demonstrate that ILCT outperforms some state-of-the-art methods in terms of accuracy and robustness.

Nowadays, object tracking is one of the hot topics in computer vision and has been widely used in many engineering applications, such as satellites[1], inverse synthetic aperture radar[2], and reconnaissance[3]. A typical scenario of object tracking is to track an unknown object initialized by a bounding box in subsequent image frames[4,5]. How to realize a robust tracker against significant appearance change is still an issue to be addressed.

In recent years, many robust trackers based on correlation filters were proposed, in which the minimum output sum of squared error (MOSSE) filter[6] is the well-known one because of its high speed and novelty. It introduced correlation operation[7] into object tracking and greatly accelerated the calculation through the theory that convolution in the spatial domain becomes the Hadamard product in the Fourier domain[8]. After that, the circulant structure of tracking-by-detection with kernels (CSK)[9] employed the circulant matrix originally to increase the number of samples that improved the classifier. Then, histogram of oriented gradients (HOG) features, Gaussian kernels, and ridge regression are used in the kernelized correlation filters (KCFs)[10] based on CSK, which has achieved satisfactory tracking results. Danelljan et al. mainly solved the issue of scale variation (SV) during an object’s movement by their discriminative scale space tracking (DSST)[11], which is based on learning the correlation filters by a scale pyramid. Ma et al. proposed long-term correlation tracking (LCT)[4], which comprises the correlation filters of appearance and motion to estimate the scale and translation of an object. It is an outstanding tracker for long-term tracking. Inspired by the model of human’s recognition, Choi et al. proposed the attentional feature-based correlation filter (AtCF)[12] to perform object tracking that can adapt to the fast variation of the object.

However, these trackers do not handle occlusion (OCC) well or only aim at partial OCC (50% coverage or less) and temporal full OCC. A robust tracking algorithm requires a detection module to recover the target from potential tracking failures caused by heavy OCC[13]. Because LCT is designed for long-term tracking, we improved it to handle OCC and named it improved LCT (ILCT).

ILCT uses the motion correlation filter w1 and appearance correlation filter w2 of LCT to estimate the position and scale of an object. An OCC criterion is designed according to the response value curve of the correlation filter to determine whether the object is occluded or not. If the object is occluded, the added re-detector works, and the tracker stops. The reliable re-detection result will re-initialize the tracker. The principles and experimental results are explained below.

LCT decomposes the tracking task into translation estimating (to get a new position) and scale estimating (to get a new scale)[4]. The process is realized by motion correlation filter w1 and appearance correlation filter w2, respectively.

This filter, w1, is trained on image patch x, whose size is M×N by circular shifts of its pixels xm,n, where (m,n){0,1,,M1}×{0,1,,N1} as training samples[9]. Using the ridge regression to minimize the mean square error between the training images and regression object, then the filter w1RM×N is obtained by w1=argminw1m,n|ϕ(xm,n)·w1y(m,n)|2+λ|w1|2,where Gaussian kernel k(x,x)=exp(|xx|2/σ2) is used to define mapping ϕ as k(x,x)=ϕ(x)·ϕ(x). According to the distance of shift, y(m,n) gives the Gaussian label to the training image that the value is close to 1 if there is less distance. λ is the regulation parameter.

After mapping and fast Fourier transform (FFT) F, the solution of w1 can be represented as the linear combination of training samples w1=m,na(m,n)ϕ(xm,n), where the coefficient a is given by A=F(a)=F(y)F[ϕ(x)·ϕ(x)]+λ.

When new frame comes, the filter will perform correlation on the new patch z around the last location. Then, the correlation response map can be calculated by y^=F1{AF[ϕ(z)·ϕ(x^)]},where x^ denotes the learned appearance model, F1 is the inverse FFT, and is elementwise multiplication. The value of y^ is between 0 and 1, and the location that owns the highest y^ is the position of the object.

The filter w2 shares the same principle of w1. In the process of estimating the scale, the patch after translating estimation is divided into K scales: S={ak|k=K12,K32,,K12}.

Each scale has its size of sM×sN(sS), and HOG features are extracted on each one to build the scale pyramid. As in Eq. (5), we can get the response value of each layer in the pyramid and select the patch that owns the highest value as the object scale s^: s^=argmaxs[max(y^1),max(y^2),,max(y^s)].

Therefore, the accepted tracking result of LCT must have two highest response values, i.e., the response values of w1 and w2. The final response map (refers to the map of w2, the same as below) is shown in Fig. 1(a).

Fig. 1. Response maps. (a) Object is intact; (b) object is occluded.

下载图片 查看所有图片

The motion correlation filter w1 and appearance model x follow the updated framework with the learning rate α by Eq. (6). The appearance model is trained by the feature vector x with a 47 channels feature[14], which includes HOG features with 31 bins, eight bins of histogram feature of intensity, and eight bins of non-parametric local rank transformation[15] of the brightness channel. A threshold is set for the appearance correlation filter w2 such that if max(y^s)τa, w2 will be updated in the same way: x^t=(1α)x^t1+αxt,A^t=(1α)A^t1+αAt.

Adopting the tracking-by-detection framework is also the critical factor showing that LCT is robust for SV, illumination variation (IV), background clutters (BCs), fast motion (FM), etc. An online support vector machine (SVM) classifier is used for recovering targets, and the color channels are quantized as features for detector learning[16]. The intersection over union (IOU) thresholds for positive training samples and negative ones are 0.5 and 0.1, respectively. Another threshold τr is set to activate it when max(y^s)<τr.

In addition, the cosine window is used in translating estimation to remove the boundary discontinuities of the response map[6].

The tracking result is adopted according to the values of response maps. If the object is intact and undisturbed, the response map is clear, and the white point is obvious. On the contrary, the map is dim, and the point is obscure, for example, when OCC occurs, as shown in Fig. 1(b).

When the OCC begins, the tracker may still locate the object successfully based on previous training. However, as time goes on, the coverage increases, which aggravates the correlation filter so that the tracker will fail to re-track the object after its quitting from OCC.

To design an OCC criterion, several sequences[17] with different attributes, including full/partial OCC, deformation (DEF), BCs, FM, IV, and SV, are studied. In each sequence, we selected five frames, f={ft4,ft3,ft2,ft1,ft}, which reflect the attribute and their corresponding response values y={yt4,yt3,yt2,yt1,yt} to draw the curve, as shown in Fig. 2.

Fig. 2. Curves of response values with different attributes. (a) Full OCC; (b) DEF; (c) partial OCC; (d) BCs; (e) FM; (f) IV; (g) SV. (Best view in PDF.)

下载图片 查看所有图片

In Fig. 2(a), because of the occluder, the response values decrease, while there are obvious rises in the last five curves. So, the first condition of criterion we set is that the five response values of five consecutive frames decrease continuously. Unlike partial OCC and DEF, the response values decrease drastically due to the full cover on the object by the occluder. Accordingly, the second condition will be reached if yt4 is τ1 larger than yt. To ensure the accuracy of judgement, the third condition is the number of elements in y that are less than τ2 is greater than two. The total three conditions of criterion are summarized below: yt4>yt3>yt2>yt1>yt,yt4ytτ1,y={y|yt4<τ2,yt3<τ2,yt2<τ2,yt1<τ2,yt<τ2},|y|2.

When the OCC criterion triggers, we set five free frames so that no operation is carried out on these frames to (1) let the object be fully occluded, (2) avoid the filters being polluted, and (3) improve the real time.

For convenience, we name the frame and the time at which it reaches the OCC criterion as focc and tocc, i.e., f={focc4,focc3,focc2,focc1,focc}, to meet the three conditions. When the object is identified as occluded, the tracker is ordered to stop, the two filters are no longer updated, and the re-detector is activated. In the re-detection module, we implement the edge boxes[18] to finish this task. Different from other methods of object detection[19], which use sliding windows that consume a large amount of calculation resources, this method efficiently generates object bounding box proposals directly from edges (about 1000 proposals/0.25 s). Its core idea is that the edge of an object corresponds to its contour, and the number of contours wholly enclosed by a bounding box is indicative of the likelihood of the box containing an object. In final, each proposed bounding box has a confidence value that reflects the likelihood of an object. More details can be found in Ref. [18].

In an image with a complex background, a huge number of proposal bounding boxes may be obtained, which takes more computation time, and most of them are not the object bounding boxes we want. So, a threshold is set that k top-ranked proposals are accepted.

While among these k boxes, false results also exist. Considering the SV before and after OCC, a constraint condition is added to filter the unreasonable boxes: 1.51×bwfocc4<bw<1.5×bwfocc4,1.51×bhfocc4<bh<1.5×bhfocc4,where bwfocc4 and bhfocc4 are the width and height of the bounding box of frame focc4. An example is shown in Fig. 3, where k=500 is set, and the number of bounding boxes after filtering is k=38.

Fig. 3. Detection for proposal boxes. (a) Bounding box of object; (b) k top-ranked proposals; (c) k proposals.

下载图片 查看所有图片

In our algorithm, w1 of focc4 is implemented on those patches of the final proposed bounding boxes to get the estimated positions. Then, scale estimation is performed by the w2 of focc4, and k response values are obtained. The detection result will be adopted if the highest value reaches the confidence threshold τ3. Finally, the result will re-initialize the tracker by giving the new position.

The whole flowchart of our method is shown in Fig. 4.

Fig. 4. Flowchart of ILCT. (Best view in PDF.)

下载图片 查看所有图片

To demonstrate the performance of the improved tracker, experiments are performed on eight sequences with the attributes of OCC, etc. Eight state-of-the-art trackers are compared with ILCT. They are KCF[10], LCT[4], DSST[11], tracking-learning-detection (TLD)[20], structured output tracking with kernels (Struck)[21], L1 tracker using the accelerated proximal gradient approach (L1APG)[22], integrated CSK (ICSK)[23], and compressive tracking (CT)[24], in which the former three are correlation-based trackers, and the rest are also effective trackers to account for OCC[17]. Besides, for better comparison, first, KCF is improved by the proposed OCC criterion triggers and recovery mechanism, named IKCF. Second, we replace two triggers used in MOSSE[6] and TLD[20], i.e., peak-to-sidelobe ratio (PSR) and median flow (MF) with the proposed one of ILCT, respectively, named LCT-PSR and LCT-MF. The experimental environment is Intel I7-6500U 2.5 GHz CPU with 8.00G RAM, MATLAB 2016b.

The annotated attributes of the eight sequences include OCC, FM, moving camera (MC), SV, BCs, IV, DEF, out-of-plane rotation (OPR), and motion blur (MB). Their information is listed in Table 1. The triumphal arch sequence is taken by us.

Table 1. Information of Eight Sequences

Video NameNumber of FramesAttributes
Carchase1[25]71OCC, FM, MC
Road[26]52OCC, BC, FM, MC, SV
Carchase2[27]150 (1st-150th)OCC, FM, IV, MC
Group[26]86OCC, DEF, MC
Motorcycle[28]156OCC, FM, MB, MC, SV
Triumphal arch331OCC, DEF, OPR, SV
Jogging[17]307OCC, DEF, OPR
Wandering[26]285OCC, DEF, MC

查看所有表

The parameters of the LCT part are set to the default values: λ=104, the size of the search window for translation estimation is set to 1.8 times the target size, the Gaussian kernel width σ=0.1, learning rate α=0.01, the number of scale space |S|=21, the scale factor a=1.08, τr=0.25 for the activation of SVM, τt=0.5 for the adoption of the SVM result, and τa=0.5 is set as the threshold for the model update[4].

In the OCC criterion, τ1 and τ2 are not fixed and are set to a quarter and a half of the response value of the second frame (the first frame has no correlations, and the object is selected manually), respectively.

In the re-detector, we use the default parameters of edge boxes[18] and set k=200. τ3 is set to 0.8 times of yocc4. The rest of the trackers are used with their default parameters.

The tracking results of 12 trackers are shown in Fig. 5.

Fig. 5. Tracking results of 12 trackers. (a) Carchase1; (b) road; (c) carchase2; (d) group; (e) motorcycle; (f) triumphal arch; (g) jogging; (h) wandering. (Best view in PDF.)

下载图片 查看所有图片

Through Fig. 5, we can see that these objects undergo obvious full OCC. Most trackers drift to background after OCC, while ILCT tracks objects robustly.

Center location error (CLE) is used for quantitative evaluation, which is defined as the percentage of frames whose Euclidean distance r between the centers of the bounding box and ground truth is within a pixel threshold (we set 15 pixels). In addition, for fair comparison, the frames with OCC are discarded, i.e., only the frames with objects in view are compared.

The results of CLE are listed in Table 2 (the best result is in bold, and the second best one is underlined). ILCT has shown its capacity of resisting disturbance in Fig. 5. Table 2 also indicates this. Because ILCT does not loose objects in all experiments, its CLE results are satisfactory. In terms of the number of the best and second best, ILCT achieves four and three times, respectively. From the evaluation results of CLE, we can consider ILCT as the best tracker.

Table 2. Comparison Results of CLE in Eight Sequences

SequenceILCTCTDSSTKCFIKCFL1APGStruckTLDICSKLCTLCT-PSRLCT-MF
Carchase10.98000.58000.58000.58000.98000.58000.58001.00000.58000.58001.00000.8400
Road0.86210.48280.48280.48281.00000.37930.48280.48280.48280.48281.00000.4828
Carchase20.96810.06380.51060.51061.00000.51060.47870.96810.54260.51061.00000.5426
Group0.94440.08330.15280.15280.94440.15280.13890.04170.15280.15280.94440.1667
Motorcycle0.98580.15600.99290.21280.93620.21280.07800.34041.00001.00000.89360.1277
Triumphal arch0.96220.13510.13510.13510.13510.13510.13510.47570.13510.13510.52970.1351
Jogging1.00000.15440.15440.15440.15440.99300.07021.00001.00001.00000.15440.6947
Wandering0.99560.32310.34060.34060.34060.32750.34060.89080.34060.33620.16590.3406
Average0.95750.23650.42980.31830.68630.42340.28050.61550.52920.55160.71100.4163
Number of best400030022240
Number of second best301011020011

查看所有表

From Table 2, compared to KCF, we can see that IKCF achieves great progress in tracking results because of the improvement, which reflects the effectiveness of our method. The proposed trigger outperforms PSR and MF because they are calculated in one frame, which may be triggered by non-OCC factors such as DEF and SV.

In conclusion, an effective tracking method that can handle OCC is proposed. Based on the motion and appearance correlation filters of LCT, ILCT employs a designed OCC criterion and a re-detector to judge the OCC and perform re-detection, respectively. Once the object is identified as occluded, the tracker stops, and the re-detector is activated. Then, the detection result with high confidence will re-initialize the tracker. Extensive experiments have been performed, and the results of qualitative and quantitative evaluation indicate that ILCT outperforms some state-of-the-art trackers in terms of accuracy and robustness. In future work, the efficiency and real-time performance have to be addressed to make the tracker perfect.

References

[1] WangG.XingF.WeiM. S.SunT.YouZ., Chin. Opt. Lett.15, 081201 (2017).CJOEE31671-7694

[2] ZhangF. Z.GuoQ. S.ZhangY.YaoY.ZhouP.ZhuD. Y.PanS. L., Chin. Opt. Lett.15, 112801 (2017).CJOEE31671-7694

[3] YuQ. H.WuD. M.ChenF. C.SunS. L., Chin. Opt. Lett.16, 071101 (2018).CJOEE31671-7694

[4] MaC.YangX.ZhangC.YangM. H., in Proceedings of Computer Vision and Pattern Recognition (2015), p. 5388.

[5] ZhangT.GhanemB.LiuS.AhujaN., in Proceedings of Computer Vision and Pattern Recognition (2012), p. 2042.

[6] BolmeD. S.BeveridgeJ. R.DraperB. A.LuiY. M., in Proceedings of Computer Vision and Pattern Recognition (2010), p. 2544.

[7] HesterC. F.CasasentD., Appl. Opt.19, 17581761 (1980).APOPAI0003-6935

[8] PressW. H.TeukolskyS. A.VetterlingW. T.FlanneryB. P., Numerical Recipes in C (Cambridge University, 1988).

[9] HenriquesJ. F.CaseiroR.MartinsP.BatistaJ., in Proceedings of European Conference on Computer Vision (2012), p. 702.

[10] HenriquesJ. F.CaseiroR.MartinsP.BatistaJ., IEEE Trans. Pattern Anal. Mach. Intell.37, 583596 (2015).ITPIDJ0162-8828

[11] DanelljanM.HägerG.KhanF.FelsbergM., in Proceedings of British Machine Vision Conference (2014), p. 1.

[12] ChoiJ.ChangH. J.JeongJ.DemirisY.ChoiJ. Y., in Proceedings of Computer Vision and Pattern Recognition (2016), p. 4321.

[13] MaC.HuangJ. B.YangX.YangM. H., Int. J. Comput. Vision126, 771 (2018).IJCVEQ0920-5691

[14] DollárP.AppelR.BelongieS.PeronaP., IEEE Trans. Pattern Anal. Mach. Intell.36, 1532 (2014).ITPIDJ0162-8828

[15] ZabihR.WoodfillJ., in Proceedings of European Conference on Computer Vision (1994), p. 151.

[16] MaC.YangX.ZhangC.YangM. H., “Long-term correlation tracking,” 2016, https://sites.google.com/site/chaoma99/cf-lstm.

[17] WuY.LimJ.YangM. H., in Proceedings of Computer Vision and Pattern Recognition (2013), p. 2411.

[18] ZitnickC. L.DollárP., in Proceedings of European Conference on Computer Vision (2014), p. 391.

[19] DollárP.ZitnickC. L., IEEE Trans. Pattern Anal. Mach. Intell.37, 1558 (2015).ITPIDJ0162-8828

[20] KalalZ.MikolajczykK.MatasJ., IEEE Trans. Pattern Anal. Mach. Intell.34, 1409 (2012).ITPIDJ0162-8828

[21] HareS.SaffariA.TorrP. H. S., in Proceedings of International Conference on Computer Vision (2011), p. 263.

[22] BaoC.WuY.LingH.JiH., in Proceedings of Computer Vision and Pattern Recognition (2012), p. 1830.

[23] DongX.ShenJ.YuD.WangW.LiuJ.HuangH., IEEE Trans. Multimedia19, 763 (2017).

[24] ZhangK.ZhangL.YangM. H., in Proceedings of European Conference on Computer Vision (2012), p. 864.

[25] LiangP.BlaschE.LingH., IEEE Trans. Image Process.24, 5630 (2015).IIPRE41057-7149

[26] MuellerM.SmithN.GhanemB., in Proceedings of European Conference on Computer Vision (2016), p. 445.

[27] QuakerO., “Anaheim California Police Chase 06/18/2017—Reckless Driver,” 2017, https://www.youtube.com/watch?v=bAZZ3NKNnTg.

[28] KristanM.MatasJ.LeonardisA.VojířT.PflugfelderR.FernandezG.NebehayG.PorikliF.ČehovinL., IEEE Trans. Pattern Anal. Mach. Intell.38, 2137 (2016).ITPIDJ0162-8828

Junhao Zhao, Gang Xiao, Xingchen Zhang, D. P. Bavirisetti. An improved long-term correlation tracking method with occlusion handling[J]. Chinese Optics Letters, 2019, 17(3): 031001.

引用该论文: TXT   |   EndNote

相关论文

加载中...

关于本站 Cookie 的使用提示

中国光学期刊网使用基于 cookie 的技术来更好地为您提供各项服务,点击此处了解我们的隐私策略。 如您需继续使用本网站,请您授权我们使用本地 cookie 来保存部分信息。
全站搜索
您最值得信赖的光电行业旗舰网络服务平台!