SiamLST: Learning Spatial and Channel-wise Transform for Visual Tracking

: Siamese network based trackers regard visual tracking as a similarity matching task between the target template and search region patches, and achieve a good balance between accuracy and speed in recent years. However, existing trackers do not effectively exploit the spatial and inter-channel cues, which lead to the redundancy of pre-trained model parameters. In this paper, we design a novel visual tracker based on a Learnable Spatial and Channel-wise Transform in Siamese network (SiamLST). The SiamLST tracker includes a powerful feature extraction backbone and an efficient cross-correlation method. The proposed algorithm takes full advantages of CNN and the learnable sparse transform module to represent the template and search patches, which effectively exploit the spatial and channel-wise correlations to deal with complicated scenarios, such as motion blur, in-plane rotation and partial occlusion. Experimental results conducted on multiple tracking benchmarks including OTB2015, VOT2016, GOT-10k and VOT2018 demonstrate that the proposed SiamLST has excellent tracking performances.


INTRODUCTION
Visual tracking aims to predict the trajectory and scale variations of a target in subsequent frames with the given target state in the initial frame.As an important branch of computer vision, visual tracking has a variety of applications, such as intelligent transportation system, augmented reality and human-computer interaction, to name a few.Although the performance of visual tracking has been greatly improved recently, well-balanced visual tracking still remains enormous challenge due to complicated scenarios, such as low resolution, partial occlusion, illumination variation, motion blur, in-plane and out-of-plane rotations, and so on.
In recent years, based on the techniques of deep learning, visual trackers obtain well-balanced tracking performances between accuracy and real-time speed.The typical tracking methods based on deep learning include two core components: feature extraction backbone based on Convolutional Neural Network (CNN) and similarity computing based on cross-correlation.These trackers have a powerful depth feature extraction ability to promote the tracking performance when the targets suffer from serious appearance variations.As a pioneering work [1], Siamese network is used for visual tracking.Many state-of-the-art algorithms based on Siamese network are proposed, such as SiamBAN [2].
Siamese network based visual algorithms regard the target tracking task as a similarity learning problem between the target template and search patches, which achieve real-time tracking performances.At first, the convolutional neural network adopts offline training manner to learn a similarity function on a large number of video sequences.Next, the similarity scores between the target template and search patches are computed.Lastly, the target position and scale offset are evaluated based on the score map in the next frame.
Despite the great success, Siamese network based visual trackers still have some disadvantages as follows: 1) in target feature extracting, the shallow convolutional layer of CNN tends to cause the lack of generalization ability, such as VGG16 [3].Visual tracker based on a shallow feature extraction backbone is easy to shift when there is serious noises or local damages in an input image.2) In deep CNN feature extracting, the offline training is often time-consuming.Meantime, the tracking performance will decline when the network depth is very deep, such as ResNet152 [4].
Recently, attention mechanism becomes a hot topic for the powerful ability in highlighting region of interest (ROI), such as FcaNet [5].Attention has been widely applied in the fields of anomaly detection [6], semantic analysis [7] and face recognition [8].Also, it is introduced to visual tracking.Wang et al. [9] design a deep architecture consisting of residual attention, general attention and channel attention for visual tracking.Zhang et al. [10] design an end-to-end framework to exploit the contextual information in consecutive frames, and the proposed tracking algorithm achieve robust tracking performance.However, the above works only pay more attention to the ROI, and the computing resources cost is expensive.
In this paper, in order to address the above-mentioned problem, we design a learnable feature extraction backbone that effectively exploits spatial and channel-wise correlations.Our algorithm achieves well-balanced tracking performance between tracking accuracy and average overlap.In addition, the SiamLST has surpassed some SOTA algorithms in some complicated situations, such as out-of-view, partial occlusion, and scale variation.The main contributions include three folds as follows: • We propose a novel end-to-end deep model to extract more discriminative features.Comparison with other competing algorithms, our tracker effectively exploits dependency of channel-wise features and decreases the number of parameters by combining the advantages of CNN and the learnable sparse module.
• We design a visual tracking algorithm based on Siamese network.It consists of the proposed deep model and an efficient cross-correlation method.Our algorithm takes full advantages of spatial and channel-wise correlations, which greatly alleviate the influences of appearance variations, such as occlusion and motion blur.
• Extensive experiments demonstrate that the SiamLST algorithm outperforms SOTA works while running at realtime speed on OTB2015, GOT-10k, VOT2016 and VOT2018 benchmarks.
For the rest of this paper, we will describe this work according to the following arrangement.In Section II, we will review relevant tracking techniques and algorithms.The details of the designed SiamLST are described in Section III.The extensive experiments conducted on multiple benchmarks are presented in Section IV.In Section V, we will draw a conclusion.

LITERATURE REVIEW
In this section, we will review some relevant tracking techniques and algorithms.In particular, the end-to-end visual trackers based on Siamese network and attention mechanism are reviewed.

Visual Trackers Based on Correlation Filter
According to appearance models, visual tracking algorithms are roughly divided into generative and discriminative algorithms.Among them, the typical generative algorithms include mean shift [11] and sparse representation [12], while the representative works of discriminative algorithms are correlation filter based trackers [16] and trackers based on deep learning [17].
In the past decades, visual trackers based on correlation filter have received extensive attention because of the simple structure and expansibility.In [13], the correlation filter is applied to visual tracking task for the first time, and the speed reaches 669 frames per second.On the basis of MOSSE, Henriques et al. [14] add a regularization term to avoid overfitting, and introduce circulant matrix and kernel function to improve the tracking speed.At the same time, Henriques et al. further propose kernelized correlation filters (KCF) based tracker [15].

Siamese Based Tracking Algorithms
Recently, visual trackers based on Siamese network have received considerable attention due to the achieved good balance between accuracy and real-time speed.Siamese network based tracking algorithms usually learn a similarity matching function by off-line training from a large number of labelled sequences, which improve the speed and accuracy of online tracking.
In [1], a fully convolutional neural network for computing the similarity between the template and a search region is designed.Guo et al. [21] propose a dynamic Siamese network based tracker that can online learn the target appearance variation and suppress noise from the previous frame.In addition, this tracker uses continuous video sequences instead of image pairs for training.In [18], the category semantic information branch is added to the tracking framework, and this method achieves robust tracking performance between accuracy and overlap rate.
Li et al. [19] propose a novel Siamese network framework consisting of Siamese network and region proposal network (RPN).RPN obtains a similarity score map by a classification branch and a regression branch.SiamRPN++ [22] aims to improve tracking accuracy, and develops different frameworks to obtain superior tracking performances.Yu et al. [23] develop a novel deformable Siamese attention network consisting of a self-attention to exploit richness of semantic information and a crossattention to enhance spatial and channel-wise correlations.

Attention Mechanism
Attention mechanism makes the convolutional neural network focusing on the ROI to better highlight it.Attention mechanism is widely used in deep learning, such as object detection, face recognition, and semantic segmentation.FcaNet [5] uses GAP from a frequency domain perspective to compensate for the lack of feature information in existing channel attention methods.FcaNet extends GAP to a more general 2-dimensional discrete cosine transform (DCT) form, and introduces more frequency components to fully utilize the information.In [7], a residual attention network captures mixed attention to train very deep residual attention networks.
Recently, attention has been applied to visual tracking for enhancing the feature representation ability.Wang et al. [4] propose a novel deep architecture that consists of channel attention, residual attention and general attention to promote tracking performance.Yu et al. [23] design a deformable Siamese attention network to enhance target discriminative ability.Zhu et al. [24] develop an end-toend algorithm for visual tracking, which takes full advantages of rich flow information.Although these methods increase the weight of the ROI, they result in the increase of the number of parameters.Inspired by the above-mentioned works, we propose a novel attention mechanism based Siamese network that effectively reduces the number of pre-trained model parameters and exploits inter-channels feature dependency.

RESEARCH METHOD
In this section, we will describe our SiamLST algorithm in details.Inspired by SiamFC, our algorithm obtains more richness and sparse representation features by designed deep model.In Fig. 1, the SiamLST includes two core components: a novel deep model for extracting the template and search region features, and an efficient similarity matching method.Among them, Xcorr is used to measure the similarity between the target template and the search region patches, which calculates the range of variation of the target position by the similarity score map in the current frame.

Siamese Network Backbone
In recent years, tracking algorithms based on deep learning have attracted great attention due to good balance between accuracy and real-time speed.Visual trackers based on Siamese architecture consist of a features extraction backbone and an efficient cross-correlation method.In [1], Siamese network is introduced to visual tracking, which regards visual tracking as a similarity learning task, and achieves competing tracking performance.Inspired by SiamFC, some algorithms achieve more robust tracking performance and real-time speed by introducing extra subnetworks or ensembling multiple subnetworks [5].
Most trackers based on Siamese network adopt shallow convolutional layers to extract target features, such as VGG16 [3].At present, the deep convolutional neural network plays a leading role in visual tracking and has achieved excellent image classification performance, such as ResNet [4].Although deep learning based visual trackers have made significant progress, the pre-trained model of Siamese network based tracking algorithms have a large number of parameters, and hardly take full advantages of spatial and channel-wise correlation.The convolution layer usually needs the vast convolution kernels to extract semantic information of the input image and distinguishes the foreground from its background.Unfortunately, this method is easy to cause the vast number of offline model parameters.Meantime, all pixels participating in the sliding window operation have the same weight coefficient.The ROI is not highlighted and the background information is not suppressed.This learning method has some disadvantages in locating the target in the next frame.

Learnable Feature Extraction Backbone
To solve the mentioned problem of feature extraction stage in visual tracking, we propose a novel Learnable Spatial and Channel-wise Transform model based on attention mechanism.Before describing our model in detail, let's review the operation of the convolution layer in traditional features extraction backbone.The output O i,j,k by convolving is computed as follows: where   x,y ,z ;i, j  means a tensor, which is extracted from  at pixel position   , , x y z , Ω is the sliding window in convolution layers, and ( k )   x,y


represents the pixel at   , x y of ( k )   .We observe that the traditional convolution layers have two disadvantages.At first, all the pixels in a slide window are needed to participate in calculation in the spatial position   , x y .It is helpful for the extraction of high-frequency features, but causes redundancy for the extraction of low-frequency features that are the majority in the whole image.Next, all pixel in positions have the same weight for each channel, which is disadvantageous to distinguish the foreground from surround background.Therefore, we intend to discuss the proposed deep model from spatial and channel-wise, respectively.
The designed algorithm SiamLST processes the feature maps from spatial and channel-wise, respectively.Our model effectively exploits feature dependency to alleviate the local feature redundancy by transforming to sparser domain.In order to retain more details of the input image as much as possible, similar to residual network, we use the point-wise convolution operator to adapt the feature map scale.The general process is as follows: where T s and T c represent the spatial and channel-wise transform in the learning sparse transform module, respectively.T r means a down-sampling operation, which ensures that the input image features can concatenate with the convolutional features.T LST means LST module in Fig. 1 and  corresponds to the output depth feature of the backbone network.
The learnable spatial transform.The spatial feature redundancy is a serious problem in training lightweight networks.To alleviate this problem, we develop an efficient spatial transform module T s by combining a learnable weight s  .The specific detail is to disrupt the input image to different frequency bands through continuous row and column transform.The learnable weight can be described as follows: where  means the Kronecker product.The row and column transforms correspond to different learnable weights column  and row  .The learnable channel transform.We exploit the channel dimension dependency by T c , which plays a key role in suppressing inter-channels redundancy.The maps of the channel-wise transform module input features to a sparser domain, and resizes scale by T r .Similar to the above spatial transform, we reweight the feature map to highlight ROI.Moreover, we effectively use the point-wise convolutional layer to obtain inter-channel cues.
The learnable resize transform.In a traditional feature extraction backbone, each pixel on the sliding window has the same weight coefficient, which hardly highlights ROI and suppresses noisy inference.To alleviate this problem, we develop a learnable weight mapping operator.The proposed model obtains sparser template features by spatial and channel-wise transforms, and exploits interchannel cues to highlight ROI.Meantime, the tracking performance was improved by an efficient resize transform module, and enhanced the robustness of the designed Siam LST.
Activation Scheme.After the above steps, the designed model obtains sparser features that are helpful to deal with challenging in video sequences.In addition, our trained model becomes more lightweight by exploiting inter-channel cues.The nonlinear activation function can reduce the negative sample feature to zero and keep the positive sample feature unchanged.However, the existing nonlinear activation functions are not robust enough in some scenes when the input image is noisy or partially occluded.To further exploit nonlinear semantic information, based on ReLU activation function, a novel activation function is designed as follows: where   sgn x is the symbolic function and τ is the hyper parameter threshold.When x is greater than τ, y value is 1; x is less than τ, y value is −1, the rest of the case y value is 0. Specifically, our model achieves more robust tracking performance when there are partial occlusions or motion blurred in the input images.Moreover, the developed activation function can compress the trivial features into sparse domain to make the trained model more lightweight.

RESULTS AND DISCUSSION
In this section, we will introduce environment configurations and the evaluation metric.Next, we compare our algorithm with state-of-the-art tracking algorithms in quantitative and qualitative ways.Finally, we analyse the limitations of the proposed algorithm and the future works.

Implementation Details
We develop an end-to-end learnable model as a feature extraction backbone by taking full advantages of the CNN learnable sparse transform, and we select various convolution layers to enhance the deep model representation capacity.Finally, we select the best model that is verified in the ablation study.In addition, the target template is cropped to 127 × 127 × 3 and the search region is cropped to 255 × 255 × 3.There are totally 50 epochs performed with stochastic gradient descent.The designed SiamLST is compared with some competing visual trackers including SiamFC [1], KCF [15], LDES [28], AFCN [29], LMCF [30], DSST [26], and so on.
Environment configuration.Our algorithm is implemented with a NVIDIA Quadro P4000 GPU and Intel Xecon E5-2600 v4 CPU (2.00 GHz), 32 GB RAM, and tested in Python 3.6 with Pytorch 1.4.0 during the offline training stage.We utilize 9335 video sequences in GOT-10 k test split to generalize model.In addition, we evaluate the best model on multiple benchmarks.
Evaluation metric.The one-pass evaluation (OPE) metric of precision and success rate are adopted to evaluate the tracking performance.The center location error (CLE) between the bounding boxes and the ground-truth is adopted to evaluate the performance of accuracy.Moreover, the intersection over union (IoU) between the predicted bounding boxes and the ground truth is adopted to evaluate the trackers performance of success rate.Finally, we analyse the tracking performance in quantitative and qualitative comparisons.

Ablation Study
As shown in Tab. 1, we select different AlexNet layers to design various deep models, and the experimental results demonstrate that the best method is the second model on OTB2015.At the features extraction stage, the best strategy should be initialized with some smaller convolutional kernels, and the receptive field will increase when the network is deepening.On the one hand, when a feature map has more detailed information in the shallow layers, the smaller convolutional kernel effectively captures the information.On the other hand, when a feature map has more rich semantic information in the deep layers; the larger receptive field takes full advantage of exploited semantic information.In investigating the impact of different convolutional layers, we obtain the best model, i.e., the best feature extraction backbone consists of AlexNet 1, 3, 4, 5 convolution layers and learnable spatial and channel-wise modules.

Comparison With State-of-the-art Trackers
OTB2015.OTB2015 is a typical tracking benchmark including 98 challenging sequences.Each sequence has one or more attributes such as fast motion (FM), deformation (DEF), out-of-view (OV), occlusion (OCC), low resolution (LR), out-of-plane rotation (OPR), in-plane rotation (IPR), background clutter (BC), motion blur (MB), illumination variation (IV) and scale variation (SV).In Fig. 2 and Tab. 2, our algorithm obtains the smallest center location error and the second overlap.On the one hand, the spatial and channel-wise transform is helpful to improve tracking performances.On the other hand, we calculate the similarity between the template and search patches by an efficient cross-correlation method.
GOT -10 k.It is a large-scale challenging tracking benchmark.It contains more than 10,000 videos and has more than 1.5 million manually labelled bounding boxes, divided into 563 target categories and 87 movement patterns.In Tab. 3, we compare SiamLST against superior trackers on GOT -10 k, and our tracker SiamLST achieves the best results in SR0.5 and SR0.75.In addition, the result of our algorithm obtains the second result in AO.We highlight the best three results in red, blue and green.Trackers Precision Success FPS KCF [15] 69.6 47.4 223.8 DSST [26] 68.0 51.3 25.4 MEEM [27] 78.1 53.0 19.5 ACFN [29] 79.9 57.4 15 LCT [31] 76.2 56.2 27 LMCF [30] 78.9 58.0 85 CFNet [17] 74.8 56.8 75 SiamFC [1] 77.1 58.2 86 LDES [28] 78.5 61.5 20 SiamLST 80.6 58.9 71 VOT2016.VOT2016 expands the test sets to 60 groups, and realizes automatic sample labelling.In Fig. 3, our algorithm SiamLST achieves the best EAO result compared with other SOTA trackers.Due to the learnable sparse transform model being added to CNN, we achieve more robust and accurate performance when input image appears with corruption or noisy.VOT2018.VOT2018 has added a long-term tracking competition.In comparison with short-term tracking algorithms, VOT2018 increases two challenging factors: full occlusion and out-of-view.In this case, the target completely disappears in the video frames, and the tracker needs to judge whether the target disappears and re-detects it when it appears.In Tab. 4, the designed SiamLST achieves the SOTA result in accuracy.We analyse that the learnable sparse transform is a benefit in locating the target center position.

Quantitative Evaluation
In this section, we compare the SiamLST with seven outstanding algorithms on OTB2015.In Tab. 4, our algorithm achieves SOTA precision and success rates according to the one-pass evaluation.The proposed SiamLST is superior to other competing algorithms in accuracy, and achieves the second result in success rate.
These results demonstrate the effectiveness of the proposed learnable feature extraction backbone.It is worth noting that SiamLST improves the precision by 0.035 compared with SiamFC, and only increases little parameters.In addition, the advantage of SiamLST is that it effectively alleviates feature redundancy.Fig. 4 and Fig. 5 show the center location error and overlap rate of each attribute.We select eight challenging attributes to evaluate the tracking performance on OTB2015.

Qualitative Evaluation
In Fig. 6, we show some tracking results compared with some outstanding tracking algorithms.Next, we will analyse the advantages and disadvantages of the designed SiamLST with specific sequences.
Illumination variation.In Fig. 6, the targets in the sequences Car1, Human2 and Skiing suffer illumination variations, and our algorithm achieves the best tracking performance compared with KCF, LDES, SiamFC.In the Car1 sequence, there are similar targets and low resolution interference in the scene.We can see that our algorithm can effectively locate the target and other trackers cannot adapt to the scale variation well.In the Skiing sequence, the target suffers scale variation, inplane and out-of-plane rotations.We can see that only our algorithm and LDES tracker can precisely track the target due to the SiamLST extracting with semantic information.In Human2 sequence, the target undergoes motion blur and illumination changes.Due to sparser and richer template features, the proposed algorithm can track the target when input image appears interrupted or noisy.Finally, due to lacking online template updating module, the proposed tracking algorithm drifts away when suffering the target long-term occlusion.Occlusion.As shown in Fig. 6, we can see that some scenes of Basketball, Biker, Football, Human7 and Tiger2 sequences suffer partial or short-time occlusion.Occlusion is a typical challenge factor in video sequences.In Basketball video sequence, our SiamLST algorithm can accurately track the target and adapt the scale variation when there are many similar objects and partial occlusion in the scene.In Bikervideo sequence, due to the influence of fast motion and in-plane rotation, KCF and LDES fail to track the targets.SiamLST can distinguish the foreground from the surround background well.The same situation occurs in Football, Human7 and Tiger2 video sequences, when the target suffers partial or short-time occlusion in the scene, our SiamLST algorithm performs better than the other three SOTA algorithms.The source of our SiamLST tracker obtains outstanding tracking performances.At first, the SiamLST effectively exploits inter-channel cues to represent the target and search patches, which largely reduce pre-trained model parameters.Meantime, the learnable feature extraction backbone obtains richer and sparser features.Finally, our tracker SiamLST achieves leading performances when the target suffers occlusions.

Limitations and Future Works 4.4.1 Limitations
In Fig. 7, we show some cases of tracking failure.The tracker SiamLST drifts under the condition of long-term tracking and serious occlusions.In Bird1 sequence, when the target completely disappears for a period of time, the tracker SiamLST cannot track the target because our tracker has no re-detection scheme.In the 378th frame, when the background specks in the scene, our tracker loses the target again because our tracker has no online update module.

Future Works
Our algorithm is inspired by SiamFC and takes full advantages of CNN and LST to represent a target, and achieves competing tracking performances.Next, we will combine LST in various deep feature extraction backbones, such as ResNet152.Moreover, we will consider some online template update schemes to adapt appearance variations and enhance the tracking accuracy.

CONCLUSION
In this paper, we design a simple and efficient visual tracker based on Siamese network, which effectively reduces the number of pre-trained model parameters and exploits the dependency of inter-channels features.In the designed model, the feature maps become sparser and richer by taking the advantages of CNN and the learnable sparse transform module.The model exploits spatial and channel-wise correlation, which tends to handle challenging scenarios, such as scale variation, in-plane and out-of-plane rotations and partial occlusion.Extensive results demonstrate that the SiamLST algorithm obtains leading performances in multiple benchmarks including OTB2015, GOT-10k, VOT2016 and VOT2018.

Figure 4
Figure 4 Expected averaged overlap performance on VOT2016 benchmark.

Figure 5
Figure 5 Overlap success (OS) rates of OPE.

Figure 7
Figure 7 Two failure cases.(a) Bird1.(b) CarDark.OurSiamLST and ground truth results are shown in red and green boxes.

Table 1
Different feature extraction backbones by combining various convolution layers and learnable sparse transform model.

Table 2
Comparisons between SiamLST and other SOTA trackers on OTB2015.

Table 3
Comparison with SOTA algorithms on GOT-10k.

Table 4
Comparison with competitive trackers on VOT2018 benchmark.The best three results are highlighted in red, blue and green colors.