Existing tracking-by-detection approaches using deep features have achieved promising results in recent years. However, these methods mainly exploit feature representations learned from individual static frames, thus paying little attention to… Click to show full abstract
Existing tracking-by-detection approaches using deep features have achieved promising results in recent years. However, these methods mainly exploit feature representations learned from individual static frames, thus paying little attention to the temporal smoothness between frames. This easily leads trackers to drift in the presence of large appearance variations and occlusions. To address this issue, we propose a two-stream network to learn discriminative spatio-temporal feature representations to represent the target objects. The proposed network consists of a Spatial ConvNet module and a Temporal ConvNet module. Specifically, the Spatial ConvNet adopts 2D convolutions to encode the target-specific appearance in static frames, while the Temporal ConvNet models the temporal appearance variations using 3D convolutions and learns consistent temporal patterns in a short video clip. Then we propose a proposal refinement module to adjust the predicted bounding box, which can make the target localizing outputs to be more consistent in video sequences. In addition, to improve the model adaptation during online update, we propose a contrastive online hard example mining (OHEM) strategy, which selects hard negative samples and enforces them to be embedded in a more discriminative feature space. Extensive experiments conducted on the OTB, Temple Color and VOT benchmarks demonstrate that the proposed algorithm performs favorably against the state-of-the-art methods.
               
Click one of the above tabs to view related content.