In semi-supervised video object segmentation (VOS) task, temporal coherent object-level cues play a key role yet are hard to accurately model. To this end, this paper presents an object-aware global-local… Click to show full abstract
In semi-supervised video object segmentation (VOS) task, temporal coherent object-level cues play a key role yet are hard to accurately model. To this end, this paper presents an object-aware global-local correspondence architecture, which enables to extract the inter-frame temporal coherent object-level features for accurate VOS. Specifically, we first generate a set of object masks by the ground-truth segmentation, and then we squeeze the current frame representation inside the object masks into a set of global object embeddings. Second, we compute the similarity between each embedding and the feature map, producing an object-aware weight for each pixel. The object-aware feature at each pixel is then constructed by summing the object embeddings weighted by their corresponding object-aware weights, which is able to capture rich object category information. Third, to establish the accurate correspondences between the inter-frame temporal coherent cues, we further design a novel global-local correspondence module to refine the temporal feature representations. Finally, we augment the object-aware features with the global-local aligned information to produce a strong spatio-temporal representation, which is essential to a more reliable pixel-wise segmentation prediction. Extensive evaluations are conducted on three popular VOS benchmarks containing Youtube-VOS, Davis2017 and Davis2016, demonstrating that the proposed method achieves favourable performance compared to the state-of-the-arts.
               
Click one of the above tabs to view related content.