Semi-supervised learning has been well established in the area of image classification but remains to be explored in video-based action recognition. FixMatch is a state-of-the-art semi-supervised method for image classification,… Click to show full abstract
Semi-supervised learning has been well established in the area of image classification but remains to be explored in video-based action recognition. FixMatch is a state-of-the-art semi-supervised method for image classification, but it does not work well when transferred directly to the video domain since it only utilizes the single RGB modality, which contains insufficient motion information. Moreover, it only leverages highly-confident pseudo-labels to explore consistency between strongly-augmented and weakly-augmented samples, resulting in limited supervised signals, long training time, and insufficient feature discriminability. To address the above issues, we propose neighbor-guided consistent and contrastive learning (NCCL), which takes both RGB and temporal gradient (TG) as input and is based on the teacher-student framework. Due to the limitation of labelled samples, we first incorporate neighbors information as a self-supervised signal to explore the consistent property, which compensates for the lack of supervised signals and the shortcoming of long training time of FixMatch. To learn more discriminative feature representations, we further propose a novel neighbor-guided category-level contrastive learning term to minimize the intra-class distance and enlarge the inter-class distance. We conduct extensive experiments on four datasets to validate the effectiveness. Compared with the state-of-the-art methods, our proposed NCCL achieves superior performance with much lower computational cost.
               
Click one of the above tabs to view related content.