Inspired by Faster R-CNN, current state-of-the-art region-based action detection approaches like R-C3D and TAL-Net creatively proposed Temporal Region Proposal Network (TRPN) to generate proposals, which greatly improved action detection accuracy.… Click to show full abstract
Inspired by Faster R-CNN, current state-of-the-art region-based action detection approaches like R-C3D and TAL-Net creatively proposed Temporal Region Proposal Network (TRPN) to generate proposals, which greatly improved action detection accuracy. However, since smooth L1 loss adopted in TRPN focuses on relative offset to pre-set anchor segments and is not sensitive enough to action boundaries and temporal regions, there is still room for improvement in temporal proposal generation. In this work, we elaborately design a Temporal Locality-Aware Network (TLAN) to learn a binary classifier using frame-level annotations. This allows our framework to effectively distinguish action instance (positive temporal regions) from background (negative temporal regions) by jointly optimizing temporal regions classification and temporal reference boxes regression, thus enabling precise localization. We further introduce a novel pooling method named Contextual Structured Spatial Temporal Pooling (CSSTP) to better exploit context and spatial-temporal information in an end-to-end fashion. Finally, TLAN and CSSTP are incorporated into a unified framework named AFNet. Extensive experiments have been conducted to evaluate the performance of our method. We achieve state-of-the-art performance on THUMOS’14 (20.6% higher than R-C3D, 6.7% higher than TAL-Net mAP @0.5) and competitive performance on Charades and ActivityNet. Besides, our inference speed reaches 1024 FPS, which is 250× faster than TAL-Net (3.5 FPS) and comparable to R-C3D (1030 FPS).
               
Click one of the above tabs to view related content.