"AFNet: Temporal Locality-Aware Network With Dual Structure for Accurate and Fast Action Detection"

Inspired by Faster R-CNN, current state-of-the-art region-based action detection approaches like R-C3D and TAL-Net creatively proposed Temporal Region Proposal Network (TRPN) to generate proposals, which greatly improved action detection accuracy. However, since smooth L1 loss adopted in TRPN focuses on relative offset to pre-set anchor segments and is not sensitive enough to action boundaries and temporal regions, there is still room for improvement in temporal proposal generation. In this work, we elaborately design a Temporal Locality-Aware Network (TLAN) to learn a binary classifier using frame-level annotations. This allows our framework to effectively distinguish action instance (positive temporal regions) from background (negative temporal regions) by jointly optimizing temporal regions classification and temporal reference boxes regression, thus enabling precise localization. We further introduce a novel pooling method named Contextual Structured Spatial Temporal Pooling (CSSTP) to better exploit context and spatial-temporal information in an end-to-end fashion. Finally, TLAN and CSSTP are incorporated into a unified framework named AFNet. Extensive experiments have been conducted to evaluate the performance of our method. We achieve state-of-the-art performance on THUMOS’14 (20.6% higher than R-C3D, 6.7% higher than TAL-Net mAP @0.5) and competitive performance on Charades and ActivityNet. Besides, our inference speed reaches 1024 FPS, which is 250× faster than TAL-Net (3.5 FPS) and comparable to R-C3D (1030 FPS).

Keywords: temporal regions; network; action detection; temporal locality; action

Journal Title: IEEE Transactions on Multimedia
Year Published: 2021

Link to full text (if available)

Share on Social Media: Sign Up to like & get
recommendations!
1

LAUSR

You are not signed in:

Sign Up!

Related content

More Information News Social Media Video Recommended