LAUSR.org creates dashboard-style pages of related content for over 1.5 million academic articles. Sign Up to like articles & get recommendations!

AFNet: Temporal Locality-Aware Network With Dual Structure for Accurate and Fast Action Detection

Photo from wikipedia

Inspired by Faster R-CNN, current state-of-the-art region-based action detection approaches like R-C3D and TAL-Net creatively proposed Temporal Region Proposal Network (TRPN) to generate proposals, which greatly improved action detection accuracy.… Click to show full abstract

Inspired by Faster R-CNN, current state-of-the-art region-based action detection approaches like R-C3D and TAL-Net creatively proposed Temporal Region Proposal Network (TRPN) to generate proposals, which greatly improved action detection accuracy. However, since smooth L1 loss adopted in TRPN focuses on relative offset to pre-set anchor segments and is not sensitive enough to action boundaries and temporal regions, there is still room for improvement in temporal proposal generation. In this work, we elaborately design a Temporal Locality-Aware Network (TLAN) to learn a binary classifier using frame-level annotations. This allows our framework to effectively distinguish action instance (positive temporal regions) from background (negative temporal regions) by jointly optimizing temporal regions classification and temporal reference boxes regression, thus enabling precise localization. We further introduce a novel pooling method named Contextual Structured Spatial Temporal Pooling (CSSTP) to better exploit context and spatial-temporal information in an end-to-end fashion. Finally, TLAN and CSSTP are incorporated into a unified framework named AFNet. Extensive experiments have been conducted to evaluate the performance of our method. We achieve state-of-the-art performance on THUMOS’14 (20.6% higher than R-C3D, 6.7% higher than TAL-Net mAP @0.5) and competitive performance on Charades and ActivityNet. Besides, our inference speed reaches 1024 FPS, which is 250× faster than TAL-Net (3.5 FPS) and comparable to R-C3D (1030 FPS).

Keywords: temporal regions; network; action detection; temporal locality; action

Journal Title: IEEE Transactions on Multimedia
Year Published: 2021

Link to full text (if available)


Share on Social Media:                               Sign Up to like & get
recommendations!

Related content

More Information              News              Social Media              Video              Recommended



                Click one of the above tabs to view related content.