Temporal variations, such as sudden motion, acceleration and occlusions, occur frequently in real-world videos and force video-modeling networks to account for them. However. often they are not beneficial for recognizing… Click to show full abstract
Temporal variations, such as sudden motion, acceleration and occlusions, occur frequently in real-world videos and force video-modeling networks to account for them. However. often they are not beneficial for recognizing actions at coarse granularity and thus may impede spatio-temporal learning. Prior solutions to this problem usually introduce multiple network branches to process input frames at different sampling rates or employ special components to explore inter-frame relations, which are computationally expensive. In this paper we propose a simple and flexible Dynamic Equilibrium Module (DEM) for video modeling through adaptive Eulerian motion manipulation. The proposed module can be directly inserted into 3D and (2+1)D backbone networks to effectively reduce the impact of temporal variations on video modeling and learn spatio-temporal representations with higher robustness. We demonstrate performance gains due to the use of DEM in R3D and R(2+1)D models on Kinetics-400, UCF-101, and HMDB-51 datasets.
               
Click one of the above tabs to view related content.