Video prediction, including uni-directional prediction for future frames and bi-directional prediction for in-between frames, is a challenging task and a problem worth exploring in multimedia and computer vision fields. Existing… Click to show full abstract
Video prediction, including uni-directional prediction for future frames and bi-directional prediction for in-between frames, is a challenging task and a problem worth exploring in multimedia and computer vision fields. Existing practices usually make predictions by learning global motion information from the whole given image. However, humans often focus on key objects carrying vital motion information instead of the entire frame. Besides, different objects often show different movement and deformation, even in the same scene. In this connection, we build a novel model of object-centric video prediction, in which the motion signals of key objects are particularly learned. This model can predict new frames by repeatedly transforming objects into the original input images. To focus on these objects automatically, we create an attention module with substitutable strategies. Our method requires no annotated data, and we also use adversarial training to improve sharpness of the predictions. We evaluate our model through Moving MNIST, UCF101 and Penn Action datasets and achieve competitive results in both quantity and quality, compared to existing methods. The experiments demonstrate that our uni-and-bi-directional network can well predict motions for different objects and generate plausible future and in-between frames.
               
Click one of the above tabs to view related content.