We consider predicting the user's head motion in 360 videos, with 2 modalities only: the past user's position and the video content (not knowing other users' traces). We make two… Click to show full abstract
We consider predicting the user's head motion in 360 videos, with 2 modalities only: the past user's position and the video content (not knowing other users' traces). We make two main contributions. First, we re-examine existing deep-learning approaches for this problem and identify hidden flaws from a thorough root-cause analysis. Second, from the results of this analysis, we design a new proposal establishing state-of-the-art performance. First, re-assessing the existing methods using both modalities, we obtain the surprising result that they all perform worse than baselines using the user's trajectory only. A root-cause analysis shows particularly that (i) the content can inform the prediction for horizons longer than 2 to 3s (existing methods consider shorter horizons), and that (ii) to compete with the baselines, it is necessary to have a recurrent unit dedicated to process the positions, but this is not sufficient. Second, from a re-examination of the problem supported with the concept of Structural-RNN, we design a new deep neural architecture, named TRACK. TRACK achieves state-of-the-art performance on all considered datasets and prediction horizons, outperforming competitors by up to 20% on focus-type videos and horizons 2-5 seconds. The entire framework is online and received an ACM reproducibility badge.
               
Click one of the above tabs to view related content.