This paper addresses the problem of generating meaningful summaries from unedited user videos. A framework based on spatiotemporal and high-level features is proposed in this paper to detect the key-shots… Click to show full abstract
This paper addresses the problem of generating meaningful summaries from unedited user videos. A framework based on spatiotemporal and high-level features is proposed in this paper to detect the key-shots after segmenting the videos into shots based on motion magnitude. To encode the time-varying characteristics of a video, we explore the local phase quantization feature descriptor from three orthogonal planes (LPQ-TOP). The sparse autoencoder (SAE), an instance of deep learning strategy, is used for the extraction of high-level features from LPQ-TOP descriptors to represent the shots carrying key-contents of videos efficiently. The Chebyshev distance between the feature vectors of the consecutive shots are calculated and thresholded using the mean value of the distance score as the threshold value. The optimal subset of shots with distance score greater than the threshold value is used to generate a high-quality video summary. The method is evaluated using SumMe data set. The summaries thus generated are of better quality than those produced by the other state of the art techniques. The effectiveness of the method is further evaluated by comparing with the human-created summaries in the ground truth.
               
Click one of the above tabs to view related content.