Hand-object interaction plays a very important role when humans manipulate objects. While existing methods focus on improving hand-object recognition with fully automatic methods, human intention has been largely neglected in… Click to show full abstract
Hand-object interaction plays a very important role when humans manipulate objects. While existing methods focus on improving hand-object recognition with fully automatic methods, human intention has been largely neglected in the recognition process, thus leading to undesirable interaction descriptions. To better interpret human-object interaction that is aligned to human intention, we argue that a reference specifying human intention should be taken into account. Thus, we propose a new approach to represent interactions while reflecting human purpose with three key factors, i.e., hand, object and reference. Specifically, we design a pattern of < hand-object, object-reference, hand, object, reference > (HOR) to recognize intention based atomic hand-object interactions. This pattern aims to model interactions with the states of hand, object, reference and their relationships. Furthermore, we design a simple yet effective Spatially Part-based (3+1)D convolutional neural network, namely SP(3+1)D, which leverages 3D and 1D convolutions to model visual dynamics and object position changes based on our HOR, respectively. With the help of our SP(3+1)D network, the recognition results are able to indicate human purposes accurately. To evaluate the proposed method, we annotate a Something-1.3k dataset, which contains 10 atomic hand-object interactions and about 130 videos for each interaction. Experimental results on Something-1.3k demonstrate the effectiveness of our SP(3+1)D network.
               
Click one of the above tabs to view related content.