How do humans localize unintentional action like “ A boy falls down while playing skateboard ”? Cognitive science shows that an 18-month-old baby understands the intention by observing the actions… Click to show full abstract
How do humans localize unintentional action like “ A boy falls down while playing skateboard ”? Cognitive science shows that an 18-month-old baby understands the intention by observing the actions and comparing the feedback. Motivated by this evidence, we propose a causal inference approach that constructs a video pool containing intentional knowledge, conducts the counterfactual intervention to observe intentional action, and compares the unintentional action with intentional action to achieve localization. Specifically, we first build a video pool, where each video contains the same action content as an original unintentional action video. Then we conduct the counterfactual intervention to generate counterfactual examples. We further maximize the difference between the predictions of factual unintentional action and counterfactual intentional action to train the model. By disentangling the effects of different clues on the model prediction, we encourage the model to highlight the intention clue and alleviate the negative effect brought by the training bias of the action content clue. We evaluate our approach on a public unintentional action dataset and achieve consistent improvements on both unintentional action recognition and localization tasks.
               
Click one of the above tabs to view related content.