Pedestrian attribute recognition (PAR), which aims to identify attributes of the pedestrians captured in video surveillance, is a challenging task due to the poor quality of images and diverse spatial… Click to show full abstract
Pedestrian attribute recognition (PAR), which aims to identify attributes of the pedestrians captured in video surveillance, is a challenging task due to the poor quality of images and diverse spatial distribution among attributes. Existing methods usually model PAR as a multi-label classification problem and manually map attributes to an ordered list corresponding to the outputs of classifiers or sequential models. However, the inherent textual information among attribute annotations is largely neglected in these visual-only methods. In this paper, we first alleviate this issue by proposing a novel visual-textual baseline (VTB) for PAR which introduces an additional textual modality to explore the textual semantic correlations from attribute annotations by pre-trained textual encoders instead of human definitions. VTB encodes pedestrian images and attribute annotations into visual and textual features respectively, interacts with information across modalities, and predicts recognition results independently to remove the influence of attribute orders. Furthermore, we introduce transformer encoder as the cross-modal fusion module in VTB for sufficient intra-modal and cross-modal correlations exploration. Our method achieves superior performance over most existing visual-only methods on two widely used datasets including RAP and PA-100K, demonstrating the effectiveness of utilizing textual modality to PAR. Our method is expected to serve as a multimodal PAR baseline and inspire new insights for multimodal fusion in future PAR research. Our code is available at https://github.com/cxh0519/VTB.
               
Click one of the above tabs to view related content.