LAUSR.org creates dashboard-style pages of related content for over 1.5 million academic articles. Sign Up to like articles & get recommendations!

A Simple Visual-Textual Baseline for Pedestrian Attribute Recognition

Photo by ann10 from unsplash

Pedestrian attribute recognition (PAR), which aims to identify attributes of the pedestrians captured in video surveillance, is a challenging task due to the poor quality of images and diverse spatial… Click to show full abstract

Pedestrian attribute recognition (PAR), which aims to identify attributes of the pedestrians captured in video surveillance, is a challenging task due to the poor quality of images and diverse spatial distribution among attributes. Existing methods usually model PAR as a multi-label classification problem and manually map attributes to an ordered list corresponding to the outputs of classifiers or sequential models. However, the inherent textual information among attribute annotations is largely neglected in these visual-only methods. In this paper, we first alleviate this issue by proposing a novel visual-textual baseline (VTB) for PAR which introduces an additional textual modality to explore the textual semantic correlations from attribute annotations by pre-trained textual encoders instead of human definitions. VTB encodes pedestrian images and attribute annotations into visual and textual features respectively, interacts with information across modalities, and predicts recognition results independently to remove the influence of attribute orders. Furthermore, we introduce transformer encoder as the cross-modal fusion module in VTB for sufficient intra-modal and cross-modal correlations exploration. Our method achieves superior performance over most existing visual-only methods on two widely used datasets including RAP and PA-100K, demonstrating the effectiveness of utilizing textual modality to PAR. Our method is expected to serve as a multimodal PAR baseline and inspire new insights for multimodal fusion in future PAR research. Our code is available at https://github.com/cxh0519/VTB.

Keywords: attribute; baseline; attribute recognition; pedestrian attribute; visual textual

Journal Title: IEEE Transactions on Circuits and Systems for Video Technology
Year Published: 2022

Link to full text (if available)


Share on Social Media:                               Sign Up to like & get
recommendations!

Related content

More Information              News              Social Media              Video              Recommended



                Click one of the above tabs to view related content.