"Video Text Detection With Robust Feature Representation"

Existing video text detection methods mostly track texts with appearance feature only, thus are easily influenced by the change of perspective and illumination. In this paper, we propose an end-to-end video text detector that tracks texts based on robust feature representation fusing multiple descriptors. First, we introduce a character center segmentation branch to extract semantic feature, which encodes the category and position information of characters. And for extracting the topology feature of each text instance, we propose a relative position awareness branch to encode the relative position information among texts. Then, an adaptive feature fusion network is proposed to dynamically fuse multiple descriptors to generate a robust feature representation for more robust tracking. In addition, to promote the research and evaluation in this field, we also construct a large Bilingual Road scene Video Text dataset, named BiRViT-1K, which contains 1000 videos of Chinese and English texts. Experimental results show the proposed semantic and topology features are beneficial to the text detection and tracking performance, and the proposed method achieves state-of-the-art performance on four public video text benchmarks ICDAR 2015 Video, YVT, RT-1K and BOVText, and two Chinese scene text benchmarks CASIA10K and MSRA-TD500.

Keywords: video; text detection; video text; robust feature; topology; feature

Journal Title: IEEE Transactions on Circuits and Systems for Video Technology
Year Published: 2024

Link to full text (if available)

Share on Social Media: Sign Up to like & get
recommendations!
0

LAUSR

You are not signed in:

Sign Up!

Related content

More Information News Social Media Video Recommended