Existing video text detection methods mostly track texts with appearance feature only, thus are easily influenced by the change of perspective and illumination. In this paper, we propose an end-to-end… Click to show full abstract
Existing video text detection methods mostly track texts with appearance feature only, thus are easily influenced by the change of perspective and illumination. In this paper, we propose an end-to-end video text detector that tracks texts based on robust feature representation fusing multiple descriptors. First, we introduce a character center segmentation branch to extract semantic feature, which encodes the category and position information of characters. And for extracting the topology feature of each text instance, we propose a relative position awareness branch to encode the relative position information among texts. Then, an adaptive feature fusion network is proposed to dynamically fuse multiple descriptors to generate a robust feature representation for more robust tracking. In addition, to promote the research and evaluation in this field, we also construct a large Bilingual Road scene Video Text dataset, named BiRViT-1K, which contains 1000 videos of Chinese and English texts. Experimental results show the proposed semantic and topology features are beneficial to the text detection and tracking performance, and the proposed method achieves state-of-the-art performance on four public video text benchmarks ICDAR 2015 Video, YVT, RT-1K and BOVText, and two Chinese scene text benchmarks CASIA10K and MSRA-TD500.
               
Click one of the above tabs to view related content.