Text-based Image Captioning has been a novel problem since 2020. This topic remains challenging because it requires the model to comprehend not only the visual context but also the scene… Click to show full abstract
Text-based Image Captioning has been a novel problem since 2020. This topic remains challenging because it requires the model to comprehend not only the visual context but also the scene texts that appear in an image. Therefore, the ways image and scene texts are embedded into the main model for training is crucial. Based on the M4C-Captioner model, this paper proposes the simple but effective EAES embedding module for effectively embedding images and scene texts into the multimodal Transformer layers. In detail, our EAES module contains two significant sub-modules: Objects-augmented and Grid feature augmentation. With the Objects-augmented module, we provide the relative geometry feature, representing the relation between objects and between OCR tokens. Furthermore, we extract the grid feature for an image with the Grid feature augmentation module and combine it with visual objects, which help the model focus on both salient objects and the general context of an image, leading to better performance. We use the TextCaps dataset as the benchmark to prove the effectiveness of our approach on five standard metrics: BLEU4, METEOR, ROUGE-L, SPICE and CIDEr. Without bells and whistles, our method achieves 20.21% on the BLEU4 metric and 85.78% on the CIDEr metric, 1.31% and 4.78% higher, respectively, than the baseline M4C-Captioner method. Furthermore, the results are incredibly competitive with other methods on METEOR, ROUGE-L and SPICE metrics.
               
Click one of the above tabs to view related content.