We are pleased to present this special issue of IJCV on combined image and language understanding. It contains some of the latest work in a long line of research into… Click to show full abstract
We are pleased to present this special issue of IJCV on combined image and language understanding. It contains some of the latest work in a long line of research into problems at the intersection of computer vision and natural language processing. Research on language and vision has been stimulated by recent advances in object recognition. While multi-layer (or “deep”)models have been applied formore than twenty years (Lawrence et al. 1997; LeCun et al. 1989; Nowlan and Platt 1995), recently they have been shown to be extremely effective at large-vocabulary object recognition (Krizhevsky et al. 2012) and at text generation (Mikolov et al. 2010). The next logical step was to combine these two tasks to enable image captioning: generating a short language description based on an image (Kulkarni et al. 2013; Mitchell et al. 2012). In 2015, deep models produced state-of-the-art results in image captioning (Donahue et al. 2015; Fang et al. 2015; Karpathy and Fei-Fei 2015; Vinyals et al. 2015). These results were facilitated by the MSCOCO data set, which provided multiple crowd-sourced labels for thousands of images (Lin et al. 2014). The success of deep image captioning initially seemed promising. Had we finally solved combined image and language understanding?A closer inspection, however, revealed that suchunderstandingwas far fromsolved:Approaches that
               
Click one of the above tabs to view related content.