Industrial intelligent devices are usually equipped with both microphones and cameras to perceive and understand the physical world. Though visual object detection technology has achieved a great success, its combination… Click to show full abstract
Industrial intelligent devices are usually equipped with both microphones and cameras to perceive and understand the physical world. Though visual object detection technology has achieved a great success, its combination with other sensing modalities remains unsolved. In this article, we establish a novel sound-induced attention framework for the visual object detection, and develop a two-stream weakly supervised deep learning architecture to combine the visual and audio modalities for localizing the sounding object. A dataset is constructed from the Audio Set to validate the proposed method and some realistic experiments are conducted to demonstrate the effectiveness of the proposed system.
               
Click one of the above tabs to view related content.