Neither a monocular RGB camera nor a small-size microphone array is capable of accurate three-dimensional (3D) speaker localization. By taking advantage of accurate visual object detection, and audio-visual complementary sensor… Click to show full abstract
Neither a monocular RGB camera nor a small-size microphone array is capable of accurate three-dimensional (3D) speaker localization. By taking advantage of accurate visual object detection, and audio-visual complementary sensor fusion, we formulate the three-dimensional (3D) speaker localization problem as a visual scaling factor estimation problem. As a result, we effectively reduce the traditional audio-only 3D speaker localization from an exhaustive grid search to a one-dimensional (1D) optimization problem. We propose a multi-modal perception system with two optimization approaches. We show that the proposed methods are effective, accurate, and robust against interference and, as corroborated by indicative empirical results on real dataset, competitive to the conventional uni-modal and the state-of-the-art audio-visual speaker localization approaches.
               
Click one of the above tabs to view related content.