Acquiring dense and precise depth information in real time is highly demanded for robotic perception and automatic driving. Motivated by the complementary nature of stereo images and LiDAR point clouds,… Click to show full abstract
Acquiring dense and precise depth information in real time is highly demanded for robotic perception and automatic driving. Motivated by the complementary nature of stereo images and LiDAR point clouds, we propose an efficient stereo-LiDAR fusion network (SLFNet) to predict a dense depth map of a scene. Specifically, the LiDAR point cloud is first projected onto each image plane of the stereo images to generate sparse RGB-D maps. Then, multi-modal feature fusion is performed between RGB image and sparse RGB-D map of the same viewpoint, and the resultant features are utilized to generate a coarse disparity map for stereo fusion. Next, complementary geometric information in stereo images and sparse RGB-D maps are incorporated to perform occlusion-aware refinement. Finally, an edge-aware refinement module is conducted to encourage the depth discontinuities to be consistent with edges in the image. Experimental results demonstrate that our network can effectively fuse the stereo images and point clouds to produce accurate depth estimations at 6 FPS, which is $8\times$ faster than existing methods. Comparative results show that our network achieves the state-of-the-art performance on the KITTI and Virtual KITTI2 datasets.
               
Click one of the above tabs to view related content.