The 3-D scene perception technology enhances the safety and decision-making ability of autonomous vehicles by accurately acquiring and analyzing stereo information from the environment. Compared to traditional object detection methods,… Click to show full abstract
The 3-D scene perception technology enhances the safety and decision-making ability of autonomous vehicles by accurately acquiring and analyzing stereo information from the environment. Compared to traditional object detection methods, semantic occupancy perception offers greater flexibility in describing 3-D scenes with arbitrary shapes and various categories. However, existing methods for semantic occupancy perception face challenges such as poor generalization of depth estimation and inaccurate alignment and fusion of multimodal features. In this article, a novel multimodal semantic occupancy prediction method, accurate fusion occupancy (AFOcc), is proposed. AFOcc addresses these challenges by adopting a fusion technique based on feature alignment and an attention mechanism. The method extracts multiscale features from image and LiDAR data, encodes LiDAR voxel features using sparse convolution, and projects them onto 2-D image features for precise alignment. The projection of point clouds onto images is achieved through a feature alignment module. Finally, a learnable fusion module adaptively adjusts the weights of different modal features to enhance the fusion effect. Extensive experiments on the nuScenes-Occupancy dataset demonstrate that AFOcc significantly outperforms state-of-the-art methods in terms of the mIoU metric. Notably, in the bicycle and motorcycle categories, an IoU improvement of more than 40% is achieved. These results illustrate the superior perception and robustness capabilities of AFOcc in complex scenes.
               
Click one of the above tabs to view related content.