Predicting 6-D object pose is an essential task in vision measurement for robotic manipulation. RGB-based methods have a natural disadvantage due to the lack of 3-D information, thus leading to… Click to show full abstract
Predicting 6-D object pose is an essential task in vision measurement for robotic manipulation. RGB-based methods have a natural disadvantage due to the lack of 3-D information, thus leading to inferior results. Therefore, exploiting the geometry information in depth images is crucial to achieve accurate predictions. To this end, we propose a multiscale point cloud transformer (MSPCT) to better learn point cloud feature representations. MSPCT mainly consists of three types of modules: local transformer (LT), DownSampling (DS) module, and global transformer (GT). Specifically, the LT is designed to dynamically divide a local region centered on each point and further extract point-level feature with local context awareness. The DS module is utilized to decrease the resolution and enlarge the receptive field. GT is employed to capture global-range dependencies between extracted local features. Based on the proposed transformer blocks, we design a network architecture for object pose estimation, where we further obtain multiscale features by fusing the local features from LT and global features from GT to predict object’s pose. Extensive experiments verify the effectiveness of LT and GT, and our pose estimation pipeline achieves promising results on three benchmark datasets.
               
Click one of the above tabs to view related content.