Learning to achieve a user-specified objective from a random position in unseen environments is challenging for image-guided navigation agents. The abilities of long-horizon reasoning and semantic understanding are still lacking.… Click to show full abstract
Learning to achieve a user-specified objective from a random position in unseen environments is challenging for image-guided navigation agents. The abilities of long-horizon reasoning and semantic understanding are still lacking. Inspired by the human memory mechanism, we introduce a neural multi-store memory network to the reinforcement learning framework for target-driven visual navigation. The proposed memory network utilizes three temporal stages of memory to build time dependency for better scene understanding. Sensory memory encodes observations and embeds transient information into working memory, which is short-term and realized by a gated recurrent neural network (RNN). Then, the long-term memory stores the latent state from each step of the RNN into a single slot. Finally, a self-attention reading mechanism is designed to retrieve goal-related information from long-term memory. In addition, to improve the scene generalization capability of the agent, we facilitate training of the visual representation with a self-supervised auxiliary task and image augmentation. This method can navigate agents in unknown visual-realistic environments using only egocentric observations, without the need for any position sensors or pretrained models. The evaluation results on the Matterport3D dataset through the Habitat simulator demonstrate that our method outperforms the state-of-the-art approaches.
               
Click one of the above tabs to view related content.