Discovering novel visual categories from a set of unlabeled images is a crucial and essential capability for intelligent vision systems since it enables them to automatically learn new concepts with… Click to show full abstract
Discovering novel visual categories from a set of unlabeled images is a crucial and essential capability for intelligent vision systems since it enables them to automatically learn new concepts with no need for human-annotated supervision anymore. To tackle this problem, existing approaches first pretrain a neural network with a set of labeled images and then fine-tune the network to cluster unlabeled images into a few categorical groups. However, their unified feature representation hits a tradeoff bottleneck between feature preservation on labeled data and feature adaptation on unlabeled data. To circumvent this bottleneck, we propose a residual-tuning approach, which estimates a new residual feature from the pretrained network and adds it with a previous basic feature to compute the clustering objective together. Our disentangled representation approach facilitates adjusting visual representations for unlabeled images and overcoming forgetting old knowledge acquired from labeled images, with no need of replaying the labeled images again. In addition, residual-tuning is an efficient solution, adding few parameters and consuming modest training time. Our results on three common benchmarks show consistent and considerable gains over other state-of-the-art methods, and further reduce the performance gap to the fully supervised learning setup. Moreover, we explore two extended scenarios, including using fewer labeled classes and continually discovering more unlabeled sets, where the results further signify the advantages and effectiveness of our residual-tuning approach against previous approaches. Our code is available at https://github.com/liuyudut/ResTune.
               
Click one of the above tabs to view related content.