Deep neural networks are capable of learning powerful representation, but often limited by heavy network architectures and high computational cost. Knowledge distillation (KD) is one of the effective ways to… Click to show full abstract
Deep neural networks are capable of learning powerful representation, but often limited by heavy network architectures and high computational cost. Knowledge distillation (KD) is one of the effective ways to perform model compression and inference acceleration. But the final student models remain parameter redundancy. To tackle these issues, we propose a novel approach, called Variational Bayesian Group-level Sparsification for Knowledge Distillation (VBGS-KD), to distill a large teacher network into a small and sparse student network while preserving accuracy. We impose the sparsity-inducing prior on the groups of parameters in the student model, and introduce the variational Bayesian approximation to learn structural sparseness, which can effectively prune most part of weights. The prune threshold is learned during training without extra fine-tuning. The proposed method can learn the robust student networks that have achieved satisifying accuracy and compact sizes compared with the state-of-the-arts methods. We have validated our method on the MNIST and CIFAR-10 datasets, observing 90.3% sparsity with 0.19% accuracy boosting in MNIST. Extensive experiments on the CIFAR-10 dataset demonstrate the efficiency of the proposed approach.
               
Click one of the above tabs to view related content.