Deep convolutional networks (CNNs) reign undisputed as the new de-facto method for computer vision tasks owning to their success in visual recognition task on still images. However, their adaptations to… Click to show full abstract
Deep convolutional networks (CNNs) reign undisputed as the new de-facto method for computer vision tasks owning to their success in visual recognition task on still images. However, their adaptations to crowd counting have not clearly established their superiority over shallow models. Existing CNNs turn out to be self-limiting in challenging scenarios such as camera illumination changing, partial occlusions, diverse crowd distributions, and perspective distortions for crowd counting because of their shallow structure. In this paper, we introduce a dynamic augmentation technique to train a much deeper CNN for crowd counting. In order to decrease overfitting caused by limited number of training samples, multitask learning is further employed to learn generalizable representations across similar domains. We also propose to aggregate multiscale convolutional features extracted from the entire image into a compact single vector representation amenable to efficient and accurate counting by way of “Vector of Locally Aggregated Descriptors” (VLAD). The “deeply supervised” strategy is employed to provide additional supervision signal for bottom layers for further performance improvement. Experimental results on three benchmark crowd datasets show that our method achieves better performance than the existing methods. Our implementation will be released at https://github.com/shizenglin/Multitask-Multiscale-Deep-NetVLAD.
               
Click one of the above tabs to view related content.