Dear editor, Saliency detection has recently attracted much attention owing to its applicability in several fields of computer vision and machine learning. Convolutional neural network (CNN) is especially successful in… Click to show full abstract
Dear editor, Saliency detection has recently attracted much attention owing to its applicability in several fields of computer vision and machine learning. Convolutional neural network (CNN) is especially successful in generating saliency map end-to-end in salient object detection. These methods can be grouped into two categories: (1) improving the structure of the networks, and (2) training the networks’ parameters better than before. However, with the increasing in the number of the images, the side information of images becomes more and more numerous. In fact, the side information has been widely used in other applications with neural networks. The side information are used to improve the performance in graph matching algorithms and object tracking [1, 2]. Zhao et al. [3] deeply exploited the semantic information of the syntactic path based on RNN. Li et al. [4] took advantage of rich semantic information to enhance performance of exploring of indoor environments. The performance of saliency detection is related to many factors, not only visual features, but also semantic information of the images. In order to further improve the accuracy of salient object detection in images, we need to adopt the side information such as image-level tags in our convolutional neural network. Recent work by Zhou et al. [5] has shown that how CNN has remarkable localization ability. Wang et al. [6] proposed a saliency detection method with image-level weak supervision. Nevertheless, the image-level category labels predicted from network may not completely accurate. Therefore, we use image-level tags labeled from the object based on the TBS dataset [7]. We make the following contributions. (1) We use image-level tags labeled on the objects in images to improve the saliency detection results. (2) We extend the global average pooling (GAP) to predict the salient object in complex images and use it as a layer in the deep convolutional network. (3) We conduct extensive experiments and the results show that the proposed method achieves state-of-the-art performance on mean absolutely error (MAE), area under the ROC curve (AUC), and F -measure index on the TBS dataset. Model and methodology. Our saliency detection model is composed by three main parts: (1) a CNN extracting low, medium and high level features for a given image; (2) a class activation mapping section; and (3) a fully connected conditional random field (CRF) to further improve the saliency map. The structure of the network is shown in Figure 1. Classification-trained CNN. We obtain the parameters of the network by training the classification task on CNN. The results of the classification are the object tags of the TBS dataset. During the process of training classification, the CNN will extract the feature of the images from different categories in the dataset. This architecture includes 5 blocks from conv1 to conv5. We train the architecture on the popular VGG16, AlexNet [8] and GoogLeNet, which are well known for elegance and simplicity, and at the same time yield nearly state-of-the-art results in image classification and generalization properties. We delete the following part based on the method of adjusting the structure of network in [1]: the layers behind conv5-3 in
               
Click one of the above tabs to view related content.