LAUSR.org creates dashboard-style pages of related content for over 1.5 million academic articles. Sign Up to like articles & get recommendations!

Data Subset Selection With Imperfect Multiple Labels

Photo by dylan_nolte from unsplash

We study the problem of selecting a subset of weakly labeled data where the labels of each data instance are redundant and imperfect. In real applications, less-than-expert labels are obtained… Click to show full abstract

We study the problem of selecting a subset of weakly labeled data where the labels of each data instance are redundant and imperfect. In real applications, less-than-expert labels are obtained at low cost in order to acquire many labels for each instance and then used for estimating the ground truth. However, on one side, preparing and processing data itself sometimes can be even more expensive than labeling. On the other side, noisy labels also decrease the performance of supervised learning methods. Thus, we introduce a new quality control mechanism on labels for each instance and use it to select an optimal subset of data. Based on the quality control mechanism, in which the labeling quality of each instance is estimated, it provides a way to know which instance has enough reliable labels or how many labels still need to be collected for a data instance. In this paper, first, we consider the data subset selection problem under the probably approximately correct model. Then, we show how to find an $\epsilon $ -optimal labeled instance based on expected labeling quality. Furthermore, we propose new algorithms to select the best $k$ quality instances that have high expected labeling quality. Using a reliable subset of data provides substantial benefit over using all data with imperfect multiple labels, and the expected labeling quality is a good indicator of where to allocate labeling effort. It shows how many labels should be acquired for an instance and which instances are qualified to be selected comparing with others. Both the theoretical guarantees and the comprehensive experiments demonstrate the effectiveness and efficiency of our algorithms.

Keywords: quality; subset selection; imperfect multiple; instance; labeling quality; data subset

Journal Title: IEEE Transactions on Neural Networks and Learning Systems
Year Published: 2019

Link to full text (if available)


Share on Social Media:                               Sign Up to like & get
recommendations!

Related content

More Information              News              Social Media              Video              Recommended



                Click one of the above tabs to view related content.