dure multiple times. Moreover, the paper shows that relatively large samples are needed to obtain consistent results from sampled metrics. In fact, the paper shows that when 1/3 of the… Click to show full abstract
dure multiple times. Moreover, the paper shows that relatively large samples are needed to obtain consistent results from sampled metrics. In fact, the paper shows that when 1/3 of the whole catalogue is used as a sample, sampled metrics are consistent with the exact metrics. Unfortunately, in this case the speed up from sampling is limited. What is the source of the inconsistency and bias in the sampled metrics? As the authors show, they stem from a simple fact: by using a sample of the irrelevant items, the rank of a relevant item is an underestimate of its exact rank, obtained when all the irrelevant items are considered. Since the error in the estimate can be quantified, it can then be corrected, and another main result of the paper is showing that even a simple correction is able to resolve most of the mistakes of the uncorrected sampled metrics. Therefore, while, as suggested by the authors, samplingbased approaches should be avoided in evaluations whenever possible, they can still be employed by using a properly designed correction. One of the most important takeaways from the paper is clear: when sampling is used to estimate a quantity, understanding, and analyzing the impact of the sampling procedure is crucial. This is a more general message than it may seem at first sight. In several applications one can rarely assume that the data at hand represent the whole system, or population, or process, under study, and most commonly the data is only a sample of the system/population/process. Understanding the impact of sampling procedures on the results of algorithms, and how to properly account for them in the computation, is of paramount importance to draw reliable and robust answers from data.
               
Click one of the above tabs to view related content.