We enjoyed reading Professor Efron’s (Brad) paper just as much as we enjoyed listening to his June 2019 lecture in Leiden. One of the core values underlying statistical research is… Click to show full abstract
We enjoyed reading Professor Efron’s (Brad) paper just as much as we enjoyed listening to his June 2019 lecture in Leiden. One of the core values underlying statistical research is in how it enables scientific understanding and discoveries, and we appreciate that Efron has placed “science” at the center of his article. Or at least, the discoveries of associations that are not just valid here and now—in the particular dataset I have at hand— but are likely to be stable and replicated under slightly different circumstances. The successes of purely predictive algorithms (gradient boosting, deep learning) in object recognition, speech recognition, machine translation, and others, are real.1 Incidentally, when one examines these successes as reported in the media, one observes that they all have in common a nearly infinite sample size (consider that WhatsApp record billions of conversations every year) and a nearly infinite signal-to-noise ratio (SNR) or, put differently, very little uncertainty. If you give me a photograph, there is little debate as to whether there is a cow on it or not. These instances are far from the settings statisticians are accustomed to: moderate sample sizes, a higher degree of uncertainty, and variables/factors which have not been measured—prosaically, things we cannot see. Using Efron’s example, it is not because I know how much you comply with a treatment that I can be certain about how much it will be of benefit to you. Interestingly, Efron puts the pure predictive algorithms to the test in a scenario where the sample size is extremely moderate by today’s standards (n = 102). In the prostate cancer microarray study, he observes the following: split the sample at random as to create a training and test set with the same number of cases and controls, and train a random forest. Perhaps to his surprise, random forests perform extremely well correctly classifying all 51 test patients but one! Surely, we must have “learned” something valuable. Or have we? His second finding is this: if we split patients according to what we believe is the time of entry in the study, then the performance is spectacularly degraded; that is, he now records 12 wrong predictions corresponding to an error rate of about one in four. So what have we really learned? This is all the more discomforting since prediction rules are intended to be deployed in the second scenario, and not in the highly idealized setting where test and training samples are exchangeable or, said differently, where current data are representative of future data. As far as
               
Click one of the above tabs to view related content.