The main goal of our investigation is to raise awareness among chemometricians about how easy it is to introduce data or parameter leakage by inappropriate methods and to demonstrate that… Click to show full abstract
The main goal of our investigation is to raise awareness among chemometricians about how easy it is to introduce data or parameter leakage by inappropriate methods and to demonstrate that high precision is necessary in the interpretation of opinions found in the literature on the preference of leave‐one‐out, leave‐many‐out, and repeated cross‐validation methods. We show how the Kennard–Stone method and inappropriate use of repeated measurements cause data leakage in train/test splitting. We demonstrate how cross‐validation parameters became overoptimistic if they are used in hyperparameter selection of models or in variable selection. We call this effect parameter leakage. We extend the leave‐one‐out/leave‐many‐out scaling law on repeated cross‐validation. We discuss and justify in some model calculations that infinite sample size inconsistencies of leave‐one‐out cross‐validation with respect to leave‐many‐out one can be theoretically important, but it need not be relevant at practical data sizes in chemometrics.
               
Click one of the above tabs to view related content.