"Being Aware of Data Leakage and Cross‐Validation Scaling in Chemometric Model Validation"

The main goal of our investigation is to raise awareness among chemometricians about how easy it is to introduce data or parameter leakage by inappropriate methods and to demonstrate that high precision is necessary in the interpretation of opinions found in the literature on the preference of leave‐one‐out, leave‐many‐out, and repeated cross‐validation methods. We show how the Kennard–Stone method and inappropriate use of repeated measurements cause data leakage in train/test splitting. We demonstrate how cross‐validation parameters became overoptimistic if they are used in hyperparameter selection of models or in variable selection. We call this effect parameter leakage. We extend the leave‐one‐out/leave‐many‐out scaling law on repeated cross‐validation. We discuss and justify in some model calculations that infinite sample size inconsistencies of leave‐one‐out cross‐validation with respect to leave‐many‐out one can be theoretically important, but it need not be relevant at practical data sizes in chemometrics.

Keywords: validation; data leakage; model; cross validation

Journal Title: Journal of Chemometrics
Year Published: 2025

Link to full text (if available)

Share on Social Media: Sign Up to like & get
recommendations!
0

LAUSR

You are not signed in:

Sign Up!

Related content

More Information News Social Media Video Recommended