Dataset dependence affects many real‐life applications of machine learning: the performance of a model trained on a dataset is significantly worse on samples from another dataset than on new, unseen… Click to show full abstract
Dataset dependence affects many real‐life applications of machine learning: the performance of a model trained on a dataset is significantly worse on samples from another dataset than on new, unseen samples from the original one. This issue is particularly acute for small and somewhat specific databases in medical applications; the automated recognition of melanoma from skin lesion images is a prime example. We document dataset dependence in dermoscopic skin lesion image classification using three publicly available medium size datasets. Standard machine learning techniques aimed at improving the predictive power of a model might enhance performance slightly, but the gain is small, the dataset dependence is not reduced, and the best combination depends on model details. We demonstrate that simple differences in image statistics account for only 5% of the dataset dependence. We suggest a solution with two essential ingredients: using an ensemble of heterogeneous models, and training on a heterogeneous dataset. Our ensemble consists of 29 convolutional networks, some of which are trained on features considered important by dermatologists; the networks' output is fused by a trained committee machine. The combined International Skin Imaging Collaboration dataset is suitable for training, as it is multi‐source, produced by a collaboration of a number of clinics over the world. Building on the strengths of the ensemble, it is applied to a related problem as well: recognizing melanoma based on clinical (non‐dermoscopic) images. This is a harder problem as both the image quality is lower than those of the dermoscopic ones and the available public datasets are smaller and scarcer. We explored various training strategies and showed that 79% balanced accuracy can be achieved for binary classification averaged over three clinical datasets.
               
Click one of the above tabs to view related content.