"The Impact of Data Re-Sampling on Learning Performance of Class Imbalanced Bankruptcy Prediction Models"

Photo by campaign_creators from unsplash

The aim of this paper is to evaluate the effect of data sampling techniques on the performance of learners using real highly imbalanced Spanish bankruptcy dataset. The class imbalance problem refers to the highly uneven distribution of class instances where one class is having most of the instances than others. In the presence of highly skewed data distribution, the performance of classical learners is heavily biased in recognizing the majority class and consequently leads to the performance degradation of quantitative classifier or predictors models. In this paper, six sampling methods such as synthetic minority oversampling technique (SMOTE), Borderline-SMOTE, Safe-level-SMOTE, Random under sampling, random oversampling and condensed nearest neighbor are used with a different individual(SVM, C4.5, and Logistic regression) and ensemble learners(AdaBoostM1, DTBagging, and Random Forests). The different quantitative prediction models are designed by combination data sampling techniques and classical learners. The performance of quantitative prediction models are evaluated using G-Mean and area under the curve (AUC) measures on the real highly imbalanced data set. The result suggest that the performance of oversampling (with LR and DTBagging) and undersampling (with C4.5 and RF) methods are superior as compare to others on this data set.

Keywords: data sampling; class; bankruptcy; prediction models; performance

Journal Title: International Journal on Electrical Engineering and Informatics
Year Published: 2018

Link to full text (if available)

Share on Social Media: Sign Up to like & get
recommendations!
1

LAUSR

You are not signed in:

Sign Up!

Related content

More Information News Social Media Video Recommended