"Gaussian Distribution Based Oversampling for Imbalanced Data Classification"

The imbalanced data classification problem widely exists in many real-world applications. Data resampling is a promising technique to deal with imbalanced data through either oversampling or undersampling. However, the traditional data resampling approaches simply take into account the local neighbor information to generate new instances in linear ways, leading to the generation of incorrect and unnecessary instances. In this study, we propose a new data resampling technique, namely, Gaussian Distribution based Oversampling (GDO), to handle the imbalanced data for classification. In GDO, anchor instances are selected from the minority class instances in a probabilistic way by taking into account the density and distance information carried by the minority instances. Then new minority instances are generated following a Gaussian distribution model. The proposed method is validated in experimental study by comparing with seven imbalanced learning approaches on 40 data sets from the KEEL repository and 10 large data sets from the UCI repository. Experimental results show that our method outperforms the other compared methods in terms of AUC, G-mean and memory usage with an increase in running time. We also apply GDO to deal with two real imbalanced data classification problems: Internet video traffic identification and metastasis detection of esophageal cancer. The experimental results once again validate the effectiveness of our approach.

Keywords: imbalanced data; distribution based; gaussian distribution; data classification

Journal Title: IEEE Transactions on Knowledge and Data Engineering
Year Published: 2022

Link to full text (if available)

Share on Social Media: Sign Up to like & get
recommendations!
1

LAUSR

You are not signed in:

Sign Up!

Related content

More Information News Social Media Video Recommended