Imbalanced software fault datasets, having fewer faulty modules than the nonfaulty modules, make accurate fault prediction difficult. It is challenging for software practitioners to handle imbalanced fault data during software… Click to show full abstract
Imbalanced software fault datasets, having fewer faulty modules than the nonfaulty modules, make accurate fault prediction difficult. It is challenging for software practitioners to handle imbalanced fault data during software fault prediction (SFP). Earlier, several researchers have applied oversampling techniques such as synthetic minority oversampling techniques and others for imbalanced learning in SFP. However, most of these techniques resulted in overfitted prediction models. This article presents generative oversampling methods to handle imbalanced data problems in the SFP. Using the generative adversarial network (GAN) based approach, the presented methods generate synthetic samples of the faulty modules to balance the proportion of faulty and nonfaulty modules in the fault datasets. Further, SFP models are built on the processed fault datasets using different machine learning techniques. Experimental validation of the presented oversampling methods is done on 18 fault datasets gathered from PROMISE, JIRA, Eclipse data repositories, and precision, recall, f1-score, and AUC are used as evaluation measures. We extensively compared presented oversampling methods with various state-of-the-art class imbalance techniques and baseline models. The experimental results evidenced that the presented methods improved fault prediction performance and yielded better performance than the state-of-the-art class imbalance techniques.
               
Click one of the above tabs to view related content.