There is currently a great need for research in gene expression data to help with cancer classification in the field of oncogenomics. This is especially true since the disease occurs… Click to show full abstract
There is currently a great need for research in gene expression data to help with cancer classification in the field of oncogenomics. This is especially true since the disease occurs sporadically and often does not show symptoms. Typically, gene expression data is disproportionate with a large number of features and a low number of samples. A small sample size is likely to adversely affect accuracy of classification, as the performance of a classifier depends largely on the data. There is a pressing need to generate data which could be provided as better input to classifiers. Primitive augmentation techniques like uniform random generation and addition of noise do not assure good probability distribution. Secondly, as we deal with critical applications, the augmented data needs to have greater likelihood to the original values. Thus, we propose an improved variant of K-nearest neighborhood (KNN) rule. We use Counting Quotient Filter, Euclidean distance and mean best value from the k-neighbors for each target sample to get synthetic samples. A comparison is drawn amongst the raw data from public domain (original data), data generated using standard K-nearest neighbor rule and data generated using improved K-nearest neighbor rule. The data generated through these approaches is then further classified using state-of-art classifiers like SVM, J48 and DNN. The samples generated through our improvisation technique yield better recall values than the standard implementation; ensuring sensitivity of data. Average classification accuracy from all the three classifiers conclude enhancement of 7.72% as compared to traditional KNN approach and 16% when raw data is considered as input to the classifiers. Thus, the proposed algorithm attains two objectives; firstly, ensuring sensitivity of data for critical applications and secondly, enhancing classification accuracy.
               
Click one of the above tabs to view related content.