Data-defect would affect the data quality and the analysis results of data mining. This paper presents a data-defect inspection method with kernel-neighbor-density-change outlier factor (KNDCOF). The definition of kernel neighbor… Click to show full abstract
Data-defect would affect the data quality and the analysis results of data mining. This paper presents a data-defect inspection method with kernel-neighbor-density-change outlier factor (KNDCOF). The definition of kernel neighbor density is proposed to represent the density of each object in database, and the ascending distance series (ADS) of each object is calculated based on the kernel distance between the object and its neighbors. Then, the average density fluctuation (ADF) of the object is established according to the weighted sum of the square of density difference between the object and others in ADS. Finally, the KNDCOF of the object is equal to the ratios of the ADF of the object and the average ADF of neighbors of the object. The degree of the object being an outlier is indicated by the KNDCOF value. The experiments are performed on three real data sets to evaluate the effectiveness of the proposed method. The experimental results verify that the proposed method has higher quality of data-defect inspection and does not increase the time complexity.Note to Practitioners–Data-defect inspection is an important procedure of data preprocessing for a real industrial process. This paper presents a data-defect inspection method with kernel-neighbor-density-change outlier factor to identify the outliers, and addresses the challenges associated with the strong correlation and the nonlinearity of the industrial data. The proposed method calculates the outlier factor for each object, which quantifies how outlying it is. The outlier factor is based on the density difference between the object and its neighbors. The larger the outlier factor of an object is, the higher the outlierness of the object is. The proposed method could be wildly used in an industrial complex data set with different density regions. In the industrial field, engineers can deal with the objects with high outlier factor values based on the actual requirements.
               
Click one of the above tabs to view related content.