A Novel Oversampling Method for Imbalanced Datasets Based on Density Peaks Clustering

: Imbalanced data classification is a major challenge in the field of data mining and machine learning, and oversampling algorithms are a widespread technique for re-sampling imbalanced data. To address the problems that existing oversampling methods tend to introduce noise points and generate overlapping instances, in this paper, we propose a novel oversampling method based on density peaks clustering. Firstly, density peaks clustering algorithm is used to cluster minority instances while screening outlier points. Secondly, sampling weights are assigned according to the size of clustered sub-clusters, and new instances are synthesized by interpolating between cluster cores and other instances of the same sub-cluster. Finally, comparative experiments are conducted on both the artificial data and KEEL datasets. The experiments validate the feasibility and effectiveness of the algorithm and improve the classification accuracy of the imbalanced data.


INTRODUCTION
Imbalanced datasets occur when there are significantly more instances from some classes than others. More precisely, in binary class problem, the class that consists of a large number of instances is referred to as the majority class whereas the class that consists of only a few instances is called the minority class [1]. The imbalanced dataset problem is prevalent in various real-world applications which include medical diagnosis [2], fault detection [3], credit assessment [4], etc. Among them, the minority class is the focus of identification and will bring significant losses if misclassified. For example, medical diagnoses in which a patient has a malignant disease fall into a minority class. If a malignant disease is misdiagnosed as benign, the best time to treat it will be missed and, in serious cases, the patient will die. Therefore, the effective improvement of the classification performance of unbalanced datasets is currently a hot research topic.
Due to its difficulty and prevalence, the imbalanced dataset problem has attracted much attention of researchers in the last two decades and numerous approaches have been proposed to address it, which are primarily divided into two categories: algorithm level approaches and data level approaches [5]. Algorithm level approaches focus on the introduction of cost-sensitive factors, integrated learning and other means to make the classifier more biased towards a few classes. Data level approaches seek to rebalance the class distribution by some resampling techniques including over-sampling, under-sampling and the combination of the two methods. All three methods are used to achieve a balanced set of instances by increasing or decreasing the number of instances from a particular class. In this paper, we deal with imbalanced data at the data level by increasing the number of instances in minority class.
Random replication of minority instances, although fast enough to increase the number of minority instances, is not sufficiently effective in practice and tends to lead to over-fitting. The Synthetic Minority Oversampling Technique (SMOTE) algorithm is currently the most classical oversampling algorithm, where the new instances are generated by linearly interpolating a randomly selected minority instance and one of its k nearest minority neighbors instead of merely duplicating existing instances, where k is a user-specified variable [6]. However, this approach may produce some noisy instances that overlap between the minority and majority classes because of its sensitivity to k values, and the newly generated instances may be included in the majority class, which may further degrade the classification performance of subsequent classifiers. In order to eliminate its shortcomings, numerous extensions of SMOTE have been proposed, mainly based on distance or clustering aspects. The Adaptive Synthetic sampling Technique (ADASYN) algorithm takes into account the distribution of the instance, where the minority instances close to the class boundary with more majority neighbors in their neighborhood are assigned higher weights so as to have greater chances to be oversampled [7]. However, this method still results in synthetic instances falling within the majority class region. In contrast, clustering-based oversampling methods first cluster minority instances and then generate a specific number of new instances in each class cluster to avoid generating noise across the boundary. Clustering methods include k-means clustering [8], hierarchical clustering [9], etc. However, these methods are computationally complex, requiring manual setting of clustering parameters in advance, and are very difficult for handling datasets with unknown distribution.
To address these issues, this paper proposes an improved SMOTE algorithm based on density peaks clustering (DP-SMOTE). Firstly, in order to avoid determining the clustering parameters and to handle various types of data efficiently, density peaks clustering is used to cluster a small number of classes of instances. Secondly, oversampling weights are assigned according to the inverse of the number of instances within a sub-cluster, giving higher oversampling weights to sub-clusters with a small number of instances. Thirdly, the interpolation formula is improved to perform linear interpolation between cluster cores and other instances in the same subcluster to avoid marginalization of the generated instances. Finally, comparison experiments are conducted on the artificial dataset and the KEEL dataset. The experimental results show the feasibility and superior performance of the method proposed in this paper, which improves the classification accuracy of the unbalanced dataset.

DENSITY PEAKS CLUSTERING
Density peaks clustering (DPC) algorithm [10] is a relatively new density-based clustering algorithm proposed in recent years, whose main idea is to find high-density regions separated by low-density regions. DPC is based on two intuitive and simple assumptions that the cluster centers should have the highest local densities and simultaneously a relative larger distance to other cluster centers. Unlike k-means, DPC does not require the number of clusters in advance and also does not impose convex assumption on the data, which make it more preferable for arbitrarily shaped unsupervised clustering issues without any priori cluster structure information available. From its two assumptions, we can find that in order to accomplish the clustering task, DPC requires to introduce two important quantities for each instance, i.e., the local density ρ i and the distance δ i to the nearest neighbor with higher local density.
Specifically, we assume that the data set is   where d ij is the distance between two instances x i , x j , and d c is referred to be cut-off distance. d c is usually set to be the distance in which 2% instances are included on average [9].
Further, the distance δ i between instance x i and higher local density neighbor is defined by: Density peaks clustering defines instances with both higher local densities ρ i and larger distances δ i as cluster centers. Instances with lower local densities ρ i and larger distances δ i are defined as outliers. The remaining instances are then assigned to the class clusters where the neighboring clustering centers with higher local densities are located, obtaining the corresponding class cluster labels.

OVERSAMPLING ALGORITHM BASED ON DENSITY PEAKS CLUSTERING
When traditional oversampling algorithms deal with minority instances, they focus more on the inter-class imbalance problem, i.e. the imbalance between minority instances and majority instances. Ignoring the influence of intra-class imbalance, the generated instance distribution often suffers from noise and marginalization, which does not conform to the original instance distribution pattern. To address the above problems, this paper proposes an oversampling algorithm based on density peaks clustering (DP-SMOTE), and the algorithm flowchart is shown in Fig. 1.

Density Peaks Clustering for Minority Class
The process of clustering density peaks for minority instances is broken down into a number of steps. Firstly, the local density of each instance point is calculated. Secondly, the distance between each instance point and the nearest high density point is calculated. Thirdly, a decision diagram is drawn using the local density as the horizontal coordinate and the distance as the vertical coordinate, from which the clustering centers and outliers are determined. Finally, the corresponding sub-cluster labels are assigned according to the proximity of the remaining instances to the cluster centers with high local densities. The specific process is shown in Algorithm 1.

Algorithm 1 Density peaks clustering algorithm
Inpute: D s : the set of minority instances, d c : the cut-off distance. Output: D s,k : the set of instances in each sub-cluster, k: number of sub-clusters, D center : the set of cluster centers, D noise : the set of outlier points. 1: Determine the cut-off distance d c , and calculate ρ i and δ i for each instance according to Eq. (1) and Eq. (2). 2: Draw a decision diagram based on ρ i and δ i , determine the cluster centers and outliers, and obtain D center , D noise , and k. 3: Assign the remaining instances to clustering centers with higher local density neighborhoods and divide them into k classes to obtain D s,k .

Improved Oversampling Algorithm
After clustering a small number of classes of instances, different oversampling weights W s,i are first assigned according to the number of instances in each sub-cluster, which is defined by: where k denotes the number of sub-clusters and n(i) denotes the number of instances in the i-th sub-cluster. From Eq. (3), it can be seen that the oversampling weight is inversely proportional to the number of instances in the sub-clusters, i.e. the smaller the number of instances in a sub-cluster, the larger the oversampling weight, and the more instances need to be generated to compensate for the small number of inter-class imbalances. Then, the number of instances to be generated for each sub-cluster N i is calculated based on the oversampling weight W s,i , which is defined by: where N denotes the overall number of synthetic minority instances to be generated. Generally, N is set to be the difference of majority size and minority size, which can balance the dataset with a 1:1 ratio so that both classes are of the same size. Finally, to avoid marginalizing the distribution of the generated instances, the concept of cluster cores is introduced in this paper. Cluster cores refer to the cluster centers of each sub-cluster obtained by clustering minority classes by density peaks. Linear interpolation is performed between the cluster cores of the same sub-cluster and other instances to achieve high quality generated instances. The interpolation formula is defined as follows: where a denotes a random number between (0, 1), x ij denotes the j-th instance in the i-th sub-cluster, ij In addition, outlier points are not involved in the generation of new instances to prevent noise generation.The detailed procedure of the improved oversampling procedure is shown in Algorithm 2.

EXPERIMENTAL RESULTS AND ANALYSIS 4.1 Artificial Dataset
In order to verify the effectiveness and superiority of the proposed method, artificial datasets are constructed to compare the distribution of instances generated by different oversampling algorithms. This experiment invokes the Sklearn package in the Python library to randomly generate two sets of instance sets that conform to a Gaussian distribution. The minority instances are represented by small dots and the number is 50, the majority instances are represented by large dots and the number is 80, and the newly generated minority instances are represented by plus signs and the number is 30, i.e. the difference between majority size and minority size. In terms of algorithm implementation, the SMOTE and K-means SMOTE algorithms are implemented by directly calling the Python imbalance-learn package, and the proposed algorithm is written in the Python language with the cut-off distance d c taken to be located at 2%. The distribution of new instances synthesized by different oversampling algorithms is given in Fig. 2. Fig. 2a shows the distribution of instances synthesised by the ADASYN algorithm, where some instances synthesised by the algorithm are concentrated in the boundary region between the minority and majority categories, introducing a significant amount of noise. Fig.  2b shows the distribution of the instances synthesized by the SMOTE algorithm, from which it can be visualized that some of the instances synthesized by the algorithm are scattered in the boundary region between the minority and majority classes, introducing noise. Fig. 2c shows the distribution of instances synthesized by the K-means SMOTE algorithm, and it can be seen that the instances synthesized by this algorithm are more densely distributed with more overlapping instances. Fig. 2d shows the distribution of new instances synthesized by the proposed algorithm, and by comparison it can be observed that the proposed algorithm generates new instances more evenly distributed in a few regions with low class density and fewer overlapping instances.

KEEL Datasets
In order to further verify the effectiveness of the method in this paper, the SMOTE, K-means SMOTE, ADASYN and the proposed algorithm were used to oversample minority instances for each of the seven KEEL datasets, and then the sampled data are classified using Support Vector Machine (SVM), K-Nearest Neighbor (KNN) and Gradient Boosted Decision Tree (GBDT). Specific information for all KEEL datasets is shown in Tab. 1, where the imbalance ratio is the ratio of the majority class to the minority class in the sample.The eigenvalues of all instances are subjected to maximumminimum normalization. 10-fold cross validation was used and the average value is taken as the experimental result to avoid randomness influences on the results. The proposed algorithms are implemented in Python, with cut-off distance d c taken to be located at 2%. SMOTE and Kmeans SMOTE, ADASYN are directly called from the Python imbalance-learn package. SVM, KNN, GBDT are directly called from the Python Sklearn module, all using default parameters.

Performance Evaluation
For imbalanced data, the correct classification of minority class is particularly important, and traditional evaluation methods based on a single accuracy rate are no longer applicable. Therefore, in order to quantitatively assess the classification performance of unbalanced data, in this paper, F-Measure, AUC, and G-Mean are used as the performance measures to compare different methods. These performance metrics usually depend on the confusion matrix, as shown in Tab. 2. G-Mean represents the geometric mean of the accuracy of the minority and majority class instances. The maximum value is reached when and only when the predictive accuracy of the two classes of instances is balanced and can be used to evaluate the classification performance of imbalanced data. G-Mean is defined by: Area Under Curve (AUC) indicates the probability that a classifier will rank positive instances from a random test higher than negative instances from a random test. The value ranges from 0,5 to 1. The higher the value, the better the differentiation ability of the classifier. AUC is defined by:

Experimental Results and Discussion on KEEL Datasets
Tab. 3 to Tab. 5 list the F-Measure, AUC and G-mean obtained after processing with different oversampling algorithms using SVM, KNN and GBDT classification respectively, with the optimal data marked in bold. For a clearer visual comparison of the various methods, Fig. 3 to Fig. 5 present the average values of each evaluation metric obtained by the four oversampling algorithms on the same classifier over all datasets.
From the results in Tab. 3 to Tab. 5, we can see that DP-SMOTE achieves optimal results for all the indicators F-Measure in all datasets under the three classifiers, indicating that this method effectively improves the classification accuracy of a few classes. In most of the data sets, the best results were obtained for the indicators AUC and G-mean, indicating that the proposed method effectively improved the overall classification performance of the unbalanced data sets. In the three datasets Vehicle1, Ecoli2 and Ecoli4, DP-SMOTE achieves the optimal F-Measure and AUC and G-mean at the same time. In general, DP-SMOTE takes into account the intra-class imbalance and cross-regional noise problems, and improves the classification performance of imbalanced data. Fig. 3 to Fig. 5 clearly demonstrate that, compared to the other three oversampling algorithms, the best classification performance is obtained by all three classifiers on the dataset obtained from the processing of the oversampling algorithm proposed in this paper, with the best average values of the evaluation metrics F-Measure and AUC and G-mean, which indicates that DP-SMOTE has the best average performance and has obvious advantages in dealing with the classification problem of imbalanced data.

Parameters Analysis of the Proposed Method
Since the proposed method adopts density peaks clustering to classify sub-clusters in minority class, the selection of the corresponding parameters for density peaks Clustering is crucial. Since density peaks clustering does not require any initial cluster centers and number of clusters as prior knowledge, the cut-off distance d c is key to its clustering performance.
Therefore, we let the ratio of included samples on average be t, and then we changed the ratio t from 1% to 5% to investigate the effect of the cut-off distance d c on the classification performance of our proposed method. Only the KNN, which was the better classifier in the comparison test, was used as the classifier on the KEEL datasets to observe the changes in the evaluation metrics F-Measure, AUC and G-mean. The results were averaged across the datasets after 10-fold cross-validation at the same value. The effects of different truncation distances on the evaluation metrics are shown in Fig. 6 to Fig. 8. Fig. 6 to Fig. 8 show a general trend of increasing and then decreasing performance of the evaluation indicators. In the initial stage, the classification performance increases with increasing d c value. When a certain value is reached, the classification performance remains stable and decreases with further increase of d c . When d c is taken as 2%, all three evaluation indicators can achieve better results.

CONCLUSION
To address the problem of unbalanced data classification, in this paper a data processing method based on density peaks clustering is proposed. Firstly, density peaks clustering is introduced in minority class to accurately and quickly identify sub-clusters, avoiding spatial shape as well as parameter restrictions. Secondly, the oversampling coefficient is adjusted according to the number of oversampling of each sub-cluster to ensure that smaller sub-clusters generate more instances and solve the intra-class imbalance of minority class. Thirdly, the SMOTE interpolation formula is improved to interpolate between cluster cores and other instances of the same subcluster to prevent the generated instances from falling into the majority class region and to reduce the generation of noise points and overlapping instances. Finally, comparative experiments are carried out on the artificial dataset and the KEEL datasets to verify that the proposed method effectively reduces the introduction of noise and overlapping instances and improves the classification performance of unbalanced data. In addition, the effect of truncation distance on the performance of the proposed method is analyzed, and optimal parameters are suggested. In this paper, we only study the oversampling of binary imbalance. In practical applications, the multi-class problem is more common, and the next step can be explored for the oversampling of multi-class imbalance.