A Linear Fitting Density Peaks Clustering Algorithm for Image Segmentation

Clustering by fast search and finding of density peaks algorithm (DPC) is a recently developed method and can obtain promising results. However, DPC needs users to determine the number of clusters in advance, thus the clustering results are unstable and deeply influenced by the number of clusters. To address this issue, we proposed a novel algorithm, namely LDPC (Linear fitting Density Peaks Clustering algorithm). LDPC uses a novel linear fitting method to choose cluster centres automatically. In the experiments, we use public datasets to access the effectiveness of LDPC. Especially, we applied LDPC to image segmentation tasks. The experimental results show that LDPC can obtain competitive results compared with other clustering algorithms.


INTRODUCTION
In recent years, image segmentation is a useful technique and becomes an important research topic in computer vision field. Image segmentation is the process of dividing an image into different areas and the union of any two adjacent regions is not homogeneous [1]. Image segmentation techniques are applied to many fields, such as medical research [2], person re-identification [3]. For image segmentation, many methods are proposed, such as edge detection, threshold method, region segmentation, clustering, and neural network [4], in which clustering method has great advantages because it is an unsupervised method without labels, and requires less computing overhead. Thus many clustering algorithms have been applied to the image segmentation, such as K-means [5,6], fuzzy C-means [7], mean-shift [8].
DPC is a novel density clustering algorithm without iteration [10], and it can identify clusters with different shapes, such as images. However, it also needs users to estimate the number of cluster centers based on the decision graph. This process affects the accuracy of DPC algorithm. To address this issue, we proposed a novel clustering method based on DPC.
The contributions of LDPC are mainly in two aspects: (1) we proposed a novel method to access the possibility of becoming the center for each point in a decision graph.
(2) The process of choosing cluster centers by human intervention is replaced by linear fitting approach.

CLUSTERING BY FAST SEARCH AND FINDING OF DENSITY PEAKS
In this section, we describe original DPC algorithm. DPC is based on the following hypotheses: (1) The center of a cluster is surrounded by points with lower local density, in other words, the density of the center is the local maximum [12].
(2) The center of a cluster has a relatively large distance from points with higher densities [13,14].
DPC uses two parameters to determine cluster centers: for each point i, the local density ρ i and its distance δ i from points of higher density are defined as Eq. (1) and Eq. (2) [15].
In Eq. (1), where d c is a cutoff distance and needs to be defined by users. The local density is equivalent to the total number of points which have smaller distance than d c with point i. The average number of neighbors of every point often is 2% of all points [17,18].
In Eq. (2), we can get that only the point which has local or global maximum value of density has larger δ i than others. Cluster centers are points which have relatively large ρ and δ values. In DPC, we can use ρ − δ decision graph to choose centers as shown in Fig. 1 (Spiral dataset), points of top right corner are centers [19]. After cluster centers were selected, each remaining point is assigned to the same cluster as its nearest neighbor of higher density.

LINER FITTING DENSITY PEAKS CLUSTERING ALGORITHM
In this section, we describe the proposed LDPC. Firstly, we introduce a method to calculate d c in subsection 3.1. Secondly, we calculate ρ and δ. Thirdly, we calculate variable γ in subsection 3.2, and use linear fitting to calculate the residual sequence. Finally, the points whose residual errors are obviously larger than others are chosen as centers.

Calculate d c
We use method of [11] to calculate d c . Since the standard deviation can reflect the dispersion degree of a data set, we design the cutoff distance d c based on it. Assume that there are n points with m attributes in a given data set X ∈ R m×n . The d c is defined as Eq. (3).
Where σ j and μ j are the standard deviation and mean value of attribute j respectively. ω ∈ (0, 1] is a trade-off parameter to control the size of cutoff distance. Here we set ω ∈ 11.

Automatic Determination of Centres
In order to automatically determine cluster centers, we first define a variable γ of data point i, γ is defined as Before calculating the value of γ, the values of ρ and δ of all data points should be normalized. The reason is that if ρ and δ have different orders of magnitude, the effects of small variables are easily ignored. Then we sort γ in descending order called γ s , data points with relatively large values of ρ and δ are chosen as cluster centers in DPC, so the γ values of centers are larger than others in LDPC.
We draw the change of γ s as shown in Fig. 2. We can clearly see that centers in the upper left corner are relatively sparse, but others are particularly dense. These sparse points are chosen as centers. Therefore, how to choose centers is transformed into how to separate sparse points from dense points.  Fig. 2, the dense points can be seen as a straight line except a few sparse points. In this way, we can use linear fitting method to separate sparse points. We obtain a residual sequence c by subtracting original γ s and fitting value γ r . In Fig. 3, we can see that the residuals of the sparse points are significantly larger than the dense points.

Figure 3 Residual graph of Spiral dataset
We adopt a special method here to separate sparse points which are chosen centers. For discrete data, because the γ value of discrete centers is significantly greater than the others, we can clearly find a jump point ap in residual sequence and use the red circle to mark this point (Fig. 3). The point before ap can be considered as the center point. The jump point ap is set as Eq. (5). In this research, we mainly focus on the images segmentation. Images have no large jumps as discrete data, so here we use another method to find centers. Generally, colors of most images are not too many. In other words, the number of clusters is less than 20. What's more, In LDPC, we choose the average of the residuals of the first 20 points as the threshold, data points that are greater than threshold are chosen as centers. The specific selection process of LDPC is shown in Fig. 4.

EXPERIMENTS AND DISCUSSIONS 4.1 Synthetic Datasets
The datasets Spiral, Flame and Aggregation are used to test the clustering performance of LDPC. The clustering results of LDPC and DPC are shown in Fig. 5. DPC needs users to determine the number of clusters. LDPC and DPC obtain the same results. The proposed LDPC algorithm can obtain correct results without manual intervention. The image segmentation experimental results are shown and discussed in subsection 4.2.

Image Segmentation
In this section, we compared the results on five typical images with K-means algorithm and DPC algorithm.
Image Lena is widely used in image processing field. We chose the best segmentation result of DPC. The experimental results on Lena are shown in Fig. 6. Fig. 6a is original Lena image, Fig. 6b is result of LDPC, Fig. 6c is the best result of DPC segmentation when setting percentage is 3, Figs. 6d, 6e and 6f are the result of Kmeans when k is 2, 3, and 4 respectively. The result of k = 3 is the most accurate result for K-means. The segmentation results of LDPC and DPC are very similar to that result of K-means (k = 3). However, DPC result has more noise points than K-means, LDPC has less noise points than K-means.
The second image is a house, which can be divided into three clusters as shown in Fig. 7. So we set k = 3, percentage = 27. Fig. 7a is original image, Fig. 7b is result of LDPC, Fig. 7c is result of DPC, and Fig. 7d is result of K-means. Figure 6 Clustering results of Lena The color number of DPC result is 2, which does not conform to the original image, the color number of LDPC result and K-means result are both 3. LDPC clusters the color of the wall and white into the same color. K-means combines white and blue into one color, both of these are reasonable. However, the contour of K-means is clearer than LDPC. Figure 7 Clustering results of House The images we used for the final experiments included Flower, Peppers, Fruits, which are often used for image processing.
By observation, the result has the least noise points and the image contour is clear when k = 2, this is the best result. Results of LDPC and DPC both have two colors, but DPC result has more noise points than LDPC results. Figure 8 Segmentation results of Flower In Fig. 9, this image has two colors. Setting percentage = 10, k = 2. Fig. 9a is original image, Fig. 9b is the result of LDPC, Fig. 9c is result of DPC, and Fig. 9d is result of k = 2. The clustering results of these three algorithms are consistent. Figure 9 Segmentation results of Peppers In Fig. 10, we set percentage = 2, k = 2, 3, 4. Fig. 10a  is original image, Fig. 10b is the result of LDPC, Fig. 10c is the result of DPC, Figs. 10d, 10e and 10f are the result of K-means when k is 2, 3, and 4 respectively.
The image contour is clearer when k = 3, k = 3 is the best result of K-means. Result of LDPC also has three colors, but DPC result has two colors, which is too ambiguous. Figure 10 Segmentation results of Fruits From figures presented above, we can conclude that the LDPC algorithm obtains the same result with K-means algorithm which sets the correct k value. However, DPC can only segment a small number of images correctly when the percentage of d c is suitable. On the contrary, LDPC can be used in image segmentation automatically. It does not need to set any parameters, and can reach a certain accuracy in segmentation results.

CONCLUSION
For image segmentation, K-means algorithm is a very popular tool but the number of clusters needs to be determined in advance. In order to contribute a more robust tool, LDPC is proposed based on a linear fitting approach. Experimental results demonstrate that the performance of LDPC is as good as K-means and DPC, and even some of the images perform better than others.
DPC is a powerful tool to find low-dimensional features of high dimensional data, we can apply it to bioinformatics. In addition, deep auto-encoder can also find reasonable features, which can help DPC get more accurate results on large scale image datasets. The above ideas need to be further studied.