SEMI-SUPERVISED AFFINITY PROPAGATION BASED ON DENSITY PEAKS

Original scientific paper In view of the unsatisfying clustering effect of affinity propagation (AP) clustering algorithm when dealing with data sets of complex structures, a semi-supervised affinity propagation clustering algorithm based on density peaks (SAP-DP) was proposed in this paper. The algorithm uses a new algorithm of density peaks (DP) which has the advantage of the manifold clustering with the idea of semi-supervised, builds pairwise constraints to adjust the similarity matrix, and then executes the AP clustering. The results of the simulation experiments validated that the proposed algorithm has better clustering performance compared with conventional AP.


Introduction
Affinity Propagation clustering (AP) is a quite different and efficient clustering algorithm.It simultaneously considers all data points as potential exemplars, and it does not require the number of clusters to be predetermined like other clustering algorithms do [1].Affinity propagation clustering algorithm was published by Brendan J. Frey and Delbert Dueck in the Science in 2007.
In recent years, scholars have developed many improving methods, all focus on three issues: application study, similarity matrix and complex data processing.Now Affinity Propagation is widely used in text segmentation [2], artificial immune system [3,4], image recognition [5] and many other fields [6÷8].Canadian scholars Hassanabadi et al. [9] present a novel, mobility-based clustering scheme for Vehicular Ad hoc Networks, which forms clusters using the Affinity Propagation algorithm in a distributed manner.Austrian scholars Bodenhofer U. et al. [10] provided an R implementation of AP algorithm to account for the ubiquity of R in bioinformatics.By introducing an idea of the emoticon to AP, Zhang Lumin et al. [11] proposed a novel approach to mine online events based on emoticons.Lu Weiming et al. [12] proposed a distributed AP clustering algorithm based on MapReduce to effectively address the large scale data.
Delbert Dueck and Brendan J. Frey [26] use nonmetric similarity to increase accurate rates of imagines classification.Zhengdong Lu [27] proposed two kinds of affinity information changing the matrix to yield better results with fewer constraints.Zhen Zhang [28] proposes STI-AP with defined manifold similarity and semi-supervised learning to reduce the complexity of marking sampled flows.They improve calculation methods of similarity in order to be suitable to some special problem.Similarity is close relation to data structures, different data need different methods.
Jianpeng Zhang [29] takes an improved weighted and hierarchical affinity propagation to reconstruct the AP models when detecting a new emerging class model.Streaming data have characteristics of dynamic distribution, Zhang X., Furtlehner C. and Germain-Renaud C. [30] apply affinity propagation as efficient clustering algorithm to VANETs.Like other algorithms, AP also has its limitations, and not a single method can be suitable for all problems, facing complex data such as manifold, we need to improve it.
The AP works based on similarities, considers all the data points as the potential cluster centres, and then through iterative competition to obtain the optimal clustering results.AP is different from other clustering algorithms.It does not need to specify the clustering number which quickly and efficiently deals with the large-scale data.But because of lacking prior information the algorithm would create many local clusters when processing complex data.According to the above problem, the paper puts forward the algorithm Semi-supervised Affinity Propagation based on Destiny Peaks (SAP-DP).

Affinity Propagation
Affinity propagation is a new and efficient algorithm which is based on similarities between pairs of data points and considers all data points as the potential clustering centre.Real-valued messages are exchanged between data points until a high quality set of exemplars and corresponding clusters gradually emerges.Because of its simplicity, general applicability, and performance, we believe affinity propagation will prove to be of broad value in science and engineering [25].
Affinity propagation takes as input a collection of real-valued similarities between data points, where the similarity s(i, k) indicates how well the data point with index k is suited to be the exemplar for data point i [1].Each similarity is set to a negative squared error (Euclidean distance): for points i and k the similarity is: ( , ) .
A priori, all data points are taken as the potential cluster centres.A data point with large value of s(k, k) is more likely chosen as exemplar.
These values are referred to as preference parameters; they play important roles in determining the number of exemplars.Initially all data points are equally suitable as exemplars, the preference parameter should be set to a common value p− this value can be varied to produce different numbers of clusters.In most cases, this shared value could be the median of the input similarities.
( (:)).p median s = During the iteration, there are two types of messages exchanged between data points, and each takes into account a different kind of competition.Messages can be combined at any stage to decide which points are exemplars and, for every other point, which exemplar it belongs to.Fig. 2 shows affinity propagation illustrated for two-dimensional data points, where negative Euclidean distance was used to measure similarity.Each point is coloured according to the current evidence that it is a cluster centre (exemplar).The darkness of the arrow directed from point i to point k corresponds to the strength of the transmitted message that point i belongs to exemplar point k [1].

Figure 1 How affinity propagation works
The core of AP is mutual transfer of the two pieces of information.The "responsibility" r(i,k) from point i to point k.It reflects how well-suited point k is to serve as the exemplar for point i.The "availability" a(i,k) from point k to point i.It reflects how appropriate it would be for point i to choose point k as its exemplar.From the viewpoint of evidence, the larger the r(:,k)+a(:,k), the more probability the point k has as a final cluster centre.
, min 0, ( , ) max ( , ) max 0, ( ) , 0, ( , ) In order to avoid oscillation, AP introduces damping factor (λ∈[0, 1)) to information update.This paper selects lambda 0,5.t is the iteration.And each iteration of affinity propagation consisted of (i) updating all responsibilities given the availabilities, (ii) updating all availabilities given the responsibilities, and (iii) combining availabilities and responsibilities to monitor the exemplar decisions and terminate the algorithm when these decisions did not change for 10 iterations.
( , ) (1 ) ( , ) ( , ), A decision matrix E is calculated after each update.Decision matrix E represents whether point i chooses point k as its exemplar or not.

Clustering by fast search and find of Density Peaks (DP)
The DP algorithm has its basis only in the distance between data points.It is able to detect nonspherical clusters and to automatically find the correct number of clusters [13].DP algorithm has two quantities: for each point i, its local density ρ i and its distance δ i from points of higher density.Both these quantities depend only on the distances between data points, which are assumed to satisfy the triangular inequality.The local density ρ i of data point i is defined as: where χ(x) = 1 if x < 0 and χ(x) = 0 otherwise, and d c is a cut-off distance.Basically, ρ i is equal to the number of points that are closer than d c to point i. δ i is measured by computing the minimum distance between the point i and any other point with higher density.For the point with the highest density, it is taken δ i = max j (d ij ).Generally, one can choose d c so that the average number of neighbours (τ) is around 1 to 2 % of the total number of points in the data set.DP chooses the only points of high δ i and relatively high ρ i are the cluster centres.After the cluster centres have been found, each remaining point is assigned to the same cluster as its nearest neighbour of higher density.Detailed calculation methods follow these formulas: . max( ) max , , ..., min ( ) This observation, which is the core of the algorithm, is illustrated by the simple example in Fig. 2 and Fig. 3. Fig. 2 shows 28 points embedded in a two-dimensional space.The density maxima are at points 1 and 10 so they are identified as cluster centres.Fig. 3 shows the plot of δ i as a function of ρ i for each point.The value of δ for point 9 and 10, with similar value of ρ, is very different: Point 9 belongs to the cluster of point 1, and several other points with a higher ρ are very close to it, whereas the nearest neighbour of higher density of point 10 belongs to another cluster.Hence, as anticipated, the only points of high δ and relatively high ρ are the cluster centres.Points 26, 27 and 28 have a relatively high δ and a low ρ because they are isolated; they can be considered as cluster composed of a single point, namely, outliers [13].

Semi-supervised clustering
There is a new method of semi-supervised clustering based on AP algorithm [14].The algorithm has two kinds of pairwise constraints, must-link, where the two data points must belong to the same cluster, i.e.M={(x i , x j )}, and cannot-link, where two data points should not be in the same cluster, i.e.C={(x i , x j )} [2].The detailed rules for updating the matrix are as follows.
Step 1: For the data point pairs in prior information that meet the must-link constraint and the data point pairs newly accord with the must-link constraint after logical extension, perform similarity update as below. , Step 2: For the data point pairs in prior information that meet the cannot-link constraint, perform similarity update as below. .
Step 3: Perform global adjustment to the unknown data points based on the principle of the shortest path according to the results of steps 1 and 2. If there is a data point that connects to both data points in a data point pair pending for adjustment, and the sum of the similarities between this data point and the two data points in the pair is greater than the similarity of the data point pair, update the similarity of the data point pair to the sum.

Semi-supervised Affinity Propagation based on Density Peaks
The paper randomly selects 80 % of the data set as the training data, and chooses the reasonable average number of neighbours (τ).Firstly through DP cluster to gain constraint information, and then update the similarity matrix by semi-supervised clustering, finally use AP to calculate the result.Here is the process of the proposed algorithm: step 1 is to initialize r(i, k)，a(i, k)=0，λ=0.

Experimental results
We present a set of clustering experiments on many datasets, including three synthetic datasets, three UCI datasets, as shown in Tab. 1.All experiments were performed with MATLAB 2012b on a computer with Inter(R) Pentium 2.9 GHz processer, 4GB RAM, 500GB hard drive.

F-measure (FM) index
F-measure measures the grammar's accuracy.It considers both the precision P and the recall R of the algorithm: P is the ratio of the number of correct results to the number of all returned results, and R is the ratio of the number of correct results to the number of results that should have been returned.P, R and F-measure (F) are defined as follows.

Comparison and analysis of the results
We compared the performance of the proposed algorithm with AP on three synthetic datasets and three UCI datasets.We tested the Silhouette index and F-measure index of the three algorithms based on the true clustering number.The result is shown as follows.
From Tab. 2 and Fig. 5 we can see that the clustering accuracy of the proposed SAP-DP algorithm is better than two other algorithms which is shown from the FM index.As for the clustering quality which is shown from the Silhouette index, the SAP-DP get better result in dataset Iris, Heart and Aggregation while it is poor at the datasets spiral, seeds and flame.It indicates that the Silhouette index is sensitive to the spherical data.The result of the Silhouette index proves that the SAP-DP can effectively construct a similarity matrix, improve the compactness of within-class and the separability of inter-class while KMeans and AP can only recognize the spherical data, so the clustering trend is more obvious.The result of the F-measure index shows that the Clustering accuracy of SAP-DP is improved obviously.In order to intuitively compare the three algorithms, we choose three synthetic datasets to plot the Decision graphs and Clustering results.Fig. 6 shows the decision graphs of three datasets.Based on the principle of DP algorithm, we choose the points with high δ and relatively high ρ as the cluster centres.The better centre we choose, the more accurate semi-supervised information we gain.Fig. 7 shows the value of gamma (γ) in decreasing order for three synthetic datasets.It provides the evidence for choosing the number of clustering centre.For example, the graph of aggregation's gamma shows that the quantity starts growing anomalously below a rank order number 7. Therefore, we performed the analysis by using 7 centres. .
In Fig. 9 one can see: The aggregation is a composite spherical dataset which has a complex structure.The original AP and KMeans are hard to gain the right clustering while SAP-DP can get a more reasonable clustering.The F-Measure index of AP is 0,5958 and KMeans's is 0,7742 while the SAP-DP's is 0,9686.The Silhouette index of AP is 0,1623 and KMeans's is 0,6058 while the SAP-DP's is 0,6345.Therefore, the proposed algorithm can process composite spherical data more efficiently.
In Figs. 10 and 11 one can see: Original AP and KMeans are hard to gain the true cluster number.When we adjust the preference to the true cluster, the F-Measure index of SAP-DP is higher than the AP and KMeans, while the Silhouette index of SAP-DP is lower than the other two algorithms.Silhouette index is sensitive to the spherical data while it has a bad result in the nonspherical data.So Figs. 5 and 6 prove the proposed algorithm has the better clustering ability on nonspherical data.

Application of SAP-DP on seismic analysis
We chose the seismic data from China Earthquake Data Centre to test the feasibility of the proposed algorithm.There are six measurement indexes used in the test: Richter magnitude, epicentral intensity, earthquake victim, death toll, total casualties and direct economic loss (Tab.3).The earthquake disasters magnitude in China is divided into: general, moderate, severe and catastrophic.Because of the rareness of the catastrophic earthquakes, we chose the previous three magnitudes to test the data.We respectively defined "general", "moderate", "severe" as "3", "2", "1".As seen in Tab. 4, the actual magnitude and test magnitude are basically identical.There are two samples that are misestimated.The F-Measure index of the result is 0,85.It shows that the application of the proposed algorithm on seismic analysis is effective.It provides a relatively effective research tool for the earthquake classification field.

Conclusion
For the incapability of affinity propagation clustering algorithm to produce ideal clustering results when dealing with nonspherical data, a novel semi-supervised affinity propagation clustering algorithm based on density peaks was proposed in this paper.The proposed SAP-DP algorithm makes full use of the manifold clustering features of the density peak, accurately identifies the potential manifold structure of complicated data, introduces the idea of a semi-supervised learning, and builds pairwise constraint condition by clustering.The pairwise constraints are used to update the similarity matrix that reflects the relationship of similarity more reasonable.Then the algorithm is executed to reach the result.Taken together, compared with the traditional AP algorithm, the Semi-supervised Affinity Propagation based on Density Peaks has better accuracy and performance.

Figure 2 Figure 3
Figure 2 Point distribution in two dimensions

Figure 5 Figure 6 Figure 7 Figure 8 Figure 9
Figure 5 Comparison of clustering quality and accuracy

Table 1
Experimental datasets Assume a data set with n samples be divided into k clusters C i (i =1, 2, …, k), a(t) is the average dissimilarity of sample t in C j to all other samples in C j , d(t, C i ) is the average dissimilarity of sample t in C j to all samples in another cluster C i , then b(t) = min{d(t, C i )}, i = 1, 2, …, k, i≠j.The formula to calculate the Silhouette index Sil of sample t is:

Table 2
Comparison of clustering index

Table 3
The seismic data

Table 4
Result of the SAP-DP clustering