IMPROVED FRUIT FLY OPTIMIZATION ALGORITHM-BASED DENSITY PEAK CLUSTERING AND ITS APPLICATIONS

Original scientific paper As density-based algorithm, Density Peak Clustering (DPC) algorithm has superiority of clustering by finding the density peaks. But the cut-off distance and clustering centres had to be set at random, which would influence clustering outcomes. Fruit flies find the best food by local searching and global searching. The food found was the parameter extreme value calculated by Fruit Fly Optimization Algorithm (FOA). Based on the rapid search and fast convergence superiorities of FOA, it is possible to make up the casualness of DPC. An improved fruit fly optimization-based density peak clustering algorithm was proposed as FOA-DPC. The FOA-DPC algorithm would be more efficient and effective than DPC algorithm. The results of seven simulation experiments in UCI data sets validated that the proposed algorithm did not only have better clustering performance, but also were closer to the true clustering numbers. Furthermore, FOA-DPC was applied to practical financial data analysis and the conclusion was also effective.


Introduction
The Clustering technologies become more and more important in digital age.They extract and analyze information and knowledge from increasing data quantity.The clusters identified have high interior similarities and distance difference between clusters [1].Therefore, the knowledge, pattern and model could easily be found by analyzing the clusters.
Clustering techniques broadly fall into partition clustering, hierarchical clustering, density-based clustering and grid-based clustering [2].For density-based algorithm, clustering centres were characterized by higher density than their neighbours and relatively large distance from points with higher densities [3].Density Peaks Clustering (DPC) is one of the typical density-based algorithms, based on the distance between data points [4].Like DBSCAN [5], the idea of DPC algorithm is to look for the correct number of clusters automatically through detecting nonspherical clusters themselves.DPC adopts the idea of local density maxima from mean-shift [6] and the basic idea of only one parameter of the distance between the data points from K-Medoids [7].However, there are still some shortcomings in DPC algorithm.The cut-off distance and clustering centres have to be set by personal experience or at random because the cut-off distance influences clustering results.That is to say, the bigger the cut-off distance value, the less clustering number is.It is a challenge to find the optimal results because it totally depends on the personal subjective judgement.For improving DPC algorithm, some scholars had put forward their own ideas [8÷17].Swarm intelligence algorithm is a good way to optimize the cut-off distance and clustering centres.There are some meta-heuristic methods, like Particle Swarm Optimization Algorithm (PSO) [18], Artificial Bee Colony Algorithm (ABC) [19], Cuckoo Search Algorithm (CS) [20], Fruit Fly Optimization Algorithm (FOA) [21].Among these optimization algorithms, FOA and its variants had been proved effective and had been applied to engineering and economics, such as power load forecasting [22], PID controller parameters tuning [23], neural network parameter optimization [24], multi-dimensional knapsack problem [25], steelmaking casting problem [26], and financial distress [27].
Considering the optimization algorithm selected should be more efficient and easy to implementation [28], this paper preferred to select FOA to optimize the cut-off distance and clustering centres in DPC algorithm that had superiority of fast convergence and rapid search.Then FOA-DPC was proposed that have superiority of finding the optimized clustering results quickly.
This paper is organized as follows: Section 2 briefly dwells on DPC algorithm and introduces variables used in the rest of this paper.Section 3 presents FOA idea and general flow.Section 4 introduces the improved algorithm FOA-DPC.Section 5 discusses the simulation experiments and analysis for FOA-DPC.Section 6 extends FOA-DPC to Stock analysis.Section 7 makes some conclusions and remarks.

Clustering by fast search and finding of density peaks
In DPC [4], the distance between data points was the basis for this algorithm.It could identify all nonspherical clustering centres and determine the number of clusters.The clustering centres were defined as local maxima and surrounded by data points in lower density.They were far away from any other clustering centres.There were two important parameters for each data point i.One was the local density ρ i , and the other was δ i .The distances d ij among data sets which were assumed to satisfy the triangular inequality.The detail of the DPC was described as follows.
DPC algorithm was only sensitive to ρ i .The local density ρ i of data points was defined as: , ) ( δ i was measured by the minimum distance between the point i and any other point with higher density: .) ( min For the data point with highest density, it was taken that δ i = max j (d ij ).And d c had to be chosen to make the average number of neighbours (τ) was around 1 to 2 % of the data set.The point with higher δ i and higher ρ i would be chosen by DPC as clustering centre.Then the remaining points were assigned to its nearest neighbour.Detailed calculation methods were the following: The core of DPC algorithm was illustrated by the simple example in Fig. 1 and Fig. 2. As shown in Fig. 1, there were 28 data points included in two-dimensional space.The density maxima were points 1 and point 10, so the two points were identified as clustering centres.The plots of δ i shown in Fig. 2 were function of ρ i for each data point.The points of relatively higher δ and higher ρ were the clustering centres.The points with relatively higher δ and lower ρ were isolated and could be considered as outliers.

Fruit Fly Optimization Algorithm
Fruit Fly Optimization algorithm (FOA) was proposed as one of the swarm intelligence algorithms in Science [29], which is based on behaviour of creatures selecting food.The fruit flies had better sensing and perception ability than other birds, so they could find all kinds of scents floating in the air.Guided by the fruit fly with best food flavour, the other flies attached the food selectively, and continued to do the neighbourhood searching process around the attached food.FOA was generated from the activities of a certain amount of fruit fly swarm flying within the search space.Their food finding behaviour had the superiority of fast convergence and rapid search.If FOA reached the maximum times of internal circulation and the quality of food hadn't yet been improved, the corresponding fruit flies started to continue searching randomly.As soon as the number of iterations reached the maximum threshold, the position of higher validity index was the cut-off distance value and the clustering centres would be identified.The algorithm was following: Step 1: Initialize the fruit fly swarm location Step 2: Initialize the direction and distance for each fruit fly used to search for favoured food by osphresis.
Step 3: If fruit flies could not find the food location, the distance to the origin was set by distance (Dist i ), and the next smell concentration judgment value (S i ) was calculated by reciprocal distance., .
Step 4: Substitute smell concentration judgment value (S i ) into smell concentration judgment function to find the smell centre (Smell i ) of fruit flies.
( ) Step 5: Find out the fruit fly with maximal smell among the fruit fly swarm.
[ ] max( ) bestSmell bestIndex smell = (10) Step 6: Keep the best smell centre value and coordinate x and coordinate y, then the fruit fly swarm fly towards that location by vision.

Smellbest bestSmell
Step 7: Repeat the implementation of Step 2 to Step 5 and make judgment whether the iterative smell concentration was superior to the previous smell concentration or not.If it was, go back to Step 6.
When the termination criterion was satisfied, then FOA was stopped.

Density Peak Clustering Algorithm Based on Fruit Fly Optimization
In DPC algorithm, the cut-off distance and the clustering centres were set by experience or at random.Hence it was hard to find the optimal results because the algorithm needed to set the clustering centres manually.FOA was selected to optimize the cut-off distance and clustering centres in DPC algorithm to make it fast convergent and rapid search.Then FOA-DPC was proposed that has superiority of finding the optimized clustering results quickly.The proposed FOA-DPC algorithm made the ability of self-organization improved, parameters less controlled, and global optimization ability enhanced.
FOA-DPC algorithm had two parameters: the maximum number of iterations T and the population of fruit fly swarm sizepop.The flow was as follows: Step 1: Algorithm initialization.Set d c value range between (D min , D max ) and the cluster number range between (0, C max ).Set the number of fruit flies sizepop = 10.
Step 2: Use formula ( 5), (6) to calculate the value of d c and cluster number; Step 3: Run DPC algorithm, get n Sil value by (10); Step 4: Keep the position of food for fruit fly swarm in the search space; Step 5: If the quality of food had not been improved in continuous limit times, the other fruit flies would start to fly towards the best food and search randomly for next new food; Step 6: The value of Sil was defined as the objective function.If there is no change in the following time or it reached the maximum iteration number T, it stopped.Output cut-off distance d c and cluster number according to the best position and clustering results.Otherwise, return to Step 4 to continue updating.

Simulation experiments and analysis
Simulation experiment environment is Inter (R) Pentium 2.7 GHz with 4.00 GB memory, 500 G hard disk, Windows 7 system and MATLAB2010a programming language.

Experimental data
To verify the clustering accuracy of FOA-DPC was higher than that of the DPC algorithm.There were 2 groups of UCI standard data sets and 5 synthetic data sets selected to compare.The experimental data were shown in Tab. 1.

Evaluating indicator
Silhouette and F-measure indicators were used in this experiment to verify the effectiveness of FOA-DPC algorithm clustering results.

Silhouette indicator
The silhouette indicator is a method of how similar an object is to its own cluster compared to other clusters.A data set D with n sample points was divided into k clusters: C i (i =1, 2,…, k).a(t) could be the average dissimilarity of sample t in C j .D(t, C i ) was the average dissimilarity of sample point t to all samples in another cluster C i , then b(t) = min{d(t, C i )}, wherein i = 1, 2,…, k and i≠j.The calculation formula of Sample t Silhouette index Sil was shown in formula (14).
Sil(t) value reflected among cluster C i with compact classes and separable classes.The average of all the samples Sil(t) values reflect the quality of clustering results.The greater the average Sil value, the more compact the class, the better the quality of clustering is.

F-measure indicator
F-measure index was external indicator.It measured a grammar's accuracy, which combines the accuracy rate of P(i, j) with the recall rate of R(i, j).Set up real clustering P j and clustering C i .Accuracy rate and recall rate are calculated as Eq. ( 15) and Eq. ( 16).

( , ) ( , ) ( , )
( , ) ( ) All F-measure values were weighted average, and the F-measure value of the whole clustering result could be obtained.The larger the F-measure value was, the higher the clustering accuracy was.max ( , )

Experimental results and analysis
Comparison of FOA-DPC algorithm with DPC algorithms through simulation experiments showed there were some superiorities of the FOA-DPC algorithm.At first, we used five data sets of UCI that were shown in

Comparison of clustering accuracy among different algorithms for different data sets
It could be seen from Tab. 2 that the cluster results of the FOA-DPC algorithms were better than DPC algorithm in the two clustering evaluation indexes.Data sets of Iris and Wine were selected to test the capability of detecting clustering centres.The FM value of FOA-DPC algorithm improved the rate by more than 10 % compared to DPC algorithm.In the Iris dataset, the Sil value for FOA-DPC algorithm was lower than DPC algorithm.In other datasets, Sil values were all increased.
It indicated that FOA algorithm could be combined with DPC algorithm effectively.For data sets of Spiral, Flame, and Path based, the FOA-DPC algorithm also had better performance than DPC algorithm.Especially for Path based, the FM value of FOA-DPC had significantly improved rate of 30 % compared to DPC algorithm.Meanwhile, in data sets of D31 and R15, it demonstrated that the FOA-DPC algorithm could detect the clustering centres with complicated structure.
Hence, the FOA significantly improved the DPC algorithm which required setting parameters manually.The algorithm based on the behaviours of food searching pattern of fruit flies was used to enhance global optimum ability of DPC algorithm.Silhouette index refers to a method of interpretation and validation of consistency within clusters of data.This paper introduced the Silhouette index to prevent the algorithm from falling into local extreme and make the algorithm more reasonable to search the optimal solution in the foraging process.So the improved FOA-DPC algorithm enhanced the accuracy of DPC algorithm.This paper used different data sets to compare the clustering outcomes and plot the results of selected UCI data sets in Figs. 3, 4, 5 and 6.

Comparison of clustering results for different data sets
Tab. 3 showed the comparison of cluster results between FOA-DPC and DPC algorithm.As we had seen, FOA-DPC algorithm could find the real classes number of the whole seven data sets.And DPC algorithm obtained the same results.It demonstrated that FOA-DPC could achieve the same clustering results without setting parameters manually, which would reduce the complexity of the algorithm considerably.Especially to the problems with large data, the improved FOA-DPC algorithm would be more efficient and save much more running time.And it confirmed the superiorities of the improved algorithm in Fig. 1, 2, 3, 4, 5.In most test data sets, the FOA-DPC algorithm could be applied to calculate the true cluster numbers.The swarm information was regarded as the knowledge used to guide the searching process.For the three kinds of large data sets Spiral, D31, and R15, FOA-DPC algorithm was just a little improved than the original DPC algorithm.

Applications of FOA-DPC on Stock Analysis
In order to test the feasibility of FOA-DPC algorithm, we chose the stock data from SSE (Shanghai Stock Exchange in China).There were four measurement indexes used in this test: Net Asset Earning, Net Profit Ratio, Earning per Share and Sales Revenue per Share.The potential growth ability of listed company was divided into three levels.High level labelled by number 3, which represented Blue Chip Stock.Middle level labelled by number 2 that represented Growth Stocks.Low level labelled by number 1 meant Depressed Stocks.The procedure for applying FOA-DPC algorithm to analysis of the stock data was described as follows: Step 1: Data pre-processing.The sample stock data were selected from SSE in 2014, which included 44 data points.In this paper, the sample data were normalized to make data in the range from 0 to 1 using the following formula: min max min {y } , 1, 2, 3 .As seen in Tab. 4, the actual level and test level were basically identical.The net asset earning and earning per share were both the most important indexes to measure earning power.Level 1 represented the growth stocks, which had ability to make profits except growing latent.Level 2 meant the stationary stocks and developed smoothly.Level 3 showed the blue chip stocks.The net asset earnings and earnings per share in level 3 were apparently higher than the other two categories.The stock analysis showed that the application of the proposed algorithm on stock analysis was effective.Hence, it provided a relatively effective research tool for the stack classification field.

Conclusion
On the basis of FOA and DPC algorithm, an improved clustering algorithm was proposed as FOA-DPC.For DPC algorithm, the cut-off distance and clustering centres had to be set by experience or at random.Furthermore, the clustering results of DPC algorithm are sensitive to different parameters.In order to alleviate the parameters-sensibility of density peak clustering, FOA-DPC algorithm computed the cut-off distance value and clustering centres by introducing the global searching ability into DPC.We had chosen Sil index as the objective function, hence two parameters found by FOA were the extreme value in solution space.
The simulation results showed that FOA-DPC algorithm was more accurate to calculate the true number of clusters than DPC algorithm.Also it demonstrated that the number of clusters was consistent with the real cluster numbers of FOA-DPC without setting parameters manually.The experiments exhibited the advantages of FOA-DPC algorithm.This superiority reduced the complexity of the algorithm considerably, especially in solving the problems with large scale data.The improved FOA-DPC algorithm proved to be more efficient and could save computing time.The clustering performance and the evaluation results were both significantly improved.FOA-DPC algorithm was then applied to real data analysis.Through analyzing the data from Shanghai Stock Exchange in China, the algorithm effectiveness was proved effective again.However there were still some difficulties in distinguish boundary nodes and low density nodes, which are our future work.

Figure 1 Figure 2
Figure 1 Point distribution in two dimensions

Tab. 1
to testify the accuracy (FM) of the two clustering algorithms (FOA-DPC and DPC) and the clustering evaluation (Sil).The results of the experiment are shown in Tab. 2. Finally, we compared the clustering results of the original DPC, FOA-DPC with the real class number, which are shown in Tab. 3.

Figure 4 Figure 5 Figure 6 2 :
(a) DPC, dc = 1,9900 (b) FOA-DPC, dc = 1,4312 Clustering result of D31 (a) DPC, dc = 1,8682 (b) FOA-DPC, dc = 2,1213 Clustering result of Flame (a) DPC, dc = 1,1011 (b) FOA-DPC, dc = 2,2389 Clustering result of Path based Step Train FOA-DPC Algorithm.In the FOA-DPC Algorithm, the cut-off distance value of the DPC Algorithm was dynamically tuned by the FOA to calculate values of d c and cluster number.By simulation, the d c value and cluster number value were 0,5809 and 3 respectively.The final clustering results are shown in Tab. 4.

Table 1
Experimental data set

Table 2
Comparison of evaluation indexes of clustering results

Table 3
Comparison of cluster number

Table 4
Results of the FOA-DPC clustering