A COMMON FRAMEWORK OF PARTITION-BASED CLUSTERING FOR LARGE SCALE DATASET USING SAMPLING AND ITS MapReduce IMPLEMENTATION

Original scientific paper Clustering is one of the significant tasks in data mining, and partition-based clustering algorithms such as k-means are one of the popular solutions. However, with the increasing development of cloud computing and big data, large scale dataset has been a big challenge for clustering. For example, the execution of clustering algorithm is too time-consuming, the optimization of parameters is difficult, and the quality of clusters is not good. To this end, in this paper, we proposed a common framework of partition-based clustering algorithms such as k-means, and designed its MapReduce implementation. Specifically, in order to deal with the representation of large scale dataset, we propose to employ sampling technique. Then, inspired by k-means algorithm, we propose a common procedure of clustering, and provide a k-means based implementation. Furthermore, we implement proposed framework using MapReduce programming model. Experiments show that our method is efficient for large scale dataset.


Introduction
Clustering is one of the significant tasks in data mining, also called unsupervised learning, defined as the process of grouping a set of objects into multiple groups such that the objects within the same group are similar and the objects across different groups are different [1].The most challenges in clustering tasks are: (1) how to represent the whole dataset with enough information as little as possible; (2) how to measure the similarity between objects as well as the cost function.
With the increasing development of cloud computing [2] and big data [3÷5], large scale dataset has become a common source of clustering.In face of large scale dataset, clustering analysis has the following issues: (1) the dataset is complicated, such as large-scaled, high-dimensional, non-linear, and skewed; (2) the execution of clustering algorithm is too time-consuming, and the optimization of parameters is difficult; (3) the quality of clusters is not good.To solve the above challenges, many researchers have proposed parallel and distributed clustering methods.For example, Feng et al. [6] proposed parallel k-means algorithm based on MPI and applied to resume dataset.Kantabutra et al. [7] proposed a distributed version of k-means but dramatically increased the communication cost between nodes.Yang et al. [8] designed a cloud implementation of SPRINT algorithm based on Hadoop.
In this paper, inspired by existing efforts, we propose a common framework of partition-based clustering algorithms such as k-means, and design its MapReduce [9] implementation.Specifically, out contributions are as follows: (1) We propose a common framework of partition-based clustering algorithms using sampling, and validate the effectiveness of proposed framework by implementing k-means and k-medoids algorithms; (2) We modify the basic random sampling method with a partition-based method, to reduce the time cost of sampling on large scale dataset; (3) We provide its implementation with MapReduce programming diagram, and design Map and Reduce procedure for each step; (4) We evaluate the efficiency of proposed framework with k-means and k-medoids implementation.Besides, we compare the performance to the MPI based implementation, with different sizes of dataset and different numbers of nodes.
The remains of this paper are organized as follows.In Section 2 we provide some related work.Section 3 presents the common framework of sampling based clustering algorithm, and Section 4 describes the MapReduce implementation.Then experiments are conducted in Section 5. Finally, the paper is concluded in Section 6.

Related work
The common clustering algorithms include partitionbased clustering, hierarchical clustering, density-based clustering, and others.
Partition-based clustering algorithms typically include k-means [10,11] and k-medoid [12,13].K-means uses the average of objects within clusters as reference, while k-medoid uses the object in the centre of clusters as reference.There are three requirements of partition-based clustering: (1) the distance between data objects as the similarity measurement; (2) a cost function to evaluate the quality of clustering results; and (3) the initial centroids and clusters.
Hierarchical clustering algorithms [14,15] repeatedly split or aggregate data through a hierarchical structure, in order to form a hierarchical sequence of solutions.The complexity is O(n 2 ), and applicable to small scale dataset.For example, CURE [16] uses a novel hierarchical strategy by choosing a fixed number of representative points and multiplying a shrinking factor to approach the centre of the cluster.Chameleon [17] is a dynamic hierarchical clustering algorithm.It first splits the data objects into relatively smaller groups based on a graph partitioning method, and then uses an agglomerative hierarchical clustering method to repeatedly find out the real clusters.
Density-based clustering algorithms explore clusters with any shape based on the data density.For example, DBSCAN [18] can find clusters with any shape and also deal with noises.OPTICS [19] solves the problem of wide range of local density across different clusters.Instead of directly generating clustering results, it gives a hierarchical sequence of density-based clustering structure.Fraley and Hinneburg et al. [20,21] proposed a kernel density estimation method, which studies the data distribution using statistical methods without any prior knowledge.
Besides, there also exist other clustering algorithms.For example, Pileva et al. [22] proposed a grid based clustering algorithm GCHL on large scale and high dimensional spatial database.Tsai et al. [23] designed a novel data clustering approach for data mining in large databases using ant systems, called ACODF.Andrew et al. [24] provided analysis of spectral clustering.Kawaji et al. [25] proposed a graph-based clustering method to cluster protein sequences into families, which automatically improves clusters of the conventional single linkage clustering method.Some researchers proposed clustering analysis based on intelligent algorithms such as genetic algorithm [26] and particle swarm optimization [27].Fuzzy clustering was also proposed in [28,29].
In this paper, we focus on one of the most popular clustering algorithms, partition-based algorithms, and explore the solutions of applying partition-based clustering onto large scale dataset.Indeed, there exist some efforts on tailoring partition-based clustering algorithms using parallel and distributed solutions.For example, Tsoumakas and Dhillon et al. [30,31] developed the parallel version of k-means on distributed memory multiprocessors using data parallelization method.Manasi [32] proposed another parallel k-means by passing the centres of clusters between processors.
Forman et al. [33] proposed to pass only statistical variables to improve the efficiency of k-means.Kantabutra et al. [34] designed a distributed k-means called k-Dmeans.Zheng et al. [35] proposed DK-means, which modified k-Dmeans by solving the problem of massive communication.Later, Li et al. [36] proposed a P2P based grid distributed clustering, called k-DmeansVM by solving the single point of failure issue of k-Dmeans.Mao [37] introduced Minimum-Maximum principle to modify Canopy-k-means algorithm, and implemented it using MapReduce framework.

Common framework of partition-based clustering using sampling
In this study, we leverage sampling technique to deal with large scale dataset.As proved in existing works [16], [38÷41], sampling can be used to accelerate the clustering analysis in large scale dataset scenarios.However, random sampling would lead to awful clustering results.

Overview
As one of the most popular partition-based clustering algorithms, k-means uses centroid to represent the whole cluster.Suppose the number of clusters is K, the number of data objects is N, and the number of dimensions is d.Given the set of data } ,..., , { , and the clusters } ,..., , { , where j C is the set of data objects that belong to cluster j, and μ j is the centre of cluster.Suppose Euclidean distance is used to measure the distance between objects, and denoted as ⋅ .k-means updates the centroids of clusters and moves the members until the ideal clusters are found.The centroid is defined upon the average of data objects: and the centroid is updated as: .) , ( The objective is to minimize the cost function in Eq. ( 2), and the centroid is updated in each iteration.
However, the result quality of k-means clustering is unstable, especially in large scale dataset scenarios.To this end, we employ sampling method to adapt partition-based clustering to large scale dataset applications.The intuitive method is to randomly sample several partitions from the original large scale dataset, so that the clustering algorithm can be applied on each partition, and the result is reliable and can represent that of the whole dataset.For example, suppose the original dataset has K clusters, in ideal situation each partition should also has K clusters.However, in some partitions, the number of clusters can be less than K. Therefore, how to deal with clustering in each partition independently with unknown number of clusters is one of the big challenges in this paper.The basic idea of partition-based clustering is: given some initial centroids and clusters, enable the data objects approach to the centre of clusters based on some predefined rules, and then adjust until the clustering results remain stable and reasonable.Inspired by k-means, we design a common framework of partition-based clustering algorithm with sampling, as shown in Fig. 1.There are mainly four steps: (1) sampling upon large scale dataset; (2) determine initial centroids using sampled data; (3) update centroids; and (4) label all data objects with cluster IDs.

Sampling
As mentioned earlier, we want to sample a smaller size of partitions such that all K clusters are included in each partition.Suppose the original dataset . The number of data objects in i C is i m .Apply M times sampling on D, and the number of data objects in each sample i D is i N .The sampling satisfies the following requirements: where , and there exists no overlap between samples.
However, the basic sampling method needs to traverse the whole dataset N times for each single record, where N is the size of the original dataset.Therefore the complexity of random sampling is O(MN).To reduce the cost of sampling, we use a partition-based random sampling method.Specifically, split D into i N partitions equally, where i N is the size of each sample.Then randomly select one record from each partition.Therefore, the time cost of modified random sampling is O(MN i ).Since N N i << , we have MN MN i << . Therefore, the sampling cost is dramatically reduced.
Inspired by [16], the sample size i N can be estimated as: and the number of members The probability of ij d also belonging to i C is calculated as: .

Calculating initial centroids
Initial centroids are determined by clustering on sample dataset.However, the real centroids are typically deviated from the initial.It could be adjusted by updating the average, which will be discussed in Section 3.4.
There are two steps in determining initial centroids: (1) apply clustering in each sample; and (2) combine results from all samples, as shown in Fig. 2. Note that if the number of clusters in samples is actually less than K, some cluster would be forced to split into several clusters so that there are always K centroids in each sample.
For simplicity, we use k-means to describe the centroids calculation in the first step.Note that any simple clustering algorithms can be applied here, since the data scale is dramatically reduced.Then, we get M K × small clusters.
Next we need to combine the clustering results of all samples.Use local centroid ij µ to represent each cluster of samples, and the global centroid j µ is calculated as: where ij m is the number of objects in cluster j of sample i , and M is the number of samples.Theorem: The combined clustering results of each partition are equivalent to the results of single clustering on all the partitions.
Proof: The centroid of each cluster samples is calculated as: where ij C is the j th cluster of i th sample, and ij m is the size of cluster.Since there is no overlapping between i D , no data object would be labeled twice with different cluster ID.Substituting Eq. ( 7) into Eq.( 6), we get: where is the total data objects in the j th cluster, and is the size of global j th cluster.
The left side of Eq. ( 8) is the combining local clustering results of each sample, and the right side is the single clustering results on the whole dataset.Therefore, the partition based method is equivalent to the single clustering algorithm on the whole dataset.

Updating centroids
In previous step, we determined the initial centroids based on sampled dataset.However, only the sampled data is used, the results cannot represent the whole dataset D. Therefore, in this step, we add the remaining data objects into each cluster, and further update the centroids of clusters.
Assign data x in the remaining dataset to current clusters based on the minimum distance principle.That is, where j µ is the centroid of cluster j C , and c is the assigned cluster.
Once a new data is assigned with cluster label, the centroid of clusters should be updated in an iterative way, until all data objects are processed: . 1 , 1

Labeling data objects
Now the new centroids are computed for all clusters.Re-label data object with cluster ID based on Eq. ( 7).Now we consider the satisfaction of clustering results.Similar to k-means algorithms, we define a cost function, and try to minimize it: .min arg

MapReduce implementation
In previous sections, we discussed the common process of partition-based clustering using sampling.Although the proposed sampling based framework can handle large scale dataset in some way, the computation is still sequential.In order to parallelize and distribute the whole clustering process, we employ MapReduce for implementation.
MapReduce is a programming model for large scale parallel and distributed processing on clusters.Basically, there are two procedures in MapReduce: Map() and Reduce().Typically, all the data is processed in the form of key/value pairs.As shown in Figure 3, first the input component reads data from splits.Then, the Map() procedure takes a series of key/value pairs, and generates processed key/value pairs, which are allocated to a particular reducer by partition function.Later, after data shuffling, the Reduce() procedure iterates through the values that are associated with specific key and produces zero or more outputs.
MapReduce model provides convenience to programmers so that only Map and Reduce procedures need to be implemented, while other details are handled by mature platforms such as Hadoop.

Figure 3 Illustration of MapReduce programming model
The MapReduce implementation of sampling based clustering is composed of four steps: Step 1: perform sampling on large scale dataset D on M Map nodes, and on each Map node perform k-means clustering.
Step 2: using Reduce procedure to combine the results from M nodes and compute the initial centroids.
Step 3: distribute D equally onto n nodes, and on each node perform: (1) labelling data objects with cluster ID, and (2) update centroids incrementally; Step 4: combine intermediate results from n nodes, and compute the new centroids.
If the termination condition is not satisfied, repeat steps 3 and 4. The overall MapReduce implementation is illustrated in Fig. 4. Details will be presented as follows.

Sampling
In this step, original large scale dataset D is sampled and processed on each node independently.Since we have M samples, M nodes are used.Specifically, the sampling process is implemented as REDUCE_SAMPLING(), which randomly select one row_id of each partition to decide the sample_id .Note that here data is partitioned equally by size i N , as discussed in Section 3.2.The clustering on each sample is implemented as MAP_CLUSTERING(), as shown in Algorithm 1.

Labelling data objects
Similar to the previous step, we distribute D to n nodes, and each node labels n D / data objects.Indeed, this step is similar to MAP_DISTRIBUTE() procedure, but without computing local centroids.The reason is that if the termination condition is satisfied, this step would be the last one to output clusters as well as data objects associated with cluster ID.

Update centroids
If the results are not satisfactory, centroids would be updated in this step.Firstly, MAP_CENTROIDS() procedure computes centroid in each local node, and then REDUCE_CENTROIDS() procedure combines results from all Map nodes, and generates the final cluster ID and centroids.

Experiment
In this study, we have 4 PCs with 3,00 GHz Intel dual-core processors, 2 GB RAM and 160G disk storage for our MapReduce cluster.We assign one as NameNode and JobTracker, and the rest three as computing nodes.Each PC can be used as 2 nodes, and therefore we have 8 nodes maximum.We employ two common clustering algorithms, k-means and kmedoids, to implement our sampling based clustering framework using MapReduce.The dataset is collected from online application.After pre-processing, we have 10 dimensions here, and the dataset size is represented as the number of records.For comparison, we also provide the MPI implementation of both algorithms.Tab. 1 lists the results of different methods with different sizes of dataset, when 4 nodes are used.From Tab. 1 we have the following observations: (1) The basic k-means or k-medoids performs worst, because it is more suitable for small dataset on single node.(2) For relatively small dataset, MPI based clustering is faster than proposed method, because the processing logic behind Hadoop is complicated and therefore increases the overhead.(3) When the size of dataset grows, the proposed method has the best performance.Therefore, we can see that our MapReduce based solution can efficiently deal with the large scale challenge.
As shown in Figs. 5 and 6, we evaluate the efficiency of k-means and k-medoids implementation of proposed method with different numbers of nodes.We can observe that: (1) basically the execution time decreases a lot when more nodes are deployed; and (2) when the size of dataset is relatively small, the improvement of multi-node execution is unstable as shown in Fig. 5, while for large scale dataset, the performance is almost linearly promoted as shown in Fig. 6.
Besides, we also evaluate the accuracy of our clustering method using SSE (Sum of Squared Errors) measure, calculated as: , where ⋅ denotes the distance.As illustrated in Fig. 7, we have four lines for 50M and 200M dataset with k-means and k-medoids implementation of proposed framework respectively.We can see that when the number of nodes is less than 4, the SSE reduces dramatically when more nodes are used.However, SSE remains relatively stable when adding more nodes.Therefore, we conclude that due to the overhead of parallel and distributed processing, it is not always the fact that the more nodes are deployed, the more the algorithm is.For example, the suggested number of nodes in these experiment settings is 4.  Second, we only implement k-means clustering algorithms as an example of partition-based clustering.In future works, we would dive deeper to extend our solution.

Figure 1
Figure 1 Flowchart of partition-based clustering with sampling

Figure 2
Figure 2 Determining initial centroids

Figure 5 Figure 6
Figure 5 Execution time of proposed method with 50M dataset

Figure 7
Figure 7 SSE measure of proposed clustering methodMoreover, we investigate the sampling cost.Fig.8gives the ratio of sampling time cost to the total execution time.We have the following observations.(1) k-means implementation has less sampling cost than k-mediods implementation.Because the total time cost of k-means is bigger than k-mediods while the sampling cost with specific data size is fixed.(2) The larger the dataset size is, the less percentage of sampling cost is.The reason is that the total time cost increases with the dataset size.(3) The more nodes are involved, the less ratio of sampling cost is.Since the sampling cost of specific data size remains stable, the more nodes are deployed, the more extra cost is introduced, which leads to a decrease in the ratio of sampling cost to the total cost.

Figure 8
Figure 8 Ratio of sampling cost to the total execution time

Table 1
Execution time (ms) of different methods