Technical gazette, Vol. 23 No. 1, 2016.
Original scientific paper
https://doi.org/10.17559/TV-20150126121041
A common framework of partition-based clustering for large scale dataset using sampling and its MapReduce implementation
Ran Jin
; (1) School of Computer Science and Information Technology, Zhejiang Wanli University, No. 8 South QianHu Road, Ningbo, Zhejiang, 315100, China / (2) College of Computer Science and Technology, Zhejiang University, No.38 Zheda Road, Hangzhou, Zhejiang, 310
Chunhai Kou
; School of Science, Donghua University No. 2999 North Renmin Road, Songjiang district, Shanghai, 201620, China
Ruijuan Liu
; School of Information Science and Technology, Donghua University, No. 2999 North Renmin Road, Songjiang district, Shanghai, 201620, China
Tao Guo
; School of Information Science and Technology, Donghua University, No. 2999 North Renmin Road, Songjiang district, Shanghai, 201620, China
Abstract
Clustering is one of the significant tasks in data mining, and partition-based clustering algorithms such as k-means are one of the popular solutions. However, with the increasing development of cloud computing and big data, large scale dataset has been a big challenge for clustering. For example, the execution of clustering algorithm is too time-consuming, the optimization of parameters is difficult, and the quality of clusters is not good. To this end, in this paper, we proposed a common framework of partition-based clustering algorithms such as k-means, and designed its MapReduce implementation. Specifically, in order to deal with the representation of large scale dataset, we propose to employ sampling technique. Then, inspired by k-means algorithm, we propose a common procedure of clustering, and provide a k-means based implementation. Furthermore, we implement proposed framework using MapReduce programming model. Experiments show that our method is efficient for large scale dataset.
Keywords
large scale dataset; MapReduce; partition-based clustering; sampling
Hrčak ID:
153152
URI
Publication date:
19.2.2016.
Visits: 2.640 *