Skip to the main content

Original scientific paper

https://doi.org/10.17559/TV-20150126121041

A common framework of partition-based clustering for large scale dataset using sampling and its MapReduce implementation

Ran Jin ; (1) School of Computer Science and Information Technology, Zhejiang Wanli University, No. 8 South QianHu Road, Ningbo, Zhejiang, 315100, China / (2) College of Computer Science and Technology, Zhejiang University, No.38 Zheda Road, Hangzhou, Zhejiang, 310
Chunhai Kou ; School of Science, Donghua University No. 2999 North Renmin Road, Songjiang district, Shanghai, 201620, China
Ruijuan Liu ; School of Information Science and Technology, Donghua University, No. 2999 North Renmin Road, Songjiang district, Shanghai, 201620, China
Tao Guo ; School of Information Science and Technology, Donghua University, No. 2999 North Renmin Road, Songjiang district, Shanghai, 201620, China


Full text: croatian pdf 1.713 Kb

page 25-33

downloads: 460

cite

Full text: english pdf 1.713 Kb

page 25-33

downloads: 737

cite


Abstract

Clustering is one of the significant tasks in data mining, and partition-based clustering algorithms such as k-means are one of the popular solutions. However, with the increasing development of cloud computing and big data, large scale dataset has been a big challenge for clustering. For example, the execution of clustering algorithm is too time-consuming, the optimization of parameters is difficult, and the quality of clusters is not good. To this end, in this paper, we proposed a common framework of partition-based clustering algorithms such as k-means, and designed its MapReduce implementation. Specifically, in order to deal with the representation of large scale dataset, we propose to employ sampling technique. Then, inspired by k-means algorithm, we propose a common procedure of clustering, and provide a k-means based implementation. Furthermore, we implement proposed framework using MapReduce programming model. Experiments show that our method is efficient for large scale dataset.

Keywords

large scale dataset; MapReduce; partition-based clustering; sampling

Hrčak ID:

153152

URI

https://hrcak.srce.hr/153152

Publication date:

19.2.2016.

Article data in other languages: croatian

Visits: 2.679 *