BIASED SAMPLING TECHNIQUES FOR CLUSTERING DATA OF DIFFERENT SIZES AND DENSITIES

Kittisak Kerdprasop; Nittaya Kerdprasop

Home

Digital Library

Visit Digital Library

Conference Proceedings

IADIS International Conference Intelligent Systems and Agents - ISA

IADIS International Conference Intelligent Systems and Agents 2010 (part of MCCSIS 2010)

Document Info

Title:	BIASED SAMPLING TECHNIQUES FOR CLUSTERING DATA OF DIFFERENT SIZES AND DENSITIES
Author(s):	Kittisak Kerdprasop, Nittaya Kerdprasop
ISBN:	978-972-8939-23-6
Editors:	António Palma dos Reis and Ajith P. Abraham
Year:	2010
Edition:	Single
Keywords:	Density biased sampling, Reservoir sampling, Zipf distribution, Sliding window algorithm, Data clustering, Erlang.
Type:	Full Paper
First Page:	51
Last Page:	58
Language:	English
Cover:
Full Contents:	click to dowload
Paper Abstract:	Clustering is a task of partitioning data into subgroups based on similarity or closest distance. A standard k-means algorithm groups data by firstly assigning all data points to the closest clusters, then determining the new cluster mean of each cluster based on the average value of its members. The algorithm repeats these two steps until it converges; that is until there is no change in cluster means and cluster assignment among data points. Performance of the algorithm depends on the number of data points, number of data clusters, and number of iterations. To speed up the clustering process, we investigate the density-biased reservoir sampling algorithms and then devise an efficient data reduction technique. Instead of simply randomly selecting data for clustering, we evaluate data density using a sliding window model and selectively draw samples with rejection criteria from the dense area. The proposed algorithm has been implemented with a functional programming paradigm using the Erlang language. The declarative style of Erlang facilitates a rapid prototyping and a concise coding, as shown in the paper. Our experimental results reveal the effectiveness of drawing samples of high density from the Zipf distributed data. The shift of cluster means is minimal, whereas the decrease in memory usage is significant.

	Go Back

Social Media Links

amazon

Search

Login

Top Visited