A DOMAIN INDEPENDENT METHODOLOGY FOR  NEAR-DUPLICATE DETECTION

Home

Document Info

Title:	A DOMAIN INDEPENDENT METHODOLOGY FOR NEAR-DUPLICATE DETECTION
Author(s):	Abdelaziz Fellah, Allaoua Maamir
ISBN:	978-989-8533-14-2
Editors:	Hans Weghorn and Pedro Isaías
Year:	2012
Edition:	Single
Keywords:	Duplicate detection, data cleaning, record linkage, edit distance.
Type:	Full Paper
First Page:	139
Last Page:	146
Language:	English
Cover:
Full Contents:	click to dowload
Paper Abstract:	We propose a new methodology for identifying near-duplicate records efficiently within a single and across multiple data sources. We describe a family of algorithms based on the Monge-Elkan well-tuned distance function and extended with an affine variant of the Smith-Waterman algorithm. Then we present constant and variable thresholding algorithms that work conceptually in a divide-merge tree fashion for detecting near duplicates as hierarchical clusters along with their corresponding representatives. Experiments show a high effectiveness of our methodology in detecting near duplicates and a speedup comparable to the seminal work of Monge-Elkan on several real and generated datasets.

	Go Back