Digital Library

cab1

 
Title:      A DOMAIN INDEPENDENT METHODOLOGY FOR NEAR-DUPLICATE DETECTION
Author(s):      Abdelaziz Fellah, Allaoua Maamir
ISBN:      978-989-8533-14-2
Editors:      Hans Weghorn and Pedro IsaĆ­as
Year:      2012
Edition:      Single
Keywords:      Duplicate detection, data cleaning, record linkage, edit distance.
Type:      Full Paper
First Page:      139
Last Page:      146
Language:      English
Cover:      cover          
Full Contents:      click to dowload Download
Paper Abstract:      We propose a new methodology for identifying near-duplicate records efficiently within a single and across multiple data sources. We describe a family of algorithms based on the Monge-Elkan well-tuned distance function and extended with an affine variant of the Smith-Waterman algorithm. Then we present constant and variable thresholding algorithms that work conceptually in a divide-merge tree fashion for detecting near duplicates as hierarchical clusters along with their corresponding representatives. Experiments show a high effectiveness of our methodology in detecting near duplicates and a speedup comparable to the seminal work of Monge-Elkan on several real and generated datasets.
   

Social Media Links

Search

Login