Title:
|
A DOMAIN INDEPENDENT METHODOLOGY FOR NEAR-DUPLICATE DETECTION |
Author(s):
|
Abdelaziz Fellah, Allaoua Maamir |
ISBN:
|
978-989-8533-14-2 |
Editors:
|
Hans Weghorn and Pedro IsaĆas |
Year:
|
2012 |
Edition:
|
Single |
Keywords:
|
Duplicate detection, data cleaning, record linkage, edit distance. |
Type:
|
Full Paper |
First Page:
|
139 |
Last Page:
|
146 |
Language:
|
English |
Cover:
|
|
Full Contents:
|
click to dowload
|
Paper Abstract:
|
We propose a new methodology for identifying near-duplicate records efficiently within a single and across multiple data sources. We describe a family of algorithms based on the Monge-Elkan well-tuned distance function and extended with an affine variant of the Smith-Waterman algorithm. Then we present constant and variable thresholding algorithms that work conceptually in a divide-merge tree fashion for detecting near duplicates as hierarchical clusters along with their corresponding representatives. Experiments show a high effectiveness of our methodology in detecting near duplicates and a speedup comparable to the seminal work of Monge-Elkan on several real and generated datasets. |
|
|
|
|