Automated Solution for Normalization of Duplicate Records from Multiple Data Sources

K. Jaya Sri, K. Ramachandra Rao


There has been an exponential growth of data in the last decade both in public and private domain. The main aim of this project is to identify the duplicate records which represent the same real world entity by using a mechanism which does not require any training data. An unsupervised method is used where no manual labeling is required. Detecting data sources records that are approximate duplicates is an important task. Query and data from multiple data sources will result with duplicates. When information is retrieved from different data sources duplicates occur due to various format specifications. A data sources having unintentional duplication of records created from the millions of data from other sources can hardly be avoided. Data sources may contain duplicate records that represent the same real world entity because of data entry errors, abbreviations, detailed schemas of records from multiple data sources. Supervised methods are the current techniques used for duplication detection, which requires trained data. These methods are not applicable for the real time data source scenario, where the records to match are query results dynamically generated in online. I present a Dynamic Duplicate Detection, for a given query the algorithm can effectively identify duplicates from the query result records of multiple data sources. In the algorithm proposed, I start from the non-duplicate set and use a weighted component similarity summing classifier and an OSVM classifier, to iteratively identify duplicates in the query results from data sources. Additional to these two classifiers which are used in Unsupervised Duplicate Detection algorithm, a third classifier called Blocking Classifier is used which helps in detecting the duplicate records. Various experiments are conducted on a data set to verify the effectiveness of the algorithm in detecting the duplicate records.

Full Text:



