• Jun 20, 2018 News! [CFP] 2018 the annual meeting of IJET Editorial Board, ICEDA 2018, will be held in Nha Trang, Vietnam during October 20-22, 2018.   [Click]
  • Aug 06, 2018 News! Vol.9, No.1- Vol.9, No.4 has been indexed by EI(Inspec)!   [Click]
  • Aug 16, 2018 News!Vol.10, No. 6 has been published with online version.   [Click]
General Information
Editor-in-chief
Prof. T. Hikmet Karakoc
Anadolu University, Faculty of Aeronautics and Astronautics, Turkey

IJET 2012 Vol.4(6): 750-754 ISSN: 1793-8236
DOI: 10.7763/IJET.2012.V4.477

Removing Fully and Partially Duplicated Records through K-Means Clustering

Bilal Khan, Azhar Rauf, Huma Javed, Shah Khusro, and Huma Javed

Abstract—Records duplication is one of the prominent problems in data warehouse. This problem arises when various databases are integrated. This research focuses on the identification of fully as well as partially duplicated records. In this paper we propose a de-duplicator algorithm which is based on numeric conversion of entire data. For efficiency, data mining technique k-mean clustering is applied on the numeric value that reduces the number of comparisons among records. To identify and remove the duplicated records, divide and conquer technique is used to match records within a cluster which further improves the efficiency of the algorithm.

Index Terms—Data cleansing, De-Duplicator, partial duplication, K-Mean clustering.

The authors are with the Department of Computer Science University of Peshawar, Pakistan (e-mail: smbilal_84@yahoo.com).

[PDF]

Cite: Bilal Khan, Azhar Rauf, Huma Javed, Shah Khusro, and Huma Javed, "Removing Fully and Partially Duplicated Records through K-Means Clustering," International Journal of Engineering and Technology vol. 4, no. 6, pp.750-754, 2012.

Copyright © 2008-2017. International Journal of Engineering and Technology. All rights reserved. 
E-mail: ijet@vip.163.com