INDEPENDENT DE-DUPLICATION IN DATA CLEANING

Udechukwu, Ajumobi; Ezeife, Christie; Barker, Ken

Journal of Information and Organizational Sciences, Vol. 29 No. 2, 2005.

Izvorni znanstveni članak

INDEPENDENT DE-DUPLICATION IN DATA CLEANING

Ajumobi Udechukwu ; Dept. of Computer Science, University of Calgary, Canada
Christie Ezeife ; School of Computer Science, University of Windsor, Canada
Ken Barker ; Dept. of Computer Science, University of Calgary, Canada

Puni tekst: engleski pdf 141 Kb

str. 53-68

preuzimanja: 1.427

citiraj

APA 6th Edition

Udechukwu, A., Ezeife, C. i Barker, K. (2005). INDEPENDENT DE-DUPLICATION IN DATA CLEANING. Journal of Information and Organizational Sciences, 29 (2), 53-68. Preuzeto s https://hrcak.srce.hr/78279

MLA 8th Edition

Udechukwu, Ajumobi, et al. "INDEPENDENT DE-DUPLICATION IN DATA CLEANING." Journal of Information and Organizational Sciences, vol. 29, br. 2, 2005, str. 53-68. https://hrcak.srce.hr/78279. Citirano 27.04.2024.

Chicago 17th Edition

Udechukwu, Ajumobi, Christie Ezeife i Ken Barker. "INDEPENDENT DE-DUPLICATION IN DATA CLEANING." Journal of Information and Organizational Sciences 29, br. 2 (2005): 53-68. https://hrcak.srce.hr/78279

Harvard

Udechukwu, A., Ezeife, C., i Barker, K. (2005). 'INDEPENDENT DE-DUPLICATION IN DATA CLEANING', Journal of Information and Organizational Sciences, 29(2), str. 53-68. Preuzeto s: https://hrcak.srce.hr/78279 (Datum pristupa: 27.04.2024.)

Vancouver

Udechukwu A, Ezeife C, Barker K. INDEPENDENT DE-DUPLICATION IN DATA CLEANING. Journal of Information and Organizational Sciences [Internet]. 2005 [pristupljeno 27.04.2024.];29(2):53-68. Dostupno na: https://hrcak.srce.hr/78279

IEEE

A. Udechukwu, C. Ezeife i K. Barker, "INDEPENDENT DE-DUPLICATION IN DATA CLEANING", Journal of Information and Organizational Sciences, vol.29, br. 2, str. 53-68, 2005. [Online]. Dostupno na: https://hrcak.srce.hr/78279. [Citirano: 27.04.2024.]

Sažetak

Many organizations collect large amounts of data to support their business and decision-making processes. The data originate from a variety of sources that may have inherent data-quality problems. These problems become more pronounced when heterogeneous data sources are integrated (for example, in data warehouses). A major problem that arises from integrating different databases is the existence of duplicates. The challenge of de-duplication is identifying “equivalent” records within the database. Most published research in de-duplication propose techniques that rely heavily on domain knowledge. A few others propose solutions that are partially domain-independent. This paper identifies two levels of domain-independence in de-duplication namely: domain-independence at the attribute level, and domain-independence at the record level. The paper then proposes a positional algorithm that achieves domain-independent de-duplication at the attribute level, and a technique for field weighting by data profiling, which, when used with the positional algorithm, achieves domain-independence at the record level. Experiments show that the proposed techniques achieve more accurate de-duplication than the existing algorithms.

Ključne riječi

Data cleaning; De-duplication; data quality; field-matching; record linkage

Hrčak ID:

78279

URI

https://hrcak.srce.hr/78279

Datum izdavanja:

21.12.2005.

Posjeta: 1.881 *

Prijava i registracija

Journal of Information and Organizational Sciences, Vol. 29 No. 2, 2005.

Sažetak

Ključne riječi

Hrčak ID:

URI

Datum izdavanja: