Literature DB >> 24777833

A machine learning approach to create blocking criteria for record linkage.

Phan H Giang1.   

Abstract

Record linkage, a part of data cleaning, is recognized as one of most expensive steps in data warehousing. Most record linkage (RL) systems employ a strategy of using blocking filters to reduce the number of pairs to be matched. A blocking filter consists of a number of blocking criteria. Until recently, blocking criteria are selected manually by domain experts. This paper proposes a new method to automatically learn efficient blocking criteria for record linkage. Our method addresses the lack of sufficient labeled data for training. Unlike previous works, we do not consider a blocking filter in isolation but in the context of an accompanying matcher which is employed after the blocking filter. We show that given such a matcher, the labels (assigned to record pairs) that are relevant for learning are the labels assigned by the matcher (link/nonlink), not the labels assigned objectively (match/unmatch). This conclusion allows us to generate an unlimited amount of labeled data for training. We formulate the problem of learning a blocking filter as a Disjunctive Normal Form (DNF) learning problem and use the Probably Approximately Correct (PAC) learning theory to guide the development of algorithm to search for blocking filters. We test the algorithm on a real patient master file of 2.18 million records. The experimental results show that compared with filters obtained by educated guess, the optimal learned filters have comparable recall but reduce throughput (runtime) by an order-of-magnitude factor.

Entities:  

Mesh:

Year:  2014        PMID: 24777833     DOI: 10.1007/s10729-014-9276-0

Source DB:  PubMed          Journal:  Health Care Manag Sci        ISSN: 1386-9620


  5 in total

1.  Genotype phenotype mapping in RNA viruses - disjunctive normal form learning.

Authors:  Chuang Wu; Andrew S Walsh; Roni Rosenfeld
Journal:  Pac Symp Biocomput       Date:  2011

2.  A benchmark comparison of deterministic and probabilistic methods for defining manual review datasets in duplicate records reconciliation.

Authors:  Erel Joffe; Michael J Byrne; Phillip Reeder; Jorge R Herskovic; Craig W Johnson; Allison B McCoy; Dean F Sittig; Elmer V Bernstam
Journal:  J Am Med Inform Assoc       Date:  2013-05-23       Impact factor: 4.497

Review 3.  Accuracy of probabilistic record linkage applied to health databases: systematic review.

Authors:  Daniele Pinto da Silveira; Elizabeth Artmann
Journal:  Rev Saude Publica       Date:  2009-09-25       Impact factor: 2.106

4.  The Department of Veterans Affairs, Department of Defense, and Kaiser Permanente Nationwide Health Information Network exchange in San Diego: patient selection, consent, and identity matching.

Authors:  Omar Bouhaddou; Jamie Bennett; Tim Cromwell; Graham Nixon; Jennifer Teal; Mike Davis; Robert Smith; Linda Fischetti; David Parker; Zachary Gillen; John Mattison
Journal:  AMIA Annu Symp Proc       Date:  2011-10-22

5.  Record linkage in Scotland and its applications to health research.

Authors:  Michael Fleming; Brad Kirby; Kay I Penny
Journal:  J Clin Nurs       Date:  2012-10       Impact factor: 3.036

  5 in total
  2 in total

1.  Foreward to special issue on health analytics.

Authors:  Farrokh Alemi
Journal:  Health Care Manag Sci       Date:  2014-10-09

2.  A proficient cost reduction framework for de-duplication of records in data integration.

Authors:  Asif Sohail; Muhammad Murtaza Yousaf
Journal:  BMC Med Inform Decis Mak       Date:  2016-04-12       Impact factor: 2.796

  2 in total

北京卡尤迪生物科技股份有限公司 © 2022-2023.