Literature DB >> 21352952

Controlling false match rates in record linkage using extreme value theory.

Murat Sariyar1, Andreas Borg, Klaus Pommerening.   

Abstract

Cleansing data from synonyms and homonyms is a relevant task in fields where high quality of data is crucial, for example in disease registries and medical research networks. Record linkage provides methods for minimizing synonym and homonym errors thereby improving data quality. We focus our attention to the case of homonym errors (in the following denoted as 'false matches'), in which records belonging to different entities are wrongly classified as equal. Synonym errors ('false non-matches') occur when a single entity maps to multiple records in the linkage result. They are not considered in this study because in our application domain they are not as crucial as false matches. False match rates are frequently computed manually through a clerical review, so without modelling the distribution of the false match rates a priori. An exception is the work of Belin and Rubin (1995) [4]. They propose to estimate the false match rate by means of a normal mixture model that needs training data for a calibration process. In this paper we present a new approach for estimating the false match rate within the framework of Fellegi and Sunter by methods of Extreme Value Theory (EVT). This approach needs no training data for determining the threshold for matches and therefore leads to a significant cost-reduction. After giving two different definitions of the false match rate, we present the tools of the EVT used in this paper: the generalized Pareto distribution and the mean excess plot. Our experiments with real data show that the model works well, with only slightly lower accuracy compared to a procedure that has information about the match status and that maximizes the accuracy.
Copyright © 2011 Elsevier Inc. All rights reserved.

Entities:  

Mesh:

Year:  2011        PMID: 21352952     DOI: 10.1016/j.jbi.2011.02.008

Source DB:  PubMed          Journal:  J Biomed Inform        ISSN: 1532-0464            Impact factor:   6.317


  2 in total

1.  Discriminative Structure Learning of Bayesian Network Classifiers from Training Dataset and Testing Instance.

Authors:  Limin Wang; Yang Liu; Musa Mammadov; Minghui Sun; Sikai Qi
Journal:  Entropy (Basel)       Date:  2019-05-13       Impact factor: 2.524

2.  Mainzelliste SecureEpiLinker (MainSEL): privacy-preserving record linkage using secure multi-party computation.

Authors:  Sebastian Stammler; Tobias Kussel; Phillipp Schoppmann; Florian Stampe; Galina Tremper; Stefan Katzenbeisser; Kay Hamacher; Martin Lablans
Journal:  Bioinformatics       Date:  2022-03-04       Impact factor: 6.937

  2 in total

北京卡尤迪生物科技股份有限公司 © 2022-2023.