Literature DB >> 29880978

Theoretical limits of microclustering for record linkage.

J E Johndrow1, K Lum2, D B Dunson3.   

Abstract

There has been substantial recent interest in record linkage, where one attempts to group the records pertaining to the same entities from one or more large databases that lack unique identifiers. This can be viewed as a type of microclustering, with few observations per cluster and a very large number of clusters. We show that the problem is fundamentally hard from a theoretical perspective and, even in idealized cases, accurate entity resolution is effectively impossible unless the number of entities is small relative to the number of records and/or the separation between records from different entities is extremely large. These results suggest conservatism in interpretation of the results of record linkage, support collection of additional data to more accurately disambiguate the entities, and motivate a focus on coarser inference. For example, results from a simulation study suggest that sometimes one may obtain accurate results for population size estimation even when fine-scale entity resolution is inaccurate.

Entities:  

Keywords:  Closed population estimation; Clustering; Entity resolution; Microclustering; Record linkage; Small clusters

Year:  2018        PMID: 29880978      PMCID: PMC5963577          DOI: 10.1093/biomet/asy003

Source DB:  PubMed          Journal:  Biometrika        ISSN: 0006-3444            Impact factor:   2.445


  4 in total

1.  Some coverage error models for census data.

Authors:  K M Wolter
Journal:  J Am Stat Assoc       Date:  1986-06       Impact factor: 5.033

2.  Probabilistic linkage of large public health data files.

Authors:  M A Jaro
Journal:  Stat Med       Date:  1995 Mar 15-Apr 15       Impact factor: 2.373

3.  Nonparametric Bayes Modeling of Multivariate Categorical Data.

Authors:  David B Dunson; Chuanhua Xing
Journal:  J Am Stat Assoc       Date:  2012-01-01       Impact factor: 5.033

4.  TENSOR DECOMPOSITIONS AND SPARSE LOG-LINEAR MODELS.

Authors:  James E Johndrow; Anirban Bhattacharya; David B Dunson
Journal:  Ann Stat       Date:  2017-02-21       Impact factor: 4.028

  4 in total
  1 in total

1.  Wartime health shocks and the postwar socioeconomic status and mortality of union army veterans and their children.

Authors:  Dora L Costa; Noelle Yetter; Heather DeSomer
Journal:  J Health Econ       Date:  2019-12-28       Impact factor: 3.804

  1 in total

北京卡尤迪生物科技股份有限公司 © 2022-2023.