Literature DB >> 23703827

A benchmark comparison of deterministic and probabilistic methods for defining manual review datasets in duplicate records reconciliation.

Erel Joffe1, Michael J Byrne, Phillip Reeder, Jorge R Herskovic, Craig W Johnson, Allison B McCoy, Dean F Sittig, Elmer V Bernstam.   

Abstract

INTRODUCTION: Clinical databases require accurate entity resolution (ER). One approach is to use algorithms that assign questionable cases to manual review. Few studies have compared the performance of common algorithms for such a task. Furthermore, previous work has been limited by a lack of objective methods for setting algorithm parameters. We compared the performance of common ER algorithms: using algorithmic optimization, rather than manual parameter tuning, and on two-threshold classification (match/manual review/non-match) as well as single-threshold (match/non-match).
METHODS: We manually reviewed 20,000 randomly selected, potential duplicate record-pairs to identify matches (10,000 training set, 10,000 test set). We evaluated the probabilistic expectation maximization, simple deterministic and fuzzy inference engine (FIE) algorithms. We used particle swarm to optimize algorithm parameters for a single and for two thresholds. We ran 10 iterations of optimization using the training set and report averaged performance against the test set.
RESULTS: The overall estimated duplicate rate was 6%. FIE and simple deterministic algorithms allowed a lower manual review set compared to the probabilistic method (FIE 1.9%, simple deterministic 2.5%, probabilistic 3.6%; p<0.001). For a single threshold, the simple deterministic algorithm performed better than the probabilistic method (positive predictive value 0.956 vs 0.887, sensitivity 0.985 vs 0.887, p<0.001). ER with FIE classifies 98.1% of record-pairs correctly (1/10,000 error rate), assigning the remainder to manual review.
CONCLUSIONS: Optimized deterministic algorithms outperform the probabilistic method. There is a strong case for considering optimized deterministic methods for ER.

Keywords:  Medical Record Linkage [N04.452.859.564.550]; Medical Records Systems, Computerized [L01.700.508.300.695]

Mesh:

Year:  2013        PMID: 23703827      PMCID: PMC3912727          DOI: 10.1136/amiajnl-2013-001744

Source DB:  PubMed          Journal:  J Am Med Inform Assoc        ISSN: 1067-5027            Impact factor:   4.497


  16 in total

1.  Exploring the utility of demographic data and vaccination history data in the deduplication of immunization registry patient records.

Authors:  P L Miller; S J Frawley; F G Sayward
Journal:  J Biomed Inform       Date:  2001-02       Impact factor: 6.317

2.  Analysis of identifier performance using a deterministic linkage algorithm.

Authors:  Shaun J Grannis; J Marc Overhage; Clement J McDonald
Journal:  Proc AMIA Symp       Date:  2002

3.  Analysis of a probabilistic record linkage technique without human review.

Authors:  Shaun J Grannis; J Marc Overhage; Siu Hui; Clement J McDonald
Journal:  AMIA Annu Symp Proc       Date:  2003

4.  Claims, errors, and compensation payments in medical malpractice litigation.

Authors:  David M Studdert; Michelle M Mello; Atul A Gawande; Tejal K Gandhi; Allen Kachalia; Catherine Yoon; Ann Louise Puopolo; Troyen A Brennan
Journal:  N Engl J Med       Date:  2006-05-11       Impact factor: 91.245

5.  Record linkage software in the public domain: a comparison of Link Plus, The Link King, and a 'basic' deterministic algorithm.

Authors:  Kevin M Campbell; Dennis Deck; Antoinette Krupski
Journal:  Health Informatics J       Date:  2008-03       Impact factor: 2.681

6.  What is the expectation maximization algorithm?

Authors:  Chuong B Do; Serafim Batzoglou
Journal:  Nat Biotechnol       Date:  2008-08       Impact factor: 54.908

7.  Extending the Fellegi-Sunter probabilistic record linkage method for approximate field comparators.

Authors:  Scott L DuVall; Richard A Kerber; Alun Thomas
Journal:  J Biomed Inform       Date:  2009-08-13       Impact factor: 6.317

8.  An empiric modification to the probabilistic record linkage algorithm using frequency-based weight scaling.

Authors:  Vivienne J Zhu; Marc J Overhage; James Egg; Stephen M Downs; Shaun J Grannis
Journal:  J Am Med Inform Assoc       Date:  2009-06-30       Impact factor: 4.497

9.  Matching identifiers in electronic health records: implications for duplicate records and patient safety.

Authors:  Allison B McCoy; Adam Wright; Michael G Kahn; Jason S Shapiro; Elmer Victor Bernstam; Dean F Sittig
Journal:  BMJ Qual Saf       Date:  2013-01-29       Impact factor: 7.035

10.  Assessing record linkage between health care and Vital Statistics databases using deterministic methods.

Authors:  Bing Li; Hude Quan; Andrew Fong; Mingshan Lu
Journal:  BMC Health Serv Res       Date:  2006-04-05       Impact factor: 2.655

View more
  14 in total

1.  Measuring the Degree of Unmatched Patient Records in a Health Information Exchange Using Exact Matching.

Authors:  John Zech; Gregg Husk; Thomas Moore; Jason S Shapiro
Journal:  Appl Clin Inform       Date:  2016-05-11       Impact factor: 2.342

2.  Clinical research informatics and electronic health record data.

Authors:  R L Richesson; M M Horvath; S A Rusincovitch
Journal:  Yearb Med Inform       Date:  2014-08-15

3.  Optimized dual threshold entity resolution for electronic health record databases--training set size and active learning.

Authors:  Erel Joffe; Michael J Byrne; Phillip Reeder; Jorge R Herskovic; Craig W Johnson; Allison B McCoy; Elmer V Bernstam
Journal:  AMIA Annu Symp Proc       Date:  2013-11-16

4.  A machine learning approach to create blocking criteria for record linkage.

Authors:  Phan H Giang
Journal:  Health Care Manag Sci       Date:  2014-04-29

5.  Identifying the Clinical Laboratory Tests from Unspecified "Other Lab Test" Data for Secondary Use.

Authors:  Xuequn Pan; James J Cimino
Journal:  AMIA Annu Symp Proc       Date:  2015-11-05

6.  A hybrid approach to record linkage using a combination of deterministic and probabilistic methodology.

Authors:  Toan C Ong; Lindsey M Duca; Michael G Kahn; Tessa L Crume
Journal:  J Am Med Inform Assoc       Date:  2020-04-01       Impact factor: 4.497

7.  Supervised Learning for Detection of Duplicates in Genomic Sequence Databases.

Authors:  Qingyu Chen; Justin Zobel; Xiuzhen Zhang; Karin Verspoor
Journal:  PLoS One       Date:  2016-08-04       Impact factor: 3.240

8.  Record Linkage Approaches Using Prescription Drug Monitoring Program and Mortality Data for Public Health Analyses and Epidemiologic Studies.

Authors:  Sarah Nechuta; Sutapa Mukhopadhyay; Shanthi Krishnaswami; Molly Golladay; Melissa McPheeters
Journal:  Epidemiology       Date:  2020-01       Impact factor: 4.822

9.  CIDACS-RL: a novel indexing search and scoring-based record linkage system for huge datasets with high accuracy and scalability.

Authors:  George C G Barbosa; M Sanni Ali; Bruno Araujo; Sandra Reis; Samila Sena; Maria Y T Ichihara; Julia Pescarini; Rosemeire L Fiaccone; Leila D Amorim; Robespierre Pita; Marcos E Barreto; Liam Smeeth; Mauricio L Barreto
Journal:  BMC Med Inform Decis Mak       Date:  2020-11-09       Impact factor: 2.796

10.  Embracing the Sparse, Noisy, and Interrelated Aspects of Patient Demographics for use in Clinical Medical Record Linkage.

Authors:  Stephen M Ash; King Ip-Lin
Journal:  AMIA Jt Summits Transl Sci Proc       Date:  2015-03-25
View more

北京卡尤迪生物科技股份有限公司 © 2022-2023.