INTRODUCTION: Clinical databases require accurate entity resolution (ER). One approach is to use algorithms that assign questionable cases to manual review. Few studies have compared the performance of common algorithms for such a task. Furthermore, previous work has been limited by a lack of objective methods for setting algorithm parameters. We compared the performance of common ER algorithms: using algorithmic optimization, rather than manual parameter tuning, and on two-threshold classification (match/manual review/non-match) as well as single-threshold (match/non-match). METHODS: We manually reviewed 20,000 randomly selected, potential duplicate record-pairs to identify matches (10,000 training set, 10,000 test set). We evaluated the probabilistic expectation maximization, simple deterministic and fuzzy inference engine (FIE) algorithms. We used particle swarm to optimize algorithm parameters for a single and for two thresholds. We ran 10 iterations of optimization using the training set and report averaged performance against the test set. RESULTS: The overall estimated duplicate rate was 6%. FIE and simple deterministic algorithms allowed a lower manual review set compared to the probabilistic method (FIE 1.9%, simple deterministic 2.5%, probabilistic 3.6%; p<0.001). For a single threshold, the simple deterministic algorithm performed better than the probabilistic method (positive predictive value 0.956 vs 0.887, sensitivity 0.985 vs 0.887, p<0.001). ER with FIE classifies 98.1% of record-pairs correctly (1/10,000 error rate), assigning the remainder to manual review. CONCLUSIONS: Optimized deterministic algorithms outperform the probabilistic method. There is a strong case for considering optimized deterministic methods for ER.
INTRODUCTION: Clinical databases require accurate entity resolution (ER). One approach is to use algorithms that assign questionable cases to manual review. Few studies have compared the performance of common algorithms for such a task. Furthermore, previous work has been limited by a lack of objective methods for setting algorithm parameters. We compared the performance of common ER algorithms: using algorithmic optimization, rather than manual parameter tuning, and on two-threshold classification (match/manual review/non-match) as well as single-threshold (match/non-match). METHODS: We manually reviewed 20,000 randomly selected, potential duplicate record-pairs to identify matches (10,000 training set, 10,000 test set). We evaluated the probabilistic expectation maximization, simple deterministic and fuzzy inference engine (FIE) algorithms. We used particle swarm to optimize algorithm parameters for a single and for two thresholds. We ran 10 iterations of optimization using the training set and report averaged performance against the test set. RESULTS: The overall estimated duplicate rate was 6%. FIE and simple deterministic algorithms allowed a lower manual review set compared to the probabilistic method (FIE 1.9%, simple deterministic 2.5%, probabilistic 3.6%; p<0.001). For a single threshold, the simple deterministic algorithm performed better than the probabilistic method (positive predictive value 0.956 vs 0.887, sensitivity 0.985 vs 0.887, p<0.001). ER with FIE classifies 98.1% of record-pairs correctly (1/10,000 error rate), assigning the remainder to manual review. CONCLUSIONS: Optimized deterministic algorithms outperform the probabilistic method. There is a strong case for considering optimized deterministic methods for ER.
Keywords:
Medical Record Linkage [N04.452.859.564.550]; Medical Records Systems, Computerized [L01.700.508.300.695]
Authors: David M Studdert; Michelle M Mello; Atul A Gawande; Tejal K Gandhi; Allen Kachalia; Catherine Yoon; Ann Louise Puopolo; Troyen A Brennan Journal: N Engl J Med Date: 2006-05-11 Impact factor: 91.245
Authors: Vivienne J Zhu; Marc J Overhage; James Egg; Stephen M Downs; Shaun J Grannis Journal: J Am Med Inform Assoc Date: 2009-06-30 Impact factor: 4.497
Authors: Allison B McCoy; Adam Wright; Michael G Kahn; Jason S Shapiro; Elmer Victor Bernstam; Dean F Sittig Journal: BMJ Qual Saf Date: 2013-01-29 Impact factor: 7.035
Authors: Erel Joffe; Michael J Byrne; Phillip Reeder; Jorge R Herskovic; Craig W Johnson; Allison B McCoy; Elmer V Bernstam Journal: AMIA Annu Symp Proc Date: 2013-11-16
Authors: George C G Barbosa; M Sanni Ali; Bruno Araujo; Sandra Reis; Samila Sena; Maria Y T Ichihara; Julia Pescarini; Rosemeire L Fiaccone; Leila D Amorim; Robespierre Pita; Marcos E Barreto; Liam Smeeth; Mauricio L Barreto Journal: BMC Med Inform Decis Mak Date: 2020-11-09 Impact factor: 2.796