Ying Zhu1, Yutaka Matsuyama2, Yasuo Ohashi3, Soko Setoguchi4. 1. Department of Biostatistics, Graduate School of Medicine, The University of Tokyo, Tokyo, Japan. Electronic address: sophie@epistat.m.u-tokyo.ac.jp. 2. Department of Biostatistics, Graduate School of Medicine, The University of Tokyo, Tokyo, Japan. 3. Department of Biostatistics, Graduate School of Medicine, The University of Tokyo, Tokyo, Japan; Department of Integrated Science and Engineering for Sustainable Society, Chuo University, Tokyo, Japan. 4. Duke Clinical Research Institute, Duke University School of Medicine, Durham, NC, United States; Department of Pharmacoepidemiology, Graduate School of Medicine, The University of Tokyo, Tokyo, Japan.
Abstract
INTRODUCTION: When unique identifiers are unavailable, successful record linkage depends greatly on data quality and types of variables available. While probabilistic linkage theoretically captures more true matches than deterministic linkage by allowing imperfection in identifiers, studies have shown inconclusive results likely due to variations in data quality, implementation of linkage methodology and validation method. The simulation study aimed to understand data characteristics that affect the performance of probabilistic vs. deterministic linkage. METHODS: We created ninety-six scenarios that represent real-life situations using non-unique identifiers. We systematically introduced a range of discriminative power, rate of missing and error, and file size to increase linkage patterns and difficulties. We assessed the performance difference of linkage methods using standard validity measures and computation time. RESULTS: Across scenarios, deterministic linkage showed advantage in PPV while probabilistic linkage showed advantage in sensitivity. Probabilistic linkage uniformly outperformed deterministic linkage as the former generated linkages with better trade-off between sensitivity and PPV regardless of data quality. However, with low rate of missing and error in data, deterministic linkage performed not significantly worse. The implementation of deterministic linkage in SAS took less than 1min, and probabilistic linkage took 2min to 2h depending on file size. DISCUSSION: Our simulation study demonstrated that the intrinsic rate of missing and error of linkage variables was key to choosing between linkage methods. In general, probabilistic linkage was a better choice, but for exceptionally good quality data (<5% error), deterministic linkage was a more resource efficient choice.
INTRODUCTION: When unique identifiers are unavailable, successful record linkage depends greatly on data quality and types of variables available. While probabilistic linkage theoretically captures more true matches than deterministic linkage by allowing imperfection in identifiers, studies have shown inconclusive results likely due to variations in data quality, implementation of linkage methodology and validation method. The simulation study aimed to understand data characteristics that affect the performance of probabilistic vs. deterministic linkage. METHODS: We created ninety-six scenarios that represent real-life situations using non-unique identifiers. We systematically introduced a range of discriminative power, rate of missing and error, and file size to increase linkage patterns and difficulties. We assessed the performance difference of linkage methods using standard validity measures and computation time. RESULTS: Across scenarios, deterministic linkage showed advantage in PPV while probabilistic linkage showed advantage in sensitivity. Probabilistic linkage uniformly outperformed deterministic linkage as the former generated linkages with better trade-off between sensitivity and PPV regardless of data quality. However, with low rate of missing and error in data, deterministic linkage performed not significantly worse. The implementation of deterministic linkage in SAS took less than 1min, and probabilistic linkage took 2min to 2h depending on file size. DISCUSSION: Our simulation study demonstrated that the intrinsic rate of missing and error of linkage variables was key to choosing between linkage methods. In general, probabilistic linkage was a better choice, but for exceptionally good quality data (<5% error), deterministic linkage was a more resource efficient choice.
Authors: Suzanne Mason; Tony Stone; Richard Jacques; Jennifer Lewis; Rebecca Simpson; Maxine Kuczawski; Matthew Franklin Journal: Med Decis Making Date: 2022-05-14 Impact factor: 2.749
Authors: Sangeerthana Rajagopal; Scott J Booth; Terry P Brown; Chen Ji; Claire Hawkes; A Niroshan Siriwardena; Kim Kirby; Sarah Black; Robert Spaight; Imogen Gunson; Samantha J Brace-McDonnell; Gavin D Perkins Journal: BMJ Open Date: 2017-11-20 Impact factor: 2.692
Authors: Daniela Almeida; David Gorender; Maria Yury Ichihara; Samila Sena; Luan Menezes; George C G Barbosa; Rosimeire L Fiaccone; Enny S Paixão; Robespierre Pita; Mauricio L Barreto Journal: BMC Med Inform Decis Mak Date: 2020-07-25 Impact factor: 2.796
Authors: Long Nguyen; Mark Stoové; Douglas Boyle; Denton Callander; Hamish McManus; Jason Asselin; Rebecca Guy; Basil Donovan; Margaret Hellard; Carol El-Hayek Journal: J Med Internet Res Date: 2020-06-24 Impact factor: 5.428
Authors: Alan F Karr; Matthew T Taylor; Suzanne L West; Soko Setoguchi; Tzuyung D Kou; Tobias Gerhard; Daniel B Horton Journal: PLoS One Date: 2019-09-24 Impact factor: 3.240