Khaled El Emam1, Ann Brown, Philip AbdelMalik. 1. Children's Hospital of Eastern Ontario Research Institute, Pediatrics, Faculty of Medicine, University of Ottawa, 401 Smyth Road, Ottawa, ON K1H 8L1, Canada. kelemam@uottawa.ca
Abstract
OBJECTIVE: In public health and health services research, the inclusion of geographic information in data sets is critical. Because of concerns over the re-identification of patients, data from small geographic areas are either suppressed or the geographic areas are aggregated into larger ones. Our objective is to estimate the population size cut-off at which a geographic area is sufficiently large so that no data suppression or further aggregation is necessary. DESIGN: The 2001 Canadian census data were used to conduct a simulation to model the relationship between geographic area population size and uniqueness for some common demographic variables. Cut-offs were computed for geographic area population size, and prediction models were developed to estimate the appropriate cut-offs. MEASUREMENTS: Re-identification risk was measured using uniqueness. Geographic area population size cut-offs were estimated using the maximum number of possible values in the data set and a traditional entropy measure. RESULTS: The model that predicted population cut-offs using the maximum number of possible values in the data set had R2 values around 0.9, and relative error of prediction less than 0.02 across all regions of Canada. The models were then applied to assess the appropriate geographic area size for the prescription records provided by retail and hospital pharmacies to commercial research and analysis firms. CONCLUSIONS: To manage re-identification risk, the prediction models can be used by public health professionals, health researchers, and research ethics boards to decide when the geographic area population size is sufficiently large.
OBJECTIVE: In public health and health services research, the inclusion of geographic information in data sets is critical. Because of concerns over the re-identification of patients, data from small geographic areas are either suppressed or the geographic areas are aggregated into larger ones. Our objective is to estimate the population size cut-off at which a geographic area is sufficiently large so that no data suppression or further aggregation is necessary. DESIGN: The 2001 Canadian census data were used to conduct a simulation to model the relationship between geographic area population size and uniqueness for some common demographic variables. Cut-offs were computed for geographic area population size, and prediction models were developed to estimate the appropriate cut-offs. MEASUREMENTS: Re-identification risk was measured using uniqueness. Geographic area population size cut-offs were estimated using the maximum number of possible values in the data set and a traditional entropy measure. RESULTS: The model that predicted population cut-offs using the maximum number of possible values in the data set had R2 values around 0.9, and relative error of prediction less than 0.02 across all regions of Canada. The models were then applied to assess the appropriate geographic area size for the prescription records provided by retail and hospital pharmacies to commercial research and analysis firms. CONCLUSIONS: To manage re-identification risk, the prediction models can be used by public health professionals, health researchers, and research ethics boards to decide when the geographic area population size is sufficiently large.
Authors: Karin Nelson; Rosa Elena Garcia; Julie Brown; Carol M Mangione; Thomas A Louis; Emmett Keeler; Shan Cretin Journal: Med Care Date: 2002-04 Impact factor: 2.983
Authors: David Armstrong; Eva Kline-Rogers; Sandeep M Jani; Edward B Goldman; Jianming Fang; Debabrata Mukherjee; Brahmajee K Nallamothu; Kim A Eagle Journal: Arch Intern Med Date: 2005-05-23
Authors: Khaled El Emam; Fida Kamal Dankar; Romeo Issa; Elizabeth Jonker; Daniel Amyot; Elise Cogo; Jean-Pierre Corriveau; Mark Walker; Sadrul Chowdhury; Regis Vaillancourt; Tyson Roffey; Jim Bottomley Journal: J Am Med Inform Assoc Date: 2009-06-30 Impact factor: 4.497
Authors: Daniel M Goldenholz; Shira R Goldenholz; Kaarkuzhali B Krishnamurthy; John Halamka; Barbara Karp; Matthew Tyburski; David Wendler; Robert Moss; Kenzie L Preston; William Theodore Journal: J Am Med Inform Assoc Date: 2018-10-01 Impact factor: 4.497
Authors: James Gardner; Li Xiong; Yonghui Xiao; Jingjing Gao; Andrew R Post; Xiaoqian Jiang; Lucila Ohno-Machado Journal: J Am Med Inform Assoc Date: 2012-10-11 Impact factor: 4.497
Authors: Khaled El Emam; Ann Brown; Philip AbdelMalik; Angelica Neisa; Mark Walker; Jim Bottomley; Tyson Roffey Journal: BMC Med Inform Decis Mak Date: 2010-04-02 Impact factor: 2.796