Romain Bey1, Romain Goussault2, François Grolleau1, Mehdi Benchoufi1, Raphaël Porcher1. 1. Centre of Research in Epidemiology and Statistics (CRESS), Université de Paris, French Institute of Health and Medical Research (INSERM), National Institute of Agricultural Research (INRA), Paris, France. 2. CIC 1413, Center for Research in Cancerology and Immunology Nantes-Angers (CRCINA), Dermatology Department, Centre Hospitalier Universitaire Nantes, Nantes University, Nantes, France.
Abstract
OBJECTIVE: We introduce fold-stratified cross-validation, a validation methodology that is compatible with privacy-preserving federated learning and that prevents data leakage caused by duplicates of electronic health records (EHRs). MATERIALS AND METHODS: Fold-stratified cross-validation complements cross-validation with an initial stratification of EHRs in folds containing patients with similar characteristics, thus ensuring that duplicates of a record are jointly present either in training or in validation folds. Monte Carlo simulations are performed to investigate the properties of fold-stratified cross-validation in the case of a model data analysis using both synthetic data and MIMIC-III (Medical Information Mart for Intensive Care-III) medical records. RESULTS: In situations in which duplicated EHRs could induce overoptimistic estimations of accuracy, applying fold-stratified cross-validation prevented this bias, while not requiring full deduplication. However, a pessimistic bias might appear if the covariate used for the stratification was strongly associated with the outcome. DISCUSSION: Although fold-stratified cross-validation presents low computational overhead, to be efficient it requires the preliminary identification of a covariate that is both shared by duplicated records and weakly associated with the outcome. When available, the hash of a personal identifier or a patient's date of birth provides such a covariate. On the contrary, pseudonymization interferes with fold-stratified cross-validation, as it may break the equality of the stratifying covariate among duplicates. CONCLUSION: Fold-stratified cross-validation is an easy-to-implement methodology that prevents data leakage when a model is trained on distributed EHRs that contain duplicates, while preserving privacy.
OBJECTIVE: We introduce fold-stratified cross-validation, a validation methodology that is compatible with privacy-preserving federated learning and that prevents data leakage caused by duplicates of electronic health records (EHRs). MATERIALS AND METHODS: Fold-stratified cross-validation complements cross-validation with an initial stratification of EHRs in folds containing patients with similar characteristics, thus ensuring that duplicates of a record are jointly present either in training or in validation folds. Monte Carlo simulations are performed to investigate the properties of fold-stratified cross-validation in the case of a model data analysis using both synthetic data and MIMIC-III (Medical Information Mart for Intensive Care-III) medical records. RESULTS: In situations in which duplicated EHRs could induce overoptimistic estimations of accuracy, applying fold-stratified cross-validation prevented this bias, while not requiring full deduplication. However, a pessimistic bias might appear if the covariate used for the stratification was strongly associated with the outcome. DISCUSSION: Although fold-stratified cross-validation presents low computational overhead, to be efficient it requires the preliminary identification of a covariate that is both shared by duplicated records and weakly associated with the outcome. When available, the hash of a personal identifier or a patient's date of birth provides such a covariate. On the contrary, pseudonymization interferes with fold-stratified cross-validation, as it may break the equality of the stratifying covariate among duplicates. CONCLUSION: Fold-stratified cross-validation is an easy-to-implement methodology that prevents data leakage when a model is trained on distributed EHRs that contain duplicates, while preserving privacy.
Authors: Allison B McCoy; Adam Wright; Michael G Kahn; Jason S Shapiro; Elmer Victor Bernstam; Dean F Sittig Journal: BMJ Qual Saf Date: 2013-01-29 Impact factor: 7.035
Authors: Yves-Alexandre de Montjoye; Sébastien Gambs; Vincent Blondel; Geoffrey Canright; Nicolas de Cordes; Sébastien Deletaille; Kenth Engø-Monsen; Manuel Garcia-Herranz; Jake Kendall; Cameron Kerry; Gautier Krings; Emmanuel Letouzé; Miguel Luengo-Oroz; Nuria Oliver; Luc Rocher; Alex Rutherford; Zbigniew Smoreda; Jessica Steele; Erik Wetter; Alex Sandy Pentland; Linus Bengtsson Journal: Sci Data Date: 2018-12-11 Impact factor: 6.444
Authors: Benoit-Marie Robaglia; Alban Lejeune; Michel Walter; Sofian Berrouiguet; Christophe Lemey Journal: J Med Internet Res Date: 2022-09-06 Impact factor: 7.076