Literature DB >> 33709065

Evaluating the utility of synthetic COVID-19 case data.

Khaled El Emam^1,2,3, Lucy Mosquera³, Elizabeth Jonker², Harpreet Sood^4,5.

Abstract

BACKGROUND: Concerns about patient privacy have limited access to COVID-19 datasets. Data synthesis is one approach for making such data broadly available to the research community in a privacy protective manner.
OBJECTIVES: Evaluate the utility of synthetic data by comparing analysis results between real and synthetic data.
METHODS: A gradient boosted classification tree was built to predict death using Ontario's 90 514 COVID-19 case records linked with community comorbidity, demographic, and socioeconomic characteristics. Model accuracy and relationships were evaluated, as well as privacy risks. The same model was developed on a synthesized dataset and compared to one from the original data.
RESULTS: The AUROC and AUPRC for the real data model were 0.945 [95% confidence interval (CI), 0.941-0.948] and 0.34 (95% CI, 0.313-0.368), respectively. The synthetic data model had AUROC and AUPRC of 0.94 (95% CI, 0.936-0.944) and 0.313 (95% CI, 0.286-0.342) with confidence interval overlap of 45.05% and 52.02% when compared with the real data. The most important predictors of death for the real and synthetic models were in descending order: age, days since January 1, 2020, type of exposure, and gender. The functional relationships were similar between the two data sets. Attribute disclosure risks were 0.0585, and membership disclosure risk was low.
CONCLUSIONS: This synthetic dataset could be used as a proxy for the real dataset.

Entities: Chemical

Keywords: data access; data sharing; data synthesis; synthetic data

Year: 2021 PMID： 33709065 PMCID： PMC7936723 DOI： 10.1093/jamiaopen/ooab012

Source DB: PubMed Journal: JAMIA Open ISSN： 2574-2531

LAY SUMMARY

There remains a strong need for sharing COVID-19 data with the research community. This study evaluates whether data synthesis can address that need. We synthesized the Ontario case database of 90 514 individuals testing positive for SARS-CoV-2 and created a synthetic version of that. The synthesis method used sequential decision trees. A machine learning (gradient boosted trees) mortality prediction model was constructed using the synthetic data and its accuracy and the relationships it detected were compared to the real data. The results of the real and synthetic data models were similar and the conclusions were the same. A privacy risk assessment on the synthetic data showed that the attribute and membership disclosure risks were low. We conclude that the synthetic version of the COVID-19 testing dataset can be shared more broadly as it has high utility and privacy characteristics.

INTRODUCTION

COVID-19 has created demand for an unprecedented level of data sharing with researchers, health care providers, and public health organizations. Even before the current pandemic, global health and funding agencies have been calling for greater sharing of public health data., To address the needs of global health research, data sharing must include clinical data collected from routine care as well as clinical trials. There is also significant potential value in using Artificial Intelligence methods to analyze COVID-19 data, but such analytic methods require large volumes of data, further amplifying the need for efficient and scalable data sharing mechanisms. Some organizations have already set up large scale COVID-19 data sharing programs. For example, South Korea is providing access to 5 years of health insurance benefit claims for COVID-19 patients for research purposes through the Health Insurance Review and Assessment (HIRA) service (the national health insurer)., The NIH is providing data through a secure enclave as part of the National COVID Cohort Collaborative (N3C). The Observational Health Data Sciences and Informatics (OHDSI) organization has made large datasets from participating organizations available through a federated analysis model. The Government of Ontario is similarly making population level administrative and clinical databases available to the research community. However, privacy concerns have historically acted as a barrier to local and global sharing of public health data., These concerns are growing in the context of making COVID-19 data more accessible, and some governments have begun to reduce the amount of information being shared about COVID cases., It is known that privacy concerns make individuals reluctant to seek care and adopt other privacy protective behaviors, and providers can be reluctant to report cases to public health authorities due to concerns about patient privacy, even in the context of a pandemic. Privacy enhancing technologies can address this risk by creating databases with perturbed data that can be shared with a very small risk of identifying individual patients. Data synthesis is one approach for achieving that., It has long been recognized that synthetic data is a key approach for data dissemination complementing more traditional disclosure control methods, and has been highlighted as a key privacy enhancing technology to enable data access for the coming decade. A number of recent efforts have made large COVID-19 datasets available specifically through data synthesis. The Clinical Practice Research Datalink (CPRD) database in the UK has made available a COVID-19 symptoms and risk factors synthetic dataset based on primary care encounters in the UK., The NIH’s N3C is also developing synthetic datasets for broader sharing with researchers., Multiple researchers and analyses have noted that synthetic data does not have an elevated identity disclosure (privacy) risk because there is no unique or one-to-one mapping between the records in the synthetic data with the records in the original data., Therefore, a key remaining question is whether a synthetic version of COVID-19 datasets can provide reasonably good data utility and act as a proxy for real data. If that is the case, then synthetic COVID-19 datasets can be shared more broadly for secondary analysis and research. This paper focuses on an assessment of the utility of a synthetic variant of the Ontario COVID-19 case dataset using a commonly applied data synthesis approach: sequential trees. Utility was defined as the ability to replicate patterns and analysis conclusions from the synthetic data that were in the original data. Specifically, we evaluate the extent to which synthetic data can replicate the accuracy and functional relationships of a gradient boosted tree (GBT) classification model predicting death for 90 514 Ontario cases.

MATERIALS AND METHODS

The objective was to construct a prediction model of COVID-19 mortality in Ontario using the real data and compare that to the same model developed on the synthetic data. The outcome was a binary indicator of death over the study period. The predictors were individual and community variables reflecting factors that have been shown in the literature to affect COVID-19 mortality. The Supplementary Material contains a review of factors that have been found to affect COVID-19 mortality and that we consider in our analysis. Our primary analysis uses a machine learning technique. It has been argued, specifically in the context of COVID-19 mortality prediction, that machine learning models are better at fully using the information in clinical datasets compared to traditional regression methods.

Data set

The dataset we used was obtained on November 15 from Esri Canada’s COVID-19 dashboard, which is collected from the Public Health Agency of Canada and curated. The last case was reported on that day. The full dataset consisted of 306 816 Canadian cases. Because the values were incomplete for some provinces, our analysis focused only on Ontario with 100 368 records at the time the data was obtained. The fields in that dataset are shown in Table 1.

Table 1.

Fields in the Canadian COVID-19 case dataset used for our study

Variables	Definitions
Date reported	Number of days since January 1, 2020
Health region	34 unique regions
Age group	Decades from 20 to 80+ (ordinal)
Gender
Exposure	Close contact, outbreak, travel, not reported
Case status	Recovered, deceased, active

Fields in the Canadian COVID-19 case dataset used for our study This case data were linked with community information for each of the health regions in Ontario. The variables related to the health region are shown in Table 2. These variables were also considered in our mortality prediction model. Following recommended practices, the selection of these predictors was informed by previous literature, and a literature review is provided in the Supplementary Appendix.

Table 2.

Fields included on the health region community

Variables	Definitions
Proportion living in rural areas	Rural areas are defined as all territory lying outside population centers (population centers have a population of at least 1000 and a density of 400 or more persons per square kilometer)
Proportion of immigrants	An immigrant as a person who is, or who has ever been, a landed immigrant or permanent resident. Such a person has been granted the right to live in Canada permanently by immigration authorities. Immigrants who have obtained Canadian citizenship by naturalization are included in this group.
Proportion of aboriginal population	Aboriginal identity is based on whether the person identified with the Aboriginal peoples of Canada. This includes those who are First Nations, Métis or Inuk (Inuit) and/or those who are registered or treaty Indians (i.e. registered under the Indian Act of Canada) and/or those who have membership in a First Nation or Indian band.
Prevalence of diabetes	Population age 12 and older who reported having been diagnosed by a health professional as having type 1 or type 2 diabetes; includes females age 15 and older who reported having been diagnosed with gestational diabetes.
Prevalence of COPD	Population age 35 and older who reported being diagnosed by a health professional with chronic bronchitis, emphysema or chronic obstructive pulmonary disease (COPD).
Prevalence of high blood pressure	Population age 12 and older who reported that they have been diagnosed by a health professional as having high blood pressure.
Family medicine physicians per 100 000 population	The number of family medicine physicians per 100 000 population.
Proportion reporting Moderate-to-severe Food Insecurity	Food security is commonly understood to exist in a household when all people, at all times, have access to sufficient safe and nutritious food for an active and healthy life. Conversely, food insecurity occurs when food quality and/or quantity are compromised and is typically associated with limited financial resources.

The definitions are taken from the source document [51].

Fields included on the health region community The definitions are taken from the source document [51]. There are precedents for using population or community metrics in prediction models in the context of COVID-19. For example, a model of student transmission of the disease was constructed and population values from prior publications were used to instantiate it. Similarly, a mortality model used population values from the China CDC. In both cases, individual-level data were created through simulation, using the population values to define sampling distribution parameters. In another study, a baseline model was developed using individual-level data on a related outcome (hospitalizations with diagnoses of pneumonia and influenza) then its predictions were adjusted to match COVID-19 case fatality rates. In our study, simulating individual-level data from the community prevalence values would require an independence assumption among the covariates, which would weaken the overall model. Furthermore, the prevalence values we use can be seen as the individual-level likelihoods of a particular characteristic. Cases where the case status was unknown or still active were removed. That way we only had recovered and deceased individuals. The final dataset had 90 514 observations. Of these, 3456 were deceased, which represents 3.82% of the Ontario dataset.

Synthesis method

The individual level variables in Table 1 were synthesized. The linking of the datasets with the community variables was performed on the synthetic data. There are a number of data synthesis methods that have been used recently in the literature, such as Bayesian networks, and Generative Adversarial Networks.,, In this study, we used another approach that has been applied quite extensively for the synthesis of health and social sciences data, namely sequential classification and regression trees. Classification and regression trees have been proposed for data synthesis when implemented in a sequential manner. Furthermore, existing evaluations have concluded that the privacy risks using sequential tree synthesis is low., With these types of models, a variable is synthesized by using the values earlier in the sequence as predictors. Conceptually, sequential synthesis is similar to modeling multiple outcome variables using classifier chains and regressor chains. The details of the specific method we used are described elsewhere.

Analysis methods

The same analysis was performed on the real and the synthetic datasets, and the results were compared. The analysis methods were selected to reflect common approaches that are used to model mortality. Our main analytical method uses a machine learning technique and the sensitivity analysis uses a regression technique. Both were operationalized to provide interpretable models, with an emphasis on selecting the most important variables and understanding the functional form of relationships. The primary data analysis method is shown in Figure 1. Gradient boosted classification trees (GBT) were used to build a predictive model of death. Five hundred bootstrap samples were used to compute 95% confidence intervals for all results reported. For each bootstrap sample, the records that were out-of-sample were used as the test dataset for that iteration. Five-fold cross validation was used to determine the optimal number of trees for the GBT model built within each bootstrap iteration. Because the dataset was imbalanced, under sampling of the majority class was used to create a balanced training dataset, within each bootstrap iteration.

Figure 1.

Process diagram for the analysis method. The diagram shows the steps for each iteration of the bootstrap sampling. The testing data is the out-of-sample subset in each bootstrap iteration. cPDP stands for conditional partial dependency plot. We compared the bootstrap confidence interval overlap between the two datasets. Confidence interval overlap has been proposed for evaluating the utility of privacy protective data transformations, which is defined as the percentage average of the real and synthetic confidence intervals that overlap. The definition of this overlap is provided in the appendix. To interpret confidence interval overlap, we propose a minimal acceptable overlap of 37.5%, as explained and justified in the appendix.

Calibration

Probability calibration was performed as the predicted probabilities from boosted decision tree models do not correspond directly with the true probabilities of class membership. This discrepancy is amplified when data are under-sampled. Boosted decision trees can be viewed as additive logistic regression, meaning that the predictions made by boosting are trying to fit a logit of the true probabilities, rather than directly fitting the true probabilities., Isotonic regression can be used to calibrate the probabilities of boosted decision trees to ensure that the predicted probabilities correspond with the true probabilities of class membership., To calibrate the predicted probabilities , using the true class labels , the following regression model is fit: where is an isotonic (non-decreasing) function. Calibrated probabilities are used for assessing model performance and conditional partial dependence.

Model accuracy

We compared the real and synthetic datasets in terms of the death prediction model accuracy. Model accuracy was assessed using the Area Under the Receiver Operating Characteristic curve (AUROC). The ROC plots the false positive rate against recall and is commonly used to evaluate the performance of binary classifiers in machine learning. For binary classification tasks a AUROC of 0.5 is the expected performance of a random classifier, whereas an AUROC of 1 is the expected performance of a perfect classifier. Another metric which focuses on predictions of the positive class is the Area Under the Precision-Recall Curve (AUPRC). Interpretation of AUPRC is dependent on the class distribution of the outcome. This means it is particularly important to evaluate AUPRC on the test data with true class distributions as the minimal achievable value is dependent on that distribution, and the AUPRC value of a random classifier is the rate of the positive class, which in our case is 0.0382.

Variable importance

We compared the variable importance in the models built using the real and synthetic datasets. One general purpose method for evaluating variable importance, or to determine which predictors are most relevant to predicting the outcome, is to use permutation., If we let a training set of predictors be with each row denoted by , and the corresponding outcome variable by , then we can permute a predictor variable to get . A model built from the training dataset is . The importance of a predictor variable can be given by using the model built on the training data to compute where is a loss function, such as prediction accuracy, and is the total number of observations. This then gives us the importance of the variable . There is evidence that permuting a variable is biased towards predictors that are correlated with other predictors and that have many categories., The reason is that, if we have two predictors that are positively correlated, say and , then there will be no training examples where is large and is small, which means that the predictions made in that region will be extrapolations, resulting in high importance for these two variables. An alternative is to permute and reconstruct the model from the (undersampled) training data, and then compare the prediction accuracy on the original and permuted models, as follows: where is the model built with permuted variable . This approach addresses the bias risk and the average difference in loss (or accuracy) across multiple permutations allows us to prioritize the variables.

Conditional partial dependence plots

To illustrate the relationships between the most important predictor variables and the outcome of interest, conditional partial dependence plots were constructed. Traditionally, partial dependence plots for the variable plot against where is the observation where the variable is set to and is the set of unique values for the variable. Partial dependence plots have been subject to criticism as not all observations may plausibly be observed with , leading to poor predictions due to extrapolation. Conditional partial dependence plots aim to minimize extrapolation by calculating partial dependence within conditional subgroups, and then pools the results across subgroups. It also isolates the effect of a variable so we can view its impact, within the model, on the outcome.

Sensitivity analysis

The data synthesis method that we used was based on decision trees, as was the primary modeling method for predicting mortality. There is the potential that using similar methods for synthesis and analysis creates a positive data utility bias in that the generative model learns specific patterns in the data that the analysis model is able to also detect in the generated data. The risk is that a different analysis method may not be able to detect the same pattern. To guard against this, we also built logistic regression mortality models using each of the real and synthetic datasets. Logistic regression is a common analytical approach in epidemiology and would not be biased from using data generated by a sequential tree synthesizer. This would allow us to directly test the sensitivity of the results to the analytical method used. The methods and results from this logistic regression model are included in the appendix.

Evaluating distinguishability

We also compared the multivariate distributions of the real and synthetic data using a distinguishability metric. We applied an omnibus comparison of multivariate distributions using a binary classifier., This means that we build a discriminator model that attempts to distinguish between real and synthetic datasets. If it is not able to tell the difference, then that indicates that the real and synthetic data are similar to each other. The distinguishability metric we use is based on propensity scores., Additional details to how we have adapted it to our specific context are described in the appendix, but the basic concept is that it is an interpretable mean squared error compared to guessing whether a record is real or synthetic.

Evaluating privacy

To evaluate the privacy risks of the synthetic data we tested for two types of disclosure. The first is attribute disclosure conditional on identity disclosure, which assesses the probability of mapping a synthetic record to a real person, and conditional on that learning something new about the individual. The second is membership disclosure which assesses whether an adversary would reliably know whether a target individual was in the real dataset used for synthesis. The details of the methods used for each of these two evaluations are provided in the appendix.

RESULTS

Descriptive statistics

The summary statistics for the real dataset are shown in Table 3.

Table 3.

Summary statistics on the variables analyzed (n = 90 514 Ontario cases)

Variable	Mean (SD)	Proportion
Date reported	214.43 (82.66)
(days since January 1, 2020)
Gender
Male		48.5%
Age group
<20		11.2%
[20–29]		20.8%
[30–39]		15.5%
[40–49]		13.8%
[50–59]		14.7%
[60–69]		9.4%
[70–79]		5.3%
80+		9.3%
Exposure
Travel related		3.4%
Close contact		40%
Outbreak		24.6%
Not reported		32%
% living in rural areas	6.98 (12)
% of immigrants	37.04 (14.59)
% of aboriginal population	1.64 (2.04)
Prevalence of diabetes	7.73 (1.45)
Prevalence of COPD	3.26 (1.38)
Prevalence of high blood pressure	17.29 (2.2)
Family medicine physicians per 100 000 Population	112.57 (102.5)
Proportion reporting moderate-to-severe food insecurity	7.99 (1.81)

Summary statistics on the variables analyzed (n = 90 514 Ontario cases)

GBT model results

The AUROC value for the GBT model on the real data was 0.945 and the AUPRC was 0.340 as shown in Table 4. The baseline death rate was 3.82% and therefore the AUPRC is a considerable improvement over that. The GBT model built on the synthetic data yielded similar model accuracy results with a AUROC of 0.940 and AUPRC of 0.313 (CI overlap 45.50% and 52.02%, respectively, and they are both above our threshold).

Table 4.

Mean model accuracy results for the real and synthetic datasets with the 95% bootstrap confidence interval

Accuracy metric	Real data	Synthetic data	CI overlap
AUROC	0.945 (0.941–0.948)	0.940 (0.936–0.945)	45.50%
AUPRC	0.340 (0.314–0.368)	0.313 (0.286–0.342)	52.02%

The confidence interval overlap between the real and synthetic CIs is also shown in the last column.

Mean model accuracy results for the real and synthetic datasets with the 95% bootstrap confidence interval The confidence interval overlap between the real and synthetic CIs is also shown in the last column. The variable importance for the real data is shown in Figures 2 and 3 using each of the prediction accuracy measures. All CI values are above our overlap threshold. The variables with the largest impact on the outcome are from the individual characteristics. The community level characteristics did not have a significant effect on death. The most important variable is age, followed by date reported, exposure, and gender. By far the most important predictor of death is age with an approximately 6% increase in AUROC with its inclusion. The confidence intervals for the accuracy gain associated with gender and exposure cross zero when quantified using AUPRC or AUROC; for both the real and synthetic datasets. We therefore focus only on the effects of date and age as the two most important predictor variables.

Figure 2.

Figure 3.

Variable importance using the permutation method with AUPRC as the accuracy metric and the 95% bootstrap confidence interval. The values on the side are the confidence interval overlap values between the real and synthetic datasets.

Variable importance using the permutation method with AUROC as the accuracy metric and the 95% bootstrap confidence interval. The values on the side are the confidence interval overlap values between the real and synthetic datasets. Variable importance using the permutation method with AUPRC as the accuracy metric and the 95% bootstrap confidence interval. The values on the side are the confidence interval overlap values between the real and synthetic datasets. Figure 4 illustrates the conditional partial dependence observed across the date reported for the two GBT models. The 95% bootstrap confidence intervals for the synthetic data align well with those constructed from the real data. This indicates that the models produced using synthetic data will yield the same conclusions as those produced using the real data. This plot shows that, after factoring out other effects, this model captures an increasing probability of death over time, which does decrease and eventually plateau. There is an uptick that started at the tail end of the reporting period.

Figure 4.

Conditional partial dependence plot for date reported with bootstrap confidence intervals on the real and synthetic datasets. The date reported is measured as the number of days since January 1, 2020. Figure 5 illustrates the conditional partial dependence observed across the age groups in the GBT models built using the real and synthetic datasets. The predicted probability of death increases monotonically with age group, with individuals greater than 80 years old having a mean predicted probability of death of 16.4% and 17.7% in the real and synthetic datasets, respectively. The GBT model built from synthetic data results in similar estimates, with a mean confidence interval overlap of 83.52% across all age groups, and the overlap for each age group exceeding our minimal threshold.

Figure 5.

Conditional partial dependence plot for age with 95% bootstrap confidence intervals on the real and synthetic data. Confidence interval overlap is annotated at the top of the plot for each age group.

Conditional partial dependence plot for age with 95% bootstrap confidence intervals on the real and synthetic data. Confidence interval overlap is annotated at the top of the plot for each age group. The sensitivity analysis results included in the appendix show that very similar logistic regression models would be constructed from the real and synthetic datasets, and the accuracy results are similar between the two and similar to the GBT model results.

Distinguishability

The distinguishability between the real and synthetic datasets was 0.04 (on a scale from zero to one). This is quite low and indicates that the discriminator was not able to tell the difference between the real and synthetic datasets.

Privacy assessment

The probability of attribute disclosure conditional on identity disclosure for this dataset was 0.0585. This value is below the commonly used threshold of 0.09. This threshold has been recommended by the European Medicines Agency (EMA) and Health Canada for datasets to be considered to have a low risk of identification. The risk value for the original data was 0.3284. Therefore, the synthesis reduced this risk considerably. For membership disclosure, the ability of an adversary to discriminate between a record that was used in synthesis (i.e. in the training dataset) and one that was not was evaluated using the standardized mean difference (SMD). The full dataset was split into a training dataset and a holdout, and the training dataset was synthesized. The distances between the training and synthesized dataset, and between the holdout and the synthesized dataset were computed. The SMD between the two distances was calculated at -0.063. This means that the distance between the training data and the synthetic data as slightly larger than holdout data and synthetic data. However, this value is below the commonly used 0.1 threshold which typically signifies a meaningful difference. This means that the likelihood of a successful membership disclosure is low. Further details about the methodology and justifications are provided in the appendix.

DISCUSSION AND CONCLUSIONS

Summary

We found that the analysis results between the real and synthetic datasets for the Ontario cohort of the Canadian COVID-19 case dataset were similar, and the conclusions from that analysis were the same. Gradient boosted classification trees were used to model the relationship between multiple factors and death. We found that age and the date since the start of 2020 were the biggest factors affecting the probability of death. These results are consistent with other reports from the literature. We did not find a relationship between community characteristics associated with the public health regions where a case was reported and death (such as the percentage of immigrants and the percentage of individuals with diabetes, COPD, and high blood pressure). It is likely that such community-level measures over a large geographic region are not sufficiently associated with individual characteristics and therefore they are not sufficiently discriminatory with respect to the outcome in our models. This further emphasizes the importance of getting access to individual level data. A sensitivity analysis performed to check for potential bias between the generator model and the analytic method did not reveal evidence of bias. Different types of logistic regression models produced consistent results between real and synthetic data, and with the GBT model. A distinguishability test between the real and synthetic data found that a classifier was not able to effectively tell the difference between the real and synthetic datasets. This further supports the modeling results above. A privacy evaluation of attribute disclosure conditional on identity disclosure, and of membership disclosure, showed that the privacy risks of the synthetic data were low. Given the increasing pressures to get access to data and growing concerns about individual patient privacy risks that this presents, the data synthesis method presented in this paper can address the privacy concerns and we have presented some evidence that the conclusions drawn will be comparable to the original data. A recent article also found that a synthesis method similar to the one used in this study produces datasets that have high utility. In that case utility was defined as prediction accuracy for a number of different machine learning models. Our study goes further by comparing more robust accuracy measures, variable importance, and model interpretability. Furthermore, our study is the first to consider the utility of synthetic COVID-19 data. As the weight of evidence on the utility of synthetic data increases, one would expect there to be broader acceptance of using synthetic data as a proxy for real data.

Limitations

Although our study used sequential classification and regression trees for data synthesis, other methods could also have been used and may have produced comparable results. We did not evaluate the utility of multiple methods as that was not the objective of the current study, but rather it was to see if a common synthesis method could produce useful synthetic data. The current study can serve as a baseline (dataset and methods) for future work comparing multiple synthesis methods. Our results were performed on the Ontario cohort of 90 514 records within the Canadian COVID-19 case dataset. Further analysis on more complex COVID-19 datasets, such as those including co-morbidities and socio-economic factors at the individual level, should be performed to add more weight to our findings and further assess the utility of synthetic data. The current study shows good potential that justifies additional effort to evaluate the utility of complex synthetic datasets.

AUTHOR CONTRIBUTIONS

KEE contributed to designing the study, performing the analysis, and contributed to writing the paper. LM contributed to designing the study, performing the analysis, and contributed to writing the paper. EJ performed the literature review. HS contributed to the design of the study and the writing of the paper.

ETHICS APPROVAL

This project was reviewed by the Children’s Hospital of Eastern Ontario Research Institute Research Ethics Board as protocol number CHEOREB#20/89X. Click here for additional data file.

33 in total

1. Health information privacy and syndromic surveillance systems.

Authors: Daniel Drociuk; J Gibson; J Hodge
Journal: MMWR Suppl Date: 2004-09-24

2. The reporting of communicable diseases.

Authors: R Marier
Journal: Am J Epidemiol Date: 1977-06 Impact factor: 4.897

3. Physician and infection control practitioner HIV/AIDS reporting characteristics.

Authors: J L Jones; P Meyer; C Garrison; L Kettinger; P Hermann
Journal: Am J Public Health Date: 1992-06 Impact factor: 9.308

4. Unachievable Region in Precision-Recall Space and Its Effect on Empirical Evaluation.

Authors: Kendrick Boyd; Vítor Santos Costa; Jesse Davis; C David Page
Journal: Proc Int Conf Mach Learn Date: 2012-12-01

5. The underreporting of disease and physicians' knowledge of reporting requirements.

Authors: P M Konowitz; G A Petrossian; D N Rose
Journal: Public Health Rep Date: 1984 Jan-Feb Impact factor: 2.792

6. Generating high-fidelity synthetic patient data for assessing machine learning healthcare software.

Authors: Allan Tucker; Zhenchen Wang; Ylenia Rotalinti; Puja Myles
Journal: NPJ Digit Med Date: 2020-11-09

7. Application of Bayesian networks to generate synthetic health data.

Authors: Dhamanpreet Kaur; Matthew Sobiesk; Shubham Patil; Jin Liu; Puran Bhagat; Amar Gupta; Natasha Markuzon
Journal: J Am Med Inform Assoc Date: 2021-03-18 Impact factor: 4.497

8. A synthetic dataset primer for the biobehavioural sciences to promote reproducibility and hypothesis generation.

Authors: Daniel S Quintana
Journal: Elife Date: 2020-03-11 Impact factor: 8.140

9. Evaluating Identity Disclosure Risk in Fully Synthetic Health Data: Model Development and Validation.

Authors: Khaled El Emam; Lucy Mosquera; Jason Bass
Journal: J Med Internet Res Date: 2020-11-16 Impact factor: 5.428

10. The National COVID Cohort Collaborative (N3C): Rationale, design, infrastructure, and deployment.

Authors: Melissa A Haendel; Christopher G Chute; Tellen D Bennett; David A Eichmann; Justin Guinney; Warren A Kibbe; Philip R O Payne; Emily R Pfaff; Peter N Robinson; Joel H Saltz; Heidi Spratt; Christine Suver; John Wilbanks; Adam B Wilcox; Andrew E Williams; Chunlei Wu; Clair Blacketer; Robert L Bradford; James J Cimino; Marshall Clark; Evan W Colmenares; Patricia A Francis; Davera Gabriel; Alexis Graves; Raju Hemadri; Stephanie S Hong; George Hripscak; Dazhi Jiao; Jeffrey G Klann; Kristin Kostka; Adam M Lee; Harold P Lehmann; Lora Lingrey; Robert T Miller; Michele Morris; Shawn N Murphy; Karthik Natarajan; Matvey B Palchuk; Usman Sheikh; Harold Solbrig; Shyam Visweswaran; Anita Walden; Kellie M Walters; Griffin M Weber; Xiaohan Tanner Zhang; Richard L Zhu; Benjamin Amor; Andrew T Girvin; Amin Manna; Nabeel Qureshi; Michael G Kurilla; Sam G Michael; Lili M Portilla; Joni L Rutter; Christopher P Austin; Ken R Gersing
Journal: J Am Med Inform Assoc Date: 2021-03-01 Impact factor: 7.942

7 in total

1. Demonstrating an approach for evaluating synthetic geospatial and temporal epidemiologic data utility: results from analyzing >1.8 million SARS-CoV-2 tests in the United States National COVID Cohort Collaborative (N3C).

Authors: Jason A Thomas; Randi E Foraker; Noa Zamstein; Jon D Morrow; Philip R O Payne; Adam B Wilcox
Journal: J Am Med Inform Assoc Date: 2022-07-12 Impact factor: 7.942

2. Synthetic data in machine learning for medicine and healthcare.

Authors: Richard J Chen; Ming Y Lu; Tiffany Y Chen; Drew F K Williamson; Faisal Mahmood
Journal: Nat Biomed Eng Date: 2021-06 Impact factor: 29.234

3. Utility Metrics for Evaluating Synthetic Health Data Generation Methods: Validation Study.

Authors: Khaled El Emam; Lucy Mosquera; Xi Fang; Alaa El-Hussuna
Journal: JMIR Med Inform Date: 2022-04-07

4. Generating high-fidelity synthetic time-to-event datasets to improve data transparency and accessibility.

Authors: Aiden Smith; Paul C Lambert; Mark J Rutherford
Journal: BMC Med Res Methodol Date: 2022-06-23 Impact factor: 4.612

5. Validating a membership disclosure metric for synthetic health data.

Authors: Khaled El Emam; Lucy Mosquera; Xi Fang
Journal: JAMIA Open Date: 2022-10-11

6. Demonstrating an approach for evaluating synthetic geospatial and temporal epidemiologic data utility: Results from analyzing >1.8 million SARS-CoV-2 tests in the United States National COVID Cohort Collaborative (N3C).

Authors: Jason A Thomas; Randi E Foraker; Noa Zamstein; Philip R O Payne; Adam B Wilcox
Journal: medRxiv Date: 2021-07-08

7. Reconciling public health common good and individual privacy: new methods and issues in geoprivacy.

Authors: Maged N Kamel Boulos; Mei-Po Kwan; Khaled El Emam; Ada Lai-Ling Chung; Song Gao; Douglas B Richardson
Journal: Int J Health Geogr Date: 2022-01-19 Impact factor: 3.918

7 in total