| Literature DB >> 32638010 |
Timothy Bergquist1, Yao Yan2, Thomas Schaffter3, Thomas Yu3, Vikas Pejaver1, Noah Hammarlund1, Justin Prosser4, Justin Guinney1,3, Sean Mooney1.
Abstract
OBJECTIVE: The development of predictive models for clinical application requires the availability of electronic health record (EHR) data, which is complicated by patient privacy concerns. We showcase the "Model to Data" (MTD) approach as a new mechanism to make private clinical data available for the development of predictive models. Under this framework, we eliminate researchers' direct interaction with patient data by delivering containerized models to the EHR data.Entities:
Keywords: clinical informatics; data science; data sharing; electronic health records; privacy
Mesh:
Year: 2020 PMID: 32638010 PMCID: PMC7526463 DOI: 10.1093/jamia/ocaa083
Source DB: PubMed Journal: J Am Med Inform Assoc ISSN: 1067-5027 Impact factor: 4.497
Figure 1.Defining the evaluation dataset. Any patient with at least 1 visit within the evaluation window was included in the evaluation dataset (gold). All other patient records were added to the training dataset (blue). Visits that were after the evaluation window end were excluded from the evaluation dataset and from the training dataset for patients who did not have a confirmed death (light/transparent blue). A 9-month evaluation window was chosen as the timeframe as that resulted in an 80/20 split between the training dataset and the evaluation dataset.
Figure 2.Schema showing the Docker container structure for the training stage and inference stage of running the Docker image.
Figure 3.Diagram for submitting and distributing containerized prediction models in a protected environment. Dockerized models were submitted to Synapse by a model developer to an evaluation queue. The Synapse Workflow Hook pulled in the submitted Docker image and built it inside the protected University of Washington (UW) environment. The model trained on the available electronic health record data and then made inferences on the evaluation dataset patients, outputting a prediction file with mortality probability scores for each patient. The prediction file was compared with a gold standard benchmark. The model’s performance, measured by area under the receiver-operating characteristic curve, was returned to the model developer. CWL: Common Workflow Language.
Number of patients in the University of Washington Medicine Observational Medical Outcomes Partnerships repository who have been diagnosed with cancer, heart disease, type 2 diabetes, or chronic obstructive pulmonary disease
| Training set (n = 956 212) | Evaluation set (336 548) | |
|---|---|---|
| Patients with cancer | 66 203 (6.9) | 42 195 (12.5) |
| Patients with heart disease | 31 352 (3.3) | 23 108 (6.9) |
| Patients with type 2 diabetes | 40 938 (4.3) | 28 234 (8.4) |
| Patients with chronic obstructive pulmonary disease | 13 777 (1.4) | 8302 (2.5) |
| Patients with stroke | 5216 (0.6) | 3927 (1.2) |
| Other patients | 834 591 (87.3) | 257 884 (76.6) |
Values are n (%).
Figure 4.A comparison of the receiver-operating characteristic curves for the 3 mortality prediction models submitted, trained, and evaluated using the “Model to Data” framework. AUC: area under the curve; cdp: condition/procedure/drug.
Figure 5.Runtime and max memory usage for training predictive models in the benchmarking test.