| Literature DB >> 30687797 |
Abstract
One broad goal of biomedical informatics is to generate fully-synthetic, faithfully representative electronic health records (EHRs) to facilitate data sharing between healthcare providers and researchers and promote methodological research. A variety of methods existing for generating synthetic EHRs, but they are not capable of generating unstructured text, like emergency department (ED) chief complaints, history of present illness, or progress notes. Here, we use the encoder-decoder model, a deep learning algorithm that features in many contemporary machine translation systems, to generate synthetic chief complaints from discrete variables in EHRs, like age group, gender, and discharge diagnosis. After being trained end-to-end on authentic records, the model can generate realistic chief complaint text that appears to preserve the epidemiological information encoded in the original record-sentence pairs. As a side effect of the model's optimization goal, these synthetic chief complaints are also free of relatively uncommon abbreviation and misspellings, and they include none of the personally identifiable information (PII) that was in the training data, suggesting that this model may be used to support the de-identification of text in EHRs. When combined with algorithms like generative adversarial networks (GANs), our model could be used to generate fully-synthetic EHRs, allowing healthcare providers to share faithful representations of multimodal medical data without compromising patient privacy. This is an important advance that we hope will facilitate the development of machine-learning methods for clinical decision support, disease surveillance, and other data-hungry applications in biomedical informatics.Entities:
Year: 2018 PMID: 30687797 PMCID: PMC6345174
Source DB: PubMed Journal: NPJ Digit Med ISSN: 2398-6352
Samples of synthetic chief complaints using greedy sampling and beam search, along with key variables from the corresponding records
| Age | Gender | CCS code | CCS description | Chief complaint | Greedy sample | Beam samples ( |
|---|---|---|---|---|---|---|
| 70–74 | M | 660 | Alcohol-related disorders | “alcoholic beverage consumption today” | “pt admits to drinking alcohol” | “etoh”, “etoh consumption”, “found on street”, “found on the street”, “fond on street admits to drinking” |
| 50–54 | F | 2607 | E Codes: Motor vehicle traffic | “mv c pt denies complaints” | “pt was rear ended” | “mvc”, “mva”, “pt was rear ended”, “pt was in a car”, “pt was rear ended in mvc” |
| 20–24 | F | 128 | Asthma | “shortness of breath on and off x 3 week” | “shortness of breath” | “wheezing”, “cough”, “chest pain”, “chest tightness”, “shortness of breath” |
| 25–29 | F | 661 | Substance-related disorders | “od narcan given by ems” | “found on street” | “od”, “as per ems”, “as per ems patient was found unresponsive”, “as per ems patient was found on the street”, “as per ems patient was found in the street” |
| 85–89 | F | 205 | Spondylosis; intervertebral disc disorders; other back problems | “m5440 lumbago with sciatica unspecified side i10 needs dialysis” | “lower back pain m549” | “m542 cervicalgia i10”, “back pain m545 low back”, “back pain m545 low back pain”, “back pain m545 low back pain i10”, “back pain m545 low back injury i10” |
| 50–54 | F | 98 | Essential hypertension | “generalized chest pain and blurred vision x few days no travel see above complaint” | “elevated bp” | “elevated bp”, “elevated blood pressure”, “high blood pressure”, “elevated bp no travel”, “elevated blood pressure no travel” |
We include the authentic chief complaint for comparison, but we omit from the table hospital code, mode of arrival, disposition, month, and year to increase patient anonymity
Scores for different sampling schemes on our range of text quality metrics
| Method | ppv | sens | f1 | CIDEr | ES |
|---|---|---|---|---|---|
| Beam ( | 0.3323 | 0.1786 | 0.2118 | 0.2013 | 0.6207 |
| Beam ( | 0.3216 | 0.1581 | 0.1922 | 0.1865 | 0.5981 |
| Beam ( | 0.3190 | 0.1410 | 0.1765 | 0.1748 | 0.5733 |
| Prob ( | 0.3148 | 0.2208 | 0.2394 | 0.2186 | 0.6541 |
| Prob ( | 0.1805 | 0.1520 | 0.1492 | 0.1410 | 0.5973 |
| Greedy |
|
|
|
|
|
Simplified PPV, sens, and F1-scores measure n-gram overlap between the authentic and synthetic chief complaints; and CIDEr and the ES scores measure their similarity in vector space. Top scores are shown in bold
PPV positive predictive value, sens sensitivity, ES embedding similarity
Sensitivity (sens), positive predictive value (PPV), and F1 scores for a chief complaint classifier trained on authentic chief complaints and tested on synthetic chief complaints generated with different sampling schemes
| Method | sens | ppv | F1 |
|---|---|---|---|
| Original | 0.4487 | 0.4609 | 0.4192 |
| Beam ( | 0.4892 | 0.5624 | 0.4436 |
| Beam ( | 0.4687 | 0.5447 | 0.4275 |
| Beam ( | 0.4354 | 0.5319 | 0.4001 |
| Prob ( | 0.4931 | 0.5432 | 0.4481 |
| Prob ( | 0.3859 | 0.4053 | 0.3547 |
| Greedy |
|
|
|
Top scores are shown in bold
Discrete variables in our dataset
| Variable | Original values | Coded values |
|---|---|---|
| Age group | 0 through 110 + in 5-year (inclusive) increments, e.g. 5–9, 10–14, and 20–24 | [0, 22] |
| Gender | M, F, and 4 other categories including non-binary genders | [0, 5] |
| Mode of arrival | “ambulance”, “car”, “helicopter”, “missing”, “on foot”, “public transportation”, “unknown”, and “other” | [0, 7] |
| Hospital code | 44 3-digit alphanumeric codes | [0, 43] |
| Disposition (without transfer) | “outpatient admitted as an inpatient to this hospital”, “routine discharge”, “discharged to home”, “left against medical advice”, “still patient”, “deceased”, “hospice—medical facility”, “hospice—home”, “deceased in medical facility”, “deceased at home”, “deceased place unknown” and “unknown” | [0, 11] |
| Disposition (with transfer) | “transferred to” + {“critical access hospital”, “intermediate care facility”, “long-term care facility”, “nursing facility”, “psychiatric facility”, “rehabilitation center“, “short-term general hospital”, or “other facility”} | [0, 7] |
| Month | January through December | [0, 11] |
| Year | 2016 and 2017 | [0, 1] |
| Diagnosis code | ICD-9/10 diagnosis codes converted to HCUP CCS code | [0, 283] |
The first column shows the variable names, the second column a description of their unique original values, and the third column a bracketed set indicating the range of those values after being recoded during preprocessing
Fig. 1Model architecture