| Literature DB >> 29293556 |
Douglas Teodoro1,2,3, Erik Sundvall4,5, Mario João Junior1, Patrick Ruch2,3, Sergio Miranda Freire1.
Abstract
The openEHR specifications are designed to support implementation of flexible and interoperable Electronic Health Record (EHR) systems. Despite the increasing number of solutions based on the openEHR specifications, it is difficult to find publicly available healthcare datasets in the openEHR format that can be used to test, compare and validate different data persistence mechanisms for openEHR. To foster research on openEHR servers, we present the openEHR Benchmark Dataset, ORBDA, a very large healthcare benchmark dataset encoded using the openEHR formalism. To construct ORBDA, we extracted and cleaned a de-identified dataset from the Brazilian National Healthcare System (SUS) containing hospitalisation and high complexity procedures information and formalised it using a set of openEHR archetypes and templates. Then, we implemented a tool to enrich the raw relational data and convert it into the openEHR model using the openEHR Java reference model library. The ORBDA dataset is available in composition, versioned composition and EHR openEHR representations in XML and JSON formats. In total, the dataset contains more than 150 million composition records. We describe the dataset and provide means to access it. Additionally, we demonstrate the usage of ORBDA for evaluating inserting throughput and query latency performances of some NoSQL database management systems. We believe that ORBDA is a valuable asset for assessing storage models for openEHR-based information systems during the software engineering process. It may also be a suitable component in future standardised benchmarking of available openEHR storage platforms.Entities:
Mesh:
Year: 2018 PMID: 29293556 PMCID: PMC5749730 DOI: 10.1371/journal.pone.0190028
Source DB: PubMed Journal: PLoS One ISSN: 1932-6203 Impact factor: 3.240
Fig 1A simplified view of the openEHR multilevel model.
From the building blocks of the Reference Model, archetypes are created to express domain concepts, which represent valid data structures in the Reference Model. By combining archetypes, we can generate templates that may be used to generate EHR forms, messages and other artefacts. Image credit: Freire et al. [27].
Core data elements of the AIH and APAC datasets.
| Data element | Group | Data type |
|---|---|---|
| date of discharge | administrative | Date |
| healthcare unit | administrative | code (CNES) |
| issue date | administrative | Date |
| reason for discharge | administrative | code (local) |
| age | demographic | Numeric |
| gender | demographic | code (local) |
| nationality | demographic | code (local) |
| state | demographic | code (local) |
| performed procedure | action | code (SIGTAP) |
| main diagnosis | evaluation | code (ICD 10) |
| secondary diagnosis | evaluation | code (ICD 10) |
Statistics of the ORBDA source database content at the dataset and patient levels.
| Stats level | Stats item | Attribute | AIH | APAC | ||
|---|---|---|---|---|---|---|
| # | % | # | % | |||
| Unique | 5.55×107 | 100.00 | 7.75×106 | 100.00 | ||
| Unique | 1.21×104 | 100.00 | 5.09×103 | 100.00 | ||
| Unique | 1.80×103 | 100.00 | 7.58×102 | 100.00 | ||
| Unique | 4.05×103 | 100.00 | 4.61×103 | 100.00 | ||
| <1 | 2.10×106 | 3.78 | 2.90×103 | 0.04 | ||
| 1–4 | 3.22×106 | 5.80 | 1.03×105 | 1.32 | ||
| 5–9 | 1.80×106 | 3.25 | 1.71×105 | 2.20 | ||
| 10–14 | 1.30×106 | 2.34 | 1.89×105 | 2.44 | ||
| 15–19 | 2.95×106 | 5.32 | 2.72×105 | 3.52 | ||
| 20–29 | 9.95×106 | 17.93 | 5.80×105 | 7.49 | ||
| 30–39 | 6.85×106 | 12.36 | 7.55×105 | 9.75 | ||
| 40–49 | 5.18×106 | 9.34 | 1.02×106 | 13.15 | ||
| 50–59 | 4.92×106 | 8.87 | 1.37×106 | 17.65 | ||
| 60–69 | 4.58×106 | 8.25 | 1.50×106 | 19.34 | ||
| 70–79 | 3.92×106 | 7.06 | 1.27×106 | 16.45 | ||
| > = 80 | 2.76×106 | 4.98 | 5.15×105 | 6.64 | ||
| Female | 3.30×107 | 59.43 | 4.32×106 | 55.74 | ||
| Male | 2.25×107 | 40.57 | 3.43×106 | 44.26 | ||
| Brazilian | 5.54×107 | 99.86 | 7.73×106 | 99.84 | ||
| Other | 7.74×104 | 0.14 | 1.23×104 | 0.16 | ||
| Spontaneous vertex delivery (O80.0) | 4.22×106 | 7.37 | - | - | ||
| Pneumonia, unspecified (J18.9) | 1.33×106 | 2.33 | - | - | ||
| Single spontaneous delivery, unspecified (O80.9) | 1.16×106 | 2.03 | - | - | ||
| Pure hypercholesterolemia (E78.0) | - | - | 5.60×105 | 7.23 | ||
| Sensorineural hearing loss, bilateral (H90.3) | - | - | 3.05×105 | 3.94 | ||
| Paranoid schizophrenia (F20.0) | - | - | 2.82×105 | 3.64 | ||
| Normal delivery (310010039) | 6.11×106 | 10.66 | - | - | ||
| Treatment of pneumonia or influenza (303140151) | 4.02×106 | 7.02 | - | - | ||
| Caesarean delivery (411010034) | 3.17×106 | 5.53 | - | - | ||
| Phacoemulsification with foldable intraocular lens implantation (405050372) | - | - | 5.68×105 | 7.33 | ||
| Cardiac catheterization (211020010) | - | - | 4.85×105 | 6.26 | ||
| Evaluation for hearing deficiency diagnosis (211070092) | - | - | 4.24×105 | 5.47 | ||
In the hospitalisation table, the number of patients is taken as the number of unique hospitalisation identifiers.
Archetypes and templates used to model the ORBDA dataset.
| Archetype | Composition | Template | Type | Source |
|---|---|---|---|---|
| demographic_data | demographic_data | demographic_data | ADMIN_ENTRY | new |
| procedure-sus | hospitalisation, outpatient_high_complexity_procedures | bariatrics, chemotherapy, hospitalisation, medication, miscellaneous, nephrology, radiotherapy | ACTION | specialised |
| admission | hospitalisation | hospitalisation | ADMIN_ENTRY | CKM |
| hospitalization_authorization | hospitalisation | hospitalisation | ADMIN_ENTRY | new |
| patient_discharge | hospitalisation, outpatient_high_complexity_procedures | bariatrics, chemotherapy, hospitalisation, medication, miscellaneous, nephrology, radiotherapy | ADMIN_ENTRY | new |
| problem_diagnosis-sus | hospitalisation, outpatient_high_complexity_procedures | bariatrics, chemotherapy, hospitalisation, medication, miscellaneous, nephrology, radiotherapy | EVALUATION | specialised |
| high_complexity_procedures_sus | outpatient_high_complexity_procedures | bariatrics, chemotherapy, medication, miscellaneous, nephrology, radiotherapy | ADMIN_ENTRY | new |
| fluid | outpatient_high_complexity_procedures | nephrology | CLUSTER | CKM |
| tnm_staging-sus | outpatient_high_complexity_procedures | chemotherapy, radiotherapy | CLUSTER | specialised |
| bariatric_surgery_evaluation | outpatient_high_complexity_procedures | bariatrics | EVALUATION | new |
| bodily_output-urination | outpatient_high_complexity_procedures | nephrology | OBSERVATION | CKM |
| body_mass_index | outpatient_high_complexity_procedures | bariatrics | OBSERVATION | CKM |
| body_weight | outpatient_high_complexity_procedures | medication, nephrology | OBSERVATION | CKM |
| height | outpatient_high_complexity_procedures | medication, nephrology | OBSERVATION | CKM |
| lab_test-antigen_antibody_sus | outpatient_high_complexity_procedures | nephrology | OBSERVATION | specialised |
| lab_test-blood_glucose | outpatient_high_complexity_procedures | nephrology | OBSERVATION | CKM |
| lab_test-hba1c | outpatient_high_complexity_procedures | nephrology | OBSERVATION | CKM |
| lab_test-liver_function | outpatient_high_complexity_procedures | nephrology | OBSERVATION | CKM |
| lab_test-urea_and_electrolytes-sus | outpatient_high_complexity_procedures | nephrology | OBSERVATION | specialised |
Contents of the ORBDA archetypes.
| Archetype | Concept |
|---|---|
| procedure-sus | |
| admission | |
| demographic_data | |
| high_complexity_procedures_sus | |
| hospitalization_authorization | |
| patient_discharge | |
| fluid | |
| tnm_staging-sus | |
| bariatric_surgery_evaluation | |
| problem_diagnosis-sus | |
| bodily_output-urination | |
| body_mass_index | |
| body_weight | |
| height | |
| lab_test-antigen_antibody_sus | |
| lab_test-blood_glucose | |
| lab_test-hba1c | |
| lab_test-liver_function | |
| lab_test-urea_and_electrolytes-sus |
Mapped data types found in the archetypes of the ORBDA dataset.
| Data type | Package | Occurrence |
|---|---|---|
| DV_QUANTITY | quantity | 6 |
| DV_BOOLEAN | basic | 7 |
| DV_CODED_TEXT | text | 23 |
| DV_COUNT | quantity | 7 |
| DV_DATE | date time | 7 |
| DV_DATE_TIME | date time | 3 |
| DV_PROPORTION | quantity | 2 |
| DV_TEXT | text | 7 |
Fig 2UML class diagram of the SUS-openEHR-Builder software.
Statistics for the all and 10k datasets created using the SUS-openEHR-Builder.
| Dataset | Object | Format | Source | #Patient | #File | Size (GB) | Time (sec) |
|---|---|---|---|---|---|---|---|
| all | Composition | XML | AIH | 55×106 | 1.1×108 | 0.85×103 | 2.2×105 |
| all | Composition | XML | APAC | 7.7×106 | 1.0×108 | 1.20×103 | 3.2×105 |
| 10k | EHR | XML | AIH | 10×103 | 9.0×104 | 0.36 | 1.3×102 |
| 10k | EHR | XML | APAC | 10×103 | 3.1×105 | 2.10 | 6.7×102 |
| 10k | EHR | JSON | AIH | 10×103 | 9.0×104 | 0.25 | 1.7×102 |
| 10k | EHR | JSON | APAC | 10×103 | 3.1×105 | 1.40 | 8.1×102 |
Statistics for SUS-openEHR-Builder running in parallel setup: 10 jobs.
Fig 3Reading and writing throughput of the SUS-openEHR-Builder tool for generating XML composition objects for the AIH and APAC datasets.
55.47Mi and 7.75Mi patients from the AIH and APAC datasets, respectively (whole ORBDA source database). Parallel setup: 10 jobs.
Fig 4Inserting throughput of Couchbase, ElasticSearch and eXist-db databases in different server–client topologies for the AIH and APAC datasets.
1S1C: 1 server– 1 client; 1S8C: 1 server– 8 clients; 3S8C: 3 servers– 8 clients. EHR in JSON format used in Couchbase and ElasticSearch. EHR in XML format used in eXist-db.
Fig 5Querying latency of Couchbase, ElasticSearch and eXist-db databases in different server–client topologies for the AIH and APAC datasets.
1S1C: 1 server– 1 client; 1S8C: 1 server– 8 clients; 3S8C: 3 servers– 8 clients. FETCH: search and retrieval of compositions using an EHR identifier. SEARCH: search of EHR identifiers containing a diagnostic code.