| Literature DB >> 28903729 |
Toan C Ong1, Michael G Kahn2,3, Bethany M Kwan4, Traci Yamashita5, Elias Brandt6, Patrick Hosokawa4, Chris Uhrich7, Lisa M Schilling5.
Abstract
BACKGROUND: Electronic health records (EHRs) contain detailed clinical data stored in proprietary formats with non-standard codes and structures. Participating in multi-site clinical research networks requires EHR data to be restructured and transformed into a common format and standard terminologies, and optimally linked to other data sources. The expertise and scalable solutions needed to transform data to conform to network requirements are beyond the scope of many health care organizations and there is a need for practical tools that lower the barriers of data contribution to clinical research networks.Entities:
Keywords: Data harmonization; Distributed research networks; Electronic health records; Extraction; Rule-based ETL; Transformation and loading
Mesh:
Year: 2017 PMID: 28903729 PMCID: PMC5598056 DOI: 10.1186/s12911-017-0532-3
Source DB: PubMed Journal: BMC Med Inform Decis Mak ISSN: 1472-6947 Impact factor: 2.796
Fig. 1Workflows of D-ETL approach to integration two source datasets
Fig. 2Examples of input data validation rules with loose and strict validation criteria
Attribute Type of a D-ETL rule
| Attribute | Description | Group |
|---|---|---|
| Rule order | Rule identification number. All rows of a rule should have the same rule order | Identification |
| Rule description | Short description with maximum length of 255 characters to describe the purpose of the rule. | Identification |
| Target database | Name of target database | Target |
| Target schema | Name of target schema | Target |
| Target table | Name of target table | Target |
| Target column | Name of target column | Target |
| Map type | Type of row. Possible values: PRIMARY, JOIN, LEFT JOIN, RIGHT JOIN, FULL JOIN, WHERE, VALUE, CUSTOM. | Source |
| Map order | Identification of row within a rule | Identification |
| Source database | Name of source database | Source |
| Source schema | Name of source schema | Source |
| Source table | Name of source table | Source |
| Source value | VALUE row: The value used to populate target column | Source |
Example of a D-ETL rule that loads data into the Care_site table in OMOP from a claims-based source CSV file
| Rule Order | Rule Description | Target Table | Target Column | Map Type | Map Order | Source Table | Source Value |
|---|---|---|---|---|---|---|---|
| 1 | Medical_claims to Care_site | Care_site | PRIMARY | 1 | Medical_claims | medical_claims.billing_provider_id, medical_claims.place_of_service_code, provider.provider_organization_type | |
| 1 | Medical_claims to Care_site | Care_site | JOIN | 2 | provider | medical_claims.billing_provider_id = provider.provider_id | |
| 1 | Medical_claims to Care_site | Care_site | WHERE | 3 | provider.provider_organization_type in (‘1’, ‘2’) | ||
| 1 | Medical_claims to Care_site | Care_site | care_site_source_value | VALUE | 4 | medical_claims.billing_provider_id || ‘-’ ||medical_claims.place_of_service_code||’-’ ||provider.provider_organization_type | |
| 1 | Medical_claims to Care_site | Care_site | organization_source_value | VALUE | 5 | NULL | |
| 1 | Medical_claims to Care_site | Care_site | place_of_service_source_value | VALUE | 6 | medical_claims.place_of_service_code | |
| 1 | Medical_claims to Care_site | Care_site | care_site_address_1 | VALUE | 7 | provider.provider_address_first_line | |
| 1 | Medical_claims to Care_site | Care_site | care_site_address_2 | VALUE | 8 | provider.provider_street | |
| 1 | Medical_claims to Care_site | Care_site | care_site_city | VALUE | 9 | provider.provider_city | |
| 1 | Medical_claims to Care_site | Care_site | care_site_state | VALUE | 10 | provider.provider_state | |
| 1 | Medical_claims to Care_site | Care_site | care_site_zip | VALUE | 11 | provider.provider_zip | |
| 1 | Medical_claims to Care_site | Care_site | care_site_county | VALUE | 12 | NULL |
Fig. 3Architecture of the ETL engine
Challenges and solutions
| Challenges | Solution by D-ETL Approach |
|---|---|
| Heterogeneity in source data sets | • ETL specifications |
| Data extraction interferes with source EHR | • CSV file format |
| Efficiency | • Integrated D-ETL engine |
| Duplicate and overlapping data | • Automated data de-duplication and incremental data loading |
| Data quality | • Input data: Extracted data validation |
| Human expertise | • Explicit rule structure |
| Resumption (ability to continue from a point where an error previously occurred) | • Modular ETL process |
D-ETL engine performance in ROSITA
| Rule number | Number of source tables | Number of records (in all source tables) | Run-time (in seconds) |
|---|---|---|---|
| 1 | 1 | 21,565 | 1.1 |
| 2 | 2 | 851,706 | 30.3 |
| 3 | 2 | 1,910,513 | 12.0 |
| 4 | 2 | 1,324,860 | 13.1 |
| 5 | 3 | 1,987,582 | 15.3 |
| 6 | 3 | 2,007,661 | 30.1 |
Examples of extracted data validations
| Validation Rule | Type of error | Description |
|---|---|---|
| Data in a date or timestamp column cannot be parsed as a date | Error | Invalid date data will fail date operators and functions |
| Data is missing in a field defined as required by the schema | Error | Missing data in required fields will violate database constraints of target schema |
| A column in the schema has a missing length, precision or scale | Warning | Default length, precision or scale can be used |
| Data in a numeric or decimal column is not a number | Error | Invalid numeric data will fail numeric operators and functions |
| Data is too long for text or varchar field | Error | Data loss will occur if long text values are truncated to meet length requirement |
| INSERT INTO tableName <fieldList> | |
| SELECT <Transformed fieldList> | |
| FROM <tableList> |