| Literature DB >> 31234921 |
Mark D Danese1, Marc Halperin2, Jennifer Duryea2, Ryan Duryea2.
Abstract
BACKGROUND: Most healthcare data sources store information within their own unique schemas, making reliable and reproducible research challenging. Consequently, researchers have adopted various data models to improve the efficiency of research. Transforming and loading data into these models is a labor-intensive process that can alter the semantics of the original data. Therefore, we created a data model with a hierarchical structure that simplifies the transformation process and minimizes data alteration.Entities:
Keywords: Claims data; Data model; Electronic health records
Mesh:
Year: 2019 PMID: 31234921 PMCID: PMC6591926 DOI: 10.1186/s12911-019-0837-5
Source DB: PubMed Journal: BMC Med Inform Decis Mak ISSN: 1472-6947 Impact factor: 2.796
Fig. 1Relationships Among the Collections, Contexts, and Clinical Codes Tables. Note: EHR = electronic health record. HCPCS = Healthcare Common Procedure Coding System. NDC = National Drug Code. ICD = International Classification of Diseases. Figure does not contain specific data, but is intended to show the conceptual relationships among data elements across tables
Fig. 2Hierarchical View of the Generalized Data Model. Note: Table names and key relationships among tables are depicted above. See Additional file 1 for more detail on tables. Tables in green serve as lookup tables across the database. There is a single Addresses table for unique addresses with relationships to Patients, Practitioners, and Facilities, and a single Practitioners table with relationships to Patients and Contexts Practitioners. The Contexts Practitioners table allows multiple practitioners to be associated with a Context record
Fig. 3Visualization of the ETL Process for SynPUF Data. Note: Clinical codes are derived from a single row in the source data set (SynPUF record). Colored arrows indicate how each group of codes is used to create records. Each code from the original record gets its own row in the Clinical Codes table. Codes that are grouped together (e.g., line diagnosis 1 and procedure 1 in yellow) share the same context. In the Contexts table, type concept id ending in “64” indicates a claim level context, and the id ending in “65” indicates a line level context. The three contexts (groups of codes) share the same collection id
Example concepts table for variables in the SEER data
| id | vocabulary_id | concept_code | concept_text |
|---|---|---|---|
| 100010943 | naaccr_grade | 1 | Grade I; grade i; grade 1; well differentiated; differentiated, NOS |
| 100010944 | naaccr_grade | 2 | Grade II; grade ii; grade 2; moderately differentiated; moderately well differentiated; intermediate differentiation |
| 100010945 | naaccr_grade | 3 | Grade III; grade iii; grade 3; poorly differentiated; differentiated |
| 100010946 | naaccr_grade | 4 | Grade IV; grade iv; grade 4; undifferentiated; anaplastic |
| 100010947 | naaccr_grade | 5 | T-cell; T-precursor |
| 100010948 | naaccr_grade | 6 | B-cell; Pre-B; B-Precursor |
| 100010949 | naaccr_grade | 7 | Null cell; Non T-non B; |
| 100010950 | naaccr_grade | 8 | N K cell (natural killer cell) |
| 100010951 | naaccr_grade | 9 | cell type not determined, not stated or not applicable |
Note: SEER = Surveillance, Epidemiology and End Results, a collection of data from cancer registries across the United States. NAACCR = North American Association of Central Cancer Registries
Example mapping table for conversion to the Sentinel model
| concept_id _1 | relationship _id | concept_id _2 | vocabulary_id_1 | concept_code_1 | concept_text_1 | vocabulary_id_2 | concept_code_2 | concept_text_2 |
|---|---|---|---|---|---|---|---|---|
| 100000002 | is_a | 100000092 | synpuf_sex | 2 | Female | sentinel_sex | F | Female |
| 100000001 | is_a | 100000093 | synpuf_sex | 1 | Male | sentinel_sex | M | Male |
| 100000003 | is_a | 100000095 | synpuf_race | 1 | White | sentinel_hispanic | N | Not Applicable |
| 100000004 | is_a | 100000095 | synpuf_race | 2 | Black | sentinel_hispanic | N | Not Applicable |
| 100000005 | is_a | 100000095 | synpuf_race | 3 | Others | sentinel_hispanic | N | Not Applicable |
| 100000006 | is_a | 100000097 | synpuf_race | 5 | Hispanic | sentinel_hispanic | Y | Yes |
| 100000005 | is_a | 100000098 | synpuf_race | 3 | Others | sentinel_race | 0 | Unknown |
| 100000006 | is_a | 100000098 | synpuf_race | 5 | Hispanic | sentinel_race | 0 | Unknown |
| 100000004 | is_a | 100000101 | synpuf_race | 2 | Black | sentinel_race | 3 | Black or African American |
| 100000003 | is_a | 100000103 | synpuf_race | 1 | White | sentinel_race | 5 | White |
Note: This table shows how sex and race/ethnicity would be associated if the GDM were to use the Sentinel definitions as its defaults. Mappings allow users to query the data based on standardized default values for these variables while retaining the values from the original data. The GDM can support more extensive mapping relationships, but these are not required. Note that only the first 3 columns are part of the Mappings table. The remaining columns are created from joining to the Concepts and Vocabularies tables