| Literature DB >> 31065558 |
James R Rogers1, Tiffany J Callahan2, Tian Kang1, Alan Bauck3, Ritu Khare4, Jeffrey S Brown5, Michael G Kahn2, Chunhua Weng1.
Abstract
INTRODUCTION: In aggregate, existing data quality (DQ) checks are currently represented in heterogeneous formats, making it difficult to compare, categorize, and index checks. This study contributes a data element-function conceptual model to facilitate the categorization and indexing of DQ checks and explores the feasibility of leveraging natural language processing (NLP) for scalable acquisition of knowledge of common data elements and functions from DQ checks narratives.Entities:
Keywords: clinical data research networks; data quality; electronic healthcare records; knowledge acquisition; natural language processing
Year: 2019 PMID: 31065558 PMCID: PMC6484368 DOI: 10.5334/egems.289
Source DB: PubMed Journal: EGEMS (Wash DC) ISSN: 2327-9214
Figure 1High-level overview of workflow with example DQ checks and their corresponding constructs, terms, and suggested domains. A data element is a focus of a DQ check (annotation is represented by “[[ data element ]]” for parsing); a function is the qualitative or quantitative evaluation over the data element (annotation is represented by “{{ function }}” for parsing). Each DQ check is essentially a function of a data element.
Allowable terms for domain descriptions of OHDSI DQ check constructs.
| Domain | Domain Definition | Terms | Count of Unique Terms in Domain | Count of Checks in Domain |
|---|---|---|---|---|
| Age | DE related to age-specific variables | age at first observation period; age at death; age | 3 | 10 |
| Care Site | DE related to places of care variables | care sites | 1 | 3 |
| Condition | DE related to condition-specific variables | condition occurrence records; condition occurrence concepts; condition eras; condition era length; condition era concepts | 5 | 15 |
| Death | DE related to death-specific variables | death records; records of death; time from death | 3 | 9 |
| Insurance | DE related to insurance-specific variables | procedure cost records; total paid; total out-of-pocket; paid toward deductible; paid copay; paid coinsurance; paid by payer; paid by coordination of benefit; ingredient_cost; drug cost records; dispensing fee; average wholesale price; payer plan (days) of first payer plan period | 13 | 23 |
| Medication | DE related to medication-specific variables | refills; quantity; drug occurrence records; drug exposure records; drug exposure concepts; drug eras; drug era records; drug era length; drug era concepts; days_supply | 10 | 20 |
| Numeric Values | DE related to an unspecified numeric values | numeric values | 1 | 1 |
| Observations | DE related to observation-centric variables | records; observation records; observation occurrence records; observation occurrence concepts; observation (days) of first observation period | 5 | 16 |
| Person | DE that examines only persons | Persons | 1 | 55 |
| Procedure | DE related to procedure-specific variables | procedure occurrence records | 1 | 8 |
| Provider | DE related to provider-specific variables | Providers | 1 | 3 |
| Visit | DE related to visit-record related variables | visits; visit records; visit occurrence records; visit occurrence concepts; length of stay | 5 | 9 |
| Count | Measures the count of DE relative to a certain specification (e.g., number of persons with X) | Number of | 1 | 128 |
| Distribution | Measures the dispersion of DE across a certain specification (e.g., distribution of age by X) | Distribution of | 1 | 39 |
| Time Length | Measures the time frame of DE given a certain specification (e.g., length of observation period (days) of first observation period by X) | Length of | 1 | 5 |
Figure 2Horizontal bar charts of frequency of DQ check domains specific to OHDSI, overlaid with DQ harmonization categories. DQ Harmonization Categories brief descriptions: Completeness, Atemporal is the data’s presence in a particular context at an individual time point; Conformance, Calculation is the data’s compliance to constraints relating to computationally derived values from existing data; Conformance, Relational is the data’s compliance to structural constraints as it relates to physical database structure specifications (e.g., primary key and foreign key relationships); Plausibility, Atemporal is the data’s feasibility at an individual time point; Plausibility, Temporal is the data’s feasibility across a series of time points in a defined time period.
Allowable terms for domain descriptions of CESR DQ check constructs.
| Domain | Domain Definition | Sample Term Descriptions* | Count of Unique Terms in Domain** | Count of Checks in Domain |
|---|---|---|---|---|
| Birth | DEs related to birth-related variables | birth date, bdate | 4 | 11 |
| Bone Measurement | DEs related to bone measurement variables | bone measured, machine type used, scan date | 10 | 42 |
| Care Site | DEs related to place of care specific variables | facility name | 2 | 20 |
| Condition | DEs related primarily to condition-specific variables | principal diagnosis, diagnosis code type, original diagnosis | 17 | 85 |
| Date | DEs related to unspecified date variables | Date | 2 | 24 |
| Death | DEs related to death-specific variables | death date, age at death | 6 | 24 |
| Enrollment | DEs related to enrollment-specific variables | enrollment start date, enrollment end date, enrollment basis | 11 | 30 |
| Ethnicity | DEs related to ethnicity-specific variables | Hispanic | 2 | 10 |
| Gender | DEs related to gender-specific variables | gender | 2 | 10 |
| Internal ID | DEs defined by internally utilized constructions | protocol ID, row ID, template ID | 53 | 144 |
| Lab | DEs related to lab-specific variables | test type, specimen source, modification measures (e.g., high, low, etc.) | 26 | 121 |
| Language | DEs related to speaking language variables | primary language, need for interpreter, language usage | 8 | 36 |
| Medication | DEs related to medication-specific variables | refills, quantity, dosage form, dosage amount, order date, prescription date, infusion duration | 120 | 764 |
| MRN | DEs related to medical record numbers | general MRNs, table-specific MRNs (such as related to enrollment) | 12 | 153 |
| Observations | DEs related to ambiguous variables | unit of measure, type of activity, message-related characteristics | 43 | 184 |
| Procedure | DEs related to procedure-specific variables | CPT modifiers, procedure date, original procedure | 8 | 43 |
| Provider | DEs related to provider-specific variables | specialty, provider demographics, provider type | 26 | 146 |
| Race | DEs related to race-specific variables | race listed (i.e., “race1, race2, etc.”), race cross-section with ethnicity (e.g., non-Hispanic white) | 24 | 97 |
| Social History | DEs related to social history variables | smoking use, alcohol use, drug use | 53 | 216 |
| Socioeconomic Factors | DEs related to socioeconomic variables | household income, poverty status, education level, insurance-related | 117 | 464 |
| Tumor | DEs related to cancer-specific variables | SSF measures, stages of progression, dates of particular cancer-related therapies (e.g., chemotherapy) | 107 | 534 |
| Visit | DEs related to visit-specific variables | inpatient length of stay, discharge status, admission type | 29 | 190 |
| Vital | DEs related to vital-specific variables | weight measurements, blood pressure measurements, pulse measurements | 20 | 86 |
| Category | Examines whether or not appropriate categories of a DE are correctly entered | Category | 1 | 243 |
| Consistency | Examines if a target DE follows an expected pattern with another DE | Expected order of values; Compare to; Extra check; Consistency | 4 | 10 |
| Count | Measures the count of a DE (either by the DE or categories of the DE) | Frequency; Counts; Number | 3 | 380 |
| Cross tab | Cross-section of a target DE with other DEs | Cross tab | 1 | 19 |
| Distribution | Examines context-specific dispersion of a DE | Distribution | 1 | 1 |
| Existence | Examines if the DE itself is present | Exist; Existence | 2 | 715 |
| Link | Examines if DE is linked correctly | Link | 1 | 41 |
| Missing | Examines if a DE’s entries are present | Missing | 1 | 722 |
| Overlap | Examines multiple locations of DE occurrence | Not overlap; Overlap | 2 | 3 |
| Sum | Measures the sum of DEs (typically used for proportions that must add to 1) | Sum | 1 | 49 |
| Trend | Examines time fluctuation of a DE | Trend | 1 | 29 |
| Uniqueness | Examines if DE duplicates are present | Uniqueness | 1 | 29 |
| Variable Length | Examines the variable length of a DE | Length | 1 | 467 |
| Variable Type | Examines how the DE is stored or defined (e.g., date format, integer, etc.) | Type | 1 | 726 |
* Note that select example data element terms and descriptions are provided because some terms are proprietary and some data elements have many terms.
** Unique terms include case sensitive representations; for example, “race1” and “RACE1” are counted as unique.
Figure 3Horizontal bar charts of frequency of DQ check domains specific to CESR, overlaid with DQ harmonization categories. DQ Harmonization Categories brief descriptions: Completeness, Atemporal is the data’s presence in a particular context at an individual time point; Conformance, Calculation is the data’s compliance to constraints relating to computationally derived values from existing data; Conformance, Relational is the data’s compliance to structural constraints as it relates to physical database structure specifications (e.g., primary key and foreign key relationships); Conformance, Value is the data’s compliance to structural constraints as it relates to prespecified formatting constraints (e.g., data element is numeric); Plausibility, Atemporal is the data’s feasibility at an individual time point; Plausibility, Temporal is the data’s feasibility across a series of time points in a defined time period; Plausibility, Uniqueness is the data’s feasibility regarding duplication.
Figure 4Heat maps of DQ check domains. Domains represented in both networks are indicated with an “*”.