| Literature DB >> 31531384 |
Sandra Henley-Smith1, Douglas Boyle1, Kathleen Gray1.
Abstract
BACKGROUND: Data quality frameworks within information technology and recently within health care have evolved considerably since their inception. When assessing data quality for secondary uses, an area not yet addressed adequately in these frameworks is the context of the intended use of the data.Entities:
Keywords: data quality; data warehouse; electronic health records
Year: 2019 PMID: 31531384 PMCID: PMC6676919 DOI: 10.5334/egems.298
Source DB: PubMed Journal: EGEMS (Wash DC) ISSN: 2327-9214
Level 1 Data Quality Sections and Sub Sections to be addressed in a DQ framework.
| DQ Section and Subsection areas | Explanation |
|---|---|
| This is the name of the application where the secondary use data were extracted from and were being assessed for data quality | |
| This provides context on the data warehouse environment where the secondary use data are held in relation to the following: Number of source systems Data processing type: Raw data or processed data Data extraction and storage type: Extract all data and over-write what is stored each time, or extract complete data first and then only changes after that Source data extraction type: Database, delimited text file, Excel file, other file type | |
| This is the name of the table being assessed for data quality, repeated for each table that is extracted from the source system. The following need to be addressed within the framework: Location of the table context/meaning Location of table fields/variables list Section to list out DQ characteristic assessment requirement Expected result of the characteristic Actual result of the characteristic Result of the characteristic – Pass/Fail If a fail, why did it fail | |
| This is the name of the field being assessed for data quality, within the table documented within the table section, repeated for each field within each table that is extracted from the source system. The following need to be addressed within the framework: Location of the field context/meaning Field variable type and length Field input type i.e. look up, text, date/time, integer/numeric Field allowable characters – if other than a look up field Field available variables – if a look up table: Document the variables and associated meanings Location of the look up table variables and meanings documented Section to list out DQ characteristic assessment requirement Expected result of the characteristic Actual result of the characteristic Result of the characteristic – Pass/Fail If a fail, why did it fail Document the data accuracy of the field held within the data warehouse Data interpretation, integrity and limitations Document any data interpretation issues known through the process or known through experience Document any data issues with the data in the data warehouse Document any known data limitations Repeat 4 for Field Name 2 to x, until all fields within the table have been documented Repeat 3 to 4 for Table | |
Level 1 DQ framework characteristics and validation requirements.
| This is the location of the data warehouse where data are being hosted | |
| This is the number of applications that are being extracted from to output data housed within this DW | |
| This is whether the data that are being assessed are from the raw extraction still in the separate source system tables or whether data have been processed and combined into associated tables and fields between the source systems | |
| This is how the data handling of the extraction is done, either extracting all data each time and overwriting what is stored in the DW, or a complete extract when first requested and then extracting updated records only from each of the required tables in subsequent data extractions | |
| This is the name of the source database software that is being assessed | |
| This is how the data were extracted from the source system, i.e. extracted from a CSV file created by the source provider or extracted from a database through the tables or views | |
| Define where or by whom the CSV/text file was created or what the other source of the data is, e.g. from a pathology laboratory CSV file created by the lab’s IT department based upon defined requirements | |
| This is the name of the database table or CSV/text file that the data has come from as the source that is being assessed | |
| This is either the tables context/meaning written or the location of the file that contains this information | |
| This is either the tables’ fields listed or the location of the file that contains this information | |
| The table within the data warehouse should be structured so that it contains easily identifiable fields/columns that can be used as a foreign key, so data are easily linkable in a usable and meaningful way | |
| The table must have a unique record ID that is not repeated without easily identifiable reasons, i.e. record has been updated or deleted; a new record has been created for the same record ID; the old record is expired and the status of the new row is set to updated or deleted | |
| The table contains gold-standard fields that enable auditors and users of the secondary data to know if the data contained within the record have been updated and by whom; to know if the record was active, inactive or deleted at the time of data extraction | |
| This lists the system field names held within the table for the Created Date, Created By, Updated Date, Updated By and Record Status fields, including the variable code and associated meaning of the record status field or the location of where this information is held | |
| The data warehouse as a gold standard should have in each table and each record when it was exported as a date/time stamp from the source system; imported as a date/time stamp into the data warehouse and the status of each record – to ensure the latest data, or data required at a set date and time, are used as required | |
| This lists the system field names held within the table for the Imported Date, Exported Date and Data Warehouse Import Status fields, including the variable code and associated meaning of the Data Warehouse Import Status field or the location of where this information is held | |
| Data held within the table are not duplicated values, with the exception of updated records and deleted records for a specific record and patient held within a table – i.e. patient’s postcode has changed from 3001 to 3124: a new record with the same record ID but an updated DW import status code exists within the table | |
| Data held within the table are stored within correct timeframes and events expected – i.e. a patient’s appointment start date and time is before the end date and time of the same appointment | |
| Data held within the table that display events that are required to have multiple entries, have them in the expected sequence and associated values based upon external and internal standards or regulations – i.e. date of an initial immunization precedes date of a booster immunization | |
| Data held within the table show expected fluctuations for time-orientated events based upon local and external knowledge – i.e. increase in influenza immunizations during flu season | |
| This is the name of the field within the source database software’s database table that is being assessed | |
| This is either the field’s context/meaning as written or the location of the file that contains this information | |
| This is the type of field and the length of the field – i.e. char 60 | |
| This indicates if the field is a primary key, composite primary key or a foreign key. This can be skipped if it is not identified as a key field | |
| This is the allowable data input that the field will accept | |
| This is the ASCI characters that the field will allow to be entered – i.e. a phone number field will only allow numeric values with no spaces. This can be skipped if it is a look up field | |
| If the field is a look up table, this will list either the table location and joining field of the look up values, if there are greater than ten options, or it will list the variable value and corresponding description. This can be skipped if it is not a look up field | |
| Data contained within the field need to conform to the required expected field type requirements for the system and external standards where the system is being used – i.e. postcode for Australia needs an integer value, of four integers | |
| The data held within the field must contain only the expected values or ranges that the field allows, based upon what the system has been designed to use – i.e. sex can only allow one numeric value that is translatable or one alpha value that is translatable | |
| Data held within the field conform to known calculation requirements and can be validated with manual calculation of formulas – i.e. the body mass index calculated within the system yields the same results as a manual calculation with the same values | |
| Data held within the field are not missing or null/blank based upon expected local and external standard requirements – i.e. sex is expected always to have a value present; work contact phone number can be null/blank as not everyone has one | |
| Data held within the field are null/blank until the action required to generate a value has occurred, within the expected time frames of the local and external standard requirements – i.e. medical discharge time is missing for three consecutive days | |
| The data stored within the field are stored and displayed with expected values that local and external standards would suggest are acceptable – i.e. height and weight values are positive and above 0 | |
| The data stored within the field are in agreement with external standards and knowledge – i.e. the weight of an adult cannot be below 10 | |
| The data stored within the field display expected results based upon local and external knowledge and known facts and common sense – i.e. a patient identified as male does not have a pregnancy documented | |
| The data stored within the field compared to data of a similar or same requirement display acceptable variability between the data – i.e. sitting blood pressure taken is within similar ranges such as 160/85 at Time 1 and 145/80 (rather than 85/160) at Time 2 | |
| This determines if the data held within the field are based upon the assessment of the above characteristics – i.e. if the data have passed with good data quality or failed with bad data quality | |
| The percentage of data held within the field that is accurate based upon local knowledge and standards – i.e. % of patients who have a sex associated to them and with the correct values based upon the context of the system | |
| The percentage of data held within the field, which has a value held within the field based upon local knowledge and standards – i.e. % of patients who have a sex associated to them. | |
| Document the limitations of the data held within the field based upon the context of the system the data were obtained from | |
| Document how the data that are held within the field and table can be misinterpreted – i.e. the doctor associated to a patient from an imported patient record does not have that doctor name within the user table of the application the data were exported from | |
| Document any issues the data can have from local and internal knowledge of the applications – i.e. Medical Director™ allows a user to code a fever as a procedure | |
| Document any other relevant information. | |
| This determines if the data held within the table are based upon the assessment of the above characteristics – i.e. if the data have passed with good data quality or failed with bad data quality | |
| The percentage of data held within the table that is accurate based upon local knowledge and standards – i.e. % of patients who have a sex associated to them and with the correct values based upon the context of the system | |
| The percentage of data held within the table, that has a value held within the table, that is based upon local knowledge and standards – i.e. % of patients who have a sex associated to them | |
| Document the limitations of the data held within the table based upon the context of the system the data was obtained from | |
| Document how the data that are held within the table can be misinterpreted – i.e. the doctor associated to a patient from an imported patient record does not have that doctor name within the user table of the application the data were exported from | |
| Document any issues the data can have from local and internal knowledge of the applications – i.e. Medical DirectorTM allows a user to code a fever as a procedure | |
| Document any other relevant information | |
Level 2 Data Quality Sections and Sub Sections needing to be addressed.
| Level 2 DQ Section and Subsection areas | Explanation |
|---|---|
| What is the area of interest/question to be investigated through the data analysis | |
| This will provide context on the area of interest/question and what is required to ensure the correct context is being used when assessing the quality of the data in the data warehouse and what needs to be addressed within the framework:
Location of the research client’s documented variable/data item list for the data required to be extracted, that contains in generic non-system specific terms: Category/Area i.e. current prescriptions, past prescriptions, past history, reason for visit, patient demographic information Variable/data item within each category/area Number of source systems and the names of the source systems required to be extracted from Location of the client’s documented restrictions on the data to be provided – i.e. age-specific conditions only Data extraction and storage type: extract all data and over-write what is stored each time, or extract complete data first and then only changes after that Source data extraction type: database, delimited text file, Excel file, other file type | |
| This lists out the tables required to answer the area of interest/question and what needs to be addressed within the framework:
Location of the documented mapping between the secondary user’s area/category list and the tables that can be supplied Number of tables that can be supplied for required category/areas Location of the tables that cannot be supplied and justification as to why this is not possible – e.g. not available within the Best Practice™ source system Section to list out DQ characteristic assessment requirements Expected result of the characteristic Actual result of the characteristic Result of the characteristic – Pass/Fail If a fail, why did it fail | |
| This lists out the tables required to answer the area of interest/question and what needs to be addressed within the framework:
Location of the documented mapping between the secondary user’s variable/data item list and the fields that can be supplied Number of fields that can be supplied for required category/areas Location of the documentation containing which fields that cannot be supplied and justification as to why this is not possible – e.g. not available within Best Practice™ Section to list out DQ characteristic assessment requirements Expected result of the characteristic Actual result of the characteristic Result of the characteristic – Pass/Fail If a fail, why did it fail Document the data accuracy of the field held within the data warehouse relating to the secondary user’s area of interest or question only Data interpretation, integrity and limitations Document any data interpretation issues known through the process or known through experience Document any data issues with the data in the data warehouse Document any known data limitations | |
Worked example of Level 2 DQ framework characteristics and validation requirements.
| This is the question to be answered, for which data are needed – i.e. are children prescribed antibiotics when seeing the doctor for a cold? | |
| This is the location of the document that contains the list of the areas/categories and their associated data variables that are required for the data analysis to help answer the questions – i.e. area/category could be patient demographics, visit reason and their associated data variables could be sex, year of birth, age, reason, date of visit | |
| This is the name of the source systems and the number required to have data extracted from them – i.e. two General Practice patient management systems (Best Practice™ and Medical Director™ in this instance) | |
| This is the location of the document that contains any restrictions based upon human research ethics review board approval of what data may be obtained – i.e. only data between the dates of 01 Jan 2009 and 01 Dec 2016 and patients who are between 0 and 18 years old | |
| This is the location of the document that contains the associated table names to the area/category required from the source system to be extracted – i.e. patient demographics would map to the patient table | |
| This is the location of the document that contains the required area/categories that are not available within the source system/s. Even if data are available in one but not the other, this needs to be documented with a reason why – i.e. Best Practice™ and Medical Director™ are unable to provide illicit drug history data as this category is not recorded anywhere within these systems | |
| This is the name of the area that the client has requested to be extracted | |
| This is the name of the source system the table is being extracted from – i.e. Best Practice™ or Medical Director™ | |
| This is the source system’s table name as shown in the database | |
| The table contains gold-standard fields that enable auditors and users of secondary data to know if the data contained within the record have been updated and by whom, and whether the record was active, inactive or deleted at the time of data extraction | |
| This lists the system field names held within the table for the Created Date, Created By, Updated Date, Updated By and Record Status fields, including the variable code and associated meaning of the record status field or the location of where this information is held | |
| The data warehouse as a gold standard should have in each table and each record when they were exported as a date/time stamp from the source system, imported as a date/time stamp into the data warehouse and the status of each record to ensure the latest data, or data required at a set date and time, are used | |
| This lists the system field names held within the table for the Imported Date, Exported Date and Data Warehouse Import Status fields, including the variable code and associated meaning of the Data warehouse Import Status field or the location of where this information is held | |
| The data held within the table are able to meet the required constraints or restrictions to meet the needs of the area of interest or answer the required question/s – i.e. data held within the table contain the details about variant brand names for antibiotics such as amoxycillin | |
| Data held within the table are not duplicated values with the exception of updated records and deleted records for a specific record and patient held within a table – i.e. patient’s postcode has changed from 3001 to 3124: a new record with the same record ID but an updated Data Warehouse Import Status code exists within the table | |
| Data held within the table are stored within correct timeframes and events expected – i.e. a patient’s appointment start date and time are before the end date and time of the same appointment | |
| Data held within the table fall within the expected timeframes that the area of interest/question is requiring of the data | |
| This is the location of the document that contains the associated field names to data variables required from the source system to be extracted – i.e. reason for visit would map to the ‘VisitReason’ table Reason field | |
| This is the location of the document that contains the required data variables that are not available within the source system/s. Even if it is available in one but not the other, this needs to be documented with a reason why – e.g. Best Practice™ is unable to provide the data variable Country of Birth | |
| This is the name of the data variable that the client has requested to be extracted | |
| This is the name of the source system the field is being extracted from -i.e. Best Practice™ or Medical Director™ | |
| This is the source system’s field name as shown in the database | |
| Data contained within the field need to conform to the required expected field type requirements for the system and external standards where the system is being used – i.e. postcode for Australia needs an integer value and must be 4 digits | |
| The data held within the field must contain only the expected values or ranges that the field allows, based upon what the system has been designed to use – i.e. sex can only allow 1 numeric value that is translatable or 1 alpha value that is translatable | |
| Data held within the field conform to known calculation requirements and can be validated with manual calculated formulas – i.e. the body mass index calculated within the system yields the same result as a manual calculation with the same values | |
| Data held within the field are not missing or null/blank based upon expected local and external standard requirements – i.e. sex is expected to have a value present always; work contact telephone number can be null/blank as not everyone has one | |
| Data held within the field are null/blank until an action that generates the value, within the expected time frames of the local and external standard requirements – i.e. medical discharge time is missing for three consecutive days. | |
| The data stored within the field are stored and displayed with expected values that local and external standards would suggest are acceptable – i.e. height and weight values are positive and above 0 | |
| The data stored within the field are in agreement with external standards and knowledge – i.e. the weight of an adult cannot be below 10 | |
| The data stored within the field display expected results based upon local and external knowledge and known facts and common sense – i.e. a patient identified as male does not have a pregnancy documented | |
| The data stored within the field compared to data of a similar or same requirement display acceptable variability between the data – i.e. sitting blood pressure taken is within similar ranges such as 160/85 at Time 1 and 145/80 (rather than 85/160) at Time 2 | |
| This determines if the data held within the field are relevant to the area of interest/question based upon the assessment of the above characteristics; if the data are able to address the required area of interest/question without modification; or if potential modification of the data request, or of the research question, might be required to obtain value out of the data provided | |
| The percentage of data held within the field that can answer the required question/area of interest is accurate based upon local knowledge and standards – i.e. % of patients who have a sex associated to them and with the correct values based upon the context of the system | |
| The percentage of data held within the field that has a value held within the field that can answer the required question/area of interest is complete, based upon local knowledge and standards – i.e. % of patients who have a sex associated to them | |
| Document the limitations of the data relating to the question/area of interest held within the field based upon the context of the system the data were obtained from | |
| Document how the data that are held within the field and table can be misinterpreted relating to the question/area of interest – i.e. the doctor associated to a patient from an imported patient record does not have that doctor name held within the user table of the application the data were exported from | |
| Document any issues the data can have relating to the question/area of interest from local and internal knowledge of the applications – i.e. Medical Director™ allows a user to code a fever as a procedure | |
| Document any other relevant information relating to the question/area of interest | |
| This determines if the data held within the table that are relevant to the area of interest/question based upon the assessment of the above characteristics, are able to yield the answer to the required area of interest/question without modification; or if potential modification/diversion might be required to obtain value out of the data provided | |
| The percentage of data held within the table that can answer the required question/area of interest is accurate based upon local knowledge and standards – i.e. % of patients who have a sex associated to them and with the correct values based upon the context of the system | |
| The percentage of data held within the table that has a value that can answer the required question/area of interest is complete, that is based upon local knowledge and standards – i.e. % of patients who have a sex associated to them | |
| Document the limitations of the data held within the table based upon the context of the system the data was obtained from | |
| Document how the data held within the table relating to the question/area of interest can be misinterpreted – i.e. the doctor associated to a patient from an imported patient record, does not have that doctor name held within the wser table of the application the data was exported from | |
| Document any issues the data relating to the question/area of interest can have, from local and internal knowledge of the applications – i.e. Medical Director™ allows a user to code a fever as a procedure | |
| Document any other relevant information relating to the question/area of interest | |
Cross walk of similarities and differences across Kahn et al. and our two framework levels.
| Khan et al.’s framework categories | Level 1 Revised Khan et al. DQ framework categories comparison | Level 2 New framework categories comparison |
|---|---|---|
| Kept this wording | Changed this from Conformance to Coherence in Level 2, kept most of the wording except changed Formats to Gold Standards | |
– Kept this wording and the two definitions in this section – Changed this to be a field level data quality requirement, rather than the global DQ requirement as in Khan et al. – Did not change anything else apart from adding validation requirements | – Kept this wording and the two definitions in this section – Changed this to be a field level data quality requirement, rather than the global DQ requirement as in Khan et al. – Did not change anything else apart from adding validation requirements | |
– Kept this wording and the two definitions in this section – Changed this to be a table level data quality requirement, rather than the global DQ requirement as in Khan et al. – Removed “c. Changes to the data model or data model versioning.” From the Framework, this is due to the fact that at the table and field levels of the database there is no version control in the same way there is with paper forms. The version numbering is for the front end of the system that accesses the data information and stores the information in the database behind – Added that the source system table should have Created and Updated Dates, which does give version control on the data added to the database or changed in the database from a front end application by a user – Added 4 new definitions to handle the aspects of the extracted source table context that were not captured in Khan et al. | Did not keep this in Level 2 | |
– Kept this wording and the one definition in this section – Changed this to be a field level data quality requirement, rather than the global DQ requirement in Khan et al. – Did not change anything else apart from adding validation requirements | – Kept this wording and the one definition in this section – Changed this to be a field level data quality requirement, rather than the global DQ requirement in Khan et al. – Did not change anything else apart from adding validation requirements | |
– Kept this wording and the two definitions in this section – Changed this to be a field level data quality requirement, rather than the global DQ requirement in Khan et al. | – Kept this wording and the two definitions in this section – Changed this to be a field level data quality requirement, rather than the global DQ requirement in Khan et al. | |
| Kept this wording | Kept this wording | |
– Kept this wording and the one definition in this section Though updated the wording slightly on this characteristic from what it was originally – Changed this to be a table level data quality requirement, rather than the global DQ requirement in Khan et al. | – Kept the wording for Uniqueness but changed from Plausibility to Compatibility. Kept the one definition in this section. Though updated the wording slightly on this characteristic from what it was originally – Changed this to be a table level data quality requirement, rather than the global DQ requirement in Khan et al. | |
– Kept this wording and the four definitions in this section – Changed this to be a field level data quality requirement, rather than the global DQ requirement in Khan et al. | – Kept the wording for Uniqueness but changed from Plausibility to Compatibility. Kept the one definition in this section Though updated the wording slightly on this characteristic from what it was originally – Changed this to be a field level data quality | |
– Kept this wording and the three definitions in this section. – Changed this to be a table level data quality requirement, rather than the global DQ requirement in Khan et al. – Changed this to be a table level data quality requirement, rather than the global DQ requirement in Khan et al. | – Kept the wording for Uniqueness but changed from Plausibility to Compatibility. Kept the one definition in this section. Though updated the wording slightly on this characteristic from what it was originally – Changed this to be a table level data quality requirement, rather than the global DQ requirement in Khan et al. | |
| Data warehouse context characteristics | Research question context framework | |
| Source system name and table name context characteristics | Source system table assessment framework | |
| Table name data quality framework characteristics | Source system table name framework characteristics | |
| Data coherence added to the source table name framework | ||
| Field name context characteristics | Source system field assessment framework characteristics | |
| Field name data quality framework characteristics | Source system field name framework characteristics | |
| Field name data quality framework overall results documented | Field name data quality framework overall results documented | |
| Table data quality framework overall results documented | Table data quality framework overall results documented | |