| Literature DB >> 33244417 |
Ari Ercole1, Vibeke Brinck2, Pradeep George3, Ramona Hicks4, Jilske Huijben5, Michael Jarrett2, Mary Vassar6, Lindsay Wilson7.
Abstract
BACKGROUND: High-quality data are critical to the entire scientific enterprise, yet the complexity and effort involved in data curation are vastly under-appreciated. This is especially true for large observational, clinical studies because of the amount of multimodal data that is captured and the opportunity for addressing numerous research questions through analysis, either alone or in combination with other data sets. However, a lack of details concerning data curation methods can result in unresolved questions about the robustness of the data, its utility for addressing specific research questions or hypotheses and how to interpret the results. We aimed to develop a framework for the design, documentation and reporting of data curation methods in order to advance the scientific rigour, reproducibility and analysis of the data.Entities:
Keywords: Data quality; Delphi process; curation; design; observational studies; reporting
Year: 2020 PMID: 33244417 PMCID: PMC7681114 DOI: 10.1017/cts.2020.24
Source DB: PubMed Journal: J Clin Transl Sci ISSN: 2059-8661
Fig. 1.Flow diagram for the DAQCORD-modified Delphi process.
Key terms and concepts
| Terms | Definition in the context of use for the DAQCORD Guidelines |
|---|---|
| Action ability | The indicator can be acted upon in the data curation process to assure quality. |
| Completeness | The degree to which the data were in actuality collected in comparison to what was expected to be collected. |
| Concordance | The agreement between variables that measure related factors. |
| Correctness | The accuracy of the data and its presentation in a standard and unambiguous manner. |
| Currency | The timeliness of the data collection and representativeness of a particular time point. |
| Curation | The management of data throughout its lifecycle (acquisition to archiving) to enable reliable reuse and retrieval for future research purposes. |
| Data | Information that is collected and stored electronically for primary and secondary analysis of health-related research. |
| Data quality factors | The completeness, correctness, concordance, plausibility and currency of data |
| Feasibility | Information about the indicator is available or easy to obtain. |
| Indicator | A measurable variable that is used to represent the quality of the data curation methods. |
| Observational study | Any research study involving data collection without a manipulation or intervention |
| Plausibility | The extent to which data are consistent with general medical knowledge or background information and are therefore believable |
| Validity | The indicator reflects the quality of the data curation methods used in the research. |
DACQORD indicators
| Study phase | Dimension | Indicator |
|---|---|---|
| Design time | Correctness |
The case report form (CRF) has been designed by a team with a range of expertise. |
| Design time | Completeness |
There is a robust process for choosing and designing the data set to be collected that involves appropriate stakeholders, including a data curation team with appropriate skill mix. |
| Design time | Concordance |
The data ontology is consistent with published standards (common data elements) to the greatest extent possible. |
| Design time | Concordance |
Data types are specified for each variable. |
| Design time | Correctness |
Variables are named and encoded in a way that is easy to understand. |
| Design time | Representation |
Relational databases have been appropriately normalised: steps have been taken to eliminate redundant data and remove potentially inconsistent or overly complex data dependencies. |
| Design time | Representation |
Each individual has a unique identifier. |
| Design time | Representation |
There is no duplication in the data set: data have not been entered twice for the same participant. |
| Design time | Completeness |
Data that are mandatory for the study are enforced by rules at data entry, and user reasons for overriding the error checks (queries) are documented in the database. |
| Design time | Completeness |
Missingness is defined and is distinguished from “not available”, “not applicable”, “not collected” or “unknown”. For optional data, “not entered” is differentiated from “not clinically available” depending on research context. |
| Design time | Plausibility |
Range and logic checks are in place for CRF response fields that require free entry of numeric values. Permissible values and units of measurement are specified at data entry. |
| Design time | Correctness |
Free text avoided unless clear scientific justification and (e.g. qualitative) analysis plan specified and feasible. |
| Design time | Concordance |
Database rule checks are in place to identify conflicts in data entries for related or dependent data collected in different CRFs or sources. |
| Design time | Representation |
There are mechanisms in place to enforce/ensure that time-sensitive data are entered within allotted time windows. |
| Design time | Completeness |
There is clear documentation of interdependence of CRF fields, including data entry skip logic. |
| Design time | Correctness |
Data collection includes fields for documenting that participants meet inclusion/exclusion criteria. |
| Design time | Representation |
The data entry tool does not perform rounding or truncation of entries that might result in precision loss. |
| Design time | Plausibility |
Extract/transform/load software for batch upload of data from other sources such as assay results should flag impossible and implausible values. |
| Design time | Representation |
Internationalisation is undertaken in a robust manner, and translation and cultural adaption of concepts (e.g. assessment tools) follow best practice. |
| Design time | Concordance |
Data collection methods are documented in study manuals that are sufficiently detailed to ensure the same procedures are followed each time. |
| Design time | Correctness |
All personnel responsible for entering data receive training and testing on how to complete the CRF. |
| Design time | Correctness |
The CRF/eCRF is easy to use and include a detailed description of the data collection guidelines and how to complete each field in the form. They are pilot-tested in a rigorous pre-specified and documented process until reliability and validity are demonstrated. |
| Design time | Concordance |
Data collectors are tested and provided with feedback regarding the accuracy of their performance across all relevant study domains. |
| Design time | Correctness |
Data collection that requires specific content expertise is carried out by trained and/or certified investigators. |
| Design time | Correctness |
Assessors are blinded to treatment allocation or predictor variables where appropriate and such blinding is explicitly recorded. |
| Design time | Correctness |
There is a clear audit chain for any data processing that takes place after entry, and this should have a mechanism for version control if it changes. |
| Design time | Representation |
Data are provided in a form that is unambiguous to researchers. |
| Design time | Concordance |
For physiological data, the methods of measurement and units are defined for all sites. |
| Design time | Correctness |
Imaging acquisition techniques are standardised (e.g. magnetic resonance imaging). |
| Design time | Correctness |
Biospecimen preparation techniques are standardised. |
| Design time | Correctness |
Biospecimen assay accuracy, precision, repeatability, detection limits, quantitation limits, linearity and range are defined. Normal ranges are determined for each assay. |
| Design time | Correctness |
There is automated entry of the results of biospecimen samples. |
| Training and testing | Completeness |
A team of data curation experts are involved with pre-specified initial and ongoing testing for quality assurance. |
| Run time | Completeness |
Proxy responses for factual questions (such as employment status) are allowed in order to maximise completeness. |
| Run time | Representation |
Automated variable transformations are documented and tested before implementation and if modified. |
| Run time | Completeness |
There is centralised monitoring of the completeness and consistency of information during data collection. |
| Run time | Plausibility |
Individual data elements should be checked for missingness. This should be done against pre-specified skip-logic/missingness masks. This should be performed throughout the study data acquisition period to give accurate “real time” feedback on completion status. |
| Run time | Plausibility |
Systematic and timely measures are in place to assure ongoing data accuracy. |
| Run time | Correctness |
Source data validation procedures are in place to check for agreement between the original data and the information recorded in the database. |
| Run time | Plausibility |
Reliability checks have been performed on variables that are critical to research hypotheses, to ensure that information from multiple sources is consistent. |
| Run time | Correctness |
Scoring of tests is checked. Scoring is performed automatically where possible. |
| Run time | Correctness |
Data irregularities are reported back to data collectors in a systematic and timely process. There is a standard operating procedure for data irregularities to be reported back to the data collectors and for documentation of the resolution of the issue. |
| Run time | Representation |
Known/emergent issues with the data dictionary are documented and reported in an accessible manner. |
| Post-collection | Representation |
The version lock-down of the database for data entry is clearly specified. |
| Post-collection | Correctness |
A plan for ongoing curation and version control is specified. |
| Post-collection | Representation |
A comprehensive data dictionary is available for end users. |