| Literature DB >> 28815122 |
Snehil Gupta1, Connie Zabarovskaya1, Brian Romine1, Daniel A Vianello1, Cynthia Hudson Vitale1, Leslie D McIntosh1.
Abstract
Research data is a dynamic and evolving entity and the ability to cite such data depends on recreating the same datasets utilized in the original research. Despite the availability of several existing technologies, most data repositories lack the necessary setup to recreate a point-in-time snapshot of data, let alone long-term sustainability of dynamic data without restoring an entire database. Through this project, we adopted a subset of the Research Data Alliance data citation working group recommendations to establish a robust informatics system supporting dynamic data and its use for reproducible research within our evolving clinical data repository. We implemented key recommendations: data versioning, times-stamping, query storing, query time-stamping, query PID, and data citation in one data repository, implemented entirely at the database level, and were able to successfully reproduce a previous dataset as it existed at a specific point-in-time using only the PID as provided in a citation.Entities:
Year: 2017 PMID: 28815122 PMCID: PMC5543373
Source DB: PubMed Journal: AMIA Jt Summits Transl Sci Proc
Research Data Alliance (RDA) working group on data citation recommendations for citing data in evolving datasets
| Concept | Description | |
|---|---|---|
| Apply versioning to ensure retrieval of earlier states of datasets. | ||
| Ensure that operations on data are timestamped. | ||
| Store the queries used to select data and associated metadata. | ||
| R4 | Query Uniqueness | Rewrite the query to a normalized form so that identical queries can be detected. |
| R5 | Stable Sorting | Ensure an unambiguous sorting of the records in the data set. |
| R6 | Result Set Verification | Compute a checksum of the query result set to enable verification of the correctness of a result upon re-execution. |
| Assign a timestamp to the query based on the last update to the entire database (or the last update to the selection of data affected by the query or the query execution time). | ||
| Assign a new persistent identifier (PID) or reuse prior PID to the query. | ||
| Store query and metadata in the query store. | ||
| Provide a recommended citation text and the PID to the user. | ||
| R11 | Landing Page | Make the PIDs resolve to a human readable landing page. |
Bold items are ones included in this paper.
Figure 1:A simplified diagram of the data flow from the clinical data source to researchers. The highlighted portions represent the focus (green solid line) and future expansions (green dotted line) of this project.
Gap Analysis Summary
| Database | Data Versioning (R1) | Data Timestamp (R2) | Query Store (R3/R9) | Query Timestamp (R7) | Query PID (R8) | Citation Text (R10) |
|---|---|---|---|---|---|---|
| Yes (default) | Yes (default) | No | No | No | No | |
| No | No | Yes (i2b2 default) | Yes (i2b2 default) | Yes (i2b2 default) | No | |
| No | No | Yes (i2b2 default) | Yes (i2b2 default) | Yes (i2b2 default) | No | |
| No | No | No | No | No | No |
Figure 2:Formula for finding changes in data between historical and current version. Historical dataset is ‘A’, the current dataset is ‘B’, and ‘C’ is the differences between the two data sets showing only the changes in data.