| Literature DB >> 32617581 |
Oliver W Butters1,2, Rebecca C Wilson1,2, Paul R Burton3.
Abstract
Good data curation is integral to cohort studies, but it is not always done to a level necessary to ensure the longevity of the data a study holds. In this opinion paper, we introduce the concept of data curation debt-the data curation equivalent to the software engineering principle of technical debt. Using the context of UK cohort studies, we define data curation debt-describing examples and their potential impact. We highlight that accruing this debt can make it more difficult to use the data in the future. Additionally, the long-running nature of cohort studies means that interest is accrued on this debt and compounded over time-increasing the impact a debt could have on a study and its stakeholders. Primary causes of data curation debt are discussed across three categories: longevity of hardware, software and data formats; funding; and skills shortages. Based on cross-domain best practice, strategies to reduce the debt and preventive measures are proposed-with importance given to the recognition and transparent reporting of data curation debt. Describing the debt in this way, we encapsulate a multi-faceted issue in simple terms understandable by all cohort study stakeholders. Data curation debt is not only confined to the UK, but is an issue the international community must be aware of and address. This paper aims to stimulate a discussion between cohort studies and their stakeholders on how to address the issue of data curation debt. If data curation debt is left unchecked it could become impossible to use highly valued cohort study data, and ultimately represents an existential risk to studies themselves.Entities:
Keywords: Data curation; cohort studies; data management
Mesh:
Year: 2020 PMID: 32617581 PMCID: PMC7660145 DOI: 10.1093/ije/dyaa087
Source DB: PubMed Journal: Int J Epidemiol ISSN: 0300-5771 Impact factor: 7.196
Examples of data curation debt, the impact these may have on a cohort study or the ability to use the data, and possible solutions and preventive measures
| Data curation debt | Impact | Possible solution and preventive measures |
|---|---|---|
| Data stored on legacy hardware (e.g. old unmaintained servers) | Legacy systems can break at any time and data irretrievably lost | Migrate data to maintained hardware and regularly review hardware lifecycle |
| Raw data stored on decaying physical media (e.g. drawings, consent etc. done on paper; interviews, behavioural tasks etc. stored on CDs/DVDs, MiniDisc, VHS, memory sticks, external hard drives etc.) | This type of information will often have been either completely transcribed or abstracted, but if the physical medium decays then the raw data are lost. This makes it impossible to reproduce the original research or apply new techniques to analyse the original raw data. In the case of paper consent decaying, this can lead to legal issues | Digitize raw data where appropriate. Move already digital data on external devices to maintained storage. Record a date by which physical media is due to have decayed in the study data asset register |
| Data stored in obsolete/proprietary data formats | Once data are in an obsolete or proprietary format they become difficult to use and process. Format conversion is required before such data can be transferred to researchers. This may not be possible or feasible, and the data may effectively be lost | Migrate data to open formats. Regularly review formats used and assess their continued use. Prioritize the use of open formats in new data collections |
| Data not properly structured (e.g. messy file trees with no indication of completeness) | A data manager may have to make a best guess as to what historical data are and how they should be structured in order to give them to researchers. The risk is that the data may be misinterpreted, be incomplete, not the latest version etc. | Restructure the data as soon as possible. Develop policy to ensure new datasets are properly structured |
| Little or no documentation or provenance information | Retrospectively writing documentation is inherently error prone. A data manager may have to make a best guess as to where the data came from and how it has been edited | Write documentation as soon as possible. The longer it is left, the more organizational memory will disappear. Develop policy to ensure new datasets must have documentation when created |
| Changing good data management practices not retrospectively applied (e.g. updated disclosure control methodologies) | Data which have been released with old disclosure control methodologies applied become more of a risk as re-identification techniques and technology evolve, and as other datasets become available | Regularly assess the risk to the study of not applying new data management practices retrospectively |
| Carelessly aggregating data from third parties (e.g. social media) | As third party data providers change their software (e.g. APIs) and terms and conditions are updated, the underlying data model may change. This can lead to extra data being erroneously collected for which explicit consent is not present, and may also affect the ability to reproduce a dataset | Where there is any doubt in the consent of third-party data, they must be deleted. When collecting new data from third parties, the relevant metadata (covering for example terms and conditions) need to be kept with the data |
This list is not intended to be exhaustive.