| Literature DB >> 29790950 |
Thomas Nind1, James Galloway1, Gordon McAllister1, Donald Scobbie2, Wilfred Bonney1, Christopher Hall1, Leandro Tramma1, Parminder Reel1, Martin Groves1, Philip Appleby1, Alex Doney1, Bruce Guthrie1, Emily Jefferson1.
Abstract
Background: The Health Informatics Centre at the University of Dundee provides a service to securely host clinical datasets and extract relevant data for anonymized cohorts to researchers to enable them to answer key research questions. As is common in research using routine healthcare data, the service was historically delivered using ad-hoc processes resulting in the slow provision of data whose provenance was often hidden to the researchers using it. This paper describes the development and evaluation of the Research Data Management Platform (RDMP): an open source tool to load, manage, clean, and curate longitudinal healthcare data for research and provide reproducible and updateable datasets for defined cohorts to researchers.Entities:
Mesh:
Year: 2018 PMID: 29790950 PMCID: PMC6041881 DOI: 10.1093/gigascience/giy060
Source DB: PubMed Journal: Gigascience ISSN: 2047-217X Impact factor: 6.524
Figure 1:Integrated data management lifecycle.
Figure 2:High-level architecture of RDMP.
High-level architecture components.
|
| The Catalogue contains a complete inventory of every dataset held in a given data repository including: a high-level description of each dataset; column level descriptions of data items; an inventory of validation rules, data transformations; export rules; outstanding dataset issues; supporting documentation; lookup information; and anonymisation rules. It utilizes the |
|
| Establishes a single platform for data loading; manages remote data sources; loads data from structured and unstructured local sources; and includes reference data management for look-up based validation rules and condition-based searches as stored in the data catalogue. The process has a logging architecture that stores comprehensive data load details including row-level insert and update, archive locations, message-digest algorithm (i.e., MD5) of load files, user who loaded, any fatal errors, etc. The process also allows users to view which datasets have received loads, whether the load was successful or failed; and translates the structured |
|
| This process is concerned with keeping the catalogue up-to-date, monitoring dataset issues and populating metadata for new datasets. The process is not unique to the Root DMN and it is intended that researchers keep their own copy of the catalogue up-to-date and provide feedback on new issues and transforms as they discover them. The catalogue management process captures and integrates useful contributions from researchers into the Root DMN Catalogue to further ensure that they are circulated amongst the entire research community. |
|
| This process is the core quality control function in the RDMP design. The process is focused on the development of data profiling and data quality assessment tools to monitor and report on the quality of the HIC-managed datasets, in terms of accessibility, access security, accuracy, completeness, consistency, relevancy, timeliness, and uniqueness. |
|
| This process creates summary layer aggregates for the data repository and data marts. The process creates discovery metadata through automated feature extraction and aggregation, generating what is essentially query optimisation metadata for the repository. It enables dataset discovery, dataset exploration, report generation, and cohort prospecting and generation. |
|
| The data extraction process provides a structured means of versioning and releasing cohort-based datasets to researchers. In HIC's case, the release to researchers is often into a secure virtual “Safe Haven” environment where researchers can analyse the data and only export aggregate level results. However, providing data controllers allow it, the RDMP software is used to release data directly to researchers for analysis within other environments. |
| The data release process involves: auditing of data extraction (e.g., rows created, time started, any crash messages); retrieving and extracting of any global metadata documents specified in the Catalogue; sending dynamic SQL queries, created by the |
Figure 3:Comparisons of efficiency and errors from using the RDMP tool. A data release is a process where relevant data are linked for a specific cohort and an extract of data is provided for a research project. Fig. 3A: Hours spent on different activities per data release. Fig. 3B: Accumulative number of projects, number of data releases for the period results were captured, normalized number of data releases estimated for whole years and the accumulative number of data release. Fig. 3C: Proportion of data releases of different types. Data releases were categorized into First (first planned release for a new project), Refresh (planned release of the data release to an existing project with no changes but to include data that has newly accrued over time), HIC Error (release to fix errors in a previous release caused by HIC making a mistake in interpreting the data specification), Researcher Error (release to fix errors in a previous release caused by the research team making a mistake in the data specification), and Change Request (release including additional data fields requested by the research team after initial analysis of a data release which was correctly aligned to the data specification). Fig. 3D: Mean number of data releases per month, mean number of data releases per month per FTE, and mean number of data releases per project.
Figure 4:Efficiency of performing data releases. The dark line in the middle of the boxes is the median. The bottom of the box indicates the 25th percentile. The top of the box represents the 75th percentile. The whiskers extend to 1.5 times the height of the box. The points are outliers, circles being outliers lying between 1.5 and 3 times the height of the box, asterisks being extreme outliers lying >3 times the height of the box.
| Dataset | Description | Type of Data Stored |
|---|---|---|
| Accident and Emergency (A&E) | Accident and emergency data | Structured and noncoded data |
| Echocardiogram (ECHO) | Cardiology echocardiographic data | Structured and noncoded data |
| General Registry Office (GRO) | Official death certification data | Structured and coded data (i.e., ICD-9/10) |
| Laboratory | Laboratory data, comprising of biochemistry, haematology, immunology, microbiology and virology reports | Structured and coded data (i.e., read codes) |
| Master Community Health Index (CHI) | Demographic data including postcode of residence, General Practice registration, and date of birth/death | Structured and coded data (i.e., CHI numbers, postcodes and health boards) |
| Prescribing | All dispensed prescriptions for prescribed medications in primary care | Structured and coded data (i.e., British National Formulary (BNF)) |
| Renal Register | Dialysis and transplant data | Structured and noncoded data |
| SMR00 | Scottish national hospital data for outpatients clinics | Structured and coded data (i.e., specialty codes with occasional use of ICD-10) |
| SMR01 | Scottish national hospital data for inpatients clinics | Structured and coded data (i.e., ICD-9/10 and OPCS-3/4) |
| SMR02 | Scottish national hospital data for maternity admissions | Structured and noncoded data |
| SMR04 | Scottish national hospital data for psychiatric admissions and day cases | Structured and coded data (i.e., ICD-10) |
| SMR06 | Scottish national hospital data for cancer registration | Structured and coded data (i.e., ICD-10) |
| Stroke | All stroke admissions to the Ninewells Hospital Acute Stroke Unit | Structured and noncoded, but diagnoses are mapped to ICD-10 |
| Vascular Laboratories | Duplex vascular ultrasound of carotids and lower extremities | Structured and noncoded data |
SMR: Scottish Morbidity Records.