Literature DB >> 29790950

The research data management platform (RDMP): A novel, process driven, open-source tool for the management of longitudinal cohorts of clinical data.

Thomas Nind¹, James Galloway¹, Gordon McAllister¹, Donald Scobbie², Wilfred Bonney¹, Christopher Hall¹, Leandro Tramma¹, Parminder Reel¹, Martin Groves¹, Philip Appleby¹, Alex Doney¹, Bruce Guthrie¹, Emily Jefferson¹.

Abstract

Background: The Health Informatics Centre at the University of Dundee provides a service to securely host clinical datasets and extract relevant data for anonymized cohorts to researchers to enable them to answer key research questions. As is common in research using routine healthcare data, the service was historically delivered using ad-hoc processes resulting in the slow provision of data whose provenance was often hidden to the researchers using it. This paper describes the development and evaluation of the Research Data Management Platform (RDMP): an open source tool to load, manage, clean, and curate longitudinal healthcare data for research and provide reproducible and updateable datasets for defined cohorts to researchers.
Results: Between 2013 and 2017, RDMP tool implementation tripled the productivity of data analysts producing data releases for researchers from 7.1 to 25.3 per month and reduced the error rate from 12.7% to 3.1%. The effort on data management reduced from a mean of 24.6 to 3.0 hours per data release. The waiting time for researchers to receive data after agreeing a specification reduced from approximately 6 months to less than 1 week. The software is scalable and currently manages 163 datasets. A total 1,321 data extracts for research have been produced, with the largest extract linking data from 70 different datasets. Conclusions: The tools and processes that encompass the RDMP not only fulfil the research data management requirements of researchers but also support the seamless collaboration of data cleaning, data transformation, data summarization and data quality assessment activities by different research groups.

Entities: CellLine Chemical Disease Gene Species

Mesh：

Year: 2018 PMID： 29790950 PMCID： PMC6041881 DOI： 10.1093/gigascience/giy060

Source DB: PubMed Journal: Gigascience ISSN： 2047-217X Impact factor: 6.524

Background

In recent years, many academic institutions have taken significant roles in the management of research data by promoting a research data lifecycle as a concept to support data acquisition, curation, preservation, sharing, and reuse of healthcare data [1-4]. Pilot Research Data Management Platform (RDM) programmes in biomedicine (MaDAM and MiSS [5]) have been established including tools such as i2b2 [6], STRIDE [7], CSDMSs [8], ClinData Express [9], REDCap [10], and tranSMART [11, 12]. These tools are often used by research institutions to manage consented longitudinal cohorts of data. Healthcare systems also regularly provide data for research alongside their primary role of managing data for administrative purposes (e.g., National Health Service [NHS] Digital in England or NHS Information Services Division in Scotland). Such organizations tend to provide data extracts for specific cohorts as one-off extracts rather than manage longitudinal cohorts and use generic industry standard IT tools such as Business Objects for data management and extraction. In the Tayside and Fife regions of Scotland, there is a long history of close partnership between the University of Dundee and local NHS Health Boards who are responsible for delivering healthcare to all residents in a geographical region. This has allowed a continuous feed of longitudinal clinical and research datasets to the Health Informatics Centre (HIC), with some datasets now containing over 50 years of historical data [13]. HIC provides a service to securely host 163 datasets and to extract relevant linked anonymized data for researcher use to enable them to answer key research questions. Prior to 2014, HIC used Microsoft's SQL Server Integration Services for loading data and then hand built data extracts using bespoke SQL queries. These tools did not meet HIC's needs for managing data feeds of variable and changing data quality and structure, and producing reproducible extracts in a timely fashion. Therefore, in 2013, HIC examined available open source RDM tools (including testing i2b2 and tranSMART) to try to find a suitable alternative tool that provided a scalable solution for managing large volumes of heterogeneous data for multiple research projects, providing tools for curating and cleaning data as an integral part of the system, and utilizing an integrated data management lifecycle. Existing RDM tools and off-the-shelf tools did not meet HIC's requirements for several reasons: Horizontal scaling: Most Extract Transform Load (ETL) tools are optimized for vertical scaling (more records) in a write-once per dataset solution in which transforms, data cleaning, and optimizations are carried out once on each dataset hosted. HIC needed the ability to rapidly and incrementally curate many datasets at once, while also supporting rapid ETL and extraction of heterogeneous one-off datasets such as those collected by the researchers themselves for specific projects (varying from patient-reported outcomes through complex clinical measurement to genomic and similar data). Data cleaning and curation tools: Many RDM tools are predicated on the external data sources being well-curated and the data being reliable and research-ready at the time of import to the analytics platform. The longitudinal datasets hosted by HIC have variable data quality and variable structure as underlying clinical systems change and are subject to changing definitions over time as well as retrospective rewriting as individual patient's clinical diagnoses evolve. Therefore, the data require significant restructuring and cleaning. Complex extraction transforms: The requirement for extraction transforms can be implemented using existing technologies such as database views, but these solutions lack scalability and curation. Data lifecycle management: Most of the existing RDM systems have been implemented as a single data management resource, an instance of which could be accessed by many researchers. However, it was found that the cleaning and standardization required by different groups depended on the research question and/or methods to be used, and so a single data management resource fed from an all-purpose data cleaning and transformation processing pipeline would not meet the requirements. The RDMP was therefore developed to address HIC's requirements. The integrated data management lifecycle separates out the RDMP's functionality into two related, but distinct activities: Repository Data Lifecycle and Project Data Lifecycle as illustrated in Fig. 1. The Repository Data Lifecycle is involved with data preservation, metadata generation (feature extraction), data profiling and quality control, cohort discovery, data linkage, and extraction. The Project Data Lifecycle is involved with data quality assessment and control, data transformation, and data analysis. In this design, the value chain is one in which the repository delivers value to the project through data extract and supply, and value is returned from the project through the capture and subsequent application of data transformation processes used in the project. The integrated data management lifecycle is applicable to all research data types rather than just clinical or biological data.

Figure 1:

Integrated data management lifecycle.

Integrated data management lifecycle. This paper describes the architectural features of the RDMP and evaluates both the impact of the implementation on HIC processes and the value of the system to the research community.

Data Description

High-level architecture

The RDMP is a systemic approach for the management of routinely collected healthcare and research data and the provision of cohort-specific extracts for research projects. The platform is a set of data structures and processes, sharing a core Catalogue, to manage electronic health records, genomic data, and imaging data throughout their lifecycle from identification and acquisition to safe disposal or archival and retention in secured Safe Havens. The architecture components of the RDMP (shown in Fig.2 and described in Table 1) are a Catalogue and five internal processes (Data Load, Catalogue Management, Data Quality, Data Summary, and Data Extraction) that are designed to enforce rigorous information governance standards relevant to the processing and anonymization of personal identifiable data. Only a summary of the processes is provided here with the details described in the online user manual [14].

Figure 2:

High-level architecture of RDMP.

Table 1:

High-level architecture components.

Catalogue	The Catalogue contains a complete inventory of every dataset held in a given data repository including: a high-level description of each dataset; column level descriptions of data items; an inventory of validation rules, data transformations; export rules; outstanding dataset issues; supporting documentation; lookup information; and anonymisation rules. It utilizes the Load, Validation, Aggregate, Filter, and Transform logics to drive all the five processes in the RDMP architecture.
Data Load Process	Establishes a single platform for data loading; manages remote data sources; loads data from structured and unstructured local sources; and includes reference data management for look-up based validation rules and condition-based searches as stored in the data catalogue. The process has a logging architecture that stores comprehensive data load details including row-level insert and update, archive locations, message-digest algorithm (i.e., MD5) of load files, user who loaded, any fatal errors, etc. The process also allows users to view which datasets have received loads, whether the load was successful or failed; and translates the structured Load Logic defined in the Catalogue into cleaning and anonymization actions performed on data being loaded into the data repository.
Catalogue Management Process	This process is concerned with keeping the catalogue up-to-date, monitoring dataset issues and populating metadata for new datasets. The process is not unique to the Root DMN and it is intended that researchers keep their own copy of the catalogue up-to-date and provide feedback on new issues and transforms as they discover them. The catalogue management process captures and integrates useful contributions from researchers into the Root DMN Catalogue to further ensure that they are circulated amongst the entire research community.
Data Quality Process	This process is the core quality control function in the RDMP design. The process is focused on the development of data profiling and data quality assessment tools to monitor and report on the quality of the HIC-managed datasets, in terms of accessibility, access security, accuracy, completeness, consistency, relevancy, timeliness, and uniqueness.
Data Summary Process	This process creates summary layer aggregates for the data repository and data marts. The process creates discovery metadata through automated feature extraction and aggregation, generating what is essentially query optimisation metadata for the repository. It enables dataset discovery, dataset exploration, report generation, and cohort prospecting and generation.
Data Extraction Process	The data extraction process provides a structured means of versioning and releasing cohort-based datasets to researchers. In HIC's case, the release to researchers is often into a secure virtual “Safe Haven” environment where researchers can analyse the data and only export aggregate level results. However, providing data controllers allow it, the RDMP software is used to release data directly to researchers for analysis within other environments.
	The data release process involves: auditing of data extraction (e.g., rows created, time started, any crash messages); retrieving and extracting of any global metadata documents specified in the Catalogue; sending dynamic SQL queries, created by the Cohort Builder, to the data repository; retrieving the result sets; creating an extraction time data quality report; extracting required lookup tables; and generating new catalogue entries tailored to the specified configuration in the Catalogue.

High-level architecture of RDMP. High-level architecture components. The Root Data Management Node (DMN) environment manages all of the data within the Data Repository. Subsets of data from the Data Repository are then provided for different research projects as data marts, along with a version of the Catalogue relevant to the data contained within the data mart. Data marts are project-specific or study-specific forms of a data warehouse [6, 8]. All the processes (other than the Data Load Process) employed by the Root DMN environment can also be made available for each Branch Research Project DMN. The changes to the Catalogue made by a Research Project's DMN can be fed back into the Root DMN Catalogue and then be provided to other DMN Catalogues as required. Data are not shared between different data marts, but the logic of how to clean, transform, and understand the data can be shared. This recognizes that the value to be captured in the research process is in the metadata created by researchers to curate and extend the raw input of research data. There are two export options provided by the software developed to support the data extraction process: Researchers receive a branch RDM node complete with an empty Catalogue and a data mart. This is then populated with research data from the Root DMN data repository (transformed for anonymization) and Catalogue information from the Root DMN Catalogue. All processes and accompanying software that runs on the Root DMN also works on researchers’ project DMN instances. This allows researchers to perform extractions of their own (e.g., providing subsets of their research datasets to other researchers or students to perform additional analysis on the dataset, as shown in Fig. 2). Researchers receive their extracted datasets in flat file formats or as an extraction as a SQL database file. In this case, the descriptive metadata are provided to researchers in dynamically generated Word documents and Comma Separated Value (CSV) formatted lookup tables.

Privacy handling

Ensuring identifiable data does not appear in data extractions is principally done by configuring which columns are extractable, which require special governance approval to be extracted, and which contain patient identifiers (and therefore should not be extracted). This can be done once per dataset, after which the rules will be applied to all project extractions. Since manual processes can be error prone, the release pipeline can also be adjusted to include further blanket checks. For example, adding the “ColumnBlacklister” component to the default extraction pipeline allows specification of a Regular Expression that will block any data extractions containing columns matching the pattern (e.g., containing the word “Id”, “Address,” or “Identifier”). It is also possible to write custom plugin data flow components. One such plugin component used by HIC looks for 10-digit sequences where the checksum matches the Scottish patient identifier CHI (Community Health Index) checksum algorithm in data being extracted. This prevents CHI numbers appearing in free text/unexpected fields from being extracted.

Data access

The focus of this paper is the RDMP itself, which can manage many forms of research data rather than the data managed by HIC's instantiation of the platform, but in brief, the datasets managed by the RDMP and hosted by HIC are generally sensitive clinical and research patient records and so are not openly available. Anonymized data extracts can be provided within a Safe Haven environment for specific cohorts to answer specific research questions given appropriate governance and ethical approvals. A list of the datasets currently hosted can be found at [13] and example datasets listed in Appendix A. To request access to the data please contact HIC [15].

Analyses

A data release is the process of linking relevant data for a specific cohort and providing an extract of data to researchers. Most projects require multiple data releases to update the same dataset as new data accrue and/or to provide additional data as project needs change over time. HIC fully integrated RDMP into its existing work processes in July 2014 with regular updates and additional features being regularly added. To date (Dec 2017), 1,321 data releases have been provided for research using the tool. There are currently 163 separate datasets that are loaded, managed, and curated by the system. The largest data extract included data linked from 70 separate datasets.

Efficiency of data loading, cleaning, and standardization

Prior to the use of the RDMP, data loading, cleaning, and standardization effort was a time-consuming exercise due to the complexity of managing large numbers of continuously updating datasets with varying structure over time. Data loading was highly manual and reactively undertaken in response to a researcher request for a linked dataset. The loading effort was often duplicated across projects. The RDMP, in contrast, provides a flexible pipeline to automate the process of loading data. The platform supports changing input formats, for example, when a dataset feed is supplied with a renamed column or no column headers. It also provides the framework for routine data cleaning. To assess the impact on the efficiency of the data loading and management features of the RDMP, the mean number of hours spent on the task each year per data release were compared. Fig. 3A shows that the total hours spent by the team on data loading, restructuring, and management decreased significantly with the use of the RDMP tool, reducing from 24.6 hours per data release in 2013 to 3.0 hours in 2017. The tools for data cleaning were not in place in 2013, and so data was largely provided raw requiring duplicative cleaning by every research project analyst. With the use of the RDMP in 2014, a data cleaning project was undertaken with the effort reducing as data items were processed and cleaning automated. The overall effort for all supporting activities has reduced in line with the implementation of new features and improvements of RDMP: 5.6 hours of RDMP development and 3.0 hours for data management, totalling 8.6 hours of supporting activity in 2017 per data release versus 24.6 hours per data release in 2013 just spent on data management.

Figure 3:

Comparisons of efficiency and errors from using the RDMP tool. A data release is a process where relevant data are linked for a specific cohort and an extract of data is provided for a research project. Fig. 3A: Hours spent on different activities per data release. Fig. 3B: Accumulative number of projects, number of data releases for the period results were captured, normalized number of data releases estimated for whole years and the accumulative number of data release. Fig. 3C: Proportion of data releases of different types. Data releases were categorized into First (first planned release for a new project), Refresh (planned release of the data release to an existing project with no changes but to include data that has newly accrued over time), HIC Error (release to fix errors in a previous release caused by HIC making a mistake in interpreting the data specification), Researcher Error (release to fix errors in a previous release caused by the research team making a mistake in the data specification), and Change Request (release including additional data fields requested by the research team after initial analysis of a data release which was correctly aligned to the data specification). Fig. 3D: Mean number of data releases per month, mean number of data releases per month per FTE, and mean number of data releases per project.

Number of projects and data releases

Fig. 3B shows that the total cumulative number of supported projects (where they have received one or more data release in any current or previous year, with recording starting from 2013) increased from 82 in 2013 to 533 in 2017. Between 2014 and 2017, approximately 140 (ranging from 139 to 146) unique projects were supported each year (where the project received at least one release that particular year). As many projects are multi-year, the cumulative total number of projects supported is less than the addition of the number of projects supported per year. Many projects require more than one data release in line with more data accruing over time and researchers carrying out longitudinal analysis. The number of new data releases since 2013 has increased each year. Release counts were captured for the last 8 months of 2013 and all months for 2014–2017. There were 142 data releases in 2013 (estimated to be 213 for the whole year) increasing to 456 in 2017. The cumulative number of total data releases assessed for this study increased from 141 in 2013 to 1,647 by the end of 2017, of which 1,321 were delivered using the RDMP.

Errors rates and types of data releases

Data releases were categorized into five types: First (planned): the first data release for a particular project Refresh (planned): refreshes of the data release with no changes except to include data that has newly accrued over time HIC error: a data release to fix errors in a previous release caused by HIC making a mistake in interpreting the data specification Researcher error: a data release to fix errors in a previous release caused by the research team making a mistake in the data specification Change request: a data release including additional data fields requested by the research team after initial analysis of a data release that was correctly aligned to the data specification The capability of the RDMP to improve the release of correct data was assessed by comparing the percentages of each type of release each year. Fig. 3C shows the proportion of releases made to fix an HIC error halved with the use of RDMP from 4.9% of releases in 2013 to 2.2% in 2017, because of improved reproducibility and error checking functionality within the RDMP. Similarly, the number of researcher errors reduced from 1.4% to 0.4%, and the number of change requests from 6.3% to 0.4%, both due to improved metadata and documentation prior to release supporting correct specification of the data required at first release. One of the features of the RDMP is the project- and data-specific documentation generated automatically on data extract. A word file is produced that provides all the metadata for just the fields that have been extracted for the project along with project-specific summary charts and the logic used to build the cohort. The project-specific summary charts show gaps in the data of which a researcher may not have been previously aware. Overall, the proportion of releases with correct data increased from 87.3% in 2013 to 96.9% in 2017.

Efficiency of performing a data release

The RDMP Cohort Builder tool enables a data analyst to combine blocks/filters of best practice, standardized, reuseable SQL queries to build cohorts and extract data. The blocks can be reused by different data analysts for multiple projects rather than bespoke SQL code being written for every new project. The quantitative benefits of using the RDMP Cohort Builder and Data Extraction tool were measured by comparing the number of data releases produced each year and the time taken by data analysts to produce any release, and separately first and refresh data releases. Fig. 3D shows that the mean number of data releases per month increased steadily from 17.8 releases in 2013 to 38.0 in 2017. Fig. 3D also shows that data analyst productivity significantly increased, with a mean of 7.1 data releases carried out each month per FTE data analyst before RDMP implementation in 2013 compared to 25.3 in 2017. As the RDMP tools have improved over the years, there has been an approximately 3-fold increase in productivity levelling off over the last 2 years as the tool reaches as close to automation as possible with much of the remaining resource being the time taken working with researchers to document and define the required cohort. Fig. 4 shows that the hours spent on each data release vary widely. In 2013, over 75% of the data releases were completed with less than 5.9 hours of effort, whereas in 2017 this has reduced to less than 2 hours. The mean time to produce a data release decreased from 5.7 hours in 2013 to 2.1 hours in 2017, with the median time decreasing from 2.5 hours to less than 1 hour over the same time period (Mann-Whitney U P < 0.001). The maximum number of hours on a project decreased from 86.0 hours in 2013 to 49.9 in 2017. Fig. 4 has been cropped at 45 hours excluding three releases, one in 2013 (86.0 hours), one in 2014 (98.2 hours), and the other in 2017 (49.9). The two earlier releases required complex cohort building with many iterative discussions with the researchers to define the cohort correctly. The release in 2017 was the first of a planned series of routine extractions of imaging data. This necessitated new development and many meetings with the imaging experts to ensure an accurate and easily repeatable extraction was created. The project with the second largest number of hours logged in 2017 took only 31 hours. There were 208 data releases with zero time marked against them. These were completely automated releases that took less than 5 minutes to initiate and so have no time booked against them. As might be expected, first releases took longer than refresh releases, but there was a decrease in the average time to perform a data release for both types of releases.

Figure 4:

Efficiency of performing data releases. The dark line in the middle of the boxes is the median. The bottom of the box indicates the 25th percentile. The top of the box represents the 75th percentile. The whiskers extend to 1.5 times the height of the box. The points are outliers, circles being outliers lying between 1.5 and 3 times the height of the box, asterisks being extreme outliers lying >3 times the height of the box. Another reason for the reduction in time taken to produce a data extract is the improved and standardized metadata held within the catalogue. The time spent with researchers in meetings to define the cohort has been reduced as researchers can now clearly see what data fields are available prior to being given their extract. This can help inform the criteria for the cohort and elucidate what fields are required in the data extract.

Number of data releases per project

Fig. 3D shows that there has been a steady increase in the average number of data extracts produced for each project due to the increased demand for new extracts, increasing from a mean of 1.7 data releases per project in 2013 to 3.1 in 2017. Fig. 3C also shows that over 66.9% of the data releases in 2017 were refreshes of the data compared to 33.1% in 2013. Updating data from continuously accruing routinely collected health data is particularly helpful for longitudinal studies where maximizing the length of follow-up is often a high priority. Prior to the development of the RDMP it was not feasible for HIC to provide regular refreshes in a timely fashion due to the manual effort to load new feeds and produce extracts for each project. Such studies could therefore only receive extracts every 1 to 2 years, which meant most studies were never refreshed as research funding is often shorter than this. The increased proportion of refreshed extracts was a result of reducing the waiting time for researchers and the improvements in the reproducibility of the data extract structure (see Qualitative evaluation).

Qualitative evaluation

There is a range of benefits of the RDMP that could not be measured quantitatively because either the metrics were not electronically recorded or because they were challenging to quantify. Therefore, the qualitative evaluation has been carried out by discussions with researchers and the team of Data Analysts.

Overall efficiency

Prior to the implementation of the RDMP, there was a significant project backlog, and it was estimated that it took approximately 6 months to provide a data release from when the research team requested the data, whereas in 2017 this has reduced to several days (with approximately one less FTE working on the task). This was due to changes in both the efficiency of data loading and performing data releases (as quantitatively analyzed above). It used to take approximately 6 months to train a data analyst before they were able to independently load data and perform data releases. Using the RDMP, this time has now reduced to a few weeks. The RDMP enabled the knowledge of the datasets and cohort building logic to be captured within the system metadata rather than just held by individuals. Junior data analysts can use the tool via a Graphical User Interface rather than having to directly write SQL, with more senior data analysts developing and recording complex new filters, thus de-skilling the junior data analyst role.

Reproducibility

Prior to the RDMP, it was extremely challenging for data analysts using bespoke SQL scripts to provide the extracts in the same format each time, especially when the data structures regularly changed at the source and a different analyst may have completed the subsequent work. Consequently, researchers needed to modify their analysis scripts to work with the new data structure each time a new extract was provided. This could take significant effort on the part of the research team especially when trying to reproduce results. A core feature of the RDMP is the ability to provide data extracts in a reproducible structure over time. A history of changes to data is stored. This information can be helpful to understand where data has been corrected/changed in the source system over time. Therefore, depending on researcher requirements, a refresh data release that is required several years after the first release can provide the data with the values exactly as it was at the time of the first release or with the updated values in the “live” source system (or both values if researchers need to compare them).

Data quality control

Overall data quality can be continually monitored, audited, and improved by the data management team using the data quality tools within the RDMP. Data can be delivered to research projects with a confidence in quality that is testable and quantifiable. Although data profiling and monitoring are standard enterprise warehouse management techniques used in data control, quality monitoring, validity, and anomaly identification, they are not activities that are well represented in the research data management life cycles. The data quality process provides the metrics that characterize and track stability and volatility in the research data. These metrics are then used to provide an automated assessment of the scope and conformance of the data to expectations before and after transformation processing.

Example projects

The RDMP has been used to provide the data management and data extracts for a range of high-impact recent publications such as [16-21].

Discussion

Over the last 4 years, the RDMP has been used to manage 163 clinical datasets (most of which are constantly accruing new data) and provided 1,321 data releases for 420 different research projects. The RDMP has improved the provision of linked data extracts for research in several key ways: Researchers now receive metadata and documentation that is automatically generated and specific for the data fields they have received and/or requested. All processes are fully audited and documented along with data governance controls. The data quality of both the data repository and the research data extracts is testable and quantifiable using the RDMP tools. The mean time to produce a data release by data analysts decreased from 5.7 hours in 2013 to 2.1 hours in 2017. Data analysts building cohorts and extracting data have become over 3 times more productive per FTE. The proportion of releases with correct data increased from 87.3% to 96.9%. The delivery time of a data extract from researcher data request has reduced from ∼6 months to several days, primarily due to proactive and automated data loading, cleaning, curation, and management. The RDMP has enabled highly complex projects to be delivered that were technically infeasible previously. The time required by researchers to clean and restructure the data they receive has decreased as the data is delivered in the same structure at each new release, which enhances reproducibility. These improvements have not only benefited the research community but have also given additional comfort to the data controllers that their data is being robustly managed, as evidenced by positive feedback from regular data governance committee meetings with representation from data controllers. The controls, audit, and logging functionality have provided supporting evidence contributing towards HIC attaining ISO27001 certification (an internationally recognized standard for information security management system) and to become a Scottish Government Accredited Safe Haven Environment. We believe that the RDMP is unique in its clear separation of the Repository Data Lifecycle and Project Data Lifecycle. There are many other tools available that provide cohort building functionality or basic ETL functionality, but they do not offer the same level of tight functional and workflow integration the RDMP offers the data linkage community. One key benefit of the RDMP is the recognition that the data curation processes to identify, clean, correct, transform, and/or impute data in the datasets are integral in the RDM lifecycle and must be embedded in a highly structured and redistributable Catalogue so that the data cleaning can be performed on-demand and applied retrospectively to new cohorts.

Potential Implications

The RDMP has been developed and utilized by a Scottish Safe Haven that handles both nonconsented and consented linked datasets and provisions extracts for specific cohorts within a locked down researcher environment. The tools would be very helpful for other organizations who provide such a service. However, the tool could also be used by others who work in contexts with different data governance constraints and use different data. The tool is designed to manage continually accruing longitudinal data and so could be particularly helpful for groups who manage longitudinal cohorts. The RDMP could also be used for other data types than health data. The RDMP is in active development. There are two other major additional work streams enabling the RDMP to handle big data: images and genomic data. The imaging plugin is currently in prototype and will be used to manage the Scottish National Radiology Dataset, which includes over 23 million different examinations from a population of 5.4 million, with over 700 TB of data collected since 2006. The RDMP is currently being used to manage multiple “omic” data results from across Europe in a project to stratify patients with different types of endocrine hypertension and to manage the phenotypic data for widely used bioresources such as GoDARTS [22]. Another area for development is further mapping data to international data standards such as Logical Observations Identifiers Names and Codes and Systematized Nomenclature of Medicine—Clinical Terms [23]. This will help us to further restructure the laboratory datasets and improve semantic interoperability and the quality of cohort selection and data linkage [24]. We are also developing the researcher tools to work on the Research Project DMN for the management of diabetes datasets as part of an NIHR Global Health award that is establishing a major new Scotland-India clinical partnership to combat diabetes. We are actively looking for other research groups with which to collaborate, especially where the RDMP can be exposed to different types of data; data cleaning and transformation logic; metadata; data mapping and phenotype definitions. Collaborative projects have two main aims: (1) to assist each collaborator with their specific data management challenges; and (2) to improve the RDMP architecture by exposing the RDMP to different and diverse data requirements. We would welcome collaborations using the RDMP and any suggestions for new features.

Methods

The data for the evaluation method were obtained from HIC's customized JIRA issue and project tracking system [25] from May 2013 to the end of 2017. All the daily activities of HIC data analysts loading, cleaning, and standardizing data as well as preparing and releasing data extracts to researchers were recorded on timesheets within JIRA. The RDMP started to be used in production in July 2014 with regular updates and additional features being released every month. The total number of projects is the number of unique research projects that have been supported. The time recorded by data analysts for a “data release” task includes all of the effort to discuss the requirements with research groups, document the requirements, produce code that defines the appropriate cohort, pull and link the relevant data for the cohort, anonymize the extract, and copy the data extract into the “Safe Haven” environment for researcher access. All of the extract files obtained from querying the JIRA database are provided along with all of the statistical analysis in either excel or SPSS (SPSS, RRID:SCR_002865) in the supporting data and materials. A detailed description of how the results were calculated is also provided. Project name: Research Data Management Platform Project home page: https://github.com/HicServices/RDMP Operating system(s): Windows Programming language: C# Other requirements: Microsoft SQL Server SciCrunch RRID (Research Resource Identification Initiative ID): Research Data Management Platform, RRID:SCR_016268 License: GPL v3

User documentation and technical details

The RDMP contains an extensive 95-page (20,658 words) user manual. The RDMP ships with not only the dlls and pdb files required to debug it but also an embedded resource file containing all the source code of the RDMP. Since all user interface classes are documented in the source code, the RDMP contains a feature that reads this documentation and screenshots each form resulting in a 166-page (30,864 words) Microsoft Word document (as of Nov 2017) with images and descriptions of all user interfaces in the application. Since these descriptions/images are created directly from the embedded source code at runtime, they are never out of date and always reflect the version of software the user is using. All messages and exceptions generated during runtime are recorded with a Stack Trace. This is combined with the embedded source code browser in the RDMP and allows you to rapidly identify the source of problems in the program while it is running without needing a debugger. The software suite for managing this database is written in C Sharp programming (i.e., C#) in a solution consisting of 63 projects and a codebase size of 96,000 lines of code supported by a unit testing harness with over 1,070 tests. A core design philosophy of the Catalogue is to extend testability into all aspects of data curation. To this end, many modules support self-checking during runtime, thus allowing the user to quickly identify problems encountered during routine data activities.

Availability of supporting data and materials

The datasets supporting the results of this article are presented in a file named Data and Analysis.rar are available via the GigaScience GigaDB repository [26]. The RDMP User Manual is available publically at: https://github.com/HicServices/RDMP/wiki. The RDMP has its own test data generator that produces csv files suitable for testing and the user manual provides instructions for how to set this up.

Abbreviations

CS: Comma Separated Value; Data Release: is the process where relevant data are linked for a specific cohort and an extract of data is provided for a research project; DMN: data management node; ETL: extract transform load; FTE: full time equivalent; HIC: Health Informatics Centre; JISC: Joint Information Systems Committee; RDM: Research Data Management; RDMP: Research Data Management Platform; SQL: Structured Query Language.

Competing interests

The author(s) declare that they have no competing interests.

Funding

The authors acknowledge the support from the Farr Institute of Health Informatics Research and Dundee University Medical School. This work was supported by the Medical Research Council (MRC) grant number MR/M501633/1 (PI: Andrew Morris) and the Wellcome Trust grant number WT086113 through the Scottish Health Informatics Programme (SHIP) (PI: Andrew Morris). SHIP is a collaboration between the Universities of Aberdeen, Dundee, Edinburgh, Glasgow, and St Andrews, and the Information Services Division of NHS Scotland. This project has also received funding from the European Union's Horizon 2020 research and innovation programme under grant agreement No 633 983 (PI: Maria-Christina Zennaro).

Authors’ contributions

E.J. conceived and directed the RDMP project and drafted this manuscript. D.S. and T.N. designed the architecture. T.N., C.H., L.T., G.M., and P.A. designed and implemented the different modules of the architecture. M.G. produced all of the data from Jira for analysis, P.R. configured the tool for multi-omics data, and W.B. populated the catalogue, implemented data standards within the catalogue, and drafted some sections of this manuscript. J.G. acted as Product Champion providing requirements of the tool to support the data analysts; A.D. and B.G. provided clinical data expertise and many of the researcher requirements. All authors read, edited, and approved the final manuscript. Click here for additional data file. Click here for additional data file. Click here for additional data file. 4/20/2018 Reviewed Click here for additional data file. 5/3/2018 Reviewed Click here for additional data file. Click here for additional data file.

Dataset	Description	Type of Data Stored
Accident and Emergency (A&E)	Accident and emergency data	Structured and noncoded data
Echocardiogram (ECHO)	Cardiology echocardiographic data	Structured and noncoded data
General Registry Office (GRO)	Official death certification data	Structured and coded data (i.e., ICD-9/10)
Laboratory	Laboratory data, comprising of biochemistry, haematology, immunology, microbiology and virology reports	Structured and coded data (i.e., read codes)
Master Community Health Index (CHI)	Demographic data including postcode of residence, General Practice registration, and date of birth/death	Structured and coded data (i.e., CHI numbers, postcodes and health boards)
Prescribing	All dispensed prescriptions for prescribed medications in primary care	Structured and coded data (i.e., British National Formulary (BNF))
Renal Register	Dialysis and transplant data	Structured and noncoded data
SMR00	Scottish national hospital data for outpatients clinics	Structured and coded data (i.e., specialty codes with occasional use of ICD-10)
SMR01	Scottish national hospital data for inpatients clinics	Structured and coded data (i.e., ICD-9/10 and OPCS-3/4)
SMR02	Scottish national hospital data for maternity admissions	Structured and noncoded data
SMR04	Scottish national hospital data for psychiatric admissions and day cases	Structured and coded data (i.e., ICD-10)
SMR06	Scottish national hospital data for cancer registration	Structured and coded data (i.e., ICD-10)
Stroke	All stroke admissions to the Ninewells Hospital Acute Stroke Unit	Structured and noncoded, but diagnoses are mapped to ICD-10
Vascular Laboratories	Duplex vascular ultrasound of carotids and lower extremities	Structured and noncoded data

SMR: Scottish Morbidity Records.

15 in total

1. STRIDE--An integrated standards-based translational research informatics platform.

Authors: Henry J Lowe; Todd A Ferris; Penni M Hernandez; Susan C Weber
Journal: AMIA Annu Symp Proc Date: 2009-11-14

2. Research electronic data capture (REDCap)--a metadata-driven methodology and workflow process for providing translational research informatics support.

Authors: Paul A Harris; Robert Taylor; Robert Thielke; Jonathon Payne; Nathaniel Gonzalez; Jose G Conde
Journal: J Biomed Inform Date: 2008-09-30 Impact factor: 6.317

3. ClinData Express--a metadata driven clinical research data management system for secondary use of clinical data.

Authors: Zuofeng Li; Jingran Wen; Xiaoyan Zhang; Chunxiao Wu; Zuogao Li; Lei Liu
Journal: AMIA Annu Symp Proc Date: 2012-11-03

4. Effects of atmospheric CO2 enrichment on the growth and development of Hymenocallis littoralis (Amaryllidaceae) and the concentrations of several antineoplastic and antiviral constituents of its bulbs.

Authors: S B Idso; B A Kimball; G R Pettit Iii; L C Garner; G R Pettit; R A Backhaus
Journal: Am J Bot Date: 2000-06 Impact factor: 3.844

5. Influenza vaccination does not promote cellular or humoral activation among heart transplant recipients.

Authors: P Kimball; S Verbeke; M Flattery; C Rhodes; D Tolman
Journal: Transplantation Date: 2000-06-15 Impact factor: 4.939

6. Mapping Local Codes to Read Codes.

Authors: Wilfred Bonney; James Galloway; Christopher Hall; Mikhail Ghattas; Leandro Tramma; Thomas Nind; Louise Donnelly; Emily Jefferson; Alexander Doney
Journal: Stud Health Technol Inform Date: 2017

7. tranSMART: An Open Source Knowledge Management and High Content Data Analytics Platform.

Authors: Elisabeth Scheufele; Dina Aronzon; Robert Coopersmith; Michael T McDuffie; Manish Kapoor; Christopher A Uhrich; Jean E Avitabile; Jinlei Liu; Dan Housman; Matvey B Palchuk
Journal: AMIA Jt Summits Transl Sci Proc Date: 2014-04-07

8. tranSMART: An Open Source and Community-Driven Informatics and Data Sharing Platform for Clinical and Translational Research.

Authors: Brian D Athey; Michael Braxenthaler; Magali Haas; Yike Guo
Journal: AMIA Jt Summits Transl Sci Proc Date: 2013-03-18

9. Genetic fine mapping and genomic annotation defines causal mechanisms at type 2 diabetes susceptibility loci.

Authors: Kyle J Gaulton; Teresa Ferreira; Yeji Lee; Anne Raimondo; Reedik Mägi; Michael E Reschen; Anubha Mahajan; Adam Locke; N William Rayner; Neil Robertson; Robert A Scott; Inga Prokopenko; Laura J Scott; Todd Green; Thomas Sparso; Dorothee Thuillier; Loic Yengo; Harald Grallert; Simone Wahl; Mattias Frånberg; Rona J Strawbridge; Hans Kestler; Himanshu Chheda; Lewin Eisele; Stefan Gustafsson; Valgerdur Steinthorsdottir; Gudmar Thorleifsson; Lu Qi; Lennart C Karssen; Elisabeth M van Leeuwen; Sara M Willems; Man Li; Han Chen; Christian Fuchsberger; Phoenix Kwan; Clement Ma; Michael Linderman; Yingchang Lu; Soren K Thomsen; Jana K Rundle; Nicola L Beer; Martijn van de Bunt; Anil Chalisey; Hyun Min Kang; Benjamin F Voight; Gonçalo R Abecasis; Peter Almgren; Damiano Baldassarre; Beverley Balkau; Rafn Benediktsson; Matthias Blüher; Heiner Boeing; Lori L Bonnycastle; Erwin P Bottinger; Noël P Burtt; Jason Carey; Guillaume Charpentier; Peter S Chines; Marilyn C Cornelis; David J Couper; Andrew T Crenshaw; Rob M van Dam; Alex S F Doney; Mozhgan Dorkhan; Sarah Edkins; Johan G Eriksson; Tonu Esko; Elodie Eury; João Fadista; Jason Flannick; Pierre Fontanillas; Caroline Fox; Paul W Franks; Karl Gertow; Christian Gieger; Bruna Gigante; Omri Gottesman; George B Grant; Niels Grarup; Christopher J Groves; Maija Hassinen; Christian T Have; Christian Herder; Oddgeir L Holmen; Astradur B Hreidarsson; Steve E Humphries; David J Hunter; Anne U Jackson; Anna Jonsson; Marit E Jørgensen; Torben Jørgensen; Wen-Hong L Kao; Nicola D Kerrison; Leena Kinnunen; Norman Klopp; Augustine Kong; Peter Kovacs; Peter Kraft; Jasmina Kravic; Cordelia Langford; Karin Leander; Liming Liang; Peter Lichtner; Cecilia M Lindgren; Eero Lindholm; Allan Linneberg; Ching-Ti Liu; Stéphane Lobbens; Jian'an Luan; Valeriya Lyssenko; Satu Männistö; Olga McLeod; Julia Meyer; Evelin Mihailov; Ghazala Mirza; Thomas W Mühleisen; Martina Müller-Nurasyid; Carmen Navarro; Markus M Nöthen; Nikolay N Oskolkov; Katharine R Owen; Domenico Palli; Sonali Pechlivanis; Leena Peltonen; John R B Perry; Carl G P Platou; Michael Roden; Douglas Ruderfer; Denis Rybin; Yvonne T van der Schouw; Bengt Sennblad; Gunnar Sigurðsson; Alena Stančáková; Gerald Steinbach; Petter Storm; Konstantin Strauch; Heather M Stringham; Qi Sun; Barbara Thorand; Emmi Tikkanen; Anke Tonjes; Joseph Trakalo; Elena Tremoli; Tiinamaija Tuomi; Roman Wennauer; Steven Wiltshire; Andrew R Wood; Eleftheria Zeggini; Ian Dunham; Ewan Birney; Lorenzo Pasquali; Jorge Ferrer; Ruth J F Loos; Josée Dupuis; Jose C Florez; Eric Boerwinkle; James S Pankow; Cornelia van Duijn; Eric Sijbrands; James B Meigs; Frank B Hu; Unnur Thorsteinsdottir; Kari Stefansson; Timo A Lakka; Rainer Rauramaa; Michael Stumvoll; Nancy L Pedersen; Lars Lind; Sirkka M Keinanen-Kiukaanniemi; Eeva Korpi-Hyövälti; Timo E Saaristo; Juha Saltevo; Johanna Kuusisto; Markku Laakso; Andres Metspalu; Raimund Erbel; Karl-Heinz Jöcke; Susanne Moebus; Samuli Ripatti; Veikko Salomaa; Erik Ingelsson; Bernhard O Boehm; Richard N Bergman; Francis S Collins; Karen L Mohlke; Heikki Koistinen; Jaakko Tuomilehto; Kristian Hveem; Inger Njølstad; Panagiotis Deloukas; Peter J Donnelly; Timothy M Frayling; Andrew T Hattersley; Ulf de Faire; Anders Hamsten; Thomas Illig; Annette Peters; Stephane Cauchi; Rob Sladek; Philippe Froguel; Torben Hansen; Oluf Pedersen; Andrew D Morris; Collin N A Palmer; Sekar Kathiresan; Olle Melander; Peter M Nilsson; Leif C Groop; Inês Barroso; Claudia Langenberg; Nicholas J Wareham; Christopher A O'Callaghan; Anna L Gloyn; David Altshuler; Michael Boehnke; Tanya M Teslovich; Mark I McCarthy; Andrew P Morris
Journal: Nat Genet Date: 2015-11-09 Impact factor: 38.330

10. The genetic architecture of type 2 diabetes.

Authors: Christian Fuchsberger; Jason Flannick; Tanya M Teslovich; Anubha Mahajan; Vineeta Agarwala; Kyle J Gaulton; Clement Ma; Pierre Fontanillas; Loukas Moutsianas; Davis J McCarthy; Manuel A Rivas; John R B Perry; Xueling Sim; Thomas W Blackwell; Neil R Robertson; N William Rayner; Pablo Cingolani; Adam E Locke; Juan Fernandez Tajes; Heather M Highland; Josee Dupuis; Peter S Chines; Cecilia M Lindgren; Christopher Hartl; Anne U Jackson; Han Chen; Jeroen R Huyghe; Martijn van de Bunt; Richard D Pearson; Ashish Kumar; Martina Müller-Nurasyid; Niels Grarup; Heather M Stringham; Eric R Gamazon; Jaehoon Lee; Yuhui Chen; Robert A Scott; Jennifer E Below; Peng Chen; Jinyan Huang; Min Jin Go; Michael L Stitzel; Dorota Pasko; Stephen C J Parker; Tibor V Varga; Todd Green; Nicola L Beer; Aaron G Day-Williams; Teresa Ferreira; Tasha Fingerlin; Momoko Horikoshi; Cheng Hu; Iksoo Huh; Mohammad Kamran Ikram; Bong-Jo Kim; Yongkang Kim; Young Jin Kim; Min-Seok Kwon; Juyoung Lee; Selyeong Lee; Keng-Han Lin; Taylor J Maxwell; Yoshihiko Nagai; Xu Wang; Ryan P Welch; Joon Yoon; Weihua Zhang; Nir Barzilai; Benjamin F Voight; Bok-Ghee Han; Christopher P Jenkinson; Teemu Kuulasmaa; Johanna Kuusisto; Alisa Manning; Maggie C Y Ng; Nicholette D Palmer; Beverley Balkau; Alena Stančáková; Hanna E Abboud; Heiner Boeing; Vilmantas Giedraitis; Dorairaj Prabhakaran; Omri Gottesman; James Scott; Jason Carey; Phoenix Kwan; George Grant; Joshua D Smith; Benjamin M Neale; Shaun Purcell; Adam S Butterworth; Joanna M M Howson; Heung Man Lee; Yingchang Lu; Soo-Heon Kwak; Wei Zhao; John Danesh; Vincent K L Lam; Kyong Soo Park; Danish Saleheen; Wing Yee So; Claudia H T Tam; Uzma Afzal; David Aguilar; Rector Arya; Tin Aung; Edmund Chan; Carmen Navarro; Ching-Yu Cheng; Domenico Palli; Adolfo Correa; Joanne E Curran; Denis Rybin; Vidya S Farook; Sharon P Fowler; Barry I Freedman; Michael Griswold; Daniel Esten Hale; Pamela J Hicks; Chiea-Chuen Khor; Satish Kumar; Benjamin Lehne; Dorothée Thuillier; Wei Yen Lim; Jianjun Liu; Yvonne T van der Schouw; Marie Loh; Solomon K Musani; Sobha Puppala; William R Scott; Loïc Yengo; Sian-Tsung Tan; Herman A Taylor; Farook Thameem; Gregory Wilson; Tien Yin Wong; Pål Rasmus Njølstad; Jonathan C Levy; Massimo Mangino; Lori L Bonnycastle; Thomas Schwarzmayr; João Fadista; Gabriela L Surdulescu; Christian Herder; Christopher J Groves; Thomas Wieland; Jette Bork-Jensen; Ivan Brandslund; Cramer Christensen; Heikki A Koistinen; Alex S F Doney; Leena Kinnunen; Tõnu Esko; Andrew J Farmer; Liisa Hakaste; Dylan Hodgkiss; Jasmina Kravic; Valeriya Lyssenko; Mette Hollensted; Marit E Jørgensen; Torben Jørgensen; Claes Ladenvall; Johanne Marie Justesen; Annemari Käräjämäki; Jennifer Kriebel; Wolfgang Rathmann; Lars Lannfelt; Torsten Lauritzen; Narisu Narisu; Allan Linneberg; Olle Melander; Lili Milani; Matt Neville; Marju Orho-Melander; Lu Qi; Qibin Qi; Michael Roden; Olov Rolandsson; Amy Swift; Anders H Rosengren; Kathleen Stirrups; Andrew R Wood; Evelin Mihailov; Christine Blancher; Mauricio O Carneiro; Jared Maguire; Ryan Poplin; Khalid Shakir; Timothy Fennell; Mark DePristo; Martin Hrabé de Angelis; Panos Deloukas; Anette P Gjesing; Goo Jun; Peter Nilsson; Jacquelyn Murphy; Robert Onofrio; Barbara Thorand; Torben Hansen; Christa Meisinger; Frank B Hu; Bo Isomaa; Fredrik Karpe; Liming Liang; Annette Peters; Cornelia Huth; Stephen P O'Rahilly; Colin N A Palmer; Oluf Pedersen; Rainer Rauramaa; Jaakko Tuomilehto; Veikko Salomaa; Richard M Watanabe; Ann-Christine Syvänen; Richard N Bergman; Dwaipayan Bharadwaj; Erwin P Bottinger; Yoon Shin Cho; Giriraj R Chandak; Juliana C N Chan; Kee Seng Chia; Mark J Daly; Shah B Ebrahim; Claudia Langenberg; Paul Elliott; Kathleen A Jablonski; Donna M Lehman; Weiping Jia; Ronald C W Ma; Toni I Pollin; Manjinder Sandhu; Nikhil Tandon; Philippe Froguel; Inês Barroso; Yik Ying Teo; Eleftheria Zeggini; Ruth J F Loos; Kerrin S Small; Janina S Ried; Ralph A DeFronzo; Harald Grallert; Benjamin Glaser; Andres Metspalu; Nicholas J Wareham; Mark Walker; Eric Banks; Christian Gieger; Erik Ingelsson; Hae Kyung Im; Thomas Illig; Paul W Franks; Gemma Buck; Joseph Trakalo; David Buck; Inga Prokopenko; Reedik Mägi; Lars Lind; Yossi Farjoun; Katharine R Owen; Anna L Gloyn; Konstantin Strauch; Tiinamaija Tuomi; Jaspal Singh Kooner; Jong-Young Lee; Taesung Park; Peter Donnelly; Andrew D Morris; Andrew T Hattersley; Donald W Bowden; Francis S Collins; Gil Atzmon; John C Chambers; Timothy D Spector; Markku Laakso; Tim M Strom; Graeme I Bell; John Blangero; Ravindranath Duggirala; E Shyong Tai; Gilean McVean; Craig L Hanis; James G Wilson; Mark Seielstad; Timothy M Frayling; James B Meigs; Nancy J Cox; Rob Sladek; Eric S Lander; Stacey Gabriel; Noël P Burtt; Karen L Mohlke; Thomas Meitinger; Leif Groop; Goncalo Abecasis; Jose C Florez; Laura J Scott; Andrew P Morris; Hyun Min Kang; Michael Boehnke; David Altshuler; Mark I McCarthy
Journal: Nature Date: 2016-07-11 Impact factor: 69.504

6 in total

1. A semi-automated pipeline for fulfillment of resource requests from a longitudinal Alzheimer's disease registry.

Authors: Katelyn A McKenzie; Suzanne L Hunt; Genevieve Hulshof; Dinesh Pal Mudaranthakam; Kayla Meyer; Eric D Vidoni; Jeffrey M Burns; Jonathan D Mahnken
Journal: JAMIA Open Date: 2019-08-26

2. A National Network of Safe Havens: Scottish Perspective.

Authors: Chuang Gao; Mark McGilchrist; Shahzad Mumtaz; Christopher Hall; Lesley Ann Anderson; John Zurowski; Sharon Gordon; Joanne Lumsden; Vicky Munro; Artur Wozniak; Michael Sibley; Christopher Banks; Chris Duncan; Pamela Linksted; Alastair Hume; Catherine L Stables; Charlie Mayor; Jacqueline Caldwell; Katie Wilde; Christian Cole; Emily Jefferson
Journal: J Med Internet Res Date: 2022-03-09 Impact factor: 7.076

3. Big Data Health Care Platform With Multisource Heterogeneous Data Integration and Massive High-Dimensional Data Governance for Large Hospitals: Design, Development, and Application.

Authors: Miye Wang; Sheyu Li; Tao Zheng; Nan Li; Qingke Shi; Xuejun Zhuo; Renxin Ding; Yong Huang
Journal: JMIR Med Inform Date: 2022-04-13

4. Predicting Hypertension Subtypes with Machine Learning Using Targeted Metabolites and Their Ratios.

Authors: Smarti Reel; Parminder S Reel; Zoran Erlic; Laurence Amar; Alessio Pecori; Casper K Larsen; Martina Tetti; Christina Pamporaki; Cornelia Prehn; Jerzy Adamski; Aleksander Prejbisz; Filippo Ceccato; Carla Scaroni; Matthias Kroiss; Michael C Dennedy; Jaap Deinum; Graeme Eisenhofer; Katharina Langton; Paolo Mulatero; Martin Reincke; Gian Paolo Rossi; Livia Lenzini; Eleanor Davies; Anne-Paule Gimenez-Roqueplo; Guillaume Assié; Anne Blanchard; Maria-Christina Zennaro; Felix Beuschlein; Emily R Jefferson
Journal: Metabolites Date: 2022-08-16

5. An extensible big data software architecture managing a research resource of real-world clinical radiology data linked to other health data from the whole Scottish population.

Authors: Thomas Nind; James Sutherland; Gordon McAllister; Douglas Hardy; Ally Hume; Ruairidh MacLeod; Jacqueline Caldwell; Susan Krueger; Leandro Tramma; Ross Teviotdale; Mohammed Abdelatif; Kenny Gillen; Joe Ward; Donald Scobbie; Ian Baillie; Andrew Brooks; Bianca Prodan; William Kerr; Dominic Sloan-Murphy; Juan F R Herrera; Dan McManus; Carole Morris; Carol Sinclair; Rob Baxter; Mark Parsons; Andrew Morris; Emily Jefferson
Journal: Gigascience Date: 2020-09-29 Impact factor: 6.524

6. Desiderata for the development of next-generation electronic health record phenotype libraries.

Authors: Martin Chapman; Shahzad Mumtaz; Luke V Rasmussen; Andreas Karwath; Georgios V Gkoutos; Chuang Gao; Dan Thayer; Jennifer A Pacheco; Helen Parkinson; Rachel L Richesson; Emily Jefferson; Spiros Denaxas; Vasa Curcin
Journal: Gigascience Date: 2021-09-11 Impact factor: 6.524

6 in total