Literature DB >> 34095524

A case study in distributed team science in research using electronic health records.

Jiao Song¹, Elizabeth Elliot², Andrew D Morris², Joannes J Kerssens³, Ashley Akbari¹, Simon Ellwood-Thompson¹, Ronan A Lyons¹.

Abstract

INTRODUCTION: Due to various regulatory barriers, it is increasingly difficult to move pseudonymised routine health data across platforms and among jurisdictions. To tackle this challenge, we summarized five approaches considered to support a scientific research project focused on the risk of the new non-vitamin K Target Specific Oral Anticoagulants (TSOACs) and collaborated between the Farr institute in Wales and Scotland. APPROACH: In Wales, routinely collected health records held in the Secure Anonymous Information Linkage (SAIL) Databank were used to identify the study cohort. In Scotland, data was extracted from national dataset resources administered by the eData Research & Innovation Service (eDRIS) and stored in the Scottish National Data Safe Haven. We adopted a federated data and multiple analysts approach, but arranged simultaneous accesses for Welsh and Scottish analysts to generate study cohorts separately by implementing the same algorithm. Our study cohort across two countries was boosted to 6,829 patients towards risk analysis. Source datasets and data types applied to generate cohorts were reviewed and compared by analysts based on both sites to ensure the consistency and harmonised output. DISCUSSION: This project used a fusion of two approaches among five considered. The approach we adopted is a simple, yet efficient and cost-effective method to ensure consistency in analysis and coherence with multiple governance systems. It has limitations and potentials of extending and scaling. It can also be considered as an initialisation of a developing infrastructure to support a distributed team science approach to research using Electronic Health Records (EHRs) across the UK and more widely.

Entities: Chemical

Keywords: Team science; cross-jurisdictional data linkage; electronic health records

Year: 2018 PMID： 34095524 PMCID： PMC8142956 DOI： 10.23889/ijpds.v3i3.442

Source DB: PubMed Journal: Int J Popul Data Sci ISSN： 2399-4908

Introduction

There is an increasing trend of conducting health research by using data linkage of electronic records. Data linkage techniques boost deeper analyses on data merged from information contained in separate datasets regarding same individuals (1,2). Health data linkage frameworks are well-established in a number of countries/regions, i.e. in Australia (3), Scotland (4), England (5) and Wales (6,7,8). There is limited literature on the practicalities of using linked data from different centres, countries and resources for research. Many research projects require data from multiple jurisdictions to obtain sufficient power to answer scientific questions. These projects can benefit from larger sample sizes for greater statistical power, especially for small number of exposures, rare conditions or outcomes (9,10,11); ascertain complete patient pathways, care and outcome; accurate data for longitudinal studies; cross-jurisdictional comparisons and so on (12). However, due to various regulatory barriers, it is increasingly difficult to move pseudonymised routine data across platforms and among jurisdictions. Challenges arising from using data from multiple jurisdictions, i.e. legal issues, organisation capacity, financial cost, separation of roles and data ownership have been partially addressed in the US, Australia and other European countries (13,14,15,16). An infrastructure in Australia was proposed for cross-jurisdictional health data linkage research across states to improve the quality of population research data (17). It has been implemented in various scientific studies (18,19). The current environment, in the US, is characterized by budget and technical challenges, but investments in data infrastructure are arguably cost-effective (20). In 2014, a UK research collaboration, the Farr Institute of Health Informatics Research, was established, comprising four centres distributed across the UK (North of England, Wales, London and Scotland) (21). Within the Farr Institute, we are motivated to make a step further towards the development of an infrastructure that allows for, and supports, cross-country research within the UK and across the EU using Electronic Health Records (EHRs). A recent scientific research project collaborated between centres in Wales and Scotland focused on the risk of the new non-vitamin K Target Specific Oral Anticoagulants (TSOAC) to a certain group of patients. This was part of an EU project, led from the Farr Institute in Scotland at Edinburgh and which the Farr Institute Wales at Swansea agreed to join to develop and demonstrate cross-UK data integration and analysis. This project is also in the process of including other European jurisdictions. Several recent randomised controlled trials have provided strong evidence supporting the safety and efficacy of TSOACs compared to the vitamin K antagonist warfarin for patients with atrial fibrillation (AF). TSOACs have a favourable risk–benefit profile, with significant reductions in stroke, intracranial haemorrhage (ICH), and mortality, and with similar major bleeding as for warfarin, but increased gastrointestinal bleeding (22). However, the cost-effectiveness of TSOACs is debated (23). TSOAC antidotes for reversal of bleeding are not yet available, and the safety and efficacy of TSOACs are unclear in patients who were not included in the randomised controlled trials but who clinicians feel may benefit. The safety issue of TSOACs in people who have had an ICH is uncertain and requires investigation. Due to the relative rarity of these circumstances, this safety issue needs to be addressed with data from more than one country. We considered multiple approaches to achieve projects requiring multi-site analysis, such as the TSOAC study, and summarised their advantages and disadvantages, respectively. The summarization is presented in Table 1.

Table 1: Summarization of considered approaches for a 3-centre analysis

Approach	Advantages/Disadvantages

1. Centralised data and analysis. Data moved from 3 centres – 1 analyst (centralised data model)	All data submitted from each site to a central site. Advantages Analysis is easier because of being fully in control of a single researcher having all the data available. Disadvantages Each site must “trust” the central site and must seek governance approval from each site and perhaps put in place a legal contract. Restrictions about data sovereignty may prevent this approach.

2. Federated data and single analyst. Data at 3 centres – 1 analyst accessing each platform then combining results	Same researcher accesses each system separately and combines outputs. Advantages Same researcher so same approach. Disadvantages Access to separate systems and learning curves. Separate access contracts and conditions of use. If outputs need combined individual level analysis, then not workable.

3. Federated data and multiple analysts. Data at 3 centres – 3 separate analyses, combine results	All analysis done separately by host site with outputs collated. Advantages Can be done quickly as each site knows their own system. Disadvantages Consistency can be hard to achieve so more validation and process documentation required. If outputs need combined individual level analysis, then not workable. Dependant on resources available at each local site.

4. Linked federated data and analysis. Data at 3 centres – 1 analyst (remote real time access model)	The sites have established inter connections. From a site the researcher can access all the required data. Advantages Analysis is easier because of being fully in control of the researcher having all the data available. Disadvantages Each site must “trust” the central site and must seek governance approval from each site and perhaps put in place a legal contract. Restrictions about data sovereignty may prevent this approach.

5. Federated data and distributed analysis. Data at 3 centres – 1 analyst directing federated queries	Using a distributed query system – issue same query to all sites. Advantages Same analysis performed in each site. No data moving so could be good for cross country restrictions. Disadvantages Common data model required. Data needs to be harmonised. More complex from an IT and governance perspective.

1. Centralised data and analysis. Data moved from 3 centres – 1 analyst (centralised data model) All data submitted from each site to a central site. Analysis is easier because of being fully in control of a single researcher having all the data available. Each site must “trust” the central site and must seek governance approval from each site and perhaps put in place a legal contract. Restrictions about data sovereignty may prevent this approach. 2. Federated data and single analyst. Data at 3 centres – 1 analyst accessing each platform then combining results Same researcher accesses each system separately and combines outputs. Same researcher so same approach. Access to separate systems and learning curves. Separate access contracts and conditions of use. If outputs need combined individual level analysis, then not workable. 3. Federated data and multiple analysts. Data at 3 centres – 3 separate analyses, combine results All analysis done separately by host site with outputs collated. Can be done quickly as each site knows their own system. Consistency can be hard to achieve so more validation and process documentation required. If outputs need combined individual level analysis, then not workable. Dependant on resources available at each local site. 4. Linked federated data and analysis. Data at 3 centres – 1 analyst (remote real time access model) The sites have established inter connections. From a site the researcher can access all the required data. Analysis is easier because of being fully in control of the researcher having all the data available. Each site must “trust” the central site and must seek governance approval from each site and perhaps put in place a legal contract. Restrictions about data sovereignty may prevent this approach. 5. Federated data and distributed analysis. Data at 3 centres – 1 analyst directing federated queries Using a distributed query system – issue same query to all sites. Same analysis performed in each site. No data moving so could be good for cross country restrictions. Common data model required. Data needs to be harmonised. More complex from an IT and governance perspective. In this paper, we report our approach in support of the European scientific research project as a case study of cross centre data-intensive research.

Approach

The first priority of the TSOAC study was to generate Welsh and Scottish cohorts using the same algorithm and appropriate source data from both sites respectively within a limited period.

Data location and access

In Wales, routinely collected health records held in the Secure Anonymous Information Linkage (SAIL) Databank were used to identify the study cohort and follow up the patients in this cohort. The SAIL Databank is a safe haven for billions of records on over 5 million living and deceased people over the population of Wales, with a complete data linkage and analysis toolset (see (2,6,7,24)). SAIL has a governance procedure model with an extremely fast approval rate compared to similar facilities. This has been achieved through the development of the Information Governance Review Panel (IGRP), which implements NHS ethics guidance on the use of de-identified data for research. When an organisation agrees to share data, they may choose to delegate due diligence on governance and the use of their data to the IGRP. When a researcher requires that data, approval is given directly by the IGRP on behalf of the original data producers. Hence, requests for multiple datasets can be handled quickly by a single body with the delegated authority to make decisions. The IGRP consists of representatives from the British Medical Association (BMA), National Research Ethics Service (NRES), Public Health Wales NHS Trust, NHS Wales Informatics Service (NWIS), and members of the public from the Consumer Panel (1). An IGRP application with supportive information was approved for the TSOAC project. This approval provided analysts in Wales and Scotland access to de-identified data from the Patient Episode Database for Wales (PEDW), Welsh General Practice dataset (WLGP), and Annual District Death Extract (ADDE) also known as Office of National Statistics (ONS) mortality, held in the SAIL Databank. Scotland does not have a single comprehensive national data warehouse. Instead, data, under the responsibility of different data controllers, are held at both regional and national levels and subsets (groups of variables) can be brought together when there are clear research questions that have public benefit. The Public Benefit and Privacy Panel for Health and Social Care (PBPP) is a governance structure of NHS Scotland that was established with delegated authority from NHS Scotland (NHSS) Chief Executive Officers and the Registrar General. The PBPP has a formal mandate to scrutinise any request to use NHSS-controlled data and the NHS Central Register data controlled by the Registrar General. The committee balances the benefits of undertaking research against the potential risks to individuals’ privacy. The administrative process that supports decision-making is layered to ensure that decisions are made in a timely manner. A PBPP request, supported by evidence of prior Ethics Committee approval, was successfully granted for this collaborative TSOAC study. Analysts based in Wales then gained access to the TSOAC project in the Scottish National Data Safe Haven. Source data was extracted from national dataset resources administered by the eData Reseach & Innovation Service (eDRIS) (25).

Cohort generation

Based on the accesses granted and existing governance systems of each safe haven, we initially considered approaches 2 and 3, described in Table 1. Then the challenge was how to tackle the remaining disadvantages of both approaches, i.e., harmonization of analysis strategies between multiple analysts and long learning curves. In our experience of multi-site replication of research, it can be time consuming, as descriptions of the generation of variables are often incomplete and substantial amount of iterations are required. To shorten this step, we arranged for the Swansea and Edinburgh analysts to have simultaneous access to each other’s data (fusion of approaches 2 and 3), which allowed for real-time viewing, creation of analytical codes and live discussion on how to tackle these practical challenges. The conveniences of how real-time communication on multiple screens worked effectively shortened the learning curves of both analysts. To construct the study cohort, the first step was to identify patients who had experienced a non-traumatic intracranial haemorrhage categorised from hospital admissions information between 30/08/2013 and 30/06/2015. In Wales, the source dataset used was PEDW, and in Scotland, the General/Acute Inpatient and Day Case (SMR01) dataset. Analyses of the data in both centres used International Classification of Disease 10th version (ICD-10) diagnostic codes. Linkage to community prescription data identified those patients who were subsequently administered an anticoagulant during the 90 days after an index hospital discharge for non-traumatic intracranial haemorrhage. The datasets used to identify anticoagulant prescriptions were WLGP dataset in Wales and the Prescribing Information System (PIS) in Scotland. The named anticoagulants studied were: TSOACs (rivaroxaban, dabigatran and apixaban) and warfarin (included as reference). Subsequent mortality was identified through linkage to national mortality records, which provide date and underlying cause of death within the follow-up period (Welsh ADDE dataset and National Records of Scotland (NRS) death records). The primary outcome was the first subsequent hospital admission for a Serious Vascular Event (SVE), including ischemic stroke, systemic embolism, intracranial haemorrhage, or extracranial haemorrhage within 1 year (follow-up period) from the index discharge date. All available individual-level data in Scotland for the case study is held in the Scottish National Data Safe Haven, with the remaining study data (Welsh) held in the SAIL Databank. The cohort generation algorithm is summarised in Figure 1.

Figure 1: Cohort generation algorithm

The Welsh and Scottish cohorts consisted of 2,676 and 4,153 patients respectively, as can be viewed in Table 2.

Table 2: Study cohorts

	Welsh cohort	Scottish cohort	Total

Male	1,347	1,938	3,285

Female	1,329	2,215	3,544

Total	2,676	4,153	6,829

By applying the same cohort generation algorithm across two countries, our study cohort was boosted to 6,829 patients towards risk analysis.

Learning outcomes

Tables 3 and 4 present the differences in variables and their definitions between the two systems.

Table 3: Variables comparison between Welsh and Scottish data

	Welsh cohort		Scottish cohort
	Variable name	Source data	Variable name	Source data
Patient identity & linkage field	ALF_E	PEDW	UPI_NUMBER	SMR01

Admission date	ADMIS_DT	PEDW	ADMISSION_DATE	SMR01

Admission methods	ADMIS_MTHD_CD	PEDW	ADMISSION_TYPE	SMR01

Discharge types	DISCH_MTHD_CD	PEDW	DISCHARGE_TYPE	SMR01

Drugs prescription	EVENT_CD	WLGP	BNFItemcode	PIS

Date of prescription	EVENT_DT	WLGP	PRES_DATE	PIS

Date of birth	WOB	ADDE	DATE_OF_BIRTH	NRS

Gender	GNDR_CD	ADDE	SEX	NRS

Deprivation quintile	WIMD2011_5TH	PEDW	SIMD_QUINTILE	SMR01

Primary cause of death	DEATHCAUSE_DIAG_UNDERLYING_CD	ADDE	CAUSE_OF_DEATH_CODE	NRS

Date of death	DOD	ADDE	DATE_OF_DEATH	NRS

Table 4: Different data definitions in Wales and Scotland

		Welsh data	Scottish data
Gender	0	N/A	Not known (i.e. indeterminate sex, includes intersex)

	1	Male	Male

	2	Female	Female

	8	Not specified	N/A

	9	N/A	Not specified (includes not stated by patient, or not recorded)

Date	Date format	YYYY-MM-DD	DDMMYY

Drug	Drug information	EVENT_CD: READ codes, e.g. bs74.	British National Formulary Drug Codes (BNF). e.g. BNF item code 0601011A0BBADAC

Adopting a fusion of approaches 2 and 3 enables real-time viewing, editing and communication. Source datasets and data types applied to generate Welsh and Scottish cohorts were reviewed and compared by analysts based on both Swansea and Edinburgh sites to ensure the consistency. Elements that needed to be handled differently in two safe havens were investigated and solutions identified. Based on these investigation results, an R script (see Appendix 1) was generated for this study to manipulate datasets in both the Welsh and Scottish safe havens to be able to produce a harmonised output suitable for combination to answer the required research questions given access and resources.

Discussion

After considering the options, this project took the approach of a fusion of approaches 2 and 3, as there were existing analysts at two centres, with each analyst accessing each distributed system simultaneously, harmonising variables, co-writing analytical scripts and combining the outputs. This was the quickest and easiest method to ensure that consistency was embedded in our analysis while working within the existing governance systems of each safe haven. Too much extra, non-project related resources, activities and agreements needed to be achieved to reach the same final outcomes with a centralised data (approach 1) as the governance, security, IT and approvals structures in each safe haven would have required wider approvals and changes to existing implementations. This would have added considerable extra work in Scotland as it already had an existing approved project at the start of the process. To adopt a linked federated data and analysis approach (approach 4), each site has to trust the central site with established security protocols and governance approval. There is no example of implementing this approach in the UK yet. Providing datasets are completely harmonised (not the case) to ensure the consistency of the analysis, a distributed query approach (approach 5) could have been undertaken and would probably have been acceptable to the governance managements under which the project was executed. The barrier to this approach is the lack of a generic distributed query engine as these types of technologies tend to be very bespoke and focused on specific research projects and their objectives. Many research projects require data and rapid harmonisation of methods from more than one country or region to promote and enable research. Our view is that using remote access to data from distributed researchers, data visualisation and real-time co-written analytical scripts can significantly improve the efficiency of replication studies. The approach we adopted is a simple, yet very efficient and cost-effective method to ensure consistency in analysis and coherence with multiple governance systems. The algorithm developed from this study for manipulating and combining datasets in both Welsh and Scottish safe havens is limited to the relevant datasets and variables used for this study. However, it is easily extended and scalable to all available data, providing sufficient time and resources are made available. While this project is in the process of including other European jurisdictions to answer the specific scientific questions, our approach can also be considered as an initialisation of a developing infrastructure to support a distributed team science approach to research using EHRs across the UK and more widely.

15 in total

1. Research use of linked health data--a best practice protocol.

Authors: C W Kelman; A J Bass; C D J Holman
Journal: Aust N Z J Public Health Date: 2002 Impact factor: 2.939

Review 2. The structure and organization of local and state public health agencies in the U.S.: a systematic review.

Authors: Justeen K Hyde; Stephen M Shortell
Journal: Am J Prev Med Date: 2012-05 Impact factor: 5.043

3. Health services research and data linkages: issues, methods, and directions for the future.

Authors: Cathy J Bradley; Lynne Penberthy; Kelly J Devers; Debra J Holden
Journal: Health Serv Res Date: 2010-08-02 Impact factor: 3.402

4. Assessing cross-sectoral and cross-jurisdictional coordination for public health emergency legal preparedness.

Authors: Rick Hogan; Cheryl H Bullard; Daniel Stier; Matthew S Penn; Teresa Wall; John Cleland; James H Burch; Judith Monroe; Robert E Ragland; Thurbert Baker; John Casciotti
Journal: J Law Med Ethics Date: 2008 Impact factor: 1.718

5. Cost-effectiveness of new oral anticoagulants compared with warfarin in preventing stroke and other cardiovascular events in patients with atrial fibrillation.

Authors: Doug Coyle; Kathryn Coyle; Chris Cameron; Karen Lee; Shannon Kelly; Sabine Steiner; George A Wells
Journal: Value Health Date: 2013-04-23 Impact factor: 5.725

6. Data linkage infrastructure for cross-jurisdictional health-related research in Australia.

Authors: James H Boyd; Anna M Ferrante; Christine M O'Keefe; Alfred J Bass; Sean M Randall; James B Semmens
Journal: BMC Health Serv Res Date: 2012-12-29 Impact factor: 2.655

7. Data cleaning and management protocols for linked perinatal research data: a good practice example from the Smoking MUMS (Maternal Use of Medications and Safety) Study.

Authors: Duong Thuy Tran; Alys Havard; Louisa R Jorm
Journal: BMC Med Res Methodol Date: 2017-07-11 Impact factor: 4.615

8. The SAIL Databank: building a national architecture for e-health research and evaluation.

Authors: David V Ford; Kerina H Jones; Jean-Philippe Verplancke; Ronan A Lyons; Gareth John; Ginevra Brown; Caroline J Brooks; Simon Thompson; Owen Bodger; Tony Couch; Ken Leake
Journal: BMC Health Serv Res Date: 2009-09-04 Impact factor: 2.655

9. The SAIL databank: linking multiple health and social care datasets.

Authors: Ronan A Lyons; Kerina H Jones; Gareth John; Caroline J Brooks; Jean-Philippe Verplancke; David V Ford; Ginevra Brown; Ken Leake
Journal: BMC Med Inform Decis Mak Date: 2009-01-16 Impact factor: 2.796

10. A case study of the Secure Anonymous Information Linkage (SAIL) Gateway: a privacy-protecting remote access system for health-related research and evaluation.

Authors: Kerina H Jones; David V Ford; Chris Jones; Rohan Dsilva; Simon Thompson; Caroline J Brooks; Martin L Heaven; Daniel S Thayer; Cynthia L McNerney; Ronan A Lyons
Journal: J Biomed Inform Date: 2014-01-15 Impact factor: 6.317