| Literature DB >> 32990744 |
Thomas Nind1, James Sutherland1, Gordon McAllister1, Douglas Hardy1, Ally Hume2, Ruairidh MacLeod2, Jacqueline Caldwell3, Susan Krueger1, Leandro Tramma1, Ross Teviotdale1, Mohammed Abdelatif1, Kenny Gillen1, Joe Ward1, Donald Scobbie2, Ian Baillie3, Andrew Brooks2, Bianca Prodan2, William Kerr2, Dominic Sloan-Murphy2, Juan F R Herrera2, Dan McManus2, Carole Morris3, Carol Sinclair4, Rob Baxter2, Mark Parsons2, Andrew Morris5, Emily Jefferson1.
Abstract
AIM: To enable a world-leading research dataset of routinely collected clinical images linked to other routinely collected data from the whole Scottish national population. This includes more than 30 million different radiological examinations from a population of 5.4 million and >2 PB of data collected since 2010.Entities:
Keywords: AI; Big Data; ML; Radiology
Year: 2020 PMID: 32990744 PMCID: PMC7523405 DOI: 10.1093/gigascience/giaa095
Source DB: PubMed Journal: Gigascience ISSN: 2047-217X Impact factor: 6.524
Figure 1:Overview of the architecture. There are 3 zones: the identifiable zone (which holds the raw data), the de-identified zone (where cohort building and linking to other data takes place) and the Safe Haven environment (where a researcher carries out their analysis).
Overview of data stores
| Data | Description |
|---|---|
| DICOM files (unmodified historic data) | DICOM files are stored unaltered in a file archive. There are many reasons why we wish to keep the identifiable data and store the original DICOM images:• Any program developed to strip all identifiable data from the DICOM files and tags risks rendering the whole dataset unusable if this is done incorrectly. Linkage to other datasets would subsequently be either incorrect or impossible.• It is conceivable that future data de-identification strategies will wish to make use of some identifiable data and removing that data would therefore limit future options.• The NHS may wish to use the data as a secondary offline disaster recovery system or use the data to populate a clinical system from an alternative provider. In this case it needs to be technically feasible to generate the data in identifiable form in a format that is non-proprietary and as close as possible to the DICOM files as they were originally captured. |
| Identifiable DICOM tag data | All tag metadata from the DICOM files are extracted to a MongoDB database in a searchable format. These tag metadata are stored in an identifiable format because de-identification analysts need to know what the identifiable data are so that they can remove them, e.g.,• If the patient name is Mrs Jones, then a de-identification analyst searching for identifiable data in the clinical report will need to know to look for the text “Jones” in order to remove it.• To check whether an image is identifiable, the de-identification analyst might need to know the CHI number in order to check that it is not burnt into the pixel data. |
| Inventory tables | A subset of data from the Identifiable set above is copied here. This is a relational database that contains suitably cleansed and de-identified image metadata (and file paths), i.e., has been confirmed to be well populated, of high quality, and does not contain identifiable data. This is used by the research co-ordinators for cohort creation and extraction to the Safe Haven. The data are indexed using EUPIs.For example, DICOM age strings can express the age in years, months, or days (e.g., 075Y, 006M, or 002D). The cleaned and homogenized metadata will store these in a consistent and easily queried numeric format. Other metadata fields may be a single value summarising data stored in multiple different DICOM tags. For example, by analysing the acquisition position of the images it is possible to identify examinations in which the same volume has been acquired repeatedly in a single series; when used in conjunction with tags that indicate whether contrast was used during the examination, this can be used to disambiguate contrast bolus imaging from other acquisitions that may also use contrast. |
| Cohort and associated anonymous research extracts | Any research project will start by defining a relevant cohort and obtaining the necessary ethical/administrative approval in consultation between the researchers and the research coordinators—this is an out-of-band process outwith the iRDMP system, so not shown here. The data analysts then assemble a dataset (the anonymous research extract) for that research project by querying the inventory table, possibly linked against other data sources via the EUPI (pseudonymised patient ID, explained below) and trigger the Extraction Microservices to export the appropriate subset of columns made available to the research users. For example, a project might request all available brain MRI scans from patients who have been prescribed gabapentin and want the dosage information and patient age; they would be given a set of image data (the scans themselves, passed through the DICOM file anonymiser described later) and a table of associated metadata including the dosage information for each de-identified patient. |
| Research Data Management Platform (RDMP) | The RDMP manages and monitors the extraction processes. |
| CHI to EUPI mapping table | Scotland uses the CHI unique identifier for health data. Adhering to the guiding principles of data linkage for research [ |
Overview of Processes
| Processes | Description |
|---|---|
| Promotion of de-identifiable tags/metadata | Promotion is a 2-stage process. The first stage promotes images for which there is an anonymisation protocol (e.g., CT images). Anonymised tag data are pushed to the inventory table. These tables support the extraction processes and support routine practices, e.g., data cleaning. A subset of these data (e.g., only primary/original images) is pushed to the de-identified zone, indicating that the images can be used for cohort generation/image extraction. This data push can include collapsing data, e.g., to series/study level. It is not feasible or desirable to proactively analyse the complete identifiable DICOM tag data in order to promote all tags. This is in part due to the difficulty in determining that a tag of a certain type does contain identifiable information for (i) the whole of the current archive and (ii) future PACS images that will be taken. A tag can be promoted in 2 circumstances: (1) it is determined not to contain identifiable information or (2) the identifiable information it does contain can be de-identified. Sophisticated techniques such as natural language processing (NLP) methodologies can be used to determine Condition 1 or find a solution for de-identification for Condition 2. The solution for Condition 2 is known as an anonymisation profile and can be saved for reuse. Once a tag can be flagged as safe for promotion it is moved to the inventory table. This is an iterative process (future studies with unique requirements will inform which data are prioritised for anonymisation/promoted). |
| Promotion of image types that are extractable | This process whitelists images that are extractable in the sense that pixel data can completely be de-identified. Some images, particularly ultrasounds, may have identifiable information such as patient name or CHI watermarked on the image. Which images can be de-identified is stored in the metadata catalogue, but the rules regarding how images are de-identified are stored in CTP [ |
| Mapping (CHI-EUPI) | This process is called when metadata are promoted to the de-identifiable zone to replace identifiable CHIs with EUPI. It is an automated process so that no individuals can see this mapping. |
| Cohort creation process | A set of software tools (or manual SQL queries if the user prefers) that query the DICOM metadata within the inventory tables to select images relevant for a particular cohort (by applying filters that describe researcher requirements). The resulting cohort forms the basis for both the initial and subsequent releases of data to the Safe Haven for the relevant study, and as such it is critically important that the cohort be identified and managed correctly. |
| Extraction process | This process uses the cohort database and inventory tables to determine which files to extract for a particular research project. It calls the DICOM file anonymiser to de-identify the relevant files used to build the cohort for release to the researcher. After the cohort output and the de-identified DICOM files are curated, the process triggers a release into the researcher Safe Haven environment. |
| DICOM file anonymiser | |
| The DICOM file anonymiser: | |
| • Obtains the file(s) from the file archives | |
| • Anonymises the pixel data of the file if necessary | |
| • Anonymises the metadata in the file (leaving only the whitelisted tags) | |
| • Converts the file to an alternative format if required | |
| • Returns the final file(s) to the user | |
| Researcher VM with tools to view and manipulate images | There are 2 main use cases: small-scale studies in which a researcher team may wish to open and mark up each image by eye and large-scale studies in which software and algorithms will be developed by the users of the system to analyse the images for their specific project. The different tools available within the Safe Haven meet both sets of requirements. The researcher VM image includes a standard set of tools, which will be increased over time as the requirements increase. Example tools are MicroDICOM (simple DICOM viewer), ClearCanvas (open-source PACS client, cf. Carestream), and XNAT. The VM should have the capability for users to securely add their own tools. The VM provides access to the associated data from study-specific image metadata and pixel data but does not allow row-level or pixel data to be extracted. Access to the internet is restricted when analysing the data. |
Score of different de-identification tools
| Tool | Core functionality | User friendliness | Support | Total |
|---|---|---|---|---|
| XNAT | 37 | 21 | 22 | 80 |
| CTP | 41 | 24 | 25 | 90 |
| DICOM Confidential | 35 | 24 | 14 | 73 |
Maximum scores: 45, 30, 25 for a total of 100.
Scalability of anonymisation processing
| No. image IDs | Anonymisation run time | Total file size | Mean run time per file |
|---|---|---|---|
| 3 | Negligible | 1.5 MB | Negligible |
| 2,264 | 0.3 hours | 1.2 GB | 0.54 sec |
| 89,415 | 1.5 hours | 50 GB | 0.06 sec |
| ∼1.2 million | 28 hours | 630 GB | 0.06 sec |
| ∼17.2 million | 65 hours | 8.5 TB | 0.01 sec |
Roles
| Role | Description |
|---|---|
| Researchers | Carry out the research on a dataset extracted from the SMI DB and other linked data. Any project may have a variety of researchers including clinicians, statisticians, radiographers, image analysis and machine learning experts, and so forth. They view and work on the PACS images within a Safe Haven environment. |
| Research coordinators/cohort builders | Work with the researchers to produce the data extract that allows the research study to be carried out. Research coordinators understand where the data are stored and how to link across datasets and will run software, write scripts, query databases, etc., to produce the final cohort datasets. |
| Data analysts | Work with de-identified PACS data to produce more usable versions of the data for research coordinators to work with. Over time data analysts (working with domain experts) may produce additional mapping tables and categorization systems that make it easier for researchers and research co-ordinators to work with the data. |
| De-identification analysts | Are responsible for ensuring that as many data as possible are made available to research coordinators for the creation of cohorts but that no identifiable data reach the coordinators. Much of the de-identification task is automated, but the system needs to be continually monitored and new DICOM tags added to the whitelist (or blacklist) as required. |
| System administrators | Are part of the infrastructure team and are responsible for building and maintaining the underpinning infrastructure, security, network separation, monitoring and supporting automated processes. Supported automated processes would involve checking, e.g., whether there were errors in the data load process or data extraction process. They have privileges and expertise to debug and/or restart these processes. |
| Software developers | Produce any new software required within any zones of the environment. The software is developed and tested outwith the production environment. Deployment of software updates will be carried out by system administrators. |