| Literature DB >> 29907888 |
William Bennett1, Kirk Smith2, Quasar Jarosz2, Tracy Nolan2, Walter Bosch3.
Abstract
Reusable, publicly available data is a pillar of open science and rapid advancement of cancer imaging research. Sharing data from completed research studies not only saves research dollars required to collect data, but also helps insure that studies are both replicable and reproducible. The Cancer Imaging Archive (TCIA) is a global shared repository for imaging data related to cancer. Insuring the consistency, scientific utility, and anonymity of data stored in TCIA is of utmost importance. As the rate of submission to TCIA has been increasing, both in volume and complexity of DICOM objects stored, the process of curation of collections has become a bottleneck in acquisition of data. In order to increase the rate of curation of image sets, improve the quality of the curation, and better track the provenance of changes made to submitted DICOM image sets, a custom set of tools was developed, using novel methods for the analysis of DICOM data sets. These tools are written in the programming language perl, use the open-source database PostgreSQL, make use of the perl DICOM routines in the open-source package Posda, and incorporate DICOM diagnostic tools from other open-source packages, such as dicom3tools. These tools are referred to as the "Posda Tools." The Posda Tools are open source and available via git at https://github.com/UAMS-DBMI/PosdaTools . In this paper, we briefly describe the Posda Tools and discuss the novel methods employed by these tools to facilitate rapid analysis of DICOM data, including the following: (1) use a database schema which is more permissive, and differently normalized from traditional DICOM databases; (2) perform integrity checks automatically on a bulk basis; (3) apply revisions to DICOM datasets on an bulk basis, either through a web-based interface or via command line executable perl scripts; (4) all such edits are tracked in a revision tracker and may be rolled back; (5) a UI is provided to inspect the results of such edits, to verify that they are what was intended; (6) identification of DICOM Studies, Series, and SOP instances using "nicknames" which are persistent and have well-defined scope to make expression of reported DICOM errors easier to manage; and (7) rapidly identify potential duplicate DICOM datasets by pixel data is provided; this can be used, e.g., to identify submission subjects which may relate to the same individual, without identifying the individual.Entities:
Keywords: Curation; DICOM; De-identification; Image archive; Open science; Posda; Scalability; TCIA
Mesh:
Year: 2018 PMID: 29907888 PMCID: PMC6261183 DOI: 10.1007/s10278-018-0097-4
Source DB: PubMed Journal: J Digit Imaging ISSN: 0897-1889 Impact factor: 4.056
Fig. 1Diagnosis and repair of DICOM errors in TCIA datasets. DICOM data are imported into a Posda database. Queries to this database are used to identify inconsistent Frame of Reference identifiers and generate scripts to correct these inconsistencies. DICOM files are edited before re-loading into TCIA
Fig. 2An excerpt from a Consistency Check
Fig. 3Output from “RunDciodvfy.pl”
Fig. 5Distinguished digests
Fig. 4Process for finding potential duplicate series based upon finding series which have the same number of files with duplicate pixel data
Fig. 6List used to identify potential duplicate series
Fig. 7Determining actual duplicate series and extent of duplication
Fig. 8Columns for each of the Series 1–3 and “Proto” Series