| Literature DB >> 34608591 |
Kirti Magudia1,2, Christopher P Bridge3, Katherine P Andriole4,3, Michael H Rosenthal4.
Abstract
With vast interest in machine learning applications, more investigators are proposing to assemble large datasets for machine learning applications. We aim to delineate multiple possible roadblocks to exam retrieval that may present themselves and lead to significant time delays. This HIPAA-compliant, institutional review board-approved, retrospective clinical study required identification and retrieval of all outpatient and emergency patients undergoing abdominal and pelvic computed tomography (CT) at three affiliated hospitals in the year 2012. If a patient had multiple abdominal CT exams, the first exam was selected for retrieval (n=23,186). Our experience in attempting to retrieve 23,186 abdominal CT exams yielded 22,852 valid CT abdomen/pelvis exams and identified four major categories of challenges when retrieving large datasets: cohort selection and processing, retrieving DICOM exam files from PACS, data storage, and non-recoverable failures. The retrieval took 3 months of project time and at minimum 300 person-hours of time between the primary investigator (a radiologist), a data scientist, and a software engineer. Exam selection and retrieval may take significantly longer than planned. We share our experience so that other investigators can anticipate and plan for these challenges. We also hope to help institutions better understand the demands that may be placed on their infrastructure by large-scale medical imaging machine learning projects.Entities:
Keywords: Artificial intelligence; Dataset; Exam retrieval; Informatics
Mesh:
Year: 2021 PMID: 34608591 PMCID: PMC8669054 DOI: 10.1007/s10278-021-00505-7
Source DB: PubMed Journal: J Digit Imaging ISSN: 0897-1889 Impact factor: 4.903
Summary of challenges encountered during exam retrieval
| Problem | Solution | Result | |
|---|---|---|---|
| Cohort selection and processing | Mislabeled exams | Excluded exam descriptions of “ablation, fna, biopsy, drainage, guidance, drain, drg, bx, interventional, interv, perc, bone” | 22,903 exams remaining |
| Inconsistent formatting of medical record numbers (MRNs) and accessions (ACCs) | For one hospital, MRNs were padded with leading zeros to 8 digits. All ACCs before a change in EMR had a leading “A” removed. | MRN and ACCs for 10,089 exams were reformatted | |
| Some ACCs generated solely for billing with no linked images | Queried copies of the underlying databases of both hospital clinical PACS systems with MRN and date to identify all candidate ACCs CTs with >20 images Search for exam with body part of abdomen If none, allow body part of “pelvis” or “chest” If none, expand date range to +/- 4 days | 838 exams had different ACC chosen than what was provided from the research database | |
| Inconsistent linkage of images to ACCs | |||
| Exam retrieval | Slow pull method for one hospital consisted of a Web API pull method that preceded a vendor-neutral archive | New method established where radiology IT pushes exams to a DCM4CHEE instance (an open-source DICOM image management system), which is then transferred to our storage. | Original time estimate to retrieve exams of >1 year. With new method, exams retrieved in 2 weeks |
| Push rate from DCM4CHEE instance exceeding write rate to storage, causing crashes | Slowed down push rate and added memory to the server running the DCM4CHEE instance to buffer images as they came in before they were written to storage | No further system crashes during exam retrieval | |
| Data storage | Data storage requirement exceeded available storage in a multi-user system | Transitioned project files to new storage device | Overall delay of 3 weeks |
| Non-recoverable failures | MRN/ACC discrepancies | Excluded from further analysis | 17 exams |
| Topogram only exam | 7 exams | ||
| Missing exam | 8 exams | ||
| Corrupted CT data | 3 exams | ||
| Non-patient test exam | 1 exam | ||
| DICOM encoding errors | 15 exams | ||