| Literature DB >> 35948551 |
Eduardo P Reis1,2, Joselisa P Q de Paiva3, Maria C B da Silva3, Guilherme A S Ribeiro3, Victor F Paiva4, Lucas Bulgarelli5, Henrique M H Lee3, Paulo V Santos3, Vanessa M Brito3, Lucas T W Amaral3, Gabriel L Beraldo3, Jorge N Haidar Filho4, Gustavo B S Teles3, Gilberto Szarf3, Tom Pollard5, Alistair E W Johnson6, Leo A Celi5,7,8, Edson Amaro4,3.
Abstract
Chest radiographs allow for the meticulous examination of a patient's chest but demands specialized training for proper interpretation. Automated analysis of medical imaging has become increasingly accessible with the advent of machine learning (ML) algorithms. Large labeled datasets are key elements for training and validation of these ML solutions. In this paper we describe the Brazilian labeled chest x-ray dataset, BRAX: an automatically labeled dataset designed to assist researchers in the validation of ML models. The dataset contains 24,959 chest radiography studies from patients presenting to a large general Brazilian hospital. A total of 40,967 images are available in the BRAX dataset. All images have been verified by trained radiologists and de-identified to protect patient privacy. Fourteen labels were derived from free-text radiology reports written in Brazilian Portuguese using Natural Language Processing.Entities:
Mesh:
Year: 2022 PMID: 35948551 PMCID: PMC9364309 DOI: 10.1038/s41597-022-01608-8
Source DB: PubMed Journal: Sci Data ISSN: 2052-4463 Impact factor: 8.501
Fig. 1BRAX dataset creation flowchart. Data Extraction: Only chest radiographs accompanied by a radiology report were included. Images were anonymized and checked for burned-in sensitive data; Data Preparation: DICOM images were converted to PNG format and rescaled. 14 radiological findings were extracted from free-text reports written in Brazilian Portuguese, after adaptation of NegEX and CheXpert Label Extraction Algorithm. Technical Validation: The labeling was validated by board-certified radiologists. Transfer to Data Repository: BRAX dataset is available on Physionet[21,22] at https://physionet.org/content/brax/1.1.0/.
Fig. 2Flowchart detailing the BRAX dataset creation process. First, images were retrieved from the institutional PACS database. Next, exclusion criteria were applied, and then a subset was separated as a hidden test dataset.
Fig. 3Automated labeling of the radiology reports. Example of the original radiology report in Brazilian Portuguese, its translation to English, and the final output of the automated labeling procedure.
Frequency of the radiological findings.
| Pathology | Positive (%) | Uncertain (%) | Negative (%) |
|---|---|---|---|
| No findings | 29009 (71.0) | 0 | 11958 (29.0) |
| Enlarged Cardiom. | 71 (0.17) | 2 (0.00) | 26212 (63.98) |
| Cardiomegaly | 3984 (9.72) | 0 | 28000 (68.35) |
| Lung Lesion | 1290 (3.15) | 19 (0.05) | 46 (0.11) |
| Lung Opacity | 4065 (9.92) | 17 (0.04) | 52 (0.13) |
| Edema | 50 (0.12) | 0 (0.0) | 0 (0.0) |
| Consolidation | 3157 (7.71) | 0 (0.0) | 19 (0.05) |
| Pneumonia | 774 (1.89) | 0 (0.0) | 46 (0.11) |
| Atelectasis | 3518 (8.59) | 0 (0.0) | 41 (0.10) |
| Pneumothorax | 214 (0.52) | 0 (0.0) | 189 (0.46) |
| Pleural Effusion | 1822 (4.45) | 0 (0.0) | 31422 (76.70) |
| Pleural Other | 117 (0.29) | 0 (0.0) | 1 (0.00) |
| Fracture | 624 (1.52) | 0 (0.0) | 16405 (40.04) |
| Support Devices | 8791 (21.46) | 0 (0.0) | 21 (0.05) |
The BRAX dataset consists of 14 labeled observations. We report the number of images which contain these observations.
Fig. 4Example images included in the BRAX dataset. (a) Lung lesion, consolidation; (b) Cardiomegaly, device; (c) patient in intensive care bed, edema, cardiomegaly, device; (d) Pneumothorax; (e) pneumothorax, pleural effusion, consolidation, atelectasis; (f) No Findings.
Fig. 5Folder structure of the BRAX dataset. The main repository contains two folders comprising the anonymized DICOM and PNG images respectively, in addition to the master spreadsheet, which contains the labels and the associated metadata for each image (DICOM/PNG).
Fig. 6Example of the Anonymized_DICOMs folder structure for a single patient. Inside the main anonymized folder, subfolders are organized in the following hierarchy: patients (DICOM tag: PatientID), studies (DICOM tag: StudyInstanceUID), series (DICOM tag: SeriesInstanceUID), and images (DICOM tag: SOPInstanceUID).
Performance of the automated labeling of the radiology reports.
| Findings | Mention | Negation | Uncertainty | ||||||
|---|---|---|---|---|---|---|---|---|---|
| F1 | Recall | Precision | F1 | Recall | Precision | F1 | Recall | Precision | |
| Atelectasis | 0.931 | 0.900 | 0.964 | 0.667 | 1.000 | 0.500 | 0.333 | 0.500 | 0.250 |
| Cardiomegaly | 0.947 | 0.986 | 0.910 | 0.996 | 0.993 | 1.000 | 0.907 | 0.975 | 0.848 |
| Consolidation | 0.824 | 0.824 | 0.824 | 0.969 | 1.000 | 0.939 | N/A | N/A | N/A |
| Edema | 0.800 | 1.000 | 0.667 | N/A | N/A | N/A | 0.889 | 0.800 | 1.000 |
| Pleural Effusion | 0.925 | 0.977 | 0.878 | 0.992 | 0.986 | 0.997 | 0.308 | 0.200 | 0.667 |
| Pneumonia | 0.762 | 0.667 | 0.889 | N/A | N/A | N/A | 0.800 | 0.889 | 0.727 |
| Pneumothorax | 1.000 | 1.000 | 1.000 | 1.000 | 1.000 | 1.000 | 1.000 | 1.000 | 1.000 |
| Enlarged Cardiomediastinum | 0.857 | 0.750 | 1.000 | 0.990 | 0.980 | 1.000 | N/A | N/A | N/A |
| Lung Lesion | 0.795 | 0.861 | 0.738 | 0.667 | 1.000 | 0.500 | 0.800 | 1.000 | 0.667 |
| Lung Opacity | 0.933 | 0.885 | 0.986 | 0.400 | 1.000 | 0.250 | 0.200 | 0.667 | 0.118 |
| Pleural Other | 0.901 | 0.865 | 0.941 | N/A | N/A | N/A | 0.182 | 0.100 | 1.000 |
| Fracture | 0.850 | 0.739 | 1.000 | 0.400 | 0.333 | 0.500 | 0.000 | 0.000 | 0.000 |
| Support Devices | 0.987 | 0.996 | 0.978 | 0.600 | 0.600 | 0.600 | N/A | N/A | N/A |
| No Finding | 0.821 | 0.993 | 0.700 | N/A | N/A | N/A | N/A | N/A | N/A |
Performance of the automated radiology report labeler (pipeline output from NegEx and BRAX labeler) on a subset of 1,000 reports compared to the labeling agreement between two board-certified radiologists on tasks of mention extraction, negation detection and uncertainty detection, as measured by F1-score, Recall and Precision.
| Measurement(s) | Chest Radiography |
| Technology Type(s) | natural language processing |
| Factor Type(s) | radiological findings/labels |
| Sample Characteristic - Organism | Homo sapiens |
| Sample Characteristic - Environment | chest organ |
| Sample Characteristic - Location | Brazil |