Literature DB >> 32904141

A dataset of chest X-ray reports annotated with Spatial Role Labeling annotations.

Surabhi Datta1, Kirk Roberts1.   

Abstract

In this paper, we present a dataset consisting of 2000 chest X-ray reports (available as part of the Open-i image search platform) annotated with spatial information. The annotation is based on Spatial Role Labeling. The information includes annotating a radiographic finding, its associated anatomical location, any potential diagnosis described in connection to the spatial relation (between finding and location), and any hedging phrase used to describe the certainty level of a finding/diagnosis. All these annotations are identified with reference to a spatial expression (or Spatial Indicator) that triggers a spatial relation in a sentence. The spatial roles used to encode the spatial information are Trajector, Landmark, Diagnosis, and Hedge. In total, there are 1962 Spatial Indicators (mainly prepositions). There are 2293 Trajectors, 2167 Landmarks, 455 Diagnosis, and 388 Hedges in the dataset. This annotated dataset can be used for developing automatic approaches targeted toward spatial information extraction from radiology reports which then can be applied to numerous clinical applications. We utilize this dataset to develop deep learning-based methods for automatically extracting the Spatial Indicators as well as the associated spatial roles [1].
© 2020 The Authors.

Entities:  

Keywords:  Chest radiology; Information extraction; Natural language processing; Radiology report; Spatial Role Labeling; Spatial relations

Year:  2020        PMID: 32904141      PMCID: PMC7451761          DOI: 10.1016/j.dib.2020.106056

Source DB:  PubMed          Journal:  Data Brief        ISSN: 2352-3409


Specifications Table Value of the Data The spatial information annotated in this dataset captures clinically significant information of chest X-ray imaging results. This annotation schema proposes a way to encode radiological spatial knowledge from report text. The annotated information includes the main radiographic finding detected, the anatomical location where the finding has been described to be present, any diagnosis associated with the finding-location pair, as well as any hedging phrase used to suggest the diagnosis or the finding. The dataset can be used to develop automatic NLP systems for extracting spatial information from radiology reports. These systems have the potential to facilitate various clinical applications. A few of these include easy visualization of contextual information associated with abnormal radiographic findings from a spatial perspective, automatic tracking of findings, and automatic annotation of corresponding radiographic images with spatial and diagnosis information. The models developed on this dataset could be further leveraged by applying them on other types of radiology reports belonging to different imaging modality such as chest Computed Tomography (CT) scans and Magnetic Resonance Imaging (MRI) as the annotated information types are common across different modalities and/or anatomies.

Data description

This 2000 chest X-ray reports dataset is a subset of 3996 reports collected from the Indiana Network for Patient Care [2]. Specifically, the 2000 report subset is composed from the set of 2470 non-normal reports as judged by two human annotators. The annotation schema is based on Spatial Role Labeling (SpRL) [3], [4] and has been extended to encode information in radiology context. This includes identifying a Spatial Indicator in a sentence and consequently annotating the main radiographic finding and anatomical location that are connected by this Spatial Indicator. Additionally, the spatial annotations include any potential diagnosis identified in a sentence with reference to the spatial relation between a finding and a location. The annotations also include any uncertainty phrase or hedge used to describe a finding/diagnosis. These four information types denote the four spatial roles with respect to a Spatial Indicator in a sentence. The schema is referred to as Rad-SpRL. The dataset is included in XML format (available at https://doi.org/10.17632/yhb26hfz8n.1 in the Mendeley data repository and https://github.com/krobertslab/datasets/) and the relevant details are described in Table 1. A few details of the Spatial Indicators in the dataset are included in Table 2. In total, there are 29 unique spatial expressions. The most frequent phrases for each of the four spatial roles annotated are shown in Table 3. We also note the frequent descriptors used in describing roles like Trajector and Diagnosis. Note that ‘XXXX’ is used to denote any de-itentified term in the report text. For each of Diagnosis, Trajector, and Landmark, the most common associated other two spatial roles are demonstrated in Figs. 1–3. We provide a brief statistics on the terms that are annotated as two different spatial roles depending on the context in a sentence in Table 4. We also analyze the terms expressing Hedge role (illustrated in Table 5).
Table 1

Annotated dataset descriptions.

AttributeDescription
DocumentRepresents a chest X-ray report
TextRaw text of the report
AnnotationsContains the processed text and spatial annotations for a report
TokenContains start character and number of characters of a token
SentenceContains start token number and number of included tokens to identify a sentence
RadSpRLRelationIndicates the presence of a spatial relation. Includes the start token number and number of tokens of a spatial expression (Spatial Indicator) in a sentence, also contains all the associated spatial roles with respect to this Spatial Indicator

Spatial roles under RadSpRLRelation

TrajectorRadiological entity (usually a radiographic finding whose position is described
LandmarkAnatomical location of a Trajector
DiagnosisPotential diagnosis associated with a spatial relation
HedgeAny uncertainty phrase used to describe a finding or diagnosis
Table 2

Spatial indicator statistics.

ParameterFrequency
Total number of Spatial Indicators1962
Number of distinct Spatial Indicators29

Most frequent indicators

of765
in526
without176
with141
within102
Table 3

Most frequent terms for each spatial role.

Spatial RoleTermFrequency (descriptors contained in Term)
Trajectoropacity279 (nodular, streaky, interstitial, focal airspace, focal, airspace, vague, patchy, bibasilar, ill-defined, mild streaky, subtle increased, few small nodular, round, scattered, rounded nodular, abnormal, vague nodular, patchy airspace, bilateral, bandlike, vague patchy, dense, minimal, minimal streaky, streaky basilar, alveolar)
degenerative change205 (mild, minimal, diffuse, moderate, severe, multilevel, chronic, advanced)
pneumothorax63 (moderate right-sided, large)
pleural effusion63 (large, small bilateral, large right)
consolidation57 (focal, focal airspace, dense)

Total Distinct861

Landmarklung285 (also includes lungs)
thoracic spine146 (mid, lower)
spine111
left lung base43
thorax40

Total Distinct570

Diagnosisscarring53 (pleural, pleural-parenchymal, chronic)
atelectasis83 (subsegmental, focal, chronic subsegmental, foci of subsegmental, lingular)
infiltrate21 (focal)
granuloma15 (calcified, partially calcified)
emphysema11

Total Distinct224

hedgemay represent40
XXXX39
consistent with38 (focal)
XXXX represent34 (also includes XXXX represents, XXXX representing, XXXX representative of)
compatible with21
Total Distinct80
Fig. 1

Most common associated landmarks and trajectors for three frequent Diagnosis. [n LM] indicates that a particular diagnosis is connected to a total of n landmarks, while [n TR] indicates that a particular diagnosis is connected to a total of n trajectors.

Fig. 3

Most common associated diagnoses and trajectors for three frequent Landmark. [n DG] indicates that a particular landmark is connected to a total of n dagnoses, while [n TR] indicates that a particular landmark is connected to a total of n trajectors.

Table 4

Overlapping terms between two spatial roles.

ParameterFrequency
Distinct overlapping terms (Trajector and Diagnosis)45
Distinct overlapping terms (Trajector and Landmark)73
Same terms with equal frequency (Trajector and Landmark)38

Terms appearing more asTrajectorand less asDiagnosis

TermFrequency

infiltrate/focal infiltrateTrajector:40 Diagnosis:19
calcified granuloma/calcified granulomasTrajector:37 diagnosis:6
focal airspace diseaseTrajector:32 Diagnosis:2
bronchovascular crowdingTrajector:22 Diagnosis:2
fracture/fracturesTrajector:18 Diagnosis:3
noduleTrajector:18 Diagnosis:2

Terms appearing more asDiagnosisand less asTrajector

TermFrequency

scarringDiagnosis:44 Trajector:21
atelectasisDiagnosis:43 Trajector:14
subsegmental atelectasisDiagnosis:23 Trajector:7
emphysemaDiagnosis:9 Trajector:4

Terms appearing more asLandmarkand less asTrajector

TermFrequency

right upper lobeLandmark:35 Trajector:2
rightLandmark:27 Trajector:2
right baseLandmark:15 Trajector:2

Common terms appearing both asTrajectorandLandmarkwith equal frequency

TermFrequency

osseous structures31
region5
peripheral aspect4
Table 5

Analysis of Hedge terms.

DescriptionTerms
Frequent Hedges that appear without Diagnosispossible/possibly, or, probable/probably, appears to be, versus
Frequent Hedges that appear with Diagnosismay represent, consistent with, XXXX, compatible with, XXXX representing
Hedges that only appear when there is no Diagnosisapparent, questionable, and/or, probable, suggestion of, difficult to exclude, or XXXX, or, approximately, apparently, cannot be excluded (11)
Example Hedges that only appear oncemay be partially due to, favored to represent, cannot be excluded, raise concern for, difficult to exclude
Annotated dataset descriptions. Spatial indicator statistics. Most frequent terms for each spatial role. Most common associated landmarks and trajectors for three frequent Diagnosis. [n LM] indicates that a particular diagnosis is connected to a total of n landmarks, while [n TR] indicates that a particular diagnosis is connected to a total of n trajectors. Most common associated diagnoses and landmarks for three frequent Trajector. [n DG] indicates that a particular trajector is connected to a total of n dagnoses, while [n LM] indicates that a particular trajector is connected to a total of n landmarks. Most common associated diagnoses and trajectors for three frequent Landmark. [n DG] indicates that a particular landmark is connected to a total of n dagnoses, while [n TR] indicates that a particular landmark is connected to a total of n trajectors. Overlapping terms between two spatial roles. Analysis of Hedge terms.

Experimental design, materials and methods

In this dataset, we attempt to widen the scope of clinically significant information types to be extracted from chest X-ray reports and additionally aim to relate all the information in context to a spatial relation between a finding and a location. This provides more contextual information about a radiographic finding. Many of the previous works on radiology information extraction mainly focused on extracting radiological entities (findings, diagnoses, etc.) separately without establishing any relation among these entities [5], [6], [8], [7]. We further analyze the variations of Spatial Indicators in the dataset. Besides the five most frequent ones mentioned in Table 2, the other spatial prepositions include – ‘at’, ‘over’, ‘on’, ‘throughout’, ‘under’, ‘along’, ‘near’, ‘to’, ‘through’, ‘between’, ‘adjacent’, ‘beneath’, ‘from’, ‘into’, ‘below’, ‘above’, ‘around’, ‘towards’, ‘about’, ‘behind’. This dataset also includes four more verbal spatial expressions – ‘overlie’, ‘overlies’, ‘overlying’, and ‘involving’. However, these four expressions occur very infrequently and together account for 30 out of 1962 Spatial Indicators. Also, note that the indicator ‘without’ denotes a negated spatial relation and is oftentimes present as part of the common negated phrase used in radiology reports – ‘without evidence of’. We inspect the dataset to analyze the most frequent terms annotated for each spatial role and observe that the top five frequent Trajectors are different from the five most frequent Diagnosis terms (as illustrated in Table 3). There are more distinct Trajectors and Landmarks than Diagnosis and Hedge terms. We also analyze, for each spatial role, the most frequently associated other roles (Figs. 1–3). For this, we consider three terms among the five most frequent terms (shown in Table 3) for each role. It is interesting to observe that no diagnoses are associated with three frequent radiographic findings (Trajectors) – ‘pneumothorax’, ‘pleural effusion’, and ‘consolidation’ (as shown in Fig. 2).
Fig. 2

Most common associated diagnoses and landmarks for three frequent Trajector. [n DG] indicates that a particular trajector is connected to a total of n dagnoses, while [n LM] indicates that a particular trajector is connected to a total of n landmarks.

In the process of annotating the reports, we noticed that some terms take different spatial roles depending on the context. We then inspect this overlap between two spatial roles in our annotated dataset. Specifically, the overlapping characteristics between Trajector and Diagnosis as well as between Trajector and Landmark are shown in Table 4. There are more distinct terms that have overlap between Trajector and Landmark than between Trajector and Diagnosis. Around 52% of the terms that act as both Trajector and Landmark an equal number of times oftentimes have the same text span and are related to anatomical structures or portions. Consider the following example: Visualized Here, ‘osseous structures’ act as both Trajector and Landmark. It takes the role of a Trajector when considered in relation to the indicator ‘of’ and acts as a Landmark when considered in relation to ‘without’. Additionally, we note that the terms that are annotated as both Trajector and Landmark appear more often as a Landmark than a Trajector (as shown in Table 4). There are certain findings like ‘pleural thickening/thickening’ which appear both as Trajector and Diagnosis with the same frequency. Since the hedging terms are used both in context to describing a radiographic finding as well as a diagnosis, we intend to investigate their distribution in both the cases. We find that certain phrases such as ‘probable’ and ‘or’ are more representative of describing the findings rather than diagnoses. We also witness a variety of hedging expressions that occur rarely in the dataset. Besides the ones presented in Table 5, few other rare hedging phrases include – ‘possibly related to’, ‘is a consideration’, ‘favored as’, ‘could be secondary to’, and ‘cannot be ruled out’.

Ethics statement

This work includes chest X-ray reports of patients collected from the Indiana Network for Patient Care in a previous study [2]. The reports are de-identified and do not involve experimentation with human subjects.

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships which have, or could be perceived to have, influenced the work reported in this article.
SubjectHealth Informatics
Specific subject areaSpatial information extraction from chest X-ray reports based on Spatial Role Labeling schema for spatial language understanding in radiology reports
Type of dataTable, Figure, Text, Annotated data in XML format
How data were acquiredA subset of 2000 chest X-ray reports were used from a pool of 3996 de-identified reports collected from the Indiana Network for Patient Care (available as one of the Open-i datasets released by the National Library of Medicine.)
Data formatRaw, Processed
Parameters for data collection2000 chest X-ray reports that are annotated with important spatial information were selected from the set of 2470 non-normal reports in the Open-i chest X-ray report dataset as adjudicated by two annotators.
Description of data collectionThese 2000 reports were annotated with four spatial roles using the Brat toolkit. First, the spatial indicators (usually the spatial prepositions) triggering any spatial relation between a radiographic finding and an anatomical location were annotated for each sentence. Then, four spatial roles–the radiographic finding, its corresponding location, hedging phrase, and any potential diganosis were annotated with respect to a specific spatial indicator.
Data source locationPrimary data source: Open-i chest X-ray dataset (https://openi.nlm.nih.gov/). Associated research paper: “Preparing a collection of radiology examinations for distribution and retrieval”https://doi.org/10.1093/jamia/ocv080
Data accessibilityRepository name: Mendeley data repository Data identification number: 10.17632/yhb26hfz8n.1 Direct URL to data: https://doi.org/10.17632/yhb26hfz8n.1, https://github.com/krobertslab/datasets/tree/master/rad-sprl
Related research articleS. Datta, Y. Si, L. Rodriguez, S. E. Shooshan, D. Demner-Fushman, K. Roberts, Understanding spatial language in radiology: Representation framework, annotation, and spatial relation extraction from chest X-ray reports using deep learning, Journal of Biomedical Informatics 108 (2020) 103473. doi:10.1016/j.jbi.2020.103473.
  4 in total

1.  Characterization of Change and Significance for Clinical Findings in Radiology Reports Through Natural Language Processing.

Authors:  Saeed Hassanpour; Graham Bay; Curtis P Langlotz
Journal:  J Digit Imaging       Date:  2017-06       Impact factor: 4.056

2.  Automated Triaging of Adult Chest Radiographs with Deep Artificial Neural Networks.

Authors:  Mauro Annarumma; Samuel J Withey; Robert J Bakewell; Emanuele Pesce; Vicky Goh; Giovanni Montana
Journal:  Radiology       Date:  2019-01-22       Impact factor: 11.105

3.  Understanding spatial language in radiology: Representation framework, annotation, and spatial relation extraction from chest X-ray reports using deep learning.

Authors:  Surabhi Datta; Yuqi Si; Laritza Rodriguez; Sonya E Shooshan; Dina Demner-Fushman; Kirk Roberts
Journal:  J Biomed Inform       Date:  2020-06-18       Impact factor: 6.317

4.  Preparing a collection of radiology examinations for distribution and retrieval.

Authors:  Dina Demner-Fushman; Marc D Kohli; Marc B Rosenman; Sonya E Shooshan; Laritza Rodriguez; Sameer Antani; George R Thoma; Clement J McDonald
Journal:  J Am Med Inform Assoc       Date:  2015-07-01       Impact factor: 4.497

  4 in total

北京卡尤迪生物科技股份有限公司 © 2022-2023.