| Literature DB >> 34600486 |
Emma M Davidson1, Michael T C Poon2,3, Beatrice Alex4,5, William Whiteley6,7,8, Arlene Casey4, Andreas Grivas4, Daniel Duma4, Hang Dong2,7, Víctor Suárez-Paniagua2,7, Claire Grover9, Richard Tobin9, Heather Whalley6,10, Honghan Wu7,11.
Abstract
BACKGROUND: Automated language analysis of radiology reports using natural language processing (NLP) can provide valuable information on patients' health and disease. With its rapid development, NLP studies should have transparent methodology to allow comparison of approaches and reproducibility. This systematic review aims to summarise the characteristics and reporting quality of studies applying NLP to radiology reports.Entities:
Keywords: Natural language processing; Radiology reports; Systematic review
Mesh:
Year: 2021 PMID: 34600486 PMCID: PMC8487512 DOI: 10.1186/s12880-021-00671-8
Source DB: PubMed Journal: BMC Med Imaging ISSN: 1471-2342 Impact factor: 1.930
Clinical application areas and their definitions
| Application area (number of papers) | Description | Subcategory | Description of included papers | Technical task (number of papers) | Anatomical scan region [Paper number—a full list of the papers by number is included in Additional file |
|---|---|---|---|---|---|
| Surveillance (45) | Using imaging reports for surveillance of disease at a population health or individual level either longitudinally or generating alerts | Disease surveillance | Monitoring occurrence of infectious disease Monitoring non-communicable disease patterns, including alerts for conditions | Thorax [25] Mixed [26] Cerebrovascular [19] | |
Thorax [27, 28] Abdomen [29] Mixed [30] | |||||
| Prioritising reports | Generating alerts for reports requiring more urgent action | Other [31] Mixed [32] Unspecified [33, 34] | |||
| Incidental findings | Generating alerts for incidental findings | Mixed [23] | |||
Thorax [20, 21] Cerebrovascular [24] | |||||
| Patient surveillance | Pairing measurements and linking reports to Investigate/monitor conditions over time e.g. worsening prognosis/ response to treatment | Thorax [35] Abdomen [6, 8, 16–18] Mixed [1, 2, 9, 36, 37, 131] Breast [3, 10, 11, 13–15] Unspecified [7, 38, 39] | |||
| Classification (5) | Thorax [4, 40] Breast [12] Mixed [5, 41] | ||||
| Follow-up | Detecting follow-up recommendations, creating alerts and linking to see if carried out | Information Extraction (3) | Abdomen [42] Unspecified [43] Mixed [44] | ||
| Classification (1) | Unspecified [45] | ||||
| Disease information and classification (46) | Using imaging reports to identify information that may also be aggregated according to classification systems (no specific clinical purpose specified) | N/A | Extracting information about a disease/condition/function (e.g. LVEF) (no additional processing required) Staging e.g. using BIRADS or Lung-RADS Identifying sub-types of disease Classification of fractures Predicting ICD codes ICD codes used for ground truth | Information Extraction (14) | Cerebrovascular [47,54,55] Breast [59,65] Abdomen [66] Thorax [63,67–70] Mixed [71,72] Unspecified [73] |
| Classification (32) | Cerebrovascular [46, 48–53, 74–76] Abdomen [77] Breast [56–58] Extremities [78–84] Mixed [85–89] Spine [22, 90] Thorax [60–62, 64] | ||||
| Language discovery and knowledge structure (27) | Investigating the structure of language in imaging reports and ways in which this may be optimised to facilitate knowledge and decision support, communication (both internally between clinicians and outward facing communication with patients/public), and assist in improving NLP applications | Knowledge support for patients/public | Improving readability of reports and communications for public/patients | Mixed [91, 92, 99] | |
| Knowledge and decision support for clinicians | Providing information for clinician use (including using ontologies and lexicons) Finding relevant reports Improving reading efficiency Supporting radiological and clinical decision making Supporting clinician education | Mixed [100] | |||
Breast [94] Mixed [101–104] Thorax [93] Cerebrovascular [105] Unspecified [106] | |||||
| Unspecified [107] | |||||
| Variability, complexity and structure of language for NLP purposes | Investigating variability and complexity of language including free-test and structured reports Improving structure of language for NLP e.g. normalising phrases to support classification Normalising and disease specific phrases | Thorax [96, 108] Mixed [97, 109, 110] Unspecified [98] | |||
Unspecified [111, 112] Spine [113] Breast [114] | |||||
Unspecified [95] Thorax) [115–117] | |||||
| Quality and compliance (20) | Using imaging reports to assess quality and safety of radiology practice, clinical practice, and efficiency of healthcare services | Assessing imaging practices | Do imaging practices adhere to guidance including indications and protocol selection Impact of guideline changes on imaging practice Assessing imaging utilisation and yield | Information Extraction (4) | Abdomen [121] Thorax [122] Mixed [119, 123] |
| Classification (11) | Mixed [118, 120, 125, 135] Cerebrovascular [124, 126, 127] Thorax [128, 129] Abdomen [130] Extremities [136] | ||||
| Audit | Classification used for quality improvement in radiology and clinical practice Identifying reports for auditing Identifying and fixing errors in reports (e.g. gender/laterality) | Information Extraction (2) | Mixed [134] Cerebrovascular [133] | ||
| Classification (3) | Thorax [132] Breast [137] Extremities [138] | ||||
| Research (16) | Using imaging reports to create patient cohorts for research purposes | Cohort | Identifying cohorts for research purposes with specific medical conditions (sometimes in specified anatomical regions), with particular radiological findings, who have had certain healthcare interactions | Classification (7) | Abdomen [140–142] Cerebrovascular [144, 152] Mixed [151] Spine [150] |
| Information Extraction (3) | Unspecified [153] Cerebrovascular [145] Spine [143] | ||||
| Epidemiology | Identifying research cohorts as above but papers in which they go on to do epidemiological analyses and present their results | Information extraction (4) | Unspecified [154] Cerebrovascular [148] Abdomen [139] Thorax [147] | ||
| Classification (2) | Mixed [146] Thorax [149] | ||||
| Technical NLP (10) | Papers which do not fit to a specific category, often with a primarily technical aim | N/A | Studies encompassed a variety of purposes, such as negation detection, spelling correction, fact checking, methods for sample selection, crowd source annotation | Information Extraction (6) | Mixed [155,157,159] Thorax [160] Unspecified [161, 162] |
| Classification (4) | Cerebrovascular [158] Mixed [163] Thorax [156] Unspecified [164] |
Items used to assess the quality of reporting criteria in the current review
| Quality heading | Quality criteria | Definition |
|---|---|---|
| Data source | (1) Sampling | Reported details of the sampling strategy for radiology reports, including whether they are from consecutive patients |
| (2) Consistent imaging acquisition | Reported whether radiology reports were from images taken from one imaging machine or more and, if more, whether these machines were of comparable specification | |
| Dataset criteria | (3) Dataset size | Reported their dataset size of > 200 |
| (4) Training dataset | Reported training data set size—the part of the initial dataset used to develop an NLP algorithm | |
| (5) Test dataset | Reported test data set size—part of the initial dataset used to evaluate an NLP algorithm | |
| (6) Validation dataset | Reported validation data set size—a separate dataset used to evaluate the performance of an NLP algorithm in a clinical setting (may be internal or external to the initial dataset) | |
| Ground truth criteria | (7) Annotated dataset | Reported annotated data set size—data which has been marked-up by humans for ground truth |
| (8) Domain expert for annotation | Reported use of a domain expert for annotation—annotation carried out by a radiologist or specialist clinician | |
| (9) Number of annotators | Reported the number of annotators | |
| (10) Inter-annotator agreement | Reported the agreement between annotators (if more than one annotator used) | |
| Outcome criteria | (11) Precision | Reported precision (positive predictive value) |
| (12) Recall | Reported recall (sensitivity) | |
| Reproducibility criteria | (13) External validation | Reported whether the NLP algorithm is tested on external data from another setting (a separate healthcare system, hospital or institution) |
| (14) Availability of data | Reported whether their data set is available for use (preferably with link provided in paper) | |
| (15) Availability of NLP code | Reported whether their NLP code is available for use (preferably with link provided in paper) |
Fig. 1PRISMA flowchart outlining the study selection process [13]
Fig. 2Distribution of studies by publication year and a clinical application, b NLP methods
Fig. 3Quality of reporting in a individual studies and b between 2015 and 2019. Legend: a Studies are arranged by the total number of qualities reported in the study from left to right in descending order. b Numbers indicate the percentage of studies in each year of publication reporting the corresponding quality
Fig. 4Precision, recall and F1 score by quality of reporting and clinical application category. Legend: NLP system performance reported as precision, recall and F1 score from included studies. Size of the bubbles represents the relative sizes of corpora in each graph. a Studies were categorised into high (> 5 qualities) and low (≤ 5 qualities) reporting quality based on the median number of qualities reported as the cut-off point. Reporting of F1 score was not a quality criterion. b Performance stratified by clinical application