| Literature DB >> 25099227 |
Anne-Dominique Pham1, Aurélie Névéol, Thomas Lavergne, Daisuke Yasunaga, Olivier Clément, Guy Meyer, Rémy Morello, Anita Burgun.
Abstract
BACKGROUND: Natural Language Processing (NLP) has been shown effective to analyze the content of radiology reports and identify diagnosis or patient characteristics. We evaluate the combination of NLP and machine learning to detect thromboembolic disease diagnosis and incidental clinically relevant findings from angiography and venography reports written in French. We model thromboembolic diagnosis and incidental findings as a set of concepts, modalities and relations between concepts that can be used as features by a supervised machine learning algorithm. A corpus of 573 radiology reports was de-identified and manually annotated with the support of NLP tools by a physician for relevant concepts, modalities and relations. A machine learning classifier was trained on the dataset interpreted by a physician for diagnosis of deep-vein thrombosis, pulmonary embolism and clinically relevant incidental findings. Decision models accounted for the imbalanced nature of the data and exploited the structure of the reports.Entities:
Mesh:
Year: 2014 PMID: 25099227 PMCID: PMC4133634 DOI: 10.1186/1471-2105-15-266
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Distribution of thromboembolic diagnosis and incidental finding in the 573 radiology reports
| Diagnoses | n (Total N = 573) | (% of total) |
|---|---|---|
|
| ||
| Positive CTA with positive CTV | 74 | (12.9%) |
| Positive CTA with negative CTV | 52 | (9.1%) |
| Negative CTA with positive CTV | 30 | (5.2%) |
| Negative CTA with negative CTV | 417 | (72.8%) |
|
| 93 | (16.2%) |
CTA = Computed tomography angiography.
CTV = Computed tomography venography.
Figure 1The annotation scheme used in our project.
Figure 2Sample annotated text using Brat. Top: Pre-annotated text using the automatic lexicon matcher. Bottom: Final annotations produced by the physician after revising the pre-annotations.
Figure 3Test and training set for CTA/CTV classification.
Distribution of annotations in the 573 radiology reports
| Concepts | N | Relations | N | Modalities | N |
|---|---|---|---|---|---|
|
| 9702 |
| 2507 |
| 1739 |
|
| 3116 |
| 42 |
| 1653 |
|
| 1582 |
| 293 | ||
|
| 1478 |
| 123 | ||
|
| 3 |
| 118 |
*ThromboPat for thrombopathology concepts.
*K for clinically relevant findings.
*PP for post-partum.
Inter-annotator agreement on a sample of 10 radiology reports (F-measure)
| Category | Exact match | Inexact match |
|---|---|---|
|
| 77.3 | 87.8 |
|
| 73.5 | 90.9 |
|
| 95.7 | 99.1 |
|
| 89.3 | 89.3 |
|
| 78.4 | 78.4 |
|
| 62.4 | 71.8 |
|
| 80 | 80 |
|
| 50 | 66.7 |
IAA = inter-annotator agreement.
*ThromboPat for thrombopathology concepts.
*K for clinically relevant findings.
Performances of the Naïve Bayes versus the maximum entropy classifiers on the CTA/CTV and incidentaloma test sets
| Precision | Recall | F-measure | |||||
|---|---|---|---|---|---|---|---|
| Features | NB | ME | NB | ME | NB | ME | |
| Baseline (plain text) | PE | 0.78 | 0.88 | 0.96 | 0.95 | 0.86 | 0.91 |
| DVT | 0.42 | 0.84 | 0.89 | 0.89 | 0.57 | 0.86 | |
| PE and/or DVT | 0.77 | 0.90 | 0.85 | 0.96 | 0.81 | 0.93 | |
| Incidentaloma | 0.32 | 0.43 | 0.29 | 0.32 | 0.30 | 0.37 | |
| Baseline + annotations | PE | 0.99 | 1.00 | 0.97 | 0.95 | 0.98 | 0.98 |
| DVT | 0.73 | 1.00 | 0.89 | 1.00 | 0.80 | 1.00 | |
| PE and/or DVT | 0.92 | 1.00 | 0.85 | 0.96 | 0.89 | 0.98 | |
| Incidentaloma | 0.67 | NC | 0.50 | NC | 0.57 | NC | |
| Baseline + annotations + section typing | Incidentaloma | 0.46 | NC | 0.63 | NC | 0.53 | NC |
| Critical sections* + annotations | Incidentaloma | 0.60 | 0.76 | 0.75 | 0.81 | 0.67 | 0.80 |
NB: Naïve Bayes; ME: Maximum Entropy.
PE: Pulmonary Embolism; DVT: Deep Vein Thrombosis.
*results and conclusion sections.
NC: Not Calculated.