| Literature DB >> 34480231 |
A W Olthof1,2,3, P M A van Ooijen4,5, L J Cornelissen4,6.
Abstract
In radiology, natural language processing (NLP) allows the extraction of valuable information from radiology reports. It can be used for various downstream tasks such as quality improvement, epidemiological research, and monitoring guideline adherence. Class imbalance, variation in dataset size, variation in report complexity, and algorithm type all influence NLP performance but have not yet been systematically and interrelatedly evaluated. In this study, we investigate these factors on the performance of four types [a fully connected neural network (Dense), a long short-term memory recurrent neural network (LSTM), a convolutional neural network (CNN), and a Bidirectional Encoder Representations from Transformers (BERT)] of deep learning-based NLP. Two datasets consisting of radiologist-annotated reports of both trauma radiographs (n = 2469) and chest radiographs and computer tomography (CT) studies (n = 2255) were split into training sets (80%) and testing sets (20%). The training data was used as a source to train all four model types in 84 experiments (Fracture-data) and 45 experiments (Chest-data) with variation in size and prevalence. The performance was evaluated on sensitivity, specificity, positive predictive value, negative predictive value, area under the curve, and F score. After the NLP of radiology reports, all four model-architectures demonstrated high performance with metrics up to > 0.90. CNN, LSTM, and Dense were outperformed by the BERT algorithm because of its stable results despite variation in training size and prevalence. Awareness of variation in prevalence is warranted because it impacts sensitivity and specificity in opposite directions.Entities:
Keywords: Informatics; Machine learning; Natural language processing; Radiology
Mesh:
Year: 2021 PMID: 34480231 PMCID: PMC8416876 DOI: 10.1007/s10916-021-01761-4
Source DB: PubMed Journal: J Med Syst ISSN: 0148-5598 Impact factor: 4.460
Fig. 1Flowchart of data processing, training, and testing. + and – refer to cases of the positive and negative classes. The input for the variable training sets are all combinations from positive and negative cases with a step size of 100. For the Fracture-data, the positive cases ranged from 100–700 and the negative cases from 100–1200. For the Chest-data, the positive cases ranged from 100–283 and the negative cases from 100–1500
Model characteristics
| Architecture | Unique characteristics and motivation | References NLP in Radiology |
|---|---|---|
| ANN / Dense | • Artificial neural network • No specific context awareness • Baseline model for comparative purposes | [ |
| CNN | • Convolutional neural network • Well known from image classification • A sliding window (or filter or kernel) assesses the context of data. This window can be 1D (for sequential data like text), 2D (for images), of 3D (for 3D datasets or video). | [ |
| LSTM | • Long short-term memory • Recurrent neural network (RCN) • Designed for sequence data like text • Feedback connections transfer information from the context | [ |
| BERT | • Bidirectional Encoder Representations from Transformers • Pre-trained on massive text datasets. • Fine-tuning for specific tasks • Attention mechanism lets words focus on each other | [ |
Model hyperparameters for the ANN/Dense, CNN and LSTM models implemented with sequential layers in Keras (a) and for the BERT model implemented with simple transformers (b), and the hardware used for training (c)
| Vocabulary size | 2500 |
| Embedding dimension | 32 |
| Input length | 150 (Fracture-data), 250 (Chest-data) |
| Batch size | Default (32) |
| Loss function | Binary cross entropy |
| Weigths for the loss | None |
| Weight regularization | None |
| Dropout | No dropout layers were applied in the final model |
| Optimizer | Adam, default parameters - learning rate 0.01 - no learning rate schedule |
| Epochs | 12 |
| Learning_rate | 4e-5 |
| Model_type | bert |
| Model_name | wietsedv/bert-base-dutch-cased |
| Num_train_epochs | 4 |
| Sliding_window | False |
| Train_batch_size | 8 |
| Use_cuda | False |
| Use_early_stopping | False |
| Weight | None |
| All other parameters | (also) default |
| Processor | Intel Core i7, 2.20 GHz |
| RAM | 16 GB |
| GPU | NVIDIA Geforce GTX 1050, 4 GB |
Fig. 2Stacked histogram demonstrating report size and binary distribution of (a) Fracture-data (1 = fracture present, 0 = fracture absent) and (b) Chest-data (1 = infiltrate present, 0 = infiltrate absent)
Fig. 3Scatterplot of model performance metrics (vertical axis) and training dataset size (horizontal axis) for (a) Fracture-data and (b) Chest-data. The size of the dots corresponds to the training dataset prevalence
Fig. 4Scatterplot of model performance metrics (vertical axis) and prevalence (horizontal axis) for (a) Fracture-data and (b) Chest-data. The size of the dots corresponds to the training dataset size
Pearson correlation coefficients for training set size; prevalence and model performance metrics
| Size | Prevalence | |
|---|---|---|
| Sensitivity | 0,04 | 0,74 |
| Specificity | 0,36 | -0,75 |
| PPV | 0,39 | -0,80 |
| NPV | 0,05 | 0,74 |
| AUC | 0,36 | 0,16 |
| F1_score | 0,42 | -0,02 |
| Size | Prevalence | |
| Sensitivity | -0,27 | 0,61 |
| Specificity | 0,60 | -0,88 |
| PPV | 0,75 | -0,88 |
| NPV | -0,23 | 0,59 |
| AUC | 0,05 | 0,20 |
| F1_score | 0,28 | -0,11 |
Fig. 5Boxplot of performance metrics per model for (a) Fracture-data and (b) Chest-data
Comparison and t-test statistics of all performance metrics for all combinations of models trained on Fracture-data. The bold and underlined models have significantly better performance in the particular comparisons
| Sensitivity | Dense | 4.375 | 0.000 | |
| Sensitivity | BERT | LSTM | 0.991 | 0.323 |
| Sensitivity | BERT | CNN | 0.553 | 0.581 |
| Sensitivity | Dense | -3.258 | 0.001 | |
| Sensitivity | Dense | -3.511 | 0.001 | |
| Sensitivity | LSTM | CNN | -0.358 | 0.721 |
| Specificity | Dense | 4.488 | 0.000 | |
| Specificity | LSTM | 4.113 | 0.000 | |
| Specificity | CNN | 3.447 | 0.001 | |
| Specificity | Dense | -1.978 | 0.050 | |
| Specificity | Dense | CNN | -1.828 | 0.069 |
| Specificity | LSTM | CNN | 0.048 | 0.962 |
| PPV | Dense | 5.155 | 0.000 | |
| PPV | LSTM | 4.465 | 0.000 | |
| PPV | CNN | 3.552 | 0.000 | |
| PPV | Dense | LSTM | -1.936 | 0.055 |
| PPV | Dense | -2.022 | 0.045 | |
| PPV | LSTM | CNN | -0.258 | 0.797 |
| NPV | Dense | 4.795 | 0.000 | |
| NPV | BERT | LSTM | 1.138 | 0.257 |
| NPV | BERT | CNN | 0.620 | 0.536 |
| NPV | Dense | -3.513 | 0.001 | |
| NPV | Dense | -3.846 | 0.000 | |
| NPV | LSTM | CNN | -0.435 | 0.664 |
| AUC | Dense | 10.730 | 0.000 | |
| AUC | LSTM | 4.541 | 0.000 | |
| AUC | CNN | 3.618 | 0.000 | |
| AUC | Dense | -6.329 | 0.000 | |
| AUC | Dense | -6.294 | 0.000 | |
| AUC | LSTM | CNN | -0.385 | 0.701 |
| F1_score | Dense | 11.362 | 0.000 | |
| F1_score | LSTM | 5.608 | 0.000 | |
| F1_score | CNN | 4.387 | 0.000 | |
| F1_score | Dense | -6.205 | 0.000 | |
| F1_score | Dense | -6.171 | 0.000 | |
| F1_score | LSTM | CNN | -0.427 | 0.670 |
Comparison and t-test statistics of all performance metrics for all combinations of models trained on Chest-data. The bold and underlined models have significantly better performance in the particular comparisons
| Sensitivity | CNN | 3.559 | 0.001 | |
| Sensitivity | LSTM | 4.493 | 0.000 | |
| Sensitivity | Dense | 8.416 | 0.000 | |
| Sensitivity | CNN | LSTM | 0.901 | 0.370 |
| Sensitivity | Dense | 5.151 | 0.000 | |
| Sensitivity | Dense | 4.333 | 0.000 | |
| Specificity | BERT | CNN | 0.054 | 0.957 |
| Specificity | BERT | LSTM | 0.174 | 0.862 |
| Specificity | BERT | Dense | 0.138 | 0.890 |
| Specificity | CNN | LSTM | 0.088 | 0.930 |
| Specificity | CNN | Dense | 0.082 | 0.935 |
| Specificity | LSTM | Dense | 0.015 | 0.988 |
| PPV | BERT | CNN | 0.051 | 0.959 |
| PPV | BERT | LSTM | 1.401 | 0.165 |
| PPV | BERT | Dense | 1.046 | 0.299 |
| PPV | CNN | LSTM | 1.329 | 0.187 |
| PPV | CNN | Dense | 0.990 | 0.325 |
| PPV | LSTM | Dense | -0.156 | 0.876 |
| NPV | CNN | 3.516 | 0.001 | |
| NPV | LSTM | 4.821 | 0.000 | |
| NPV | Dense | 9.064 | 0.000 | |
| NPV | CNN | LSTM | 1.135 | 0.259 |
| NPV | Dense | 5.536 | 0.000 | |
| NPV | Dense | 4.561 | 0.000 | |
| AUC | CNN | 4.269 | 0.000 | |
| AUC | LSTM | 5.580 | 0.000 | |
| AUC | Dense | 11.571 | 0.000 | |
| AUC | CNN | LSTM | 1.243 | 0.217 |
| AUC | Dense | 7.349 | 0.000 | |
| AUC | Dense | 6.202 | 0.000 | |
| F1_score | CNN | 3.485 | 0.001 | |
| F1_score | LSTM | 4.690 | 0.000 | |
| F1_score | Dense | 9.497 | 0.000 | |
| F1_score | CNN | LSTM | 1.692 | 0.094 |
| F1_score | Dense | 7.140 | 0.000 | |
| F1_score | Dense | 5.334 | 0.000 |
Results summary of model performance
| All | • All models perform better on the shorter radiology reports of the Fracture-data than the more complex reports of the Chest-data. • Negative predictive value depends less on model type, training dataset size and prevalence than the positive predictive value |
| Dense | • Baseline model Dense performs well on the Fracture-data but depends more on variation in training dataset size and prevalence |
| LSTM / CNN | • The LSTM and CNN models demonstrate equal performance |
| BERT | • The BERT model has stable results despite a variation in training dataset size and prevalence. • The BERT model outperforms all other models, especially for the more complex reports of the Chest-data |
| Positive example | Negative example |
|---|---|
| • Fracture | • No fracture |
| • Suspicion for a fracture | • Fracture unlikely |
| • Possible fracture | • No traumatic abnormalities |
| • Epiphysiolysis | • Normal findings |
| • Positive fat pad sign elbow | |
| • Luxation |
| Positive examples | Negative examples |
|---|---|
| Consolidation | No infiltrate |
| Infiltrate | No suspicion of an infiltrate |
| Possible infiltrate | Infiltrate unlikely |
| Maybe a small infiltrate | Normal findings |
| Suspicion of infiltrative abnormalitie |
| Layer (type) | Output Shape | Param # |
|---|---|---|
| Embedding (Embedding) | (None, 250, 32) | 80000 |
| flatten (Flatten) | (None, 8000) | 0 |
| Dense1 (Dense) | (None, 32) | 256032 |
| Dense-2 (Dense) | (None, 16) | 528 |
| Dense-3 (Dense) | (None, 8) | 136 |
| Dense-4 (Dense) | (None, 1) | 9 |
| Total params: 336,705 | ||
| Trainable params: 336,705 | ||
| Non-trainable params: 0 |
| Layer (type) | Output Shape | Param # |
|---|---|---|
| Embedding (Embedding) | (None, 250, 32) | 80000 |
| LSTM-1 (Bidirectional) | (None, 250, 64) | 16640 |
| LSTM-2 (Bidirectional) | (None, 64) | 24832 |
| Dense-1 (Dense) | (None, 24) | 1560 |
| Dense-2 (Dense) | (None, 1) | 25 |
| Total params: 123,057 | ||
| Trainable params: 123,057 | ||
| Non-trainable params: 0 |
| Layer (type) | Output Shape | Param # |
|---|---|---|
| Embedding (Embedding) | (None, 250, 32) | 80000 |
| Conv-1D-1 (Conv1D) | (None, 246, 64) | 10304 |
| Pooling-1 (AveragePooling1D) | (None, 123, 64) | 0 |
| Conv-1D-2 (Conv1D) | (None, 119, 64) | 20544 |
| Pooling-2 (GlobalAveragePool) | (None, 64) | 0 |
| Dense-1 (Dense) | (None, 24) | 1560 |
| Dense-2 (Dense) | (None, 1) | 25 |
| Total params: 112,433 | ||
| Trainable params: 112,433 | ||
| Non-trainable params: 0 | ||
| True | ||||||||
|---|---|---|---|---|---|---|---|---|
| Predicted | PPV | 0.50 | ||||||
| TP | 80 | FP | 80 | 160 | NPV | 0.94 | ||
| FN | 20 | TN | 320 | 340 | sens | 0.80 | ||
| 100 | 400 | spec | 0.80 | |||||
| True | ||||||||
|---|---|---|---|---|---|---|---|---|
| Predicted | PPV | 0.41 | ||||||
| TP | 99 | FP | 140 | 239 | NPV | 1.00 | ||
| FN | 1 | TN | 260 | 261 | sens | 0.99 | ||
| 100 | 400 | spec | 0.65 | |||||
| True | ||||||||
|---|---|---|---|---|---|---|---|---|
| Predicted | PPV | 0.98 | ||||||
| TP | 65 | FP | 1 | 66 | NPV | 0.92 | ||
| FN | 35 | TN | 399 | 434 | sens | 0.65 | ||
| 100 | 400 | spec | 1.00 | |||||