| Literature DB >> 34098339 |
Feyisope R Eweje1, Bingting Bao2, Jing Wu2, Deepa Dalal3, Wei-Hua Liao4, Yu He2, Yongheng Luo2, Shaolei Lu5, Paul Zhang6, Xianjing Peng7, Ronnie Sebro8, Harrison X Bai9, Lisa States10.
Abstract
BACKGROUND: Radiologists have difficulty distinguishing benign from malignant bone lesions because these lesions may have similar imaging appearances. The purpose of this study was to develop a deep learning algorithm that can differentiate benign and malignant bone lesions using routine magnetic resonance imaging (MRI) and patient demographics.Entities:
Keywords: Bone lesion; Bone tumor; Convolutional neural network; Deep learning; MRI
Mesh:
Year: 2021 PMID: 34098339 PMCID: PMC8190437 DOI: 10.1016/j.ebiom.2021.103402
Source DB: PubMed Journal: EBioMedicine ISSN: 2352-3964 Impact factor: 8.143
Figure 1Schematic of the bone tumor classification deep learning pipeline. Top: Image segmentation. Raw image volumes were manually segmented to a region of interest focused upon the tumor. The largest axial, transverse, and coronal slices of the segmented volume were used as inputs for the imaging models (“2.5D” image representation). Middle: Training and evaluation scheme. Hyperparameters were selected based upon 4-fold cross validation scheme. Final models were trained using the training and validation data sets then evaluated using the internal and external testing sets, where the external testing set was from an independent institution. Bottom: Model architecture. An EfficientNet-B0 took T1-weighted images as an input and output a malignancy probability; another EfficientNet-B0 took T2-weight images as inputs. A logistic regression model accepted age, binary-encoded sex, and one-hot encoded lesion location as inputs and output a malignancy probability. A voting ensemble model used classifications from the T1W, T2W, and clinical features models as inputs and output a final classification by a soft, probability-based majority rule vote.
Characteristics of patients included in the study. “One-vs-rest” tests for statistical significance in location distribution (e.g. Foot vs. Rest) were performed with Bonferroni-corrected p-values used for significance. ***Statistically significant
| Benign (N=582) | Malignant (N=478) | ||
|---|---|---|---|
| 27 ± 20 | 34 ± 25 | <0·001*** | |
| 0·79 | |||
| 342 (59%) | 277 (58%) | ||
| 240 (41%) | 201 (42%) | ||
| <0·001*** | |||
| 3 (0.7%) | 5 (1.7%) | 0.52 | |
| 12 (2.6%) | 55 (18.3%) | <0·001*** | |
| 74 (16.1%) | 35 (11.6%) | 0.0055 | |
| 80 (17.4%) | 57 (18.9%) | 0.43 | |
| 89 (19.3%) | 5 (1.7%) | <0·001*** | |
| 0 (0%) | 1 (0.3%) | 0.92 | |
| 13 (2.8%) | 0 (0%) | 0.0026 | |
| 4 (0.9%) | 2 (0.7%) | 0.87 | |
| 1 (0.2%) | 0 (0%) | 0.92 | |
| 29 (6.3%) | 1 (0.3%) | <0·001*** | |
| 41 (8.9%) | 62 (20.6%) | 0.0017 | |
| 34 (7.4%) | 33 (11%) | 0.56 | |
| 16 (3.5%) | 8 (2.7%) | 0.34 | |
| 65 (14.1%) | 37 (12.3%) | 0.075 | |
| 11 (2.4%) | 4 (1.3%) | 0.24 | |
| 6 (1.3%) | 10 (3.3%) | 0.24 | |
| 4 (0.9%) | 1 (0.3%) | 0.5 | |
| 3 (0.7%) | 7 (2.3%) | 0.2 | |
| 6 (1.3%) | 19 (6.3%) | 0.0033 | |
| 10 (2.2%) | 7 (2.3%) | 0.93 | |
| 81 (17.6%) | 129 (42.9%) | <0·001*** |
Performance of T1W, T2W, clinical features and ensemble models on the internal test set (n = 93) compared with expert evaluation, as well as the external test set (n = 97). p-value as calculated by the McNemar test for each expert is for accuracy relative to the performance of the ensemble model. Abbreviations - ROC AUC, area under ROC curve; PPV, positive predictive value; NPV, negative predictive value; 95% CI, 95% confidence interval.
| Modality | F1 Score | ROC AUC | Accuracy (95% CI) | Sensitivity (95% CI) | Specificity (95% CI) | PPV | NPV | |
|---|---|---|---|---|---|---|---|---|
| Clinical | 0·58 | 0·71 | 0·62 (0·52-0·72) | 0·57 (0·42-0·71) | 0·67 (0·53-0·78) | 0·59 | 0·65 | - |
| T1W | 0·59 | 0·64 | 0·66 (0·55-0·74) | 0·55 (0·40-0·69) | 0·75 (0·61-0·85) | 0·64 | 0·67 | - |
| T2W | 0·67 | 0·74 | 0·74 (0·64-0·82) | 0·57 (0·42-0·71) | 0·88 (0·76-0·95) | 0·80 | 0·71 | - |
| Ensemble | 0·75 | 0·82 | 0·76 (0·67-0·84) | 0·79 (0·64-0·89) | 0·66 (0·53-0·78) | 0·72 | 0·81 | - |
| Expert 1 | 0·77 | - | 0·76 (0·66-0·84) | 0·86 (0·72-0·94) | 0·68 (0·54-0·79) | 0·69 | 0·85 | 1.0 |
| Expert 2 | 0·74 | - | 0·73 (0·63-0·81) | 0·83 (0·69-0·92) | 0·64 (0·50-0·76) | 0·66 | 0·82 | 0.66 |
| Expert 3 | 0·52 | - | 0·60 (0·50-0·69) | 0·48 (0·33-0·62) | 0·70 (0·56-0·81) | 0·57 | 0·61 | 0·02 |
| Expert Committee | 0·73 | - | 0·73 (0·63-0·81) | 0·81 (0·67-0·90) | 0·66 (0·52-0·78) | 0·67 | 0·81 | 0.7 |
| Modality | F1 Score | ROC AUC | Accuracy (95% CI) | Sensitivity (95% CI) | Specificity (95% CI) | PPV | NPV | |
| Clinical | 0·52 | 0·69 | 0·64 (0·54-0·73) | 0·49 (0·34-0·64) | 0·74 (0·62-0·84) | 0·56 | 0·68 | |
| T1W | 0·51 | 0·66 | 0·66 (0·56-0·75) | 0·44 (0·29-0·59) | 0·81 (0·69-0·89) | 0·61 | 0·68 | |
| T2W | 0·65 | 0·73 | 0·72 (0·62-0·80) | 0·64 (0·48-0·77) | 0·78 (0·65-0·87) | 0·66 | 0·76 | |
| Ensemble | 0·70 | 0·79 | 0·73 (0·64-0·81) | 0·77 (0·61-0·88) | 0·71 (0·58-0·81) | 0·63 | 0·82 | |
Figure 2Receiver-Operator Characteristic (ROC) curves for all models on internal test data set (n = 93) compared to expert performance and on the external test data set (n = 97).
Figure 3Cases in the test set that were misclassified by all experts. Model classifications are displayed with the probability of malignancy determined by the model.
Performance of the experts and the ensemble model in classifying high frequency benign and malignant lesions in the internal test set.
| Malignant Tumors | N | Expert 1 accuracy | Expert 2 accuracy | Expert 3 accuracy | Expert committee | Model accuracy |
|---|---|---|---|---|---|---|
| 11 | 90•9% | 100•0% | 81•8% | 100% | 90.9% | |
| 12 | 83•3% | 91•7% | 41•7% | 83•3% | 91•7% | |
| 8 | 87•5% | 62•5% | 62•5% | 75•0% | 75.0% | |
| 5 | 60•0% | 80•0% | 20•0% | 60•0% | 60•0% | |
| 9 | 44•4% | 44•4% | 77•8% | 44•4% | 77•8% | |
| 7 | 100•0% | 85•7% | 42•9% | 85•7% | 85•7% | |
| 6 | 83•3% | 83•3% | 100•0% | 83•3% | 100% | |
| 6 | 50•0% | 33•3% | 66•7% | 50•0% | 50•0% | |
| 5 | 100•0% | 100•0% | 80•0% | 100•0% | 100% | |