Literature DB >> 34098339

Deep Learning for Classification of Bone Lesions on Routine MRI.

Feyisope R Eweje¹, Bingting Bao², Jing Wu², Deepa Dalal³, Wei-Hua Liao⁴, Yu He², Yongheng Luo², Shaolei Lu⁵, Paul Zhang⁶, Xianjing Peng⁷, Ronnie Sebro⁸, Harrison X Bai⁹, Lisa States¹⁰.

Abstract

BACKGROUND: Radiologists have difficulty distinguishing benign from malignant bone lesions because these lesions may have similar imaging appearances. The purpose of this study was to develop a deep learning algorithm that can differentiate benign and malignant bone lesions using routine magnetic resonance imaging (MRI) and patient demographics.
METHODS: 1,060 histologically confirmed bone lesions with T1- and T2-weighted pre-operative MRI were retrospectively identified and included, with lesions from 4 institutions used for model development and internal validation, and data from a fifth institution used for external validation. Image-based models were generated using the EfficientNet-B0 architecture and a logistic regression model was trained using patient age, sex, and lesion location. A voting ensemble was created as the final model. The performance of the model was compared to classification performance by radiology experts.
FINDINGS: The cohort had a mean age of 30±23 years and was 58.3% male, with 582 benign lesions and 478 malignant. Compared to a contrived expert committee result, the ensemble deep learning model achieved (ensemble vs. experts): similar accuracy (0·76 vs. 0·73, p=0·7), sensitivity (0·79 vs. 0·81, p=1·0) and specificity (0·75 vs. 0·66, p=0·48), with a ROC AUC of 0·82. On external testing, the model achieved ROC AUC of 0·79.
INTERPRETATION: Deep learning can be used to distinguish benign and malignant bone lesions on par with experts. These findings could aid in the development of computer-aided diagnostic tools to reduce unnecessary referrals to specialized centers from community clinics and limit unnecessary biopsies. FUNDING: This work was funded by a Radiological Society of North America Research Medical Student Grant (#RMS2013) and supported by the Amazon Web Services Diagnostic Development Initiative.

Entities: Chemical

Keywords: Bone lesion; Bone tumor; Convolutional neural network; Deep learning; MRI

Mesh：

Year: 2021 PMID： 34098339 PMCID： PMC8190437 DOI： 10.1016/j.ebiom.2021.103402

Source DB: PubMed Journal: EBioMedicine ISSN： 2352-3964 Impact factor: 8.143

Evidence before this study

Literature searches were conducted using the PubMed search engine using the following search terms: (“machine learning” OR “deep learning” OR “convolutional neural network”) AND (“bone tumor” OR “bone lesion”) AND (“diagnosis” OR “benign malignant”) AND (“imaging” OR “MRI”). Our search identified one previous study that utilized neural networks to distinguish benign and malignant bone tumors on radiographs, but this study involved manually rather than automatically encoded imaging features, did not study advanced imaging modalities, and did not include external validation. Another study utilized a Bayesian model for histological diagnosis of bone tumors, but this study was also based on radiographic imaging and did not include external validation. Both of these studies also suffered from small sample sizes. We did not find any studies involving deep learning techniques for characterization of bone lesions on MRI.

Added value of this study

In this study, we demonstrate that convolutional neural networks trained with MRI studies in combination with a logistic regression based upon clinical data are able to discern benign and malignant bone lesions with performance equivalent to that of expert musculoskeletal radiologists. Our study utilizes a multi-institutional dataset and includes external validation to ensure the generalizability of our findings.

Implications of all the available evidence

By providing a validated assessment of bone lesions on MRI, our approach has the potential to aid in diagnostic evaluation of bone lesions, particularly non-expert primary evaluation outside of specialist centers. Moreover, morbidity related to unnecessary biopsy of benign lesions can be reduced by enabling radiologists to rule out malignancy with greater confidence. Alt-text: Unlabelled box

Introduction

Cancer of the bones and joints was the 3rd leading cause of cancer-related deaths in people under the age of 20 in the United States in 2016, with approximately 3,500 new bone cancer diagnoses in 2019 [1]. Outside of bone metastases (secondary bone tumors) and plasma cell myeloma, the most common bone malignancies are osteosarcoma, chondrosarcoma, and Ewing sarcoma [2], [3], [4]. While the incidence of benign bone tumors is more difficult to determine because they are rarely fully evaluated or biopsied, osteochondroma, enchondroma and osteoid osteoma are among the most common benign tumors [3]. Radiographs are the recommended first line imaging modality for the characterization of bone lesions, as it typically provides clear assessment of lesion location, internal matrix, margins, and associated periosteal reaction [3]. These lesion characteristics in combination with patient age are often sufficient to provide differential diagnoses of bone lesions [3], [5]. However, radiographs have limitations. Superimpositions, poorly-visualized partial cortex destruction, and difficulties analyzing flat and short bones and soft tissues can make radiographic diagnosis more challenging [6]. In addition, the clinical symptoms and radiographic appearance of infections (osteomyelitis) often mimic those of bone tumors [5,7]. Image diagnosis of primary bone lesions can be further complicated by the presence of pathologic fractures, which can increase the amount of fluid, hemorrhage, or edema in and around a lesion; this is particularly relevant for benign lesions such as non-ossifying fibroma, aneurysmal bone cyst, and fibrous dysplasia [2]. Chondroblastoma, osteoid osteoma, and Langerhans cell histiocytosis are among the benign bone lesions that can present with extra-lesional edema-like signal even in the absence of pathological fracture [8]. In cases with such complicating factors or any case in which a lesion is indeterminate or potentially aggressive, advanced imaging with MRI is warranted. MRI is highly sensitive for the detection of bone abnormalities due to its ability to characterize bone marrow involvement, soft tissue invasion, and fluid content of lesions [3,9]. The excellent tissue contrast provided by MR imaging can occasionally yield sufficient information to allow a specific histologic diagnosis to be made (e.g., intraosseous lipoma, enchondroma, hemangioma, or aneurysmal bone cyst) [3], [9], [10]. However, even when combining plain radiograph with MRI, radiologists were 100% sensitive but only 55% specific and 73% accurate in classifying bone malignancy in a small dataset [11]. Upon considering the limitations of advanced imaging and the rarity with which bone tumors are encountered clinically, a clear need emerges for technologies to aid in the diagnosis of bone tumors. Artificial intelligence tools have been used to augment the ability of radiologists to assess the malignancy of tumors, including from MR images. Previous studies have showcased the ability of convolutional neural network (CNN) models to classify breast, prostate, kidney and brain lesions on MRI with high sensitivity and specificity [12], [13], [14], [15], [16]. While the body of literature is limited, some studies have employed such techniques for the classification of bone lesions. As early as 1994, rudimentary two-layer, feed-forward neural network models were used to distinguish benign from malignant bone tumors with 85% accuracy, 76% specificity and 89% sensitivity. However, this outcome was based upon a dataset of only 709 lesions, manually encoded radiographic characteristics, and, critically, training rather than validation performance [17]. In a more recent work, Do et al. used a Naïve Bayesian model trained upon 18 demographic and radiographic features to determine a top-3 histological differential diagnosis of 710 bone tumors, capturing the true diagnosis with 60% accuracy [18]. Our research group has recently demonstrated that convolutional neural networks can be used to achieve 3-class discrimination of bone tumors on radiographs according to histopathologic categories with 73% accuracy, comparable to expert radiologists [19]. With current state-of-the-art machine learning methods, it may be possible to achieve better performance in automated bone tumor characterization through analysis of advanced imaging. In this study, we utilized deep learning to develop a malignancy prediction algorithm for bone lesions on routine MRI.

Methods

Study participants

1368 lesions with pre-operative MRI demonstrating single or multiple lesions with apparent bone involvement and histologically confirmed diagnosis following biopsy or surgery were retrospectively identified from the 5 hospitals from 2006 to 2020 by consecutive sampling. Lesions were identified according to the World Health Organization (WHO) system for the classification of bone and soft tissue tumors. The World Health Organization classifies bone tumors into histological categories based upon the potential of the tumor to cause local tissue destruction and metastasize to distant sites. In order of disease severity, tumors can be classified as: 1) Benign, 2) Intermediate, locally aggressive (possibility of destructive local recurrence). 3) Intermediate, rarely metastasizing (as above, with the additional possibility of metastasis) or 4) Malignant [4,20,21]. Tumors classified as intermediate according to the WHO classification criteria were grouped as benign, as each of the diagnoses present in this group (osteoblastoma, desmoplastic fibroma, giant cell tumor, epithelioid hemangioma, myofibromatosis, Langerhans cell histiocytosis, and myoepithelioma) are in practice generally considered benign (henceforth collectively refer to as benign). Age at time of imaging, gender, and the skeletal location of the lesion of interest were extracted from patients’ electronic medical record. Exclusion criteria were incomplete imaging protocols lacking a T1- or T2-weighted sequence (T1W or T2W, respectively), inconclusive involvement of osseous structures, and insufficient image quality for analysis. Inclusion and exclusion criteria are described in Supplementary Figure 1.

Image Segmentation and Preprocessing

The MR images were manually segmented by a radiologist with 3 years of experience reading musculoskeletal MRI using 3D Slicer (version 4.10). N4 bias correction and intensity normalization were performed upon each image using SimpleITK [22]. The intensity of each image was normalized relative to a reference image, with all images acquired using a given sequence (i.e. T1W or T2W) normalized with a single, high-resolution reference of the same sequence. Each image was cropped to a rectangular volume of interest delineated by the widest and tallest non-zero valued pixels in the segmentation. The largest sagittal, axial and coronal slices of each processed image volume were selected as inputs for the classification model; this 2.5D approach has been shown to have robust performance relative to 3D image classification approaches but with significantly reduced computational cost [23,24].

Model design

Imaging data models

Models for image classification were developed by adapting the EfficientNet deep learning architecture. EfficientNet is a state-of-the-art image classification network architecture that is an improvement upon previously developed convolutional neural network designs as it improves accuracy while significantly decreasing the number of network parameters and thereby substantially improving computational efficiency. EfficientNet models initialized with weights pre-trained on the ImageNet database were used for feature extraction from imaging data. The EfficientNet classifying layer was replaced with a series of fully connected layers of size 256, 128, 64, 32 and 16 nodes with interposing dropout layers and batch normalization layers. A final classification layer with a single node and sigmoid activation was used to perform the binary classification task.

Clinical data model

A logistic regression model using clinical variables was separately developed for the classification task. Inputs were patient age, sex, and lesion location. 21 locations (clavicle, cranium, proximal femur, distal femur, foot, proximal radius, distal radius, proximal ulna, distal ulna, hand, hip, proximal humerus, distal humerus, proximal tibia, distal tibia, proximal fibula, distal fibula, mandible, rib/chest wall, scapula, or spine) were one-hot encoded such that the model received 23 distinct quantified input variables.

Ensemble model

The imaging and clinical feature models were then combined using a stacked ensemble approach in which a voting ensemble received malignancy probabilities from the imaging and clinical feature models as inputs and created outputs based upon a summation of the predicted probabilities. Each ensemble classification model consisted of the outputs of an EfficientNet trained upon T1W imaging studies, an EfficientNet trained upon T2W imaging studies, and a logistic regression model based upon clinical features.

Model training and evaluation

Binary classification models were trained to distinguish benign from malignant bone lesions on T1W and T2W images. 4-fold cross validation was used to evaluate the model building pipeline and select hyperparameters for the final trained models. The data from CHOP, HUP, RIH, and SXH were used for cross validation, as well as training, validation, and internal testing for the final models by a 7:2:1 split. Data from XH was reserved for external testing to assess generalizability of the created algorithms to data from separate institutions. Using the EfficientNet-B0 architecture, models were trained with a batch size of 64 for 200 epochs during cross-validation and 200 epochs with early stopping after 100 epochs of no loss improvements on the validation set during final model training. Models were implemented in Python (version 3.8) and trained on a machine with a NVIDIA Tesla V100 GPU. During training, segmented images were scaled up or down to 200 × 200 pixels using bilinear interpolation. The training set was augmented with horizontal flip, vertical flip, shear, and zoom transformations to add variability. A predetermined probability of 0·5 was assigned to the final sigmoid activation neuron as a threshold for classification of malignancy. Loss on the validation set was monitored over each epoch and the model with the minimum validation loss was selected to represent a given training trial. The hyperparameters that produced the best average test performance in cross-validation were selected for training the final image classification models. The logistic regression model for clinical feature-based classification was trained with L2 regularization and a stochastic averaged gradient descent optimizer. Feature ranking with recursive feature elimination and 4-fold cross-validated selection of the best feature set was implemented for the creation of the clinical data model. Training involved all features initially then the least predictive feature was removed with each iteration until the desired feature set size was achieved. Feature set sizes from 1 to 23 (all features) were trialed. The feature set with maximum cross validation test performance was selected as the final feature set. To incorporate sensitivity bias in the voting ensemble, malignancy thresholds for each of the constituent models were empirically determined via a grid search algorithm that maximized Youden's index while achieving at least 90% sensitivity on the validation set [25]. Figure 1 illustrates the data processing pipeline and model architecture.

Figure 1

Schematic of the bone tumor classification deep learning pipeline. Top: Image segmentation. Raw image volumes were manually segmented to a region of interest focused upon the tumor. The largest axial, transverse, and coronal slices of the segmented volume were used as inputs for the imaging models (“2.5D” image representation). Middle: Training and evaluation scheme. Hyperparameters were selected based upon 4-fold cross validation scheme. Final models were trained using the training and validation data sets then evaluated using the internal and external testing sets, where the external testing set was from an independent institution. Bottom: Model architecture. An EfficientNet-B0 took T1-weighted images as an input and output a malignancy probability; another EfficientNet-B0 took T2-weight images as inputs. A logistic regression model accepted age, binary-encoded sex, and one-hot encoded lesion location as inputs and output a malignancy probability. A voting ensemble model used classifications from the T1W, T2W, and clinical features models as inputs and output a final classification by a soft, probability-based majority rule vote.

Radiologist Interpretation

Three expert radiologists (Y.H., R.S., Y.L.) with 3, 8, and 7 years of experience reading musculoskeletal MRI respectively, blind to histopathologic data, evaluated unsegmented MRI images of the bone lesions in the internal test set for malignancy. T1-weighted and T2-weighted images were made available to the evaluators for each lesion in the internal test set; T1 contrast-enhanced (T1C) images were also provided to the evaluators when available but were not used in the model training and evaluation due to limited availability among samples in the dataset. The evaluators were also given demographic information (age and sex) for each patient. The model's results were compared to these expert evaluations and a contrived “expert committee” (expert decision by majority rule) to assess model performance. Supplementary Table 1 shows information regarding the previous experience of the radiologists who evaluated the lesions in the internal test set in reading musculoskeletal MRI.

Statistics

The demographic and clinicopathologic features of the benign and malignant groups were compared using a chi-squared test for categorical variables (lesion location, sex) and a T-test or single-factor ANOVA for continuous variables (age). Post hoc chi-squared tests pairing lesions of each location against a subset comprised of lesions from all other locations were performed following indication of an overall statistically significant difference in location between the groups. These “one-vs-rest” comparisons were performed with Bonferroni-corrected p-values for significance. The same analyses were performed to compare the combined training and validation set to the internal testing and external testing datasets. Accuracy, sensitivity, specificity, and area under the Receiver-Operator Characteristic curve (ROC AUC) were calculated for the classification task with 95% confidence intervals determined by the Wilson method [26]. ROC AUC on the validation set was used to empirically select final training hyperparameters for the imaging data models during cross validation. Fleiss’ κ was used to evaluate interrater reliability. Model binary classification performance was compared to expert committee performance using the McNemar test. Statistical significance was defined as P < 0·05. Statistical analyses were performed using Python (version 3.8) statistical libraries.

Code Availability

The image classification models were deployed with an implementation of the EfficientNet architecture using the Python Keras library (https://github.com/qubvel/efficientnet). The clinical feature-based logistic regression model was implemented using the Python scikit-learn library (version 0·24.1). All code for image preprocessing and malignancy prediction is publicly available at https://github.com/sopeeweje/Bone-MRI.

Ethics Statement

Our study received a waiver of informed consent and exempt status from the institutional review boards of the Hospital of the University of Pennsylvania (HUP) (protocol number 831582) and the Children's Hospital of Pennsylvania (CHOP) (protocol number 20-017327) in Philadelphia, PA, and Rhode Island Hospital (RIH) (protocol number 1747284) in Providence, Rhode Island. The study was also approved by the institutional review boards of the Xiangya Hospital (XH) and Second Xiangya Hospital (SXH) of Central South University in Hunan, China.

Role of the funding source

The study sponsors did not have any role in the study design; the collection, analysis and interpretation of data; in writing of the report; or in the decision to submit the paper for publication.

Results

Study Participants

The final cohort consisted of 1060 lesions – 185 from HUP, 464 from CHOP, 208 from SXH, 111 from RIH, and 97 from XH. Table 1 summarizes the clinical characteristics of the study participants and Supplementary Table 2 shows the detailed histopathological diagnoses. The sample had a mean age of 30±23 years and comprised 619 males and 441 females. 582 lesions (27±20 years, 342 males) were benign and 478 lesions (34±25 years, 277 males) were malignant. Comparing the benign and malignant lesion groups, there was a balanced gender distribution (p = 0·79) and significant differences in age (p < 0·001) and lesion location (p < 0·001). Upon “one-vs-rest” comparison, there was a statistically significant difference in malignancy distribution for lesions located in the cranium, hand, foot, hip, and spine. 678, 192, 93, and 97 lesions were allocated to the training, validation, internal testing, and external testing sets respectively. Supplementary Table 3 summarizes the characteristics of the training, validation, and internal test and external test sets.

Table 1

	Benign (N=582)	Malignant (N=478)	p-value
Age (years ± SD)	27 ± 20	34 ± 25	<0·001***
Sex (%)			0·79
Male	342 (59%)	277 (58%)
Female	240 (41%)	201 (42%)
Location (%)			<0·001***
Clavicle	3 (0.7%)	5 (1.7%)	0.52
Cranium	12 (2.6%)	55 (18.3%)	<0·001***
Proximal femur	74 (16.1%)	35 (11.6%)	0.0055
Distal femur	80 (17.4%)	57 (18.9%)	0.43
Foot	89 (19.3%)	5 (1.7%)	<0·001***
Proximal radius	0 (0%)	1 (0.3%)	0.92
Distal radius	13 (2.8%)	0 (0%)	0.0026
Proximal ulna	4 (0.9%)	2 (0.7%)	0.87
Distal ulna	1 (0.2%)	0 (0%)	0.92
Hand	29 (6.3%)	1 (0.3%)	<0·001***
Hip	41 (8.9%)	62 (20.6%)	0.0017
Proximal humerus	34 (7.4%)	33 (11%)	0.56
Distal humerus	16 (3.5%)	8 (2.7%)	0.34
Proximal tibia	65 (14.1%)	37 (12.3%)	0.075
Distal tibia	11 (2.4%)	4 (1.3%)	0.24
Proximal fibula	6 (1.3%)	10 (3.3%)	0.24
Distal fibula	4 (0.9%)	1 (0.3%)	0.5
Mandible	3 (0.7%)	7 (2.3%)	0.2
Rib/Chest wall	6 (1.3%)	19 (6.3%)	0.0033
Scapula	10 (2.2%)	7 (2.3%)	0.93
Spine	81 (17.6%)	129 (42.9%)	<0·001***

Characteristics of patients included in the study. “One-vs-rest” tests for statistical significance in location distribution (e.g. Foot vs. Rest) were performed with Bonferroni-corrected p-values used for significance. ***Statistically significant Performance of the imaging data training algorithms in cross-validation is shown in Supplementary Figure 2. Results of the grid search for thresholds in the voting ensemble are shown in Supplementary Figure 3. Performance of the final T1W, T2W, clinical features and ensemble models on the internal test set in comparison to expert evaluations and the external test set is described in Table 2 and performance of the models on the training and validation sets is summarized in Supplementary Table 4.

Table 2

Internal Test Set
Modality	F1 Score	ROC AUC	Accuracy (95% CI)	Sensitivity (95% CI)	Specificity (95% CI)	PPV	NPV	p-value
Clinical	0·58	0·71	0·62 (0·52-0·72)	0·57 (0·42-0·71)	0·67 (0·53-0·78)	0·59	0·65	-
T1W	0·59	0·64	0·66 (0·55-0·74)	0·55 (0·40-0·69)	0·75 (0·61-0·85)	0·64	0·67	-
T2W	0·67	0·74	0·74 (0·64-0·82)	0·57 (0·42-0·71)	0·88 (0·76-0·95)	0·80	0·71	-
Ensemble	0·75	0·82	0·76 (0·67-0·84)	0·79 (0·64-0·89)	0·66 (0·53-0·78)	0·72	0·81	-
Expert 1	0·77	-	0·76 (0·66-0·84)	0·86 (0·72-0·94)	0·68 (0·54-0·79)	0·69	0·85	1.0
Expert 2	0·74	-	0·73 (0·63-0·81)	0·83 (0·69-0·92)	0·64 (0·50-0·76)	0·66	0·82	0.66
Expert 3	0·52	-	0·60 (0·50-0·69)	0·48 (0·33-0·62)	0·70 (0·56-0·81)	0·57	0·61	0·02
Expert Committee	0·73	-	0·73 (0·63-0·81)	0·81 (0·67-0·90)	0·66 (0·52-0·78)	0·67	0·81	0.7

External Testing Set
Modality	F1 Score	ROC AUC	Accuracy (95% CI)	Sensitivity (95% CI)	Specificity (95% CI)	PPV	NPV

Clinical	0·52	0·69	0·64 (0·54-0·73)	0·49 (0·34-0·64)	0·74 (0·62-0·84)	0·56	0·68
T₁W	0·51	0·66	0·66 (0·56-0·75)	0·44 (0·29-0·59)	0·81 (0·69-0·89)	0·61	0·68
T₂W	0·65	0·73	0·72 (0·62-0·80)	0·64 (0·48-0·77)	0·78 (0·65-0·87)	0·66	0·76
Ensemble	0·70	0·79	0·73 (0·64-0·81)	0·77 (0·61-0·88)	0·71 (0·58-0·81)	0·63	0·82

Performance of T1W, T2W, clinical features and ensemble models on the internal test set (n = 93) compared with expert evaluation, as well as the external test set (n = 97). p-value as calculated by the McNemar test for each expert is for accuracy relative to the performance of the ensemble model. Abbreviations - ROC AUC, area under ROC curve; PPV, positive predictive value; NPV, negative predictive value; 95% CI, 95% confidence interval. On internal testing, the clinical variable logistic regression achieved an accuracy of 0·62 (95% CI: 0·52-0·72), F1 score of 0·58, sensitivity of 0·57 (95% CI: 0·42-0·71), and specificity of 0·67 (95% CI: 0·53-0·78). On external testing, the logistic regression model based on clinical variables achieved an accuracy of 0·64 (95% CI: 0·54-0·73), F1 score of 0·52, sensitivity of 0·49 (95% CI: 0·34-0·64), and specificity of 0·74 (95% CI: 0·62-0·84). The generated regression equation was:0.566Age + 0.955 Cranium + 0.705Hip + 0.588Spine + 0.438RibChest + 0.438DistalFemur + 0.398ProximalTibia Location in the hand (-0·505) and foot (-0·815) were the most heavily weighted determinants towards benign classification and location in the hip (+0·705) and cranium (+0·955) were the most heavily weighted determinants towards malignant classification. The final T1W model was trained with stochastic gradient descent optimization with Nesterov momentum, learning rate of 0·001, and dropout rate of 0·2 in the classifier. On the internal test set, the T1W model achieved a test accuracy of 0·66 (95% CI: 0·55-0·74), F1 score of 0·59, sensitivity of 0·55 (95% CI: 0·40-0·69), and specificity of 0·75 (95% CI: 0·61-0·85). On the external test set, the T1W trained model achieved a test accuracy of 0·66 (95% CI: 0·56-0·75), F1 score of 0·51, sensitivity of 0·44 (95% CI: 0·29-0·59), and specificity of 0·81 (95% CI: 0·69-0·89). The final T2W model was trained with Adam optimization and dropout rate of 0·4 in the classifier. On the internal test set, the T2W model achieved a test accuracy of 0·72 (95% CI: 0·62-0·80), F1 score of 0·75, sensitivity of 0·64 (95% CI: 0·48-0·77), and specificity of 0·78 (95% CI: 0·65-0·87). On the external test set, the T2W model achieved a test accuracy of 0·74 (95% CI: 0·62-0·80), F1 score of 0·65, sensitivity of 0·64 (95% CI: 0·48-0·77), and specificity of 0·78 (95% CI: 0·65-0·87). On the internal test set, the ensemble model achieved a test accuracy 0·76 (95% CI: 0·67-0·84), F1 score of 0·75, and sensitivity of 0·79 (95% CI: 0·64-0·89), and specificity of 0·75 (95% CI: 0·61-0·85). On the external test set, the ensemble model achieved a test accuracy of 0·73 (95% CI: 0·64-0·81), F1 score of 0·70, sensitivity of 0·77 (95% CI: 0·61-0·88), and specificity of 0·71 (95% CI: 0·58-0·81). Adding an optimized T1C-trained model to the ensemble model neither supplemented nor decremented internal test set performance (Supplementary Table 5). In evaluating the internal test set, expert 1 achieved a test accuracy of 0·76 (95% CI: 0·66-0·84), F1 score of 0·77, and sensitivity of 0·86 (95% CI: 0·72-0·94), and specificity of 0·68 (95% CI: 0·54-0·79). Expert 2 achieved a test accuracy of 0·73 (95% CI: 0·63-0·81), F1 score of 0·74, and sensitivity of 0·83 (95% CI: 0·69-0·92), and specificity of 0·64 (95% CI: 0·50-0·76). Expert 3 achieved a test accuracy of 0·60 (95% CI: 0·50-0·69), F1 score of 0·52, and sensitivity of 0·48 (95% CI: 0·33-0·62), and specificity of 0·70 (95% CI: 0·56-0·81). Interrater reliability as calculated with Fleiss’ κ was 0·02. The expert committee assembled by majority-rule achieved a test accuracy of 0·73 (95% CI: 0·63-0·81), F1 score of 0·73, and sensitivity of 0·81 (95% CI: 0·67-0·90), and specificity of 0·67 (95% CI: 0·53-0·78). Compared to the expert committee, the ensemble deep learning model achieved similar accuracy (0·76 vs. 0·73, p=0·7 [McNemar test]), sensitivity (0·79 vs. 0·81, p=1·0 [McNemar test]) and specificity (0·75 vs. 0·66, p=0·48 [McNemar test]). Figure 2 shows the ROC curve for each model overlaid with expert performance on the internal test set and the ROC curve for the models’ performance on the external test set.

Figure 2

Receiver-Operator Characteristic (ROC) curves for all models on internal test data set (n = 93) compared to expert performance and on the external test data set (n = 97).

Receiver-Operator Characteristic (ROC) curves for all models on internal test data set (n = 93) compared to expert performance and on the external test data set (n = 97). There were 7 tumors out of the 92 cases in the test set that were classified incorrectly by all 3 evaluators. These cases are depicted in Figure 3. 4 out of these 7 cases were benign entities incorrectly assessed as malignant by all 3 raters and the remaining 3 were malignant entities incorrectly assessed as benign. The model was correctly able to assess malignancy in 4 out of these 7 cases. Table 3 shows the performance of the expert evaluators and the model in classifying the benign and malignant lesion types that were most frequent in the test set. No statistically significant differences in performance by lesion were observed.

Figure 3

Cases in the test set that were misclassified by all experts. Model classifications are displayed with the probability of malignancy determined by the model.

Table 3

Performance of the experts and the ensemble model in classifying high frequency benign and malignant lesions in the internal test set.

Malignant Tumors	N	Expert 1 accuracy	Expert 2 accuracy	Expert 3 accuracy	Expert committee	Model accuracy
Osteosarcoma	11	90•9%	100•0%	81•8%	100%	90.9%
Ewing sarcoma	12	83•3%	91•7%	41•7%	83•3%	91•7%
Multiple Myeloma	8	87•5%	62•5%	62•5%	75•0%	75.0%
Chondrosarcoma	5	60•0%	80•0%	20•0%	60•0%	60•0%

Benign Lesions	N	Expert 1 accuracy	Expert 2 accuracy	Expert 3 accuracy	Expert committee	Model accuracy

Giant cell tumor of the bone	9	44•4%	44•4%	77•8%	44•4%	77•8%
Chondroblastoma	7	100•0%	85•7%	42•9%	85•7%	85•7%
Enchondroma	6	83•3%	83•3%	100•0%	83•3%	100%
Aneurysmal Bone Cyst	6	50•0%	33•3%	66•7%	50•0%	50•0%
Osteomyelitis	5	100•0%	100•0%	80•0%	100•0%	100%

Cases in the test set that were misclassified by all experts. Model classifications are displayed with the probability of malignancy determined by the model. Performance of the experts and the ensemble model in classifying high frequency benign and malignant lesions in the internal test set.

Discussion

MRI is the go-to advanced imaging modality for the evaluation of potentially suspicious bone lesions prior to biopsy or intervention. The diagnosis of bone lesions on imaging is complicated by the rarity with which they are encountered in clinical practice and the non-specific presentations of various benign and malignant entities. In this study, we utilized a deep learning method combining routine MRI images and clinical characteristics to develop a model to classify the malignancy of bone lesions. The model was a voting ensemble comprised of EfficientNets trained upon T1-weighted and T2-weighted images and a logistic regression trained upon patient age, sex, and tumor location. Generalizability was effectively demonstrated by showing a lack of significant decrement in performance on validation with an external data set. As shown in Table 3, the model was able to classify benign entities such as giant cell tumor and malignant entities such as Ewing sarcoma and multiple myeloma with higher accuracy than the experts. The analysis was not sufficiently powered to observe statistically significance differences in classification performance by lesion but expanding the test set may have allowed for such differences to be determined. There is significant value in a model that can recapitulate the assessment of bone lesions on MRI by expert musculoskeletal radiologists. In one review of patients with equivocal findings on initial MRI that were subsequently referred to an orthopedic oncology clinic, radiologists at the clinic found that one-third of 390 referred patients had images that were clearly characteristic of non-neoplastic entities or benign tumors that did not in fact require follow-up with an orthopedic oncologist [27]. These unnecessary referrals complicate the task of identifying malignancy for specialist radiologists. In this context, a sensitive validated model for the characterization of suspicious bone lesions could perhaps reduce the rate of unnecessary referrals to higher levels of care and reduce patient anxiety regarding a potential cancer diagnosis. Both the experts and the model were highly sensitive, the former due to an inherent bias towards avoiding false negative diagnoses and the latter due to an encoded bias designed to mimic the expert approach. However, specific assessment of bone lesions would have also proven valuable. Unnecessary biopsy of benign lesions falsely considered malignant can create undue patient stress and leave patients at risk of post-operative complications, especially when managed outside of specialist multidisciplinary centers [28]. In addition, biopsy can be non-diagnostic in up to 30% of cases, subjecting patients to repeat biopsy procedures and a higher risk of complication [28,29]. A computer-aided diagnostic tool that can identify benign lesions with high specificity would be valuable in reducing the rate of unnecessary biopsies, by aiding radiologists in ruling in malignancy with greater certainty. By adjusting our thresholding approaches, we could easily create additional models that are biased towards high specificity performance to be used for this purpose. Explainability is a significant barrier to the utilization of machine learning methods to support clinical practice. Our clinical features model represents a step towards a more explicit understanding of artificial decision-making for clinical diagnostics. The clinical model was correctly able to predict hand and foot locations as negative predictors of malignancy and cranial and spinal locations as positive predictors of malignancy. The majority of tumors affecting the hand, cranium and spine are enchondromas; chordomas and chondrosarcomas; and bone metastases and multiple myeloma, respectively [30], [31], [32], [33]. While the majority of bone tumors of the foot are benign, our clinical features model likely associated foot location with benign nature because the majority of foot-located lesions in our cohort were osteomyelitis, a bone tumor-mimicker that commonly affects the lower extremity. The clinical model also predicted increased probability of malignancy with increased age which is consistent with the epidemiology of benign bone tumors, most of which occur in the first two decades of life. Sex had no predictive value for the model, which is consistent with the observation that most bone lesions show no particular gender predilection [34]. There were select tumors that were misclassified by all expert evaluators but correctly assessed by the model (Figure 3). Ewing sarcoma (Figure 3d) has a heterogenous appearance on MRI and can be difficult to clinically diagnose in its earliest stages prior to the significant cortical destruction that occurs following spread beyond the bone marrow [35,36]. Osteoid osteoma (Figure 3g) frequently demonstrates an abnormally high peritumoral signal intensity on MRI due to hyperemia and consequent bone marrow edema, resulting in frequent misinterpretation [11,37]. There were also cases that were misclassified by all expert evaluators as well as the model. Osteoblastoma (Figure 3c) is an uncommon benign bone tumor with rib involvement in less than 5% of cases. The expansile growth pattern with well-defined margins exhibited by both of the present tumors is consistent with previously documented observations of rib osteoblastomas; however, these are somewhat aggressive features that are frequently considered to be on the borderline of osteoblastoma and low-grade osteosarcoma [38,39]. Aneurysmal bone cyst (Figure 3f) shares several clinical and imaging features with telangiectatic osteosarcoma, such as young age at presentation, large size, and heterogeneous to high T2 signal intensity corresponding to fluid levels [40]. The “black-box” nature of deep neural networks makes it challenging to explain why our model was able to achieve the correct classification in some of these cases but was similarly misled in others. It is noteworthy that our model was able to achieve performance on par with the experts without the use of data from T1-weighted contrast-enhanced studies, which were available to the experts in 81 out of 93 lesions in the test set. This may have introduced a bias towards the experts. There is also a question of the utility of contrast-enhanced MRI imaging in bone tumor diagnosis. Review of the literature showed that in a one study, MR scans with gadolinium did not contribute to differential diagnosis or management in 89% of a cohort of 242 patients with musculoskeletal tumors and tumor mimickers [41]. Contrast imaging did however aid in guiding biopsy of bulky lesions and evaluating tumor beds for possible recurrence [41]. Another author which reviewed the use of gadolinium in MR imaging of solitary bone tumors found that the role of contrast imaging is limited outside of directing biopsy and planning tumor resection [42]. By maintaining diagnostic performance without the need for contrast imaging, our model is of utility in contexts where contrast imaging is not readily available to the radiologist (e.g., incidentally discovered lesions). Moreover, for intentional evaluation of suspicious bone lesions, protocols that are sufficiently informative without the use of contrast enhancement would be of significant benefit to the pediatric radiology community. Given the pain-related anxiety that can be provoked by IV placement and the unknown risks of gadolinium deposition in children, elimination of contrast imaging for bone lesion assessment could be valuable [43,44]. The present study represents a first step towards a validated computational method for this purpose. The rarity of bone tumors presented a challenge in compiling a dataset that could effectively power the training of a deep neural network for this task. While the size of our dataset is larger than many others that have been used for tumor classification tasks, it is still orders of magnitude smaller than datasets that have been used for other medical image characterization tasks. A larger dataset could also allow for granular classification beyond binary, such as differentiating between types of bone sarcomas or other histopathological diagnoses. The study was limited by the need to perform manual lesion segmentation prior to analysis using our deep learning method. Manual segmentation precludes the creation of a fully automated lesion characterization pipeline. Given the variability in bone lesion location and the non-uniform shape of bones based on anatomical location, automated bone lesion segmentation would ostensibly be a much more challenging task than, for example, automated segmentation of the breast or brain tumors on MRI, both of which have been previously demonstrated [45,46]. While automated segmentation of specific osseous structures has been demonstrated, such as the proximal and distal femur and the proximal tibia, fully automated segmentation of bone lesions has not been reported [47,48]. One study employed a semiautomatic segmentation technique for bone sarcomas on MRI that involved manual segmentation of slices at the extremes and the middle of the volume, followed by an interpolation to create the final segmented volume [49]. Achieving automated bone lesion segmentation will likely be critical to wide-spread adoption of deep learning techniques for lesion classification in clinical practice. In summary, we have developed a deep learning model that can evaluate the malignancy of bone lesions with similar accuracy and improved specificity in comparison to expert evaluators. Future studies will seek to combine radiograph and MRI findings in the development of classification models using deep learning and accompanying radiologist interpretation, as this is the clinical standard of care for suspicious bone lesions. In addition, a future study with a larger cohort may allow for classification of bone lesions by specific diagnosis. Finally, the development of a fully automated bone lesion classification tool would be facilitated by establishing an automated segmentation technique and utilizing a tool to automatically query for and extract relevant imaging studies from hospitals’ picture archive and communication systems (PACS), such as the DICOM Image Analysis and Archive (DIANA) system previously developed by our group [50]. At present, this work demonstrates the promise of deep learning to aid radiologists in characterizing the malignancy of bone lesions with improved certainty.

Contributors

HXB, PZ, and RS conceptualized the study. JW, BB, WL, SL and XP contributed to the curation of the data. RS and HXB provided supervision. RS, YH, and YL performed the expert evaluations. DD and JW contributed to project administration. FRE, RS, HXB, and LS developed the study methodology. HXB provided computational resources and contributed to funding acquisition. FRE performed the statistical analyses, software development, data visualization, and wrote the original draft. HXB and FRE have verified the underlying data. All authors contributed to reviewing and editing the final manuscript.

Data sharing statement

MRI imaging volumes and clinical data will not be made publicly available to ensure patient confidentiality, but they are available upon reasonable request to the corresponding author. Source code used to perform this study is available at https://github.com/sopeeweje/Bone-MRI.

Declaration of Competing Interest

The authors have nothing to disclose.

5 in total

1. Malignant Bone Tumors Diagnosis Using Magnetic Resonance Imaging Based on Deep Learning Algorithms.

Authors: Vlad Alexandru Georgeanu; Mădălin Mămuleanu; Sorin Ghiea; Dan Selișteanu
Journal: Medicina (Kaunas) Date: 2022-05-04 Impact factor: 2.948

2. Translatability Analysis of National Institutes of Health-Funded Biomedical Research That Applies Artificial Intelligence.

Authors: Feyisope R Eweje; Suzie Byun; Rajat Chandra; Fengling Hu; Ihab Kamel; Paul Zhang; Zhicheng Jiao; Harrison X Bai
Journal: JAMA Netw Open Date: 2022-01-04

Review 3. Emerging Applications of Deep Learning in Bone Tumors: Current Advances and Challenges.

Authors: Xiaowen Zhou; Hua Wang; Chengyao Feng; Ruilin Xu; Yu He; Lan Li; Chao Tu
Journal: Front Oncol Date: 2022-07-19 Impact factor: 5.738

Review 4. Applications of machine learning for imaging-driven diagnosis of musculoskeletal malignancies-a scoping review.

Authors: Florian Hinterwimmer; Sarah Consalvo; Jan Neumann; Daniel Rueckert; Rüdiger von Eisenhart-Rothe; Rainer Burgkart
Journal: Eur Radiol Date: 2022-07-19 Impact factor: 7.034

5. Popular deep learning algorithms for disease prediction: a review.

Authors: Zengchen Yu; Ke Wang; Zhibo Wan; Shuxuan Xie; Zhihan Lv
Journal: Cluster Comput Date: 2022-09-13 Impact factor: 2.303

5 in total