| Literature DB >> 33907257 |
Minliang He1, Xuming Wang1, Yijun Zhao2.
Abstract
Musculoskeletal disorders affect the locomotor system and are the leading contributor to disability worldwide. Patients suffer chronic pain and limitations in mobility, dexterity, and functional ability. Musculoskeletal (bone) X-ray is an essential tool in diagnosing the abnormalities. In recent years, deep learning algorithms have increasingly been applied in musculoskeletal radiology and have produced remarkable results. In our study, we introduce a new calibrated ensemble of deep learners for the task of identifying abnormal musculoskeletal radiographs. Our model leverages the strengths of three baseline deep neural networks (ConvNet, ResNet, and DenseNet), which are typically employed either directly or as the backbone architecture in the existing deep learning-based approaches in this domain. Experimental results based on the public MURA dataset demonstrate that our proposed model outperforms three individual models and a traditional ensemble learner, achieving an overall performance of (AUC: 0.93, Accuracy: 0.87, Precision: 0.93, Recall: 0.81, Cohen's kappa: 0.74). The model also outperforms expert radiologists in three out of the seven upper extremity anatomical regions with a leading performance of (AUC: 0.97, Accuracy: 0.93, Precision: 0.90, Recall:0.97, Cohen's kappa: 0.85) in the humerus region. We further apply the class activation map technique to highlight the areas essential to our model's decision-making process. Given that the best radiologist performance is between 0.73 and 0.78 in Cohen's kappa statistic, our study provides convincing results supporting the utility of a calibrated ensemble approach for assessing abnormalities in musculoskeletal X-rays.Entities:
Year: 2021 PMID: 33907257 PMCID: PMC8079683 DOI: 10.1038/s41598-021-88578-w
Source DB: PubMed Journal: Sci Rep ISSN: 2045-2322 Impact factor: 4.379
Number of images in each anatomical category.
| Train | Validation | Test | Total | |||||
|---|---|---|---|---|---|---|---|---|
| Positive* | Negative* | Positive | Negative | Positive | Negative | Positive | Negative | |
| Elbow | 1734 | 2584 | 272 | 341 | 230 | 235 | 2236 | 3160 |
| Finger | 1710 | 2750 | 258 | 388 | 247 | 214 | 2215 | 3352 |
| Forearm | 583 | 1042 | 78 | 122 | 151 | 150 | 812 | 1314 |
| Hand | 1287 | 3549 | 197 | 510 | 189 | 271 | 1673 | 4330 |
| Humerus | 514 | 593 | 85 | 80 | 140 | 148 | 739 | 812 |
| Shoulder | 3627 | 3673 | 541 | 538 | 278 | 285 | 4446 | 4496 |
| Wrist | 3489 | 4993 | 498 | 772 | 295 | 364 | 4282 | 6129 |
| Total | 12,944 | 19,184 | 1929 | 2751 | 1530 | 1667 | 16,403 | 23,602 |
| 32,128 | 4680 | 3197 | 40,005 | |||||
*Positive or negative represents abnormal or normal, respectively.
Figure 1Samples training images. The first row are original images from the MURA dataset. The second row presents images augmented in three sequential operations: a random horizontal flip, a random vertical flip, and a random rotation within to .
Model training statistics.
| Parameters | Batch size | Epochs | Training time** (h) | |
|---|---|---|---|---|
| ConvNet | 18,894,578 | 32 | 100 | 12 |
| ResNet | 21,170,050 | 32 | 24 | |
| DenseNet | 12,639,938 | 16 | 60 |
**Using Intel(R) Core(TM) i7-8750H CPU processor with 16GB Ram and NVIDIA GeForce GTX 1070 with Max-Q Design GPU.
Figure 2ROC curves of five models.
Performance comparison of five models—overall and across different parts of body.
| Metric | Model | Overall | Elbow | Finger | Forearm | Hand | Humerus | Shoulder | Wrist |
|---|---|---|---|---|---|---|---|---|---|
| AUC | ConvNet | 0.88 | 0.89 | 0.87 | 0.88 | 0.84 | 0.90 | 0.83 | 0.93 |
| ResNet | 0.92 (5%) | 0.89 | 0.92 | 0.89 | 0.85 | ||||
| DenseNet | 0.91 (4%) | 0.91 | 0.90 | 0.88 | 0.94 | 0.86 | 0.95 | ||
| Res+Dense | 0.90 | ||||||||
| Calibrated | 0.87 | ||||||||
| Accuracy | ConvNet | 0.82 | 0.87 | 0.80 | 0.80 | 0.78 | 0.84 | 0.77 | 0.89 |
| ResNet | 0.86 (4%) | 0.83 | 0.85 | 0.83 | 0.81 | 0.92 | 0.90 | ||
| DenseNet | 0.85 (4%) | 0.83 | 0.84 | 0.81 | 0.87 | 0.81 | 0.89 | ||
| Res+Dense | 0.86 (5%) | 0.89 | 0.84 | 0.83 | 0.81 | 0.90 | 0.83 | ||
| Calibrated | |||||||||
| Precision | ConvNet | 0.86 | 0.83 | 0.87 | 0.85 | 0.89 | 0.88 | 0.78 | 0.92 |
| ResNet | 0.91 (5%) | 0.93 | 0.89 | 0.90 | 0.92 | 0.94 | |||
| DenseNet | 0.90 (4%) | 0.93 | 0.89 | 0.90 | 0.91 | 0.88 | 0.84 | 0.92 | |
| Res+Dense | 0.91 (6%) | 0.95 | 0.90 | 0.95 | 0.89 | 0.89 | 0.85 | 0.95 | |
| Calibrated | |||||||||
| Recall | ConvNet | 0.72 | 0.71 | 0.72 | 0.71 | 0.60 | 0.78 | 0.75 | 0.75 |
| ResNet | 0.78 (8%) | 0.74 | 0.75 | 0.72 | 0.61 | 0.94 | 0.83 | 0.81 | |
| DenseNet | 0.76 (6%) | 0.82 | 0.75 | 0.70 | 0.59 | 0.87 | 0.76 | 0.80 | |
| Res+Dense | 0.77 (8%) | 0.79 | 0.75 | 0.72 | 0.56 | 0.93 | 0.81 | 0.82 | |
| Calibrated | |||||||||
| Cohen’s kappa ( | ConvNet | 0.63 | 0.72 | 0.59 | 0.59 | 0.51 | 0.67 | 0.55 | 0.76 |
| ResNet | 0.71 (12%) | 0.64 | 0.70 | 0.65 | 0.59 | 0.84 | 0.80 | ||
| DenseNet | 0.70 (10%) | 0.67 | 0.68 | 0.59 | 0.75 | 0.63 | 0.77 | ||
| Res+Dense | 0.72 (13%) | 0.77 | 0.68 | 0.65 | 0.57 | 0.81 | 0.66 | ||
| Calibrated |
Bold numbers indicate the highest performing model(s) for each evaluation metric across eight studies. The numbers in parenthesis indicate the percentage change in a particular evaluation metric for a model M over the baseline ConvNet model. “Res+Dense” is a meta-learner which makes its predictions based on the average of the output probabilities from the ResNet and DenseNet models. “Calibrated” is our proposed ensemble model which trains a designated deep learner for each anatomical region.
Figure 3Sample class activation maps.
ConvNet architecture.
| Filter Size | Stride | # of Filters | |
|---|---|---|---|
| Convolutional layer 1 | 3 × 3 | (1,1) | 16 |
| Convolutional layer 2 | 3 × 3 | (1,1) | 16 |
| Max pooling | 2 × 2 | ||
| Convolutional layer 3 | 3 × 3 | (1,1) | 32 |
| Convolutional layer 4 | 3 × 3 | (1,1) | 32 |
| Max pooling | 2 × 2 | ||
| Convolutional layer 5 | 3 × 3 | (1,1) | 64 |
| Convolutional layer 6 | 3 × 3 | (1,1) | 64 |
| Max pooling | 2 × 2 | ||
| Convolutional layer 7 | 3 × 3 | (1,1) | 128 |
| Convolutional layer 8 | 3 × 3 | (1,1) | 128 |
| Max pooling | 2 × 2 | ||
| Convolutional layer 9 | 3 × 3 | (1,1) | 256 |
| Convolutional layer 10 | 3 × 3 | (1,1) | 256 |
| Max pooling | 2 × 2 | ||
| Convolutional layer 11 | 3 × 3 | (1,1) | 512 |
| Convolutional layer 12 | 3 × 3 | (1,1) | 512 |
| Max pooling | 2 × 2 | ||
| Convolutional layer 13 | 3 × 3 | (1,1) | 1024 |
| Convolutional layer 14 | 3 × 3 | (1,1) | 1024 |
| Max pooling | 2 × 2 | ||
| Flatten | |||
| FC 1 | 512 | ||
| FC 2 | 256 | ||
Figure 4Two types of residual blocks.
Validation dataset performance of baseline models—overall and across different parts of body.
| Metric | Model | Overall | Elbow | Finger | Forearm | Hand | Humerus | Shoulder | Wrist |
|---|---|---|---|---|---|---|---|---|---|
| AUC | ConvNet | 0.89 | 0.90 | 0.90 | 0.85 | 0.84 | 0.91 | 0.84 | 0.92 |
| ResNet | 0.91 | 0.91 | 0.88 | ||||||
| DenseNet | 0.91 | 0.92 | 0.85 | 0.95 | 0.86 | 0.95 | |||
| Accuracy | ConvNet | 0.84 | 0.85 | 0.83 | 0.84 | 0.78 | 0.83 | 0.79 | 0.87 |
| ResNet | 0.86 | 0.84 | 0.85 | ||||||
| DenseNet | 0.85 | 0.84 | 0.88 | 0.82 | 0.89 | ||||
| Cohen’s kappa ( | ConvNet | 0.64 | 0.73 | 0.64 | 0.64 | 0.52 | 0.68 | 0.58 | 0.77 |
| ResNet | 0.72 | 0.67 | 0.67 | ||||||
| DenseNet | 0.71 | 0.65 | 0.59 | 0.75 | 0.63 | 0.77 |
Bold numbers indicate the highest performing model for each evaluation metric across eight studies. DenseNet is the designated learner for the Elbow and Forearm studies and ResNet for the remaining ones.