| Literature DB >> 33335111 |
Shih-Cheng Huang1,2, Anuj Pareek3,4, Roham Zamanian5,6, Imon Banerjee3,7, Matthew P Lungren8,3,4.
Abstract
Recent advancements in deep learning have led to a resurgence of medical imaging and Electronic Medical Record (EMR) models for a variety of applications, including clinical decision support, automated workflow triage, clinical prediction and more. However, very few models have been developed to integrate both clinical and imaging data, despite that in routine practice clinicians rely on EMR to provide context in medical imaging interpretation. In this study, we developed and compared different multimodal fusion model architectures that are capable of utilizing both pixel data from volumetric Computed Tomography Pulmonary Angiography scans and clinical patient data from the EMR to automatically classify Pulmonary Embolism (PE) cases. The best performing multimodality model is a late fusion model that achieves an AUROC of 0.947 [95% CI: 0.946-0.948] on the entire held-out test set, outperforming imaging-only and EMR-only single modality models.Entities:
Mesh:
Year: 2020 PMID: 33335111 PMCID: PMC7746687 DOI: 10.1038/s41598-020-78888-w
Source DB: PubMed Journal: Sci Rep ISSN: 2045-2322 Impact factor: 4.379
Figure 1Overview of the workflow for this study. We extracted a total of 108,991 studies from Stanford University Medical Center (A) and sampled a subset (B) for manual review (C). 1837 studies remained after screening by two radiologists and were used to train and evaluate our models. Single modality models were created (D) both as baselines for comparisons as well as components for the fusion models (E).
Figure 2Fusion model architectures. The 7 different fusion architectures used in this study, including (A) Early Fusion, (B) Joint All Fusion, (C) Joint Separate Fusion, (D) Late NN Average, (E) Late Elastic Average, (F) Late Separate Average and (G) Late Meta. Each input feature modality is color coded. Detailed definition of each model architecture is described in the Methods.
Data characteristics of the Stanford Medical Center dataset.
| Category | Sub-category | Overall | Train | Validation | Test |
|---|---|---|---|---|---|
| CTPA exams | Number of studies | 1837 | 1454 | 193 | 190 |
| Number of patients | 1794 | 1414 | 190 | 190 | |
| Median number of slices (IQR) | 386 (134) | 385 (136) | 388 (132) | 388 (139) | |
| Patient Demographics | Female | 1048 (57.07%) | 823 (56.64%) | 130 (67.36%) | 99 (52.08%) |
| Median age (IQR) | 66.14 (53.24–82.40) | 66.13 (53.14–82.95) | 64.10 (50.88–78.38) | 67.24 (56.62–82.76) | |
| Race | White | 1101 (59.70%) | 872 (59.80%) | 108 (55.96%) | 121 (62.69%) |
| Black | 140 (7.59%) | 101 (6.93%) | 22 (11.40%) | 17 (8.80%) | |
| Asian | 144 (7.81%) | 122 (8.37%) | 12 (6.21%) | 10 (5.18%) | |
| Pacific Islander | 13 (0.70%) | 10 (0.69%) | 0 (0.0%) | 3 (1.55%) | |
| Other | 210 (11.39%) | 168 (11.52%) | 25 (12.95%) | 17 (8.80%) | |
| Unknown | 233 (12.63%) | 184 (12.62%) | 24 (12.44%) | 25 (12.95%) | |
| Pulmonary embolism | Number of negative PE | 1111(60.48%) | 946 (65.06%) | 85 (44.04%) | 80 (42.10%) |
| Number of positive PE | 726 (39.50%) | 508 (34.94%) | 108 (55.96%) | 110 (57.89%) | |
| Central | 257(35.40%) | 202 (39.76%) | 27 (25.00%) | 28 (25.45%) | |
| Segmental | 387(53.31%) | 281 (55.31%) | 52 (48.15%) | 54 (49.09%) | |
| Subsegmental | 82 (11.29%) | 25 (4.91%) | 29 (26.85%) | 28 (25.45%) | |
| Vitals | BMI (mean: std) | 28.37 : 9.65 | 28.36 : 10.03 | 27.11 : 6.78 | 29.60 : 9.22 |
| Pulse (mean: std) | 81.62 : 14.99 | 81.53 : 15.64 | 83.05 : 11.86 | 80.50 : 13.06 | |
| D-dimer | D-dimer test taken | 580 (30.62%) | 461 (30.90%) | 58 (28.71%) | 61 (30.50%) |
| D-dimer positive | 496 (26.18%) | 389 (26.07%) | 51 (25.25%) | 56 (28.00%) |
The curated Stanford Medical Center dataset was divided into training, validation and test set. The training set was used to optimize model parameters and the validation set was used to select the best model hyperparameters and operating points. The hold-out test set was used to evaluate the model’s performance.
Fusion model architecture experimentation.
| Evaluation metrics | Early | Late | Joint | ||||
|---|---|---|---|---|---|---|---|
| Early fusion | Late NN average | Late elastic average | Late separate average | Late meta | Joint all | Joint separate | |
| Operating threshold | 0.345 | 0.473 | 0.414 | 0.483 | 0.197 | 0.500 | 0.517 |
| Accuracy | 0.842 [0.84–0.844] | 0.848 [0.846–0.849] | [0.884–0.886] | 0.853 [0.851–0.854] | 0.828 [0.826–0.829] | 0.809 [0.808–0.811] | 0.842 [0.841–0.844] |
| AUROC | 0.899 [0.898–0.901] | 0.895 [0.894–0.897] | [0.946–0.948] | 0.908 [0.906–0.909] | 0.896 [0.895–0.898] | 0.796 [0.794–0.798] | 0.893 [0.891–0.894] |
| Specificity | 0.737 [0.733–0.74] | 0.838 [0.835–0.84] | [0.9–0.904] | 0.851 [0.849–0.853] | 0.852 [0.849–0.854] | 0.709 [0.706–0.712] | 0.837 [0.835–0.840] |
| Sensitivity | [0.918–0.921] | 0.781 [0.778–0.783] | 0.873 [0.871–0.875] | 0.854 [0.851–0.856] | 0.81 [0.808–0.813] | 0.882 [0.88–0.884] | 0.846 [0.844–0.849] |
| PPV | 0.827 [0.825–0.829] | 0.869 [0.867–0.871] | [0.923–0.926 | 0.887 [0.886–0.889] | 0.883 [0.881–0.885] | 0.807 [0.805–0.809] | 0.877 [0.875–0.879] |
| NPV | [0.867–0.873] | 0.734 [0.731–0.737] | 0.838 [0.835–0.84] | 0.809 [0.806–0.811] | 0.765 [0.762–0.768] | 0.814 [0.811–0.817] | 0.799 [0.796–0.801] |
Comparison between different fusion strategies. Model performance on the held-out test set with 95% confidence interval using probability threshold that maximizes both sensitivity and specificity on the validation dataset. Best performance metrics in bold text.
Comparison between multimodality and the best performing single modality models.
| Evaluation metrics | Including subsegmental | Excluding subsegmental | ||||
|---|---|---|---|---|---|---|
| Imaging model | EMR model | Late elastic average | Imaging model | EMR model | Late elastic average | |
| Operating threshold | 0.625 | 0.630 | 0.448 | 0.625 | 0.612 | 0.414 |
| Accuracy | 0.687 [0.685–0.689] | 0.834 [0.832–0.835] | [0.884–0.886] | 0.756 [0.754–0.758] | 0.873 [0.871–0.874] | [0.900–0.903] |
| AUROC | 0.791 [0.788–0.793] | 0.911 [0.910–0.913] | [0.946–0.948] | 0.833 [0.830–0.835] | 0.921 [0.919–0.923] | [0.961–0.963] |
| Specificity | 0.862 [0.860–0.865] | 0.875 [0.872–0.877] | [0.9–0.904] | 0.863 [0.861–0.866] | [0.876–0.880] | 0.849 [0.847–0.852] |
| Sensitivity | 0.559 [0.557–0.562] | 0.804 [0.801–0.806] | [0.871–0.875] | 0.651 [0.647–0.654] | 0.867 [0.865–0.870] | [0.951–0.954] |
| PPV | 0.848 [0.846–0.851] | 0.898 [0.896–0.899] | [0.923–0.926 | 0.830 [0.827–0.833] | [0.877–0.882] | 0.866 [0.864–0.869] |
| NPV | 0.588 [0.585–0.590] | 0.765 [0.761–0.767] | [0.835–0.84] | 0.707 [0.705–0.710] | 0.866 [0.864–0.868] | [0.945–0.948] |
Model performance on the held-out testset with 95% confidence interval using probability threshold that maximizes both sensitivity and specificity on the validation dataset. Best performance metrics in bold text.
Figure 3Two selected axial CT images of the chest from two separate patients with positive diagnosis of PE. The left CT scan demonstrates a left lower lobe posterolateral basal segmental artery filling defect consistent with a pulmonary embolism. The CT scan on the right panel demonstrates a small elongated filling defect bridging across the segmental arteries of the right lower lobe consistent with a segmental pulmonary embolism, in addition to surrounding collapse of the right lower lobe. The vision-only model yielded false-negative predictions for both cases, but the fusion model correctly predicted both as positive.