| Literature DB >> 35140277 |
Christopher P Bridge1,2,3,4, Bernardo C Bizzo5,6,7,8,9,10, James M Hillis1,3,11, John K Chin1, Donnella S Comeau1, Romane Gauriau1, Fabiola Macruz1, Jayashri Pawar1, Flavia T C Noro1, Elshaimaa Sharaf1, Marcelo Straus Takahashi12, Bradley Wright1, John F Kalafut13, Katherine P Andriole1,3,14, Stuart R Pomerantz1,3,4, Stefano Pedemonte1, R Gilberto González1,2,3,4.
Abstract
Stroke is a leading cause of death and disability. The ability to quickly identify the presence of acute infarct and quantify the volume on magnetic resonance imaging (MRI) has important treatment implications. We developed a machine learning model that used the apparent diffusion coefficient and diffusion weighted imaging series. It was trained on 6,657 MRI studies from Massachusetts General Hospital (MGH; Boston, USA). All studies were labelled positive or negative for infarct (classification annotation) with 377 having the region of interest outlined (segmentation annotation). The different annotation types facilitated training on more studies while not requiring the extensive time to manually segment every study. We initially validated the model on studies sequestered from the training set. We then tested the model on studies from three clinical scenarios: consecutive stroke team activations for 6-months at MGH, consecutive stroke team activations for 6-months at a hospital that did not provide training data (Brigham and Women's Hospital [BWH]; Boston, USA), and an international site (Diagnósticos da América SA [DASA]; Brazil). The model results were compared to radiologist ground truth interpretations. The model performed better when trained on classification and segmentation annotations (area under the receiver operating curve [AUROC] 0.995 [95% CI 0.992-0.998] and median Dice coefficient for segmentation overlap of 0.797 [IQR 0.642-0.861]) compared to segmentation annotations alone (AUROC 0.982 [95% CI 0.972-0.990] and Dice coefficient 0.776 [IQR 0.584-0.857]). The model accurately identified infarcts for MGH stroke team activations (AUROC 0.964 [95% CI 0.943-0.982], 381 studies), BWH stroke team activations (AUROC 0.981 [95% CI 0.966-0.993], 247 studies), and at DASA (AUROC 0.998 [95% CI 0.993-1.000], 171 studies). The model accurately segmented infarcts with Pearson correlation comparing model output and ground truth volumes between 0.968 and 0.986 for the three scenarios. Acute infarct can be accurately detected and segmented on MRI in real-world clinical scenarios using a machine learning model.Entities:
Year: 2022 PMID: 35140277 PMCID: PMC8828773 DOI: 10.1038/s41598-022-06021-0
Source DB: PubMed Journal: Sci Rep ISSN: 2045-2322 Impact factor: 4.379
Figure 1Model design and development: (a) The structure for development and inference of a runtime model including incorporation of both DWI and ADC sequences as well as both classification and segmentation annotations. The shading of voxel level probabilities uses an operating point of 0.5. (b) The training process for a single batch of DWI and ADC pairs from 8 studies. A batch consisted of 2 segmented positive studies, 2 non-segmented positive studies and 4 negative studies, which involved oversampling of segmented studies. A Dice segmentation loss was applied for the segmented positive studies and negative studies using the segmentation output masks. In addition to the segmentation output, a classification output was produced by a global max-pooling operation on the output masks. A binary cross-entropy loss was then applied for all examples in the batch using the classification output.
Dataset details: the properties of the datasets that were used for model training and testing.
| Training set | Validation set | Primary test set | Stroke code test sets | International test set | |||||
|---|---|---|---|---|---|---|---|---|---|
| Classification | Segmentation | Classification | Segmentation | Classification | Segmentation | Training hospital | Non-training hospital | ||
| Number of studies | 6657 | 377 | 725 | 34 | 792 | 62 | 381 | 247 | 171 |
| Number of positive studies (%) | 3314 (49.8%) | All | 372 (51.3%) | All | 384 (48.5%) | All | 168 (44.1%) | 128 (50.2%) | 70 (40.9%) |
| Time period of studies | 01/2004–05/2018 | 01/2004–05/2018 | 01/2007–05/2018 | 02/2007–05/2018 | 01/2007–05/2018 | 03/2007–05/2018 | 07/2018–01/2019 | 07/2018–12/2018 | 01/2017–07/2019 |
| Number of studies on female patients (%) | 3445 (51.8%) | 176 (46.7%) | 374 (51.6%) | 17 (50.0%) | 404 (51.0%) | 26 (41.9%) | 193 (50.7%) | 129 (52.2%) | 101 (59.1%) |
| Mean age in years ± standard deviation (range) | 60.7 ± 18.0 (18–104) | 68.1 ± 14.6 (18–102) | 60.8 ± 17.7 (18–101) | 67.4 ± 18.4 (26–96) | 60.5 ± 18.4 (18–102) | 68.2 ± 15.8 (26–99) | 65.9 ± 16.5 (19–98) | 67.6 ± 17.2 (22–97) | 46.7 ± 21.1 (18–95) |
| Median infarct volume in mL (interquartile range; range) | – | 6.42 (0.61–33.28; 0.02–333.06) | – | 6.38 (1.42–36.03; 0.06–276.38) | – | 5.57 (0.78–56.88; 0.03–308.78) | 2.73 (0.47–12.98; 0.04–403.16) | 6.12 (1.03–43.47; 0.10–442.80) | 3.01 (0.75–14.54; 0.07–255.20) |
| Number of studies on GE scanners | 5233 | 349 | 568 | 31 | 618 | 57 | 345 | 52 | Unavailable |
| Number of studies on Siemens scanners | 1424 | 28 | 157 | 3 | 174 | 5 | 36 | 196 | Unavailable |
Results summary: The model results obtained during training and testing.
| AUROC (95% CI) | Sensitivity (95% CI) | Specificity (95% CI) | Median Dice coefficient for region correlation (IQR) | Pearson coefficient for volume correlation | ||
|---|---|---|---|---|---|---|
| Segmentation studies only (no classification studies) | 0.982 (0.972–0.990) | 95.4% (93.2–97.4%) | 93.8% (91.1–96.2%) | 0.776 (0.584–0.857) | 0.988 | |
| ADC series only (no DWI series) | 0.954 (0.939–0.968) | 85.5% (81.8–89.0%) | 95.2% (92.9–97.3%) | 0.598 (0.444–0.736) | 0.951 | |
| DWI series only (no ADC series) | 0.991 (0.985–0.996) | 95.7% (93.5–97.7%) | 96.9% (94.9–98.6%) | 0.787 (0.650–0.863) | 0.984 | |
| Final model | 0.995 (0.992–0.998) | 96.5% (94.5–98.2%) | 97.5% (95.6–98.9%) | 0.797 (0.642–0.861) | 0.987 | |
| Primary test set | 0.998 (0.995–0.999) | 98.4% (97.1–99.5%) | 98.0% (96.6–99.3%) | 0.813 (0.727–0.863) | 0.987 | |
| Training hospital stroke code test set | ||||||
| GE | 0.962 (0.938–0.982) | 88.2% (82.7–93.1%) | 95.3% (92.2–98.0%) | R1 vs M | 0.726 (0.563–0.801) | 0.987 |
| R2 vs M | 0.705 (0.551–0.792) | |||||
| R1 vs R2 | 0.727 (0.590–0.811) | |||||
| Siemens | 0.997 (0.981–1.000) | 100.0% (100.0–100.0%) | 90.0% (75.0–100.0%) | R1 vs M | 0.727 (0.622–0.810) | 0.994 |
| R2 vs M | 0.742 (0.594–0.802) | |||||
| R1 vs R2 | 0.752 (0.634–0.838) | |||||
| Overall | 0.964 (0.943–0.982) | 89.3% (84.5–93.9%) | 94.8% (91.7–97.6%) | R1 vs M | 0.726 (0.568–0.803) | 0.968 |
| R2 vs M | 0.709 (0.551–0.793) | |||||
| R1 vs R2 | 0.727 (0.598–0.813) | |||||
| Non-training hospital stroke code test set | ||||||
| GE | 0.988 (0.960–1.000) | 100.0% (100.0–100.0%) | 78.3% (60.0–94.4%) | R1 vs M | 0.660 (0.509–0.811) | 0.978 |
| R2 vs M | 0.667 (0.468–0.791) | |||||
| R1 vs R2 | 0.683 (0.587–0.822) | |||||
| Siemens | 0.979 (0.960–0.993) | 94.9% (90.2–99.0%) | 88.5% (81.8–94.5%) | R1 vs M | 0.649 (0.461–0.732) | 0.989 |
| R2 vs M | 0.637 (0.488–0.765) | |||||
| R1 vs R2 | 0.681 (0.594–0.755) | |||||
| Overall | 0.981 (0.966–0.993) | 96.1% (92.3–99.2%) | 86.6% (80.2–92.3%) | R1 vs M | 0.658 (0.480–0.750) | 0.986 |
| R2 vs M | 0.652 (0.473–0.770) | |||||
| R1 vs R2 | 0.682 (0.592–0.770) | |||||
| International test set | 0.998 (0.993–1.000) | 100.0% (100.0–100.0%) | 98.0% (94.9–100.0%) | R1 vs M | 0.686 (0.503–0.776) | 0.980 |
| R2 vs M | 0.683 (0.519–0.762) | |||||
| R1 vs R2 | 0.714 (0.604–0.813) | |||||
As there were two ground truth readers for the stroke code and international test sets, there are three Dice coefficients (reader 1 vs model [R1 vs M], reader 2 vs model [R2 vs M] and reader 1 vs reader 2 [R1 vs R2]); the Pearson coefficient for these test sets is calculated for averaged reader volume vs model volume.
Figure 2Model performance on primary test set: (a) Receiver operating characteristic curve for the primary test set including operating point of 0.5. (b) Volume plot comparing true (radiologist annotated) with predicted (model output) volumes for the primary test set.
Figure 3Model performance on stroke code test sets: (a, b) Training hospital stroke code receiver operating characteristic curve (a) and volume plot comparing averaged reader volume with model output volume (b; with magnified view of 0–70 mL on the right). (c, d) Non-training hospital stroke code receiver operating characteristic curve (c) and volume plot comparing averaged reader volume with model output volume (d; with magnified view of 0–70 mL on the right).
Figure 4Model performance based on clinical scenario for stroke code test sets: (a–d) Histograms demonstrating the number of true positive (TP) and false negative (FN) studies for different ground truth volumes (a), NIH Stroke Scales (b), time intervals between last seen well and MRI (c), and time intervals between symptom onset and MRI (d). The images from the false negative studies with ground truth volume > 1 mL are included in Supplementary Fig. S9. As an example of the time intervals, a patient who presents at 8 am having gone to sleep without symptoms at 10 pm and woken with symptoms at 6 am will have time from last seen well of 10 h and time from symptom onset of 2 h.
Figure 5Model performance on international test set: (a) Receiver operating characteristic curve for the international test set. (b) Volume plot comparing averaged reader volume with model output volume for the international test set.