| Literature DB >> 35486572 |
Bernardo C Bizzo1,2, Shadi Ebrahimian2, Mark E Walters1, Mark H Michalski1,2, Katherine P Andriole1,3, Keith J Dreyer1,2, Mannudeep K Kalra1,2, Tarik Alkasab1,2, Subba R Digumarthy2.
Abstract
A standardized objective evaluation method is needed to compare machine learning (ML) algorithms as these tools become available for clinical use. Therefore, we designed, built, and tested an evaluation pipeline with the goal of normalizing performance measurement of independently developed algorithms, using a common test dataset of our clinical imaging. Three vendor applications for detecting solid, part-solid, and groundglass lung nodules in chest CT examinations were assessed in this retrospective study using our data-preprocessing and algorithm assessment chain. The pipeline included tools for image cohort creation and de-identification; report and image annotation for ground-truth labeling; server partitioning to receive vendor "black box" algorithms and to enable model testing on our internal clinical data (100 chest CTs with 243 nodules) from within our security firewall; model validation and result visualization; and performance assessment calculating algorithm recall, precision, and receiver operating characteristic curves (ROC). Algorithm true positives, false positives, false negatives, recall, and precision for detecting lung nodules were as follows: Vendor-1 (194, 23, 49, 0.80, 0.89); Vendor-2 (182, 270, 61, 0.75, 0.40); Vendor-3 (75, 120, 168, 0.32, 0.39). The AUCs for detection of solid (0.61-0.74), groundglass (0.66-0.86) and part-solid (0.52-0.86) nodules varied between the three vendors. Our ML model validation pipeline enabled testing of multi-vendor algorithms within the institutional firewall. Wide variations in algorithm performance for detection as well as classification of lung nodules justifies the premise for a standardized objective ML algorithm evaluation process.Entities:
Mesh:
Year: 2022 PMID: 35486572 PMCID: PMC9053776 DOI: 10.1371/journal.pone.0267213
Source DB: PubMed Journal: PLoS One ISSN: 1932-6203 Impact factor: 3.240
Fig 1Flowchart of patient population in the study.
It is summarizing the ground-truthing of pulmonary nodules by three radiologists.
Fig 2Infrastructure of validation pipeline for AI algorithms assessed in our study.
Fig 3Vendor output data schema defined in Extensible Markup Language (XML).
Fig 4Algorithm evaluation results viewed using image annotation-visualization tool for comparison to ground-truth.
Note patient protected health information (PHI) is changed for de-identification purposes.
Fig 5Image annotation-visualization tool showing zoom into features of vendor-detected nodules.
Vendor lung nodules detection performance.
| Vendor 1 | Vendor 2 | Vendor 3 | |
|---|---|---|---|
| True positives | 194 | 182 | 75 |
| False positives | 23 | 270 | 120 |
| False negatives | 49 | 61 | 168 |
| Recall | 0.80 | 0.75 | 0.32 |
| Precision | 0.89 | 0.40 | 0.39 |
Lung nodules cohort features distribution.
| Variable | # of nodules (n = 243) | Frequency |
|---|---|---|
|
| ||
| RUL | 63 | 25.92% |
| RLL | 52 | 21.39% |
| RML | 16 | 6.58% |
| LUL | 45 | 18.51% |
| LLL | 61 | 25.10% |
| Lingula | 6 | 2.46% |
|
| ||
| Solid | 127 | 52.26% |
| Ground glass | 76 | 31.27% |
| Part-solid | 40 | 16.46% |
|
| ||
| Yes | 9 | 3.70% |
| No | 234 | 96.29% |
*RUL: Right upper lung; RLL: Right lower lung; RML: Right middle lung; LUL: Left upper lung; LLL: Left lower lung
Stratified summary statistics of three AI algorithms for detection of solid (3A, n = 127), ground glass nodules (3B, n = 76), and part-solid (3C, n = 40) lung nodules.
| Vendor 1 | Vendor 2 | Vendor 3 | |
|---|---|---|---|
|
| |||
| True positives | 105 | 96 | 53 |
| False positives | 16 | 184 | 82 |
| False negatives | 22 | 31 | 74 |
| True negatives | 29 | 6 | 14 |
| Sensitivity | 0.83 | 0.76 | 0.42 |
| Recall | 0.82 | 0.75 | 0.41 |
| Precision | 0.86 | 0.34 | 0.39 |
|
| |||
| True positives | 53 | 50 | 1 |
| False positives | 11 | 154 | 53 |
| False negatives | 23 | 26 | 75 |
| True negatives | 34 | 4 | 17 |
| Sensitivity | 0.90 | 0.90 | 0.53 |
| Recall | 0.69 | 0.65 | 0.01 |
| Precision | 0.82 | 0.24 | 0.01 |
|
| |||
| True positives | 36 | 36 | 21 |
| False positives | 12 | 107 | 52 |
| False negatives | 4 | 4 | 19 |
| Ture Negatives | 53 | 8 | 31 |
| Sensitivity | 0.70 | 0.66 | 0.01 |
| Recall | 0.90 | 0.90 | 0.52 |
| Precision | 0.75 | 0.25 | 0.28 |
Summary of areas under the curve with 95% confidence interval (AUC 95% CI) for detection of solid, part-solid and ground-glass nodules with the three AI algorithms assessed in our study.
| AUC (95% CI) | |||
|---|---|---|---|
| Vendor 1 | Vendor 2 | Vendor 3 | |
| Solid nodules | 0.74 (0.54–0.83) | 0.61 (0.54–0.67) | 0.74 (0.68–0.80) |
| Part solid nodules | 0.86 (0.78–0.94) | 0.52 (0.41–0.62) | 0.55 (0.44–0.66) |
| Groundglass nodules | 0.73 (0.63–0.82) | 0.66 (0.58–0.74) | 0.86 (0.80–0.93) |