| Literature DB >> 33232868 |
Yu He1, Ian Pan2, Bingting Bao1, Kasey Halsey2, Marcello Chang3, Hui Liu1, Shuping Peng1, Ronnie A Sebro4, Jing Guan1, Thomas Yi5, Andrew T Delworth6, Feyisope Eweje7, Lisa J States8, Paul J Zhang9, Zishu Zhang1, Jing Wu10, Xianjing Peng11, Harrison X Bai12.
Abstract
BACKGROUND: To develop a deep learning model to classify primary bone tumors from preoperative radiographs and compare performance with radiologists.Entities:
Keywords: Convolutional neural network; Deep learning; Plain radiograph; Primary bone tumor
Year: 2020 PMID: 33232868 PMCID: PMC7689511 DOI: 10.1016/j.ebiom.2020.103121
Source DB: PubMed Journal: EBioMedicine ISSN: 2352-3964 Impact factor: 8.143
Demographics for each of the 5 institutions.
| Institution 1 | Institution 2 | Institution 3 | Institution 4 | Institution 5 | |
|---|---|---|---|---|---|
| Hospital type | Adult & Pediatric | Pediatric | Adult & Pediatric | Adult & Pediatric | Pediatric |
| Number of patients | 160 | 333 | 572 | 186 | 105 |
| Age, mean, years (SD) | 40•3 (17•8) | 12•6 (4•8) | 28•4 (17•8) | 31•3 (19•3) | 7•5 (3•9) |
| Sex (% male) | 82 (51•2) | 188 (56•5) | 328(57•3) | 116 (62•4) | 75 (71•4) |
| Pathology (%) | |||||
| Benign | 69 (43•1) | 112 (33•6) | 336 (58•7) | 78 (41•9) | 84 (80•0) |
| Intermediate | 35 (21•9) | 78 (23•4) | 143 (25•0) | 45 (24•2) | 16 (15•2) |
| Malignant | 56 (35•0) | 143 (42•9) | 93 (16•3) | 63 (33•9) | 5 (4•8) |
Model performance of 2 formulated binary classification problems: benign vs. not benign and malignant vs. not malignant. 95% confidence intervals for AUCs were obtained via the DeLong method.
| AUC | ||
|---|---|---|
| Cross-validation | Not Benign | 0•894 (0•874, 0•912) |
| Not Malignant | 0•907 (0•886, 0•926) | |
| Divided into quartiles by age for subgroup | ||
| Age (<12, n=268) | Not Benign | 0•891 (0•849, 0•928) |
| Not Malignant | 0•915 (0•870, 0•953) | |
| Age (12-18, n=277) | Not Benign | 0•933 (0•903, 0•960) |
| Not Malignant | 0•933 (0•900, 0•962) | |
| Age (19-36, n= 263) | Not Benign | 0•897 (0•858, 0•933) |
| Not Malignant | 0•946 (0•910, 0•975) | |
| Age (>36, n= 257) | Not Benign | 0•844 (0•849, 0•928) |
| Not Malignant | 0•819 (0•870, 0•953) | |
| External testing | Not Benign | 0•877 (0•833, 0•918) |
| Not Malignant | 0•916 (0•877, 0•949) | |
AUC: area under curve
Fig. 1Receiver operating characteristic curves for the 2 formulated binary classification problems. benign vs. not-benign (a) and malignant vs. not-malignant (b). Area under curve (AUC) of internal cross-validation (CV, red) and external testing (blue) are also included.
Comparison of model performance with subspecialists on cross-validation. For Cohen's kappa scores and categorical accuracy, 95% confidence intervals were generated using 10,000 bootstrap samples. Permutation tests with 10,000 iterations were used to calculate p-values.
| Accuracy | Cohen's κ | Difference in κ | |||
|---|---|---|---|---|---|
| Total | Model | 72•1% | 0•548 (0•504, 0•590) | ||
| Rater 1 | 74•6% | 0•605 (0•564, 0•644) | 0•057 (0•007, 0•107) | 0•03 | |
| Rater 2 | 72•1% | 0•565 (0•523, 0•607) | 0•017 (-0•034, 0•068) | 0•52 | |
| Age (<12, n=268) | Model | 73•9% | 0•557 (0•473, 0•641) | ||
| Rater 1 | 71•3% | 0•544 (0•464, 0•625) | -0•013 (-0•106, 0•079) | 0•77 | |
| Rater 2 | 73•9% | 0•587 (0•506, 0•666) | 0•030 (-0•069, 0•128) | 0•56 | |
| Age (12-18, n=277) | Model | 76•7% | 0•617 (0•537, 0•693) | ||
| Rater 1 | 77•4% | 0•646 (0•570, 0•721) | 0•029 (-0•065, 0•126) | 0•55 | |
| Rater 2 | 75•6% | 0•615 (0•534, 0•689) | -0•002 (-0•098, 0•094) | 0•96 | |
| Age (19-36, n= 263) | Model | 75•8% | 0•610 (0•523, 0•692) | ||
| Rater 1 | 77•8% | 0•653 (0•571, 0•731) | 0•043 (-0•062, 0•148) | 0•43 | |
| Rater 2 | 70•6% | 0•541 (0•451, 0•628) | -0•069 (-0•174, 0•036) | 0•22 | |
| Age (>36, n= 257) | Model | 62•2% | 0•384 (0•291, 0•473) | ||
| Rater 1 | 72•1% | 0•558 (0•472, 0•641) | 0•174 (0•065, 0•284) | 0•003 | |
| Rater 2 | 68•3% | 0•499 (0•413, 0•583) | 0•115 (0•004, 0•227) | 0•05 |
Rater 1 and 2 are subspecialists.
Comparison of model performance with subspecialists and junior radiologists evaluating uncropped images of the external testing data and stratified by age group. For Cohen's kappa scores and categorical accuracy, 95% confidence intervals were generated using 10,000 bootstrap samples. Permutation tests with 10,000 iterations were used to calculate p-values.
| Accuracy | Cohen's κ | Difference in κ | |||
|---|---|---|---|---|---|
| Total | Model | 73•4% | 0•560 (0•481, 0•639) | ||
| Rater 1 | 69•3% | 0•483 (0•394, 0•567) | -0•077 (-0•180, 0•021) | 0•14 | |
| Rater 2 | 73•4% | 0•553 (0•468, 0•634) | -0•007 (-0•112, 0•096) | 0•89 | |
| Rater 3 | 73•1% | 0•555 (0•472, 0•633) | -0•005 (-0•115, 0•103) | 0•93 | |
| Rater 4 | 67•9% | 0•430 (0•340, 0•519) | -0•130 (-0•240, -0•020) | 0•02 | |
| Rater 5 | 63•4% | 0•367 (0•285, 0•449) | -0•193 (-0•293, -0•093) | 0•0005 | |
| Age (<10, | Model | 74•2% | 0•383 (0•210, 0•542) | ||
| Rater 1 | 79•4% | 0•478 (0•278, 0•655) | 0•095 (-0•128, 0•314) | 0•41 | |
| Rater 2 | 79•4% | 0•515 (0•334, 0•678) | 0•132 (-0•080, 0•343) | 0•23 | |
| Rater 3 | 79•4% | 0•535 (0•367, 0•695) | 0•152 (-0•080, 0•393) | 0•25 | |
| Rater 4 | 80•4% | 0•448 (0•239, 0•637) | 0•065 (-0•177, 0•314) | 0•61 | |
| Rater 5 | 69•1% | 0•229 (0•064, 0•390) | -0•154 (-0•341, 0•017) | 0•11 | |
| Age (10-24, | Model | 77•3% | 0•630 (0•498, 0•755) | ||
| Rater 1 | 70•1% | 0•496 (0•336, 0•640) | -0•134 (-0•311, 0•038) | 0•13 | |
| Rater 2 | 72•2% | 0•538 (0•392, 0•676) | -0•092 (-0•261, 0•075) | 0•28 | |
| Rater 3 | 77•3% | 0•618 (0•473, 0•749) | -0•012 (-0•183, 0•156) | 0•88 | |
| Rater 4 | 69•1% | 0•450 (0•291, 0•596) | -0•180 (-0•352, -0•011) | 0•045 | |
| Rater 5 | 52•6% | 0•217 (0•085, 0•354) | -0•413 (-0•576, -0•246) | <1•0e-6 | |
| Age (>24, | Model | 68•8% | 0•514 (0•366, 0•648) | ||
| Rater 1 | 58•3% | 0•386 (0•250, 0•521) | -0•128 (-0•304, 0•047) | 0•15 | |
| Rater 2 | 68•8% | 0•526 (0•385, 0•660) | 0•012 (-0•178, 0•200) | 0•89 | |
| Rater 3 | 62•5% | 0•413 (0•263, 0•556) | -0•101 (-0•294, 0•093) | 0•31 | |
| Rater 4 | 54•2% | 0•282 (0•132, 0•429) | -0•232 (-0•426, -0•033) | 0•025 | |
| Rater 5 | 68•8% | 0•479 (0•345, 0•608) | -0•035 (-0•198, 0•137) | 0•71 |
Rater 1 and 2 are subspecialists, while rater 3-5 are junior radiologists.
Fig. 2Three examples of malignant tumors that were predicted to be not malignant by both deep learning model and subspecialists. a, Osteosarcoma in upper left tibia predicted to be benign by the deep learning model (67•1%) and benign by 2 subspecialists. b, Chondrosarcoma in upper right femur predicted to be intermediate by the deep learning model (80•5%) and benign and intermediate by 2 subspecialists. c, Ewing sarcoma in right cuboid bone predicted to be benign by the deep learning model (77•2%) and intermediate by 2 subspecialists.
Fig. 4Two examples of malignant tumor predicted to be malignant by the subspecialists and otherwise by the deep learning model. a, Ewing sarcoma in left femur diaphysis, predicted to be benign by the deep learning model (95•0%). b, Plasma cell myeloma in T12 vertebral body, predicted to be benign by the deep learning model (81•5%).
Fig. 3Examples of malignant tumor that was predicted to be malignant by the deep learning model and otherwise by subspecialists. Osteosarcoma in distal right femur, predicted to be malignant by the deep learning model (99•9%) and intermediate by 2 subspecialists.