| Literature DB >> 31603772 |
Nan Wu, Jason Phang, Jungkyu Park, Yiqiu Shen, Zhe Huang, Masha Zorin, Stanislaw Jastrzebski, Thibault Fevry, Joe Katsnelson, Eric Kim, Stacey Wolfson, Ujas Parikh, Sushma Gaddam, Leng Leng Young Lin, Kara Ho, Joshua D Weinstein, Beatriu Reig, Yiming Gao, Hildegard Toth, Kristine Pysarenko, Alana Lewin, Jiyon Lee, Krystal Airola, Eralda Mema, Stephanie Chung, Esther Hwang, Naziya Samreen, S Gene Kim, Laura Heacock, Linda Moy, Kyunghyun Cho, Krzysztof J Geras.
Abstract
We present a deep convolutional neural network for breast cancer screening exam classification, trained, and evaluated on over 200000 exams (over 1000000 images). Our network achieves an AUC of 0.895 in predicting the presence of cancer in the breast, when tested on the screening population. We attribute the high accuracy to a few technical advances. 1) Our network's novel two-stage architecture and training procedure, which allows us to use a high-capacity patch-level network to learn from pixel-level labels alongside a network learning from macroscopic breast-level labels. 2) A custom ResNet-based network used as a building block of our model, whose balance of depth and width is optimized for high-resolution medical images. 3) Pretraining the network on screening BI-RADS classification, a related task with more noisy labels. 4) Combining multiple input views in an optimal way among a number of possible choices. To validate our model, we conducted a reader study with 14 readers, each reading 720 screening mammogram exams, and show that our model is as accurate as experienced radiologists when presented with the same data. We also show that a hybrid model, averaging the probability of malignancy predicted by a radiologist with a prediction of our neural network, is more accurate than either of the two separately. To further understand our results, we conduct a thorough analysis of our network's performance on different subpopulations of the screening population, the model's design, training procedure, errors, and properties of its internal representations. Our best models are publicly available at https://github.com/nyukat/breast_cancer_classifier.Entities:
Mesh:
Year: 2019 PMID: 31603772 PMCID: PMC7427471 DOI: 10.1109/TMI.2019.2945514
Source DB: PubMed Journal: IEEE Trans Med Imaging ISSN: 0278-0062 Impact factor: 10.048
Fig. 1.Examples of breast cancer screening exams. First row: both breasts without any findings; second row: left breast with no findings and right breast with a malignant finding; third row: left breast with a benign finding and right breast with no findings.
Fig. 2.An example of a segmentation performed by a radiologist. Left: the original image. Right: the image with lesions requiring a biopsy highlighted. The malignant finding is highlighted with red and benign finding with green.
Number of Breasts With Malignant and Benign Findings Based on the Labels Extracted From the Pathology Reports, Broken Down According to Whether the Findings Were Visible or Occult
| malignant | benign | |||
|---|---|---|---|---|
| visible | occult | visible | occult | |
| 750 | 107 | 2,586 | 2,004 | |
| 51 | 15 | 357 | 253 | |
| 54 | 8 | 215 | 141 | |
| 855 (86.8%) | 130 (13.2%) | 3,158 (56.84%) | 2,398 (43.16%) | |
Fig. 3.A schematic representation of how we formulated breast cancer exam classification as a learning task. The main task that we intend the model to learn is malignant/not malignant classification. The task of benign/not benign classification is used as an auxiliary task regularizing the network.
Fig. 5.Four model variants for incorporating information across the four screening mammography views in an exam. All variants are constrained to have a total of 1,024 hidden activations between fully connected layers. The ‘view-wise’ model, which is the primary model used in our experiments, contains separate model branches for CC and MLO views–we average the predictions across both branches. The ‘image-wise’ model has a model branch for each image, and we similarly average the predictions. The ‘breast-wise’ model has separate branches per breast (left and right). The ‘joint’ model only has a single branch, operating on the concatenated representations of all four images. Average pooling in all models is averaging globally across spatial dimensions in all feature maps. When heatmaps (cf. Section IV-B) are added as additional channels to corresponding inputs, the first layers of the columns are modified accordingly.
Fig. 4.Architecture of single-view ResNet-22. The numbers in square brackets indicate the number of output channels, unless otherwise specified. Left: Overview of the single-view ResNet-22, which consists of a set of ResNet layers. Center: ResNet layers consist of a sequence of ResNet blocks with different downsampling and output channels. Right: ResNet blocks consist of two 3 × 3 convolutional layers, with interleaving ReLU and batch normalization operations, and a residual connection between input and output. Where no downsampling factor is specified for a ResNet block, the first 3 × 3 convolution layer has a stride of 1, and the 1 × 1 convolution operation for the residual is omitted.
Dimensions of Feature Maps After Each Layer in Resnet-22, Shown as H × W × D. D Indicates the Number of Feature Maps, H and W Indicate Spatial Dimensions
| CC view | MLO view | |
|---|---|---|
| Conv7×7 | 1339×971×16 | 1487×874×16 |
| ResBlock 1 | 670×486×16 | 744×437×16 |
| ResBlock 2 | 335×243×32 | 372×219×32 |
| ResBlock 3 | 168×122×64 | 186×110×64 |
| ResBlock 4 | 84×61×128 | 93×55×128 |
| ResBlock 5 | 42×31×256 | 47×28×256 |
Fig. 6.The original image (left), the ‘malignant’ heatmap over the image (middle) and the ‘benign’ heatmap over the image (right).
Fig. 7.BI-RADS classification model architecture. The architecture is largely similar to the ‘view-wise’ cancer classification model variant, except that the output is a set of probability estimates over the three output classes. The model consists of four ResNet-22 columns, with weights shared within CC and MLO branches of the model.
AUCS of Our Models on Screening and Biopsied Populations. All Models, Except the Ones Indicated With * Were Pretrained on BI-RADS Classification
| screening population | biopsied population | ||||||||
|---|---|---|---|---|---|---|---|---|---|
| single | 5x ensemble | single | 5x ensemble | ||||||
| malignant | benign | malignant | benign | malignant | benign | malignant | benign | ||
| view-wise | 0.827±0.008 | 0.731 ±0.004 | 0.840 | 0.743 | 0.781±0.006 | 0.673±0.003 | 0.791 | 0.682 | |
| view-wise* | 0.687±0.009 | 0.657±0.006 | 0.703 | 0.669 | 0.693±0.006 | 0.564±0.006 | 0.709 | 0.571 | |
| image-wise | 0.830±0.006 | 0.759±0.002 | 0.841 | 0.766 | 0.740±0.007 | 0.638±0.001 | 0.749 | 0.642 | |
| breast-wise | 0.821±0.012 | 0.757±0.002 | 0.836 | 0.768 | 0.726±0.009 | 0.639±0.002 | 0.738 | 0.645 | |
| joint | 0.822±0.008 | 0.737±0.004 | 0.831 | 0.746 | 0.780±0.006 | 0.682±0.001 | 0.787 | 0.688 | |
| view-wise | 0.747±0.002 | 0.756 | 0.690±0.002 | 0.696 | |||||
| view-wise* | 0.856±0.007 | 0.701±0.004 | 0.868 | 0.708 | 0.828±0.008 | 0.633±0.006 | 0.841 | 0.640 | |
| image-wise | 0.875±0.001 | 0.885 | 0.774 | 0.812±0.001 | 0.653±0.003 | 0.821 | 0.658 | ||
| breast-wise | 0.876±0.004 | 0.764±0.004 | 0.889 | 0.805±0.004 | 0.652±0.004 | 0.818 | 0.661 | ||
| joint | 0.860±0.008 | 0.745±0.002 | 0.876 | 0.763 | 0.817±0.008 | 0.830 | |||
Fig. 8.ROC curves [(a), (b), and (e)] and Precision-Recall curves [(c), (d), and (f)] on the subset of the test set used for the reader study. (a) and (c) curves for all 14 readers. Their average performance are highlighted in blue. (b) and (d) curves for hybrid of the image-andheatmaps ensemble with each single reader. Curve highlighted in blue indicates the average performance of all hybrids. (e) and (f) comparison among the image-and-heatmaps ensemble, average reader and average hybrid.
Fig. 9.AUC (left) and PRAUC (right) as a function of λ ∈ [0, 1) for hybrids between each reader and our image-and-heatmaps ensemble. Each hybrid achieves the highest AUC/PRAUC for a different λ (marked with ◇).
Fig. 10.Two-dimensional UMAP projection of the activations computed by the network for the exams in the reader study. We visualize two sets of activations: (left) concatenated activations from the last layer of each of the four image-specific columns, and (right) concatenated activations from the first fully connected layer in both CC and MLO model branches. Each point represents one exam. Color and size of each point reflect the same information: probability of malignancy predicted by the readers (averaged over the two breasts and the 14 readers).