Literature DB >> 30972224

Classification of Melanocytic Lesions in Selected and Whole-Slide Images via Convolutional Neural Networks.

Steven N Hart¹, William Flotte², Andrew P Norgan², Kabeer K Shah², Zachary R Buchan², Taofic Mounajjed², Thomas J Flotte².

Abstract

Whole-slide images (WSIs) are a rich new source of biomedical imaging data. The use of automated systems to classify and segment WSIs has recently come to forefront of the pathology research community. While digital slides have obvious educational and clinical uses, their most exciting potential lies in the application of quantitative computational tools to automate search tasks, assist in classic diagnostic classification tasks, and improve prognosis and theranostics. An essential step in enabling these advancements is to apply advances in machine learning and artificial intelligence from other fields to previously inaccessible pathology datasets, thereby enabling the application of new technologies to solve persistent diagnostic challenges in pathology. Here, we applied convolutional neural networks to differentiate between two forms of melanocytic lesions (Spitz and conventional). Classification accuracy at the patch level was 99.0%-2% when applied to WSI. Importantly, when the model was trained without careful image curation by a pathologist, the training took significantly longer and had lower overall performance. These results highlight the utility of augmented human intelligence in digital pathology applications, and the critical role pathologists will play in the evolution of computational pathology algorithms.

Entities: Chemical

Keywords: Bioinformatics; deep learning; dermatology; image analysis

Year: 2019 PMID： 30972224 PMCID： PMC6415523 DOI： 10.4103/jpi.jpi_32_18

Source DB: PubMed Journal: J Pathol Inform

INTRODUCTION

Melanocytic nevi are mostly benign and common, but certain forms of nevi can be difficult to classify; however, accurate classification of nevi is important in feature evaluation for distinguishing nevi from melanoma. The architecture and cytomorphology of different types of nevi vary significantly and their overlapping characteristics further confound the accurate diagnosis of malignancy. Features of malignant lesions are also found in benign nevi,[1] which makes diagnosis difficult. Depending on the criteria, accurate diagnoses range from 71% to 82%,[2] leading to 17.6% false diagnoses of melanoma.[3] Recently, whole-slide image (WSI) scanners have made it possible to fully digitize pathology slides. In addition to enabling long-term slide preservation and facilitating slide sharing for collaboration or second opinions, digitization of pathology slides allows for the development and utilization artificial intelligence (AI)-driven diagnostic tools. During microscopic examination, a pathologist uses salient clinical information, pattern matching, and feature recognition (shape, color, structure, etc.) to render a diagnosis. For example, to diagnose melanoma, relevant features may include asymmetry, poor circumscription, predominance of single melanocytes, mitoses, necrosis, and other features. The major objective of this study was to develop a convolutional neural network (CNN) capable of distinguishing between conventional and Spitz nevi. A classification challenge exists in the diagnosis of a subset of melanocytic nevi as conventional or Spitz-type; a difficult but clinically important task. To accomplish this, curated image patches of conventional nevi, Spitz nevi, or nonnevus skin tissue (other) were manually extracted from WSIs by a board-certified dermatopathologist. The curated patches were used to train a CNN for the classification task.

METHODS

We investigated the utility of a CNN to assist in the classification of selected melanocytic lesions as Spitz or conventional. Histologic sections of pigmented lesions were reviewed by two board-certified dermatopathologists and only cases where there was concurrence of diagnoses of conventional and Spitz nevi were utilized. Slides were digitized using an Aperio AT Turbo scanner from Leica Biosystems, with ×40 power. Large sections of representative tissue were curated by an expert dermatopathologist from 300 hematoxylin and eosin (H and E) slides each containing conventional (n = 150) or Spitz nevi [n = 150, Figure 1]. Slides were digitized using an Aperio AT Turbo scanner from Leica Biosystems, with ×40 power. The scans from 100 H and E slides (50 conventional and 50 Spitz nevi) were used for the training and validation set. Smaller variant image patches 299 × 299 pixels (px) were then derived for conventional nevi (n = 15,868), Spitz nevi (n = 21,468), and other nonnevus skin features (n = 38,374). From these patches, 30% were used exclusively for validation experiments. These image sets were then used to train and validate the deep CNN (Inception V3[4]) using the TensorFlow framework (version: 1.5.0).[5] Models were trained using pretrained weights (available from the TensorFlow website) or entirely from scratch. Using pretrained weights decrease the time to convergence since it reuses the weights that identify sample agnostic image characteristics such as edges and curves. Training from scratch means that the weights are initially randomized and then adjusted throughout the training process to converge. This process typically yields higher accuracy but requires more data and compute time to relearn basic features in addition to sample-dependent features (e.g., nuclei, cells, tissue compartments). In both cases, we used the following hyperparameters: RMSprop optimizer, batch size of 32, learning rate of 0.01, and training for 250k steps. At 250k steps, the model observed each image approximately 150 times (epochs).

Figure 1

Experimental design. (a) Representative examples of image classes. (b) Sample image selection and modeling. Note the “other” class was only available for the curated informative regions

Experimental design. (a) Representative examples of image classes. (b) Sample image selection and modeling. Note the “other” class was only available for the curated informative regions A second experiment was also performed on noncurated image patches representing the entire slide. In this experiment, tissue segments were automatically extracted from the WSI without pathologist input. Successive nonoverlapping 299 × 299 px tiles representing the entire WSI were evaluated for tissue content by converting the red, green, and blue values to gray scale and applying a mean intensity cut-off of >210. Any 299 × 299 px region with sufficient gray scale intensity was considered to possibly contain tissue and was extracted and analyzed. Regions with insufficient gray scale intensity were not considered for the analysis and treated as missing data. Since no human selection occurred, only two prediction classes were available: Spitz and conventional, with n = 611,485 and n = 612,523 image patches, respectively, from the 100 training slides. To effectively compare the results to the curated patch-level classifications, training was performed for 3.6 M steps (~135 epochs). Testing was performed using 200 WSI not used during training or validation. Accuracy was measured at the patch level (from the validation patches) and at the WSI level. WSI were classified as either conventional or Spitz by calculating a prediction for all nonoverlapping 299 × 299 px regions with sufficient tissue. Classifications where the classification probability (i.e., logit) was at least 10% higher than the next likely class were used as votes, with the classification label for the entire slide assigned by simple majority (Spitz or conventional, [other was ignored]). Accuracy for the WSI-label predictions was then assessed for binary classification accuracy using the Caret package (version: 6.0–71)[6] in R (version: 3.2.3).[7] The gold standard for the correct classification was the diagnosis made by the dermatopathologist. All codes used for these data are publicly available on GitHub.[8] This work was conducted under approval from the Institutional Review Board at Mayo Clinic.

RESULTS

Training using the curated image patches took approximately 50 h to complete 250k iterations with 4 GeForce GTX 1080 GPUs. Training accuracy for curated patches reached maximum accuracy (100%) at around epoch 13, whereas the pretrained model only began to converge around epoch 100 [Figure 2]. Training accuracy for the noncurated patches converged around epoch 50. The validation accuracy, however, revealed stark differences in the generalizability of the models. Both the de novo and pretrained networks had high validation accuracy (99.0% and 95.4%, respectively), but the noncurated patches were unable to learn transferable features with a final validation accuracy of only 52.3%.

Figure 2

TTraining and Validation Accuracy. (Left) Training accuracy for each cohort of images and models. The shaded area is the margin of error. (Right) Accuracy of predictions on the validation images

TTraining and Validation Accuracy. (Left) Training accuracy for each cohort of images and models. The shaded area is the margin of error. (Right) Accuracy of predictions on the validation images A single classification was applied to an entire slide denoting whether or not it contained a Spitz or conventional nevus. For each patch in a given WSI, a prediction was made as to whether that patch was of type “Spitz,” “conventional,” or “other.” Then, the number of patches that were predicted as Spitz or conventional was tallied, and an overall slide prediction was based on whichever category was more abundant. That WSI-level prediction is then compared to the true label of the slide to determine accuracy. The classification accuracy of the 200 whole slides not seen by the training algorithm was 92.0%. Sensitivity was 85% with a specificity of 99%. On a per class basis, 99 of 100 conventional nevi were classified correctly (99%), compared to only 85% for Spitz nevi. Of the 16 misclassified WSI, 94% were due to Spitz-type lesions being classified as conventional. When further exploring the false-positive calls, a strong edge effect was observed around the decision boundary [Figure 3], meaning that the incorrect calls were primarily driven by small differences in the expected versus observed classes. Examples of correctly and incorrectly predicted WSI are shown in Figure 4.

Figure 3

Figure 4

Example classification of the whole-slide image. Each of these images shows an example of correct (left) or incorrect (right) classification for Spitz (top) and conventional (bottom) nevi types. In the heatmaps adjacent to each image, each pixel is colored to represent the prediction for a particular region. Blue indicates a patch-level classification for “Spitz,” red for “conventional,” and green for “other”

Experimental design. Count of patch predictions from the whole-slide image. For each whole-slide image, the total number of predictions for Spitz and conventional was aggregated. Squares and crosses signify correct classifications. Circles and triangles are misclassified whole-slide image. Notice the majority of misclassified images reside near the decision boundary (solid line) Example classification of the whole-slide image. Each of these images shows an example of correct (left) or incorrect (right) classification for Spitz (top) and conventional (bottom) nevi types. In the heatmaps adjacent to each image, each pixel is colored to represent the prediction for a particular region. Blue indicates a patch-level classification for “Spitz,” red for “conventional,” and green for “other”

CONCLUSIONS

This work highlights an important lesson when developing algorithms for use by pathologists; involve the pathologist in the design of the assay. The manual curation, though tedious for the clinician, proved to be a valuable contribution to optimizing model performance. By preselecting representative examples of Spitz and conventional nevi, along with providing examples for nondiagnostic areas such as hair follicles, sweat glands, and tissue artifacts, the model was able to learn faster and has an overall higher accuracy on the training and validation sets with fewer examples. The number of images used from the curated images was ×16 less than the noncurated approach but was more focused on learning the salient features for discrimination in less time, taking only 50 h to train versus 800. Given a small image patch, the algorithm will correctly predict the correct classification 99% of the time. However, there are several important caveats. At present, the classifier does not achieve high accuracy with undirected evaluation of all image patches extracted from WSIs to be reliable for clinical use. Our data show that the accuracy of a single call for a WSI is 92% accurate. This is a major limitation since this would be the expected workflow in clinical practice. These errors are predominantly derived around a decision boundary between the number of patches counted as either Spitz or conventional. More sophisticated methods will be needed to improve classification accuracy at the whole-slide level. Alternatively, more work could be done to improve the patch-level accuracy (currently at 99%), which would decrease the number of false calls in a WSI. Given that each WSI generates about 15,000 image patches, a 1% error rate would result in 150 false-positive calls, which on its face does not seem alarming. However, ~75% of those fall into the “other,” noninformative classification, so the influence of even a few incorrect assertions can have moderate influence on the final classification. Additional work on refining that initial classification or on developing a secondary machine learning framework for results interpretation is necessary to decrease the error rate of diagnostic classification of Spitz versus conventional Nevi. These data provide strong evidence for the potential utility of AI to enhance diagnosis in digital pathology.

Financial support and sponsorship

Nil.

Conflicts of interest

There are no conflicts of interest.

3 in total

1. Histological features used in the diagnosis of melanoma are frequently found in benign melanocytic naevi.

Authors: C Urso; F Rongioletti; D Innocenzi; D Batolo; S Chimenti; P L Fanti; R Filotico; R Gianotti; M Lentini; C Tomasini; M Pippione
Journal: J Clin Pathol Date: 2005-04 Impact factor: 3.411

2. Inter-observer variation in the histopathological diagnosis of clinically suspicious pigmented skin lesions.

Authors: Lieve Brochez; Evelien Verhaeghe; Edouard Grosshans; Eckhart Haneke; Gérald Piérard; Dirk Ruiter; Jean-Marie Naeyaert
Journal: J Pathol Date: 2002-04 Impact factor: 7.996

3. Sensitivity, specificity, and diagnostic accuracy of three dermoscopic algorithmic methods in the diagnosis of doubtful melanocytic lesions: the importance of light brown structureless areas in differentiating atypical melanocytic nevi from thin melanomas.

Authors: Giorgio Annessi; Riccardo Bono; Francesca Sampogna; Tullio Faraggiana; Damiano Abeni
Journal: J Am Acad Dermatol Date: 2007-02-20 Impact factor: 11.527

3 in total

10 in total

1. A Deep Multi-Label Segmentation Network For Eosinophilic Esophagitis Whole Slide Biopsy Diagnostics.

Authors: Nati Daniel; Ariel Larey; Eliel Aknin; Garrett A Osswald; Julie M Caldwell; Mark Rochman; Margaret H Collins; Guang-Yu Yang; Nicoleta C Arva; Kelley E Capocelli; Marc E Rothenberg; Yonatan Savir
Journal: Annu Int Conf IEEE Eng Med Biol Soc Date: 2022-07

2. Development of an Image Analysis-Based Prognosis Score Using Google's Teachable Machine in Melanoma.

Authors: Stephan Forchhammer; Amar Abu-Ghazaleh; Gisela Metzler; Claus Garbe; Thomas Eigentler
Journal: Cancers (Basel) Date: 2022-04-29 Impact factor: 6.575

3. Identification of metastatic primary cutaneous squamous cell carcinoma utilizing artificial intelligence analysis of whole slide images.

Authors: Jaakko S Knuutila; Pilvi Riihilä; Antti Karlsson; Mikko Tukiainen; Lauri Talve; Liisa Nissinen; Veli-Matti Kähäri
Journal: Sci Rep Date: 2022-06-14 Impact factor: 4.996

Review 4. Machine Learning in Dermatology: Current Applications, Opportunities, and Limitations.

Authors: Stephanie Chan; Vidhatha Reddy; Bridget Myers; Quinn Thibodeaux; Nicholas Brownstone; Wilson Liao
Journal: Dermatol Ther (Heidelb) Date: 2020-04-06

5. Resolution-agnostic tissue segmentation in whole-slide histopathology images with convolutional neural networks.

Authors: Péter Bándi; Maschenka Balkenhol; Bram van Ginneken; Jeroen van der Laak; Geert Litjens
Journal: PeerJ Date: 2019-12-17 Impact factor: 2.984

Review 6. High-throughput whole-slide scanning to enable large-scale data repository building.

Authors: Mark D Zarella; Keysabelis Rivera Alvarez
Journal: J Pathol Date: 2022-06-08 Impact factor: 9.883

Review 7. The Spectrum of Spitz Melanocytic Lesions: From Morphologic Diagnosis to Molecular Classification.

Authors: Tiffany W Cheng; Madeline C Ahern; Alessio Giubellino
Journal: Front Oncol Date: 2022-06-07 Impact factor: 5.738

8. Dermatopathology of Malignant Melanoma in the Era of Artificial Intelligence: A Single Institutional Experience.

Authors: Gerardo Cazzato; Alessandro Massaro; Anna Colagrande; Teresa Lettini; Sebastiano Cicco; Paola Parente; Eleonora Nacchiero; Lucia Lospalluti; Eliano Cascardi; Giuseppe Giudice; Giuseppe Ingravallo; Leonardo Resta; Eugenio Maiorano; Angelo Vacca
Journal: Diagnostics (Basel) Date: 2022-08-15

9. Tailored for Real-World: A Whole Slide Image Classification System Validated on Uncurated Multi-Site Data Emulating the Prospective Pathology Workload.

Authors: Julianna D Ianni; Rajath E Soans; Sivaramakrishnan Sankarapandian; Ramachandra Vikas Chamarthi; Devi Ayyagari; Thomas G Olsen; Michael J Bonham; Coleman C Stavish; Kiran Motaparthi; Clay J Cockerell; Theresa A Feeser; Jason B Lee
Journal: Sci Rep Date: 2020-02-21 Impact factor: 4.379

Review 10. Application of Artificial Intelligence in Medicine: An Overview.

Authors: Peng-Ran Liu; Lin Lu; Jia-Yao Zhang; Tong-Tong Huo; Song-Xiang Liu; Zhe-Wei Ye
Journal: Curr Med Sci Date: 2021-12-06

10 in total