Catherine P Jayapandian1, Yijiang Chen2, Andrew R Janowczyk3, Matthew B Palmer4, Clarissa A Cassol5, Miroslav Sekulic6, Jeffrey B Hodgin7, Jarcy Zee8, Stephen M Hewitt9, John O'Toole10, Paula Toro11, John R Sedor12, Laura Barisoni13, Anant Madabhushi14. 1. Department of Biomedical Engineering, Case Western Reserve University, Cleveland, Ohio, USA. Electronic address: cpj3@case.edu. 2. Department of Biomedical Engineering, Case Western Reserve University, Cleveland, Ohio, USA. 3. Department of Biomedical Engineering, Case Western Reserve University, Cleveland, Ohio, USA; Precision Oncology Center, Lausanne University Hospital, Vaud, Switzerland. 4. Department of Pathology, Perelman School of Medicine, University of Pennsylvania, Philadelphia, Pennsylvania, USA. 5. Department of Pathology, Ohio State University, Columbus, Ohio, USA. 6. Department of Biomedical Engineering, Case Western Reserve University, Cleveland, Ohio, USA; Department of Pathology, University Hospitals of Cleveland, Cleveland, Ohio, USA. 7. Department of Pathology, University of Michigan, Ann Arbor, Michigan, USA. 8. Department of Biostatistics, Perelman School of Medicine, University of Pennsylvania, Philadelphia, Pennsylvania, USA. 9. Laboratory of Pathology, National Institutes of Health, National Cancer Institute, Bethesda, Maryland, USA. 10. Lerner Research and Glickman Urology and Kidney Institutes, Cleveland Clinic, Cleveland, Ohio, USA. 11. Department of Pathology, Universidad Nacional de Colombia, Bogotá, Colombia. 12. Lerner Research and Glickman Urology and Kidney Institutes, Cleveland Clinic, Cleveland, Ohio, USA; Department of Physiology and Biophysics, Case Western Reserve University, Cleveland, Ohio, USA. 13. Department of Pathology and Medicine, Division of Nephrology, Duke University, Durham, North Carolina, USA. 14. Department of Biomedical Engineering, Case Western Reserve University, Cleveland, Ohio, USA; Louis Stokes Cleveland Veterans Administration Medical Center, Cleveland, Ohio, USA.
Abstract
The application of deep learning for automated segmentation (delineation of boundaries) of histologic primitives (structures) from whole slide images can facilitate the establishment of novel protocols for kidney biopsy assessment. Here, we developed and validated deep learning networks for the segmentation of histologic structures on kidney biopsies and nephrectomies. For development, we examined 125 biopsies for Minimal Change Disease collected across 29 NEPTUNE enrolling centers along with 459 whole slide images stained with Hematoxylin & Eosin (125), Periodic Acid Schiff (125), Silver (102), and Trichrome (107) divided into training, validation and testing sets (ratio 6:1:3). Histologic structures were manually segmented (30048 total annotations) by five nephropathologists. Twenty deep learning models were trained with optimal digital magnification across the structures and stains. Periodic Acid Schiff-stained whole slide images yielded the best concordance between pathologists and deep learning segmentation across all structures (F-scores: 0.93 for glomerular tufts, 0.94 for glomerular tuft plus Bowman's capsule, 0.91 for proximal tubules, 0.93 for distal tubular segments, 0.81 for peritubular capillaries, and 0.85 for arteries and afferent arterioles). Optimal digital magnifications were 5X for glomerular tuft/tuft plus Bowman's capsule, 10X for proximal/distal tubule, arteries and afferent arterioles, and 40X for peritubular capillaries. Silver stained whole slide images yielded the worst deep learning performance. Thus, this largest study to date adapted deep learning for the segmentation of kidney histologic structures across multiple stains and pathology laboratories. All data used for training and testing and a detailed online tutorial will be publicly available.
The application of deep learning for automated segmentation (delineation of boundaries) of histologic primitives (structures) from whole slide images can facilitate the establishment of novel protocols for kidney biopsy assessment. Here, we developed and validated deep learning networks for the segmentation of histologic structures on kidney biopsies and nephrectomies. For development, we examined 125 biopsies for Minimal Change Disease collected across 29 NEPTUNE enrolling centers along with 459 whole slide images stained with Hematoxylin & Eosin (125), Periodic Acid Schiff (125), Silver (102), and Trichrome (107) divided into training, validation and testing sets (ratio 6:1:3). Histologic structures were manually segmented (30048 total annotations) by five nephropathologists. Twenty deep learning models were trained with optimal digital magnification across the structures and stains. Periodic Acid Schiff-stained whole slide images yielded the best concordance between pathologists and deep learning segmentation across all structures (F-scores: 0.93 for glomerular tufts, 0.94 for glomerular tuft plus Bowman's capsule, 0.91 for proximal tubules, 0.93 for distal tubular segments, 0.81 for peritubular capillaries, and 0.85 for arteries and afferent arterioles). Optimal digital magnifications were 5X for glomerular tuft/tuft plus Bowman's capsule, 10X for proximal/distal tubule, arteries and afferent arterioles, and 40X for peritubular capillaries. Silver stained whole slide images yielded the worst deep learning performance. Thus, this largest study to date adapted deep learning for the segmentation of kidney histologic structures across multiple stains and pathology laboratories. All data used for training and testing and a detailed online tutorial will be publicly available.
Renal biopsy interpretation remains the gold standard for the diagnosis and staging of native and transplant kidney diseases.[1-3] Although visual morphologic assessment of the renal parenchyma may provide useful information for disease categorization, manual assessment and visual quantification by pathologists are time-consuming and limited by poor intra- and interreader reproducibility.[4-7]The introduction of digital pathology in nephrology clinical trials[8] has provided an unprecedented opportunity to test machine learning approaches for large-scale tissue quantification efforts. Standardization of pathology material acquisition has allowed worldwide consortia to establish digital pathology repositories containing thousands of digital renal biopsies for the evaluation of kidney diseases in adults and children, across diverse populations and pathology laboratories.[4,9,10] This large-scale quantification, however, presents some new challenges. Unlike cancer pathology where hematoxylin and eosin (H&E) is generally the sole stain employed, renal biopsies require routine special stains such as Jones and periodic acid–methenamine silver (SIL), periodic acid–Schiff (PAS), and Masson trichrome (TRI).[3,11,12] Additionally, the multicenter nature of such consortia is reflected in the heterogeneity of preparations (e.g., integrity of tissue sections and quality of the stains).Deep learning (DL) is a machine learning approach that recognizes patterns in images through a network of connected artificial neurons. DL uses deep convolutional neural networks (CNNs) that are capable of identifying patterns in complex histopathology data prone to such heterogeneity. U-Net is a popular semantic-based DL network validated in the context of biomedical image segmentation that takes spatial context of pixels into consideration as opposed to naive pixel-level DL classifiers.[13] The output of U-Net is a high-resolution image (typically the same size as the input image) with labeled class predictions at the pixel level.[14-16]In this study, we evaluated the feasibility of DL approaches for automatic segmentation of 6 renal histologic primitives on 4 stains, using the digital renal biopsies from a multicenter Nephrotic Syndrome Study Network (NEPTUNE) dataset.[9] In addition, we describe annotation and training considerations, specifically as they relate to DL algorithms for digital nephropathology. To the best of our knowledge, this is the largest comprehensive study to address applicability of DL approaches employable for kidney pathology images generated in a multicenter setting.
RESULTS
DL performance per histologic primitive
Glomerular tuft.
The classifier performed consistently across the 4 stains with only marginal differences in F-score and Dice similarity coefficient (DSC). A 5× digital magnification on PAS and H&E stains (Table 1, Figures 1 and 2) resulted in optimal detection and segmentation.
Table 1|
Performance metrics: F, DSC, TPR, and PPV for structurally normal histologic primitives at optimal digital magnification
Optimally digitally magnified regions of interest.
The optimal magnification varied for each histologic primitive using patch size of 256 × 256 px: periodic acid–Schiff glomerular unit and tuft, original magnification ×5; proximal and distal tubular segment, original magnification ×10; peritubular capillary, original magnification ×40; and arteries/arterioles, original magnification ×10 (not shown).
Figure 2|
Deep learning (DL) segmentation of glomerular tuft and unit.
DL segmentation for glomerular unit and tuft on whole slide images of formalin-fixed and paraffin-embedded sections from minimal change disease, stained with hematoxylin and eosin (H&E), periodic acid–Schiff (PAS), trichrome (TRI), and silver (SIL). For each stain, the original image overlaid with ground truth is presented on the left, and the DL segmentation is presented on the right. The positive classes are highlighted in bright pink from green transparent mask overlaid on original image. The DL output is specifically tracing the Bowman capsule for glomerular unit and the profile of the capillary wall for the glomerular tuft. The glomerular units and tufts were correctly identified across all types of stains.
Glomerular unit.
Consistent quantitative performance metric with F-score and DSC over 0.89 were observed across all stains, with optimal results for detection and segmentation using 5× digital magnification on PAS and SIL stains (Table 1, Figures 1 and 2).
Proximal tubular segments.
Segmentation results varied little across the stains (F-score from 0.89 to 0.91, and DSC from 0.88 to 0.95), with PAS, SIL, and TRI stains having better performance than the H&E stain. A 10× magnification was optimal for detection and segmentation across all stains. (Table 1, Figures 1 and 3).
Figure 3|
Deep learning (DL) segmentation of proximal and distal tubular segments.
DL segmentation for tubular segments on whole slide images of formalin-fixed and paraffin-embedded sections from minimal change disease, stained with hematoxylin and eosin (H&E), periodic acid–Schiff (PAS), trichrome (TRI), and silver (SIL). For each stain, the original image overlaid with ground truth is presented on the left, and the DL segmentation is presented on the right. The positive classes are highlighted in bright pink from green transparent mask overlaid on original image.
Distal tubular segments.
Segmentation results were highly variable across all the stains: F-scores were 0.78 and 0.81 for H&E and TRI, respectively, and 0.91 and 0.93 for SIL and PAS, respectively. DSC scores were 0.78 and 0.82 for H&E and TRI, and 0.92 and 0.93 for SIL and PAS. Optimal results for detection and segmentation were obtained using 10× digital magnification on PAS and SIL stains (Table 1, Figures 1 and 3).
Arteries/arterioles.
Artery/arteriole segmentation was variable across stains, with F-scores ranging from 0.79 to 0.85 across TRI, H&E, and PAS staining and DSC ranging from 0.85 to 0.90. Optimal results for detection and segmentation were obtained using 10× on PAS stain (Table 1, Figures 1 and 4).
Figure 4|
Deep learning (DL) segmentation of arteries/arterioles and peritubular capillaries.
DL segmentation for arteries/arterioles on whole slide images of formalin-fixed and paraffin-embedded sections from minimal change disease, stained hematoxylin and eosin (H&E), periodic acid–Schiff (PAS), trichrome (TRI), and silver (SIL), and for peritubular capillaries on whole slide images of formalin-fixed and paraffin-embedded sections stained with PAS, with the original image overlaid with ground truth on the left and the DL segmentation on the right. The positive classes are highlighted in bright pink from green transparent mask overlaid on original image.
Peritubular capillaries.
Optimal results for detection and segmentation were obtained using 40× magnification on PAS stain (Table 1, Figures 1 and 4). Qualitative segmentation results on the testing cohort show that most of the large-sized peritubular capillaries were thin and long as they were cut tangentially from the biopsy. Although the size, shape, and textural presentation of peritubular capillaries varied (Figure 5a), the U-Net model was able to detect and segment peritubular capillaries of varying sizes and shapes (Figure 5). The classifier tends to perform better on thin and long, small- to medium-sized capillaries. However, capillaries with size less than 40 pixels (167 μm2) failed to be identified or were inaccurately segmented.
Figure 5|
Deep learning (DL) Segmentation performance in relation to the morphologic heterogeneity of peritubular capillaries (PTCs).
(a) Most of the peritubular capillaries were small when measured in number of pixels. The size of the peritubular capillaries has an exponential distribution with a long tail from small to large. Each pixel is 0.06 μm2 on tissue, and as observed, most of the PTCs are under 90 μm2. Examples of DL performance on small (c), medium (b), and large (d,e) PCs.
Validation of DL models using nephrectomies.
An F-score of 0.93 was obtained for 191 glomerular units, 0.90 for 1484 proximal tubules, 0.93 for 1251 distal tubules, 0.71 for 269 arteries/arterioles (Figure 6), and 0.90 for 3784 peritubular capillaries (Figure 7). The rare globally sclerotic glomeruli and atrophic tubules present in the sections were not segmented by the DL network.
Figure 6|
Deep learning (DL) segmentation of normal histologic primitives on periodic acid–Schiff nephrectomies.
(a) Segmentation of normal glomerular units. (b) Segmentation of proximal (yellow) and distal (green) tubules; rare atrophic tubules were detected by the DL algorithms. (c) Segmentation of arteries/arterioles.
Figure 7|
Segmentation outputs of peritubular capillaries (PTCs) on periodic acid–Schiff (PAS) nephrectomies.
(a) Formalin-fixed and paraffin-embedded sections stained with PAS and CD34 (double stain). (b) Deep learning (DL) segmentation of peritubular capillaries on the same section stained with PAS alone. There is overlap between the CD34 positive stain and the DL detection of peritubular capillaries. Overall, the DL performance was similar to the segmentation accuracy on the testing set for minimal change disease.
DL segmentation performance across sites and artifacts.
See Supplementary Figure S4.
DL performance as a function of number of training exemplars
The rate of improvement of the network performance as a function of the number of training exemplars was observed to be different across histologic primitives. The number of exemplars needed to maximize network performance increases substantially from glomerular tufts to distal tubular segments, arteries/arterioles, and finally to peritubular capillaries (Figure 8). For larger structures such as glomerular tufts, it was observed that only 60 training samples were necessary to achieve an F-score of 0.89, with a 0.02 increase using 183 tufts. For smaller and largely represented structures such as distal tubules, a 0.07 increase in F-score was observed by increasing the number of exemplars from 507 to 2789. For structures such as arteries/arterioles with varying sizes, the F-score increased by 0.13, increasing the number of exemplars from 258 to 864. A significant increase in F-score from 0.27 to 0.81 was observed with peritubular capillaries by increasing the number of exemplars 2.5 times (i.e., from 4273 to 10,975).
Figure 8|
Model performance with increasing number of training annotations.
Number of annotations versus deep learning model performans. The model performance was measured as F-score, dice similarity coefficient (DSC), true positive rate (TPR), predictive positive value (PPV). For histologic primitives such as glomerular tufts, only a small number of annotations was required to construct a robust classifier, in contrast to peritubular capillaries where larger number of annotations were required. The performance metrics for peritubular capillary segmentation increased linearly as more annotations were added. Arteries/arterioles and distal tubules had intermediate rates of convergence with increasing number of annotations.
DISCUSSION
The assessment of renal biopsy is unique compared with other surgical pathology specimens because of the variety of stains routinely used. Morphologic assessment relies on the quality of the preparations, the pathologists’ expertise in detecting the individual structures and associated changes, and quantitative or semiquantitative metrics used to capture the extent of tissue damage. Visual histologic quantitative assessment such as counting, distribution, and morphometry of certain histologic primitives are known to be robust predictors of outcome for various kidney diseases.[10,17-23] However, quantitative analysis remains a challenge for the human eye. Some of these primitives (e.g., peritubular capillaries) cannot be measured visually or manually and warrant the aid of computational algorithms. Recent studies have suggested that computer vision tools can serve as triage and decision support tools for disease diagnosis with digital pathology.[24-27] Thus, automated image analysis tools need to be implemented and integrated into the pathology workflow for efficient and reliable segmentation of histologic primitives across multiple types of stains. DL segmentation tools could greatly facilitate derivation of not only the visual but also subvisual histomorphometric features (e.g., shape, textural, and graph features) for correlation with diagnosis and outcome.[28-30]This study attempts to address the challenges of computational renal pathology for large-scale tissue interrogation by providing DL algorithms for thorough annotation of 6 histologic primitives on renal parenchyma of minimal change disease (MCD), using whole slide images (WSIs) of 4 stains and generated across 29 NEPTUNE enrolling centers. In the past few years, several studies have demonstrated the utility of DL networks for low-level image analyses (i.e., detection, segmentation, and classification of histologic primitives) and high-level complex prognosis and prediction tasks.[31-35] Our study is the largest, comprehensive DL study of kidney biopsies, presenting algorithms that were developed on different stains and using a large number of annotated images, compared with those previously published. The primary conclusions and significant findings from our work are described next.
Comparison with current literature
The differences between previous studies[36-44] and our contributions are summarized in the Supplementary Figure S6. Previously published studies focus on a single histologic primitive and a single stain. For example, Marsh et al. evaluated CNNs for detection of global glomerulosclerosis in transplant kidney frozen sections stained with H&E[36]; Kanna et al. evaluated CNNs to discriminate normal, segmentally and globally sclerosed glomeruli from trichrome stained formalin-fixed and paraffin-embedded kidney sections[37]; Gallego et al. applied DL to detect glomeruli on PAS-stained sections; Bel et al. demonstrated segmentation of normal and pathologic histologic structures using PAS stained WSIs of nephrectomy cortex tissue.[39] Temerinac-Ott et al. demonstrate a DL approach to improve glomerular detection on 1 staining using results from differently stained sections of same tissue.[38] Our DL networks on all 4 stains represent a first step for future clinical deployment allowing for the detection, segmentation, and ultimately quantification of several normal histologic primitives in all stains routinely used for diagnostic purposes.Another critical element that needs to be taken into consideration before their use in large-scale DL networks is how they can be applied to heterogeneous datasets. Our DL models were trained and tested on a very heterogeneous set of WSIs with preanalytic variations in tissue acquisition, processing, and slide preparation using 4 stains, thus facilitating the rigorous evaluation of the applicability of the DL approach in a multisite setting.Different DL approaches have been used for the segmentation of histologic primitives, such as Gadermayr et al.’s application of generative adversarial deep networks for stain-independent glomerular segmentation.[45] Bel et al. employed cycle-consistent generative adversarial networks (cycle-GANs) in DL applications for multicenter stain transformation.[40] Hermsen et al. has demonstrated U-Net based segmentation of 7 tissue classes using 40 transplant biopsies on PAS stain.[42] Our approach, in this study, was to develop multiple U-Net based DL networks using optimal digital magnification and varying number of annotations across primitives and stains.All previous works have used relatively smaller number of WSIs of renal biopsies/nephrectomies compared with our study (Table 2). The use of a large WSI dataset allowed us to provide insights to pathologists for generating well-annotated training exemplars for each primitive and stain, as well as the number of training exemplars required for best network performance using U-Net CNNs (Figure 8).
Table 2|
DL dataset showing the number of training and testing region of interest images extracted from 459 WSIs of 125 MCD patients and the number of manually segmented annotations for 6 structurally normal histologic primitives
Histologic primitive for DL segmentation
Stain
No. of manual segmentations
No. of images (3000 × 3000 px) extracted from the WSIs
Specificity of the segmentation of the individual histologic primitives and their pathologic variation is critical for the deployment of DL models into clinical practice.[42,43] The DL networks generated in this work are specific to structurally normal histologic primitives, such as those seen in MCD or nephrectomies, and can be applied to both adult and pediatric renal biopsies. When the DL networks were tested on patches of renal parenchyma from nephrectomy specimens, the specificity for the structurally normal histologic primitives was maintained. The DL framework presented in this study will also enable architecting of networks in the future that are specifically focused on automated segmentation and assessment of structurally abnormal histologic primitives and their correlation with clinical outcomes.
DL-based ranking of different stains
Our study suggests that the PAS stain is best suited for identification of structurally normal histologic primitives using the U-Net model. This may be because PAS appears to be consistently more homogeneous across pathology laboratories compared with TRI or SIL. PAS-stained WSIs highlight the basement membranes of different structures, which in turn provides superior definition of the boundary of each single primitive to be segmented. For this reason, PAS was the only stain used for segmentation of peritubular capillaries. On the basis of our results, PAS and H&E stains showed better performance for glomerular tuft and unit segmentation, PAS and TRI for arteries/arterioles, PAS and SIL for tubular segments, and PAS for peritubular capillaries.
Optimal digital magnification for DL models
Our results suggest that with a unified patch size of 256 × 256, optimal magnification for the DL models was 5× for glomeruli, 10× for tubules and vessels, and 40× for capillaries (Figure 1). Interestingly, most of the optimal magnifications were concordant with the magnifications that pathologists tend to use when annotating the individual primitives, except for glomeruli where the pathologists used 15× to 20×. Larger structures such as glomeruli, tubules, and vessels were more precisely segmented by the network at 5× to 10× magnification regardless of the stain. For smaller structures such as peritubular capillaries, larger digital magnification (40×) was required for accurate DL segmentation.
DL segmentation performance across sites and artifacts
Heterogeneity of tissue preparation and lack of standardization of the analytics is particularly relevant for multicenter studies, where the pathology material is collected from several laboratories. As expected, heterogeneity in tissue presentation and glass, tissue, and scanning artifacts was observed, each with variable contribution to the DL performance. For example, although in general tissue artifacts had limited impact on the DL networks, the thickness of the section appeared to affect performance. The impact of individual artifacts was also relative to the histologic primitive; for example, glass artifacts showed a slight negative impact on DL performance for arteries/arterioles and proximal tubules. Additionally, there was variability in DL performance across sites, and this variability appeared to be histologic primitive dependent (Supplementary Figure S4).Our quantitative data validated the intuitive assumption that more exemplars are needed for those primitives that are more difficult to identify visually (i.e., tangentially cut arteries/arterioles or primitives at the edge of the region of interest [ROI]) (Figure 8). For those primitives that were too small or ill defined (i.e. peritubular capillaries), curation and iterative annotation was necessary to improve segmentation accuracy. For segmentation of glomerular tufts, the network converged to maximum accuracy with a small number (60–183) of training exemplars; performance did not improve with inclusion of additional exemplars. For tubules and arteries/arterioles segmentation, the corresponding networks showed marginal to intermediate performance improvement with an increasing number of exemplars. In contrast, a significant increase in F-score and DSC (0.27–0.81) was observed with a 2.5-fold increase in the number of peritubular capillary exemplars, a linear scope of F-score increase indicating even better accuracy with more exemplars.
Interpreting segmentation results
Few false positives were observed in regions of interest with artifacts (i.e. tissue folds, uneven staining), suggesting the need for digital quality assessment of the slide images prior to invocation of the computational models (Supplementary Figure S4). In a few ROIs, the DL appeared to outperform the pathologists—for example, when a small portion of an artery/arteriole was at the edge of the ROIs and was not manually annotated as ground truth by the pathologist because they were visually difficult to detect. This can be explained by the protocol used for segmentation of arteries, where pathologists included only arteries where the wall (tunica media and intima) and lumen were visible and segmented the outer boundary of the tunica media. Thus, the models, trained to detect the tunica media and intima of the arteries correctly identified small fragments of tunica media (arterial/arteriolar wall tangentially cut) as arteries/arterioles despite the lack of a lumen (Figure 9).
Figure 9|
Examples of false positive and false negative deep learning (DL) segmentations on periodic acid–Schiff (PAS).
(a) Glomerular unit: DL failed to detect a tangentially cut glomerular unit that does not have a typical round shape (red thick arrow). (b) Artery: section artifact generate a false positive (red thick arrows). (c) Arteries: black arrows show 2 arterioles missed by the pathologist but detected by DL. (d) Arteries: pathologists were instructed to segment artery when lumen was present; however, DL segmentation detected tangentially cut artery (thick black arrow) where only the medium was visible. (e) Peritubular capillaries: a long peritubular capillary reveals only partial DL segmentation at the pixel level. (f) Peritubular capillaries: DL network for peritubular capillaries detects a few glomerular capillaries (false positive; thick red arrow).
Additionally, tubules in renal biopsy sections are more often seen in transverse than longitudinal sections. The initial classifier missed some longitudinally sectioned tubules, mostly on H&E-stained images, because the tubule boundaries were less sharp, and longitudinally sectioned tubules were underrepresented in the initial training set. To facilitate and improve the process of annotation and the network, the false-negative errors associated with the U-Net segmentation of the tubules were visually identified and manually refined by the pathologist, and the updated annotations were returned to the network. A few small arterioles were also incorrectly identified as distal tubules by the DL algorithm (false positives) during the first iteration. These false-positive annotations were removed by the pathologist upon review of the initial classifier output and corrected images were returned to the network for retraining without changing the experimental setup or the network parameters to eliminate false positives and negative errors of the DL algorithm.[45]In line with current sharing guidelines, with this report, we are making all of our data and accompanying ground truth annotations publicly available for the community. Online supplemental material released as part of this work is anticipated to advance the field of computational renal pathology[46] and provide best practices for generating annotations, augmentations,[47] magnifications and recommended stains to perform segmentation tasks optimally.In conclusion, this study represents a solid foundation toward invoking machine learning classifiers to aid large-scale tissue quantification efforts and the implementation of machine–human interactive protocols in clinical and pathology workflows. DL segmentation of histologic primitives enables computational derivation of histomorphometric features for enabling biopsy interpretation. Additionally, the framework presented in this work will also pave the way for development of new DL networks in the future that are specifically geared toward (i) abnormal or pathologic histologic primitives (i.e., global and segmental sclerosis, glomerular proliferative features, collecting ducts, veins and peripheral nerves, tubular atrophy, interstitial fibrosis, and arteriosclerosis), (ii) renal cortex and medullary compartments, and (iii) a wider spectrum of diseases. Further, these novel approaches could pave the way for the development of machine learning tools that provide disease prognosis or predicting treatment response[24] and even facilitate discovery of clinically actionable, nondestructive computational pathology–based imaging diagnostic biomarkers for kidney diseases.[25,27,48]
METHODS
Case and image dataset selection
This study was conducted using digital renal biopsies from the NEPTUNE digital pathology repository. NEPTUNE is a North American multicenter collaborative consortium with more than 650 adult and children enrolled from 29 recruiting sites (38 pathology laboratories). Only cases with a diagnosis of MCD were included in this study because histologically they are the most similar to normal renal parenchyma. A total of 459 curated WSIs (125 H&E, 125 PAS, 102 SIL, 107 TRI) from 125 MCD renal biopsies were used.[49] Not all cases had all stains available in the digital pathology repository. Four WSIs were selected for each patient (1 WSI per stain). From each WSI, approximately 3 to 5 ROIs containing the histologic primitives were randomly selected, inspected by a pathologist, and manually extracted as 3000 × 3000 tiles then stored as 8-bit red-green-blue (RGB) color images in PNG format at 40× digital magnification. Additional details on digitization and curation of biopsy WSIs can be found in Supplementary Figure S1.
Independent validation of the DL models.
Six WSIs from 3 formalin-fixed and paraffin-embedded nephrectomy specimens were included to test the DL network performance for the segmentation of all histologic primitives on adult renal parenchyma without significant structural abnormalities. Sections from the nephrectomy specimens were stained with PAS, scanned into WSIs, and subsequently stained with a CD34 antibody, a marker of endothelial cells, and then rescanned into WSIs. One hundred seventy-five random ROIs (3000 × 3000 pixels) were extracted from the PAS-stained WSIs. The PAS-CD34 double-stained WSIs were used as ground truth for validation of the DL segmentation approach for peritubular capillaries.
Histologic primitives and manual segmentation
Five renal pathologists manually segmented the ROIs to establish the ground truth for the histologic primitives (Table 2). Manual segmentations were generated using an open-source software application.[15] The ground truth annotations were saved as binary masks; that is, each pixel that was denoted as part of a histologic primitive (positive class pixels expressed as binary 1s) or not (negative class pixels expressed as binary 0s). Through this process, 30,048 annotations were made by pathologists on 1818 ROIs (Figure 10).
Figure 10|
Ground truth annotation for histologic primitives.
Examples of manual annotation on histologic primitives on whole slide images of formalin-fixed and paraffin-embedded sections from minimal change disease, stained with hematoxylin and eosin (H&E), periodic acid–Schiff (PAS), trichrome (TRI), and silver (SIL), and corresponding binary masks (black and white pictures) are shown.
Six histologic primitives were used for this study: glomerular tuft, glomerular unit (tuft + Bowman’s capsule), proximal tubular segments, distal tubular segments, arteries and arterioles, and peritubular capillaries. Consistent and detailed ground truth labels across all training samples can greatly facilitate robust DL performance, especially in segmentation tasks.[24,32,36,50-54] In order to produce consistent annotations across all images, each histologic primitive and its boundaries were carefully defined, and the annotation procedure for each use case standardized (Supplementary Figure S2). Furthermore, each annotation generated by a pathologist was reviewed by a second pathologist for quality assessment.
DL experimental pipeline and training methods
DL dataset.
Up to four WSIs per biopsy (H&E, PAS, TRI, and SIL for each) were used for the segmentation of the glomerular tuft and unit, and proximal and distal tubular segments. Peritubular capillaries were segmented using only PAS WSIs, and arteries/arterioles were segmented only in H&E, PAS, and TRI WSIs (Table 2). WSIs were divided at the patient level into training, validation, and testing sets (ratio 6:1:3). The networks were developed using WSIs of both adult and pediatric patients (Supplementary Figure S1). For training of the U-Net network, 5 pathologists annotated 1196 glomerular tufts and units, 4669 proximal and 2285 distal tubular segments, 19,280 peritubular capillaries, and 2261 arteries/arterioles (Table 2).
Network configuration and training.
Standard U-Net architecture with slightly tweaked parameters were implemented in PyTorch framework for training of each use case (Figure 11). Details of U-Net configuration, training methods including training set balancing and data augmentation can be found in Supplemental S3.
Figure 11|
Flowchart of the workflow of deep learning (DL) experimental pipeline for each stain and use case.
(a) Whole slide images (WSIs) were selected for generation of training, validation, and testing data. (b) Regions of interest were cropped from original WSIs with 40× digital magnification. (c) Ground truth labels were generated by pathologists for training, and overlapping patches of size 256 × 256 px (0.24 μm/px) containing both image data and ground truth annotation information were cropped from the training and validation images (as shown in black boxes). (d) For each path, a randomized data augmentation method is introduced to account for (i) size variation of primitives, (ii) stain variations, and (iii) tissue variations (e.g. thickness). (e) All the training patches were passed to U-Net on PyTorch for training, and validation patches were used to generate loss and accuracy measures for each epoch trained to evaluate model performance. Finally, the epoch that yielded the lowest loss on the validation data was selected for generation of test results.
Detection and segmentation metrics.
Detection and segmentation results were evaluated using F-Score, true positive rate (TPR), positive predictive value (PPV), and DSC.[55-57] Values of 0 and 1 represent the maximal discordance and agreement, respectively, between the pathologist ground truth and the U-Net results. TPR, PPV, and F-Score measure the detection accuracy of the DL networks. These metrics are computed using the number of correct segmentation results (true positives), incorrect segmentations (false positives), and missing segmentations (false negatives). DSC is the pixel-wise spatial overlap index that measures the segmentation accuracy of the classifier, with values ranging from 0 (indicating no spatial overlap between ground truth annotation and corresponding DL output mask) to 1 (indicating complete overlap), and a DSC value >0.5 denoting a correct segmentation (true positive).
Number of training exemplars for different histologic primitives
To test how the number of manually annotated training exemplars influences network performance, we selected a representative set of histologic primitives based on size, complexity, distribution, and stain: glomerular tufts on H&E, peritubular capillaries on PAS, distal tubular segments on TRI, and arteries/arterioles on SIL. Specifically, we sought to evaluate the minimum number of annotated exemplars for standing up trained U-Net models for each type of histologic primitive. Toward this end, multiple U-Net models were trained for each type of primitive, each time with a greater number of annotated exemplars. Detection and segmentation accuracy were then computed for each such U-Net model for each primitive on the corresponding testing sets (Figure 8).See Supplementary Figure S4.Figure S1.Digitization and curation of renal biopsy whole slide images.Figure S2.Histologic primitives and criteria for segmentation.Figure S3.Network training, data augmentation, balanced sampling and pre- and post-processing.
Authors: Michael Gadermayr; Laxmi Gupta; Vitus Appel; Peter Boor; Barbara M Klinkhammer; Dorit Merhof Journal: IEEE Trans Med Imaging Date: 2019-02-14 Impact factor: 10.048
Authors: Jon Whitney; German Corredor; Andrew Janowczyk; Shridar Ganesan; Scott Doyle; John Tomaszewski; Michael Feldman; Hannah Gilmore; Anant Madabhushi Journal: BMC Cancer Date: 2018-05-30 Impact factor: 4.430
Authors: Musab S Hommos; Caihong Zeng; Zhihong Liu; Jonathan P Troost; Avi Z Rosenberg; Matthew Palmer; Walter K Kremers; Lynn D Cornell; Fernando C Fervenza; Laura Barisoni; Andrew D Rule Journal: Kidney Int Date: 2017-12-19 Impact factor: 10.612
Authors: Kelly H Zou; Simon K Warfield; Aditya Bharatha; Clare M C Tempany; Michael R Kaus; Steven J Haker; William M Wells; Ferenc A Jolesz; Ron Kikinis Journal: Acad Radiol Date: 2004-02 Impact factor: 3.173
Authors: Anand Srivastava; Ragnar Palsson; Arnaud D Kaze; Margaret E Chen; Polly Palacios; Venkata Sabbisetti; Rebecca A Betensky; Theodore I Steinman; Ravi I Thadhani; Gearoid M McMahon; Isaac E Stillman; Helmut G Rennke; Sushrut S Waikar Journal: J Am Soc Nephrol Date: 2018-06-04 Impact factor: 10.121
Authors: Nathan R Hill; Samuel T Fatoba; Jason L Oke; Jennifer A Hirst; Christopher A O'Callaghan; Daniel S Lasserson; F D Richard Hobbs Journal: PLoS One Date: 2016-07-06 Impact factor: 3.240
Authors: Laura Barisoni; Charlotte Gimpel; Renate Kain; Arvydas Laurinavicius; Gloria Bueno; Caihong Zeng; Zhihong Liu; Franz Schaefer; Matthias Kretzler; Lawrence B Holzman; Stephen M Hewitt Journal: Clin Kidney J Date: 2017-02-18
Authors: Elise Marechal; Adrien Jaugey; Georges Tarris; Michel Paindavoine; Jean Seibel; Laurent Martin; Mathilde Funes de la Vega; Thomas Crepin; Didier Ducloux; Gilbert Zanetta; Sophie Felix; Pierre Henri Bonnot; Florian Bardet; Luc Cormier; Jean-Michel Rebibou; Mathieu Legendre Journal: Clin J Am Soc Nephrol Date: 2021-12-03 Impact factor: 8.237
Authors: Aleksandar Denic; Marija Bogojevic; Aidan F Mullan; Moldovan Sabov; Muhammad S Asghar; Sanjeev Sethi; Maxwell L Smith; Fernando C Fervenza; Richard J Glassock; Musab S Hommos; Andrew D Rule Journal: J Am Soc Nephrol Date: 2022-08-03 Impact factor: 14.978
Authors: Alton B Farris; Juan Vizcarra; Mohamed Amgad; Lee A D Cooper; David Gutman; Julien Hogan Journal: Histopathology Date: 2021-03-08 Impact factor: 5.087
Authors: Xiang Li; Richard C Davis; Yuemei Xu; Zehan Wang; Nao Souma; Gina Sotolongo; Jonathan Bell; Matthew Ellis; David Howell; Xiling Shen; Kyle J Lafata; Laura Barisoni Journal: J Med Imaging (Bellingham) Date: 2021-12-20
Authors: Yi Zheng; Clarissa A Cassol; Saemi Jung; Divya Veerapaneni; Vipul C Chitalia; Kevin Y M Ren; Shubha S Bellur; Peter Boor; Laura M Barisoni; Sushrut S Waikar; Margrit Betke; Vijaya B Kolachalama Journal: Am J Pathol Date: 2021-05-23 Impact factor: 5.770