Literature DB >> 33732986

Glomerular Classification Using Convolutional Neural Networks Based on Defined Annotation Criteria and Concordance Evaluation Among Clinicians.

Ryohei Yamaguchi¹, Yoshimasa Kawazoe¹, Kiminori Shimamoto¹, Emiko Shinohara¹, Tatsuo Tsukamoto², Yukako Shintani-Domoto³, Hajime Nagasu⁴, Hiroshi Uozaki⁵, Tetsuo Ushiku³, Masaomi Nangaku⁶, Naoki Kashihara⁴, Akira Shimizu⁷, Michio Nagata⁸, Kazuhiko Ohe⁹.

Abstract

INTRODUCTION: Diagnosing renal pathologies is important for performing treatments. However, classifying every glomerulus is difficult for clinicians; thus, a support system, such as a computer, is required. This paper describes the automatic classification of glomerular images using a convolutional neural network (CNN).
METHOD: To generate appropriate labeled data, annotation criteria including 12 features (e.g., "fibrous crescent") were defined. The concordance among 5 clinicians was evaluated for 100 images using the kappa (κ) coefficient for each feature. Using the annotation criteria, 1 clinician annotated 10,102 images. We trained the CNNs to classify the features with an average κ ≥0.4 and evaluated their performance using the receiver operating characteristic-area under the curve (ROC-AUC). An error analysis was conducted and the gradient-weighted class activation mapping (Grad-CAM) was also applied; it expresses the CNN's focusing point with a heat map when the CNN classifies the glomerular image for a feature.
RESULTS: The average κ coefficient of the features ranged from 0.28 to 0.50. The ROC-AUC of the CNNs for test data varied from 0.65 to 0.98. Among the features, "capillary collapse" and "fibrous crescent" had high ROC-AUC values of 0.98 and 0.91, respectively. The error analysis and the Grad-CAM visually showed that the CNN could not distinguish between 2 different features that had similar visual structures or that occurred simultaneously.
CONCLUSION: The differences in the texture or frequency of the co-occurrence between the different features affected the CNN performance; thus, to improve the classification accuracy, methods such as segmentation are required.

Entities: Chemical

Keywords: artificial intelligence; convolutional neural network; deep learning; glomerular image; renal pathology

Year: 2020 PMID： 33732986 PMCID： PMC7938073 DOI： 10.1016/j.ekir.2020.11.037

Source DB: PubMed Journal: Kidney Int Rep ISSN： 2468-0249

Understanding renal pathology is important when making diagnoses and decisions for the course(s) of treatment., When interpreting renal pathology, all the glomeruli must be carefully observed individually. This is a time-consuming task for clinicians and requires the support of a computer. In recent years, there have been advancements in technologies related to artificial intelligence known as deep learning. Through deep learning, general image recognition is often more accurate than by humans. In particular, the convolutional neural network (CNN), which is based on the neocognitron, has played a major role. The CNN has demonstrated high performance in image recognition and object extraction even in the medical field. For example, Saha et al. reported that they detected mitosis from pathological glass slides of breast cancer using a CNN with a 92% accuracy and an 88% recall. Research on pathological glomerular images using machine learning, including the CNN, involves the detection of glomeruli from a whole slide image (WSI) and the classification of glomeruli. As an example of glomerular detection from a WSI, Gallero et al. reported that their CNN detected glomeruli from a periodic acid−Schiff (PAS) −stained WSI with an F1 score of 0.937. Kawazoe et al. reported that they detected glomeruli from PAS-, periodic acid-methenamine-silver (PAM)−, Masson trichrome−, and Azan-stained WSIs, with F1 scores ranging from 0.876 to 0.925 using a modified CNN. Bukowy et al. reported that they constructed a CNN to detect glomeruli from trichrome-stained WSIs with an average precision and recall of 96.94% and 96.79%. Hermsen et al. constructed a CNN to segment WSIs into 11 classes such as “glomeruli” or “interstitium” from the PAS-stained WSI, and it detected 92.7% of all glomeruli. Regarding glomerular classification, George et al. reported that they used the k−nearest neighbor algorithm and succeeded in classifying whether the glomerular images contained proliferative lesions with 92.3% precision. In another study, Shruti et al. constructed a CNN to determine whether the image obtained from trichrome-stained WSI is a “non-global sclerosis, global sclerosis, or non-glomerular” with 89.66% to 95.06% accuracy. Similarly, John et al. used deep learning to diagnose whether the glomeruli from hematoxylin and eosin−stained WSIs of the transplanted kidney are sclerotic, with an F1 score of 0.865 to 0.879. Chagas et al. combined a CNN and a support vector machine to classify the hypercellularity for glomerular images from hematoxylin and eosin− and PAS-stained WSIs, with 98.8% to 99.6% accuracy. To automatically classify the glomerular images using CNN, the ground truth for features is essential for all the images. However, in the routine diagnosis of kidney disease, scores are not always recorded for all of the features for all glomeruli, and they are often recorded only for representative glomeruli. Therefore, the pathological reports recorded in daily medical care often do not have complete ground truth for all glomeruli. Furthermore, the existing annotation criterion for a specific disease, such as the Oxford classification, or the annotation criterion for only particular diseases, such as the multicenter Nephrotic Syndrome Study Network (NEPTUNE) scoring system, cannot be applied to various diseases in their current format. Therefore, annotation criteria that can be applied to various kidney diseases should be newly developed. First, to create the ground truth of glomerular images, this study established the annotation criteria for glomerular images and evaluated the degree of concordance among clinicians based on the criteria. Next, CNNs that classified the various features of the glomerular images were developed, and their performance was evaluated. Finally, the concordance between the CNN score and the clinicians’ score was assessed, and an error analysis was performed to determine computer-related issues for the renal pathology images.

Materials and Methods

Dataset Collection

This study collected 293 PAS-stained WSIs from 3 hospitals: 99 WSIs from the University of Tokyo Hospital (UTH) from 2010 to 2017; 88 WSIs from the Tazuke Kofukai Medical Research Institute, Kitano Hospital (KH), from 2014 to 2017; and 106 WSIs from the University of Tsukuba Hospital (UTSH) from 2009 to 2016. From the 293 WSIs, the glomerular image was manually extracted in a rectangular shape so that 1 major glomerulus was included, which yielded 10,202 images. Supplementary Table S1 lists the major diagnoses of the 293 WSIs for the 10,202 images. The extracted glomerular images were converted into the png format using Openslide and were saved as a single image file.

Development of Annotation Criteria

Five co-authored clinicians (1 physician and 4 pathologists) developed the annotation criteria for various kidney diseases based on the Delphi method, because we were to guarantee the quality of the annotation criteria and annotated labels; Krause et al. showed that the quality of the annotated label affected the validity of its performance in the machine learning model for diabetic retinopathy. Consequently, the annotation criteria consisted of the following 3 elements: (i) “feature”: finding the name (e.g., “mesangial hypercellularity”); (ii) “score”: the possible value of the feature (e.g., “normal,” “mild,” “moderate,” “severe”); and (iii) “regulation”: definitions for determining the score (e.g., “count the number of mesangial cells …”). The annotation criteria also included 12 features, which are important to diagnose the PAS-stained images. The 12 features are as follows: (i) capillary collapse (CC); (ii) sclerosis (Scl); (iii) mesangial hypercellularity (MesHyper); (iv) increased mesangial matrix (IMM); (v) mesangiolysis (MLysis); (vi) endocapillary proliferation (EP); (vii) fibrous crescent (F-Cre); (viii) fibrocellular crescent (Fc-Cre); (ix) cellular crescent (C-Cre); (x) adhesion (Adh); (xi) afferent/efferent arteriolar hyalinosis (AAH); and (xii) increased vasculature around the vascular pole (IVVP). The granularity of the score and the regulation were also discussed and agreed upon by the 5 clinicians. In some images, the score of the particular features could not be obtained; these were classified as “impossible to score.” Supplementary Table S2 lists the annotation criteria that were developed for this study. Supplementary Figure S1 depicts the example images of annotation for each feature. In addition to those criteria, an annotation flowchart was developed to clarify the dependency between the features, which sometimes made the annotation complicated. Figure 1a and b show the annotation flowcharts.

Figure 1

(a, b) Flowcharts of the annotation for an image of a glomerulus. These 2 flowcharts represent the order followed to create a feature score. AAH, afferent/efferent arteriolar hyalinosis; Adh, adhesion; CC, capillary collapse; C-Cre, cellular crescent; EP, endocapillary proliferation; F-Cre, fibrous crescent; Fc-Cre, fibrocellular crescent; IMM, increased mesangial matrix; IVVP, increased vasculature around the vascular pole; MesHyper, mesangial hypercellularity; MLysis, mesangiolysis; Scl, sclerosis.

Concordance Evaluation of Annotated Features Between Clinicians

Out of 10,202 images, the first author selected 100 images for concordance evaluation. These 100 images included relatively “easy-to-score” images but also “hard-to-score” images in which different clinicians might assign different scores. Five clinicians annotated all 12 features for the 100 images except for 2 images in which 2 clinicians considered them to be ineligible to be scored. Supplementary Figure S2 shows the 98 images used for concordance evaluation. After the 5 clinicians finished annotating the images, the Cohen kappa (κ) coefficients of the 5 clinicians were calculated.

Development and Evaluation of CNN

One physician annotated all 12 features that were included in the developed annotation criteria for the remaining 10,102 glomerular images. The breakdown list of the 10,102 images for the features and scores is presented in Table 1.

Table 1

Breakdown of the annotated images

Feature	Score
CC	(−)	Segmental	Global		Impossible to score
CC	8837	187	1022		56
Scl	(−)	Segmental	Global		Impossible to score
Scl	8499	448	82		1083
MesHyper	Normal	Mild	Moderate	Severe	Impossible to score
MesHyper	5411	1874	726	878	1213
IMM	(−)	(+)			Impossible to score
IMM	1569	7320			1213
MLysis	(−)	(+)			Impossible to score
MLysis	8777	112			1213
EP	(−)	Segmental	Global		Impossible to score
EP	8409	496	34		1163
F-Cre	(−)	(+)
F-Cre	8611	1491
Fc-Cre	(−)	(+)
Fc-Cre	9413	689
C-Cre	(−)	(+)
C-Cre	9733	369
Adh	(−)	(+)			Impossible to score
Adh	6826	1925			1351
IVVP	(−)	(+)			Impossible to score
IVVP	2160	144			7798
AAH	(−)	(+)			Impossible to score
AAH	798	235			9069

For each of the 12 features that were made, the distribution of the scores for all of the 10,102 images are listed. Each number represents the number of images scored for each score. AAH, afferent/efferent arteriolar hyalinosis; Adh, adhesion; C-Cre, cellular crescent; CC, capillary collapse; EP, endocapillary proliferation; F-Cre, fibrous crescent; Fc-Cre, fibrocellular crescent; IMM, increased mesangial matrix; IVVP, increased vasculature around the vascular pole; MesHyper, mesangial hypercellularity; MLysis, mesangiolysis; Scl, sclerosis.

Breakdown of the annotated images For each of the 12 features that were made, the distribution of the scores for all of the 10,102 images are listed. Each number represents the number of images scored for each score. AAH, afferent/efferent arteriolar hyalinosis; Adh, adhesion; C-Cre, cellular crescent; CC, capillary collapse; EP, endocapillary proliferation; F-Cre, fibrous crescent; Fc-Cre, fibrocellular crescent; IMM, increased mesangial matrix; IVVP, increased vasculature around the vascular pole; MesHyper, mesangial hypercellularity; MLysis, mesangiolysis; Scl, sclerosis. The CNNs were developed for the features for which the κ coefficients were ≥0.40 to classify the features as positive or negative. The binary class breakdown list of the 10,102 images of the 5 features for the CNN is presented in Table 2. The strength of the correlation between the features was calculated using the phi coefficient (Table 3). There was a strong correlation between CC and F-Cre (phi coefficient = 0.67).

Table 2

Breakdown of data used for the convolutional neural network (CNN)

ScoreFeature	Negative	Positive	Ratio of negative to positive
CC	None, Impossible to score	Segmental, Global
CC	8893	1209	7.36
IMM	(−), Impossible to score	(+)
IMM	2782	7320	0.38
MLysis	(−), Impossible to score	(+)
MLysis	9990	112	89.20
F-Cre	(−)	(+)
F-Cre	8611	1491	5.78
IVVP	(−), Impossible to score	(+)
IVVP	9958	144	69.15

Each number shows the number of images contained in the score. CC, capillary collapse; F-Cre, fibrous crescent; IMM, increased mesangial matrix; IVVP, increased vasculature around the vascular pole; MLysis, mesangiolysis.

Table 3

Phi coefficient among the 5 features

CC	1.00
IMM	–0.50	1.00
MLysis	–0.04	0.04	1.00
F-Cre	0.67	–0.36	–0.02	1.00
IVVP	–0.03	0.03	0.00	–0.03	1.00
	CC	IMM	MLysis	F-Cre	IVVP

This table shows the phi coefficient between the 2 features among the 5 features. Each number shows the phi coefficient between the 2 features. The phi coefficient ranges from −1 to 1, where 1 represents a perfect positive correlation, −1 represents a perfect negative correlation, and 0 represents no correlation at all. CC, capillary collapse; F-Cre, fibrous crescent; IMM, increased mesangial matrix; IVVP, increased vasculature around the vascular pole; MLysis, mesangiolysis.

Breakdown of data used for the convolutional neural network (CNN) Each number shows the number of images contained in the score. CC, capillary collapse; F-Cre, fibrous crescent; IMM, increased mesangial matrix; IVVP, increased vasculature around the vascular pole; MLysis, mesangiolysis. Phi coefficient among the 5 features This table shows the phi coefficient between the 2 features among the 5 features. Each number shows the phi coefficient between the 2 features. The phi coefficient ranges from −1 to 1, where 1 represents a perfect positive correlation, −1 represents a perfect negative correlation, and 0 represents no correlation at all. CC, capillary collapse; F-Cre, fibrous crescent; IMM, increased mesangial matrix; IVVP, increased vasculature around the vascular pole; MLysis, mesangiolysis. The performance of the CNN was evaluated on the validation, test, and concordance data, respectively, over the 4-fold cross-validation (Figure 2). The performance was measured using the average ROC-AUC and F1 score via 4-fold stratified cross-validation for the 10,102 images, and also for the 98 images used for the concordance evaluation.

Figure 2

Schematic representation of the cross-validation. This figure shows how all images were divided for training, validation, testing, and concordance. First, 10,202 glomerular images were collected from the whole slide images (WSIs). Next, 100 glomerular images were chosen for the concordance data. Finally, the remaining 10,102 images were divided into 5 groups, 1 for the test data and the other 4 groups as the training data and validation data that were used for convolutional neural network training. ROC−AUC, receiver operating characteristic−area under the curve. This study used a 50-layer residual network (ResNet50) as the structure of the CNN, which demonstrates the high performance in general image recognition and is unlikely to fall into poor local minima because of its structure; hence, it would achieve high classification performance. This investigation also adopted transfer learning, which uses the parameters that were trained by the ImageNet dataset. The weighted softmax cross entropy loss was used as a loss function because the dataset for this study was unbalanced between positive and negative. For the weighted softmax cross entropy, the reciprocal of the ratio of the number of images in the class to the total number of images was set for each class. As the optimizing method, adaptive moment estimation (Adam), which has a fast convergence speed and can easily obtain good accuracy without parameter tuning, was used with an empirically determined hyperparameter (α = 1.0 × 10−7). All original images were resized to 224 × 224 pixels for the CNN input. At the time of CNN training, data augmentation was applied to randomly rotate (0°, 90°, 180°, and 270°) and invert the images. The losses for the training and validation data were calculated for each epoch. The training was stopped at the point at which the loss of the validation data did not decrease for at least 50 continuous epochs. Python3 was used as the programming language, and Chainer (version 4.0.0) was used as the deep learning library. The calculations were performed using NVIDIA DGX-1 (OS: Ubuntu 16.04, main memory: 515 GB, CPU: Intel Xeon CPU E5-2698 v4, GPU: 8 Tesla V100-SXM).

Concordance Evaluation Between the CNN Score and Clinicians’ Score

For the features for which the ROC–AUC for the test data exceeded 0.9, this study compared the CNN’s predictive score with the 5 clinicians’ scores for the concordance data and performed error analysis. Moreover, to perform the error analysis visually and to help understand the CNN’s properties, gradient-weighted class activation mapping (Grad-CAM) was applied for the concordance data; it is a technique that expresses the CNN’s focusing point using a heat map calculated by summing up the intermediate gradient of the CNN for each channel.

Results

Concordance Between Clinicians

Table 4 lists the results of the Cohen κ coefficient of the 5 clinicians for the 98 images. The average κ coefficient ranged from 0.28 to 0.50, and was ≥0.40 (moderate agreement) for the 5 features (CC, IMM, MLysis, F-Cre, and IVVP).

Table 4

Results of Cohen κ coefficient by the 5 clinicians for the concordance images

	A/B	A/C	A/D	A/E	B/C	B/D	B/E	C/D	C/E	D/E	Average
CC	0.33	0.30	0.50	0.35	0.51	0.51	0.40	0.44	0.35	0.41	0.41
Scl	0.18	0.36	0.46	0.44	0.46	0.31	0.26	0.52	0.38	0.44	0.38
MesHyper	0.44	0.37	0.35	0.37	0.42	0.27	0.28	0.29	0.36	0.35	0.35
IMM	0.45	0.52	0.62	0.45	0.43	0.42	0.45	0.53	0.45	0.42	0.47
MLysis	0.54	0.48	0.68	0.53	0.25	0.39	0.41	0.45	0.56	0.56	0.49
EP	0.19	0.38	0.54	0.35	0.18	0.30	0.41	0.34	0.35	0.51	0.36
F-Cre	0.59	0.50	0.20	0.64	0.72	0.04	0.59	0.02	0.50	0.28	0.41
Fc-Cre	0.34	0.20	0.30	0.23	0.72	0.39	0.56	0.12	0.44	0.33	0.36
C-Cre	0.47	0.57	0.24	0.33	0.44	0.26	0.45	0.22	0.53	0.40	0.39
Adh	0.16	0.58	0.26	0.37	0.18	0.29	0.23	0.08	0.28	0.33	0.28
IVVP	0.40	0.40	0.45	0.56	0.41	0.55	0.56	0.56	0.46	0.63	0.50
AAH	0.25	0.16	0.41	0.47	0.35	0.46	0.37	0.39	0.26	0.60	0.37

The rows indicate the features, and the column indicate the pairs of clinicians (e.g., A/B implies clinicians A and B) and their average value. Each number shows the Cohen κ coefficient. The Cohen κ coefficient ranges from −1 to 1, where 1 represents perfect concordance, −1 represents perfect discordance, and 0 represents completely no concordance. Adh, adhesion; AAH, afferent/efferent arteriolar hyalinosis; C-Cre, cellular crescent; CC, capillary collapse; EP, endocapillary proliferation; F-Cre, fibrous crescent; Fc-Cre, fibrocellular crescent; IMM, increased mesangial matrix; IVVP, increased vasculature around the vascular pole; MesHyper, mesangial hypercellularity; MLysis, mesangiolysis; Scl, sclerosis.

Results of Cohen κ coefficient by the 5 clinicians for the concordance images The rows indicate the features, and the column indicate the pairs of clinicians (e.g., A/B implies clinicians A and B) and their average value. Each number shows the Cohen κ coefficient. The Cohen κ coefficient ranges from −1 to 1, where 1 represents perfect concordance, −1 represents perfect discordance, and 0 represents completely no concordance. Adh, adhesion; AAH, afferent/efferent arteriolar hyalinosis; C-Cre, cellular crescent; CC, capillary collapse; EP, endocapillary proliferation; F-Cre, fibrous crescent; Fc-Cre, fibrocellular crescent; IMM, increased mesangial matrix; IVVP, increased vasculature around the vascular pole; MesHyper, mesangial hypercellularity; MLysis, mesangiolysis; Scl, sclerosis.

CNN Performance

Table 5 summarizes the ROC–AUC and the F1 score for the validation data, test data, and concordance data. The ROC–AUC for the test data varied from 0.65 to 0.98. Among the 5 features, CC and F-Cre demonstrated high ROC–AUCs of 0.98 and 0.91, respectively.

Table 5

Overall results of the convolutional neural network (CNN) performance

Feature	Validation data		Test data		Concordance data
Feature	ROC–AUC	F1 score	ROC–AUC	F1 score	ROC–AUC	F1 score
CC	0.97	0.79	0.98	0.79	0.84	0.59
IMM	0.82	0.82	0.79	0.79	0.76	0.72
MLysis	0.87	0.12	0.76	0.08	0.37	0.04
F-Cre	0.90	0.61	0.91	0.63	0.69	0.53
IVVP	0.62	0.04	0.65	0.05	0.47	0.04

Each number shows the average receiver operating characteristic–area under the curve (ROC–AUC) or the F1 score over the 4 folds by each feature. CC, capillary collapse; F-Cre, fibrous crescent; IMM, increased mesangial matrix; IVVP, increased vasculature around the vascular pole; MLysis, mesangiolysis.

Overall results of the convolutional neural network (CNN) performance Each number shows the average receiver operating characteristic–area under the curve (ROC–AUC) or the F1 score over the 4 folds by each feature. CC, capillary collapse; F-Cre, fibrous crescent; IMM, increased mesangial matrix; IVVP, increased vasculature around the vascular pole; MLysis, mesangiolysis.

Concordance Between the CNN Score and Clinicians’ Score

As shown in Table 5, there were 2 features (CC and F-Cre) with ROC–AUCs for the test data that exceeded 0.9 (CC and F-Cre). Error analysis was conducted for these 2 features. Tables 6 and 7 present the comparisons for CC and F-Cre, respectively. As shown in Table 6, for CC, there are 13 completely true-positive (cTP) images that the CNN predicted to be positive, and all 5 clinicians scored them positively. In addition, there were no completely false-positive (cFP) images that the CNN predicted to be negative; however, no clinician scored these images positively. There were 33 completely true-negative (cTN) images that the CNN predicted to be negative, and no clinician scored them as positive. There was 1 completely false-negative (cFN) image that the CNN predicted to be negative; however, all of the clinicians scored it as positive. Table 7 demonstrates that for the F-Cre, there are no cTP images but there are 20 cFP images, 37 cTN images, and 1 cFN image.

Table 6

Comparison between the convolutional neural network (CNN) predictive score and the clinicians’ feature for the capillary collapse (CC)

		No. of clinicians out of 5 who scored positive
		0	1	2	3	4	5
CNN’s predictive score over 4 folds	0.75–1.00	0 a	2	2	5	7	10b
	0.50–0.75	0a	2	2	2	2	3b
	0.25–0.50	9c	5	0	0	1	1d
	0.00–0.25	24d	11	4	2	4	0d

This confusion matrix shows the comparison between the CNN’s predictive score and the clinicians’ feature for the CC. Rows show the CNN’s predictive score for an image, calculated as an average of the CNN’s softmax probability over the 4 folds. Columns show the number of clinicians out of 5 who scored it as positive. Each number represents the number of images.

Images in which the CNN predicted it to be positive but no clinician scored it as positive (the CNN’s completely false-positive result).

Images in which the CNN predicted it to be positive and all of the clinicians scored it as positive (the CNN’s completely true-positive result).

Images in which the CNN predicted it to be negative and no clinician scored it as positive (the CNN’s completely true-negative result).

Images in which the CNN predicted it to be negative but all of the clinicians scored it as positive (the CNN’s completely false-negative result).

Table 7

Comparison between the convolutional neural network (CNN) predictive score and the clinicians’ feature for the fibrous crescent (F-Cre)

		No. of clinicians out of 5 who scored positive
		0	1	2	3	4	5
CNN’s predictive score over 4 folds	0.75–1.00	9a	4	2	7	6	0b
	0.50–0.75	11a	5	1	2	1	0b
	0.25–0.50	20c	2	1	3	0	1d
	0.00–0.25	17c	4	1	0	1	0d

This confusion matrix shows the comparison between the CNN predictive score and the clinicians’ feature for the F-Cre. Rows show the CNN’s predictive score for an image, which was calculated as an average of the CNN’s softmax probability over the 4 folds. Columns show the number of clinicians out of 5 who scored it as positive. Each number shows the number of images.

Images in which the CNN predicted it to be positive, but no clinician scored it as positive (CNN’s completely false-positive result).

Images in which the CNN predicted it to be positive and all of the clinicians scored it as positive (the CNN’s completely true-positive result).

Images in which the CNN predicted it to be negative and no clinician scored it as positive (the CNN’s completely true-negative result).

Images in which the CNN predicted it to be negative, but all clinicians scored it as positive (the CNN’s completely false-negative result).

Comparison between the convolutional neural network (CNN) predictive score and the clinicians’ feature for the capillary collapse (CC) This confusion matrix shows the comparison between the CNN’s predictive score and the clinicians’ feature for the CC. Rows show the CNN’s predictive score for an image, calculated as an average of the CNN’s softmax probability over the 4 folds. Columns show the number of clinicians out of 5 who scored it as positive. Each number represents the number of images. Images in which the CNN predicted it to be positive but no clinician scored it as positive (the CNN’s completely false-positive result). Images in which the CNN predicted it to be positive and all of the clinicians scored it as positive (the CNN’s completely true-positive result). Images in which the CNN predicted it to be negative and no clinician scored it as positive (the CNN’s completely true-negative result). Images in which the CNN predicted it to be negative but all of the clinicians scored it as positive (the CNN’s completely false-negative result). Comparison between the convolutional neural network (CNN) predictive score and the clinicians’ feature for the fibrous crescent (F-Cre) This confusion matrix shows the comparison between the CNN predictive score and the clinicians’ feature for the F-Cre. Rows show the CNN’s predictive score for an image, which was calculated as an average of the CNN’s softmax probability over the 4 folds. Columns show the number of clinicians out of 5 who scored it as positive. Each number shows the number of images. Images in which the CNN predicted it to be positive, but no clinician scored it as positive (CNN’s completely false-positive result). Images in which the CNN predicted it to be positive and all of the clinicians scored it as positive (the CNN’s completely true-positive result). Images in which the CNN predicted it to be negative and no clinician scored it as positive (the CNN’s completely true-negative result). Images in which the CNN predicted it to be negative, but all clinicians scored it as positive (the CNN’s completely false-negative result).

Visualization of the CNN’s Focusing Point

We show the 3 images of Grad-CAM to understand the properties of the CNN qualitatively. Figure 3 depicts a cTP image for CC where the CNN and all 5 clinicians scored it as positive. In this figure, the focus of the CNN was on the collapsed capillary, which is an actual pathological lesion. On the other hand, Figures 4 and 5 show the cFP images for the F-Cre where the CNN scored it as positive but none of the 5 clinicians scored it as positive. In Figure 4, the focus of the CNN was the extracapillary lesion, which is an actual pathological lesion of the Fc-Cre; however, the CNN misclassified it as F-Cre. In Figure 5, the focus of the CNN was not the extracapillary lesion but the collapsed capillary lesion.

Figure 3

Figure 4

Example of the false-positive case of the convolutional neural network (CNN) for the fibrous crescent (F-Cre). (Left) One of the periodic acid−Schiff (PAS)−stained images from the concordance data. The CNN diagnosed this image as “F-Cre is positive,” but none of the 5 clinicians scored it as “F-Cre is positive,” whereas 4 of the 5 clinicians scored it as “Fibrocellular Crescent (Fc-Cre) is positive.” (Center) The yellow line shows the area of Fc-Cre. (Right) Visualization of the CNN’s focusing points using the gradient-weighted class activation mapping when the CNN diagnosed the F-Cre as positive or negative. The red lesions are the CNN’s focusing points. This shows that the CNN focused correctly on the extracapillary area but “misdiagnosed” it as “F-Cre is positive.”

Figure 5

Another example of the false-positive case of the convolutional neural network (CNN) for the fibrous crescent (F-Cre). (Left) One of the periodic acid−Schiff (PAS)−stained images from the concordance data. The CNN diagnosed this image as “F-Cre is positive” but none of the 5 clinicians scored it as “F-Cre is positive.” (Center) The blue line shows the area where the capillary was collapsed and obsoleted, and the yellow line depicts the fibrocellular crescent area. (Right) Visualization of the CNN’s focusing points using the gradient-weighted class activation mapping when the CNN diagnosed the F-Cre as positive or negative. The red lesions are the CNN’s focusing points. This reveals that the CNN accidentally focused on the capillary area although it needs to focus on the extracapillary lesion.

Example of the true positive score of the convolutional neural network (CNN) for the capillary collapse (CC). (Left) One of the periodic acid−Schiff (PAS)−stained images from the concordance data. The CNN diagnosed this image as “capillary collapse is positive,” and all 5 clinicians scored it as “capillary collapse is positive.” (Center) The blue lines depict the collapsed capillary, and the yellow lines show the fibrous crescents. (Right) Visualization of the CNN’s focusing points using the Gradient-weighted Class Activation Mapping when the CNN diagnosed the capillary collapse as positive or negative. Red lesions are the CNN’s focusing points. This shows that the CNN was focused correctly on the capillary area. Example of the false-positive case of the convolutional neural network (CNN) for the fibrous crescent (F-Cre). (Left) One of the periodic acid−Schiff (PAS)−stained images from the concordance data. The CNN diagnosed this image as “F-Cre is positive,” but none of the 5 clinicians scored it as “F-Cre is positive,” whereas 4 of the 5 clinicians scored it as “Fibrocellular Crescent (Fc-Cre) is positive.” (Center) The yellow line shows the area of Fc-Cre. (Right) Visualization of the CNN’s focusing points using the gradient-weighted class activation mapping when the CNN diagnosed the F-Cre as positive or negative. The red lesions are the CNN’s focusing points. This shows that the CNN focused correctly on the extracapillary area but “misdiagnosed” it as “F-Cre is positive.” Another example of the false-positive case of the convolutional neural network (CNN) for the fibrous crescent (F-Cre). (Left) One of the periodic acid−Schiff (PAS)−stained images from the concordance data. The CNN diagnosed this image as “F-Cre is positive” but none of the 5 clinicians scored it as “F-Cre is positive.” (Center) The blue line shows the area where the capillary was collapsed and obsoleted, and the yellow line depicts the fibrocellular crescent area. (Right) Visualization of the CNN’s focusing points using the gradient-weighted class activation mapping when the CNN diagnosed the F-Cre as positive or negative. The red lesions are the CNN’s focusing points. This reveals that the CNN accidentally focused on the capillary area although it needs to focus on the extracapillary lesion.

Discussion

Clinicians’ Concordance

As listed in Table 4, the κ coefficient ranged from 0.28 to 0.50, which indicates a fair-to-moderate concordance. The study about concordance in the Oxford classification for IgA nephropathy measured the concordance by using intraclass correlation coefficients (ICCs), and their ICCs ranged from 0.03 to 0.90. The same variety of ICCs were also reported in the Japanese cohort study using the Oxford classification for IgA nephropathy where the ICCs ranged from 0.26 to 0.89. Because the ICCs are analogous with the κ coefficient, the discordance was also reported in those studies. As for the feature of MesHyper, the κ coefficient is 0.35, whereas the κ coefficient for the mesangial hypercellularity reported in the NEPTUNE scoring system was 0.54 to 0.64. Even though a direct comparison is difficult because of differences in the 2 study designs, the difference in the number of classes (5 for this study and 2 for NEPTUNE) might have led to the difference in the κ score. One possible reason that may have caused discordance is that single-stained WSIs were used for scoring the images. In some cases, clinicians see the WSIs for several stains, and they try to assign a score for an image. Combining other stained WSIs, such as PAM or Azan, might improve the concordance between clinicians. As shown in Table 5, the ROC–AUC for the CNN for the test data exceeded 0.9 for the CC and the F-Cre. The 2 features may have demonstrated a high performance due to the positive-to-negative ratio (described in Table 2), and the width of the pathological lesions were not so narrow. In contrast, features such as IVVP with a low performance demonstrated a high ratio of negative to positive, and a small pathological lesion may have caused the low performance. To improve the accuracy, the data should be collected to improve the class balance with the resampling methods. As displayed in Table 5, there is a difference in the performance between the concordance data and the test data for some features. This seems to be due to the sampling method. Specifically, the test data were selected to have a good representation of the whole dataset by the stratified folding method; however, the concordance data were intentionally selected to contain “hard-to-score” images.

Concordance Between the CNN and Clinicians

As shown in Tables 6 and 7, for the CC, there was no cFP case and only 1 cFN case out of 98 images. Meanwhile, for the F-Cre, there were 20 cFP cases and 1 cFN case. F-Cre has more cFP images compared with CC. The reason why there are so many cFP images for the F-Cre is discussed in the next section.

Visualization of CNN’s Focusing Point

A visualization of the cTP case for the CC is illustrated in Figure 3, and the cFP cases for F-Cre are presented in Figures 4 and 5. In Figure 4, the CNN cannot distinguish between the 2 different features, which are F-Cre and Fc-Cre. It is hypothesized that F-Cre and Fc-Cre have a similar crescentic structure but different textures. On the other hand, in Figure 5, the CNN cannot distinguish between CC and F-Cre, which tended to occur simultaneously in this dataset (as depicted in Table 3). This is because during the training process, the CNN merely focuses on the frequently occurring lesion, which is not always a true pathological lesion. This phenomenon has been reported as an inherent bias derived from the dataset. By correcting the balance of co-occurrence between the 2 different features, this may enable the CNN to focus on the appropriate pathological lesion and to make the correct classification. Except for the CC and F-Cre, there are several other strong correlations between the features of the renal pathology in terms of pathogenesis. For example, in most cases of IgA nephropathy, endocapillary proliferation tends to accompany mesangial hypercellularity. To avoid the inherent bias caused by this correlation, it is suggested that instead of classifying the features directly, the glomerular image should be divided into different lesions according to its structures, such as the capillary lesion, the mesangial lesion, and the extracapillary lesion, and semantic segmentation can be applied to extract the lesions from the glomerular image. In conclusion, this study established the annotation criteria and an annotation flowchart to automatically classify the glomerular images using CNNs. The annotation criteria included 12 features and the degree of concordance. This was indicated by the average κ coefficient among the 5 clinicians for each feature that ranged from 0.28 to 0.50. One clinician annotated 10,102 images and the performance of the CNN for the 5 features that were evaluated. The ROC–AUC values for the test data ranged from 0.65 to 0.98. In particular, CC and F-Cre demonstrated high ROC–AUC values of 0.98 and 0.91, respectively. The error analysis indicated that the CNN cannot distinguish features that have a similar visual structure or are likely to occur simultaneously. To address this problem, labeling every pixel with the structure class such as mesangial lesions or capillary lesions and applying semantic segmentation might improve the performance of the glomerular classification. One of the limitations of this study is that the annotation criteria should be validated by other clinicians, and more images for concordance evaluation are desirable. Another limitation is that the images for the CNN development were annotated only by 1 clinician. Thus, there should be unintended bias in the dataset, and the results of the CNN performance might not be extrapolated to the other datasets. This study has 3 important implications for future research on automatic glomerular classification. The first implication is that the discordance among the clinicians is similar to what has been reported in the literature. The second implication is that the width of the pathological lesion seems to affect the performance of the CNN. The third implication is that providing the predictive score and visualizing the focus of the CNN would support clinicians in diagnosing renal pathology, which would improve the quality of the clinicians’ assessment.

Disclosure

All the authors declared no competing interests.

19 in total

1. Weighted kappa: nominal scale agreement with provision for scaled disagreement or partial credit.

Authors: J Cohen
Journal: Psychol Bull Date: 1968-10 Impact factor: 17.737

Review 2. Deep learning.

Authors: Yann LeCun; Yoshua Bengio; Geoffrey Hinton
Journal: Nature Date: 2015-05-28 Impact factor: 49.962

3. Deep Learning-Based Histopathologic Assessment of Kidney Tissue.

Authors: Meyke Hermsen; Thomas de Bel; Marjolijn den Boer; Eric J Steenbergen; Jesper Kers; Sandrine Florquin; Joris J T H Roelofs; Mark D Stegall; Mariam P Alexander; Byron H Smith; Bart Smeets; Luuk B Hilbrands; Jeroen A W M van der Laak
Journal: J Am Soc Nephrol Date: 2019-09-05 Impact factor: 10.121

4. The measurement of observer agreement for categorical data.

Authors: J R Landis; G G Koch
Journal: Biometrics Date: 1977-03 Impact factor: 2.571

5. Reproducibility for pathological prognostic parameters of the Oxford classification of IgA nephropathy: a Japanese cohort study of the Ministry of Health, Labor and Welfare.

Authors: Satoshi Hisano; Kensuke Joh; Ritsuko Katafuchi; Akira Shimizu; Noriaki Hashiguchi; Tetsuya Kawamura; Seiichi Matsuo
Journal: Clin Exp Nephrol Date: 2016-03-24 Impact factor: 2.801

6. Grader Variability and the Importance of Reference Standards for Evaluating Machine Learning Models for Diabetic Retinopathy.

Authors: Jonathan Krause; Varun Gulshan; Ehsan Rahimy; Peter Karth; Kasumi Widner; Greg S Corrado; Lily Peng; Dale R Webster
Journal: Ophthalmology Date: 2018-03-13 Impact factor: 12.079

7. The Oxford classification of IgA nephropathy: pathology definitions, correlations, and reproducibility.

Authors: Ian S D Roberts; H Terence Cook; Stéphan Troyanov; Charles E Alpers; Alessandro Amore; Jonathan Barratt; Francois Berthoux; Stephen Bonsib; Jan A Bruijn; Daniel C Cattran; Rosanna Coppo; Vivette D'Agati; Giuseppe D'Amico; Steven Emancipator; Francesco Emma; John Feehally; Franco Ferrario; Fernando C Fervenza; Sandrine Florquin; Agnes Fogo; Colin C Geddes; Hermann-Josef Groene; Mark Haas; Andrew M Herzenberg; Prue A Hill; Ronald J Hogg; Stephen I Hsu; J Charles Jennette; Kensuke Joh; Bruce A Julian; Tetsuya Kawamura; Fernand M Lai; Lei-Shi Li; Philip K T Li; Zhi-Hong Liu; Bruce Mackinnon; Sergio Mezzano; F Paolo Schena; Yasuhiko Tomino; Patrick D Walker; Haiyan Wang; Jan J Weening; Nori Yoshikawa; Hong Zhang
Journal: Kidney Int Date: 2009-07-01 Impact factor: 10.612

8. OpenSlide: A vendor-neutral software foundation for digital pathology.

Authors: Adam Goode; Benjamin Gilbert; Jan Harkes; Drazen Jukic; Mahadev Satyanarayanan
Journal: J Pathol Inform Date: 2013-09-27

9. Segmentation of Glomeruli Within Trichrome Images Using Deep Learning.

Authors: Shruti Kannan; Laura A Morgan; Benjamin Liang; McKenzie G Cheung; Christopher Q Lin; Dan Mun; Ralph G Nader; Mostafa E Belghasem; Joel M Henderson; Jean M Francis; Vipul C Chitalia; Vijaya B Kolachalama
Journal: Kidney Int Rep Date: 2019-04-15

10. PathoSpotter-K: A computational tool for the automatic identification of glomerular lesions in histological images of kidneys.

Authors: George O Barros; Brenda Navarro; Angelo Duarte; Washington L C Dos-Santos
Journal: Sci Rep Date: 2017-04-24 Impact factor: 4.379

4 in total

Review 1. The potential of artificial intelligence-based applications in kidney pathology.

Authors: Roman D Büllow; Jon N Marsh; S Joshua Swamidass; Joseph P Gaut; Peter Boor
Journal: Curr Opin Nephrol Hypertens Date: 2022-02-14 Impact factor: 3.416

2. Deep learning segmentation of glomeruli on kidney donor frozen sections.

Authors: Xiang Li; Richard C Davis; Yuemei Xu; Zehan Wang; Nao Souma; Gina Sotolongo; Jonathan Bell; Matthew Ellis; David Howell; Xiling Shen; Kyle J Lafata; Laura Barisoni
Journal: J Med Imaging (Bellingham) Date: 2021-12-20

3. MIXTURE of human expertise and deep learning-developing an explainable model for predicting pathological diagnosis and survival in patients with interstitial lung disease.

Authors: Wataru Uegami; Andrey Bychkov; Mutsumi Ozasa; Kazuki Uehara; Kensuke Kataoka; Takeshi Johkoh; Yasuhiro Kondoh; Hidenori Sakanashi; Junya Fukuoka
Journal: Mod Pathol Date: 2022-02-23 Impact factor: 8.209

Review 4. Artificial Intelligence-Assisted Renal Pathology: Advances and Prospects.

Authors: Yiqin Wang; Qiong Wen; Luhua Jin; Wei Chen
Journal: J Clin Med Date: 2022-08-22 Impact factor: 4.964

4 in total