Literature DB >> 34028548

Development and Validation of a Deep Learning Model to Quantify Interstitial Fibrosis and Tubular Atrophy From Kidney Ultrasonography Images.

Ambarish M Athavale¹, Peter D Hart¹, Mathew Itteera¹, David Cimbaluk², Tushar Patel³, Anas Alabkaa², Jose Arruda⁴, Ashok Singh¹, Avi Rosenberg⁵, Hemant Kulkarni⁶.

Abstract

Importance: Interstitial fibrosis and tubular atrophy (IFTA) is a strong indicator of decline in kidney function and is measured using histopathological assessment of kidney biopsy core. At present, a noninvasive test to assess IFTA is not available. Objective: To develop and validate a deep learning (DL) algorithm to quantify IFTA from kidney ultrasonography images. Design, Setting, and Participants: This was a single-center diagnostic study of consecutive patients who underwent native kidney biopsy at John H. Stroger Jr. Hospital of Cook County, Chicago, Illinois, between January 1, 2014, and December 31, 2018. A DL algorithm was trained, validated, and tested to classify IFTA from kidney ultrasonography images. Of 6135 Crimmins-filtered ultrasonography images, 5523 were used for training (5122 images) and validation (401 images), and 612 were used to test the accuracy of the DL system. Kidney segmentation was performed using the UNet architecture, and classification was performed using a convolution neural network-based feature extractor and extreme gradient boosting. IFTA scored by a nephropathologist on trichrome stained kidney biopsy slide was used as the reference standard. IFTA was divided into 4 grades (grade 1, 0%-24%; grade 2, 25%-49%; grade 3, 50%-74%; and grade 4, 75%-100%). Data analysis was performed from December 2019 to May 2020. Main Outcomes and Measures: Prediction of IFTA grade was measured using the metrics precision, recall, accuracy, and F1 score.
Results: This study included 352 patients (mean [SD] age 47.43 [14.37] years), of whom 193 (54.82%) were women. There were 159 patients with IFTA grade 1 (2701 ultrasonography images), 74 patients with IFTA grade 2 (1239 ultrasonography images), 41 patients with IFTA grade 3 (701 ultrasonography images), and 78 patients with IFTA grade 4 (1494 ultrasonography images). Kidney ultrasonography images were segmented with 91% accuracy. In the independent test set, the point estimates for performance matrices showed precision of 0.8927 (95% CI, 0.8682-0.9172), recall of 0.8037 (95% CI, 0.7722-0.8352), accuracy of 0.8675 (95% CI, 0.8406-0.8944), and an F1 score of 0.8389 (95% CI, 0.8098-0.8680) at the image level. Corresponding estimates at the patient level were precision of 0.9003 (95% CI, 0.8644-0.9362), recall of 0.8421 (95% CI, 0.7984-0.8858), accuracy of 0.8955 (95% CI, 0.8589-0.9321), and an F1 score of 0.8639 (95% CI, 0.8228-0.9049). Accuracy at the patient level was highest for IFTA grade 1 and IFTA grade 4. The accuracy (approximately 90%) remained high irrespective of the timing of ultrasonography studies and the biopsy diagnosis. The predictive performance of the DL system did not show significant improvement when combined with baseline clinical characteristics. Conclusions and Relevance: These findings suggest that a DL algorithm can accurately and independently predict IFTA from kidney ultrasonography images.

Entities: Chemical

Mesh：

Year: 2021 PMID： 34028548 PMCID： PMC8144924 DOI： 10.1001/jamanetworkopen.2021.11176

Source DB: PubMed Journal: JAMA Netw Open ISSN： 2574-3805

Introduction

Chronic kidney disease (CKD) affects 15% of the adult population in the US and has contributed to a 52% increase in cost burden from 2002 to 2016.[1,2] A key pathophysiological indicator of CKD is interstitial fibrosis and tubular atrophy (IFTA), which is associated with estimated glomerular filtration rate (eGFR),[3] future decline in kidney function, and development of kidney failure.[4] Furthermore, IFTA incrementally improves the value of baseline proteinuria and eGFR to predict clinical outcomes of CKD[5] irrespective of the underlying causes.[5,6] Currently, the challenge is to have an accurate, noninvasive method to quantify IFTA because histopathological grading of kidney biopsy core by a nephropathologist is the only accepted method to quantify IFTA. In addition to being invasive, kidney biopsy is associated with bleeding complications, provides only a snapshot of IFTA, and is subject to sampling error.[7,8] Consequently and importantly, most patients with CKD who will eventually need kidney replacement therapy[9] never undergo a kidney biopsy[10,11] and represent a missed clinical opportunity. Kidney ultrasonography is a routinely performed, noninvasive test for evaluation of kidney disease. Certain features on ultrasonography, such as echogenicity, kidney length, and corticomedullary differentiation, have been found to be associated with IFTA; however, these features are unable to provide a quantifiable estimate of IFTA.[12] Unlike FibroScan, which provides a quantification of fibrosis in liver,[13] there is no imaging modality that can provide an accurate estimate of IFTA in the kidney in routine clinical practice. We hypothesized that ingrained within the ultrasonographic features are subtle signs of IFTA that can be quantitatively extracted and analyzed. Artificial intelligence and deep learning (DL) are being increasingly used in diagnosis and prognosis of various medical conditions.[14,15,16] Because DL can map complex feature relationships, we trained, validated, and tested a DL system to quantify IFTA using kidney ultrasonography images.

Methods

Patient Selection

This diagnostic study was approved by the institutional review board at Cook County Health with waiver of informed consent because deidentified data and images were used, in accordance with 45 CFR §46. Reporting of the results follows the Standards for Reporting of Diagnostic Accuracy (STARD) reporting guideline. Consecutive patients undergoing native kidney biopsy under real-time ultrasonography guidance between January 1, 2014, and December 31, 2018, at the John H. Stroger Jr. Hospital of Cook County, Chicago, Illinois, were included in the study. Allograft biopsies and patients for whom ultrasonography images were not available or IFTA grades were not available (15 patients) were excluded. Clinical and demographic information of patients included in this study was obtained by medical record review.

Ground Truth Quantification of IFTA

Quantifying IFTA on Masson trichrome–stained kidney biopsy slide is the current standard of care.[17] The percentage of cortex with IFTA was scored as 0% to 24%, 25% to 49%, 50% to 74%, and 75% to 100% of the cortex sampled. One nephropathologist (D.C.) provided IFTA scores from each trichrome-stained histopathological slide of the kidney biopsy core. To validate the methods of the pathologist (D.C.) in grading IFTA, a second nephropathologist (T.P.) also provided IFTA scores for a random sample of 93 whole slide images in a blinded fashion, and agreement between the 2 nephropathologists was evaluated.

Kidney Ultrasonography Images

Longitudinal ultrasonography images from both kidneys obtained between 6 months before and 2 weeks after kidney biopsy (including images obtained during the kidney biopsy) were included in the study. A total of 6602 ultrasonography images were deidentified and stored in the JPEG format.

Development of DL Classification System

Development of the DL model involved 4 independent steps: (1) preprocessing of ultrasonography images, (2) kidney segmentation, (3) feature extraction, and (4) image classification with internal and independent validation (Figure). All scripts were written in Python software version 3.7 (Python) within an Anaconda software version 3.0 (Anaconda, Inc) environment[18] and used the PyTorch platform.[19] Jupyter notebooks with codes and outputs are available from the authors upon reasonable request.

Figure.

Overall Analysis Pipeline

The entire process was partitioned into 4 main tasks (green boxes): preprocessing of images, segmentation of kidneys in preprocessed images, feature extraction from masked images, and image classification from feature maps. Subtasks within these main tasks are indicated with italic type. In the feature extraction and image classification phase, a test set of 612 images was generated and was never used in any training. This test set was used for a final independent evaluation of the overall analytical pipeline. US indicates ultrasonography; VGG19, Visual Geometry Group 19; XGBoost, extreme gradient boosting.

Overall Analysis Pipeline

Preprocessing of Ultrasonography Images

Ultrasonography images included in the study were resized to 224 × 224 pixels, the input dimension required for many popular DL models. A Crimmins filter[20] (also called the geometric filter) is most suited to reduce background noise and backscatter in ultrasonography images and was applied to each image (eFigure 1 in the Supplement).

Kidney Segmentation

A kidney ultrasonography image includes structures in addition to the kidney, such as muscle, adipose tissue, liver, spleen, and bowel. For the current study, it was important that the training images focused on the kidney while eliminating other structures. Thus, we first trained and validated a DL model to generate segmented (masked) ultrasonography images. We chose the UNet architecture (eFigure 2 in the Supplement) because it is suited for ultrasonography images.[21,22,23,24] We used a pretrained, publicly available model[25] and retrained it for kidney segmentation. For this, we randomly selected a subset of 600 ultrasonography images (Figure) and manually labeled these images for kidney identification using the labelme software.[26] The selected ultrasonography images were further randomly split into a training set of 500 images and a validation set of 100 images. By use of this optimized (for mean squared error L2 loss) and trained UNet model, we generated masked images (ie, images with everything other than kidneys blacked out) from each of the 6602 preprocessed images. We used the intersection-over-union (IoU) metric to measure the accuracy of segmentation in the validation set only. We used the OpenCV Python library[27] and used the function multiply to obtain a masked image from the original image and its UNet-generated mask. Of 6602 images, 6135 (93%) images had adequate masks (IoU >90%) and were used for subsequent analyses.

Feature Extraction

We used transfer learning for this purpose using a pretrained convolutional neural network, Visual Geometry Group 19 (VGG-19) batch normalization (BN). This model (eFigure 3A in the Supplement) comprises an initial feature extractor component followed by a classifier component. For our purposes, we used the feature extractor component only. To prime it for kidney ultrasonography fibrosis, we used IFTA grades as the final output and tuned the VGG-19 BN model using categorical cross-entropy cost function. The final output of the 7 × 7 × 512 features was then flattened into a vector of 25 088 features and further compressed (using a fully connected layer) into a 1024-length feature vector as shown in eFigure 3A in the Supplement. From the trained model, we extracted 1024 features (from the Fc2 layer) using the IntermediateLayerGetter Python library.[28] Training of the feature extractor was done on a randomly selected subset of 90% of the images (5523 images) (Figure). The remaining 612 images were retained as an independent test set for validation. After training, the tuned VGG-19 BN model was used to extract features from all images into a 6135 × 1024 matrix. This matrix, along with the associated class labels, image identifiers, and training or test membership information was used for subsequent image classification.

Image Classification

Image classification was done using extreme gradient boosting (XGBoost, using the xgboost Python library).[29] During this step, we retained the 612 images as an independent test set (Figure). The remaining images were randomly split into a training set of 5122 images and a validation set of 401 images. We used the 1024 features extracted in the feature extraction step as input to XGBoost algorithm and the ground truth IFTA grades as output for training the DL algorithm. Multiclass log loss was used for optimization. Grid search was used for finding the optimum hyperparameters, and the best predicting model (one with the least multiclass error) was used to predict the IFTA grades both in the validation set of 401 images and in the independent test set of 612 images.

Statistical Analysis

Descriptive statistics included mean (SD) for continuous variables and proportions for categorical variables. Statistical significance for distribution across grades of IFTA was assessed using 1-way analysis of variance for continuous variables and a 2-sided Pearson χ2 test for categorical variables. Agreement between pathologists’ grading of IFTA was done using weighted (using a square-weighted Cohen κ strategy) Cohen κ values. Performance metrics for the image classification task were precision (synonymous with positive predictive value as used in epidemiology), recall (synonymous with sensitivity), accuracy, and F1 score (which was estimated as the harmonic mean of precision and recall). Because clinical characteristics such as age, diabetes, hypertension, and the eGFR (derived using the Modification of Diet in Renal Disease equation) are associated with the IFTA grade, we examined whether the combination of the clinical characteristics with DL predictions improved the prediction. For this, we ran a baseline multinomial logistic model that predicted the IFTA class using predicted IFTA class as the independent variable. In the next nested model, we added age, sex, hypertension, diabetes, body mass index, and eGFR as the covariates. Incremental predictive performance of the clinical predictors over that of DL prediction was assessed by comparing likelihood ratio χ2, pseudo R2, and Brier score. To make the alternative model more robust, we also evaluated whether using powerful machine learning algorithms can further improve the IFTA predictions obtained by combining DL predictions with clinical characteristics. For this, we used the package CMA in R statistical software version 4.0.2 (R Project for Statistical Computing) and evaluated the following machine learning methods: component-wise boosting, linear discriminant analyses, diagonal discriminant analysis, partial least squares combined with linear discriminant analysis, feed forward neural network, random forest, and support vector machines. All statistical analyses were conducted in Stata statistical software version 12.0 (StataCorp). A global type I error rate of .05 was used to test statistical significance. Data analysis was performed from December 2019 to May 2020.

Results

Study Participants and Ultrasonography Images

A total of 367 kidney biopsies were performed in the study period; information on degree of IFTA and concurrent ultrasonography images were available for 352 biopsies (96%). Of the 352 patients (mean [SD] age, 47.43 [14.37] years), 193 (54.82%) were women. Clinical and demographic characteristics of these patients are shown in Table 1. Numbers of patients assigned to different IFTA grades were as follows: grade 1, 159 patients (45.17%; 2701 ultrasonography images); grade 2, 74 patients (21.02%; 1239 ultrasonography images); grade 3, 41 patients (11.65%; 701 ultrasonography images); and grade 4, 78 patients (22.16%; 1494 ultrasonography images). IFTA grade increased with age, presence of diabetes, hypertension, and increased serum creatinine level. For the 352 biopsies included in the study, a total of 6135 ultrasonography images had adequate masks (Figure) and were used to train and test the DL algorithm.

Table 1.

Characteristics of the Study Participants

Characteristic	Participants, No. (%)				P value
Characteristic	IFTA 0%-24% (n = 159)	IFTA 25%-49% (n = 74)	IFTA 50%-74% (n = 41)	IFTA ≥75% (n = 78)	P value
Age, mean (SD), y	42.6 (13.4)	52.9 (13.5)	51.7 (13.7)	49.8 (14.5)	<.001
Sex
Female	93 (59.5)	43 (57.1)	21 (51.2)	36 (48.2)	.36
Male	66 (40.5)	31 (42.9)	20 (48.8)	42 (51.8)	.36
Race/ethnicity
White	70 (44.2)	21 (31.2)	10 (24.4)	29 (38.6)	.16
Black	56 (35.6)	39 (50.7)	21 (51.2)	37 (47.0)
Asian	11 (6.8)	7 (9.1)	2 (4.9)	5 (6.0)
Other^a	22 (13.5)	7 (9.1)	8 (19.5)	7 (8.4)
Diabetes	39 (23.9)	32 (44.1)	18 (43.9)	40 (50.6)	<.001
Hypertension	101 (63.8)	68 (90.9)	40 (97.6)	71 (91.6)	<.001
Body mass index^b	29.7 (7.0)	29.8 (7.0)	29.2 (7.1)	29.8 (6.3)	.88
Creatinine, mg/dL	1.49 (1.84)	2.21 (1.43)	2.41 (0.85)	4.17 (2.18)	<.001
Estimated glomerular filtration rate, mL/min	87.3 (51.0)	41.0 (24.3)	30.8 (11.2)	19.2 (10.1)	<.001
Proteinuria, g/g creatinine	4.06 (3.9)	4.92 (5.23)	3.65 (2.76)	4.90 (4.76)	.23
Biopsy diagnosis
Lupus nephritis	62 (39.0)	8 (10.8)	3 (7.3)	8 (10.3)	<.001
Diabetic nephropathy	4 (2.5)	19 (25.7)	10 (24.4)	24 (30.8)	<.001
Focal segmental glomerulosclerosis	14 (8.8)	20 (27.0)	12 (29.3)	8 (10.3)	<.001
IgA nephropathy	19 (11.9)	6 (8.1)	4 (9.8)	10 (12.8)	.78
Membranous glomerulonephritis	21 (13.2)	6 (8.1)	2 (4.9)	2 (2.6)	.04
Antineutrophil cytoplasmic antibody vasculitis	4 (2.5)	4 (5.4)	1 (2.4)	9 (11.5)	.02
Hypertensive nephropathy	1 (0.6)	1 (1.4)	2 (4.9)	9 (11.5)	<.001
Minimal change disease	11 (6.9)	2 (2.7)	0	0	.02
Ultrasonography images, No.
Total	2701	1239	701	1494	.20
Per patient	16.99	16.74	17.13	19.15	.29

Abbreviation: IFTA, Interstitial fibrosis and tubular atrophy.

SI conversion factor: To convert creatinine to micromoles per liter, multiply by 88.4.

Other includes American Indian or Alaska Native and Native Hawaiian or Pacific Islander or that race/ethnicity was not indicated in the medical record.

Body mass index is calculated as weight in kilograms divided by height in meters squared.

Abbreviation: IFTA, Interstitial fibrosis and tubular atrophy. SI conversion factor: To convert creatinine to micromoles per liter, multiply by 88.4. Other includes American Indian or Alaska Native and Native Hawaiian or Pacific Islander or that race/ethnicity was not indicated in the medical record. Body mass index is calculated as weight in kilograms divided by height in meters squared.

Agreement Between Pathologists’ IFTA Scores

Overall, there was excellent agreement between the 2 pathologists for IFTA classification. (Cohen κ, 0.84) (Table 2) except for IFTA grade 3 (50%-74% IFTA score). Thus, we proceeded with the ensuing analyses using grades assigned by the first nephropathologist (D.C.), who had graded all the histopathology slides, as the ground truth labels for IFTA grades.

Table 2.

Agreement Among Pathologists’ Independent Evaluation of IFTA Scores on Randomly Selected Subsample of Histopathology Slides

Pathologist 1, No. of slides	Pathologist 2, No. of Slides				Total
Pathologist 1, No. of slides	IFTA 0%-24%	IFTA 25%-49%	IFTA 50%-74%	IFTA ≥75%	Total
IFTA 0%-24%	26	4	0	0	30
IFTA 25%-49%	5	12	3	1	21
IFTA 50%-74%	0	5	2	0	7
IFTA ≥75%	0	6	3	26	35
Total	31	27	8	27	93

Abbreviation: IFTA, Interstitial fibrosis and tubular atrophy.

Weighted Cohen κ = 0.8360, and SE = 0.1026.

Abbreviation: IFTA, Interstitial fibrosis and tubular atrophy. Weighted Cohen κ = 0.8360, and SE = 0.1026.

Preprocessing, Kidney Segmentation, and Feature Extraction

When the Crimmins-filtered, smoothed images (eFigure 1 in the Supplement) were used for training a UNet model for kidney segmentation (eFigure 2A in the Supplement), the network needed only 4 epochs to provide the best estimate of IoU with a rapidly decreasing loss (best IoU = 0.91, or 91% accuracy) (eFigure 2B in the Supplement). We then subjected all the preprocessed images to this tuned UNet model. We inspected the resulting images and their masks (eFigure 2C in the Supplement) manually and found that in poorly segmented images the proportion of the mask to the entire image was less than 0.05. We thus excluded these 256 images and retained a set of 6346 that related to the entire set of 367 patients. After further excluding images from the 15 patients for whom IFTA classes were not available, the final set of 6135 ultrasonography images was used for feature extraction. Of these images, 5523 were used for training the feature extractor. The training of the feature extractor was consistent, gradual, and reasonably smooth, as shown by the decreasing loss function (eFigure 3B in the Supplement). Using the tuned model, we generated the feature map for all the 6135 masked images as shown in eFigure 3C in the Supplement. The distribution of IFTA grades was similar in the training, validation, and test sets (eFigure 4 in the Supplement). We then trained an XGBoost classifier for image classification. An exhaustive grid search yielded an optimal classification solution with the following set of hyperparameters: learning rate (eta), 0.01; maximum tree depth, 16; subsample fraction, 0.5; and severe L2 regularization penalty (lambda), 10. The decrement in loss function was monotonic and smooth in both training (5122 images) and validation (401 images) sets, as shown eFigure 5A in the Supplement. Concordantly, the multiclass labeling accuracy consistently increased in both sets (eFigure 5B in the Supplement), implying acceptable fit to the data. When this model was evaluated in the validation set, we found that confusion matrix (eFigure 6A in the Supplement) was dense along the diagonals and yielded the following performance metrics (Table 3): precision, 0.8936; recall, 0.7646; accuracy, 0.8429; and F1 score, 0.8054. To further demonstrate the robustness of this approach, the image classifier was evaluated in the independent test set. We observed a very similar performance in this test set (eFigure 5B in the Supplement) with precision of 0.8927 (95% CI, 0.8682-0.9172), recall of 0.8037 (95% CI, 0.7722-0.8352), accuracy of 0.8675 (95% CI, 0.8406-0.8944), and an F1 score of 0.8389 (95% CI, 0.8098-0.8680) (Table 3). A closer look at the confusion matrices (eFigures 6A and 6B in the Supplement) showed that the accuracy of prediction was highest for IFTA grade 1 (almost perfect) and IFTA grade 4 (0.81 and 0.82 in the validation and test sets, respectively).

Table 3.

Predictive Performance of the Deep Learning Model to Quantify Interstitial Fibrosis and Tubular Atrophy

Metric	Point estimate (95%) CI)
Metric	Validation set (n = 401)	Test set (n = 612)	Patient level (n = 268)
Precision	0.8936 (0.8634-0.9238)	0.8927 (0.8682-0.9172)	0.9003 (0.8644-0.9362)
Recall	0.7646 (0.7231-0.8061)	0.8037 (0.7722-0.8352)	0.8421 (0.7984-0.8858)
Accuracy	0.8429 (0.8073-0.8785)	0.8675 (0.8406-0.8944)	0.8955 (0.8589-0.9321)
F1 score	0.8054 (0.7667-0.8441)	0.8389 (0.8098-0.8680)	0.8639 (0.8228-0.9049)

Sensitivity Analyses

We conducted sensitivity analyses of the image-level predictions in the test set regarding their temporal proximity to the date of kidney biopsy (eTable 1 in the Supplement). The predictive accuracy was better with formal ultrasonography images (0.8986) compared with ultrasonography images obtained during the biopsy (0.8543). However, irrespective of the timing of the ultrasonography studies, the predictive performance of the DL model was consistently high. The comparative performance of the DL model in the subset of patients with 1 of the top 3 most common biopsy diagnoses is shown in eTable 2 in the Supplement. The highest precision of 0.9590 (and lowest recall of 0.7369) was observed for patients with lupus nephritis. In contrast, for the ultrasonography images of patients with diabetic nephropathy, the precision and recall were lowest (0.7673) and highest (0.8385), respectively. In patients with focal segmental glomerulosclerosis, the DL model performance was in between that for patients with lupus nephritis and diabetic nephropathy.

Classification Performance at the Patient Level

For patient-level IFTA prediction, when multiple images for a patient were available, the highest IFTA class assigned by the DL model was considered as the predicted class for that patient. The performance metrics at the patient level (eFigure 6C in the Supplement and Table 3) showed improved precision (0.9003; 95% CI, 0.8644-0.9362), recall (0.8421; 95% CI, 0.7984-0.8858), accuracy (0.8955; 95% CI, 0.8589-0.9321), and F1-score (0.8639; 95% CI, 0.8228-0.9049) compared with corresponding metrics at the image level. Notably, the mean (SE) of eGFR on the day of biopsy was 83.2 (4.6) mL/min for predicted class 1, 39.5 (4.1) mL/min for predicted class 2, 30.3 (2.7) mL/min for predicted class 3, and 20.7 (1.7) mL/min for predicted class 4, demonstrating a significant dose-response relationship (regression coefficient, −21.10; P < .001).

Incremental Predictive Value of the DL Model

We investigated whether the addition of clinical characteristics (those that were significantly associated with IFTA class in Table 1) to DL model predictions could further improve the prediction of IFTA at the level of the patient. A comparison of the baseline (with only DL prediction as the independent variable) and the alternative (with clinical characteristics as additional covariates) multinomial logistic regression models (Table 4 and eTable 3 in the Supplement) showed that although the overall likelihood ratio χ2 improved significantly, from 341.58 to 395.41 with 18 excess df (P < .001), as did the pseudo R2 (improving from 0.5044 to 0.5839), the other prediction metrics (precision, recall, accuracy, and F1 score) remained comparable, with overlapping 95% CIs. Using robust machine learning models also did not provide better prediction (eTable 4 in the Supplement).

Table 4.

Incremental Value of the DL Model to Predict Interstitial Fibrosis and Tubular Atrophy Class at the Level of Individual Patient

Characteristic	Baseline model	Alternative model
Covariates	DL-based predictions	DL-based predictions, age, sex, diabetes, hypertension, body mass index, estimated glomerular filtration rate
Likelihood ratio χ² (df)	341.58 (3)	395.41 (21)
Pseudo R²	0.5044	0.5839
Brier score	0.0676	0.0644
Point estimates (95% CI)
Precision	0.8798 (0.8409-0.9187)	0.8880 (0.8502-0.9258)
Recall	0.8135 (0.7669-0.8601)	0.8435 (0.8000-0.8870)
Accuracy	0.8843 (0.8460-0.9226)	0.8918 (0.8546-0.9290)
F1 score	0.8354 (0.7910-0.8798)	0.8607 (0.8192-0.9022)

Abbreviation: DL, deep learning.

Results are from multinomial logistic regression analyses with ground truth labels as the dependent variable.

Abbreviation: DL, deep learning. Results are from multinomial logistic regression analyses with ground truth labels as the dependent variable.

Discussion

We developed, validated, and tested a DL algorithm to predict IFTA (a histopathology-based classification) from ultrasonography images of the kidney. To our knowledge, no such model currently exists, although similar attempts using computerized tomography[30] and magnetic resonance[31] images have been made. Our prediction system capitalizes on ultrasonography imaging, which is done in patients with kidney disease irrespective of need for kidney biopsy. The overall diagnostic accuracy of the DL algorithm alone was approximately 90% at the patient level and was comparable even when combined with baseline clinical characteristics. There have been prior attempts to correlate ultrasonography findings with IFTA on kidney biopsy. Moghazi et al[12] reported that kidney length, echogenicity, and parenchymal thickness were significantly, albeit modestly (correlation coefficient, 0.35 for echogenicity and interstitial fibrosis), correlated with IFTA, and none of the ultrasonographic findings individually or in combination was able to provide a quantitative estimate of IFTA. Other ultrasonographic techniques, such as quantitative echogenicity, shear wave velocity imaging, transient elastography, and ultrasonography corticomedullary strain, have been evaluated.[12] However, unlike FibroScan, which grades liver fibrosis by transient elastography and has obviated the use of biopsy in chronic viral hepatitis, none of the current ultrasonographic methods can provide a clinically useful estimate of IFTA grade.[32,33] Several serum and urinary biomarkers have been evaluated as a noninvasive or semiinvasive measure of IFTA, but no biomarker has been sufficiently accurate to be useful in routine clinical practice. In our study, the DL algorithm was able to predict IFTA grade with 90% accuracy.

Strengths and Limitations

Our study has significant strengths. First, our approach accurately (90% accuracy) predicted the IFTA grade. Second, the biopsy diagnoses represent a spectrum of kidney diseases without exclusions (other than cystic diseases of the kidney for which biopsy is typically not performed). Third, our algorithm was able to segment ultrasonography images to identify the kidney contours with high degree of accuracy. From a clinical standpoint, it is foreseeable that a DL system such as the one developed in this study has the potential to act as a gatekeeper for rationalizing the decision to conduct a kidney biopsy in patients with CKD. We anticipate that because of the ability of this system to provide probabilistic estimate of IFTA in real time, the system is likely to be acceptable (because it is unlikely to put any time burden on the technicians) and can also reduce the costs associated with kidney biopsy. For example, the DL algorithm developed in our study was able to identify patients with IFTA less than 25% or greater than 75% with high degree of accuracy. It is generally accepted that patients with advanced fibrosis (ie, >75%) are not suitable candidates for immunosuppressive therapy in proteinuric glomerular diseases. Thus, a noninvasive method of estimating fibrosis can help in treatment decisions without the need for invasive kidney biopsy. Finally, although the DL-based IFTA predictions correlated with eGFR, future studies need to specifically investigate the potential of this algorithm to predict decline in kidney function prospectively. The results should be considered in the light of some implicit limitations. First, this was a retrospective, observational study because the kidney biopsy and ultrasonography studies were performed before this analysis. A 90% diagnostic accuracy implies that 10% of IFTA grades may potentially be misclassified. Thus, more work is needed to increase the accuracy before such an algorithm can be used in clinical practice. The diagnostic accuracy in IFTA grade 3 was low. This may partly be due to the class imbalance because IFTA grade 3 also had the least number of subjects (11.65% of total sample) and the least number of ultrasonography images for training the algorithm compared with the other 3 IFTA grades. It is conceivable that the diagnostic accuracy for IFTA grade 3 would improve with a higher representation of ultrasonography images in this grade. Interestingly, the weakest agreement between pathologists was for IFTA grade 3, which points to the possibility of an inherent difficulty in assigning histopathological images to this grade. Second, the choices for available pretrained models are aplenty. Our choice of the VGG-19 BN model was driven by the motivation to use as simple models as possible, but other deeper models like those belonging to the ResNet, DenseNet, or Inception families may improve the accuracy of IFTA estimate. Third, our UNet segmentation model provided a high average IoU but it is likely that if this accuracy is further enhanced, it may lead to an improved feature extraction and classification ability of the system. Future studies are needed to improve the segmentation component of our system. Fourth, although the DL algorithm was validated on an independent sample of images, because all the ultrasonography images used in this study are from a single center, whether the predictive performance of the model will hold in varying settings of ultrasonography image quality, different equipment used to capture ultrasonography images, different ultrasonography technicians, varying clinical profile of the patients, and differing prevalence of IFTA grades is not known. Further validation on external data sets is therefore needed. Fifth, DL models are, by nature, adaptive. We anticipate a continually improved performance of the system as more real-time data are provided for its continued learning. Our work should be viewed as a proof-of-principle that a DL algorithm has the ability to predict IFTA grade from ultrasonography image of the kidney.

Conclusions

In conclusion, we have developed an artificial intelligence–based and DL-driven algorithm that was trained on ultrasonography images to predict IFTA grade with high degree of accuracy. This article provides proof-of-principle that a DL system can be used to noninvasively, accurately, and independently predict IFTA grade in patients with kidney disease. Although the system in its current form may not be an alternative to kidney biopsy, after robust external validation, a DL-based, noninvasive assessment of IFTA has the potential to significantly enhance clinical decision-making and prognostication in patients with CKD.

25 in total

Review 1. The Native Kidney Biopsy: Update and Evidence for Best Practice.

Authors: Jonathan J Hogan; Michaela Mocanu; Jeffrey S Berns
Journal: Clin J Am Soc Nephrol Date: 2015-09-02 Impact factor: 8.237

Review 2. Bleeding complications of native kidney biopsy: a systematic review and meta-analysis.

Authors: Kristin M Corapi; Joline L T Chen; Ethan M Balk; Craig E Gordon
Journal: Am J Kidney Dis Date: 2012-04-24 Impact factor: 8.860

3. Geometric filter for speckle reduction.

Authors: T R Crimmins
Journal: Appl Opt Date: 1985-05-15 Impact factor: 1.980

4. Deep Learning at Chest Radiography: Automated Classification of Pulmonary Tuberculosis by Using Convolutional Neural Networks.

Authors: Paras Lakhani; Baskaran Sundaram
Journal: Radiology Date: 2017-04-24 Impact factor: 11.105

5. Relationship between renal function and histological changes found in renal-biopsy specimens from patients with persistent glomerular nephritis.

Authors: R A Risdon; J C Sloper; H E De Wardener
Journal: Lancet Date: 1968-08-17 Impact factor: 79.321

6. Artificial Intelligence to Detect Papilledema from Ocular Fundus Photographs.

Authors: Dan Milea; Raymond P Najjar; Jiang Zhubo; Daniel Ting; Caroline Vasseneix; Xinxing Xu; Masoud Aghsaei Fard; Pedro Fonseca; Kavin Vanikieti; Wolf A Lagrèze; Chiara La Morgia; Carol Y Cheung; Steffen Hamann; Christophe Chiquet; Nicolae Sanda; Hui Yang; Luis J Mejico; Marie-Bénédicte Rougier; Richard Kho; Tran Thi Ha Chau; Shweta Singhal; Philippe Gohier; Catherine Clermont-Vignal; Ching-Yu Cheng; Jost B Jonas; Patrick Yu-Wai-Man; Clare L Fraser; John J Chen; Selvakumar Ambika; Neil R Miller; Yong Liu; Nancy J Newman; Tien Y Wong; Valérie Biousse
Journal: N Engl J Med Date: 2020-04-14 Impact factor: 91.245

Review 7. Performance of transient elastography for the staging of liver fibrosis: a meta-analysis.

Authors: Mireen Friedrich-Rust; Mei-Fang Ong; Swantje Martens; Christoph Sarrazin; Joerg Bojunga; Stefan Zeuzem; Eva Herrmann
Journal: Gastroenterology Date: 2008-01-18 Impact factor: 22.682

8. The Prognostic Value of Histopathologic Lesions in Native Kidney Biopsy Specimens: Results from the Boston Kidney Biopsy Cohort Study.

Authors: Anand Srivastava; Ragnar Palsson; Arnaud D Kaze; Margaret E Chen; Polly Palacios; Venkata Sabbisetti; Rebecca A Betensky; Theodore I Steinman; Ravi I Thadhani; Gearoid M McMahon; Isaac E Stillman; Helmut G Rennke; Sushrut S Waikar
Journal: J Am Soc Nephrol Date: 2018-06-04 Impact factor: 10.121

Review 9. Management of Patients with Cerebellar Ataxia During the COVID-19 Pandemic: Current Concerns and Future Implications.

Authors: Mario Manto; Nicolas Dupre; Marios Hadjivassiliou; Elan D Louis; Hiroshi Mitoma; Marco Molinari; Aasef G Shaikh; Bing-Wen Soong; Michael Strupp; Frank Van Overwalle; Jeremy D Schmahmann
Journal: Cerebellum Date: 2020-08 Impact factor: 3.847

10. COVID-19 Pneumonia Diagnosis Using a Simple 2D Deep Learning Framework With a Single Chest CT Image: Model Development and Validation.

Authors: Hoon Ko; Heewon Chung; Wu Seong Kang; Kyung Won Kim; Youngbin Shin; Seung Ji Kang; Jae Hoon Lee; Young Jun Kim; Nan Yeol Kim; Hyunseok Jung; Jinseok Lee
Journal: J Med Internet Res Date: 2020-06-29 Impact factor: 5.428

4 in total

Review 1. Artificial intelligence in glomerular diseases.

Authors: Francesco P Schena; Riccardo Magistroni; Fedelucio Narducci; Daniela I Abbrescia; Vito W Anelli; Tommaso Di Noia
Journal: Pediatr Nephrol Date: 2022-03-10 Impact factor: 3.651

Review 2. The potential of artificial intelligence-based applications in kidney pathology.

Authors: Roman D Büllow; Jon N Marsh; S Joshua Swamidass; Joseph P Gaut; Peter Boor
Journal: Curr Opin Nephrol Hypertens Date: 2022-02-14 Impact factor: 3.416

3. Circulating immune-complexes and complement activation through the classical pathway in myeloperoxidase-ANCA-associated glomerulonephritis.

Authors: Tadasu Kojima; Dan Inoue; Takeaki Wajima; Takahiro Uchida; Muneharu Yamada; Isao Ohsawa; Takashi Oda
Journal: Ren Fail Date: 2022-12 Impact factor: 2.606

4. Deep Learning-Based Model Significantly Improves Diagnostic Performance for Assessing Renal Histopathology in Lupus Glomerulonephritis.

Authors: Luping Shen; Wenyi Sun; Qixiang Zhang; Mengru Wei; Huanke Xu; Xuan Luo; Guangji Wang; Fang Zhou
Journal: Kidney Dis (Basel) Date: 2022-06-07

4 in total