Literature DB >> 35263356

Prediction of prognosis in immunoglobulin a nephropathy patients with focal crescent by machine learning.

Xuefei Lin^1,2,3, Yongfang Liu^2,3, Yizhen Chen¹, Xiaodan Huang¹, Jundu Li¹, Yuansheng Hou¹, Miaoying Shen¹, Zaoqiang Lin^1,4, Ronglin Zhang^1,5, Haifeng Yang⁶, Songlin Hong⁷, Xusheng Liu⁸, Chuan Zou^8,9.

Abstract

BACKGROUND AND OBJECTIVES: Immunoglobulin a nephropathy (IgAN) is the most common primary glomerular disease in the world, with different clinical manifestations, varying severity of pathological changes, common complications of crescent formation in different proportions, and great individual heterogeneous in clinical outcomes. Therefore, we aim to develop a machine learning (ML) based predictive model for predicting the prognosis of IgAN with focal crescent formation and without obvious chronic renal lesions (glomerulosclerosis <25%). MATERIALS: We retrospectively reviewed biopsy-proven IgAN patients in our hospital and cooperative hospital from 2005 to 2017. The method of feature importance of random forest (RF) was applied to conduct feature exploration of feature variables to establish the characteristic variables that are closely related to the prognosis of focal crescent IgAN. Multiple ML algorithms were attempted to establish the prediction models. The area under the precision-recall curve (AUPRC) and the area under the receiver operating characteristic curve (AUROC) were applied to evaluate the predictive performance via three-fold cross validation (namely 2 training sets and 1 validation set).
RESULTS: RF was used to screen the important features, the top three of which were baseline estimated glomerular filtration rate (eGFR), serum creatine and triglyceride. Ten important features were selected as important predictors for modeling on the basis of data-driven and medical selection, predictors include: age, baseline eGFR, serum creatine, serum triglycerides, complement 3(C3), proteinuria, mean arterial pressure (MAP) and Hematuria, crescents proportion of glomeruli, Global crescent proportion of glomeruli. In a variety of ML algorithms, the support vector machine (SVM) algorithm displayed better predictive performance, with Precision of 0.77, Recall of 0.77, F1-score of 0.73, accuracy of 0.77, AUROC of 79.57%, and AUPRC of 76.5%.
CONCLUSIONS: The SVM model is potentially useful for predicting the prognosis of IgAN patients with focal crescent shape and without obvious chronic renal lesions.

Entities: Chemical

Mesh：

Substances：
Creatine

Year: 2022 PMID： 35263356 PMCID： PMC8906594 DOI： 10.1371/journal.pone.0265017

Source DB: PubMed Journal: PLoS One ISSN： 1932-6203 Impact factor: 3.240

Introduction

IgA nephropathy (IgAN) is the most common primary glomerular disease in the world, accounting for more than 40% of primary glomerular diseases. Its adult incidence rate is more than 2.5/100,000 / year, which is the main cause of end stage renal disease (ESRD) [1-3]. 20–40% of patients with IgAN will develop ESRD within 10–20 years [4]. In the latest MEST-C Oxford typing in 2017 [5], crescent is an independent predictor of prognosis in patients with IgAN, and there is also a proportional dependence between crescent proportion and prognosis. Therefore, the prognosis of IgAN with different proportion of partial crescent formation is different, and the proportion of crescent can be included in the prognosis study. However, the variable of crescent is not included in the two prediction models recommended by the 2021 KDIGO guidelines, due to the crescent is highly related to race/ethnicity and the use of immunosuppressants after biopsy [6]. Of note, this problem could be solved by the importance ranking function of RFs, from which the importance score of each factor would be applied to reflect its own contribution. According to importance scores, the predictive value of clinicopathological parameters should be explored to predict ESRD for a more suitable prective model for Chinese IgAN patients. This indicates that there is a clear need for a predictive model that includes crescents as an important predictor to predict disease progression in IgAN. Many previous prediction models for IgAN have used standard modeling with a small number of predefined the risk of demographic, clinical, and pathological variables predicting the progression of IgAN to end-stage renal disease [6-10]. However, previous studies have mostly used standard statistical methods, such as univariate and multivariate Cox regression models and proportional hazard models, which only evaluate the relationship between variable quantum sets and ESRD progress, and potentially ignore the important interactions between variables and their effects on ESRD progress. Compared with conventional statistical methods, machine learning (ML) has better ability to identify variables related to clinical outcomes, better predictive performance, better complex relationship modeling ability, robustness to data noise, and the ability to learn from multiple data modules. ML’s application in furthering nephrology research and practice are myriad [11, 12]. Recently, Random forest and ANN model have been applied to predict progression to ESRD in IgAN patients [13, 14]. ML algorithms display better predictive performance and lower errors. In this study, multiple ML algorithms were applied to predict ESRD progression in IgAN patients. The purpose of this study is to successfully identify patients at high risk of progression to ESRD to facilitate early and effective treatment. Since the updated Oxford classification included crescent, the prediction model has included the crescent into the prediction model of predictive variables. However, they only simply include C0, C1 and C2, without subdividing the size, proportion and nature of the crescent. Therefore, this study innovatively incorporated the crescent index of different size, proportion and nature into the prognosis study of IgAN.

Materials and methods

Study population

In this study, the 662 biopsy-proven IgAN patients were collected from Guangdong Provincial Hospital of Chinese Medicine and Shanxi Traditional Chinese Medicine Hospital between May 2005 and November 2017. The inclusion criteria were as follows: (1) Age > 18 years old;(2) Patients with biopsy-proven primary IgAN;(3) glomerulosclerosis proportion < 25%;(4) Patients were followed for more than 12 months unless ESRD occurred within 12 months. Patients who met any of the following criteria were excluded: (1) insufficient clinical and pathological data;(2) Patients with secondary causes of mesangial IgA deposits, such as IgA vasculitis and systemic lupus erythematosus, or those with comorbid conditions, such as diabetes mellitus, were excluded;(3) atypical IgAN, such as crescentic IgAN;(4) tubulointerstitial fibrosis caused by drugs and ischemia;(5) a biopsy specimen with less than 8 total glomeruli. This study was approved by the research ethics committee of Guangdong Provincial Hospital of Chinese Medicine, IRB number: B2016-155-01. This study was a retrospective study, all data were completely anonymized, and the ethics committee waived the requirement for informed consent.

Dataset collection and definitions of variables

In this study, baseline demographics, clinical and pathology data were collected for all patients during renal biopsies, including age, gender, mean arterial pressure (MAP) defined as diastolic pressure plus one-third of the pulse pressure, 24-hour protein excretion and estimated glomerular filtration rate (eGFR) calculated by the Chronic Kidney Disease Epidemiology (CKD-EPI) Collaboration equation. Regardless of the duration and dose, the type of immunosuppression or renin-angiotensin-aldosterone system (RAAS) blockades therapy that the patient received was recorded. Immunosuppression was defined as treatment with corticosteroids and/or corticosteroid-sparing agents (including cyclophosphamide, azathioprine, mycophenolate, cyclosporine or tacrolimus). RAAS blockades included any exposure to angiotensin-converting enzyme inhibitor and/or angiotensin receptor blocker after biopsy. The updated Oxford Classification (MEST-C) for IgAN was applied in this study [5]. Renal biopsy samples from all patients were examined by pathologist and nephrologist. The crescent is subdivided according to the volume, composition and proportion of the crescent. The volume of the crescent body is defined as the large crescent body accounting for 50% or more of the renal sac volume and the small crescent body accounting for 50% or less of the renal sac volume. The components of crescent body can be divided into cellular crescent, cellular fibrous crescent and fibrous crescent. The cellular crescent consists of > 75% cells and < 25% fibrous matrix. The fibrous cellular crescent consists of 25%-75% of the cells and the remaining fibrous matrix. Extracapillary fibrosis of fibrous crescents consists of > 75% matrix and < 25% cells. The crescent ratio is defined as the proportion of the number of glomeruli with crescents in the total number of glomeruli, and the cell / fibrous cell / fibrous crescent is evaluated according to the relative ratio. ESRD was defined as eGFR<15 mL/min/1.73 m2 for more than 3 months or initiation of dialysis or transplantation. In this study, we defined clinical outcome: the combined event (Doubling of serum creatinine, 50% reduction in eGFR, 15% reduction in eGFR within 1 year, 30% reduction in eGFR within 2 year, ESRD or death) after diagnostic kidney biopsy.

ML algorithms

In the study, a variety of representative supervised classification ML algorithms were applied to build models. Three prediction models, Support Vector Machine (SVM) model, Random Forest (RF) Model and Naïve Bayes (NB) Model were used to build a prediction model based on the variables selected above. SVM, RF and NB are “black box”models, where the function connecting the predictor variables with response is unclear to the user. The receiver operating characteristic (ROC) curve, precision-recall curve (PRC) and lift curve were used to assess the predictive performance as previously described.

Feature selection and model construction

In this study, 39 clinical, pathological and demographic parameters were applied to predict the progression status of IgAN. To explore the better models, random forest algorithm which can assess the importance of all variables was adopted to evaluate the importance rankings of correlated predictive factors related to the prognosis of IgAN. In order to compute the importance of each predictive feature, all the features were used in the RF method. It is not enough to use the rank of important features of random forest for feature selection, but also to consider the characteristics of clinical specialty. Thus, in this study, we established and further evaluated the performance of 2 kind of models without and with crescent in a cohort of IgAN patients from China. Additionally, ML models (random forest classifier, support vector machine, Naïve Bayes, etc.) were cross-verified with 3 fold cross-validation (namely 2 training sets and 1 validation set). The detailed process of model cross-validation is shown in Fig 1. The ML algorithms were implemented using Python 3.8.5 (https://www.python.org) with scikit-learn (https://scikit-learn.org/stable/).

Fig 1

All ML models were cross-verified with 3 fold cross-validation.

Statistical analysis

Continuous variables were presented as the means and standard deviations for normal distributions and as medians and interquartile ranges for non-normal distributions. The categorical variables were presented as the number and percentages. The independent sample t test was used for normally distributed continuous variables, Mann-Whitney U test for Non-normally distributed continuous variables and Pearson Chi-Square test or Fisher’ s exact probability test for categorical variables. The P value<0.05 was considered statistically significant. Statistical analyses were performed using IBM SPSS Statistics software (Version 25.0. IBM Corporation, NY, USA).

Results

Clinical and pathological characteristics of the population

From May 2005 and November 2017, 374 biopsy-proven IgAN patients were recruited eventually (Fig 2), whose characteristics are shown in Table 1. In our cohort, 17.6% of the 374 IgAN patients reached the combined event within a median follow-up time of 32.99 (25.86–54.68) months.

Fig 2

Enrollment of IgAN patients in our cohort.

Table 1

Baseline cohort characteristics.

Factors	Overall (N = 374)	None-Endpoint (N = 308)	Endpoint (N = 66)	P value
Male, n (%)	175(46.8)	153(49.68)	22(33.33)	0.016*
Age(years)	31(26–40)	31(26–38.75)	31.5(25–46)	0.502
Follow up, n (%)	32.99(25.86–54.68)	33.13(26.04–55.71)	31.62(24.73–50.19)	0.288
Disease course, months	7(1–24)	7.5(1–24)	5.5(1–24)	0.821
eGFR, mL/min/1.73m²	108.3±39.49	106.92±35.76	114.75±53.46	0.259
Serum creatine, umol/L	80.7±29.74	81.14±28.47	78.66±35.28	0.54
Proteinuria, g/24h	0.85(0.43–1.57)	0.8(0.43–1.49)	1.03(0.48–2.32)	0.055
Hematuria (red blood cells/high-power field)	51(22.75–146.4)	51(23–145.5)	51(19.95–173.75)	0.998
BUN, mmol/L	4.77(4–5.78)	4.8(4.04–5.77)	4.57(3.69–5.8)	0.455
Uric Acid, mmol/L	341.5(280–414.25)	343.5(280–416)	335(278–406)	0.752
Cholesterol, mmol/L	4.6(4–5.39)	4.6(4–5.38)	4.69(4.11–5.46)	0.562
Triglyceride, mmol/L	1.2(0.9–1.76)	1.2(0.89–1.7)	1.33(0.88–2.31)	0.224
HDL-C, mmol/L	1.27(1.05–1.53)	1.24(1.03–1.55)	1.35(1.09–1.53)	0.338
LDL-C, mmol/L	2.82(2.32–3.5)	2.89(2.34–3.48)	2.76(2.22–3.51)	0.752
Blood glucose	4.85(4.42–5.1)	4.85(4.4–5.1)	4.79(4.44–5.11)	0.731
TP, g/L	67(62–71.53)	67.3(62.98–71.98)	65.5(59.38–70.35)	0.031*
Serum albumin, g/L	40.9(37.18–43.9)	41.1(37.7–44.15)	38.8(34.28–42.7)	0.005*
Serum IgA, g/L	3.05(2.46–3.5)	3.05(2.45–3.54)	3.05(2.47–3.49)	0.959
Serum C3, g/L	1.02(0.9–1.11)	1.02(0.9–1.1)	1.02(0.93–1.14)	0.585
SBP, mmHg	120(110–130)	120(110–130)	121.5(112.75–134)	0.149
DBP, mmHg	79.5(70–86)	80(70–85)	78.5(70–89.25)	0.419
MAP, mmHg	93.17(84.25–100)	92.84(83.67–100)	93.67(86.5–102.84)	0.239
Hypertension (%)	116(31)	93(30.19)	23(34.85)	0.458
Diabetes (%)	6(1.6)	4(1.3)	2(3.03)	0.287
Hepatitis (%)	28(7.5)	26(8.44)	2(3.03)	0.195
CVD (%)	1(0.3)	1(0.32)	0(0)	1
Smoke (%)	15(4)	12(3.9)	3(4.55)	0.735
Alcohol (%)	10(2.7)	9(2.92)	1(1.52)	1
M1 (%)	308(82.4)	249(80.84)	59(89.39)	0.098
E1 (%)	50(13.4)	39(12.66)	11(16.67)	0.386
S1 (%)	199(53.2)	157(50.97)	42(63.63)	0.061
T1 (%)	46(12.3)	35(11.36)	11(16.67)	0.209
T2 (%)	5(1.3)	3(0.97)	2(3.03)	0.197
C1 (%)	105(28.1)	85(27.6)	20(30.3)	0.556
C2 (%)	14(3.7)	10(3.25)	4(6.06)	0.42
RAAS blockade (%)	247(66)	206(66.9)	41(62.1)	0.458
Immunosuppressant (%)	140(37.4)	108(35.1)	32(48.5)	0.041*

* P < 0.05

The demographic, clinical, laboratory data and treatment of the IgAN patients. C3, complement 3; TP, total protein; MAP, mean arterial pressure; eGFR, estimated glomerular filtration rate; LDL-C, low density lipoprotein cholesterol; HDL-C, high density lipoprotein cholesterol; BUN, blood urea nitrogen; SBP, Systolic blood pressure; DBP, Diastolic blood pressure; CVD, cardiovascular disease, RAAS, renin-angiotensin-aldosterone system. Immunosuppressants include Steroids, cyclophosphamide, ciclosporin, mycophenolate mofetil and others. * P < 0.05 The demographic, clinical, pathologic and treatment characteristics of patients at the time of biopsy with and without progression to the combined event were retrospectively compared. The median age of the enrolled patients in this cohort at IgAN diagnosis was 31(26–40) years, of whom 175(46.8%) were male. At the time of renal biopsy, patients had the median urinary protein excretion of 0.85(0.43–1.57) g/24 h and the mean eGFR was 108.3±39.49 mL/min/1.73 m2. The median MAP was 93.17(84.25–100) mmHg, and 31% (116) of the patients presented with hypertension history. In total, 119 (31.8%) patients had crescents in glomeruli. Of these patients, 105 (28.1%) had crescents in less than 1/4 of glomeruli (C1 group), and 14 (3.7%) had crescents in more than 1/4 of glomeruli (C2 group). Regarding MEST Oxford scores in all patients, 82.4% were M1, 13.4% were E1, 53.2% were S1, and 14.5% were T1/T2. After diagnosis, 247 (66%) patients received RAAS blockade, which include Angiotensin-Converting Enzyme Inhibitors (ACEI) and Angiotensin Receptor Blockers (ARB). During the course, 140 (37.4%) patients received immunosuppressants including corticosteroids, cyclophosphamide, ciclosporin, mycophenolate mofetil and tripterygium glycosides, as appropriate. There was no significant difference in the proportion of patients who were treated with RAAS blockade between the none-endpoint group and the endpoint group (66.9% vs 62.1%, p = 0.458). The proportion of Endpoint group who were treated with immunosuppressants was more higher than the none-endpoint group (48.5% vs 35.1%, p = 0.041). Other demographic, clinical, and laboratory data of the IgAN patients are shown in Table 1.

Feature importance and selection

To identify crucial predictors of the combined event, we employed the Random Forest (RF) method to calculate the feature scores of all features (S1 Table shows all features). The feature selection method of our modeling considers the following principles to select the features that participate in the modeling: (1) The top features found by the feature selection algorithm. (2) The selected features cover different aspects as far as possible, such as patient pathological characteristics, clinical characteristics, epidemiological characteristics, and so on. (3) The selected features are as independent as possible, that is, to minimize the strong correlation between multiple variables. (4) Generally speaking, the amount of data of a mathematical model should be at least 10 times the number of independent variables of the model, on the other hand, the number of features involved in modeling should not be too much. The mathematical model established by too many variables can be poorly explained. (5) Focus on the principles and practical experience of the medical profession. Therefore, based on the above principles, we selected the top ten features including: baseline estimated GFR, serum creatine, serum triglycerides, proteinuria, MAP and Hematuria, C3, age, crescents proportion of glomeruli, Global crescent proportion of glomeruli. All of these features displayed a strong correlation with the combined event (Fig 3 shows the feature importance). A total of 10 prioritized features were selected as important predictors for modeling on the basis of the ranking of important features and medical selection, as shown in Table 2. The establishment of prediction models of predictors can be roughly divided into three aspects: patient epidemiological characteristics: age; clinical features: baseline estimated GFR, serum creatine, serum triglycerides, C3, proteinuria, MAP and Hematuria (red blood cells/high-power field); pathological findings: crescents proportion of glomeruli, Global crescent proportion of glomeruli.

Fig 3

Contribution of the included features of the combined event in IgAN patients.

HDL-C, High density lipoprotein cholesterol, LDL-C, Low density Lipoprotein cholesterol, TP, Total serum protein.

Table 2

Predictors selected using random forest and the corresponding feature importance score.

Features	Importance score
Baseline eGFR, ml/min per 1.73m²	0.066177
Serum creatine, mmol/L	0.059347
Serum triglycerides, mmol/L	0.054830
Proteinuria, g/d	0.049275
MAP, mm Hg	0.043798
Hematuria (red blood cells/high-power field)	0.043790
Serum C3, g/L	0.043743
Age at biopsy, years	0.036900
Crescent proportion of glomeruli, %	0.013346
Global crescent proportion of glomeruli, %	0.006574

eGFR, estimated glomerular filtration rate; MAP, mean arterial pressure; C3, complement 3.

Contribution of the included features of the combined event in IgAN patients.

HDL-C, High density lipoprotein cholesterol, LDL-C, Low density Lipoprotein cholesterol, TP, Total serum protein. eGFR, estimated glomerular filtration rate; MAP, mean arterial pressure; C3, complement 3.

ML models establishment and evaluation

In the study, above ten important features were applied to IgAN with crescent models, however, the first eight important features were applied to IgAN without crescent models. For the selection of a better predictive model, several widely applied ML algorithms were compared, including support vector machine (SVM), Random Forest (RF), and Naïve Bayes (NB), by using the receiver operating characteristic curve and precision-recall curve. In IgAN with crescent models, the AUROCs of the SVM model, RF model and NB model are 0.7957, 0.6443 and 0.7078, respectively (Fig 4). The AUPRCs of the SVM model, RF model and NB model are 0.765, 0.472 and 0.637, respectively (Fig 5). In IgAN without crescent models, the AUROCs of the SVM model, RF model and NB model are 0.831, 0.7041 and 0.5959, respectively (Fig 6). The AUPRCs of the SVM model, RF model and NB model are 0.716, 0.567 and 0.567, respectively (Fig 7). Receiver operating characteristic curves and precision-recall curves both show the superiority of the SVM model.

Fig 4

Receiver operating characteristic (ROC) curves of the three candidate models for the prognosis of IgAN.

AUC, area under the curve.

Fig 5

Precision-recall curves of the three candidate models for the prognosis of IgAN.

Fig 6

Receiver operating characteristic (ROC) curves of the three candidate models for the prognosis of IgAN without ’Crescent proportion’ and ’Global crescent proportion’.

Fig 7

Precision-recall curves of the three candidate models for the prognosis of IgAN without ’Crescent proportion’ and ’Global crescent proportion’.

Receiver operating characteristic (ROC) curves of the three candidate models for the prognosis of IgAN.

AUC, area under the curve. Table 3 summarizes the IgAN with crescent model metrics, including precision, recall, F1-score (higher is better), AUROC, and AUPRC (higher is better). The Support Vector Machine model exhibited the highest Precision (0.77), Recall (0.77), F1-score (0.73), Accuracy (0.77), AUROC (0.7957) and AUPRC (0.765). Table 4 summarizes the IgAN without crescent model metrics. The Support Vector Machine model exhibited the highest Precision (0.78), Recall (0.68), F1-score (0.56), Accuracy (0.68), AUROC (0.831) and AUPRC (0.716).

Table 3

Summary of the comparison of IgAN with ’Crescent proportion’ and ’Global crescent proportion’ model performance.

Prediction model	Precision	Recall	F1-score	Accuracy	AUROC	AUPRC
Support Vector Machine	0.77	0.77	0.73	0.77	0.7957	0.765
Random Forest	0.69	0.70	0.61	0.70	0.6443	0.472
Naïve Bayes	0.74	0.74	0.69	0.74	0.7078	0.637

Table 4

Summary of the comparison of IgAN without ’Crescent proportion’ and ’Global crescent proportion’ model performance.

Prediction model	Precision	Recall	F1-score	Accuracy	AUROC	AUPRC
Support Vector Machine	0.78	0.68	0.56	0.68	0.831	0.716
Random Forest	0.65	0.68	0.63	0.68	0.7041	0.567
Naïve Bayes	0.70	0.72	0.70	0.72	0.5959	0.567

Performance evaluation of excellent models with Lift

The Lift curve is one of the most commonly used methods for ML classification. Lift reflects how many times the accuracy of prediction is improved compared with random selection without prediction model. Lift reveals the effect of the prediction model, it should be as steep as possible. To obtain a more reliable evaluation of the performance of the prediction model, this study will comprehensively apply ROC curves and Lift curves to verify the performance of the model based on different algorithms. In IgAN with crescent models, the larger the Lift value of the SVM prediction model is, the better the model effect. As shown in Fig 8, the SVM model predicted the conbined endpoint event of IgAN, Lift = 3.65, which is 3.65 times more accurate than random prediction, and none conbined endpoint event of IgAN, Lift = 1.38, which is 1.38 times more accurate than random prediction. The Lift curve basically shows a downward trend, also suggesting that the SVM model has good prediction performance.

Fig 8

The Lift curve with Support Vector Machine model.

“Class 0” indicates IgAN patients with none conbined endpoint progression, and “Class 1” indicates IgAN patients with the conbined endpoint progression.

The Lift curve with Support Vector Machine model.

“Class 0” indicates IgAN patients with none conbined endpoint progression, and “Class 1” indicates IgAN patients with the conbined endpoint progression.

Models calibration

The Calibration of the prediction model is an important index to evaluate the accuracy of a disease risk model in predicting the probability of an individual outcome event in the future. It reflects the consistency between the predicted risk and the actual occurrence risk of the model, so it can also be called consistency. Good calibration indicates that the prediction model has high accuracy; poor calibration indicates that the model may overestimate or underestimate the risk of disease. As shown in Figs 9 and 10, relatively speaking, the blue Random Forest model and the black Support Vector Machine Model calibration curve are better.

Fig 9

Calibration plots of the three candidate models for the prognosis of IgAN with ’Crescent proportion’ and ’Global crescent proportion’.

Fig 10

Calibration plots of the three candidate models for the prognosis of IgAN without ’Crescent proportion’ and ’Global crescent proportion’.

Discussion

Crescents have been implicated as an important marker of poor prognosis of IgAN. However, a validation study of crescent failed to validate the increased risk of renal function progression in Chinese IgAN patients in the C1 or C2 groups compared to the C0 group [15]. The discrepant findings may be due to the different definitions of outcomes. In addition, considering the inherent nature of crescents, some IgAN patients with more crescents may occur in an early or acute stage of renal damage. Therefore, we defined clinical outcome: the combined event after diagnostic kidney biopsy. However, it has been reported that the proportion of glomerulosclerosis >25% as chronic pathological lesions which may interfere with crescents as active lesions on the prognosis of IgAN, which are both associated with a decreased renal survival rate in IgAN patients [16]. Previous studies have shown that the pathological types of renal biopsy in IgAN patients are mainly non-obvious chronicity lesions (glomerulosclerosis <25%, T score<2) [6, 16, 17]. However, in IgAN patients without obvious chronic lesions, few prediction model have studied the effect of crescent index of different size, proportion and nature into the prognosis study of IgAN. Therefore, 374 patients with IgAN without obvious chronic lesions were retrospectively analyzed and multiple ML algorithms were applied to explore a useful and practical predictive model. Some baseline characteristics of patients at renal biopsy between negative and positive endpoint groups are significantly different so it is possible to use baseline characteristics at renal biopsy to predict the conbined event of IgAN patients. To date, several clinical and pathological parameters have been associated with a high risk of kidney disease progression in IgAN. Previously identified risk parameters for IgAN include gender [13, 18, 19], age [6, 13, 18–20], baseline serum creatinine concentration [13, 18–20], eGFR [6, 18], SBP [6, 13, 18–20], DBP [6, 13, 18–20], proteinuria [6, 13, 18–20], hematuria [20], serum UA concentration [20], serum albumin concentration [18], treatment type[6, 13, 18] and histology grading [6, 13, 18–20]. In this study, as the risk factors identified above, the clinical and pathological factors included baseline eGFR, serum creatine, Serum triglyceride, proteinuria, MAP, hematuria, age at biopsy, Serum C3, Crescent proportion of glomeruli and Global crescent proportion of glomeruli. Based on the random forest algorithm, the top two important features were baseline eGFR and serum creatine, which are the demonstrated strong predictors for IgAN prognosis. Hypertriglyceridaemia at the time of diagnosis, which may have a role in tubulointerstitial lesions, is the important independent risk factor of poor outcome in IgAN [21]. In addition to the well known risk factors, proteinuria, age and MAP, hematuria was independently associated with IgAN progression [22]. It is reported that the formation of crescents is related to the degree of mesangial C3 deposition. Low serum levels of complement C3 is often associated with poor renal outcomes in IgAN [23]. Consistent with our previous study, global crescent and Serum C3 are the independent risk factor for IgAN progression and poor renal outcome in IgAN patients without significant chronic kidney damage [24]. Further more, Latest study show that crescent is independently associated with higher mortality in IgAN [25]. Thus, all of these ten prioritized predictive features described above have clinically reasonable explanations. In our study, several prevailing algorithms were applied to identify severe progress or poor prognosis of IgAN. Our raw dataset showed an obvious imbalance with 66 target events (the combined event) and 308 negative samples. Such imbalance is very common in clinical research. Therefore, we adopted the method of three-fold cross-validation and over-sample to improve the stability and generalization of the model and reduce the effect of imbalance. Comparing 3 ML models, SVM model outperformed the other 2 ML algorithms including Random Forest, and Naïve Bayes. To evaluate the prediction and accuracy of various ML models, we calculated and compared areas under the receiver operating characteristic curve (AUROC). Although SVM of IgAN without crescent model has higher AUROC, SVM of IgAN with crescent model has higher AUPRC and F1 values. AUROC is a general metric for model selection, however it is not the only reference. In clinical practice, particularly for our imbalanced dataset, the AUPRC and F1-score are more practical evaluation indicators for ML. Based on this requirement, we preferred the SVM of IgAN with crescent model rather than other models. SVM is an algorithm for identifying a high-dimensional boundary that distinctly classifies data points. Notable strengths of our study include our choice to select patients who were without obvious chronic renal lesions (glomerulosclerosis <25%) at the time of biopsy which can clearly identify the effect of crescents on the prognosis of IgAN. In our study, two-center IgAN cohort, we excluded crescentic IgAN in order to predict the progress of IgAN with a focal crescent shape. Besides, we identified the important features using a more objective ML approach. Further more, in addition to the definitive outcome of ESRD or death, we also incorporated a 50% reduction in eGFR, doubling of serum creatinine, 15% reduction in eGFR within 1 year, 30% reduction in eGFR within 2 year, in our combined event to implement variable selection, which is more appropriate to evaluate the prognosis of patients with IgAN with manifestations of different severity. Lastly, and perhaps most importantly, we apply ML algorithms, which can build complex models and make accurate decisions rather than traditional statistical methods. The strength of our study is that we selected important predictors for modeling on the basis of feature scores and medical selection to avoid ignoring non-statistically-significant parameters or non-clinical parameters. Our study was subject to some limitations. First of all, only Chinese patients were included in the model, and prediction for other populations was not evaluated. Secondly, our cohort was not large enough and our data was lost and unbalanced. As such, predictive ability was impaired by the relatively small numbers of positive events resulting from data imbalance. Thirdly, due to the limitation of the retrospective study design, the duration and dosage of IS therapy were not collected, more prospective studies with a larger cohort are needed to support the present findings, retrospective analyses alone are not sufficient to determine the treatment choice for patients with IgAN. Further more, external validation is required to prevent overfitting. Finally, this model has been developed and tested in retrospective cohorts which could not show the effect of the model in guiding treatment. However, the current prediction model is an effective and simple method to predict the progression of IgAN patients to the conbined event.

Conclusion

In conclusion, ML algorithms exhibit an excellent predictive performance for IgAN patients with a focal crescent shape. Among these algorithms, support vector machine model, show the higher sensitivity, AUROC and Lift, can be used to predict the prognosis of IgAN patients with a focal crescent shape. However, we also identified that eGFR, serum creatine, Serum triglyceride, proteinuria, MAP, hematuria, age, Serum C3, Crescent proportion of glomeruli and Global crescent proportion of glomeruli had important impacts on the predictability of the models. In future work, further prospective multicenter studies with multiple datasets are needed to evaluate the validity of these model and to reduce the influence of the imbalance in the target variables.

Features included demographic, clinical, laboratory data and treatment of the IgAN patients.

(DOCX) Click here for additional data file. 30 Dec 2021

PONE-D-21-34774

Prediction of prognosis in immunoglobulin a nephropathy patients with focal crescent by machine learning

PLOS ONE Dear Dr. Zou, Thank you for submitting your manuscript to PLOS ONE. After careful consideration, we feel that it has merit but does not fully meet PLOS ONE’s publication criteria as it currently stands. Therefore, we invite you to submit a revised version of the manuscript that addresses the points raised during the review process.

Please, resolve carefully all the issues raised from Reviewer 1 including the rationale behind the selection of the variables in the model and the composite outcome. About this, it is not clear the importance of crescent in the model. Please submit your revised manuscript by Feb 09 2022 11:59PM. If you will need more time than this to complete your revisions, please reply to this message or contact the journal office at plosone@plos.org. When you're ready to submit your revision, log on to https://www.editorialmanager.com/pone/ and select the 'Submissions Needing Revision' folder to locate your manuscript file. Please include the following items when submitting your revised manuscript: A rebuttal letter that responds to each point raised by the academic editor and reviewer(s). You should upload this letter as a separate file labeled 'Response to Reviewers'. A marked-up copy of your manuscript that highlights changes made to the original version. You should upload this as a separate file labeled 'Revised Manuscript with Track Changes'. An unmarked version of your revised paper without tracked changes. You should upload this as a separate file labeled 'Manuscript'. If you would like to make changes to your financial disclosure, please include your updated statement in your cover letter. Guidelines for resubmitting your figure files are available below the reviewer comments at the end of this letter. If applicable, we recommend that you deposit your laboratory protocols in protocols.io to enhance the reproducibility of your results. Protocols.io assigns your protocol its own identifier (DOI) so that it can be cited independently in the future. For instructions see: https://journals.plos.org/plosone/s/submission-guidelines#loc-laboratory-protocols. Additionally, PLOS ONE offers an option for publishing peer-reviewed Lab Protocol articles, which describe protocols hosted on protocols.io. Read more information on sharing protocols at https://plos.org/protocols?utm_medium=editorial-email&utm_source=authorletters&utm_campaign=protocols. We look forward to receiving your revised manuscript. Kind regards, Fabio Sallustio, PhD Academic Editor PLOS ONE Journal Requirements: 1. When submitting your revision, we need you to address these additional requirements. Please ensure that your manuscript meets PLOS ONE's style requirements, including those for file naming. The PLOS ONE style templates can be found at https://journals.plos.org/plosone/s/file?id=wjVg/PLOSOne_formatting_sample_main_body.pdf and https://journals.plos.org/plosone/s/file?id=ba62/PLOSOne_formatting_sample_title_authors_affiliations.pdf 2. Please note that PLOS ONE has specific guidelines on code sharing for submissions in which author-generated code underpins the findings in the manuscript. In these cases, all author-generated code must be made available without restrictions upon publication of the work. Please review our guidelines at https://journals.plos.org/plosone/s/materials-and-software-sharing#loc-sharing-code and ensure that your code is shared in a way that follows best practice and facilitates reproducibility and reuse. 3. In your Data Availability statement, you have not specified where the minimal data set underlying the results described in your manuscript can be found. PLOS defines a study's minimal data set as the underlying data used to reach the conclusions drawn in the manuscript and any additional data required to replicate the reported study findings in their entirety. All PLOS journals require that the minimal data set be made fully available. For more information about our data policy, please see http://journals.plos.org/plosone/s/data-availability. "Upon re-submitting your revised manuscript, please upload your study’s minimal underlying data set as either Supporting Information files or to a stable, public repository and include the relevant URLs, DOIs, or accession numbers within your revised cover letter. For a list of acceptable repositories, please see http://journals.plos.org/plosone/s/data-availability#loc-recommended-repositories. Any potentially identifying patient information must be fully anonymized. Important: If there are ethical or legal restrictions to sharing your data publicly, please explain these restrictions in detail. Please see our guidelines for more information on what we consider unacceptable restrictions to publicly sharing data: http://journals.plos.org/plosone/s/data-availability#loc-unacceptable-data-access-restrictions. Note that it is not acceptable for the authors to be the sole named individuals responsible for ensuring data access. We will update your Data Availability statement to reflect the information you provide in your cover letter. [Note: HTML markup is below. Please do not edit.] Reviewers' comments: Reviewer's Responses to Questions Comments to the Author 1. Is the manuscript technically sound, and do the data support the conclusions? The manuscript must describe a technically sound piece of scientific research with data that supports the conclusions. Experiments must have been conducted rigorously, with appropriate controls, replication, and sample sizes. The conclusions must be drawn appropriately based on the data presented. Reviewer #1: Partly Reviewer #2: Yes ********** 2. Has the statistical analysis been performed appropriately and rigorously? Reviewer #1: No Reviewer #2: Yes ********** 3. Have the authors made all data underlying the findings in their manuscript fully available? The PLOS Data policy requires authors to make all data underlying the findings described in their manuscript fully available without restriction, with rare exception (please refer to the Data Availability Statement in the manuscript PDF file). The data should be provided as part of the manuscript or its supporting information, or deposited to a public repository. For example, in addition to summary statistics, the data points behind means, medians and variance measures should be available. If there are restrictions on publicly sharing data—e.g. participant privacy or use of data from a third party—those must be specified. Reviewer #1: Yes Reviewer #2: Yes ********** 4. Is the manuscript presented in an intelligible fashion and written in standard English? PLOS ONE does not copyedit accepted manuscripts, so the language in submitted articles must be clear, correct, and unambiguous. Any typographical or grammatical errors should be corrected at revision, so please note any specific errors here. Reviewer #1: No Reviewer #2: Yes ********** 5. Review Comments to the Author Please use the space provided to explain your answers to the questions above. You may also include additional comments for the author, including concerns about dual publication, research ethics, or publication ethics. (Please upload your review as an attachment if it exceeds 20,000 characters) Reviewer #1: In this paper the authors present a new method based on machine learning for the prediction of prognosis in IgA nephropathy patients with focal crescent. Considering the growing interest in machine learning and its applications in medicine, the paper is of sure interest for the scientific community. However, some issues need to be solved before the paper is accepted for publication: 1. The paper is difficult to read. English need to be improved, and in the paragraphs the same concept is often repeated several times. Introduction is too long, as well as conclusions, and too much room is dedicated to provide info on ML. The paragraph “feature selection and model construction” need to be expanded. 2. At page 6, line 137, age > 80 years seems an exclusion criterion. However, this is not reported in the flow-chart (Fig. 2). 3. Page 7: if crescent depends on immunosuppression, why drug duration and dose were not recorded? 4. The combined event is quite confusing. It includes several ways to measure CKD progression, including different percentages of reduction of eGFR. In some cases, the reduction is an absolute number, in other measured at 1 or 2 years. Please explain the rationale behind this choice. 5. The predictive performance of the model was evaluated only by ROC curve. However, other parameters need to be taken into account to evaluate the performance of an algorithm. Please implement this section. Furthermore, in order to understand if crescent really improves the model performance, AUC with and without this variable should be reported. 6. In table 2 the authors report the chosen variables using random forest. However, looking at the random forest, the selection criterion is not clear, as variables scored as more important have been discarded. Please clarify this issue. Minor issues: Pag.5, line 111: Transfer learning is not properly an algorithm of ML, but rather a method used to train the model. Reviewer #2: Dear Author Congratulations. Your Manuscript is novel and proper. My comment : Accept. Kind Regards. Dear Author Congratulations. Your Manuscript is novel and proper. My comment : Accept. Kind Regards. ********** 6. PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files. If you choose “no”, your identity will remain anonymous but your review may still be made public. Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy. Reviewer #1: No Reviewer #2: No [NOTE: If reviewer comments were submitted as an attachment file, they will be attached to this email and accessible via the submission site. Please log into your account, locate the manuscript record, and check for the action link "View Attachments". If this link does not appear, there are no attachment files.] While revising your submission, please upload your figure files to the Preflight Analysis and Conversion Engine (PACE) digital diagnostic tool, https://pacev2.apexcovantage.com/. PACE helps ensure that figures meet PLOS requirements. To use PACE, you must first register as a user. Registration is free. Then, login and navigate to the UPLOAD tab, where you will find detailed instructions on how to use the tool. If you encounter any issues or have any questions when using PACE, please email PLOS at figures@plos.org. Please note that Supporting Information files do not need this step. 8 Feb 2022 Dear Editor Fabio Sallustio and Reviewers, On behalf of my co-authors, we greatly appreciate the careful review and comments from both you and the reviewers. We believe that by implementing the suggested changes, we now have a stronger manuscript entitled “Prediction of prognosis in immunoglobulin a nephropathy patients with focal crescent by machine learning” for submission to PLOS ONE. We look forward to your positive response to the revised work submitted here. We present here point-to-point responses for each of the comments in the attached document and have revised our manuscript accordingly. We do not change our statistics or results. And we hope the revised manuscript could be acceptable for you. Revised sections are identified with red text in the paper. There are no conflicts of interest regarding this work. All authors have read the revised manuscript and approved its submission to PLOS ONE. Please do not hesitate to contact us if we can be of any further assistance. Thank you and best regards. Sincerely yours, Xusheng Liu, Chuan Zou. Department of Nephrology, Guangdong Provincial Hospital of Chinese Medicine, Guangzhou, Guangdong, China; Correspondence: Xusheng Liu, liuxu801 @ 126.com; Chuan Zou, doctorzc541888 @ 126.com Response to the editor’s comments: Dear Editor Fabio Sallustio, Thank you very much for reviewing our manuscript. The main corrections in the manuscript text and our responses to your comments are as following: Comment1: When submitting your revision, we need you to address these additional requirements. Please ensure that your manuscript meets PLOS ONE's style requirements, including those for file naming. The PLOS ONE style templates can be found at https://journals.plos.org/plosone/s/file?id=wjVg/PLOSOne_formatting_sample_main_body.pdf and https://journals.plos.org/plosone/s/file?id=ba62/PLOSOne_formatting_sample_title_authors_affiliations.pdf Response1: Thanks for your kind suggestion. Our manuscript meets PLOS ONE's style requirements. Comment2: Please note that PLOS ONE has specific guidelines on code sharing for submissions in which author-generated code underpins the findings in the manuscript. In these cases, all author-generated code must be made available without restrictions upon publication of the work. Please review our guidelines at https://journals.plos.org/plosone/s/materials-and-software-sharing#loc-sharing-code and ensure that your code is shared in a way that follows best practice and facilitates reproducibility and reuse. Response2: Thanks for your kind suggestion. The Python code underlying reported findings were deposited in appropriate public data repository which is Figshare. The DOI is 10.6084/m9.figshare.19127399. Comment3: In your Data Availability statement, you have not specified where the minimal data set underlying the results described in your manuscript can be found. PLOS defines a study's minimal data set as the underlying data used to reach the conclusions drawn in the manuscript and any additional data required to replicate the reported study findings in their entirety. All PLOS journals require that the minimal data set be made fully available. For more information about our data policy, please see http://journals.plos.org/plosone/s/data-availability. "Upon re-submitting your revised manuscript, please upload your study’s minimal underlying data set as either Supporting Information files or to a stable, public repository and include the relevant URLs, DOIs, or accession numbers within your revised cover letter. For a list of acceptable repositories, please see http://journals.plos.org/plosone/s/data-availability#loc-recommended-repositories. Any potentially identifying patient information must be fully anonymized. Important: If there are ethical or legal restrictions to sharing your data publicly, please explain these restrictions in detail. Please see our guidelines for more information on what we consider unacceptable restrictions to publicly sharing data: http://journals.plos.org/plosone/s/data-availability#loc-unacceptable-data-access-restrictions. Note that it is not acceptable for the authors to be the sole named individuals responsible for ensuring data access. We will update your Data Availability statement to reflect the information you provide in your cover letter. Response3: Thanks for your kind suggestion. Our study’s data set underlying reported findings were deposited in appropriate public data repository which is Figshare. The DOI is 10.6084/m9.figshare.19127342. Responds to the reviewer’s comments: Reply to Reviewer #1 Dear Reviewer, Thank you for your positive comments and valuable suggestions to improve the quality of our manuscript. Comments: In this paper the authors present a new method based on machine learning for the prediction of prognosis in IgA nephropathy patients with focal crescent. Considering the growing interest in machine learning and its applications in medicine, the paper is of sure interest for the scientific community. However, some issues need to be solved before the paper is accepted for publication: 1. The paper is difficult to read. English need to be improved, and in the paragraphs the same concept is often repeated several times. Introduction is too long, as well as conclusions, and too much room is dedicated to provide info on ML. The paragraph “feature selection and model construction” need to be expanded. 2. At page 6, line 137, age > 80 years seems an exclusion criterion. However, this is not reported in the flow-chart (Fig. 2). 3. Page 7: if crescent depends on immunosuppression, why drug duration and dose were not recorded? 4. The combined event is quite confusing. It includes several ways to measure CKD progression, including different percentages of reduction of eGFR. In some cases, the reduction is an absolute number, in other measured at 1 or 2 years. Please explain the rationale behind this choice. 5. The predictive performance of the model was evaluated only by ROC curve. However, other parameters need to be taken into account to evaluate the performance of an algorithm. Please implement this section. Furthermore, in order to understand if crescent really improves the model performance, AUC with and without this variable should be reported. 6. In table 2 the authors report the chosen variables using random forest. However, looking at the random forest, the selection criterion is not clear, as variables scored as more important have been discarded. Please clarify this issue. Comment1: The paper is difficult to read. English need to be improved, and in the paragraphs the same concept is often repeated several times. Introduction is too long, as well as conclusions, and too much room is dedicated to provide info on ML. The paragraph “feature selection and model construction” need to be expanded. Response1: Thanks for your kind suggestion. We are very sorry for our poorly written manuscript. We improve our article and we hope the revised manuscript could be acceptable for you. And here we did not list the changes but marked in red in the manuscript. Comment2: At page 6, line 137, age > 80 years seems an exclusion criterion. However, this is not reported in the flow-chart (Fig. 2). Response2: Thanks for your suggestion. The number of patients older than 80 was zero. As suggested by the reviewer, we have corrected the sentence into “Age > 18 years old.” page 6, line 137; Comment3: Page 7: if crescent depends on immunosuppression, why drug duration and dose were not recorded? Response3: Thank you for the detailed review. This is an especially important issue. According to 2021 KDIGO guidelines, there is insufficient evidence to support the use of the Oxford MEST-C score in determining whether immunosuppression should be commenced in IgAN[1]. However, in clinical practice, crescents are often associated with the use of immunosuppressants. Due to the limitation of the retrospective study design, the duration and dosage of IS therapy were not collected. Therefore, more prospective studies with a larger cohort are needed to support the present findings, retrospective analyses alone are not sufficient to determine the treatment choice for patients with IgAN, see the last paragraph of the discussion section. Comment4: The combined event is quite confusing. It includes several ways to measure CKD progression, including different percentages of reduction of eGFR. In some cases, the reduction is an absolute number, in other measured at 1 or 2 years. Please explain the rationale behind this choice. Response4: Thank you for your careful observations. In this study, we defined clinical outcome: the combined event (Doubling of serum creatinine, 50% reduction in eGFR, 15% reduction in eGFR within 1 year, 30% reduction in eGFR within 2 years, ESRD or death). The primary clinical outcome was ESRD or death. The US FDA currently accepts 50% reduction in eGFR, assessed as doubling of serum creatinine level, as a surrogate end point for the development of kidney failure in clinical trials of kidney disease progression. In addition, based on a series of meta-analyses of cohorts and clinical trials and simulations of trial designs and analytic methods, 30% reduction in eGFR within 2 to 3 years may be an acceptable surrogate end point in some circumstances[2]. However, the population in this study is IgAN with focal crescent formation and without obvious chronic renal lesions, which is mainly acute lesions. As a result, we added 15% reduction in eGFR within 1 year in the combined event. Comment5: The predictive performance of the model was evaluated only by ROC curve. However, other parameters need to be taken into account to evaluate the performance of an algorithm. Please implement this section. Furthermore, in order to understand if crescent really improves the model performance, AUC with and without this variable should be reported. Response5: Thank you again for your valuable suggestions to improve the quality of our manuscript. In addition to ROC Curve, we reported on Precision Recall Curve and LIFT Curve, as well as detailed Precision Recall Curve metrics for each model, further more, supplemented by a description of the Calibration curve Figure 9 and Figure10 in the article. Fig 9. Calibration plots of the three candidate models for the prognosis of IgAN with 'Crescent proportion' and 'Global crescent proportion'. As suggested by the reviewer, we have added three models (IgAN without 'Crescent proportion' and 'Global crescent proportion' Models)(Table4) .Although SVM of IgAN without crescent model has higher AUROC, SVM of IgAN with crescent model has higher AUPRC and F1 values. In clinical practice, particularly for our imbalanced dataset, the AUPRC and F1-score are more practical evaluation indicators for ML. Based on this requirement, we preferred the SVM of IgAN with crescent model rather than other models. Table 3. Summary of the comparison of IgAN with 'Crescent proportion' and 'Global crescent proportion' model performance Prediction model Precision Recall F1-score Accuracy AUROC AUPRC Support Vector Machine 0.77 0.77 0.73 0.77 0.7957 0.765 Random Forest 0.69 0.70 0.61 0.70 0.6443 0.472 Naïve Bayes 0.74 0.74 0.69 0.74 0.7078 0.637 Table 4. Summary of the comparison of IgAN without 'Crescent proportion' and 'Global crescent proportion' model performance Prediction model Precision Recall F1-score Accuracy AUROC AUPRC Support Vector Machine 0.78 0.68 0.56 0.68 0.831 0.716 Random Forest 0.65 0.68 0.63 0.68 0.7041 0.567 Naïve Bayes 0.70 0.72 0.70 0.72 0.5959 0.567 Fig 6. Receiver operating characteristic(ROC) curves of the three candidate models for the prognosis of IgAN without 'Crescent proportion' and 'Global crescent proportion'. Fig 7. Precision-recall curves of the three candidate models for the prognosis of IgAN without 'Crescent proportion' and 'Global crescent proportion'. Fig 10. Calibration plots of the three candidate models for the prognosis of IgAN without 'Crescent proportion' and 'Global crescent proportion'. Comment6: In table 2 the authors report the chosen variables using random forest. However, looking at the random forest, the selection criterion is not clear, as variables scored as more important have been discarded. Please clarify this issue. Response6: Thank you for your kind suggestion. A total of 10 important features were selected as important predictors for modeling on the basis of data-driven and medical selection, as shown in Table 2. The feature selection method of our modeling considers the following principles to select the features that participate in the modeling: 1.1 The top features found by the feature selection algorithm. 1.2 The selected features cover different aspects as far as possible, such as patient pathological characteristics, clinical characteristics, epidemiological characteristics, and so on. 1.3 The selected features are as independent as possible, that is, to minimize the strong correlation between multiple variables. 1.4 Generally speaking, the amount of data of a mathematical model should be at least 10 times the number of independent variables of the model, on the other hand, the number of features involved in modeling should not be too much. The mathematical model established by too many variables can be poorly explained. 1.5 Focus on the principles and practical experience of the medical profession. Therefore, based on the above principles, we selected the top ten features including: baseline estimated GFR, serum creatine, serum triglycerides, proteinuria, MAP and Hematuria, C3, age, crescents proportion of glomeruli, Global crescent proportion of glomeruli. Other comments: Minor issues: Pag.5, line 111: Transfer learning is not properly an algorithm of ML, but rather a method used to train the model. Response: Thank you for your careful observations, we have amended it. Reply to Reviewer #2 Dear Reviewer, Thank you very much for your time involved in reviewing the manuscript and your very encouraging comments on the merits. Comments: Dear Author Congratulations. Your Manuscript is novel and proper. My comment : Accept. Kind Regards. References: [1] 2021 KDIGO 2021 Clinical Practice Guideline for the Management of Glomerular Diseases KIDNEY INT 100 S1-276 [2] Levey A S, Inker L A, Matsushita K, Greene T, Willis K, Lewis E, de Zeeuw D, Cheung A K and Coresh J 2014 GFR Decline as an End Point for Clinical Trials in CKD: A Scientific Workshop Sponsored by the National Kidney Foundation and the US Food and Drug Administration AM J KIDNEY DIS 64 821-35 Submitted filename: Response to Reviewers.docx Click here for additional data file. 21 Feb 2022 Prediction of prognosis in immunoglobulin a nephropathy patients with focal crescent by machine learning PONE-D-21-34774R1 Dear Dr. Zou, We’re pleased to inform you that your manuscript has been judged scientifically suitable for publication and will be formally accepted for publication once it meets all outstanding technical requirements. Within one week, you’ll receive an e-mail detailing the required amendments. When these have been addressed, you’ll receive a formal acceptance letter and your manuscript will be scheduled for publication. An invoice for payment will follow shortly after the formal acceptance. To ensure an efficient process, please log into Editorial Manager at http://www.editorialmanager.com/pone/, click the 'Update My Information' link at the top of the page, and double check that your user information is up-to-date. If you have any billing related questions, please contact our Author Billing department directly at authorbilling@plos.org. If your institution or institutions have a press office, please notify them about your upcoming paper to help maximize its impact. If they’ll be preparing press materials, please inform our press team as soon as possible -- no later than 48 hours after receiving the formal acceptance. Your manuscript will remain under strict press embargo until 2 pm Eastern Time on the date of publication. For more information, please contact onepress@plos.org. Kind regards, Fabio Sallustio, PhD Academic Editor PLOS ONE Additional Editor Comments (optional): Reviewers' comments: Reviewer's Responses to Questions Comments to the Author 1. If the authors have adequately addressed your comments raised in a previous round of review and you feel that this manuscript is now acceptable for publication, you may indicate that here to bypass the “Comments to the Author” section, enter your conflict of interest statement in the “Confidential to Editor” section, and submit your "Accept" recommendation. Reviewer #1: All comments have been addressed ********** 2. Is the manuscript technically sound, and do the data support the conclusions? The manuscript must describe a technically sound piece of scientific research with data that supports the conclusions. Experiments must have been conducted rigorously, with appropriate controls, replication, and sample sizes. The conclusions must be drawn appropriately based on the data presented. Reviewer #1: Yes ********** 3. Has the statistical analysis been performed appropriately and rigorously? Reviewer #1: Yes ********** 4. Have the authors made all data underlying the findings in their manuscript fully available? The PLOS Data policy requires authors to make all data underlying the findings described in their manuscript fully available without restriction, with rare exception (please refer to the Data Availability Statement in the manuscript PDF file). The data should be provided as part of the manuscript or its supporting information, or deposited to a public repository. For example, in addition to summary statistics, the data points behind means, medians and variance measures should be available. If there are restrictions on publicly sharing data—e.g. participant privacy or use of data from a third party—those must be specified. Reviewer #1: No ********** 5. Is the manuscript presented in an intelligible fashion and written in standard English? PLOS ONE does not copyedit accepted manuscripts, so the language in submitted articles must be clear, correct, and unambiguous. Any typographical or grammatical errors should be corrected at revision, so please note any specific errors here. Reviewer #1: Yes ********** 6. Review Comments to the Author Please use the space provided to explain your answers to the questions above. You may also include additional comments for the author, including concerns about dual publication, research ethics, or publication ethics. (Please upload your review as an attachment if it exceeds 20,000 characters) Reviewer #1: (No Response) ********** 7. PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files. If you choose “no”, your identity will remain anonymous but your review may still be made public. Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy. Reviewer #1: No 28 Feb 2022 PONE-D-21-34774R1 Prediction of prognosis in immunoglobulin a nephropathy patients with focal crescent by machine learning Dear Dr. Zou: I'm pleased to inform you that your manuscript has been deemed suitable for publication in PLOS ONE. Congratulations! Your manuscript is now with our production department. If your institution or institutions have a press office, please let them know about your upcoming paper now to help maximize its impact. If they'll be preparing press materials, please inform our press team within the next 48 hours. Your manuscript will remain under strict press embargo until 2 pm Eastern Time on the date of publication. For more information please contact onepress@plos.org. If we can help with anything else, please email us at plosone@plos.org. Thank you for submitting your work to PLOS ONE and supporting open access. Kind regards, PLOS ONE Editorial Office Staff on behalf of Dr. Fabio Sallustio Academic Editor PLOS ONE

25 in total

Review 1. IgA nephropathy.

Authors: James V Donadio; Joseph P Grande
Journal: N Engl J Med Date: 2002-09-05 Impact factor: 91.245

2. Development and validation of a prediction rule using the Oxford classification in IgA nephropathy.

Authors: Shigeru Tanaka; Toshiharu Ninomiya; Ritsuko Katafuchi; Kosuke Masutani; Akihiro Tsuchimoto; Hideko Noguchi; Hideki Hirakata; Kazuhiko Tsuruya; Takanari Kitazono
Journal: Clin J Am Soc Nephrol Date: 2013-10-31 Impact factor: 8.237

3. Predicting the risk for dialysis or death in IgA nephropathy.

Authors: François Berthoux; Hesham Mohey; Blandine Laurent; Christophe Mariat; Aida Afiani; Lise Thibaudin
Journal: J Am Soc Nephrol Date: 2011-01-21 Impact factor: 10.121

4. Hypertriglyceridaemia and hyperuricaemia are risk factors for progression of IgA nephropathy.

Authors: J Syrjänen; J Mustonen; A Pasternack
Journal: Nephrol Dial Transplant Date: 2000-01 Impact factor: 5.992

Review 5. The incidence of primary glomerulonephritis worldwide: a systematic review of the literature.

Authors: Anita McGrogan; Casper F M Franssen; Corinne S de Vries
Journal: Nephrol Dial Transplant Date: 2010-11-10 Impact factor: 5.992

6. Epidemiologic data of renal diseases from a single unit in China: analysis based on 13,519 renal biopsies.

Authors: Lei-Shi Li; Zhi-Hong Liu
Journal: Kidney Int Date: 2004-09 Impact factor: 10.612

7. Random forest can accurately predict the development of end-stage renal disease in immunoglobulin a nephropathy patients.

Authors: Xin Han; Xiaonan Zheng; Ying Wang; Xiaoru Sun; Yi Xiao; Yi Tang; Wei Qin
Journal: Ann Transl Med Date: 2019-06

8. Predicting progression of IgA nephropathy: new clinical progression risk score.

Authors: Jingyuan Xie; Krzysztof Kiryluk; Weiming Wang; Zhaohui Wang; Shanmai Guo; Pingyan Shen; Hong Ren; Xiaoxia Pan; Xiaonong Chen; Wen Zhang; Xiao Li; Hao Shi; Yifu Li; Ali G Gharavi; Nan Chen
Journal: PLoS One Date: 2012-06-14 Impact factor: 3.240