Literature DB >> 30100397

Clinical prediction of HBV and HCV related hepatic fibrosis using machine learning.

Runmin Wei¹, Jingye Wang², Xiaoning Wang³, Guoxiang Xie², Yixing Wang³, Hua Zhang³, Cheng-Yuan Peng⁴, Cynthia Rajani², Sandi Kwee², Ping Liu⁵, Wei Jia⁶.

Abstract

Clinical prediction of advanced hepatic fibrosis (HF) and cirrhosis has long been challenging due to the gold standard, liver biopsy, being an invasive approach with certain limitations. Less invasive blood test tandem with a cutting-edge machine learning algorithm shows promising diagnostic potential. In this study, we constructed and compared machine learning methods with the FIB-4 score in a discovery dataset (n = 490) of hepatitis B virus (HBV) patients. Models were validated in an independent HBV dataset (n = 86). We further employed these models on two independent hepatitis C virus (HCV) datasets (n = 254 and 230) to examine their applicability. In the discovery data, gradient boosting (GB) stably outperformed other methods as well as FIB-4 scores (p < .001) in the prediction of advanced HF and cirrhosis. In the HBV validation dataset, for classification between early and advanced HF, the area under receiver operating characteristic curves (AUROC) of GB model was 0.918, while FIB-4 was 0.841; for classification between non-cirrhosis and cirrhosis, GB showed AUROC of 0.871, while FIB-4 was 0.830. Additionally, GB-based prediction demonstrated good classification capacity on two HCV datasets while higher cutoffs for both GB and FIB-4 scores were required to achieve comparable specificity and sensitivity. Using the same parameters as FIB-4, the GB-based prediction system demonstrated steady improvements relative to FIB-4 in HBV and HCV cohorts with different cutoff values required in different etiological groups. A user-friendly web tool, LiveBoost, makes our prediction models freely accessible for further clinical studies and applications.

Entities: Chemical Disease Gene Species

Keywords: FIB-4; Gradient boosting; Hepatic fibrosis; Hepatitis B; Hepatitis C; Machine learning

Mesh：

Year: 2018 PMID： 30100397 PMCID： PMC6154783 DOI： 10.1016/j.ebiom.2018.07.041

Source DB: PubMed Journal: EBioMedicine ISSN： 2352-3964 Impact factor: 8.143

Evidence before this study

We searched the PubMed database according to the terms [(“prediction” OR “risk prediction” OR “prediction model” OR “predictive” OR “predictive modeling”) AND (“FIB-4” OR “Fibrosis-4” OR “FIB4”) AND (“machine learning” OR “ensemble learning” OR “gradient boosting”) AND (“liver fibrosis” OR “hepatic fibrosis”)] among English-language articles before March 4th, 2018. We identified two studies using genotype-based decision tree models to predict advanced liver fibrosis in patients with chronic hepatitis C viral infection (HCV) and non-alcoholic fatty liver disease in patients with chronic hepatitis B viral infection (HBV). Neither of these studies attempted to use machine learning methods to augment the diagnostic performance based on commonly used clinical indicators, nor discussed the performance between different viral etiologies. We hypothesized that applying cutting-edge machine learning algorithms to existing blood-test scoring system (FIB-4) can augment the detection of advanced hepatic fibrosis and cirrhosis in chronic liver disease patients.

Added value of this study

Our study constructed and compared machine learning methods based on the same clinical parameters of the FIB-4 scoring system in an HBV cohort for detecting advanced hepatic fibrosis and cirrhosis. We validated our models in three independent cohorts including both HBV and HCV. Our machine learning-based prediction system, a less-invasive approach, demonstrated steady diagnostic improvements, which could overcome certain limitations of the gold standard (i.e., liver biopsy), facilitate medical decision making, and enhance long-term clinical surveillance of chronic liver disease.

Implications of all the available evidence

To fill in gaps between machine learning algorithms and real-world clinical studies, we built a user-friendly web tool (LiveBoost) that makes our prediction models easily accessible for further studies and applications. Alt-text: Unlabelled Box

Introduction

Every year, chronic liver disease (CLD) and its complications lead to approximately 2 million deaths globally [1]. Hepatitis B and C virus infections, chronic alcohol consumption and immune system abnormalities are leading causes of liver injury, with the wound-healing response from liver injury leading to hepatic fibrosis (HF), cirrhosis, and ultimately organ failure or liver cancer [2]. Evidence suggests that HF is reversible in many cases of CLD, but the clinically-significant regression of cirrhosis is still controversial [3], thus highlighting the necessity of a clinical tool for early detection of HF, differentiation of cirrhosis, and longitudinal surveillance of therapeutic responses. The gold standard for clinical measurement of HF is the liver biopsy, which is associated with both significant complications [4] and limitations [5] (e.g., pain, bleeding, infection, perforation of nearby organs, sampling errors, inter-observer and intra-observer variability). A less invasive and more reproducible approach of assessing HF severity and progression would be of great value in the clinical setting. The development of scoring systems based on simple clinical parameters and blood tests (e.g., FIB-4) is one important step towards non-invasive CLD monitoring and diagnosis [6]. FIB-4 was introduced as a non-invasive method to predict HF stages among Caucasian patients with hepatitis C virus (HCV) and human immunodeficiency virus co-infection [7]. Since then, this method has been independently validated in multiple HCV infected,and hepatitis B virus (HBV) infected patient cohorts [[8], [9], [10]]. FIB-4 provides an attractive alternative for biopsy due to its affordable price, objective measurements, and avoidance of complications. The formula of FIB-4 is defined as: AST: aspartate transaminase; ALT: alanine transaminase; PLT: platelet count. This relatively simple formula was originally derived from a multiple logistic regression (LR) model with odds ratios considered [7]. However, this statistical approach ignores more complex non-linear interactions between variables that might play significant roles in determining HF severity, and which could be captured using more sophisticated modeling approaches. In recent years, machine learning along with the explosive growth of biomedical big-data has generated much interest in developing clinical informatics tools for disease diagnosis, staging, and prognosis [[11], [12], [13]]. Machine learning, especially ensemble learning, has been successfully applied for recognizing hidden patterns in complex data, allowing for better predictions of clinical outcomes than traditional statistical models, especially when applied to large-scale datasets [14]. Unlike conventional regression-based approaches, ensemble learning algorithms such as random forest (RF) and gradient boosting (GB), are capable of capturing higher-order, non-linear interactions between predictors [15]. For HBV and HCV patients, pathology and genetic data have been successfully used for the implementation of predictive models [[16], [17], [18], [19]]. Here, for the first time, we propose to reconstruct an existing blood test-based clinical scoring system, FIB-4, with cutting-edge machine learning approaches for improved detection and classification of advanced HF and cirrhosis, validating models in multiple independent datasets from patients with CLD of different viral etiologies.

Materials and methods

Data and ethics

An HBV discovery dataset included a total of 490 HBV infected subjects recruited from Shuguang Hospital in affiliation with Shanghai University of Traditional Chinese Medicine (Shanghai, China) from April 2013 to June 2015. Patients were included after providing informed consent and meeting inclusion and exclusion criteria as described in the appendix (Text S1). An independent HBV dataset (validation-1) included a total of 86 HBV infected subjects recruited from Xiamen Hospital of Traditional Chinese Medicine (Xiamen, China). Recruitment and eligibility criteria were the same as those established for the discovery dataset. These studies were approved by the institutional review board of the Shanghai University of Traditional Chinese Medicine and Xiamen Hospital of Traditional Chinese Medicine. All participants signed informed consent forms for the study. Two additional retrospective anonymous datasets from existing studies were used to further evaluate the prediction models in HCV infected patients. Validation-2 (HCV), comprised of a total of 254 HCV infected subjects, was recruited from China Medical University Hospital, Taiwan. Detailed information about this cohort was provided in the original study publication [20]. Another independent dataset, validation-3 (HCV), comprised of a total of 230 samples from 115 HCV infected patients, was recruited from Komaki City Hospital (Komaki, Japan). In this cohort, biopsy results, clinical parameters, and blood samples were available from before and after antiviral treatment. Detailed information about this study is provided in the original study publication [21].

Liver biopsy

An ultrasound-guided liver biopsy was performed on all patients in both the discovery and HBV validation datasets. All liver biopsies were performed within one week after study recruitment. Liver specimens were placed in 10% neutral buffered formalin and embedded in paraffin for histologic processing. Tissue sections were stained with Masson's trichrome staining and hematoxylin and eosin (H&E). The histologic staging was based on Scheuer's classification using a 5-point scale for HF severity ranging from S0 (non-fibrosis) to S4 (cirrhosis) [22]. The staging was performed by three independent pathologists from Shanghai Medical College of Fudan University who were blinded to patient clinical information. In cases of discordant staging, specimens were re-examined until consensus was reached.

Serum sample collection and test

Overnight fasting (12h) blood samples were collected from all discovery and validation-1 subjects within one week after recruitment. Blood specimens were placed on ice, processed by centrifugation, and stored in a − 80 °C freezer until analysis. Hematological and standard biochemical tests were performed according to the manufacturers' protocols using an LH750 Hematology Analyzer and Synchron DXC800 Clinical System (Beckman Coulter, USA).

Machine learning and statistics

The original formula for FIB-4 suggested potential interactions between predictors. Thus we began our machine learning modeling with a decision tree (DT) considering its intrinsic capacity for interaction detection. We applied DT, along with two ensemble learning models, RF and GB, to reconstruct the four individual components (Age, AST, ALT, and PLT) of the FIB-4 score. DT is a flowchart-like prediction model that depicts a complete decision-making process where each internal node represents a decision point on a single attribute, and each leaf represents a single assigned class label [23]. The structure of the DT model is similar to the clinical decision-making process, providing a sound rationale for its application to clinical problems [24]. Unfortunately, it often suffers from over-fitting and is considered a high-variance model [25]. The RF model, on the other hand, is an ensemble method which aggregates a large number of DTs using bootstrap resampling and often yields lower variances and better model generalization than single DT [26]. The GB model goes one step further, instead of averaging prediction results from all DTs in RF, it grows a new DT based on old trees by decreasing prediction errors that the old trees made [27]. To optimize the model hyper-parameters, 10-fold cross-validation was performed with different hyper-parameter settings in the discovery set. We optimized the complexity parameter for DT and the mtry parameter for RF. For GB, we tuned parameters including, interaction.depth, n.trees, shrinkage, and n.minobsinnode in a grid search manner. Receiver operating characteristic (ROC) curves were used as evaluation metrics. The R package caret was applied for the hyper-parameter optimization [28]. The details of tuned hyper-parameters can be found in https://github.com/elise-is/LiveBoost. To determine the final model and test its robustness, we randomly split the discovery set into training (70%) and testing sets (30%) 100 times. Each time, we trained the three different machine learning models on the training set using fixed hyper-parameters based on previous model tuning results and lastly, compared these results with the FIB-4 score on the testing set. Area under ROC curves (AUROC) and area under precision-recall curves (AUPR) were calculated to compare the four methods (Step 1 in Fig. 1). The R packages rpart, randomForest, and gbm were applied for the DT, RF, and GB model training, respectively [27, 29, 30].

Fig. 1

Flowchart of the study design. In step 1 of model selection, we performed training-testing splitting 100 times on the discovery set and trained DT, RF, GB models on the training sets, then compared these results with FIB-4 on testing sets. In step 2, we constructed final GB models and compared results with FIB-4 on the whole discovery set and then validated on the HBV validation set. In step 3, GB models and FIB-4 were used to predict the risks for two extra HCV cohorts. In step 4, we developed a user-friendly web-tool for clinical practices. After selecting GB as our preferred reconstruction approach, we trained the final GB models on the full discovery set using previously optimized hyper-parameters. Then, we compared our GB prediction scores with FIB-4 scores on both discovery set and validation set-1 (HBV) using ROC and PR curves. The best GB cutoff points were selected using Youden's index, maximizing the sum of sensitivity and specificity [31]. For the FIB-4 score, we applied two previously reported clinical cutoff points (1.45 and 3.25) [[6], [7], [8]]. We calculated specificity and sensitivity at these cutoff points and their 95% confidence intervals (CIs) using 500 times bootstrap resampling (Step 2 in Fig. 1). ROC and PR calculation were conducted with the R packages pROC and PRROC, respectively [32, 33]. To further assess the classification robustness of the FIB-4 reconstruction models for staging HF related to HCV, we applied our trained GB models on two independent HCV validation cohorts (Step 3 in Fig. 1). We employed t-tests to compare FIB-4 and GB scores in early vs. advanced HF and fibrosis vs. cirrhosis within both HCV cohorts. ROC curves, sensitivity, specificity and 95% CIs for predicting advanced HF were calculated for the FIB-4 and GB scores. We additionally included two extra blood test-based clinical indicators (i.e., albumin (ALB) and gamma-glutamyl transpeptidase (GGT)) to check whether classification performances could be further improved. We rebuilt new GB models on six predictors (Age, AST, ALT, PLT, ALB, and GGT) and LR models based on FIB-4, ALB and GGT in the discovery set and compared with the original models in two HBV cohorts using ROC curves. Datasets and R-code related to this study can be found at https://github.com/elise-is/LiveBoost.

Web-tool development

To develop a tool for HF staging that is amenable to use in clinical practices, we designed a web-based application, LiveBoost, providing a graphical user interface (GUI) to access our final trained GB models (Step 4 in Fig. 1). This application is hosted on our server which is publicly accessible via https://metabolomics.cc.hawaii.edu/software/LiveBoost/. The web-tool development was conducted using the R package shiny.

Results

Machine learning model selection

For our first aim (i.e., finding a machine learning approach that robustly improves the original FIB-4 score), we compared the original FIB-4 scoring system with different models using a 100-times jackknife resampling approach by randomly splitting the discovery set into 70% training set and 30% testing set. We built each model on the training set with optimized hyper-parameters and compared results to FIB-4 on the testing set. For differentiating between early (S0–2) and advanced (S3–4) fibrosis, we found that compared to FIB-4 score, DT was associated with comparable AUPR (0.67 vs. 0.68, p = .15) but significantly lower AUROC (0.79 vs. 0.82, p < .001). The RF approach was associated with significantly higher AUPR (0.73 vs. 0.68, p < .001) and comparable AUROC (0.82 vs. 0.82, p = .59). The GB approach was associated with significantly higher AUPR (0.77 vs. 0.68, p < .001) and AUROC (0.85 vs. 0.82, p < .001) (left side panels of Fig. 2). Similarly, for the identifying of cirrhosis cases (S4) (right side panels of Fig. 2), we found that compared to FIB-4 scoring, the DT based approach was associated with significantly lower AUPR (0.60 vs. 0.66, p < .001) and AUROC (0.81 vs. 0.87, p < .001). The RF-based approach was associated with significantly higher AUPR (0.70 vs. 0.66, p = .0047) and AUROC (0.89 vs. 0.87, p = .0023). The GB approach was associated with even greater significant differences in AUPR (0.72 vs. 0.66, p < .001) and AUROC (0.90 vs. 0.87, p < .001). Descriptive statistics of the AUROCs and AUPRs for these approaches are summarized in Table S1. Altogether, the GB approach provided the greatest improvements in classification capacity over FIB-4 scoring system among the three machine learning methods. Additionally, higher variances associated with the DT approach indicated less robustness than the other approaches. In addition to showing significantly better classification performance relative to FIB-4, two ensemble learning approaches were associated with smaller variances than DT.

Fig. 2

Boxplots of AUPR and AUROC on testing sets for four different methods. P-values were calculated using Student's t-tests.

Model construction and validation

After we selected the GB model as our reconstruction approach for FIB-4, we finalized our prediction models for detecting advanced HF (discriminating S0–2 from S3–4) and cirrhosis (discriminating S0–3 from S4) by training GB models on the full discovery set with optimized hyper-parameters, to produce GB-based risk scores. To validate this GB-based scoring system, we applied it to our first validation set which was derived from an independent HBV cohort. Table 1 summarizes the four clinical indicators and other demographic information for the discovery and validation-1 datasets. Applying our final trained GB models to generate risk scores (in log-odds scale) for all the samples in both the discovery and validation-1 sets, we again found higher AUROC (Fig. 3A) and AUPR (Fig. S1) values for the GB-based scoring relative to FIB-4 scoring for both datasets. For classification between S0-2 and S3-4, GB showed an AUROC of 0.904 and 0.918, AUPR of 0.836 and 0.925 in the discovery set and validation set-1, respectively while FIB-4 showed an AUROC of 0.817 and 0.841, AUPR of 0.688 and 0.844, respectively. For classification between S0–3 and S4, GB showed an AUROC of 0.961 and 0.871, AUPR of 0.891 and 0.833 in the discovery set and validation set-1, respectively while FIB-4 showed an AUROC of 0.864 and 0.830, AUPR of 0.671 and 0.738, respectively. We then compared the specificity and sensitivity at the best cutoff values for GB scores and two recommended cutoffs for FIB-4 (Fig. 3B), finding that the GB prediction model produced higher specificity (0.86 and 0.85 in the discovery set and validation set-1, respectively) and sensitivity (0.79 and 0.84 in the discovery set and validation set-1, respectively) than FIB-4 (specificity = 0.74 and 0.83, sensitivity = 0.74 and 0.78 in the discovery set and validation set-1, respectively) with cutoff = 1.45 for discriminating stages S0-2 from S3-4. While FIB-4 scoring with cutoff = 3.25 resulted in higher specificity (0.94 and 0.96 in the discovery set and validation set-1, respectively), it suffered from lower sensitivity (0.50 and 0.68 in the discovery set and 0.68 in the validation-1 set) (left panels of Fig. 3B). Similarly, for detecting cirrhosis (S4), the GB prediction model showed higher and more stable (with smaller CIs) specificity (0.92) and sensitivity (0.88) in the discovery set than FIB-4 using cutoff = 1.45 (specificity = 0.72 and sensitivity = 0.85) (upper right panel of Fig. 3B). Again, FIB-4 with cutoff = 3.25 showed high specificity (0.92) with much lower sensitivity (0.68) (upper right panel of Fig. 3B). In the validation-1 set, GB still demonstrated more balanced specificity (0.85) and sensitivity (0.78) while FIB-4 with cutoff = 1.45 showed lower specificity (0.78) with larger CI and cutoff = 3.25 and showed lower sensitivity (0.69) (lower right panel of Fig. 3B).

Table 1

Clinical and demographical characteristics of the HBV cohorts.

Data	HF stage	Total Num	Num of M	Num of F	BMI (kg/m^2)	Age (years)	AST (U/L)	ALT (U/L)	PLT (10^9/L)
Discovery Set (HBV)	0	46	39	7	22.1 (20.3–23.5)	32 (27–40)	49 (35–66)	106 (58–171)	190 (161–215)
	1	169	125	44	21.2 (19.5–24.1)	30 (25–38)	58 (39–99)	114 (65–190)	179 (155–214)
	2	134	93	41	21.6 (20.1–24.0)	31 (27–39)	74 (43–138)	155 (80–267)	176 (150–210)
	3	56	47	9	22.5 (20.9–25.0)	39 (29–47)	62 (44–112)	90 (56–250)	148 (108–182)
	4	85	53	32	22.5 (20.9–24.5)	50 (40–58)	45 (31–77)	45 (28–100)	86 (43–121)
Validation Set (HBV)	0	15	7	8	23.2 (21.2–24.0)	35 (28–40)	40 (23–67)	65 (33−100)	173 (152–193)
	1	21	14	6	22.5 (21. 3–24.8)	31 (26–45)	67 (36–128)	98 (77–183)	193 (174–221)
	2	12	7	5	22.4 (21.5–23.3)	39 (34–43)	50 (40–97)	76 (52–357)	161 (145–178)
	3	11	8	3	21.5 (20.5–22.8)	40 (31–49)	35 (33–53)	40 (31–95)	108 (93–118)
	4	27	18	9	22.4 (20.1–23.9)	45 (37–56)	43 (32–70)	35 (28–82)	74 (40–98)

Continuous variables are displayed as median value (25% - 75% quantile values), Num (number), F (female) M (male).

Fig. 3

Classification performances of GB and FIB-4 on the discovery set and the HBV validation set. (A) ROC curves of GB and FIB-4 in advanced HF detection (left-panel) and cirrhosis detection (right-panel). (B) Specificity, sensitivity and their 95% CIs of GB and FIB-4 scores in advanced HF detection (left-panel) and cirrhosis detection (right-panel). We selected the best GB cutoff based on the Youden index for the discovery set and two commonly applied FIB-4 cutoffs (1.45 and 3.25).

Clinical and demographical characteristics of the HBV cohorts. Continuous variables are displayed as median value (25% - 75% quantile values), Num (number), F (female) M (male). Classification performances of GB and FIB-4 on the discovery set and the HBV validation set. (A) ROC curves of GB and FIB-4 in advanced HF detection (left-panel) and cirrhosis detection (right-panel). (B) Specificity, sensitivity and their 95% CIs of GB and FIB-4 scores in advanced HF detection (left-panel) and cirrhosis detection (right-panel). We selected the best GB cutoff based on the Youden index for the discovery set and two commonly applied FIB-4 cutoffs (1.45 and 3.25). To verify whether the classification performances could be further improved by introducing extra clinical parameters, ALB and GGT, which were reported in previous studies [21, 34], we additionally rebuilt GB models based on six predictors (Age, AST, ALT, PLT, ALB, and GGT) and LR models based on FIB-4, ALB and GGT. Comparing to our original GB models, new GB models slightly improved the AUROC (0.929 and 0.974 for S0–2 vs. S34 and S0–3 vs. S4, respectively, Fig. S2 upper panel) in the discovery set, and showed similar results in the HBV validation set-1 (0.91 and 0.874 for S0–2 vs. S34 and S0–3 vs. S4, respectively, Fig. S2 lower panel). When we compared the new FIB-4 models (FIB-4, ALB, and GGT) to the original FIB-4 score, we found although new FIB-4 models displayed similar AUROC in the discovery set (0.818 and 0.842 for S0–2 vs. S34 and S0–3 vs. S4, respectively, Fig. S3 upper panel), showed much lower AUROC in the HBV validation set-1 (0.738 and 0.757 for S0–2 vs. S34 and S0–3 vs. S4, respectively, Fig. S3 lower panel). Thus, we did not include these two parameters in the following analyses.

Model prediction on HCV cohorts

To investigate potential differences in GB risk score and FIB-4 score between HBV-related and HCV-related CLD cohorts, we applied our prediction models on two independent HCV validation data sets. We found significant differences in both FIB-4 and GB scores between groups staged S0–2 and S3–4, with more significant differences in GB scores than FIB-4 scores in both HBV (p < 2.2e-16 vs. p = 1.8e-12 in the discovery set and p = 7.8e-14 vs. p = 2.5e-6 in the validation set-1) and HCV cohorts (p = 1.4e-10 vs. p = 3.6e-8 in the validation set-2 and p = 2.2e-9 vs. p = 7.5e-5 in the validation set-3) (Fig. 4). Similar results were observed when discriminating cirrhosis (S4) from HF (S0–3) (Fig. S4). The GB and FIB-4 scores in HCV-related cohorts performed with AUROC = 0.797 and 0.816, respectively in validation set-2 and AUROC = 0.849 and 0.795, respectively in validation set-3. Thus, classification performance with the GB model was improved relative to FIB-4 in validation set-3 set (HCV), but not in validation set-2 dataset (HCV) (Fig. S5A).

Fig. 4

FIB-4 and GB scores for four independent cohorts between S0–2 and S3–4. P-values were calculated using Student's t-tests.

FIB-4 and GB scores for four independent cohorts between S0–2 and S3–4. P-values were calculated using Student's t-tests. When HCV-infected cohorts were compared to HBV-infected cohorts, higher mean FIB-4 scores and GB scores were noted in both the S0–2 and S0–3 groups (Fig. 4 and Fig. S4), a finding which suggested that cutoff values built on one cohort might not be ideal for other cohorts with different etiological causes for CLD. Measurement of the FIB-4 score in the HCV cohorts for staging advanced HF revealed that a cutoff value of 3.25 resulted in a better specificity and sensitivity trade-off point relative to a cutoff value of 1.45 which resulted in more false positive findings due to low specificity (Fig. S5B). In contrast, a FIB-4 cutoff value of 1.45 exhibited more balance between sensitivity and specificity values in the HBV cohorts while a cutoff value of 3.25 yielded false negative findings due to low sensitivity (Fig. 3B). Correspondingly, we assessed the sensitivity and specificity of a GB cutoff value of −0.93 (which produced optimal results on the HBV discovery set) for HCV in differentiating early and advanced HF. Applying this cutoff value to the validation-2 and 3 HCV datasets produced a classification performance associated with higher sensitivity, but at the expense of much lower specificity (Fig. S5B). A new GB cutoff value of −0.14 was calculated based on the Youden's index on the ROC curve of the validation-2 (HCV) dataset. This higher cutoff value yielded improvements on the point and interval estimations of specificity in both HCV cohorts while maintaining a reasonable balance with specificity and sensitivity for discriminating S0–2 vs. S3–4. We did not assess cirrhosis detection performances on these two HCV datasets due to the small number of cirrhosis samples. Table S2 includes all point and interval estimations of AUROC, specificity, and sensitivity of both the FIB-4 and GB scores. To encourage further study of the clinical application of HF staging using cutting-edge machine learning approaches, we packaged our trained GB models into a free, accessible web-tool (LiveBoost: https://metabolomics.cc.hawaii.edu/software/LiveBoost/). Fig. 5 shows the GUI of the web-tool. To use this web-tool, one can simply input values for the four clinical indicators along with the disease etiology followed by a click of the “Predict” button. Two gauge plots will be generated at the right panel of the interface showing the GB risk score and the FIB-4 score for this subject. Corresponding descriptions with calculated disease probabilities will appear below the plots. The next step for prediction of liver cirrhosis is provided in the “Model-2” tab. One clicks the “Reset” button to bring values back to their default settings.

Fig. 5

A screenshot of the web-tool (LiveBoost).

Discussion

An affordable, reproducible, objective and non-invasive method for predicting the severity of HF is needed to support longitudinal surveillance and clinical decision making. In this study, we aimed to reconstruct the current FIB-4 scoring system to improve the sensitivity and specificity of classifications between early and advanced HF and for the detection of cirrhosis. To our knowledge, this is the first time that machine learning algorithms have been employed to improve the staging of CLD by building on an existing clinical scoring system. Furthermore, this algorithm has been implemented into a user-friendly web-tool to support further independent explorations of its clinical utility. We compared the original FIB-4 score with three machine learning methods: DT, RF, and GB. The results showed that GB outperformed FIB-4 and other methods regarding AUPR and AUROC. Applying GB to an independent HBV dataset, we observed consistently superior performance to FIB-4 scoring. On two independent HCV validation sets, the trained GB model also showed good classification performance with more significant group-differences compared to FIB-4 scoring (Fig. 4, Fig. S4). Although the GB model produced similar AUROC values to the FIB-4 scores in validation set-2 (HCV) (Fig. S5), this might due to group imbalance and a lack of S0 group in this dataset. The GB model was associated with narrower CIs of specificity and sensitivity, supporting its potential for robust classification. Also, GB showed better classification performance than FIB-4 in validation set-3 (HCV) (Fig. S5). In this validation set, each patient underwent serial liver biopsies before and after antiviral therapy along with the corresponding blood tests. To avoid potential confounding factors caused by the therapy, we performed additional validation on the pre-treatment data and achieved consistent results with our previous analyses (Fig. S6). In addition to FIB-4 parameters, other clinical indicators (e.g., ALB, GGT) were discovered as significant diagnostic predictors of liver disease [21, 34]. To assess whether ALB and GGT values could augment the performance of GB models and FIB-4 for the staging of CLD, we rebuilt GB and FIB-4 models with these two indicators added to the existing panel. The inclusion of ALB and GGT values did not produce significant improvements over our current models. Notably, the new FIB-4 LR models displayed even worse classification performances than the original FIB-4 score. These results might suggest a potential problem of overfitting in LR models when we include redundant predictors, while GB models did not particularly suffer from overfitting issues. What's more, high-order non-linear interactions between predictors may be better captured by innovative machine learning approaches [15] which might explain why the GB model outperformed FIB-4 for classifying CLD for the HBV cohorts. Thus, it will be worth trying this approach with other clinical predictors such as, the AST/platelet ratio index (APRI) [35], the AST/ALT ratio [36], the AST/ALT ratio/platelet ratio index (AARPRI) [37], the age-platelet index [38] or the FibroScan score [39]). In addition to conventional clinical indicators, serum metabolomics could also serve as a potential source of biomarkers for assessing CLD. The liver is the principal organ for lipid metabolism in the manufacture of fatty acids from excess acetyl-CoA along with transportation and storage of lipid metabolites [40]. Bile acids are originally synthesized by liver cells and fibrosis-related changes in their enterohepatic circulation may be reflected as serum biomarkers [41]. Additionally, the liver performs a significant role in the degradation of amino acids [42]. CLD with progressive HF should therefore lead to alterations of various metabolites and indeed, previous studies demonstrated that changes in levels of amino acids [43], free fatty acids [44], and bile acids [45, 46] were highly correlated with progression of liver disease. Thus, metabolic alterations might serve as complementary information to existing clinical indicators for HF staging, making it worth exploring whether including metabolic markers into our CLD prediction models will improve classification performances, especially for the early fibrosis stratifications. A potential caveat worth discussing is the performance of different cutoff values for FIB-4 and GB scores for different etiological CLD cohorts (i.e., HBV and HCV cohorts). Compared to HBV cohorts, there was a trend for higher FIB-4 and GB scores in early stage fibrosis of HCV patients (Fig. 4) which prompted us to apply a higher cutoff value to achieve more reasonable classification performances. Age, one of the parameters used to calculate the FIB-4 score, is also on average higher in HCV patients than HBV patients as shown in our study (Fig. S7A). HCV infections which are commonly acquired later in life than HBV infections, likely produce age at exposure differences and age-related prevalence differences between HBV and HCV induced liver injuries [47, 48]. Thus, etiologic and epidemiologic differences between HBV and HCV patients may both be contributing to differences in optimal FIB-4 and GB scores cutoffs between these groups. AST and ALT were found at lower levels in S0 and S4 groups versus intermediate groups (Fig. S7B and C), which is consistent with former studies [49]. PLT levels decreased with HF progression in both HBV and HCV cohorts (Fig. S7D). Thus, the natural progression of liver injuries induced by different hepatitis etiologies may be reflected in different cutoffs of FIB-4 and GB scores. The first cutoff of FIB-4 (1.45) is a better trade-off point with specificity and sensitivity in HBV cohorts (Fig. 3) while the second cutoff (3.25) showed more consistent classification performances in HCV cohorts (Fig. S5). For GB scores, we found that the best cutoff (−0.93) for the HBV discovery set showed biased classifications in HCV cohorts (Fig. S5B). After changing the cutoff from −0.93 to −0.14, we observed better specificity without drastically decreasing the sensitivity (Fig. S5B). Thus, different etiologies of CLD may need to be directly factored into models, or etiology-specific cutoff values should be considered. However, in this study, HCV datasets suffered from small sample sizes and a limited number of cirrhosis subjects. In the future, we propose re-training new machine learning models using GB with larger HCV cohorts to achieve better staging performances. Certain limitations of this study and the results need to be discussed. First, the training sample size remains limited and individual cohorts were drawn from only Asian ethnic cohorts. We are planning to collect more samples from multiple sites in the future which will be necessary to further establish the robustness of these predictive models. Second, we trained the GB models on HBV cohorts and recognized the need to recruit additional cohorts to examine models trained on groups affected by specific etiologies such as HCV patients. Based on the differences in model performance that we have preliminarily observed between HBV and HCV-related CLD patients, we believe it is possible to achieve further refinements of the models through the incorporation of etiology-related parameters. Third, limitations in clinical data infrastructure and mechanisms to support data sharing for biomedical research currently poses a severe bottleneck in validating cutting-edge machine learning techniques [50, 51]. We have implemented our GB prediction model as an online tool to support its dissemination for independent testing in other cohorts. We hope that independent researchers can share their results and data to help expedite this and other potential clinical applications of machine learning. In conclusion, we employed a cutting-edge machine learning algorithm (GB) to reconstruct a well-studied clinical scoring system (FIB-4) for better detection of advanced HF and cirrhosis in HBV cohorts with CLD. We validated the prediction capacity of our models in multiple independent groups of HBV and HCV patients. Due to the etiological differences between HBV and HCV, we proposed that different cutoff values for GB and FIB-4 scoring should be applied. Particular machine learning models could be trained on larger HCV cohort in future studies. The idea of using machine learning to reconstruct existing clinical scoring systems could be applied to other indicators in other disease cohorts.

42 in total

Review 1. Enterohepatic circulation: physiological, pharmacokinetic and clinical implications.

Authors: Michael S Roberts; Beatrice M Magnusson; Frank J Burczynski; Michael Weiss
Journal: Clin Pharmacokinet Date: 2002 Impact factor: 6.447

2. Urinary metabolite variation is associated with pathological progression of the post-hepatitis B cirrhosis patients.

Authors: Xiaoning Wang; Xiaoyan Wang; Guoxiang Xie; Mingmei Zhou; Huan Yu; Yan Lin; Guangli Du; Guoan Luo; Wei Jia; Ping Liu
Journal: J Proteome Res Date: 2012-06-07 Impact factor: 4.466

Review 3. Clinical evidence for the regression of liver fibrosis.

Authors: Elizabeth L Ellis; Derek A Mann
Journal: J Hepatol Date: 2012-01-13 Impact factor: 25.083

Review 4. Use of Liver Imaging and Biopsy in Clinical Practice.

Authors: Elliot B Tapper; Anna S-F Lok
Journal: N Engl J Med Date: 2017-08-24 Impact factor: 91.245

5. Diagnostic value of fibronectin discriminant score for predicting liver fibrosis stages in chronic hepatitis C virus patients.

Authors: Abdelfattah M Attallah; Sanaa O Abdallah; Ahmed A Attallah; Mohamed M Omran; Khaled Farid; Wesam A Nasif; Gamal E Shiha; Abdel-Aziz F Abdel-Aziz; Nancy Rasafy; Yehia M Shaker
Journal: Ann Hepatol Date: 2013 Jan-Feb Impact factor: 2.400

6. A methodology for automated CPA extraction using liver biopsy image analysis and machine learning techniques.

Authors: Markos G Tsipouras; Nikolaos Giannakeas; Alexandros T Tzallas; Zoe E Tsianou; Pinelopi Manousou; Andrew Hall; Ioannis Tsoulos; Epameinondas Tsianos
Journal: Comput Methods Programs Biomed Date: 2016-11-29 Impact factor: 5.428

7. FIB-4: an inexpensive and accurate marker of fibrosis in HCV infection. comparison with liver biopsy and fibrotest.

Authors: Anaïs Vallet-Pichard; Vincent Mallet; Bertrand Nalpas; Virginie Verkarre; Antoine Nalpas; Valérie Dhalluin-Venier; Hélène Fontaine; Stanislas Pol
Journal: Hepatology Date: 2007-07 Impact factor: 17.425

8. pROC: an open-source package for R and S+ to analyze and compare ROC curves.

Authors: Xavier Robin; Natacha Turck; Alexandre Hainard; Natalia Tiberti; Frédérique Lisacek; Jean-Charles Sanchez; Markus Müller
Journal: BMC Bioinformatics Date: 2011-03-17 Impact factor: 3.307

9. Data sharing and the evolving role of statisticians.

Authors: Nick Manamley; Steve Mallett; Matthew R Sydes; Sally Hollis; Alison Scrimgeour; Hans Ulrich Burger; Hans-Joerg Urban
Journal: BMC Med Res Methodol Date: 2016-07-08 Impact factor: 4.615

10. Predictive Ability of Laboratory Indices for Liver Fibrosis in Patients with Chronic Hepatitis C after the Eradication of Hepatitis C Virus.

Authors: Yoshihiko Tachi; Takanori Hirai; Hidenori Toyoda; Toshifumi Tada; Kazuhiko Hayashi; Takashi Honda; Masatoshi Ishigami; Hidemi Goto; Takashi Kumada
Journal: PLoS One Date: 2015-07-27 Impact factor: 3.240

13 in total

Review 1. The digital transformation of hepatology: The patient is logged in.

Authors: Tiffany Wu; Douglas A Simonetto; John D Halamka; Vijay H Shah
Journal: Hepatology Date: 2022-01-31 Impact factor: 17.298

2. Serum metabolite profiles are associated with the presence of advanced liver fibrosis in Chinese patients with chronic hepatitis B viral infection.

Authors: Guoxiang Xie; Xiaoning Wang; Runmin Wei; Jingye Wang; Aihua Zhao; Tianlu Chen; Yixing Wang; Hua Zhang; Zhun Xiao; Xinzhu Liu; Youping Deng; Linda Wong; Cynthia Rajani; Sandi Kwee; Hua Bian; Xin Gao; Ping Liu; Wei Jia
Journal: BMC Med Date: 2020-06-05 Impact factor: 8.775

Review 3. Function of TREM1 and TREM2 in Liver-Related Diseases.

Authors: Huifang Sun; Jianguo Feng; Liling Tang
Journal: Cells Date: 2020-12-07 Impact factor: 6.600

Review 4. Recent Advances in Understanding, Diagnosing, and Treating Hepatitis B Virus Infection.

Authors: Magda Rybicka; Krzysztof Piotr Bielawski
Journal: Microorganisms Date: 2020-09-15

5. Beta-catenin activation and immunotherapy resistance in hepatocellular carcinoma: mechanisms and biomarkers.

Authors: Sandi A Kwee; Maarit Tiirikainen
Journal: Hepatoma Res Date: 2021-01-07

6. Application of artificial intelligence in chronic liver diseases: a systematic review and meta-analysis.

Authors: Pakanat Decharatanachart; Roongruedee Chaiteerakij; Thodsawit Tiyarattanachai; Sombat Treeprasertsuk
Journal: BMC Gastroenterol Date: 2021-01-06 Impact factor: 3.067

7. Multi-institutional development and external validation of machine learning-based models to predict relapse risk of pancreatic ductal adenocarcinoma after radical resection.

Authors: Xiawei Li; Litao Yang; Zheping Yuan; Jianyao Lou; Yiqun Fan; Aiguang Shi; Junjie Huang; Mingchen Zhao; Yulian Wu
Journal: J Transl Med Date: 2021-06-30 Impact factor: 5.531

8. Predicting liver disease post hepatitis virus infection: In silico pathology and pattern recognition.

Authors: Brett A Lidbury
Journal: EBioMedicine Date: 2018-08-23 Impact factor: 8.143

9. Phospholipids are A Potentially Important Source of Tissue Biomarkers for Hepatocellular Carcinoma: Results of a Pilot Study Involving Targeted Metabolomics.

Authors: Erin B Evangelista; Sandi A Kwee; Miles M Sato; Lu Wang; Christoph Rettenmeier; Guoxiang Xie; Wei Jia; Linda L Wong
Journal: Diagnostics (Basel) Date: 2019-10-29

10. An Unbiased Machine Learning Exploration Reveals Gene Sets Predictive of Allograft Tolerance After Kidney Transplantation.

Authors: Qiang Fu; Divyansh Agarwal; Kevin Deng; Rudy Matheson; Hongji Yang; Liang Wei; Qing Ran; Shaoping Deng; James F Markmann
Journal: Front Immunol Date: 2021-07-08 Impact factor: 7.561