| Literature DB >> 36203910 |
Noor Atika Azit1,2, Shahnorbanun Sahran3, Voon Meng Leow4,5, Manisekar Subramaniam5, Suryati Mokhtar6, Azmawati Mohammed Nawi1.
Abstract
Background: Hepatocellular carcinoma (HCC) among type-2 diabetes (T2D) patients is an increasing burden to diabetes management. This study aims to develop and select the best machine learning (ML) classification model for predicting HCC in T2D for HCC early detection.Entities:
Keywords: Diabetes; Hepatocellular carcinoma; Machine learning; Risk prediction; Support vector machine
Year: 2022 PMID: 36203910 PMCID: PMC9529545 DOI: 10.1016/j.heliyon.2022.e10772
Source DB: PubMed Journal: Heliyon ISSN: 2405-8440
Figure 1SPSS Modeler stream. Dataset file containing all the variables. In data processing, a type node was used to select the variables and to assign the appropriate categories. Data audit node was used to visualise the selected variables distribution and the validity of each variable. A SetToFlag node was selected for feature engineering, which involves converting nominal variables into categorical variables: “yes or no”. The transformed data were re-analysed using the data audit node.
The build setting parameters for LR, ANN, SVM, CHAID, and ensembled models.
| LR | Multinomial Method: Enter |
| Singularity tolerance: 1.0E-8 | |
| Maximum iterations:20 | |
| Maximum step-halving:5 | |
| Log-likelihood convergence:1.0E-1 | |
| Parameter convergence:1.0E-6 | |
| Delta:0.0 | |
| Confidence interval: 95.0 | |
| ANN | Neural network model: Multilayer perceptron (MLP) |
| SVM | Stopping criteria: 1.0E-3 |
| CHAID | Levels below root:5 |
| ENSEMBLE | Ensemble method: Confidence-weighted voting |
Figure 2The characteristics of target and input variables included in the models. HCC status is the target variable, with the other 12 input variables. All were in the flag (yes/no) measurement. The graph colour in red indicates the proportion of variables with HCC = yes (1). No missing values for each variable. There was no significant different between training and testing set (p-value <0.05).
Figure 3Predictor’s importance showing the relative contribution of each variable towards the model algorithm is presented as follows: a) LR-all input variables were included in the model with viral hepatitis contributing the most, b) ANN-viral hepatitis contributed the most to this model while ALP contributed the least c) SVM-all variables were included, with viral hepatitis contributing most to the models and d) CHAID models – only six variables were selected by the model out of 12 input variables in the final model, with viral hepatitis contributing the most.
Summary of the machine learning performance of the classification models.
| Models | Dataset | N | TP | TN | FP | FN | Accuracy (%) | Standard deviation | C. error (%) | AUC | Sensitivity (%) | Specificity (%) | PPV | NPV |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| ENSEMBLED (LR, CHAID, SVM) | training | 248 | 103 | 110 | 17 | 18 | 85.9 | (±0.74) | 14.1 | 0.919 | 85.1 | 86.6 | 85.8 | 85.9 |
| testing | 176 | 76 | 75 | 10 | 15 | 85.8 | (±1.19) | 14.2 | 0.917 | 83.5 | 88.2 | 88.4 | 83.3 | |
| SVM | training | 248 | 104 | 112 | 15 | 17 | 87.1 | (±0.39) | 12.9 | 0.926 | 86.0 | 88.2 | 87.4 | 86.8 |
| testing | 176 | 75 | 75 | 10 | 16 | 85.2 | (±0.78) | 14.8 | 0.914 | 82.4 | 88.2 | 88.2 | 82.4 | |
| LR | training | 248 | 101 | 108 | 19 | 20 | 84.3 | (±1.03) | 15.7 | 0.909 | 83.5 | 85.0 | 84.2 | 84.4 |
| testing | 176 | 76 | 73 | 12 | 15 | 84.7 | (±1.51) | 15.3 | 0.925 | 83.5 | 85.9 | 86.4 | 83.0 | |
| ANN | training | 248 | 100 | 108 | 19 | 21 | 83.9 | (±0.81) | 16.1 | 0.915 | 82.6 | 85.0 | 84.0 | 83.7 |
| testing | 176 | 75 | 72 | 13 | 16 | 83.5 | (±1.31) | 16.5 | 0.905 | 82.4 | 84.7 | 85.2 | 81.8 | |
| CHAID | training | 248 | 97 | 108 | 19 | 24 | 82.7 | (±1.50) | 17.3 | 0.879 | 80.2 | 85.0 | 83.6 | 81.8 |
| testing | 176 | 66 | 72 | 13 | 25 | 78.4 | (±1.96) | 21.6 | 0.862 | 72.5 | 84.7 | 83.5 | 74.2 |
Abbreviations: SVM = support vector machine, LR = logistic regression, ANN = artificial neural network, CHAID = chi-square automatic interaction detection, TP = true positive, TN = true negative, FP = false positive, FN = false negative, C. error = classification error, AUC = area under the ROC curve, PPV = positive predictive value, NPV = negative predictive value.
Figure 4a) The web-based application with an example of the absence of any risk factors in an Indian patient. b) The HCC risk estimation in the presence of all the risk factors in an Indian patient.
The comparison of the predictive models for HCC in T2D.
| Authors | Country, race | Design | DM age-adjusted prev [ | Viral hepatitis incidence [ | Alcohol consumption [ | Sample Size (N), Sample Pop. | Model | Variable(s) | Performance | Strength | Limitation |
|---|---|---|---|---|---|---|---|---|---|---|---|
| Current study 2021 | Malaysia, Malay Indian Chinese | Case control | 16.7% | 1052.65/100 000 population | 0.9 L/person | (N = 424) Case- 212 T2DM with newly diagnosed HCC | SVM, ANN, LR, CHAID | Race, symptoms (weight loss, abdominal pain/discomfort), viral hepatitis, statin, alcohol consumption, Alkaline phosphatase (ALP), Alanine transaminase, fatty liver disease | Best model, SVM | use ML classification models to evaluate the best predictors | -retrospective study |
| Grecian et al. 2020 [ | Scotland, UK, | prospective cohort, 11 years follow up | 3.9% | 335.66/100 000 population | 11.4 L/person | (N = 1059) | the best prediction was the combination of USS screening and the fibrosis score | APRI | Best model: (APRI >0.5) | -Large sample size, | In a cohort with a moderately low cirrhosis/HCC existing risk scores did not reliably identify participants at high risk. |
| Chen et al. 2019 [ | China, | Case control | 9.2% | 3321.29/100 000 population | 7.2 L/person | Model 1 (N = 200): | LR | Gender, age, AST, direct bilirubin, GGT, triglyceride, total cholesterol, and hdl- cholesterol, uric acid, | Data coverage of 301 Hospitals | Missing data handling (impute with normal value) | |
| Li et al. 2018 [ | Taiwan, Chinese | retrospective cohort study | 6.3% | N/A | N/A | (N = 31723) T2DM patients | cox -proportional hazard regression models | age, gender, smoking, variation in hemoglobin, serum glutamic–pyruvic transaminase, liver cirrhosis, hepatitis B, hepatitis C, antidiabetic medications, antihyperlipidemic medications, and total/high-density lipoprotein cholesterol ratio | Validation set: | -a large population-based study with a long-term follow-up period, | - missing data may be a potential bias |
| Si et al. 2016 [ | Republic of Korea, | Retrospective cohort | 6.3% | 3832.50/100000 | 3.9 L/person | (N = 3544) | Cox proportional hazards model (DM-HCC risk score) | age >65 years, low triglyceride levels, | Validation set | Involved large cohort of diabetic patients observed for a prolonged period of time. | Lacking of anti-HBc data in most of patients- high hepatitis B virus prevalence in Korea |
| Rau et al. 2016 [ | Taiwan, Not mentioned | matched case-control | 6.3% | 4927/100000 | N/A | 2060 (case 515, control 1545) | ANN and LR | sex, age, alcoholic cirrhosis, nonalcoholic cirrhosis, alcoholic hepatitis, viral hepatitis, other types of chronic hepatitis, alcoholic fatty liver disease, other types of fatty liver disease, and hyperlipidemia | The performance of the ANN was superior to that of LR, | web based application | did not use blood examinations as predictors |
Figure 5Patients in the T2D clinic who underwent routine check-ups and blood investigation will be assessed for HCC risk using the web-based HCC risk predictor. Patients who had been predicted for HCC need to be referred for further assessment including hepatobiliary imaging such as ultrasound. Those who had not been predicted will be assessed again in the next routine blood investigation.