| Literature DB >> 31147560 |
Masaya Sato1,2, Kentaro Morimoto3, Shigeki Kajihara3, Ryosuke Tateishi4, Shuichiro Shiina5, Kazuhiko Koike4, Yutaka Yatomi6.
Abstract
Because of its multifactorial nature, predicting the presence of cancer using a single biomarker is difficult. We aimed to establish a novel machine-learning model for predicting hepatocellular carcinoma (HCC) using real-world data obtained during clinical practice. To establish a predictive model, we developed a machine-learning framework which developed optimized classifiers and their respective hyperparameter, depending on the nature of the data, using a grid-search method. We applied the current framework to 539 and 1043 patients with and without HCC to develop a predictive model for the diagnosis of HCC. Using the optimal hyperparameter, gradient boosting provided the highest predictive accuracy for the presence of HCC (87.34%) and produced an area under the curve (AUC) of 0.940. Using cut-offs of 200 ng/mL for AFP, 40 mAu/mL for DCP, and 15% for AFP-L3, the accuracies of AFP, DCP, and AFP-L3 for predicting HCC were 70.67% (AUC, 0.766), 74.91% (AUC, 0.644), and 71.05% (AUC, 0.683), respectively. A novel predictive model using a machine-learning approach reduced the misclassification rate by about half compared with a single tumor marker. The framework used in the current study can be applied to various kinds of data, thus potentially become a translational mechanism between academic research and clinical practice.Entities:
Mesh:
Substances:
Year: 2019 PMID: 31147560 PMCID: PMC6543030 DOI: 10.1038/s41598-019-44022-8
Source DB: PubMed Journal: Sci Rep ISSN: 2045-2322 Impact factor: 4.379
Figure 1The concept of graphical user interface machine learning framework. Comma-separated values (CSV) dataset files with a labeled variable were dragged and dropped onto a dashboard, and the framework automatically implemented supervised learning and developed optimized classifiers and their respective hyperparameters.
Classifiers and their respective hyperparameters and R packages used.
| Classifiers | Hyperparameters | R packages |
|---|---|---|
| Logistic regression model | — | stats |
| L1 penalized logistic regression model | lambda* | glmnet |
| L2 penalized logistic regression model | lambda* | glmnet |
| Elastic net penalized Logistic regression model | alpha†, lambda* | glmnet |
| RBF Support vector machine | C‡, sigma§ | kernlab |
| Gradient Boosting | eta||, gamma¶, max_depth**, min_child_weight††, max_delta_step‡‡, subsample§§, colsample_bytree|||| | xgboost |
| Random Forest | ntree¶¶, mtry*** | randomForest |
| Neural Network | size†††, decay‡‡‡ | nnet |
|
| epochs¶¶¶, batch_size****, optimizer†††† | keras tensorflow |
*Scalar value, specifying the relative importance of the regularization function.
†An option to specify one or more values for the probability of a type-I error.
‡A parameter for the soft margin cost function, which specifies the allowance of a misclassification penalty for stability.
§A parameter to specify the complexity of the separation margin.
||A learning rate or step size shrinkage used in an update to prevent overfitting.
¶Minimal loss reduction required to make a further partition on a leaf node of the tree.
**Maximum depth of tree to control over-fitting; increasing this value makes the model more complex.
††Minimum sum of instance weight needed in a child node.
‡‡Maximum delta step allowed for each tree’s estimation.
§§Subsample ratio of training instance.
||||Subsample ratio of columns when constructing each tree.
¶¶Total number of trees included in the forest model.
***Number of features used in the construction of each tree.
†††Number of units in hidden layer (number of nodes in each hidden layer was set as 1).
‡‡‡A regularization parameter to avoid over-fitting.
||||||Fully connected neural network with 4 layers of neurons (16-64-64-2).
¶¶¶A single training iteration over the entire training data.
****Number of training samples processed at an iteration.
††††A device to adjust the deep learning model for optimal execution.
Patient characteristics (n = 1582).
| Parameters | HCC patients (n = 539) | non-HCC patients (n = 1043) | |
|---|---|---|---|
| Sex, n (%) | <0.001 | ||
| Female | 167 (31.0) | 483 (46.3) | |
| Male | 372 (69.0) | 560 (53.7) | |
| Age (years) | 68 (63–74) | 57 (48–66) | <0.001 |
| HCV antibody | <0.001 | ||
| Positive | 382 (71.0) | 630 (60.4) | |
| Negative | 157 (29.0) | 413 (39.6) | |
| HBs antigen | <0.001 | ||
| Positive | 78 (14.5) | 254 (24.4) | |
| Negative | 461 (85.5) | 789 (75.6) | |
| AFP (ng/mL) | 21 (7.8–91) | 5.0 (3.0–10) | <0.001 |
| AFP-L3 (%) | 0.5 (0.0–92) | 0.0 (0.0–0.5) | <0.001 |
| DCP (mAU/mL) | 24 (16–74) | 16 (12–20) | <0.001 |
| AST (U/L) | 53 (38–77) | 47 (29–77) | <0.001 |
| ALT (U/L) | 47 (29–74) | 56 (30–95) | <0.001 |
| Platelet Count (×104/μL) | 11.0 (7.9-15.7) | 16.8 (12.0-22.1) | <0.001 |
| GGT (IU/L) | 55 (36–97) | 49 (25–94) | <0.001 |
| ALP (IU/L) | 251 (193–323) | 195 (155–250) | <0.001 |
| Albumin (g/dL) | 3.6 (3.2–4.0) | 4.1 (3.8–4.3) | <0.001 |
| TB (mg/dL) | 0.8 (0.6–1.1) | 0.8 (0.6–1.0) | <0.001 |
| Height (cm) | 161 (154–167) | 162 (155–168) | 0.11 |
| Body weight (kg) | 60.7 (53.0–68.0) | 60.0 (52.0–68.0) | 0.34 |
*Data were expressed as the median values (1st–3rd quartiles).
Predictive accuracy for HCC presence of each classifier.
| Classifier | Accuracy* (%) | Area under the curve |
|---|---|---|
| Logistic regression model | 79.74 | 0.866 |
| L1 penalized logistic bregression model | 80.38 | 0.867 |
| L2 penalized logistic regression model | 81.64 | 0.884 |
| Elastic net penalized logistic Regression model | 80.38 | 0.884 |
| Support vector machine (RBF kernel) | 81.65 | 0.870 |
| Gradient boosting | 87.34 | 0.940 |
| Random forest | 86.08 | 0.923 |
| Neural network | 84.18 | 0.908 |
| Deep learning | 83.54 | 0.884 |
*A training/development/test split was used to evaluate the model.
Figure 2Receiver-operating characteristic curve for predicting the presence of HCC based on the optimal predictive model developed by our framework. The area under the curve for the prediction of HCC was 0.943.
Figure 3Mean decrease in the Gini impurity of the attributes as assigned using the optimized model. Patient age followed by three tumor markers were the most important variables for HCC prediction.