| Literature DB >> 30784182 |
Rui-Zhao Dong1, Xuan Yang1,2, Xin-Yu Zhang1,2, Ping-Ting Gao1,2, Ai-Wu Ke1,2, Hui-Chuan Sun1,2, Jian Zhou1,2,3, Jia Fan1,2,3, Jia-Bin Cai1,2, Guo-Ming Shi1,2.
Abstract
Hepatocellular carcinoma (HCC) is closely associated with abnormal DNA methylation. In this study, we analyzed 450K methylation chip data from 377 HCC samples and 50 adjacent normal samples in the TCGA database. We screened 47,099 differentially methylated sites using Cox regression as well as SVM-RFE and FW-SVM algorithms, and constructed a model using three risk categories to predict the overall survival based on 134 methylation sites. The model showed a 10-fold cross-validation score of 0.95 and satisfactory predictive power, and correctly classified 26 of 33 samples in testing set obtained by stratified sampling from high, intermediate and low risk groups.Entities:
Keywords: DNA methylation; hepatocellular carcinoma; machine learning
Mesh:
Year: 2019 PMID: 30784182 PMCID: PMC6484308 DOI: 10.1111/jcmm.14231
Source DB: PubMed Journal: J Cell Mol Med ISSN: 1582-1838 Impact factor: 5.310
Figure 1Schematic of the study method. Raw data on DNA methylation of 377 HCC samples and 50 adjacent normal tissue samples based on the Illumina Human Methylation 450 (450K) Bead Chip were downloaded from the TCGA database. By using the ChAMP tool in R software, 40 799 sites methylated differently between HCC tissue and adjacent normal tissue were identified. Then Cox regression was used to assess the potential correlation between OS and each CpG site differentially methylated between HCC and normal tissues. 2785 sites significantly related to OS (P < 0.05) were retained. The SVM was then used as a classifier in the SVM‐RFE algorithm to rank features (in our case, methylation sites) from most to least relevant for the training objectives in an iterative process that removes the feature from the background, and the best 243 were selected based on the 10‐fold cross‐validation score for the number of recursive features at each level.The forward‐SVM (FW‐SVM) method was then used to screen feature subsets emerging from the SVM‐RFE analysis. In this process (As shown in the right half of the figure), a model for each feature is constructed, the model with the highest cross‐validation score is selected, and then this feature is combined with each of the others to construct two‐feature models, the best of which is selected based on the cross‐validation score. This process is then iterated to build up multi‐feature models. Finally we built a predictive model containing the best 134 features, and the model was tested using the testing dataset. Of 33 cases, 26 were correctly classified (26/33=79%)
Figure 2(A) Using ChAMP, we identified 47 099 differentially methylated sites in the sample of 377 HCC samples and 50 adjacent normal tissues. (B) Results of applying the SVM‐RFE algorithm to 2785 methylation sites significantly associated with overall survival based on Cox regression, and the best 243 were selected based on the 10‐fold cross‐validation score for the number of recursive features at each level. The corresponding 10‐fold cross‐validation score was 0.50.C. Results of applying the FW‐SVM algorithm to 243 methylation sites obtained with the SVM‐RFE method, and we finally built a predictive model containing the best 134 features, which gave a mean 10‐fold cross‐validation score of 0.95
Stratified sampling of patients based on overall survival after surgery
| Risk group | Patients in dataset (n) | |
|---|---|---|
| Training | Testing | |
| High | 46 | 12 |
| Intermediate | 51 | 13 |
| Low | 33 | 8 |
Model validation
| Predicted/Actual | High risk | Intermediate risk | Low risk |
|---|---|---|---|
| High risk | 12 | 0 | 0 |
| Intermediate risk | 2 | 9 | 2 |
| Low risk | 1 | 2 | 5 |