Literature DB >> 31590287

DNA Methylation Markers for Pan-Cancer Prediction by Deep Learning.

Biao Liu1, Yulu Liu2, Xingxin Pan3, Mengyao Li4, Shuang Yang5, Shuai Cheng Li6.   

Abstract

For cancer diagnosis, many DNA methylation markers have been identified. However, few studies have tried to identify DNA methylation markers to diagnose diverse cancer types simultaneously, i.e., pan-cancers. In this study, we tried to identify DNA methylation markers to differentiate cancer samples from the respective normal samples in pan-cancers. We collected whole genome methylation data of 27 cancer types containing 10,140 cancer samples and 3386 normal samples, and divided all samples into five data sets, including one training data set, one validation data set and three test data sets. We applied machine learning to identify DNA methylation markers, and specifically, we constructed diagnostic prediction models by deep learning. We identified two categories of markers: 12 CpG markers and 13 promoter markers. Three of 12 CpG markers and four of 13 promoter markers locate at cancer-related genes. With the CpG markers, our model achieved an average sensitivity and specificity on test data sets as 92.8% and 90.1%, respectively. For promoter markers, the average sensitivity and specificity on test data sets were 89.8% and 81.1%, respectively. Furthermore, in cell-free DNA methylation data of 163 prostate cancer samples, the CpG markers achieved the sensitivity as 100%, and the promoter markers achieved 92%. For both marker types, the specificity of normal whole blood was 100%. To conclude, we identified methylation markers to diagnose pan-cancers, which might be applied to liquid biopsy of cancers.

Entities:  

Keywords:  biomarker, methylation, pan-cancer, deep learning, CpG, promoter

Mesh:

Substances:

Year:  2019        PMID: 31590287      PMCID: PMC6826785          DOI: 10.3390/genes10100778

Source DB:  PubMed          Journal:  Genes (Basel)        ISSN: 2073-4425            Impact factor:   4.096


1. Introduction

DNA methylation, as an important epigenetic modification, is associated with gene silencing, and the primary methylated sequence in vertebrates is CpG [1,2]. CpG methylations located at promoter silence the promoter activity, thus, they are negative correlated with the gene expression [3,4]. Furthermore, promoter methylations play major roles in cancers by suppressing transcription of some vital genes, such as tumor suppressor genes [5,6]. Since DNA methylation plays an important role in cancers, many studies have utilized DNA methylated sequences as biomarkers for cancer detections, including CpG markers and promoter markers. Specifically, irregular methylations in promoters of cancer-related genes could serve as biomarkers for early cancer diagnosis and prognosis [7]. For example, adenomatous polyposis coli (APC) promoter methylation could be a biomarker for early diagnosis of prostate cancer [8], and O6-methylguanine-DNA-methyltransferase (MGMT) promoter methylation might be a predictive biomarker for cancer prognosis [9]. For CpG markers, ten diagnosis markers and eight prognosis markers in circulating tumor DNA of hepatocellular carcinoma have been screened [10]. Although quite a few DNA methylation biomarkers have been identified, and some of them have even been commercialized [11], one of the common limits is that these markers can only apply to one or few cancer types. Studying the similarities and differences among diverse cancer types is known as pan-cancer analysis, which has revealed that some different cancer types could have similar methylation patterns, and biomarkers that cross boundaries among diverse cancer types are expected to be identified [12,13]. Although some pan-cancer differentially methylated CpG sites have been identified [14,15], effective and practical pan-cancer methylation biomarkers remain to be identified. In this study, we focused on identifying DNA methylation biomarkers, including CpG markers and promoter markers, for diagnosing pan-cancers. We collected the whole genome methylation data of 27 cancer types containing 10,140 cancer samples and 3386 normal samples from TCGA (The Cancer Genome Atlas) [12] and GEO (Gene Expression Omnibus) [16]. Then, we used machine learning to analyze and identify cancer-special CpG markers and promoter markers. Specifically, we constructed diagnostic prediction models by deep learning. Finally, we identified 12 CpG markers and 13 promoter markers, which can be used to predict pan-cancer precisely.

2. Materials and Methods

2.1. Datasets

We totally collected whole genome methylation data of 10,140 cancer samples and 3386 normal samples from TCGA and GEO. Specifically, methylation data of 4840 cancer samples and 1742 matched normal samples (matched normal sample: healthy tissue adjacent to tumor from the same patient) were divided randomly into a training data set (named as Training data set), a validation data set (named as Validation data set), and a test data set (named as Test data set 1) (Table S1). To make markers more adaptive in virtual normal samples, we added 727 cancer samples and 836 normal samples (most of them were virtual normal samples) into the Training data set (virtual normal sample: healthy tissue from healthy, unrelated individuals). Therefore, the Training data set contained 4827 cancer samples and 2716 normal samples (Tables S2 and S3). Both Validation data set and Test data set 1 contained 370 cancer samples and 201 matched normal samples from eight cancer types (Tables S1 and S2); The other two test data sets are named as Test data set 2 and Test data set 3. Test data set 2 contained 3041 cancer samples and 268 matched normal samples from 15 cancer types (Table S4); Test data set 3 contained 1532 cancer samples and 540 virtual normal samples from five cancer types (Table S5). The methylation data of each sample came from Illumina’s Infinium HumanMethylation450 BeadChip, which contains more than 450,000 methylation sites. The details of all samples are summarized in Tables S2–S6. We calculated the average methylation beta value of all CpG sites located in the promoter of the same gene as methylation beta value of the promoter. Specifically, upstream 1500 bp of TSS (Transcription start site) to downstream 500 bp of TSS are defined as a promoter [17,18]. We removed CpG sites and promoters where at least one sample had missed value to guarantee more strict data sets. Finally, 139,422 of 485,000 CpG sites and 15,316 of 24,062 promoters were left for the following analysis. Therefore, all samples had 139,422 common CpG sites and 15,316 common promoters. Data containing CpG sites and data containing promoters were analyzed parallelly in the following steps.

2.2. Identifying Markers

For mining markers, first we used the ‘moderated t-statistics’ method [19] to conduct the prescreening procedure to get the methylation sites with the most differential methylation expression. This method utilized Empirical Bayes for shrinking the variance and Benjamini–Hochberg procedure [20] to adjust p values. We sorted all candidate markers by the adjusted p values from low to high (lower adjusted p value means that the differential rates of methylation between cancer samples and normal samples are larger), and we took the top 2000 markers as the next candidate markers. Next, we used two strategies to obtain fewer markers. One machine learning strategy is LASSO (least absolute shrinkage and selection operator) [21] under a binomial distribution. We randomly subsampled 75 percent of the samples every time and conducted LASSO procedure to identify markers with the biggest methylation beta value difference. After 1000 times sampling, we selected the markers that were chosen by LASSO at least 750 times. In this process, we did not choose the minimum lambda but chose the “1-se” lambda which is one standard error larger than the minimum to make the model simpler. Besides, the minimization goal we chose was ‘auc’ to make our model more robust. 10-fold cross-validation was applied each time. Another machine learning strategy is a random forest. The tree number to use for the first forest was 5000 and for all additional forests was 2000. The algorithm we applied used OOB (Out-of-bag) error as minimization criterion, and removed those least important variables from the random forest [22]. At each iteration, we set the dropping fraction of variables at 0.3. Four main R packages (‘limma’, ‘glmnet’, ‘doParallel’, and ‘varSelRF’) were implemented in R version 3.5.0 to conduct these three machine learning strategies.

2.3. Constructing Diagnostic Prediction Models

To construct diagnostic prediction models, we constructed two multi-layer feedforward neural networks, both of which contained one input layer, multiple hidden layers and one output layer. The source code we used for prediction is publicly available at https://github.com/BiaoLiu2017/Cancer-methylation. The input layer was namely the input data matrix (data matrices only containing marker sites), and the output layer had just one neural unit, whose activation function was sigmoid activation function while the activation function of hidden units was ReLU. For each hidden layer, the number of hidden units was the same. The cost function of the neural network was standard logistic regression cost function. The optimization algorithm we deployed in the network was Adam optimization algorithm, and the exponential decay rate for the first moment estimates was 0.9, while the second was 0.999. The learning rate decay strategy was an exponential decay, which means the learning rate would multiple a decay rate after specified epochs. To prevent overfitting, we carried out a batch normalization after activation function of every hidden layer. Another strategy to prevent overfitting is early stopping, and we chose a befitting training point to stop to make the model more suitable for the Validation data set. We conducted a random search [23] for hyper-parameter optimization. Table S7 shows the hyper-parameters we tuned in the process of training the neural network. In other words, we adopted the strategy that randomly initializes these hyper-parameters in the range as Table S7. We parallelly trained 1000 neural networks, and finally chose the hyper-parameters combination that had the best performance for the Validation data set. The final hyper-parameters combination is the best scheme as Table S7 shows. We deployed the best hyper-parameters into the final deep learning models and trained them by feeding the Training data set. We used the Validation data set to justify whether the model was overfitting. After we trained two neural network models whose performance were good enough in the Validation data set, we tested our diagnostic prediction models in Test data set 1. Furthermore, to evaluate the performance of our model unbiasedly, we tested our prediction models in the other two test data sets: Test data set 2 and Test data set 3. What needs to be emphasized is that all three test data sets were tested just once. The reason we divided samples in this way was to evaluate whether our model could predict untrained cancers. Before being fed into deep learning models, all data were subjected to standardization to fit a standardized normal distribution; namely the average was 0, and the standard deviation was 1. The deep neural network models that we deployed were based on the deep learning framework Tensorflow-GPU version 1.4.0 [24]. Logistic regression required scikit-learn version 0.19.1. We obtained SHAP (SHapley Additive exPlanation) values [25] by executing package ‘shap’ to interpret model predictions. To evaluate the robustness of markers we selected, random sampling of 100 times were carried out. Each time, 6000 samples were random selected from all samples. Each data set was requested to have the same ratio of cancer and normal as that of the original data set. Additionally, <30% sample overlap among all 100 data sets was required too. Each data set was divided into one training data set and one test data set with same ratio of cancer and normal.

3. Results

3.1. Identifying Cancer-Specific Methylation Markers by Machine Learning

We utilized the Training data set to analyze and identify methylation markers by three machine learning methods. Figure 1 shows the procedure of identifying methylation marker. We organized the Training data set into two data matrices: CpG methylation matrix and promoter matrix. The CpG methylation matrix consisted of beta values of 139,422 CpG methylation sites, and promoter matrix consisted of beta values of 15,316 promoters. These two data matrices were utilized to identity the CpG markers and the promoter markers. First, a prescreening procedure was conducted by ‘moderated t-statistics’, to identify candidate markers with the most differential methylation beta value between cancer samples and normal samples. After that, we obtained the top 2000 markers as the candidate markers, including 2000 CpG markers and 2000 promoter markers. Next, we used two strategies to reduce the number of markers parallelly. One machine learning strategy was LASSO (least absolute shrinkage and selection operator) under a binomial distribution. After 1000 times sampling, we selected the markers that were chosen by LASSO at least 750 times. Eventually, by LASSO we got 63 CpG markers and 68 promoter markers. Another machine learning strategy was random forest, and we got 115 CpG markers and 57 promoter markers. We took 12 overlapping CpG markers (Table 1) and 13 overlapping promoter markers (Table 2) between these two machine learning methods as final markers. In 12 CpG markers, reference genes of three markers involve cancer-related pathway. SOX14 (cg04374393 locates at the promoter of SOX14 gene) involves molecular mechanisms of cancer; TP73 (reference gene of cg17804348) involves p53 signaling pathway; SND1 (cg26642667 locates at the promoter of SND1 gene) involves viral carcinogenesis. In 13 promoter markers, associated genes of four markers involve cancer-related pathway. ACVRL1 involves TGF-beta signaling pathway; AURKB involves regulation of TP53 activity; RHOT2 involves mitophagy; WT1 involves transcriptional misregulation in cancer.
Figure 1

Workflow chart of identifying markers by machine learning. (a) Workflow of CpG methylation data. (b) Workflow of promoter methylation data. CpG methylation data contained 139,422 CpG sites, and promoter methylation data contained 15,316 promoters. We utilized the Training data set containing 4827 cancer samples and 2176 normal samples to identify markers applying three machine learning strategies (Moderated t-statistics, LASSO, and Random-forest) and obtained 12 markers for the CpG methylation data, and 13 markers for the promoter methylation data. Then, we trained two deep learning models for CpG markers and promoter markers respectively.

Table 1

Characteristics of 12 CpG markers in the Training data set.

MarkersRef GeneCoefficientsSEz Valuep Value
4.280170.1236534.614<0.001
cg01397449EXOC3L1−1.261950.0828−15.241<0.001
cg04374393SOX140.440950.107594.098<0.001
cg06575035PCDHGA11.00890.0932110.823<0.001
cg07333191Chr4:130.54350.113894.772<0.001
cg16389386Chr7:154−0.385540.06408−6.016<0.001
cg16508627HS3ST2−0.547320.11407−4.798<0.001
cg16926102Chr10:230.89460.119517.486<0.001
cg17804348TP731.097240.0644217.033<0.001
cg19710323Chr12:34−0.86280.10259−8.41<0.001
cg22620090Chr6:1040.363390.077594.683<0.001
cg26642667SND1−0.857460.04911−17.461<0.001
cg26733975RP11–760D2.1−0.971630.10248−9.481<0.001

Note: SE indicates standard errors of coefficients; z value indicates Wald z-statistic value.

Table 2

Characteristics of 13 promoter markers in the Training data set.

Markers Coefficients SE z Value p Value
2.64720.53164.979<0.001
ACVRL15.58480.75237.423<0.001
AURKB−3.99691.2242−3.2650.001
GRASPOS−1.00940.3599−2.8050.005
MC3R−12.28530.858−14.319<0.001
OR10H2−6.82540.6101−11.188<0.001
OTX2-AS13.6640.61365.972<0.001
PCDHGA120.61880.52941.1690.242
PCDHGA51.86530.7042.6490.008
PCDHGA61.09610.65521.6730.094
PHC3−12.8650.9678−13.293<0.001
RHOT211.31430.895912.628<0.001
TOX23.0390.80613.77<0.001
WT14.50580.47969.394<0.001

Note: SE indicates standard errors of coefficients; z value indicates Wald z-statistic value.

3.2. Constructing Diagnostic Prediction Models by Deep Learning

The markers obtained by machine learning were used to classify and predict cancer and normal samples by deep learning method. We constructed two multi-layer feedforward neural networks based on the deep learning framework Tensorflow and fed the Training data set into these two deep neural network models. We utilized a random search for hyper-parameter optimization, and Table S7 shows the best hyper-parameter combination. These two deep learning models were deployed with the best hyper-parameters and trained again. Figure S1 shows the training curves. By early stopping strategy, we chose a befitting training point to stop, to make the model more suitable for the Validation data set. After obtaining the best parameters, we tested our deep learning models in the three test data sets (Test data set 1, Test data set 2, and Test data set 3). Figure 2 shows the ROC (Receiver operating characteristic) curves of both two marker types. AUC (Area under the Curve of ROC) of Test data set 1 is 0.989 for CpG markers, and 0.985 for promoter markers. Figure 3 shows the results of unsupervised hierarchical clustering for Training data set, Validation data set, and Test data set 1, while Figure 4 shows the results for the other two test data sets (Test data set 2 and Test data set 3). These results indicate that cancer samples can be distinguished markedly from normal samples by both two marker types. Table 3 shows a summary of all prediction results for both CpG markers and promoter markers (More details see Table 4, Table 5, Table 6 and Table 7). Figure 5 shows the distribution of predict values in all samples. For CpG markers, average sensitivity and specificity of three test data sets were 92.8% and 90.1% respectively (Table 3). For promoter markers, average sensitivity and specificity of three test data sets were 89.8% and 81.1% respectively (Table 3). Although sensitivity and specificity in most cancer types were higher than 0.7 for both two marker types, specificity of esophagus and stomach cancer for promoter markers were lower than 0.6, and the sensitivity of oral, thyroid, and nasopharynx cancer for promoter markers were lower than 0.6 (Table 4, Table 5, Table 6 and Table 7). Therefore, all 27 cancer types could be diagnosed precisely by CpG markers, while only twenty-two of 27 cancer types could be diagnosed precisely by promoter markers. Both two categories of markers predicted the same results in each of 88.4% samples (i.e., 5262 samples) of three test data sets (i.e., 5952 samples), and average sensitivity and specificity of these 5262 samples were promoted to 96.4% and 91.6%. Therefore, if the prediction result of one sample is same between CpG markers and promoter markers, the prediction will be more reliable. Average sensitivity and specificity in Test data set 1 were much higher than Test data set 2 and Test data set 3 for both two categories of markers, which means the models we trained are more adapted to eight trained tissue types than the other 20 untrained tissue types. Furthermore, for CpG markers, sensitivity and specificity of the eight cancers (breast, kidney, liver, lung, bile duct, lymph nodes, cervix, and skin cancer) were both higher than 95% (Table 4, Table 5, Table 6 and Table 7). Additionally, for promoter markers, sensitivity and specificity of nine cancers (breast, colorectal, liver, lung, adrenal gland, bile duct, soft tissue, cervix, and skin cancer) were both higher than 95% (Table 4, Table 5, Table 6 and Table 7). Conclusively, the prediction results indicate that our deep learning models can correctly classify cancer samples and normal samples in pan-cancers.
Figure 2

ROC curves of the three data sets. (a) ROC curves of CpG methylation data. (b) ROC curves of promoter methylation data. ROC curves: Receiver operating characteristic curves.

Figure 3

Unsupervised hierarchical clustering of the three data sets. (a,c,e) come from 12 CpG markers and (b,d,f) come from 13 promoter markers. Methylation beta values range from 0 to 1.

Figure 4

Unsupervised hierarchical clustering of Test data set 2 and Test data set 3. (a,c) come from 12 CpG markers and (b,d) come from 13 promoter markers. Methylation beta values range from 0 to 1.

Table 3

The summary of all prediction results.

MarkerData SetTotalCancerNormalTotal AccuracyMCC
Cancer TotalPredict CancerPredict NormalSensitivityNormal TotalPredict CancerPredict NormalSpecificity
CpGTraining700348274734930.98121761121650.9950.9850.966
Validation571370352180.951201101910.950.9510.894
Test set 1571370360100.97320191920.9550.9670.927
Test set 23309304127952460.919268392290.8540.9140.602
Test set 3207215321433990.935540524880.9040.9270.817
All three test sets5952494345883550.92810091009090.9010.9240.761
PromoterTraining7003482746761510.9692176321730.9990.9780.951
Validation571370354160.95720151960.9750.9630.921
Test set 1571370353170.95420181930.960.9560.906
Test set 23309304126414000.868268282400.90.8710.528
Test set 3207215321443890.9425401553850.7130.8820.684
All three test sets5952494344375060.89810091918180.8110.8830.639

Note: ‘Predict cancer’ or ‘Predict normal’ indicates samples predicted as cancer or normal. Training, Validation, Test set 1, Test set 2 and Test set 3 respectively indicate Training data set, Validation data set, Test data set 1, Test data set 2 and Test data set 3. MCC indicates Matthews Correlation Coefficient [25].

Table 4

The prediction results of three data sets for 12 CpG markers.

Data SetTissue TypesTotalCancerNormalTotal AccuracyMCC
Cancer TotalPredict CancerPredict NormalSensitivityNormal TotalPredict CancerPredict NormalSpecificity
TrainingBreast11221006993130.98711621140.9830.9870.932
Colorectal39037136740.9891901910.990.904
Kidney794593573200.966201020110.9750.937
Leukocyte576000-57615750.9980.9980
Liver442366355110.977607610.9750.92
Lung1155857839180.97929822960.9930.9830.956
Prostate529491476150.9693803810.9720.834
Uterus43241641510.9981601610.9980.969
ValidationBreast85606001251240.960.9880.972
Colorectal56404001162140.8750.9640.913
Kidney85605640.9332502510.9530.897
Leukocyte40000-40040110
Liver6040400120020111
Lung120807370.912401390.9750.9330.86
Prostate70504460.88205150.750.8430.621
Uterus55403910.975151140.9330.9640.908
Test 1Breast85605820.967251240.960.9650.916
Colorectal56404001161150.9380.9820.956
Kidney85605820.9672502510.9760.946
Leukocyte40000-40040110
Liver60403820.95201190.950.950.889
Lung12080800140040111
Prostate70504640.92205150.750.8710.681
Uterus55404001151140.9330.9820.954
Table 5

The prediction results of three data sets for 13 promoter markers.

Data Set Tissue TypesTotalCancerNormalTotal AccuracyMCC
Cancer TotalPredict Cancer Predict NormalSensitivityNormal TotalPredict CancerPredict NormalSpecificity
TrainingBreast11221006984220.97811611150.9910.980.902
Colorectal39037137010.9971901910.9970.973
Kidney794593545480.919201020110.940.861
Leukocyte576000-5760576110
Liver442366354120.9677607610.9730.914
Lung1155857829280.967298029810.9760.94
Prostate529491470210.957381370.9740.9580.769
Uterus43241641601161150.9380.9980.967
ValidationBreast85605910.9832502510.9880.972
Colorectal5640400116016111
Kidney85605730.952502510.9650.921
Leukocyte40000-40040110
Liver6040400120020111
Lung1208070100.8754004010.9170.837
Prostate70504820.96203170.850.9290.823
Uterus55404001152130.8670.9640.909
Test set 1Breast85605730.952502510.9650.921
Colorectal5640400116016111
Kidney85605640.9332502510.9530.897
Leukocyte40000-40040110
Liver60403820.95201190.950.950.889
Lung120807730.9634004010.9750.946
Prostate70504550.9205150.750.8570.65
Uterus55404001152130.8670.9640.909
Table 6

The prediction results of two test data sets for 12 CpG markers.

Data SetTissue TypesTotalCancerNormalTotal AccuracyMCC
Cancer TotalPredict CancerPredict NormalSensitivityNormal TotalPredict CancerPredict NormalSpecificity
Test data set 2Adrenal gland267264213510.80730310.8090.212
Bile duct45363601909111
Bladder44041941180.981213180.8570.9750.758
Esophagus20218618510.995165110.6880.970.779
Eyes80807460.925000-0.9250
Head and neck58053052910.9985010400.80.9810.874
Lymph nodes51484620.95830310.9610.758
Oral1046546190.708392370.9490.7980.637
Ovary10101001000-10
Pancreas391352265870.753393360.9230.770.436
Pleura87878160.931000-0.9310
Small bowel56282710.964284240.8570.9110.826
Soft tissue269265250150.94340410.9440.446
Testis156156135210.865000-0.8650
Thyroid571515487280.9465612440.7860.930.655
Test data set 3Bone marrow386325257680.7916106110.8240.611
Cervix35631531140.987411400.9760.9860.934
Nasopharynx48241950.792242220.9170.8540.714
Skin69447346670.98522112200.9950.9880.974
Stomach588395380150.962193481450.7510.8930.753
Table 7

The prediction results of two test data sets for 13 promoter markers.

Data SetTissue TypesTotalCancerNormalTotal AccuracyMCC
Cancer TotalPredict CancerPredict NormalSensitivityNormal TotalPredict CancerPredict NormalSpecificity
Test data set 2Adrenal gland267264251130.95130310.9510.422
Bile duct45363601909111
Bladder44041941450.988213180.8570.9820.81
Esophagus2021861860116790.5620.9650.736
Eyes80807460.925000-0.9250
Head and neck58053052370.987506440.880.9780.859
Lymph nodes5148480133000.9410
Oral1046537280.569392370.9490.7120.518
Ovary10101001000-10
Pancreas391352297550.844393360.9230.8520.544
Pleura87878250.943000-0.9430
Small bowel56282710.964282260.9290.9460.893
Soft tissue26926526320.99240410.9930.813
Testis156156134220.859000-0.8590
Thyroid5715152592560.503562540.9640.5480.279
Test data set 3Bone marrow386325261640.8036106110.8340.626
Cervix35631531050.9844104110.9860.937
Nasopharynx48249150.3752402410.6880.48
Skin69447347030.99422112200.9950.9940.987
Stomach58839539320.995193154390.2020.7350.363
Figure 5

The distribution of prediction values in all samples. (a,c,e) come from 12 CpG markers and (b,d,f) come from 13 promoter markers. Red indicates the status of the sample is cancer, and green indicates the status of the sample is normal.

Interpreting model predictions becomes more and more crucial in the field of machine learning, especially for deep learning. An outstanding approach has been proposed, which used SHAP (SHapley Additive exPlanation) values as a unified measure of feature importance [26]. Figure 6 shows the average absolute SHAP value of each marker. For CpG markers, cg07333191 has biggest impact on model output, while cg04374393 has least impact. For promoter markers, AURKB has biggest impact on model output, while the impact of ACVRL1 is least. Figure S4 shows the detailed impact of each marker to the model output in four samples.
Figure 6

The distribution of average absolute SHAP value for all markers. (a) comes from 12 CpG markers and (b) comes from 13 promoter markers. SHAP: SHapley Additive explanation.

3.3. Evaluating Reliability of Markers and Diagnostic Prediction Models

To verify whether our deep learning models perform better than general traditional machine learning strategy, such as logistic regression, we fitted our data in two logistic regression models. The results indicate that deep learning predicting method performs more precise than logistic regression method in our data sets actually (Figure S2, Table S8). To test the reliability of the selected markers, we randomly partitioned all samples into 80% for training, 10% for validation, and 10% for testing. We constructed two other deep learning models for CpG markers and promoter markers, and fed all these samples into the models. Figure S3 shows ROC of the three data sets, and AUCs (0.993 for CpG markers and 0.995 for promoter markers) demonstrate that the selected markers can classify all samples precisely. The robustness of biomarkers for cancer diagnosis or prognosis might be low due to tumor heterogeneity, and random sampling was suggested to evaluate the robustness of markers [27]. We performed random sampling of 100 times, data were divided into one training data set and one test data set each time. Training data set was used to train models and test data set was used to evaluate the performance of models. Figure S5 shows predict accuracies for both CpG markers and promoter markers. The result indicates the robustness of our markers is strong, since the predict accuracies are high in different data sets. To test performance of the markers in liquid biopsy, we utilized the markers to predict cell-free DNA methylation data of 163 prostate cancer samples. Sensitivity for CpG markers was 100%, and for promoter markers was 92%. Additionally, another dataset whose GEO accession number is GSE110185 contains six cell-free DNA pooled samples (two colorectal cancer, two advanced adenomas and two healthy control samples). Additionally, all of these six samples were predicted as normal samples. Notably, for both marker types, specificity of normal whole blood is 100%, and whole blood samples are the most similar samples to cell-free DNA samples.

4. Discussion

Most related studies identifying methylation markers focused on one or a few cancer types. The most important impact for our study is that we attempted to identify two categories of methylation markers, CpG markers and promoter markers, to classify and predict pan-cancers. The reliability of this study lies in the fact that all three test data sets were tested only once to avoid overfitting. Therefore, the predict results we show here can prove that pan-cancers can be predicted precisely by the selected methylation markers. Sensitivity and specificity in most cancer types are high enough for both markers. Nonetheless, for promoter markers, specificities of two cancer types (esophagus and stomach cancer) and sensitivities of three cancer types (oral, thyroid, and nasopharynx cancer) are too low to predict precisely. Sensitivity and specificity of these five cancer types are high enough for CpG markers, which means the samples are qualified. Therefore, a possible reason is that in these samples, CpG probes located at promoters are not enough to calculate promoter methylation values precisely. This is the potential defect of promoter markers that promoter methylation value calculating might be inaccurate since each promoter has different length definition actually. Another possible reason is that CpG markers may be more adapted to these cancer types than promoter markers. Nonetheless, identifying promoter methylation markers is worth attempting, since promoter has a close relation with the process of cancer developing. The advantage of pan-cancer methylation biomarkers is that diagnosis of diverse cancer types can be based on targeted measuring of these biomarkers. Therefore, these biomarkers could be applied in liquid biopsy effectively. The performance of the selected markers in cell-free DNA methylation data of 163 cancer samples was excellent. However, for GSE110185 dataset, all six pooled samples were predicted as normal samples. Two advanced adenomas samples should be regarded as non-cancer samples, thus the prediction accuracy is 0.667. However, because lack of abundant normal cell-free DNA samples, specificity remains to be verified in more normal samples. We have put arguments of the well-trained deep learning models online to let more researchers validate the reliability of our model. What should be emphasized is the dependability of cell-free DNA samples. Since in the process of cell-free DNA isolation, contamination could easily happen, such as ruptured blood cells [28]. Therefore, samples containing cell-free DNA are prone to be classified as normal samples. Comparing other studies to our study, Vrba et al. [15] attempted to identify CpG markers to predict pan-cancer. One difference between their strategy and ours is that they reduced the number of markers by comparing cancer samples to mix unrelated normal whole blood samples. While we identified markers by comparing cancer samples to mix matched normal samples. Another difference is that they identified markers in each cancer, and summarized all markers to one marker set. However, our strategy involves gathering all samples from the start, and identifying markers fitting all data. Due to a lack of cell-free DNA methylation data, one compromise in their research is that they treated whole blood samples as cell-free DNA samples simulation. Although whole blood samples mainly contained leukocytes, whole blood samples are the most similar samples to cell-free samples. Therefore, specificity for cell-free DNA samples in our study could be calculated by whole blood approximately, which means for both two marker types, specificity of whole blood is 100%. In our study, taking the intersection of markers from two machine learning strategies to reduce the number of markers is a compromising strategy. In the future, more convincing statistics, machine learning, and data dividing strategies for mining marker are necessary. With a lack of abundant cell-free DNA samples, more verification results depend on more researchers using our models published online to classify cell-free DNA samples. Additionally, the pipeline of this study can be applied in cell-free DNA samples to identify methylation markers more adaptive to cell-free DNA samples. The long-range perspective is identifying one methylation markers set for cell-free DNA samples, applying them to cancer early diagnosis for pan-cancers, and making all cancers be exposed early, be cured early, to reduce death rate of cancers. The models we have trained can only diagnose whether a sample is cancer or normal tissue, but cannot judge which cancer type the sample belongs to. Multiple classification models need to be constructed to diagnose the exact cancer type of samples in future study.

5. Conclusions

In our study, we collected whole genome methylation data of 10,140 cancer samples and 3386 normal samples, and divided them into five data sets. Using three machine learning methods, we identified two categories of markers: 12 CpG markers and 13 promoter markers. Three of 12 CpG markers and four of 13 promoter markers located at cancer-related genes. The performances of these markers in solid or cell-free DNA samples are both pretty good. Additionally, if the prediction result of one sample is the same between CpG markers and promoter markers, the prediction will be more reliable. To conclude, we found it possible to identify methylation markers used to predict pan-cancer. The long-range perspective is identifying one methylation markers set for efficient and precise liquid biopsy of pan-cancers.
  22 in total

1.  Comparison of the predicted and observed secondary structure of T4 phage lysozyme.

Authors:  B W Matthews
Journal:  Biochim Biophys Acta       Date:  1975-10-20

Review 2.  DNA methylation and gene silencing in cancer.

Authors:  Stephen B Baylin
Journal:  Nat Clin Pract Oncol       Date:  2005-12

Review 3.  APC gene hypermethylation and prostate cancer: a systematic review and meta-analysis.

Authors:  Yang Chen; Jie Li; Xiaoxiang Yu; Shuai Li; Xuerong Zhang; Zengnan Mo; Yanling Hu
Journal:  Eur J Hum Genet       Date:  2013-01-09       Impact factor: 4.246

4.  Predominant hematopoietic origin of cell-free DNA in plasma and serum after sex-mismatched bone marrow transplantation.

Authors:  Yanni Y N Lui; Ki-Wai Chik; Rossa W K Chiu; Cheong-Yip Ho; Christopher W K Lam; Y M Dennis Lo
Journal:  Clin Chem       Date:  2002-03       Impact factor: 8.327

5.  A suite of DNA methylation markers that can detect most common human cancers.

Authors:  Lukas Vrba; Bernard W Futscher
Journal:  Epigenetics       Date:  2018-02-19       Impact factor: 4.528

Review 6.  Epigenetic interplay between histone modifications and DNA methylation in gene silencing.

Authors:  Thomas Vaissière; Carla Sawan; Zdenko Herceg
Journal:  Mutat Res       Date:  2008-02-29       Impact factor: 2.433

7.  NCBI GEO: archive for functional genomics data sets--update.

Authors:  Tanya Barrett; Stephen E Wilhite; Pierre Ledoux; Carlos Evangelista; Irene F Kim; Maxim Tomashevsky; Kimberly A Marshall; Katherine H Phillippy; Patti M Sherman; Michelle Holko; Andrey Yefanov; Hyeseung Lee; Naigong Zhang; Cynthia L Robertson; Nadezhda Serova; Sean Davis; Alexandra Soboleva
Journal:  Nucleic Acids Res       Date:  2012-11-27       Impact factor: 16.971

8.  MethHC: a database of DNA methylation and gene expression in human cancer.

Authors:  Wei-Yun Huang; Sheng-Da Hsu; Hsi-Yuan Huang; Yi-Ming Sun; Chih-Hung Chou; Shun-Long Weng; Hsien-Da Huang
Journal:  Nucleic Acids Res       Date:  2014-11-14       Impact factor: 16.971

9.  Pan-cancer patterns of DNA methylation.

Authors:  Tania Witte; Christoph Plass; Clarissa Gerhauser
Journal:  Genome Med       Date:  2014-08-30       Impact factor: 11.117

10.  GeneSrF and varSelRF: a web-based tool and R package for gene selection and classification using random forest.

Authors:  Ramón Diaz-Uriarte
Journal:  BMC Bioinformatics       Date:  2007-09-03       Impact factor: 3.169

View more
  16 in total

1.  Benchmarking DNA methylation analysis of 14 alignment algorithms for whole genome bisulfite sequencing in mammals.

Authors:  Wentao Gong; Xiangchun Pan; Dantong Xu; Guanyu Ji; Yifei Wang; Yuhan Tian; Jiali Cai; Jiaqi Li; Zhe Zhang; Xiaolong Yuan
Journal:  Comput Struct Biotechnol J       Date:  2022-08-27       Impact factor: 6.155

2.  A Deep Survival EWAS approach estimating risk profile based on pre-diagnostic DNA methylation: An application to breast cancer time to diagnosis.

Authors:  Michela Carlotta Massi; Lorenzo Dominoni; Francesca Ieva; Giovanni Fiorito
Journal:  PLoS Comput Biol       Date:  2022-09-26       Impact factor: 4.779

3.  i-Modern: Integrated multi-omics network model identifies potential therapeutic targets in glioma by deep learning with interpretability.

Authors:  Xingxin Pan; Brandon Burgman; Erxi Wu; Jason H Huang; Nidhi Sahni; S Stephen Yi
Journal:  Comput Struct Biotechnol J       Date:  2022-06-30       Impact factor: 6.155

4.  Predicting High Blood Pressure Using DNA Methylome-Based Machine Learning Models.

Authors:  Thi Mai Nguyen; Hoang Long Le; Kyu-Baek Hwang; Yun-Chul Hong; Jin Hee Kim
Journal:  Biomedicines       Date:  2022-06-14

5.  MBD2 Correlates with a Poor Prognosis and Tumor Progression in Renal Cell Carcinoma.

Authors:  Liantao Li; Na Li; Nianli Liu; Fuchun Huo; Junnian Zheng
Journal:  Onco Targets Ther       Date:  2020-10-07       Impact factor: 4.147

Review 6.  Machine Learning in Epigenomics: Insights into Cancer Biology and Medicine.

Authors:  Emre Arslan; Jonathan Schulz; Kunal Rai
Journal:  Biochim Biophys Acta Rev Cancer       Date:  2021-07-07       Impact factor: 10.680

Review 7.  Epigenetic reprogramming during prostate cancer progression: A perspective from development.

Authors:  Sakshi Goel; Vipul Bhatia; Tanay Biswas; Bushra Ateeq
Journal:  Semin Cancer Biol       Date:  2021-02-02       Impact factor: 17.012

8.  Genome-wide DNA methylation profiling and identification of potential pan-cancer and tumor-specific biomarkers.

Authors:  Joe Ibrahim; Ken Op de Beeck; Erik Fransen; Marc Peeters; Guy Van Camp
Journal:  Mol Oncol       Date:  2022-01-21       Impact factor: 7.449

9.  Innovating Computational Biology and Intelligent Medicine: ICIBM 2019 Special Issue.

Authors:  Yan Guo; Xia Ning; Ewy Mathé; Kai Wang; Lang Li; Chi Zhang; Zhongming Zhao
Journal:  Genes (Basel)       Date:  2020-04-17       Impact factor: 4.096

10.  MethylNet: an automated and modular deep learning approach for DNA methylation analysis.

Authors:  Joshua J Levy; Alexander J Titus; Curtis L Petersen; Youdinghuan Chen; Lucas A Salas; Brock C Christensen
Journal:  BMC Bioinformatics       Date:  2020-03-17       Impact factor: 3.169

View more

北京卡尤迪生物科技股份有限公司 © 2022-2023.