| Literature DB >> 32752173 |
Maximilian Sprang1, Claudia Paret1,2,3, Joerg Faber1,2,3.
Abstract
The analysis of tumours using biomarkers in blood is transforming cancer diagnosis and therapy. Cancers are characterised by evolving genetic alterations, making it difficult to develop reliable and broadly applicable DNA-based biomarkers for liquid biopsy. In contrast to the variability in gene mutations, the methylation pattern remains generally constant during carcinogenesis. Thus, methylation more than mutation analysis may be exploited to recognise tumour features in the blood of patients. In this work, we investigated the possibility of using global CpG (CpG means a CG motif in the context of methylation. The p represents the phosphate. This is used to distinguish CG sites meant for methylation from other CG motifs or from mentions of CG content) island methylation profiles as a basis for the prediction of cancer state of patients utilising liquid biopsy samples. We retrieved existing GEO methylation datasets on hepatocellular carcinoma (HCC) and cell-free DNA (cfDNA) from HCC patients and healthy donors, as well as healthy whole blood and purified peripheral blood mononuclear cell (PBMC) samples, and used a random forest classifier as a predictor. Additionally, we tested three different feature selection techniques in combination. When using cfDNA samples together with solid tumour samples and healthy blood samples of different origin, we could achieve an average accuracy of 0.98 in a 10-fold cross-validation. In this setting, all the feature selection methods we tested in this work showed promising results. We could also show that it is possible to use solid tumour samples and purified PBMCs as a training set and correctly predict a cfDNA sample as cancerous or healthy. In contrast to the complete set of samples, the feature selections led to varying results of the respective random forests. ANOVA feature selection worked well with this training set, and the selected features allowed the random forest to predict all cfDNA samples correctly. Feature selection based on mutual information could also lead to better than random results, but LASSO feature selection would not lead to a confident prediction. Our results show the relevance of CpG islands as tumour markers in blood.Entities:
Keywords: CpG islands; HCC; liquid biopsy
Mesh:
Substances:
Year: 2020 PMID: 32752173 PMCID: PMC7465093 DOI: 10.3390/cells9081820
Source DB: PubMed Journal: Cells ISSN: 2073-4409 Impact factor: 6.600
Gene Expression Omnibus (GEO) datasets used in this work. cfDNA: cell-free DNA, PBMCs: peripheral blood mononuclear cells.
| Sample | GEO Identifier | Exemplary Identifier in This Work | |
|---|---|---|---|
| Healthy blood: cfDNA | GSE110185 | cf_Moss_x | 8 |
| Healthy blood: cfDNA | GSE122126 | cfDNA_NCF_pool_x | 2 |
| Healthy blood: PBMCs | GSE130748 | PBMC_x | 37 |
| Healthy blood: whole blood | GSE77056 | blood_x | 24 |
| Hepatocellular carcinoma: solid tumour | GSE77269 | HCC_2_x | 20 |
| Hepatocellular carcinoma: solid tumour | GSE99036 | Hepatocellular carcinoma YSHxxx | 15 |
| Hepatocellular carcinoma: cfDNA | GSE129374 | cfDNA_603xxxx_Cirrhosis_with_HCC | 22 |
| Healthy blood: whole blood | GSE40279 | GEO Accession (GSM989xxx) | 101 |
Figure 1Principal component analysis (PCA) of the complete dataset. The features were in the form of CpG sites in (a) or computed CpG islands in (b). Blue is associated to purified peripheral blood mononuclear cell (PBMC) samples, while orange shows whole blood samples, where the complete DNA that can be found in a blood sample is extracted without further purification. The violet dots represent solid hepatocellular carcinoma (HCC) samples. Green and red are healthy and cancerous cfDNA samples extracted from plasma, respectively. The highest red dot in (b) located between the lower orange cluster is the sample cfDNA_6032742_Cirrhosis_with_HCC, which is the most mispredicted sample by all classifiers. In (a), it is the lowest red dot in proximity to the blood sample cluster.
Figure 2cfDNA classification based on random forest trained with 60% of complete data. (a) Average prediction accuracy of the random forests trained with—and tuned on—the features selected by the indicated methods. (b) Average F1 score of the same predictions. The y-axis indicated the score. Black bars are the standard deviation. The selectors were given the complete set of data as a randomly split training and test set with test sizes of 0.4. Ten iterations were conducted with different random states. CpG islands methylation (blue) or CpG sites (orange) were used for the analysis.
Frequently recurring features occurring in all feature selections and in multiple random states resulting in a high accuracy of prediction.
| CpG Islands (Selected Features) | Genes | Region | Status in PBMC | Status in HCC |
|---|---|---|---|---|
| chr19:47614409-47614661 | Zinc Finger CCCH-Type Containing 4, ZC3H4 | Intron 2 | Non-methylated | Semi-methylated |
| chr9:79073908-79074561 | Beta-1,3-galactosyl-O-glycosylglycoprotein beta-1,6-Nacetylglucosaminyltransferase, GCNT1 | Promoter Region, Exon 1, Intron 1 | Non-methylated | Varying |
| chr1:2979276-2980758 | PRDM16 Divergent Transcript | Hypermethylated | Semi-methylated |
Minimum number of features selected by ANOVA leading to a high accuracy of prediction.
| CpG-Islands (Selected Features) | Genes | Region | Status in PBMC | Status in HCC |
|---|---|---|---|---|
| chr12:50361368-50361652 | Aquaporin 6, AQP6 | Intron 1 | Non methylated | hypomethylated |
| chr8:103875223-103877084 | Antizyme inhibitor 1, AZIN1 | Promoter, exon 1 | hypomethylated | Non methylated |
| chr22:32149763-32150064 | DEP domain-containing 5, DEPDC5 | Promoter, exon 1 | Non methylated | hypomethylated |
Figure 3cfDNA classification based on random forest trained with all solid tumours and PBMC samples. The training set was limited to just the solid tumours and PBMC samples (35 and 37 samples, respectively). The selectors were trained only on solid tumour and PBMC samples as a randomly shuffled training set. The test set was composed of the healthy and cancerous cfDNA samples (10 and 22 samples, respectively). (a) Average prediction accuracy of the random forests trained with—and tuned on—the features selected by the indicated methods. (b) Average F1 score of the same predictions. Black bars are the standard deviation. Two PBMC samples were removed from the set, since they were outliers from the others in the PCA. CpG islands methylation (blue) or CpG sites (oranges) were used for the analysis.