| Literature DB >> 30388122 |
K Babalyan1,2,3, R Sultanov1,2,3, E Generozov1, E Sharova1, E Kostryukova1, A Larin1, A Kanygina1,2, V Govorun1,2,3, G Arapidi1,2,3.
Abstract
Although modern methods of whole genome DNA methylation analysis have a wide range of applications, they are not suitable for clinical diagnostics due to their high cost and complexity and due to the large amount of sample DNA required for the analysis. Therefore, it is crucial to be able to identify a relatively small number of methylation sites that provide high precision and sensitivity for the diagnosis of pathological states. We propose an algorithm for constructing limited subsamples from high-dimensional data to form diagnostic panels. We have developed a tool that utilizes different methods of selection to find an optimal, minimum necessary combination of factors using cross-entropy loss metrics (LogLoss) to identify a subset of methylation sites. We show that the algorithm can work effectively with different genome methylation patterns using ensemble-based machine learning methods. Algorithm efficiency, precision and robustness were evaluated using five genome-wide DNA methylation datasets (totaling 626 samples), and each dataset was classified into tumor and non-tumor samples. The algorithm produced an AUC of 0.97 (95% CI: 0.94-0.99, 9 sites) for prostate adenocarcinoma and an AUC of 1.0 (from 2 to 6 sites) for urothelial bladder carcinoma, two types of kidney carcinoma and colorectal carcinoma. For prostate adenocarcinoma we showed that identified differential variability methylation patterns distinguish cluster of samples with higher recurrence rate (hazard ratio for recurrence = 0.48, 95% CI: 0.05-0.92; log-rank test, p-value < 0.03). We also identified several clusters of correlated interchangeable methylation sites that can be used for the elaboration of biological interpretation of the resulting models and for further selection of the sites most suitable for designing diagnostic panels. LogLoss-BERAF is implemented as a standalone python code and open-source code is freely available from https://github.com/bioinformatics-IBCH/logloss-beraf along with the models described in this article.Entities:
Mesh:
Year: 2018 PMID: 30388122 PMCID: PMC6214495 DOI: 10.1371/journal.pone.0204371
Source DB: PubMed Journal: PLoS One ISSN: 1932-6203 Impact factor: 3.240
List of tumor (T) and non-tumor (N) samples and datasets used for model training and validation.
| Source | Training Set | Test Set | Total |
|---|---|---|---|
| 8T, 11N | 13T, 16N | 48 | |
| 117T, 15N | 176T, 23N | 331 | |
| 0 | 143T, 0N | 143 | |
| 0 | 8T, 4N | 12 | |
| 0 | 77T, 15N | 92 | |
| 134T, 9N | 201T, 14N | 490 | |
| 133T, 15N | 200T, 22N | 370 | |
| 6T, 9N | 8T, 11N | 34 | |
| 116T, 52N | 174T, 78N | 420 | |
| 101T, 63N | 151T, 104N | 419 | |
Fig 1Pipeline of the proposed method.
Prostate adenocarcinoma classification model sites with their positions and their corresponding gene name and group.
| IlmnID | Chr | Position | Gene name | Group |
|---|---|---|---|---|
| Prostate adenocarcinoma | ||||
| cg02361803 | chr1 | 2014371 | PRKCZ | Body |
| cg11448068 | chr2 | 191045026 | C2orf88 | TSS1500 |
| cg16100120 | chr2 | 56150475 | EFEMP1 | TSS200 |
| cg00817367 | chr12 | 52401214 | GRASP | Body |
| cg18844382 | chr14 | 23834977 | EFS | TSS200 |
| cg00402172 | chr16 | 68118754 | NFATC3 | TSS1500 |
| cg14621217 | chr17 | 80944134 | B3GNTL1 | Body |
| cg16849024 | chr19 | 41934210 | B3GNT8 | 5'UTR |
| cg22059073 | chr22 | 17602570 | CECR6 | TSS1500 |
Fig 2A. Clustering of tumor samples by shared methylation patterns of the sites included in the prostate cancer classification model. The X axis shows indices of model sites (0: cg02361803; 1: cg16100120; 2: cg11448068; 3: cg00817367; 4: cg18844382; 5: cg00402172; 6: cg14621217; 7: cg16849024; and 8: cg22059073). B. AUC changes depending on the noise level, delta, introduced into methylation levels of the sites included in the prostate cancer classification model. C. PCA graph for the sites of the PRAD diagnostic model with methylation groups assigned according to the data for leukocyte blood fraction from nominally healthy people.
Fig 3Kaplan-Meier curves for samples from second and third clusters.
Clusters introduced in Fig 2A.
Model classification efficacy metrics: precision, recall, F1-score and AUC for test sets and the number of sites per model obtained using LogLoss-BERAF for different types of oncological diseases.
| Cancer type | Sites num. | Precision | Recall | F1 score | AUC |
|---|---|---|---|---|---|
| 9 | 0.95 | 0.95 | 0.95 | 0.97 | |
| 3 | 1.0 | 1.0 | 1.0 | 1.0 | |
| 6 | 0.98 | 0.98 | 0.98 | 1.0 | |
| 5 | 0.98 | 0.98 | 0.98 | 1.0 | |
| 2 | 0.99 | 0.99 | 0.99 | 1.0 |
Co-localized and diagnostically similar cancers: classification of model sites with their positions and corresponding gene names and groups.
| IlmnID | Chr | Position | Gene name | Group |
|---|---|---|---|---|
| cg01588438 | chr8 | 67344553 | ADHFE1 | TSS200 |
| cg04456219 | chr7 | 17274337 | - | - |
| cg09287864 | chr7 | 17274056 | - | - |
| cg06830167 | chr1 | 7600135 | CAMTA1 | Body |
| cg10671066 | chr1 | 160492861 | SLAMF6 | Body |
| cg14357535 | chr2 | 25389040 | POMC | 5'UTR |
| cg03487935 | chr7 | 51925284 | - | - |
| cg17202717 | chr7 | 1708823 | - | - |
| cg01090433 | chr16 | 82673506 | CDH13 | Body |
| cg22274117 | chr6 | 16713613 | ATXN1 | 5'UTR |
| cg00347746 | chr19 | 48970082 | - | - |
| cg04951371 | chr2 | 3317860 | TSSC1 | Body |
| cg22274117 | chr6 | 16713613 | ATXN1 | 5'UTR |
| cg13458609 | chr9 | 130608923 | ENG | Body |
| cg02921122 | chr10 | 126712074 | CTBP2 | Body |
| cg02766539 | chr17 | 57861641 | TMEM49 | Body |
Fig 4ROC curves for prostate adenocarcinoma (red), colon adenocarcinoma (yellow), urothelial bladder carcinoma (green), kidney renal papillary cell carcinoma (purple) and kidney renal clear cell carcinoma (blue) models.
The lower-right inset shows a close-up of the upper-left parts of the AUC curves.