| Literature DB >> 24934728 |
Abstract
BACKGROUND: DNA methylation (DNAm) has important regulatory roles in many biological processes and diseases. It is the only epigenetic mark with a clear mechanism of mitotic inheritance and the only one easily available on a genome scale. Aberrant cytosine-phosphate-guanine (CpG) methylation has been discussed in the context of disease aetiology, especially cancer. CpG hypermethylation of promoter regions is often associated with silencing of tumour suppressor genes and hypomethylation with activation of oncogenes.Supervised principal component analysis (SPCA) is a popular machine learning method. However, in a recent application to phenotype prediction from DNAm data SPCA was inferior to the specific method EVORA.Entities:
Mesh:
Year: 2014 PMID: 24934728 PMCID: PMC4073816 DOI: 10.1186/1471-2105-15-193
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Datasets used
| GSE30758 | Normal | 75 | 77 |
| GSE30758 | Normal HPV+ | 44 | 48 |
| GSE30758 | Normal HPV- | 31 | 29 |
| GSE20080 | CIN2+(a) | 18 | 30 |
| GSE37020 | CIN2+(b) | 24 | 24 |
| GSE30759 | Cancer | 48 | 15 |
The four columns show the GEO [37] accession numbers, name of the datasets and the corresponding numbers of contained case and control samples.
Numbers of significant CpGs (q-value < 0.05) according to five different tests
| t | 0 | 0 | 0 | 389 | 233 | 14811 |
| 0 | 0 | 0 | 452 | 100 | 10383(7) | |
| 0 | 0 | 0 | 8 | 140 | 4753(1) | |
| MWU | 0 | 0 | 0 | 403 | 1008 | 16990 |
| 0 | 0 | 0 | 408 | 372 | 11320(97) | |
| 0 | 0 | 0 | 10 | 646 | 5885(46) | |
| Bartlett | 2830 | 1837 | 1204 | 3035 | 3444 | 12023 |
| 1614 | 1154 | 748 | 2208 | 1948 | 12209 | |
| 1194 | 707 | 468 | 847 | 1489 | 414 | |
| Levene | 0 | 0 | 0 | 241 | 2 | 5881 |
| 0 | 1 | 0 | 326 | 4 | 7178 | |
| 0 | 0 | 0 | 0 | 0 | 80 | |
| Age-corr. | 16 | 385 | 68 | 13 | 89 | 473 |
| 16 | 330 | 75 | 19 | 52 | 525 | |
| 1 | 49 | 13 | 5 | 31 | 69 |
t-test, Mann–Whitney U test, Bartlett’s test, Levene’s test and test for methylation-age-correlation. The three rows per cell correspond to all, hyper- and hypo-CpGs. Numbers of most significant CpGs that are completely separating cases from controls are given in brackets.
Numbers of joint CpGs amongst the 500 most significant ones
| t | 10 | 15 | 12 | 9 | 12 | 14 | 16 | ||||||
| 15 | 11 | 17 | 9 | 7 | 16 | ||||||||
| 11 | 6 | 9 | 7 | 15 | 17 | 10 | |||||||
| MWU | 17 | 10 | 16 | 8 | 10 | 7 | 9 | 15 | |||||
| 13 | 17 | 7 | 13 | 9 | 8 | 11 | |||||||
| 12 | 10 | 7 | 8 | 7 | 9 | 13 | 11 | 16 | |||||
| Bartlett | |||||||||||||
| 53 | |||||||||||||
| 12 | 16 | 15 | 10 | 15 | |||||||||
| Levene | 12 | 17 | 12 | 6 | 8 | 17 | |||||||
| 14 | 12 | 14 | 17 | ||||||||||
| 11 | 2 | 9 | 9 | 16 | 16 | 14 | 6 | 14 | 17 | 13 | 7 | ||
| Age-corr. | 15 | 10 | 16 | 9 | 13 | 11 | 7 | 9 | |||||
| 12 | 8 | 12 | 10 | 10 | 5 | 8 | 15 | ||||||
| 16 | 13 | 12 |
CpGs were ordered according to five different tests (t-test, Mann–Whitney U test, Bartlett’s test, Levene’s test and test for methylation-age-correlation) and the number of overlapping CpGs between the first 500 of two datasets determined. The three rows per cell correspond to all, hyper- and hypo-CpGs. Significant overlaps (p < 0.01) are shown in bold.
Numbers of genes overlapping to 538 known cervical cancer genes
| t | 31 | |||||
| 29 | 25 | |||||
| MWU | 37 | 38 | 35 | 32 | ||
| 38 | 30 | 26 | ||||
| Bartlett | 38 | 36 | ||||
| 36 | ||||||
| Levene | 37 | |||||
| 35 | 38 | |||||
| 35 | 38 | |||||
| Age-corr. | 36 | 31 | ||||
| 32 | 34 | |||||
| 35 | 29 | 35 | 37 |
Genes corresponding to the 1000 most significant CpGs taken (five tests: t-test, Mann–Whitney U test, Bartlett’s test, Levene’s test and test for methylation-age-correlation, mean length of gene lists: 931). The three rows per cell correspond to all, hyper- and hypo-CpGs. Significant overlaps (p < 0.01) are shown in bold.
Numbers of genes overlapping to 1,591 developmental genes
| t | 77 | 93 | ||||
| 81 | 82 | |||||
| 61 | 62 | 84 | 75 | 46 | ||
| MWU | 71 | 93 | ||||
| 91 | 76 | |||||
| 59 | 66 | 72 | 84 | 47 | ||
| Bartlett | ||||||
| 96 | 89 | 84 | 45 | |||
| Levene | 79 | |||||
| 96 | ||||||
| 71 | 74 | 83 | 80 | 80 | 40 | |
| Age-corr. | ||||||
| 71 | 86 | 62 | 81 | 59 |
Genes corresponding to the 1000 most significant CpGs taken (five tests: t-test, Mann–Whitney U test, Bartlett’s test, Levene’s test and test for methylation-age-correlation, mean length of gene lists: 931). The three rows per cell correspond to all, hyper- and hypo-CpGs. Significant overlaps (p < 0.01) are shown in bold.
Prediction performance (AUC) of MS-SPCA
| Normal | | | | |||
| Normal HPV+ | | | ||||
| Normal HPV- | | | ||||
| CIN2+(a) | 0.53/ | | ||||
| CIN2+(b) | 0.98/0.85 |
Rows correspond to training data and columns to test data. The first number shows the performance of MS-SPCA, the second the performance of EVORA (mean value of 8 runs). Numbers in brackets show the five EVORA results as presented in [10]. Bold numbers show best predictions.
Figure 1Two parameters - used for final model selection. Each dot corresponds to one model that performs well in cross-validation in the training data. Each row corresponds to a given training dataset (name on the left), each column to the corresponding test dataset (name in header). For instance, the field row 1 (Normal) – column 4 (CIN2+(a)) shows the two parameters (x-axis Eval1, y-axis EV1dist) for all >300 models selected from the training dataset Normal (LOO-prediction-accuracy > 0.65), when applied to the test data CIN2+(a). For better visualization, the 10% of the models predicting the test data best are shown in red, the next 10% (between deciles 1 and 2) are coloured green and the next (between deciles 2 and 3) blue. Black dots represent the remaining 70%. Eval1 is the normalized largest eigenvalue of the covariance matrix taken from the methylation matrix of the test data. EV1dist is the Euclidean distance between the leading Eigenvectors of the model’s covariance matrix in the training data and in the test data.
Figure 2Performance of prediction (AUC). Each row corresponds to a given training dataset, each column to a test dataset and each dot to one model. Models are ordered according to Eval1-EV1dist, rank 1 corresponds to the model with the largest value. Eval1 is the normalized largest eigenvalue of the covariance matrix taken from the methylation matrix of the test data. EV1dist is the Euclidean distance between the leading Eigenvectors of the model’s covariance matrix in the training data and in the test data. The red line shows the AUC resulting from cumulative risk scores (see Methods). The values of the red lines at model rank 5 are given in Table 6.
Figure 3Description of models used for predictions (weights and # CpGs). Each row corresponds to a given training dataset, each column to a test dataset. Models are ordered according to Eval1-EV1dist, rank 1 corresponds to the model with the largest value. Eval1 is the normalized largest eigenvalue of the covariance matrix taken from the methylation matrix of the test data. EV1dist is the Euclidean distance between the leading Eigenvectors of the model’s covariance matrix in the training data and in the test data. The black line shows the mean number of CpGs used in the models up to the indicated rank, normalized by the maximum number of CpGs considered (1500). The other lines correspond to the mean weights (see Methods) used in the models up to the indicated rank. Blue lines correspond to average methylation difference (t- or MWU test), red to methylation variation difference (Bartlett’s or Levene’s test) and green to methylation-age-correlation. Solid lines indicate models taking into account both hyper- and hypomethylated CpGs. Dashed lines represent models using only hypermethylated and dotted lines indicate models using only hypomethylated CpGs.
Prediction performance (AUC) of MS-SPCA, using Normal data for training
| N.HPV+ 1 | | |||||
| N.HPV+ 2 | | |||||
| N.HPV+ 3 | | |||||
| N.HPV+ 4 | 0.47/ | | 0.50/ | |||
| N.HPV+ 5 | | 0.51/0.51 | ||||
| | | | | | | |
| N.HPV-1 | | |||||
| N.HPV-2 | 0.14/ | | ||||
| N.HPV-3 | 0.42/ | | ||||
| N.HPV-4 | 0.36/ | | ||||
| N.HPV-5 | 0.39/ |
Rows correspond to training datasets and columns to test datasets. The first number shows the performance of MS-SPCA, the second the performance of EVORA (mean value of 8 runs). Bold numbers show best predictions.