| Literature DB >> 27600245 |
Ioannis Valavanis1, Eleftherios Pilalis2, Panagiotis Georgiadis3, Soterios Kyrtopoulos4, Aristotelis Chatziioannou5.
Abstract
DNA methylation profiling exploits microarray technologies, thus yielding a wealth of high-volume data. Here, an intelligent framework is applied, encompassing epidemiological genome-scale DNA methylation data produced from the Illumina's Infinium Human Methylation 450K Bead Chip platform, in an effort to correlate interesting methylation patterns with cancer predisposition and, in particular, breast cancer and B-cell lymphoma. Feature selection and classification are employed in order to select, from an initial set of ~480,000 methylation measurements at CpG sites, predictive cancer epigenetic biomarkers and assess their classification power for discriminating healthy versus cancer related classes. Feature selection exploits evolutionary algorithms or a graph-theoretic methodology which makes use of the semantics information included in the Gene Ontology (GO) tree. The selected features, corresponding to methylation of CpG sites, attained moderate-to-high classification accuracies when imported to a series of classifiers evaluated by resampling or blindfold validation. The semantics-driven selection revealed sets of CpG sites performing similarly with evolutionary selection in the classification tasks. However, gene enrichment and pathway analysis showed that it additionally provides more descriptive sets of GO terms and KEGG pathways regarding the cancer phenotypes studied here. Results support the expediency of this methodology regarding its application in epidemiological studies.Entities:
Keywords: B-cell lymphoma; DNA methylation; breast cancer; classification; epigenetic biomarker; evolutionary algorithm; gene ontology tree; graph-theory
Year: 2015 PMID: 27600245 PMCID: PMC4996413 DOI: 10.3390/microarrays4040647
Source DB: PubMed Journal: Microarrays (Basel) ISSN: 2076-3905
Distribution of samples in the cohorts used in the study.
| Cohort | Control | Cases | Total |
|---|---|---|---|
| Breast Cancer Cohort | 48 | 48 | 96 |
| Β-cell Lymphoma Cohort | 83 | 82 | 165 |
| Both Cohorts | 131 | 130 | 261 |
Genomic distribution of CpGs classified in different groups: promoter, body, 3′UTR and intergenic [26].
| CpG Location | CpGs | Subgroup | CpGs |
|---|---|---|---|
| Promoter | 200,339 | TSS200 | 62,625 |
| TSS1500 | 77,379 | ||
| 5′UTR | 49,525 | ||
| 1stExon | 10,810 | ||
| Body | 150,212 | ||
| 3′UTR | 15,383 | ||
| Intergenic | 119,830 |
Pre-selection followed by evolutionary feature selection (up to 150 CpG sites) for the three-class problem (controls vs. BCCA vs. LYCA) (142 selected CpG sites). Results using the independent set.
| Independent Set | 1-nn | 6-nn | 12-nn | Tree | ANN |
|---|---|---|---|---|---|
| Total Accuracy | 47.69 | 47.69 | 45.38 | 32.13 | 60.00 |
| Controls Sensitivity | 54.32 | 55.56 | 54.32 | 40.74 | 69.14 |
| BCCA Sensitivity | 53.85 | 46.15 | 38.46 | 23.08 | 61.54 |
| LYCA Sensitivity | 30.56 | 30.56 | 27.78 | 16.67 | 38.89 |
Pre-selection followed by evolutionary selection (up 150 CpG sites) for the two-class problem (controls vs. BCCA) (129 selected CpG sites). Results using the independent set.
| Independent Set | 1-nn | 6-nn | 12-nn | Tree | ANN |
|---|---|---|---|---|---|
| Total Accuracy | 64.1 | 71.79 | 69.23 | 53.85 | 76.92 |
| Controls Sensitivity | 65.38 | 76.92 | 69.23 | 38.46 | 76.92 |
| BCCA Sensitivity | 61.54 | 61.54 | 69.23 | 84.62 | 76.92 |
Pre-selection followed by evolutionary selection (up 150 CpG sites) for the two-class problem (controls vs. LYCA) (143 selected CpG sites). Results using the independent set.
| Independent Set | 1-nn | 6-nn | 12-nn | Tree | ANN |
|---|---|---|---|---|---|
| Total Accuracy | 52.75 | 59.34 | 54.95 | 53.85 | 64.84 |
| Controls Sensitivity | 43.64 | 50.91 | 41.82 | 47.27 | 72.73 |
| BCCA Sensitivity | 66.67 | 72.22 | 75 | 63.89 | 52.78 |
Pre-selection followed by evolutionary selection (up 150 CpG sites) for the two-class problem (BCCA vs. LYCA) (146 selected CpG sites). Results using the independent set.
| Independent Set | 1-nn | 6-nn | 12-nn | Tree | ANN |
|---|---|---|---|---|---|
| Total Accuracy | 97.96 | 93.88 | 93.88 | 93.88 | 97.96 |
| BCCA Sensitivity | 100 | 84.62 | 84.62 | 100 | 100 |
| LYCA Sensitivity | 97.22 | 97.22 | 97.22 | 91.67 | 97.22 |
Pre-selection followed by GORevenge for the three-class problem (controls vs. BCCA vs. LYCA) (352 selected CpG sites). Results using the independent set.
| Independent Set | 1-nn | 6-nn | 12-nn | Tree | ANN |
|---|---|---|---|---|---|
| Total Accuracy | 48.46 | 44.62 | 47.69 | 41.54 | 65.38 |
| Controls Sensitivity | 48.15 | 35.80 | 40.74 | 37.04 | 70.37 |
| BCCA Sensitivity | 46.15 | 38.46 | 38.46 | 38.46 | 69.23 |
| LYCA Sensitivity | 50.00 | 66.67 | 66.76 | 52.78 | 52.78 |
Pre-selection followed by GoRevenge for the two-class problem (controls vs. BCCA) (183 selected CpG sites). Results using the independent set.
| Independent Set | 1-nn | 6-nn | 12-nn | Tree | ANN |
|---|---|---|---|---|---|
| Total Accuracy | 84.62 | 82.05 | 74.36 | 48.72 | 84.62 |
| Controls Sensitivity | 84.62 | 84.62 | 84.62 | 42.31 | 84.62 |
| BCCA Sensitivity | 84.62 | 76.92 | 53.85 | 61.54 | 84.62 |
Pre-selection followed by GoRevenge for the two-class problem (controls vs. LYCA) (35 selected CpG sites). Results using the independent set.
| Independent Set | 1-nn | 6-nn | 12-nn | Tree | ANN |
|---|---|---|---|---|---|
| Total Accuracy | 53.85 | 54.95 | 54.95 | 53.85 | 54.95 |
| Controls Sensitivity | 41.82 | 40 | 34.55 | 52.73 | 61.82 |
| LYCA Sensitivity | 72.22 | 77.78 | 86.11 | 55.56 | 44.44 |
Pre-selection followed by GORevenge and evolutionary selection (up to 150 CpG sites) for the three-class problem (controls vs. BCCA vs. LYCA) (141 selected CpG sites). Results using the independent set.
| Independent Set | 1-nn | 6-nn | 12-nn | Tree | ANN |
|---|---|---|---|---|---|
| Total Accuracy | 48.46 | 56.15 | 55.38 | 44.62 | 56.15 |
| Controls Sensitivity | 55.56 | 60.49 | 54.32 | 48.15 | 67.90 |
| BCCA Sensitivity | 23.08 | 53.85 | 38.46 | 38.46 | 53.58 |
| LYCA Sensitivity | 41.67 | 47.22 | 63.89 | 38.89 | 30.56 |
Pre-selection followed by evolutionary selection (up to 400 CpG sites) for the three-class problem (controls vs. BCCA vs. LYCA) (373 selected CpG sites). Results using the independent set.
| 1-nn | 6-nn | 12-nn | Tree | ANN | 1-nn |
|---|---|---|---|---|---|
| Total Accuracy | 59.23 | 58.46 | 55.38 | 45.38 | 69.23 |
| Controls Sensitivity | 55.56 | 53.09 | 45.68 | 54.32 | 80.25 |
| BCCA Sensitivity | 61.54 | 46.15 | 46.15 | 61.54 | 84.62 |
| LYCA Sensitivity | 66.67 | 75.00 | 80.56 | 19.44 | 38.89 |
Figure 1Total accuracy (%) of the embedded 12-nn classifier during GA evolution (three-fold cross-validation) for the three-class problem controls vs. BCCA vs. LYCA (% accuracy vs. number of generations completed).
Figure 2Performance (%) of ANN in the totally unknown—independent testing set in the three-class problem (controls vs. BCCA vs. LYCA), fed by CpG sites subsets selected by the selection schemes applied.
Figure 3Performance (%) of ANN in the totally unknown—independent testing set in the two-class problem (controls vs. BCCA), fed by CpG sites subsets selected by the selection schemes.
Figure 4Performance (%) of ANN in the totally unknown-independent testing set in the two-class problem (controls vs. LYCA), fed by CpG sites subsets selected by the selection schemes applied.
Top enriched GO terms (analysis using the biological process category) derived by genes corresponding to CpG sites selected by GoRevenge-based selection (a) and genes corresponding to CpG sites selected by evolutionary selection (b). (a) Using GORevenge results; and (b) using evolutionary selection results.
| GO ID | GO Description | Enrichment | ||
|---|---|---|---|---|
| ( | ||||
| 1 | GO:0048709 | oligodendrocyte differentiation | 5.52 × 10−13 | 9/21 |
| 2 | GO:0044281 | small molecule metabolic process | 1.50 × 10−12 | 66/1530 |
| 3 | GO:0007411 | axon guidance | 1.92 × 10−12 | 29/345 |
| 4 | GO:0046777 | protein amino acid autophosphorylation | 1.93 × 10−12 | 21/171 |
| 5 | GO:0051216 | cartilage development | 3.58 × 10−12 | 14/90 |
| 6 | GO:0030900 | forebrain development | 3.86 × 10−12 | 15/87 |
| 7 | GO:0009790 | embryo development | 3.97 × 10−12 | 24/154 |
| 8 | GO:0007420 | brain development | 4.82 × 10−12 | 27/225 |
| 9 | GO:0001701 | in utero embryonic development | 5.75 × 10−12 | 36/256 |
| 10 | GO:0048011 | nerve growth factor receptor signaling pathway | 5.89 × 10−12 | 36/286 |
| ( | ||||
| 1 | GO:0033603 | positive regulation of dopamine secretion | 3.44 × 10−7 | 3/6 |
| 2 | GO:0016458 | gene silencing | 7.95 × 10−7 | 3/7 |
| 3 | GO:0007411 | axon guidance | 7.50 × 10−6 | 15/345 |
| 4 | GO:0035249 | synaptic transmission, glutamatergic | 8.03 × 10−6 | 4/23 |
| 5 | GO:0045662 | negative regulation of myoblast differentiation | 1.07 × 10−5 | 3/12 |
| 6 | GO:0030900 | forebrain development | 1.29 × 10−5 | 7/87 |
| 7 | GO:0043523 | regulation of neuron apoptosis | 1.85 × 10−5 | 4/27 |
| 8 | GO:0001501 | skeletal system development | 2.14 × 10−5 | 9/152 |
| 9 | GO:0031069 | hair follicle morphogenesis | 2.23 × 10−5 | 4/28 |
| 10 | GO:0048663 | neuron fate commitment | 2.66 × 10−5 | 4/29 |
Genomic distribution of 352 CpGs derived from the GORevenge-based selection, classified in different groups: promoter, body, 3′UTR, and intergenic.
| CpG Location | CpGs | Subgroup | CpGs |
|---|---|---|---|
| Promoter | 186 | TSS200 | 41 |
| TSS1500 | 82 | ||
| 5′UTR | 36 | ||
| 1stExon | 27 | ||
| Body | 147 | - | - |
| 3′UTR | 19 | - | - |
| Intergenic | - | - | - |
Enriched KEGG pathways derived by genes corresponding to CpG sites selected by GoRevenge-based selection (a) and genes corresponding to CpG sites selected by evolutionary selection (b). (a) Using GoRevenge results; and (b) using evolutionary selection results.
| Pathway ID | Pathway Description | Enrichment | ||
|---|---|---|---|---|
| ( | ||||
| 1 | hsa05205 | Proteoglycans in cancer | 1.37 × 10−12 | 24/222 |
| 2 | hsa05215 | Prostate cancer | 2.58 × 10−12 | 17/88 |
| 3 | hsa04390 | Hippo signaling pathway | 5.86 × 10−12 | 20/154 |
| 4 | hsa04910 | Insulin signaling pathway | 9.91 × 10−12 | 20/136 |
| 5 | hsa05200 | Pathways in cancer | 2.39 × 10−11 | 44/326 |
| 6 | hsa05166 | HTLV-I infection | 3.52 × 10−11 | 25/263 |
| 7 | hsa04020 | Calcium signaling pathway | 4.52 × 10−11 | 25/181 |
| 8 | hsa04916 | Melanogenesis | 5.73 × 10−11 | 15/99 |
| 9 | hsa04010 | MAPK signaling pathway | 6.98 × 10−11 | 27/257 |
| 10 | hsa04722 | Neurotrophin signaling pathway | 1.18 × 10−10 | 16/118 |
| 11 | hsa04151 | PI3K-Akt signaling pathway | 1.22 × 10−10 | 27/341 |
| 12 | hsa04310 | Wnt signaling pathway | 2.82 × 10−10 | 17/143 |
| 13 | hsa05217 | Basal cell carcinoma | 3.93 × 10−10 | 11/55 |
| 14 | hsa04510 | Focal adhesion | 3.97 × 10−10 | 20/204 |
| 15 | hsa04350 | TGF-beta signaling pathway | 4.18 × 10−10 | 13/81 |
| 16 | hsa05202 | Transcriptional misregulation in cancer | 4.78 × 10−10 | 18/165 |
| 17 | hsa00053 | Ascorbate and aldarate metabolism | 7.52 × 10−10 | 8/26 |
| 18 | hsa05030 | Cocaine addiction | 2.11 × 10−9 | 10/50 |
| 19 | hsa00500 | Starch and sucrose metabolism | 3.30 × 10−9 | 10/52 |
| ( | ||||
| 1 | hsa04260 | Cardiac muscle contraction | 9.81 × 10−5 | 6/76 |
| 2 | hsa04974 | Protein digestion and absorption | 2.29 × 10−4 | 6/87 |
| 3 | hsa05410 | Hypertrophic cardiomyopathy (HCM) | 1.28 × 10−3 | 5/85 |
| 4 | hsa05414 | Dilated cardiomyopathy | 1.72 × 10−3 | 5/90 |
| 5 | hsa00061 | Fatty acid biosynthesis | 2.86 × 10−3 | 1/6 |
| 6 | hsa05412 | Arrhythmogenic right ventricular cardiomyopathy (ARVC) | 3.70 × 10−3 | 4/73 |
| 7 | hsa00460 | Cyanoamino acid metabolism | 3.97 × 10−3 | 1/7 |
| 8 | hsa04961 | Endocrine and other factor-regulated calcium reabsorption | 4.98 × 10−3 | 3/49 |
| 9 | hsa05030 | Cocaine addiction | 5.36 × 10−3 | 3/50 |
| 10 | hsa04512 | ECM-receptor interaction | 7.41 × 10−3 | 4/86 |
| 11 | hsa04730 | Long-term depression | 0.010 | 3/60 |