| Literature DB >> 26872146 |
Brittany Baur1, Serdar Bozdag1.
Abstract
DNA methylation is an important epigenetic event that effects gene expression during development and various diseases such as cancer. Understanding the mechanism of action of DNA methylation is important for downstream analysis. In the Illumina Infinium HumanMethylation 450K array, there are tens of probes associated with each gene. Given methylation intensities of all these probes, it is necessary to compute which of these probes are most representative of the gene centric methylation level. In this study, we developed a feature selection algorithm based on sequential forward selection that utilized different classification methods to compute gene centric DNA methylation using probe level DNA methylation data. We compared our algorithm to other feature selection algorithms such as support vector machines with recursive feature elimination, genetic algorithms and ReliefF. We evaluated all methods based on the predictive power of selected probes on their mRNA expression levels and found that a K-Nearest Neighbors classification using the sequential forward selection algorithm performed better than other algorithms based on all metrics. We also observed that transcriptional activities of certain genes were more sensitive to DNA methylation changes than transcriptional activities of other genes. Our algorithm was able to predict the expression of those genes with high accuracy using only DNA methylation data. Our results also showed that those DNA methylation-sensitive genes were enriched in Gene Ontology terms related to the regulation of various biological processes.Entities:
Mesh:
Substances:
Year: 2016 PMID: 26872146 PMCID: PMC4752315 DOI: 10.1371/journal.pone.0148977
Source DB: PubMed Journal: PLoS One ISSN: 1932-6203 Impact factor: 3.240
Mean performance of SFS algorithms and controls on the breast cancer cell line data.
| 1NN | 3NN | 5NN | NB | DT | SVM | 1NN Random | 1NN Top Two | |
|---|---|---|---|---|---|---|---|---|
| Accuracy | 0.79 | 0.80 | 0.74 | 0.78 | 0.77 | 0.74 | 0.66 | 0.67 |
| Precision | 0.70 | 0.74 | 0.65 | 0.75 | 0.68 | 0.64 | 0.53 | 0.54 |
| Recall | 0.68 | 0.65 | 0.53 | 0.59 | 0.65 | 0.63 | 0.53 | 0.53 |
| Specificity | 0.70 | 0.67 | 0.56 | 0.62 | 0.66 | 0.63 | 0.55 | 0.55 |
| MCC | 0.40 | 0.40 | 0.16 | 0.35 | 0.33 | 0.26 | 0.08 | 0.08 |
Fig 1Violin plots of performance metrics for the algorithm when utilizing different classification methods in the SFS algorithm and controls on the breast cancer cell line data.
A) MCC, B) Precision, C) Recall, D) Specificity. Green squares specify the median and the red pluses specify the mean. NB: Naive Bayes, DT: Decision tree, SVM: Support Vector Machine
Fig 2Violin plots of performance metrics for 1NN-SFS algorithm against other algorithms on the breast cancer cell line data.
A) MCC, B) Precision, C) Recall, D) Specificity. Random: KNN random, Top 2: KNN top two (see Methods). GAK: GA-KNN algorithm with varying K-nearest neighbors. RFK: Relief-F algorithm with varying K nearest neighbors.
Fig 3Violin plots of MCC for 1NN-SFS algorithm against other probe selection methods on the breast cancer cell line data.
A) All, B) Upstream CpG Island, C) TSS, D) Top SD.
Fig 4Number of probes selected per gene by 1NN-SFS algorithm on the breast cancer cell line data.
Fig 5Heatmap clustering of MCC values.
Heatmap clustering of MCC values for five executions of the algorithm on random halves of the breast cancer cell line data for A. 1NN algorithm and B. random selection of probes C. Top two correlated approach.
Top 30 GO Terms for genes with MCC >0.6 by 1NN-SFS algorithm on the breast cancer cell line data.
| Description | FDR q-value |
|---|---|
| regulation of multicellular organismal process | 4.43E-19 |
| regulation of developmental process | 2.51E-17 |
| regulation of multicellular organismal development | 9.31E-17 |
| positive regulation of biological process | 1.16E-16 |
| movement of cell or subcellular component | 1.23E-16 |
| positive regulation of cellular process | 1.41E-16 |
| negative regulation of biological process | 2.3E-16 |
| anatomical structure development | 1.38E-15 |
| negative regulation of cellular process | 2.72E-15 |
| regulation of cell differentiation | 2.85E-15 |
| cell migration | 6.81E-15 |
| negative regulation of metabolic process | 2.48E-14 |
| anatomical structure morphogenesis | 3.53E-14 |
| organ development | 5.3E-14 |
| transmembrane receptor protein tyrosine kinase signaling pathway | 6.02E-14 |
| cell motility | 7.21E-14 |
| Locomotion | 1.7E-13 |
| developmental process | 1.71E-13 |
| enzyme linked receptor protein signaling pathway | 1.75E-13 |
| single-organism developmental process | 1.76E-13 |
| regulation of cell development | 2.88E-13 |
| regulation of anatomical structure morphogenesis | 4.5E-13 |
| negative regulation of macromolecule metabolic process | 6.04E-13 |
| intracellular signal transduction | 8.58E-13 |
| single-multicellular organism process | 2.36E-12 |
| multicellular organismal process | 5.86E-12 |
| regulation of localization | 1.06E-11 |
| positive regulation of multicellular organismal process | 1.07E-11 |
| signal transduction | 1.27E-11 |
| cellular component organization or biogenesis | 1.39E-11 |
| positive regulation of developmental process | 3.2E-11 |
GO terms with MCC < 0.2 for genes by 1NN-SFS algorithm on the breast cancer cell line data.
| Description | FDR q-value |
|---|---|
| detection of chemical stimulus involved in sensory perception of smell | 5.62E-11 |
| detection of chemical stimulus involved in sensory perception | 5.74E-11 |
| detection of chemical stimulus | 5.62E-8 |
| detection of stimulus involved in sensory perception | 1.07E-7 |
| detection of stimulus | 1.95E-3 |
| immune response | 1.23E-2 |
Functional clusters of genes with MCC > 0.6 with upstream probes selected by 1NN-SFS algorithm on the breast cancer cell line data.
| Cluster Number | Number of genes | Enrichment | Most significant terms (p-val) | Other representative terms (p-val) and notes |
|---|---|---|---|---|
| 1 | 40 | 4.39 | Atp-binding (4.4E-45), Nucleotide-binding (4.6E-38), adenyl ribonucleotide binding (1.7E-37) | Helicase (4E-12), kinase (5.8E-6), protein kinase activity (3.7E-4) |
| 2 | 4 | 3.67 | Repeat:ANK 1 (1.7E-6), Repeat:ANK 2 (1.8E-6), Ankyrin (2.9E-6) | Genes coding for ankyrin proteins |
| 3 | 45 | 3.46 | Kinase (1.8E-56), Protein Kinase–ATP binding site (2.0E-56), domain: protein kinase (2.1E-53) | Phosphorylation (1.7E-51), transferase (1.1E-47), nucleotide binding (2.1E-34) |
| 4 | 13 | 3.42 | Microtubule cytoskeleton (9.6E-15), cytoskeleton (9.1E-14), cytoskeletal part (4.1E-12) | Centrosome (2.3E-8), genes involved in regulation of cell motility |
| 7 | 4 | 2.91 | binding site:S-adenosyl-L-methionine (1.8E-8), s-adenosyl-l-methionine (1.5E-7), methyltransferase (4.3E-7) | Genes coding for methyltransferases |
| 8 | 5 | 2.83 | Microfilament motor activity (22.0E-12), actin filament-based movement (6.3E-12), domain:Myosin head-like (9.4E-12) | Genes coding for myosin proteins |
| 9 | 6 | 2.66 | Anti-apoptosis (7.8E-12), negative reglation of apoptosis (1.2E-8), negative regulation of programmed cell death (1.3E-8) | Genes predominately related to BCL2 (BAG3, BAG4, BCL2A1, BL210). Also includes MCL1 and TNFRSF10D |
| 10 | 16 | 2.54 | Nucleotide phosphate-binding region:GTP (4.7E-28), gtp-binding (2.3E-27), Ras (2.7E-16) | Genes predominately related to the RAS oncogene family |
| 13 | 59 | 2.29 | Transcription regulator activity (2.7E-50), transcription regulation (2.2E-47), regulation of transcription, DNA dependent (2.2E-47) | Sequence specific DNA-binding (3.1E-29), repressor (6.0E-22) |
| 19 | 10 | 1.72 | Ribosomal protein (6.7E-19), structural constituent of ribosome (8.2E-18), cytostolic ribosome (1.6E-17) | Genes coding for ribosomal proteins |
Functional clusters of genes with MCC > 0.6 with gene body probes selected by 1NN-SFS algorithm on the breast cancer cell line data.
| Cluster Number | Number of genes | Enrichment | Most significant terms (p-val) | Other representative terms (p-val) and notes |
|---|---|---|---|---|
| 1 | 48 | 2.84 | Atp-binding (1.1E-51), Nucleotide-binding (6.5E-47), adenyl ribonucleotide binding (4.2E-45) | phosphorylation (4.8E-33), kinase (7.6E-40), transferase(1.9E-29) |
| 2 | 12 | 2.36 | Nucleolus (1.2E-14), nuclear lumen (3.9E-11), intracellular organelle lumen (3.7E-10) | |
| 3 | 11 | 2.06 | Transcription regulation (1.6E-10), transcription(2.1E-10), regulation of transcription (6.8E-8) | |
| 4 | 9 | 1.83 | Ribosomal protein (7.2E-17), ribonucleoprotein (1.8E-15), ribosome (5.6E-15) | RNA binding (2.8E-4) |
| 8 | 6 | 1.39 | Negative regulation of ubiquitin-protein ligase activity during mitotic cell cycles (2.2E-12), negative regulation of ubiquitin-ligaase activity (2.6E-12) | Genes coding for proteasomes and ubiquitin |
| 9 | 66 | 1.38 | Regulation of transcription (1.1E-34), transcription (2.4E-24), transcription regulation (5.0E-32) |
Fig 6Performance metrics of 1NN-SFS algorithm on TCGA data.
Top 30 GO terms with MCC > 0.6 for genes by 1NN-SFS algorithm on TCGA data.
| Description | FDR q-value |
|---|---|
| positive regulation of cellular process | 3.75E-8 |
| positive regulation of biological process | 2E-7 |
| RNA metabolic process | 3.6E-7 |
| regulation of metabolic process | 7.55E-7 |
| regulation of transcription from RNA polymerase II promoter | 8.6E-7 |
| cellular macromolecule metabolic process | 9.69E-7 |
| regulation of gene expression | 1.16E-6 |
| regulation of macromolecule metabolic process | 1.19E-6 |
| regulation of macromolecule biosynthetic process | 1.35E-6 |
| regulation of cellular macromolecule biosynthetic process | 1.36E-6 |
| RNA biosynthetic process | 1.45E-6 |
| regulation of primary metabolic process | 1.54E-6 |
| regulation of biosynthetic process | 1.56E-6 |
| macromolecule metabolic process | 2.36E-6 |
| aromatic compound biosynthetic process | 2.48E-6 |
| regulation of cellular biosynthetic process | 2.52E-6 |
| positive regulation of RNA biosynthetic process | 3.02E-6 |
| regulation of RNA biosynthetic process | 3.12E-6 |
| nucleobase-containing compound biosynthetic process | 3.4E-6 |
| nucleic acid metabolic process | 3.44E-6 |
| regulation of cellular metabolic process | 3.45E-6 |
| regulation of transcription, DNA-templated | 3.63E-6 |
| cellular process | 3.73E-6 |
| heterocycle biosynthetic process | 3.93E-6 |
| cellular nitrogen compound biosynthetic process | 4.29E-6 |
| positive regulation of macromolecule biosynthetic process | 4.35E-6 |
| regulation of nucleic acid-templated transcription | 5.11E-6 |
| nucleobase-containing compound metabolic process | 6.76E-6 |
| regulation of nucleobase-containing compound metabolic process | 1.04E-5 |
| positive regulation of RNA metabolic process | 1.07E-5 |
GO terms with MCC < 0.2 for genes by 1NN-SFS algorithm on TCGA data.
| Description | FDR q-value |
|---|---|
| detection of chemical stimulus involved in sensory perception | 1.27E-42 |
| detection of chemical stimulus | 6.29E-41 |
| detection of chemical stimulus involved in sensory perception of smell | 8.16E-41 |
| detection of stimulus involved in sensory perception | 3.18E-38 |
| detection of stimulus | 7.93E-31 |
| G-protein coupled receptor signaling pathway | 1.44E-21 |
| sensory perception of smell | 1.16E-19 |
| sensory perception of chemical stimulus | 4.86E-14 |
| cell surface receptor signaling pathway | 7.68E-7 |
| sensory perception | 7.02E-6 |
| response to stimulus | 5.47E-5 |
| drug metabolic process | 4.77E-3 |
| signal transduction | 1.12E-2 |
Functional clusters of genes with MCC > 0.6 with upstream probes selected by 1NN-SFS algorithm in TCGA data.
| Cluster number | Number of genes | Enrichment | Top terms (pval) | Other representative terms and notes |
|---|---|---|---|---|
| 1 | 5 | 4.73 | Nucleolus (8.8E-6), nuclear lumen (1.6E-4), intracellular organelle lumen (3.7E-4) | Transcription, DNA-dependent (4.3E-2) |
| 2 | 24 | 4.08 | RNA splicing (1.0E-29), RNA processing (8.0E-29), mRNA processing (1.1E-28) | Spliceosome (6.8E-23), rna-binding (2.3E-10) |
| 3 | 13 | 2.48 | Cytoskeleton (1.5E-18), cytoplasm (7.2E-10), microtubule cytoskeleton (4.7E-9) | |
| 4 | 11 | 2.25 | Ribosomal protein (6.3E-21), ribonucleoprotein (3.5E-19), ribosome (1.5E-18) | Group of genes coding for mitochondrial ribosomal proteins |
| 5 | 134 | 2.2 | Transcription regulation (1.9E-45), zinc (4.1E-45), transcription (1.3E-43) | Transcription regulation |
| 6 | 13 | 2.03 | Ubl conjugation pathway (1E-19), modification-dependent protein catabolic process (3E-17), modification-dependent macromolecule catabolic process (3E-17) | Ubiquitin proteins, proteolysis (4.7E-14) |
Functional clusters of genes with MCC > 0.6 with gene body probes selected by 1NN-SFS algorithm in TCGA data.
| Cluster Number | Number of genes | Enrichment | Most significant terms (p-val) | Other representative terms (p-val) and notes |
|---|---|---|---|---|
| 1 | 4 | 3.3 | GTPase activation (5.5E-7), domain:PH (1.9E-6), Pleckstrin homology (4.5E-6) | Rho GTPases |
| 2 | 5 | 2.5 | Atp-binding (2.2E-5), nucleotide-binding(5.9E-5), adenyl ribonucleotide binding (1.8E-4) | |
| 3 | 17 | 2.14 | Protein kinase–core (8.7E-23), kinase (2.7E-21), protein kinase–atp binding site (1.2E-20) | Phosphorylation (1.9E-20), nucleotide-binding (1.9E-15), transferase (7.3E-16) |
| 4 | 5 | 1.86 | Zinc (1.7E-4), metal-binding (5.7E-4), zinc ion binding (1E-3) |