Literature DB >> 29668835

MethylMix 2.0: an R package for identifying DNA methylation genes.

Pierre-Louis Cedoz¹, Marcos Prunello², Kevin Brennan¹, Olivier Gevaert¹.

Abstract

Summary: DNA methylation is an important mechanism regulating gene transcription, and its role in carcinogenesis has been extensively studied. Hyper and hypomethylation of genes is a major mechanism of gene expression deregulation in a wide range of diseases. At the same time, high-throughput DNA methylation assays have been developed generating vast amounts of genome wide DNA methylation measurements. We developed MethylMix, an algorithm implemented in R to identify disease specific hyper and hypomethylated genes. Here we present a new version of MethylMix that automates the construction of DNA-methylation and gene expression datasets from The Cancer Genome Atlas (TCGA). More precisely, MethylMix 2.0 incorporates two major updates: the automated downloading of DNA methylation and gene expression datasets from TCGA and the automated preprocessing of such datasets: value imputation, batch correction and CpG sites clustering within each gene. The resulting datasets can subsequently be analyzed with MethylMix to identify transcriptionally predictive methylation states. We show that the Differential Methylation Values created by MethylMix can be used for cancer subtyping. Availability and implementation: MethylMix 2.0 was implemented as an R package and is available in bioconductor. https://www.bioconductor.org/packages/release/bioc/html/MethylMix.html.

Entities: CellLine Disease Gene Species

Mesh：

Substances：
DNA

Year: 2018 PMID： 29668835 PMCID： PMC6129298 DOI： 10.1093/bioinformatics/bty156

Source DB: PubMed Journal: Bioinformatics ISSN： 1367-4803 Impact factor: 6.937

1 Introduction

DNA methylation is the best studied epigenetic aberration underlying oncogenesis. Besides genetic mutations, hypermethylation and hypomethylation of genes (increased and decreased methylation in a disease relative to a normal state) is an alternative mechanism that is capable of altering the normal transcriptional state and driving a wide range of diseases. Prior studies have focused on the analysis of DNA methylation data, for example in breast cancer or to identify differentially methylated regions for specific DNA methylation platforms (Wang ; Warden ). A recent pancancer study of DNA-methylation (Gevaert ) revealed 10 pancancer clusters reflecting new similarities across malignantly transformed tissues. Furthermore, several computational tools have been developed incorporating state-of-the-art statistical techniques for the analysis of DNA methylation data (Aryee ). In Gevaert (2015), we introduced MethylMix: an algorithm that integrates DNA methylation from normal and disease samples and matched gene expression data to identify likely DNA methylation driven genes in diseases. The main output of MethylMix is a novel metric called ‘Differential Methylation value’ or ‘DM-value’ defined as the difference of an abnormal methylation state (Hypermethylated or hypomethylated) from the normal methylation state. These methylation states are computed using a beta mixture model on the beta-values. Here, we present MethylMix 2.0, the updated version of MethylMix that features functions for downloading and preprocessing DNA methylation and gene expression datasets from all cancer sites in The Cancer Genome Atlas (TCGA). We also demonstrate an application of MethylMix for cancer subtyping and show how to use the DM-values for clustering using the R package ConsensusClusterPlus (Wilkerson and Hayes, 2010).

2 Algorithm

MethylMix 2.0 identifies DNA methylation driven genes by modeling DNA methylation data in cancer versus normal and looking for homogeneous subpopulations. In addition, matched gene expression data can be used to identify transcriptionally predictive DNA methylation events by requiring a negative correlation between methylation and gene expression of a particular gene. Therefore, MethylMix 2.0 requires DNA methylation from normal and disease samples and matched disease gene expression data. In MethylMix 2.0, we have automated the construction of the methylation and gene expression datasets from TCGA via a three-step algorithm: The outputs of these functions are numeric matrices (, and ) with genes in rows and samples in columns to be used as inputs in MethylMix for further analysis. Step 1: Automated Downloading from TCGA: DNA methylation datasets and Gene expression Datasets are downloaded automatically from TCGA by supplying the TCGA cancer code. We have provided the functionality to study any of the 33 TCGA cancer sites that are currently available. Step 2: Automated preprocessing: The preprocessing steps include eliminating samples and genes with too many missing values, imputing remaining missing values and batch correction across technical batches. Step 3: Clustering of the CpG probes: The methylation data produced by the Illumina 450k methylation array consists of multiple CpG probes for each gene. Since the probes are highly correlated, we clustered them prior to learning a mixture model. We used a complete linkage hierarchical clustering algorithm for all probes of a single gene to cluster the probes into CpG clusters. Then we cut off the hierarchical tree at a Pearson correlation threshold of 0.7.

3 Functions and examples

MethylMix 2.0 was implemented in the statistical language R and is provided as an R package, and is also available on bioconductor. MethylMix 2.0 was designed to identify transcriptionally predictive DNA methylation events using a beta mixture modeling approach (Gevaert, 2015). MethylMix 2.0 requires three datasets as inputs: cancer DNA methylation data (), normal DNA methylation data () and matched disease gene expression data (). These datasets can be downloaded as follows using the appropriate TCGA cancer codes (example OV for Ovarian Cancer): > library(MethylMix) > cancerSite <- “OV”> targetDirectory <- paste0(getwd(), “/”)> METdirectories <- Download_DNAmethylation(cancerSite, targetDirectory, TRUE)> GEdirectories <- Download_GeneExpression(cancer Site, targetDirectory, TRUE) For DNA-methylation and Gene Expression, we used the Broad Institute Firehose tool (Firehose, 2016), which includes several preprocessing steps such as removing problematic rows, removing redundant columns, reordering the columns and sorting the data by gene name. MethylMix’s contribution to the preprocessing consists of eliminating samples and genes with too many missing values, imputing remaining missing values and performing batch correction across technical batches within each cancer type. We used an adjustable missing value threshold for removing samples or genes with too many missing values. The default threshold is a conservative one where genes with more than 20% of missing data are removed and we applied a K-Nearest Neighbors approach with K = 15 to estimate the remaining missing values, as proposed in Troyanskaya . Since TCGA data was generated in sample batches, we implemented batch correction to remove any systematic differences between technical batches. To this end, we used the ComBat algorithm introduced by Johnson that removes known batch effects by implementing empirical Bayes methods for adjusting for additive, multiplicative and exponential batch effects. These adjustments methods are robust to small sample sizes. Since DNA methylation data generally do not follow a normal distribution, we used the nonparametric version of ComBat to correct the DNA methylation data. > METProcessedData <- Preprocess_DNAmethylation (cancerSite, METdirectories)> GEProcessedData <- Preprocess_GeneExpression (cancerSite, GEdirectories) The last step for preprocessing the methylation data is to assign each probe to a gene based on their closest transcription start site. Then for each gene, we cluster all its CpG sites using hierarchical clustering with complete linkage and the Pearson correlation as distance metric. If data for normal samples is provided, some probes might be removed in the normal samples or in the cancer samples due to a high number of missing values. In this case, only overlapping probes between cancer and normal samples are retained because MethylMix provides an analysis of the differential methylation. > res <- ClusterProbes(METProcessedData[[1]], METProcessedData[[2]]) Probes with SNPs are removed since SNPs within the probe binding sequence prevent methylation of that CpG site. Additionally, MethylMix 2.0 provides a function that wraps the functions for downloading and preprocessing DNA methylation and gene expression data, as well as for clustering CpG probes. > cancerSite <- “OV”> targetDirectory <- paste0(getwd(), “/”)> GetData(cancerSite, targetDirectory) Next, it should be noted that all these functions are prepared to run in parallel if the user registers a parallel structure, otherwise they run sequentially. Finally, the user can also use MethylMix with data that is not from TCGA. To use custom data, the user has to provide DNA methylation beta-values of a cancer cohort and optionally normal DNA methylation data and matched gene expression data in the form of a data.matrix object in R with the rows corresponding to the genes and the columns to the sample. In addition, the user can provide batch information in the case where multiple technical batches were used to generate the data. MethylMix can be applied on all Illumina DNA methylation arrays, including the newly released Epic platform and any microarray that outputs beta values. Similarly, sequencing-based methylation data can be modeled, if the data is formatted in proportions, but, as mixture modeling is computationally demanding, MethylMix will require proportionally more time to finish as the number of CpG sites is bigger.

4 Applications

The main output of MethylMix are the ‘DM-values’, which reflect the homogeneous subpopulations of samples with a particular methylation state. An application of the DM-values is to identify DNA methylation subtypes. For instance in lung squamous cell carcinoma (Fig. 1), a DNA hypomethylated subtype featuring genetic inactivation of NSD1 was identified (Brennan ; Campbell ). DNA methylation subtypes were discovered by applying consensus clustering (a widely-used algorithm for clustering patients based on molecular data) to the ‘DM-values’ matrix output of MethylMix. Patients are thereby clustered into robust and homogenous groups (putative subtypes) based on their abnormal methylation profiles. Consensus clustering was performed using the ConsensusClusterPlus R package (Version 1.36.0) (Wilkerson and Hayes, 2010), with 1000 rounds of k-means clustering and a maximum of k = 10 clusters. Selection of the best number of clusters was based on the visual inspection of ConsensusClusterPlus output plots.

Fig. 1.

Subtyping of lung squamous cell carcinoma patients based on MethylMix2.0 analysis. This heatmap illustrates ‘DM values’ for 638 MethylMix genes (rows) in 503 LUSC patient primary tumors (columns). Patients are ordered by DNA methylation subtype, indicated in the horizontal sidebar. DNA methylation subtypes represent patient groups with distinct DNA methylation profiles that are homogenous within subtypes. The NSD1-inactivated subtype is highlighted in red, with other (NSD1 proficient) subtypes indicated in grey. Horizontal sidebars indicate the category of each patient with regard to key etiological variables including NSD1 mutations (white space reflects patients without NSD1 mutation data), smoking status, sex, pathological stage and smoking status. MethylMix genes (rows) are ordered by hierarchical clustering Exploration of somatic mutation and copy number alteration data revealed that one patient cluster (indicated by the red sidebar) represents a DNA hypomethylated subtype that is enriched for inactivating genetic mutations and deletions in the NSD1, encoding a histone lysine methyltransferase. Indeed, six of ten patients with NSD1 mutations in LUSC were within the NSD1 subtype (Chi-squared P = 0.005). This mirrors the phenotype of a similar hypomethylated, NSD1-inactivated subtype that was recently described in head and neck squamous cell carcinoma (Brennan ). > source(“https://bioconductor.org/biocLite.R”) > biocLite(“ConsensusClusterPlus”)> library(“ConsensusClusterPlus”)> MethylMixResults <- MethylMix(METcancer, GEcancer, METnormal)> DMvalues <- MethylMixResults$MethylationStates> cons_cluster <- ConsensusClusterPlus(d = DMvalues, maxK = 10, reps = 1000, pItem = 0.8, distance=’euclidean’, clusterAlg=“km”)

5 Conclusion

MethylMix 2.0 is an R package that provides automated functionalities that improve upon the original MethylMix algorithm. In MethylMix 2.0 we have implemented an automated process to download and preprocess these datasets directly from TCGA in a few lines of code. In addition, we have demonstrated a key application of MethylMix, identifying robust subgroups. In summary, MethylMix 2.0 offers a tool that facilitates the systematic analysis of methylation-driven genes in pan-cancer studies from TCGA.

11 in total

1. IMA: an R package for high-throughput analysis of Illumina's 450K Infinium methylation data.

Authors: Dan Wang; Li Yan; Qiang Hu; Lara E Sucheston; Michael J Higgins; Christine B Ambrosone; Candace S Johnson; Dominic J Smiraglia; Song Liu
Journal: Bioinformatics Date: 2012-01-16 Impact factor: 6.937

2. Adjusting batch effects in microarray expression data using empirical Bayes methods.

Authors: W Evan Johnson; Cheng Li; Ariel Rabinovic
Journal: Biostatistics Date: 2006-04-21 Impact factor: 5.899

3. Minfi: a flexible and comprehensive Bioconductor package for the analysis of Infinium DNA methylation microarrays.

Authors: Martin J Aryee; Andrew E Jaffe; Hector Corrada-Bravo; Christine Ladd-Acosta; Andrew P Feinberg; Kasper D Hansen; Rafael A Irizarry
Journal: Bioinformatics Date: 2014-01-28 Impact factor: 6.937

4. ConsensusClusterPlus: a class discovery tool with confidence assessments and item tracking.

Authors: Matthew D Wilkerson; D Neil Hayes
Journal: Bioinformatics Date: 2010-04-28 Impact factor: 6.937

5. MethylMix: an R package for identifying DNA methylation-driven genes.

Authors: Olivier Gevaert
Journal: Bioinformatics Date: 2015-01-20 Impact factor: 6.937

6. Pancancer analysis of DNA methylation-driven genes using MethylMix.

Authors: Olivier Gevaert; Robert Tibshirani; Sylvia K Plevritis
Journal: Genome Biol Date: 2015-01-29 Impact factor: 13.583

7. Identification of an atypical etiological head and neck squamous carcinoma subtype featuring the CpG island methylator phenotype.

Authors: K Brennan; J L Koenig; A J Gentles; J B Sunwoo; O Gevaert
Journal: EBioMedicine Date: 2017-03-01 Impact factor: 8.143

8. Genomic, Pathway Network, and Immunologic Features Distinguishing Squamous Carcinomas.

Authors: Joshua D Campbell; Christina Yau; Reanne Bowlby; Yuexin Liu; Kevin Brennan; Huihui Fan; Alison M Taylor; Chen Wang; Vonn Walter; Rehan Akbani; Lauren Averett Byers; Chad J Creighton; Cristian Coarfa; Juliann Shih; Andrew D Cherniack; Olivier Gevaert; Marcos Prunello; Hui Shen; Pavana Anur; Jianhong Chen; Hui Cheng; D Neil Hayes; Susan Bullman; Chandra Sekhar Pedamallu; Akinyemi I Ojesina; Sara Sadeghi; Karen L Mungall; A Gordon Robertson; Christopher Benz; Andre Schultz; Rupa S Kanchi; Carl M Gay; Apurva Hegde; Lixia Diao; Jing Wang; Wencai Ma; Pavel Sumazin; Hua-Sheng Chiu; Ting-Wen Chen; Preethi Gunaratne; Larry Donehower; Janet S Rader; Rosemary Zuna; Hikmat Al-Ahmadie; Alexander J Lazar; Elsa R Flores; Kenneth Y Tsai; Jane H Zhou; Anil K Rustgi; Esther Drill; Ronglei Shen; Christopher K Wong; Joshua M Stuart; Peter W Laird; Katherine A Hoadley; John N Weinstein; Myron Peto; Curtis R Pickering; Zhong Chen; Carter Van Waes
Journal: Cell Rep Date: 2018-04-03 Impact factor: 9.423

9. COHCAP: an integrative genomic pipeline for single-nucleotide resolution DNA methylation analysis.

Authors: Charles D Warden; Heehyoung Lee; Joshua D Tompkins; Xiaojin Li; Charles Wang; Arthur D Riggs; Hua Yu; Richard Jove; Yate-Ching Yuan
Journal: Nucleic Acids Res Date: 2013-04-17 Impact factor: 16.971

10. NSD1 inactivation defines an immune cold, DNA hypomethylated subtype in squamous cell carcinoma.

Authors: Kevin Brennan; June Ho Shin; Joshua K Tay; Marcos Prunello; Andrew J Gentles; John B Sunwoo; Olivier Gevaert
Journal: Sci Rep Date: 2017-12-06 Impact factor: 4.379

36 in total

1. Whole slide images reflect DNA methylation patterns of human tumors.

Authors: Hong Zheng; Alexandre Momeni; Pierre-Louis Cedoz; Hannes Vogel; Olivier Gevaert
Journal: NPJ Genom Med Date: 2020-03-10 Impact factor: 8.617

2. Machine learning algorithm-generated and multi-center validated melanoma prognostic signature with inspiration for treatment management.

Authors: Zaoqu Liu; Hui Xu; Siyuan Weng; Chunguang Guo; Qin Dang; Yuyuan Zhang; Yuqing Ren; Long Liu; Libo Wang; Xiaoyong Ge; Zhe Xing; Jian Zhang; Peng Luo; Xinwei Han
Journal: Cancer Immunol Immunother Date: 2022-08-23 Impact factor: 6.630

3. A combination of transcriptome and methylation analyses reveals the role of lncRNA HOTAIRM1 in the proliferation and metastasis of breast cancer.

Authors: Gui-E Lai; Jian Zhou; Cui-Liu Huang; Cun-Jun Mai; Yi-Mei Lai; Zhi-Qin Lin; Tao Peng; Yuan Luo; Feng-En Liu
Journal: Gland Surg Date: 2022-05

4. Integrative pharmacogenomics revealed three subtypes with different immune landscapes and specific therapeutic responses in lung adenocarcinoma.

Authors: Xiaoyong Ge; Zaoqu Liu; Siyuan Weng; Hui Xu; Yuyuan Zhang; Long Liu; Qin Dang; Chunguang Guo; Richard Beatson; Jinhai Deng; Xinwei Han
Journal: Comput Struct Biotechnol J Date: 2022-07-02 Impact factor: 6.155

Review 5. DNA methylation analysis in plants: review of computational tools and future perspectives.

Authors: Jimmy Omony; Thomas Nussbaumer; Ruben Gutzat
Journal: Brief Bioinform Date: 2020-05-21 Impact factor: 11.622

6. The transcriptional landscape of Shh medulloblastoma.

Authors: Patryk Skowron; Hamza Farooq; Florence M G Cavalli; A Sorana Morrissy; Michelle Ly; Liam D Hendrikse; Evan Y Wang; Haig Djambazian; Helen Zhu; Karen L Mungall; Quang M Trinh; Tina Zheng; Shizhong Dai; Ana S Guerreiro Stucklin; Maria C Vladoiu; Vernon Fong; Borja L Holgado; Carolina Nor; Xiaochong Wu; Diala Abd-Rabbo; Pierre Bérubé; Yu Chang Wang; Betty Luu; Raul A Suarez; Avesta Rastan; Aaron H Gillmor; John J Y Lee; Xiao Yun Zhang; Craig Daniels; Peter Dirks; David Malkin; Eric Bouffet; Uri Tabori; James Loukides; François P Doz; Franck Bourdeaut; Olivier O Delattre; Julien Masliah-Planchon; Olivier Ayrault; Seung-Ki Kim; David Meyronet; Wieslawa A Grajkowska; Carlos G Carlotti; Carmen de Torres; Jaume Mora; Charles G Eberhart; Erwin G Van Meir; Toshihiro Kumabe; Pim J French; Johan M Kros; Nada Jabado; Boleslaw Lach; Ian F Pollack; Ronald L Hamilton; Amulya A Nageswara Rao; Caterina Giannini; James M Olson; László Bognár; Almos Klekner; Karel Zitterbart; Joanna J Phillips; Reid C Thompson; Michael K Cooper; Joshua B Rubin; Linda M Liau; Miklós Garami; Peter Hauser; Kay Ka Wai Li; Ho-Keung Ng; Wai Sang Poon; G Yancey Gillespie; Jennifer A Chan; Shin Jung; Roger E McLendon; Eric M Thompson; David Zagzag; Rajeev Vibhakar; Young Shin Ra; Maria Luisa Garre; Ulrich Schüller; Tomoko Shofuda; Claudia C Faria; Enrique López-Aguilar; Gelareh Zadeh; Chi-Chung Hui; Vijay Ramaswamy; Swneke D Bailey; Steven J Jones; Andrew J Mungall; Richard A Moore; John A Calarco; Lincoln D Stein; Gary D Bader; Jüri Reimand; Jiannis Ragoussis; William A Weiss; Marco A Marra; Hiromichi Suzuki; Michael D Taylor
Journal: Nat Commun Date: 2021-03-19 Impact factor: 17.694

7. MethReg: estimating the regulatory potential of DNA methylation in gene transcription.

Authors: Tiago C Silva; Juan I Young; Eden R Martin; X Steven Chen; Lily Wang
Journal: Nucleic Acids Res Date: 2022-05-20 Impact factor: 19.160

8. Imaging-AMARETTO: An Imaging Genomics Software Tool to Interrogate Multiomics Networks for Relevance to Radiography and Histopathology Imaging Biomarkers of Clinical Outcomes.

Authors: Olivier Gevaert; Mohsen Nabian; Shaimaa Bakr; Celine Everaert; Jayendra Shinde; Artur Manukyan; Ted Liefeld; Thorin Tabor; Jishu Xu; Joachim Lupberger; Brian J Haas; Thomas F Baumert; Mikel Hernaez; Michael Reich; Francisco J Quintana; Erik J Uhlmann; Anna M Krichevsky; Jill P Mesirov; Vincent Carey; Nathalie Pochet
Journal: JCO Clin Cancer Inform Date: 2020-05

9. Four methylation-driven genes may be prognostic biomarkers in clear cell renal carcinoma.

Authors: Hang Yin; Haiyang Zhang; Xiaoyuan Wang; Qingyong Xu
Journal: Clin Transl Med Date: 2020-06-04

10. A robust twelve-gene signature for prognosis prediction of hepatocellular carcinoma.

Authors: Guoqing Ouyang; Bin Yi; Guangdong Pan; Xiang Chen
Journal: Cancer Cell Int Date: 2020-06-03 Impact factor: 5.722