| Literature DB >> 26308520 |
Shuying Sun1, Peng Li2.
Abstract
DNA methylation (the addition of a methyl group to a cytosine) is an important epigenetic event in mammalian cells because it plays a key role in regulating gene expression. Most previous methylation studies assume that DNA methylation occurs on both positive and negative strands. However, a few studies have reported that in some genes, methylation occurs only on one strand (ie, hemimethylation) and has clustering patterns. These studies report that hemimethylation occurs on individual genes. It is unclear whether hemimethylation occurs genome-wide and whether there are hemimethylation differences between cancerous and noncancerous cells. To address these questions, we have developed the first-ever pipeline, named hemimethylation pipeline (HMPL), to identify hemimethylation patterns. Utilizing the available software and the newly developed Perl and R scripts, HMPL can identify hemimethylation patterns for a single sample and can also compare two different samples.Entities:
Keywords: HMPL; NGS (next-generation sequencing); hemimethylation
Year: 2015 PMID: 26308520 PMCID: PMC4530977 DOI: 10.4137/CIN.S17286
Source DB: PubMed Journal: Cancer Inform ISSN: 1176-9351
Figure 1Examples of hemimethylation patterns. M and CmG represent a methylated site. U and CG represent an unmethylated site. A is an example of hemimethylation that occurs on the same strand. B is an example of polarity or reverse hemimethylation pattern with only two CpG sites. C is an example of hemimethylation on different strands with more than two CpG sites.
Figure 2Workflow of the HMPL.
The command options of HMPL Part I (Pre.HMPL.pl).
| OPTIONS | EXPLANATION |
|---|---|
| [−1<file>] | Required. FASTQ format single-end input file or pair-end input file 1, eg, −1 MCF7.fastq, which is the file name of a fastq dataset. |
| [−2<file>] | FASTQ format pair-end input file 2. By default, when there is no input 2, it only processes the input file 1 and processes it as a single-end file. |
| [-o <dir>] | The output directory. The default output directory is the user’s current directory. For example, if the current directory in which the user runs HMPL is ‘/home/user/check.folder/’, then when running HMPL command line without specifying ‘-o’, the user would have all the output files in ‘/home/user/check.folder’. |
| [-p <string>] | Required. The prefix written to the output file names. eg, –p MCF7, then the output file will have the prefix MCF7 (eg, MCF7.site, or MCF7.cluster). |
| [-r <file>] | The name of the file that lists the genome reference sequence (ie, *.fa) files that users will use to do alignment. Please note that this “-r” option must be provided whether or not the “-I” (ie, alignment index) option is provided. Otherwise, the “ |
| [-f <sanger or illumine>] | FASTQ format: HMPL accepts sanger or illumina format FASTQ files as input data; default is sanger. |
| [-a <yes or no>] | Adapter trimming: Users can select whether or not to utilize |
| [-A <tirng>] | Adapter sequences: HMPL accepts two adapter sequence inputs, separated by a comma, and default is AGATCGGAAGAGCGGTTCAGCAG GAATGCCGAG, AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGT. |
| [-T <fix or brat>] | Quality trim flag: Specifies whether to use BRAT dynamic trimming function (default is BRAT-trim) or the user can specify ‘fix’ to apply fixed quality trimming (ie, trim off a fixed number of bases). |
| [-N <int>] | Fixed quality trimming: Specifies the number of bases to be trimmed at the 5′ end (default is 5). |
| [-n <int>] | Fixed quality trimming: Specifies the number of bases to be trimmed at the 3′ end (default is 10). |
| [-Q <yes or no>] | Whether or not to do the quality assessment using FastQC (default is no). |
| [-I <dir>] | The index directory for BRAT-bw alignment. If the index folder is provided, it will be automatically used. Otherwise, it will build index, which is the default setting. |
| [-i <positive integer>] | To specify minimum insert size for paired-end mapping, the minimum distance allowed between the left-most ends of the mapped mates on the forward strand (default is 0). |
| [-m <positive integer>] | To specify maximum insert size for paired-end mapping, the maximum distance allowed between the left-most ends of the mapped mates on the forward strand (default is 1000). |
The command options of HMPL Part II (Parse.HMPL.pl).
| OPTIONS | EXPLANATION |
|---|---|
| [−1 <file>] | Input file 1 is required. Note: For both Input 1 and Input 2 (see next row), the user can enter two kinds of inputs. One is the combined methylation level data (eg, “−1 MCF7.CG.combine”), and the other is the “ |
| [−2 <file>] | Input file 2, optional. If specified, the pipeline will process both inputs and compare their final results. Default is only to process the input file 1, and not to do the comparing. Note: For both Input 1 and Input 2, the user can enter two kinds of inputs as explained in the above row. |
| [-o <dir>] | The output directory where all the output files are created and written. Default is “<current_dir>/final.results/.” |
| [-c <int>] | The value for selecting the methylation coverage is greater than B. (Default: B = 0). On each strand there must be at least B reads to cover a specific CpG site in order for HMPL to check if it is hemimethylated. Changing the “–c” value from a smaller value (eg, -c 5) to a larger value (eg, -c 10) will obtain a shorter list of hemimethylated sites and have a smaller false discovery rate. |
| [-l <real>] | The cutoff value for selecting low methylation level. (Default: 0.2, range: [0.05, 0.4]). This value corresponds to the “L0” mentioned in Step 4 of the pipeline. If the methylation level is less than this “-l” value, it will be claimed as unmethylated. Changing “-l” value from a smaller value (eg, -l 0.1) to a larger value (eg, -l 0.2) may give a longer list of hemimethylated sites, but there may be a larger false discovery rate. |
| [-h <real>] | The cutoff value for selecting high methylation level. (Default: 0.8, range: [0.6, 1]). This value corresponds to the “H0” mentioned in Step 4 of the pipeline. If the methylation level is greater than this “-h” value, it will be claimed as methylated. Changing “-h” value from a smaller value (eg, -h 0.7) to a larger value (eg, -h 0.9) may give a shorter list of hemimethylated sites, but there may be a smaller false discovery rate. |
| [-d <int>] | The maximum distance between two CpG sites to be selected as a cluster with default 50. If the maximum distance is changed from a smaller value (eg, -d 50) to larger value (eg, -d 100), the number of CpG sites in a cluster will be larger, but the total number of hemimethylation clusters will become smaller. |
| [-r<file>] | The reference gene file, not the genome reference sequence files. This file is used to provide genetic annotation (ie, gene names) to the hemimethylation sites. For example, we set it as “-r/home/reference/hg19/refGene.txt”. This “refGene.txt” file contains the gene names and gene information downloaded from the UCSC genome browser. |
| [-D <int>] | The distance of promoter region (Default: D = 1000). That is, if the transcript starting position is located at X = 5,000 bp on a chromosome, the promoter region of this gene is defined as from X-D = 4,000 to X = 5,000. |
The description of HMPL Part II (Parse.HMPL.pl) output files.
| FILE NAME | CONTENTS |
|---|---|
| *.grX | The CpG sites with coverage greater than X (X > 0). |
| *.all.HM.sites | The hemimethylated CpG sites identified by the high and low cutoff values. |
| *.all.HM.sites.annotated | The annotated hemimethylated sites (ie, gene names are provided). |
| *.all.labelled.CG | The CpG sites with coverage greater than X and with the labels of methylation states (P: partially methylated, M: methylated, U: unmethylated). |
| *. Summary | The summary file for all the methylation states of single hemimethylation sites and clusters. |
| *.all.HMClusters | All hemimethylated clusters. |
| *.all.Rev.Clusters | All of the polarity (or reverse) clusters, including both consecutive and non-consecutive polarity clusters. |
| *.non.Rev.Clusters | The hemimethylated clusters that are not polarity patterns. |
| *.Singleton | Single hemimethylated CpG sites. |
| *.consec.revs.Clusters | The consecutive polarity (or reverse) clusters (ie, with just two consecutive CpG sites). |
| *.non.consec.revs.Clusters | The non-consecutive polarity clusters (ie, with two CpG sites that are not consecutive). |
| *.compare | The results of comparing two samples. |
Figure 3MCF10A and MCF7 hemimethylation pattern comparison results. In each Venn diagram, the “A” entry means the number of CpG sites or clusters that are hemimethylated in the MCF10A sample, but not in the MCF7 sample. The “B” entry shows the number of CpG sites or clusters that are hemimethylated in the MCF10A sample, but there are no sequencing reads for these CpG sites in the MCF7 sample. The “C” entry represents the number of CpG sites or clusters that are hemimethylated in both MCF10A and MCF7. The “D” entry indicates the number of CpG sites or clusters that are hemimethylated in the MCF7 sample, but not in the MCF10A sample. The “E” entry means the number of CpG sites that are hemimethylated in the MCF7 sample, but there are no sequencing reads for these CpG sites in the MCF10A sample.
The summary of MCF10A and MCF7 hemimethylation patterns.
| MCF10A | MCF7 | ||
|---|---|---|---|
| CLUSTER PATTERN | FREQUENCY | CLUSTER/PATTERN | FREQUENCY |
| MMMMMM-UUUUUU | 1 | MMMMM-UUUUU | 1 |
| MMMMM-UUUUU | 1 | MMMM-UUUU | 1 |
| MMM-UUU | 2 | MMM-UUU | 3 |
| MM-UU | 6 | MM-UU | 27 |
| MMU-UUM | 2 | MMU-UUM | 1 |
| MUM-UMU | 1 | MUM-UMU | 6 |
| MUMU-UMUM | 2 | MUMU-UMUM | 4 |
| MU-UM | 1643 | MU-UM | 2290 |
| UM-MU | 4 | MUU-UMM | 2 |
| UMU-MUM | 3 | UMM-MUU | 1 |
| UU-MM | 9 | UM-MU | 6 |
| UUU-MMM | 3 | UMU-MUM | 8 |
| UUUU-MMMM | 1 | UMUMU-MUMUM | 1 |
| UUUUU-MMMMM | 1 | UU-MM | 20 |
| UUU-MMM | 5 | ||
| UUUU-MMMM | 1 | ||
Ten significant cancer modules identified using GSEA. The first column is the gene symbol, and the other columns indicate if a gene belongs to a specific cancer module. “X” shows that a gene belongs to that module, and a blank cell means that a gene does not belong to a specific module.
| GENE SYMBOL | CANCER MODULE NUMBER (OR ID) | |||||||||
|---|---|---|---|---|---|---|---|---|---|---|
| 38 | 334 | 100 | 137 | 66 | 11 | 55 | 88 | 41 | 37 | |
| X | X | |||||||||
| X | X | X | X | X | X | X | X | |||
| X | X | X | X | X | X | X | X | |||
| X | X | X | X | X | X | X | ||||
| X | X | X | X | X | X | |||||
| X | X | X | X | X | ||||||
| X | X | X | X | X | X | |||||
| X | X | X | X | |||||||
| X | X | X | X | |||||||
| X | X | X | ||||||||
| X | X | X | ||||||||
| X | X | X | ||||||||
| X | X | X | ||||||||
| X | X | X | ||||||||
| X | X | |||||||||
| X | ||||||||||
| X | ||||||||||
| X | ||||||||||
| X | ||||||||||
| X | ||||||||||
| X | ||||||||||
| X | ||||||||||
| X | X | |||||||||
| X | ||||||||||
| X | ||||||||||
| X | ||||||||||
| X | ||||||||||
| X | ||||||||||
| X | ||||||||||
| X | ||||||||||
| X | ||||||||||
| X | ||||||||||
| X | ||||||||||
| X | ||||||||||
| X | X | X | X | X | X | X | ||||
| X | X | X | X | X | X | X | ||||
| X | X | X | X | X | X | |||||
| X | X | X | X | X | X | |||||
| X | X | X | X | X | X | |||||
| X | X | X | X | X | ||||||
| X | X | X | X | X | ||||||
| X | X | X | X | X | ||||||
| X | X | X | X | |||||||
| X | X | X | X | |||||||
| X | X | X | X | |||||||
| X | X | X | X | |||||||
| X | X | X | X | |||||||
| X | X | X | X | |||||||
| X | X | X | X | |||||||
| X | X | X | ||||||||
| X | X | X | ||||||||
| X | X | X | ||||||||
| X | X | |||||||||
| X | X | |||||||||
| X | X | |||||||||
| X | X | |||||||||
| X | X | |||||||||
| X | X | |||||||||
| X | X | |||||||||
| X | X | |||||||||
| X | X | |||||||||
| X | X | |||||||||
| X | ||||||||||
| X | ||||||||||
| X | ||||||||||
| X | ||||||||||
| X | ||||||||||
| X | ||||||||||
| X | ||||||||||
| X | ||||||||||
| X | ||||||||||
| X | ||||||||||
| X | ||||||||||
| X | ||||||||||
| X | ||||||||||
| X | ||||||||||
| X | ||||||||||
| X | ||||||||||
| X | ||||||||||
| X | ||||||||||
| X | ||||||||||
| X | ||||||||||
| X | ||||||||||
| X | ||||||||||
| X | ||||||||||
| X | ||||||||||
| X | ||||||||||