| Literature DB >> 24330312 |
Robert Lowe1, Vardhman K Rakyan.
Abstract
BACKGROUND: DNA methylation is indispensible for normal human genome function. Currently there is an increasingly large number of DNA methylomic data being released in the public domain allowing for an opportunity to investigate the relationships between the DNA methylome, genome function, and human phenotypes. The Illumina450K is one of the most popular platforms for assessing DNA methylation with over 10,000 samples available in the public domain. However, accessing all this data requires downloading each individual experiment and due to inconsistent annotation, accessing the right data can be a challenge. DESCRIPTION: Here we introduce 'Marmal-aid', the first standardised database for DNA methylation (freely available at http://marmal-aid.org). In Marmal-aid, the majority of publicly available Illumina HumanMethylation450 data is incorporated into a single repository allowing for re-processing of data including normalisation and imputation of missing values. The database is accessible in two ways: (1) Using an R package to allow for incorporation into existing analysis pipelines which can then be easily queried to gain insight into the functionality of certain CpG sites. This is aimed at a bioinformatician with experience in R. (2) Using a graphical interface allowing general biologists to query a pre-defined set of tissues (currently 15) providing a reference database of the methylation state in these tissues for the 450,000 CpG sites profiled by the Illumina HumanMethylation450.Entities:
Mesh:
Year: 2013 PMID: 24330312 PMCID: PMC3878775 DOI: 10.1186/1471-2105-14-359
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Figure 1The growth of methylation data in the public domain for different assays. The number of samples contained within GEO was extracted for Illumina 27K Array (GPL8490) and Illumina HumanMethylation450 array (GPL13534). “MEDIP”, “BS-Seq” and “RRBS” were used as search terms with a filter for Homo sapiens to extract the number of samples for MEDIP-Seq, RRBS-Seq and BS-Seq. The cumulative number of samples for each month from February 2008 to June 2013 is shown for all of the different assays.
A description of the column names contained in the annotation file in Marmal-aid
| ID used to identify sample. If from GEO this is the GEO accession number | |
| A long descriptive name of the sample | |
| The GSE number which can be used to a certain whether samples are from the same experiment. | |
| The lineage of the tissue | |
| Main tissue type E g. Blood, Brain, Liver, Kidney | |
| Further sub categorisation of the tissue E g. for Blood this may be CD19, CD4 etc | |
| If a transformed cell line the name of the line is given in this column | |
| Indication of the disease state of the sample. NA here means no information was recorded and is most likely healthy. | |
| A further charaterisation of the disease state E g. if the DISEASE column contains Cancer then this column will indicate what type | |
| The sex of the sample if given as taken from the annotation file | |
| The prediction of the sex of the sample using autosomal probes | |
| The age of the sample if given. | |
| Any additional information that could not be described in the other columns. |
Figure 2Imputation and visualisation in Marmal-aid. (A) The left panel shows the percentage of probes that were imputed to within 0.2 of the actual beta value for over 100 different random samples. On average over 96% of the probes were imputed to within 0.2 of the actual value. The right panel shows an example of 1 of the 100 random samples in which the actual beta value is plotted on the x-axis and the imputed value on the y-axis. (B) An example of the GUI available at http://marmal-aid.org/visualise.html for chr1:67,217,505-67,219,505 around the TSS of TCTEX101.
Figure 3Investigating batch effects. (A) A plot of the hierarchical clustering of 4 different tissues (CD19, Whole Blood, Brain and Lung) from 8 different studies. A modified Euclidean distance was used so that only those probes with a 0-0.1 beta difference contributed (left panel), 0.1-0.2 beta difference (middle panel) or a 0.2-0.3 beta difference (right panel) contributed. (B) Breast cancer DMPs were called using an F-Test with corrected p-value < 0.01 for tumour samples against normal for a single dataset contained in Marmal-aid (GSE29290). The beta value difference for these DMPs was then plotted for another dataset of breast tumors against normal taken from TCGA. The red lines represent a threshold of an absolute beta value difference >0.1 and >0.2.