Literature DB >> 27794041

NGSmethDB 2017: enhanced methylomes and differential methylation.

Ricardo Lebrón^1,2, Cristina Gómez-Martín^1,2, Pedro Carpena³, Pedro Bernaola-Galván³, Guillermo Barturen⁴, Michael Hackenberg^5,2, José L Oliver^6,2.

Abstract

The 2017 update of NGSmethDB stores whole genome methylomes generated from short-read data sets obtained by bisulfite sequencing (WGBS) technology. To generate high-quality methylomes, stringent quality controls were integrated with third-part software, adding also a two-step mapping process to exploit the advantages of the new genome assembly models. The samples were all profiled under constant parameter settings, thus enabling comparative downstream analyses. Besides a significant increase in the number of samples, NGSmethDB now includes two additional data-types, which are a valuable resource for the discovery of methylation epigenetic biomarkers: (i) differentially methylated single-cytosines; and (ii) methylation segments (i.e. genome regions of homogeneous methylation). The NGSmethDB back-end is now based on MongoDB, a NoSQL hierarchical database using JSON-formatted documents and dynamic schemas, thus accelerating sample comparative analyses. Besides conventional database dumps, track hubs were implemented, which improved database access, visualization in genome browsers and comparative analyses to third-part annotations. In addition, the database can be also accessed through a RESTful API. Lastly, a Python client and a multiplatform virtual machine allow for program-driven access from user desktop. This way, private methylation data can be compared to NGSmethDB without the need to upload them to public servers. Database website: http://bioinfo2.ugr.es/NGSmethDB.

Entities: Chemical Disease Gene Species

Mesh：

Substances：
Cytosine

Year: 2016 PMID： 27794041 PMCID： PMC5210667 DOI： 10.1093/nar/gkw996

Source DB: PubMed Journal: Nucleic Acids Res ISSN： 0305-1048 Impact factor: 16.971

INTRODUCTION

DNA methylation at the cytosine carbon 5 position (5mC) is the main epigenetic mark, being able to modify gene expression patterns without entail DNA sequence changes (1,2). Furthermore, such modification is reversible during cell differentiation (3,4). CpG methylation is involved in cell differentiation in mammals, gene imprinting, the inactivation of X chromosome and genome stability (1,5–11). Differential methylation at CpG islands, many of which overlap promoters, can control tissue specificity of gene expression (12). The short-read data sets from whole-genome shotgun bisulfite sequencing (WGBS) can be used (13,14) to generate whole-genome methylation maps or methylomes (15–17). When integrated with other genome maps (18,19), they may help to understand how the changes in DNA methylation (20–22) interact with other genetic and epigenetic marks (23,24) to control normal development or to provoke pathological dysregulations, as cancer (25,26). Current problems with NGS methylation profiling (Barturen et al. ‘Error Correction in Methylation Profiling From NGS Bisulfite Protocols’, in ‘Algorithms for Next-Generation Sequencing Data: Techniques, Approaches and Applications’, Springer, in preparation) include: (i) the correct handling of the different error sources that might appear along the process (sequencing errors, clonal reads, sequence variation, bisulfite failure and miss-alignments) and that could lead to wrong methylation calling; and (ii) the diversity of protocols, methods and optional parameter-values in carrying out the alignment of the reads to a reference genome, as well as the read-out of the methylation levels from the alignment (27,28). Although other methylation databases exist (17,29–38), the above problems are not always properly addressed and therefore methylation data stored in them can be often unsuitable for comparative downstream analyses. Another common problem is the restricted range of samples stored in some of the databases, either focusing on specific gene loci (29–32), tissues (33,34) or diseases (35–38) and thus hampering large-scale comparative analyses. Several years ago we initiated the development of integrated methylation pipelines (28,39) to minimize all the potential errors, at the same time unifying the different protocols. Our most recent development is MethFlow (Lebrón et al. ‘MethFlowVM: a virtual machine for the integral analysis of bisulfite sequencing data’, in preparation), a pipeline integrating the stringent quality controls built into MethylExtract (28) with third-part software in order to produce enhanced methylomes. We use MethFlow to populate and update NGSmethDB, thus profiling the samples under uniform conditions, which enables comparative, large-scale downstream analyses. Furthermore, NGSmethDB now includes two additional data sets, which may be a valuable resource for the discovery of methylation epigenetic biomarkers: (i) differentially methylated single-cytosines; and (ii) methylation segments (i.e. genome regions of homogeneous methylation).

DATABASE CONTENT

Publicly available short-read data sets from WGBS bisulfite sequencing projects for different cell lines, primary tissues, pathological biopsies and autopsies were downloaded mainly from NCBI GEO (40) and the ROADMAP project (41). An updated list of the available methylomes, with detailed information on the source cell lines or tissues, is maintained online on the database website. At the time of writing, the NGSmethDB includes 667 methylomes generated for CG, CHG (H = A, C, T) and CHH sequence contexts (Table 1), a significant increase over the 87 methylomes in the previous release. CG is the most spread methylation sequence context in mammals, while CHG and CHH have been recently found in almost all human tissues (3,21,42). The information stored for each sampled single-cytosine is detailed in Table 2. Of particular interest are the samples derived from primary tissues of three individuals from the ROADMAP project: 11 samples from STL001 (3-years-old healthy male), 11 samples from STL002 (30-years-old female, iron deficiency, bipolar disease) and 13 samples from STL003 (34-years-old male, polysubstance abuse).

Table 1.

Number of methylomes by species and sequence context stored in NGSmethDB

Species	Reference genome assembly	Sequence context	No. of methylomes
Homo sapiens	hg19	CG	57
		CHG	54
Homo sapiens	hg38	CG	35
Pan troglodytes	panTro4	CG	5
		CHG	5
Macaca mulatta	rheMac3	CG	6
		CHG	6
Mus musculus	mm10	CG	41
		CHG	41
Solanum lycopersicum	sl2.50	CG	8
		CHG	8
		CHH	8
Solanum pimpinellifolium	sl2.50	CG	2
		CHG	2
		CHH	2
Arabidopsis thaliana	tair10	CG	129
		CHG	129
		CHH	129
		TOTAL	667

Table 2.

Information stored in NGSmethDB for each single-cytosine. All fields are described as shown in the results of NGSmethDB API client. Each row corresponds to a single-cytosine and each column to a field. For more information, see the manual NGSmethDB.

Field	Description	Example
chrom	Chromosome	chr22
pos	Chromosome position	25174338
genotype	Genotype of methylation context	YG
methContext	Methylation context where is the cytosine	CG
w.methylatedReads	Number of reads in which this cytosine is methylated (Watson-strand only)	22
c.methylatedReads	Number of reads in which this cytosine is methylated (Crick-strand only)	27
methylatedReads	Number of reads in which this cytosine is methylated (both strands)	49
w.coverage	Number of reads mapped at this chromosome position (Watson-strand only)	26
c.coverage	Number of reads mapped at this chromosome position (Crick-strand only)	33
coverage	Number of reads mapped at this chromosome position (both strands)	59
w.methRatio	Methylated reads ratio at this chromosome position (Watson-strand only)	0.85
c.methRatio	Methylated reads ratio at this chromosome position (Crick-strand only)	0.82
methRatio	Methylated reads ratio at this chromosome position (both strands)	0.83
w.phredScore	Average sequencing quality score at this chromosome position (Watson-strand only)	39
c.phredScore	Average sequencing quality score at this chromosome position (Crick-strand only)	37
phredScore	Average sequencing quality score at this chromosome position (both strands)	38

Lastly, and as a novelty in this release of NGSmethDB, genome maps of differentially methylated cytosines (DMCs) and methylation segments for human (hg38) and tomato (sl2.50) were also included in the database. Both data sets are a valuable resource for the discovery of methylation epigenetic biomarkers. The information stored in NGSmethDB for each DMC is detailed in Supplementary Table S1, and that for methylation segments in Supplementary Table S2.

DATABASE BACK-END

Single-cytosine methylation, methylation segments and differential methylation data are stored hierarchically in MongoDB (https://www.mongodb.com/), a NoSQL database that avoids the traditional table-based relational database structure in favor of JSON-formatted documents with dynamic schemas. This makes the programmatic data comparison of different samples easier and faster. However, when querying the database CSV or TSV formatted files can be optionally generated, thus also allowing for downstream analyses by means of conventional spreadsheets. Each assembly is stored in a database and inside every database there is collection for each chromosome. Within the collection, each JSON-like document represents one cytosine and contains hierarchically all genotypes (i.e. the different individuals from which the samples were obtained), differential methylation and methylation data of all individuals and samples. The first level is the data type (genotype, methylation or differential methylation), the second is the individual, the third is the sample and the fourth are the data values themselves (i.e. the DNA methylation levels, the alleles, the methylation differences, etc.).

WHOLE-GENOME, SINGLE-CYTOSINE RESOLUTION METHYLOMES

The high-quality methylomes stored in NGSmethDB were produced by MethylExtract (28), a software for DNA methylation profiling and genotyping from the same sample. For the most recent genome assemblies, we used our improved methylation pipeline MethFlow (Lebrón et al. ‘MethFlowVM: a virtual machine for the integral analysis of bisulfite sequencing data’, in preparation), using optimized (default) values for all the samples. The core of MethFlow is MethylExtract, to which third-part software was added to improve its overall performance. The pre-processing improvement consists in the use of Trimmomatic (43) for adapter trimming and removing low quality 3′ ends. The alignment to a three letter genome is then performed by means of Bismark (44) that uses Bowtie2 (45) as aligner. Next, BSeQC (46) is used for the elimination of known technical artefacts that may result in inaccurate methylation estimation. And finally, MethylExtract performs the methylation calling and genotyping combined step. This last software minimizes several important error sources like sequencing errors, bisulfite failure, clonal reads and single nucleotide variants. The result of the entire process is a high quality, whole-genome methylation map or methylome, as well as the genotypes at all single-cytosine positions. Another important novelty of MethFlow, as compared with previous pipelines, is the incorporation of a two-step mapping process in order to (i) exploit the advantages of the new genome assembly models (47) and (ii) recover the useful information of multiple-mapped reads for the analysis (see Supplementary Figure S1 for details). First, the reads are mapped against a decoy assembly (canonical chromosomes + alternative loci + decoy sequences). This increases the number of correctly mapped reads but also the number of multiple-mapped reads (that are discarded). A certain percentage of those ambiguous reads are then recovered in a second mapping step against the canonical chromosomes.

DIFFERENTIALLY METHYLATED CYTOSINES

Given the increasing biological relevance of differential methylation (48–50), a section of the database is now dedicated to precomputed DMCs. Variation in methylation levels can be caused by many different factors like cell type, genotype (individual), environment, age, sex, etc. and therefore the naive comparison of any kind of samples is not a reliable method to obtain biologically meaningful DMCs. Here, we restricted ourselves to compute differential methylation controlling only for the two main factors: the variation found among tissues within the same individual (intra-individual DMCs), and that found in the same tissue from different individuals (inter-individual DMCs), a strategy allowing for a consistent comparison of these two important levels of epigenetic variability (Lebrón et al. ‘Intra- and inter-individual variability of single-cytosine methylation’, in preparation). Right now, the relevant samples for human intra- and inter-individual variability are all from the ROADMAP project (41), which include three individuals and multiple tissue types with different replicas, but more comparisons might be added as soon as appropriate data become available. In tomatoes, we generated a catalogue of intra-individual DMCs by comparing two leaves of different age from the same plant of Solanum pimpinellifolium accession TO-937 (Gómez-Martín et al. ‘Differential methylation in tomato leaves’, in preparation), as well as another list of intra-cultivar DMCs in Solanum lycopersicum cv. Ailsa Craig on the basis of the whole-genome bisulfite sequencing on fruit in four stages of development carried out by Giovannoni and coworkers (51). To detect DMCs between sample pairs we used the CG sequence context in humans and the CG and CHG contexts in tomato. Three statistical tests were applied: the Fisher's exact test as implemented in methylKit (52) and MOABS (53), and the similarity test implemented in MOABS (53). To ensure statistical consistence, only those cytosines reaching statistical significance by all three tests (consensus) were considered as DMCs. On the other hand, pair-wise comparisons between all relevant samples were made; reaching consensus in anyone of the pair-wise comparisons suffices to consider such cytosine position as a DMC. See Supplementary Table S1 for details on the information stored in NGSmethDB on each detected DMC.

METHYLATION SEGMENTS

Segmentation algorithms (54–57) can divide a DNA sequence into segments of homogeneous nucleotide composition, at a given significance level. In the same way, an array of single-cytosine methylation levels (in the CpG sequence context) along a chromosome sequence can be decomposed into segments of homogeneous methylation, thus revealing the regional variation of methylation levels. We have adapted our recursive segmentation algorithm to handle the methylation levels obtained by MethFlow. The details of the method will be given elsewhere (Carpena et al., ‘Segmenting whole-genome methylation maps’, in preparation), but in essence the algorithm maximizes the difference of the mean values of adjacent segments by computing the t-statistic. Given the non-Gaussian nature of the distribution of methylation levels, the statistical significance of a given t-value was then obtained by a special randomization process, which takes into account both the methylation probability distribution in the sample (often bimodal) and the correlations among neighbour methylation values. See Supplementary Table S2 for details on the information stored in NGSmethDB on each methylation segment.

DATA SHARING, PROGRAMMATIC ACCESS AND VISUALIZATION

The data stored in NGSmethDB can be accessed in a variety of ways, depending on the data type or the access method (Figure 1). Single-cytosine methylation, differentially methylated cytosines and methylation segments can be accessed by downloading database dumps, querying the database through a custom web-form, using track hubs and Table Browser at UCSC, issuing HTTP simple queries on a web browser or through programmatic access (either using the NGSmethDB API Client or the NGSmethDB API Virtual Machine).

Figure 1.

Data flow diagram for NGSmethDB indicating the source of primary data, the different types of extracted data and the different ways of data access.

Data flow diagram for NGSmethDB indicating the source of primary data, the different types of extracted data and the different ways of data access. Database dumps are the easiest way to access NGSmethDB data. Complete zipped methylomes can be downloaded from the ‘Content and dumps’ page at the NGSmethDB website. Once unzipped, you get a tab-delimited file that can be directly opened in a spreadsheet for downstream analyses. Another mode to access NGSmethDB data is through track hubs (58), which provide a standard and efficient mechanism for visualizing remotely hosted, Internet-accessible collections of genome annotations. Hub data sets can then be fully integrated into the University of California Santa Cruz (UCSC) Genome Browser (18). In this way, NGSmethDB data can be visualized and compared to a plethora of third-part annotations (Figure 2). In addition, UCSC tools, as Table Browser or Data Integrator, provide ways to (i) retrieve detailed NGSmethDB tab-delimited data sets from any genome, chromosome, genome region, gene, SNP or whatever other genome marker; (ii) combine methylation data and any other third-part annotation into a single set of data based on a specific join criteria, e.g. this can be used to find the methylation state of cytosines that intersect with CpG islands; and (iii) directly upload NGSmethDB data sets to public bioinformatics platforms as Galaxy (59), GenomeSpace (60) or GREAT (61) for further downstream, genome-wide analyses.

Figure 2.

NGSmethDB data shown at the UCSC Genome Browser. A genome region of chromosome 19 (chr19:2, 248, 838-2, 256, 966) encompassing three genes (AMH, encoding for the anti-Mullerian hormone; MIR4321, encoding for the microRNA 4321; and JSRP1 encoding for a junctional sarcoplasmic reticulum protein) is shown. The three main types of NGSmethDB data are shown for different tissues: (i) methylation levels at single-cytosines; (ii) differentially methylated cytosines; and (iii) methylation segments. Third-part annotations are also shown: genes from the Refseq database (63), the strict set of CpG-islands predicted by CpGcluster (64,65) and gene expression levels from the NIH Genotype-Tissue Expression (GTEx) project (66). Online image: https://goo.gl/ElXE4t. As a novelty for this release, and using Node.js (https://nodejs.org/en/), a NGSmethDB API server has been implemented on our server, which provides access to the entire MongoDB database via a RESTful API (62). This allows for additional web or programmatic ways to access NGSmethDB data: Through a custom web-form; e.g. to retrieve single-cytosine methylation levels data from a specific genome region. HTTP access, i.e. to retrieve single-cytosine methylation data by directly issuing simple HTTP queries on the navigation bar of the user browser (see the NGSmethDB manual for details). Web-form and HTTP access methods are recommended only to retrieve data on a single position, or regions of moderate size as exons, genes, etc. Larger regions are better analyzed by means of track hubs or using the programmatic access. Program-driven access to NGSmethDB Through the NGSmethDB Standalone API Client. This is a multiplatform Python script (http://bioinfo2.ugr.es/NGSmethDB_API/NGSmethDB_API_client.py) running on Linux, Mac OS X and other UNIX systems. It permits to select assembly and samples, download methylation and differential methylation data and computing statistics for different genomic regions based on the coordinates provided as a BED formatted file —the BED format is described in UCSC FAQ (https://genome.ucsc.edu/FAQ/FAQformat.html#format1). Through the NGSmethDB API Virtual Machine. The NGSmethDB API Client has been encapsulated in a VirtualBox (https://www.virtualbox.org/) preconfigured Virtual Machine on which all the dependencies are already installed. The NGSmethDB API VM (http://bioinfo2.ugr.es/NGSmethDB_API/NGSmethDB_API.ova) is platform independent, being able to run on Windows, Linux or Mac desktops.

CONFIDENTIALITY ISSUES

It is increasingly frequent the use of private data that should not abandon the user desktop, particularly in biomedical and biotechnological genome research; therefore, uploading these data to a public server is frequently prohibitive. This is why we implemented the NGSmethDB API server, and developed also a Python standalone client able to programmatically access all the NGSmethDB data while running on the user desktop. The NGSmethDB API client has been already implemented within two multiplatform preconfigured virtual machines coming with all the dependencies installed: (i) the NGSmethDB API virtual machine (http://bioinfo2.ugr.es/NGSmethDB_API/NGSmethDB_API.ova), described in this paper, which allow downloading all the NGSmethDB data (single-cytosine methylation, methylation segments and differentially methylated cytosines); and (ii) MethFlow (http://bioinfo2.ugr.es:8080/MethFlow/) that can be used to obtain methylomes from private WGBS short-read data sets and compare them to the downloaded NGSmethDB methylomes. With these two tools, the user will no longer need to upload private data to any public server to carry out comparative analyses against NGSmethDB data.

62 in total

Review 1. Computational methods and next-generation sequencing approaches to analyze epigenetics data: Profiling of methods and applications.

Authors: Itika Arora; Trygve O Tollefsbol
Journal: Methods Date: 2020-09-14 Impact factor: 3.608

Review 2. Systematics for types and effects of DNA variations.

Authors: Mauno Vihinen
Journal: BMC Genomics Date: 2018-12-28 Impact factor: 3.969

2 in total

NGSmethDB 2017: enhanced methylomes and differential methylation.

INTRODUCTION

DATABASE CONTENT

DATABASE BACK-END

WHOLE-GENOME, SINGLE-CYTOSINE RESOLUTION METHYLOMES

DIFFERENTIALLY METHYLATED CYTOSINES

METHYLATION SEGMENTS

DATA SHARING, PROGRAMMATIC ACCESS AND VISUALIZATION

CONFIDENTIALITY ISSUES

1. Compositional segmentation and long-range fractal correlations in DNA sequences.

2. The dinucleotide CG as a genomic signalling module.

Review 3. Principles and challenges of genomewide DNA methylation analysis.

Review 4. Altered DNA methylation and genome instability: a new pathway to cancer?

5. A gene hypermethylation profile of human cancer.

6. An improved version of the DNA Methylation database (MethDB).

7. DiseaseMeth: a human disease methylation database.

8. MOABS: model based analysis of bisulfite sequencing data.

9. CMS: a web-based system for visualization and analysis of genome-wide methylation data of human cancers.

10. PubMeth: a cancer methylation database combining text-mining and expert annotation.

Review 1. Computational methods and next-generation sequencing approaches to analyze epigenetics data: Profiling of methods and applications.

Review 2. Systematics for types and effects of DNA variations.