Literature DB >> 27924023

MethSMRT: an integrative database for DNA N6-methyladenine and N4-methylcytosine generated by single-molecular real-time sequencing.

Pohao Ye1, Yizhao Luan1, Kaining Chen1, Yizhi Liu1, Chuanle Xiao2, Zhi Xie3,4,5.   

Abstract

DNA methylation is an important type of epigenetic modifications, where 5- methylcytosine (5mC), 6-methyadenine (6mA) and 4-methylcytosine (4mC) are the most common types. Previous efforts have been largely focused on 5mC, providing invaluable insights into epigenetic regulation through DNA methylation. Recently developed single-molecule real-time (SMRT) sequencing technology provides a unique opportunity to detect the less studied DNA 6mA and 4mC modifications at single-nucleotide resolution. With a rapidly increased amount of SMRT sequencing data generated, there is an emerging demand to systematically explore DNA 6mA and 4mC modifications from these data sets. MethSMRT is the first resource hosting DNA 6mA and 4mC methylomes. All the data sets were processed using the same analysis pipeline with the same quality control. The current version of the database provides a platform to store, browse, search and download epigenome-wide methylation profiles of 156 species, including seven eukaryotes such as Arabidopsis, C. elegans, Drosophila, mouse and yeast, as well as 149 prokaryotes. It also offers a genome browser to visualize the methylation sites and related information such as single nucleotide polymorphisms (SNP) and genomic annotation. Furthermore, the database provides a quick summary of statistics of methylome of 6mA and 4mC and predicted methylation motifs for each species. MethSMRT is publicly available at http://sysbio.sysu.edu.cn/methsmrt/ without use restriction.
© The Author(s) 2016. Published by Oxford University Press on behalf of Nucleic Acids Research.

Entities:  

Mesh:

Substances:

Year:  2016        PMID: 27924023      PMCID: PMC5210644          DOI: 10.1093/nar/gkw950

Source DB:  PubMed          Journal:  Nucleic Acids Res        ISSN: 0305-1048            Impact factor:   16.971


INTRODUCTION

DNA methylation is an important type of epigenetic modifications, which greatly expands the information content of DNA. The most common types of DNA methylation are 5-methylcytosine (5mC), 6-methyadenine (6mA) and 4-methylcytosine (4mC) (1). In eukaryotes, 5mC is the dominant type, playing an important role in gene regulation, transposon suppression and genomic imprinting (2). Aberrant 5mC patterns have been associated with many diseases and cancers (3). Take retinoblastoma for example, DNA hypermethylation silenced gene expression of RAS-associated domain family 1A in tumor but not in normal tissue (4). In prokaryotes, 6mA and 4mC are the most prevalent DNA modifications that are primarily used for distinguishing host DNA from foreign pathogenic DNA (5). In contrast, 6mA and 4mC are suggested to be minimal and only detectable by highly sensitive technologies in eukaryotes (5). Until recently, several studies reported the epigenome-wide patterns of 6mA in eukaryotes, including Chlamydomonas, C. elegans and Drosophila, showing wide existence of 6mA in eukaryotes and its important functions in regulating gene regulation and development (6–8). To date, many DNA methylation databases had been constructed, providing invaluable resources for the epigenetic community. MethDB is the first database that stores DNA methylation profiles and associated gene expression information (9). NGSMethDB hosts DNA methylation profiles generated from bisulfite sequencing technique (10). MethBank focuses on methylome changes during embryonic development (11) while MethyCancer and MENT focus on cancers (12,13). PubMeth is another cancer methylation database, based on text-mining of published literature (14). All these databases hosted DNA 5mC profiles and no database provided DNA 6mA or 4mC information so far. Recently developed single-molecule real-time (SMRT) sequencing technology allows detection of individual molecules of DNA in real time without amplification process, which is also called ‘third generation sequencing’ (15). By monitoring kinetic ‘signature’ of a base during the normal course of sequencing, the presence of the base modification can also be directly detected as a measurement of increased inter-pulse duration (IPD) compared to unmodified DNA bases, where IPD is defined as space between fluorescence pluses (15). All the three major types of DNA methylation can be detected from SMRT sequencing data. In particular, 6mA and 4mC provide highly sensitive kinetic signals therefore require only 25-fold coverage to obtain high confident levels of detection (16). Because the present high-throughput techniques for DNA methylation mainly focus on 5mC modification, SMRT sequencing technology provides a unique opportunity to detect the less studied DNA 6mA and 4mC modifications (16). With a rapidly increased amount of SMRT sequencing data generated in the past several years, there is an emerging demand to systematically explore DNA 6mA and 4mC modifications from these data sets. Here, we present MethSMRT, the first database for DNA 6mA and 4mC methylomes, generated from the publicly available SMRT sequencing data sets. The database provides a platform to host, analyze, browse, search and download 6mA and 4mC methylomes for 156 species, including seven eukaryotes and 149 prokaryotes. It also offers a genome browser to visualize the methylation profiles and related information such as single nucleotide polymorphisms (SNP) and gene annotation. In addition, the database provides a quick summary of statistics of methylomes of 6mA and 4mC and predicted methylation motifs for each species. Because all the data sets in the database were processed using the same analysis pipeline with the same quality control, it is feasible for a consistent comparison of methylomes between different species and data sets. MethSMRT is publicly available at http://sysbio.sysu.edu.cn/methsmrt/ without use restriction.

MATERIALS AND METHODS

Data sources

The SMRT sequencing data sets were downloaded from NCBI Gene Expression Omnibus (http://www.ncbi.nlm.nih.gov/geo/) (17) and Sequence Read Archive (http://www.ncbi.nlm.nih.gov/sra) (18). DNA 6mA and 4mC sites were detected from the SMRT sequencing data sets using a unified pipeline as described in the next section. The current version of MethSMRT includes a total of 156 species, including 7 eukaryotes such as Arabidopsis, C. elegans, Drosophila, Mouse and Yeast, as well as 149 prokaryotes. The associated SNP data were downloaded from dbSNP (19) or species-specific databases. The genome references were downloaded from NCBI or IMG databases (See Supplementary Table S1). Available 5mC modification sites of Arabidopsis and mouse were downloaded from NGSMethDB (10).

DNA modification detection

We used the PacBio SMRT analysis platform (version: 2.3.0) for DNA 6mA and 4mC modification detection (http://www.pacb.com/products-and-services/analytical-software/smrt-analysis/analysis-applications/epigenetics/). Briefly, raw data files in the h5 format were downloaded. Raw reads first were filtered using SFilter to remove sequencing adapters, short reads, defined as read length less than 50-nucleotide (nt) or reads with a low quality region (read score < 0.75 by default). The filtered reads were aligned to the reference genome using pbalign (version: 0.2.0.1). Kinetic analysis of the aligned DNA sequencing data were used to identify DNA 6mA and 4mC modifications using the default parameters of the SMRT analysis platform. Followed the recommendations by the PacBio, the 6mA or 4mC sites with less than 25-fold coverage per strand were removed from further analysis. The modification score is defined as phred-transformed P-value that a kinetic deviation exists at the position according to the PacBio manual. To gain a reliable modification site, the sites with modification score < 20 have been filtered out.

Data analysis

To obtain distributions of 4mC and 6mA sites on genomes, we classified the modification sites onto exon, intron, intergenic, promoter and UTR regions for eukaryotes. For prokaryotes, the modification sites were classified onto exon, intergenic and promoter regions. The genomic annotation of exon, intron, UTR and intergenic regions were defined according to the annotation from NCBI RefSeq (20) or JGI database (21). The genomic regions were extracted by BEDTools (v2.19.1) (22). To define promoters of eukaryotic genes, we used regions between 500-nt upstream of transcription starting site to 500-nt downstream of transcription starting site (23). For promoters of prokaryotic genes, we used regions between 100-nt upstream of the starting site of coding DNA sequence (CDS) to that of 50-ntdownstream (24). To infer DNA modification motif, we first extracted sequences between 4-nt upstream to 4-nt downstream of the modification site. Duplicated sequences were excluded and MEME (version: 4.11.1) was used to predict enriched motifs (25). For 5mC profiles of Arabidopsis or mouse, we combined all the available methylation sites of different samples downloaded from NGSMethDB. Methylation score was defined as the number of methylated reads on a given position divided by the number of total mapped reads of that position. Among multiple reported 5mC events on the same position, the maximum of methylation scores was used.

Database implementation

The database was organized using MySQL and queried using PHP scripts. The web pages were constructed using HTML5 with JavaScript. To provide a smooth and friendly users interface, we used Bootstrap framework from the front-end toolkit (http://getbootstrap.com/). The graphs in the Browse pages were produced by D3.js library (https://d3js.org/). Jbrowser was used to browse genomes and visualize modification sites (26).

RESULTS

Usage and access

The main functionality of MethSMRT is shown in Figure 1, including browsing, searching, visualizing and downloading DNA 6mA and 4mC modifications in single-nucleotide resolution for 156 species.
Figure 1.

Functionality of MethSMRT.

Functionality of MethSMRT.

Search

Users can query the 6mA and 4mC profiles using gene name or genomic location. For searching by a gene name, users first select a species and enter a gene symbol or Ensembl ID in the search box. Alternatively, users can also search the database by entering a genomic region in a chromosome. The output page returns 6mA and 4mC profiles of a given gene or region. By default, the output page displays genomic location of the methylation site, methylation type, associated gene symbol, coverage of reads, strand information and sequence context of this site (Supplementary Figure S1). Some features of usage of the search page include: (i) Users can sort the table by clicking the column names; (ii) Users can selectively display the methylation sites based on the coverage or methylation type; (iii) Users can selectively display the columns of the table; and (iv) Users can export the output page in CSV format.

Genome browser

MethSMRT provides an interactive genome browser to query and visualize the 6mA and 4mC profiles and some relevant information. By clicking the ‘view’ icon in the JBrowse column on the most left side of the output table, the genome browser starts (Figure 2). The top left section of the browser provides click-box of six different feature tracks, including 6mA/4mC track, reference genome, IPD ratio score, gene annotation, SNP and 5mC track, where 5mC track is only available for Arabidopsis or mouse in the current version of MethSMRT The 6mA/4mC track shows the methylation sites and by clicking the site users can obtain the detailed attributes of the site such as coverage, sequence context of the site, exact position and IPD score. The reference genome track shows the genomic sequence and the gene annotation track shows the genomic structure of genes. The SNP track displays the SNP information and the IPD ratio track indicates the kinetic signature of the methylation site. The 5mC track displays previously identified 5mC sites.
Figure 2.

Screenshot of the genome browser. (A) Five information tracks available in the browser: (1) 6mA/4mC track, (2) reference genome track, (3) IPD track, (4) gene annotation track, (5) 5mC track and (6) SNP track. Note that 5mC track is only available for Arabidopsis and mouse. (B) Zoom in a methylation site, with reference genome sequence and IPD signal. (C) Detailed information of the methylation site is available when clicking the methylation site in the browser. (D) Detailed information of the gene annotation is available when clicking the gene annotation track.

Screenshot of the genome browser. (A) Five information tracks available in the browser: (1) 6mA/4mC track, (2) reference genome track, (3) IPD track, (4) gene annotation track, (5) 5mC track and (6) SNP track. Note that 5mC track is only available for Arabidopsis and mouse. (B) Zoom in a methylation site, with reference genome sequence and IPD signal. (C) Detailed information of the methylation site is available when clicking the methylation site in the browser. (D) Detailed information of the gene annotation is available when clicking the gene annotation track.

Browse

The browse page displays meta-information of each SMRT data set and summary of the 6mC and 4mA profiles (Supplementary Figure S2). The meta-information of a given SMRT data set includes size and the number of runs of the data, SRA ID, Pubmed ID and a quick link to the processed file in GFF format. The page also displays statistics of genomic location of the methylation sites, histogram of coverage in log2 scale and histogram of modification score in log2 scale (see Materials and Methods section for details). In addition, the page displays the consensus sequence motif the methylation sites.

Download

MethSMRT provides two ways to download the methylation profiles. First, the download page provides GFF files that contain epigenome-wide 6mA and 4mC modification sites. Because there are over 150 species in the database, we provide a convenient search box for users to quickly find out the species using species name, SRA ID or taxon ID. In addition, users can also download the search results on the Search page in CSV format.

DISCUSSION

MethSMRT is the first resource of DNA 6mA and 4mC methylomes, generated from SMRT sequencing technology. With delivery of the new SMRT sequencing platform, the Sequel system, from PacBio Biosciences, the sequencing costs further reduced and we expect that more SMRT data sets will be generated with an increasing pace. MethSMRT will continue updating and incorporating new data sets. It should be noted that although the SMRT data sets of human and gorilla genomes have been recently reported (27,28), the current PacBio SMRT analysis platform requires extensive size of RAM (estimated > 2 Tb) to process the data sets and we are in process to parallelize the analysis platform. We will include the 6mA and 4mC methylomes for human and gorilla in the near future. To sum up, MethSMRT integrates and visualizes single-nucleotide resolution of DNA 6mA and 4mC methylomes. Together with many other useful DNA modification databases, MethSMRT will be an important epigenetic resource that facilitates discovery of methylation events and understanding of its biological functions.
  28 in total

Review 1.  An Adenine Code for DNA: A Second Life for N6-Methyladenine.

Authors:  Holger Heyn; Manel Esteller
Journal:  Cell       Date:  2015-04-30       Impact factor: 41.582

2.  N6-methyldeoxyadenosine marks active transcription start sites in Chlamydomonas.

Authors:  Ye Fu; Guan-Zheng Luo; Kai Chen; Xin Deng; Miao Yu; Dali Han; Ziyang Hao; Jianzhao Liu; Xingyu Lu; Louis C Dore; Xiaocheng Weng; Quanjiang Ji; Laurens Mets; Chuan He
Journal:  Cell       Date:  2015-04-30       Impact factor: 41.582

3.  N6-methyladenine DNA modification in Drosophila.

Authors:  Guoqiang Zhang; Hua Huang; Di Liu; Ying Cheng; Xiaoling Liu; Wenxin Zhang; Ruichuan Yin; Dapeng Zhang; Peng Zhang; Jianzhao Liu; Chaoyi Li; Baodong Liu; Yuewan Luo; Yuanxiang Zhu; Ning Zhang; Shunmin He; Chuan He; Hailin Wang; Dahua Chen
Journal:  Cell       Date:  2015-04-30       Impact factor: 41.582

Review 4.  Entering the era of bacterial epigenomics with single molecule real time DNA sequencing.

Authors:  Brigid M Davis; Michael C Chao; Matthew K Waldor
Journal:  Curr Opin Microbiol       Date:  2013-02-19       Impact factor: 7.934

5.  The Gene Expression Omnibus Database.

Authors:  Emily Clough; Tanya Barrett
Journal:  Methods Mol Biol       Date:  2016

6.  Aberrant promoter methylation and silencing of the RASSF1A gene in pediatric tumors and cell lines.

Authors:  Kenichi Harada; Shinichi Toyooka; Anirban Maitra; Riichiroh Maruyama; Kiyomi O Toyooka; Charles F Timmons; Gail E Tomlinson; Domenico Mastrangelo; Robert J Hay; John D Minna; Adi F Gazdar
Journal:  Oncogene       Date:  2002-06-20       Impact factor: 9.867

7.  The genome portal of the Department of Energy Joint Genome Institute: 2014 updates.

Authors:  Henrik Nordberg; Michael Cantor; Serge Dusheyko; Susan Hua; Alexander Poliakov; Igor Shabalov; Tatyana Smirnova; Igor V Grigoriev; Inna Dubchak
Journal:  Nucleic Acids Res       Date:  2013-11-12       Impact factor: 16.971

8.  The Epigenomic Landscape of Prokaryotes.

Authors:  Matthew J Blow; Tyson A Clark; Chris G Daum; Adam M Deutschbauer; Alexey Fomenkov; Roxanne Fries; Jeff Froula; Dongwan D Kang; Rex R Malmstrom; Richard D Morgan; Janos Posfai; Kanwar Singh; Axel Visel; Kelly Wetmore; Zhiying Zhao; Edward M Rubin; Jonas Korlach; Len A Pennacchio; Richard J Roberts
Journal:  PLoS Genet       Date:  2016-02-12       Impact factor: 5.917

9.  MEME SUITE: tools for motif discovery and searching.

Authors:  Timothy L Bailey; Mikael Boden; Fabian A Buske; Martin Frith; Charles E Grant; Luca Clementi; Jingyuan Ren; Wilfred W Li; William S Noble
Journal:  Nucleic Acids Res       Date:  2009-05-20       Impact factor: 16.971

10.  Reference sequence (RefSeq) database at NCBI: current status, taxonomic expansion, and functional annotation.

Authors:  Nuala A O'Leary; Mathew W Wright; J Rodney Brister; Stacy Ciufo; Diana Haddad; Rich McVeigh; Bhanu Rajput; Barbara Robbertse; Brian Smith-White; Danso Ako-Adjei; Alexander Astashyn; Azat Badretdin; Yiming Bao; Olga Blinkova; Vyacheslav Brover; Vyacheslav Chetvernin; Jinna Choi; Eric Cox; Olga Ermolaeva; Catherine M Farrell; Tamara Goldfarb; Tripti Gupta; Daniel Haft; Eneida Hatcher; Wratko Hlavina; Vinita S Joardar; Vamsi K Kodali; Wenjun Li; Donna Maglott; Patrick Masterson; Kelly M McGarvey; Michael R Murphy; Kathleen O'Neill; Shashikant Pujar; Sanjida H Rangwala; Daniel Rausch; Lillian D Riddick; Conrad Schoch; Andrei Shkeda; Susan S Storz; Hanzhen Sun; Francoise Thibaud-Nissen; Igor Tolstoy; Raymond E Tully; Anjana R Vatsan; Craig Wallin; David Webb; Wendy Wu; Melissa J Landrum; Avi Kimchi; Tatiana Tatusova; Michael DiCuccio; Paul Kitts; Terence D Murphy; Kim D Pruitt
Journal:  Nucleic Acids Res       Date:  2015-11-08       Impact factor: 16.971

View more
  37 in total

1.  BpForms and BcForms: a toolkit for concretely describing non-canonical polymers and complexes to facilitate global biochemical networks.

Authors:  Paul F Lang; Yassmine Chebaro; Xiaoyue Zheng; John A P Sekar; Bilal Shaikh; Darren A Natale; Jonathan R Karr
Journal:  Genome Biol       Date:  2020-05-18       Impact factor: 13.583

2.  Deep4mC: systematic assessment and computational prediction for DNA N4-methylcytosine sites by deep learning.

Authors:  Haodong Xu; Peilin Jia; Zhongming Zhao
Journal:  Brief Bioinform       Date:  2021-05-20       Impact factor: 11.622

3.  6mA-Finder: a novel online tool for predicting DNA N6-methyladenine sites in genomes.

Authors:  Haodong Xu; Ruifeng Hu; Peilin Jia; Zhongming Zhao
Journal:  Bioinformatics       Date:  2020-05-01       Impact factor: 6.937

4.  Selective recognition of N4-methylcytosine in DNA by engineered transcription-activator-like effectors.

Authors:  Preeti Rathi; Sara Maurer; Daniel Summerer
Journal:  Philos Trans R Soc Lond B Biol Sci       Date:  2018-06-05       Impact factor: 6.237

5.  BERT6mA: prediction of DNA N6-methyladenine site using deep learning-based approaches.

Authors:  Sho Tsukiyama; Md Mehedi Hasan; Hong-Wen Deng; Hiroyuki Kurata
Journal:  Brief Bioinform       Date:  2022-03-10       Impact factor: 11.622

6.  Leveraging the attention mechanism to improve the identification of DNA N6-methyladenine sites.

Authors:  Ying Zhang; Yan Liu; Jian Xu; Xiaoyu Wang; Xinxin Peng; Jiangning Song; Dong-Jun Yu
Journal:  Brief Bioinform       Date:  2021-11-05       Impact factor: 13.994

7.  Hyb4mC: a hybrid DNA2vec-based model for DNA N4-methylcytosine sites prediction.

Authors:  Ying Liang; Yanan Wu; Zequn Zhang; Niannian Liu; Jun Peng; Jianjun Tang
Journal:  BMC Bioinformatics       Date:  2022-06-29       Impact factor: 3.307

8.  N6-Methyladenine in Eukaryotic DNA: Tissue Distribution, Early Embryo Development, and Neuronal Toxicity.

Authors:  Sara B Fernandes; Nathalie Grova; Sarah Roth; Radu Corneliu Duca; Lode Godderis; Pauline Guebels; Sophie B Mériaux; Andrew I Lumley; Pascaline Bouillaud-Kremarik; Isabelle Ernens; Yvan Devaux; Henri Schroeder; Jonathan D Turner
Journal:  Front Genet       Date:  2021-05-24       Impact factor: 4.599

9.  iDNA-MT: Identification DNA Modification Sites in Multiple Species by Using Multi-Task Learning Based a Neural Network Tool.

Authors:  Xiao Yang; Xiucai Ye; Xuehong Li; Lesong Wei
Journal:  Front Genet       Date:  2021-03-31       Impact factor: 4.599

10.  i4mC-EL: Identifying DNA N4-Methylcytosine Sites in the Mouse Genome Using Ensemble Learning.

Authors:  Yanjuan Li; Zhengnan Zhao; Zhixia Teng
Journal:  Biomed Res Int       Date:  2021-05-29       Impact factor: 3.411

View more

北京卡尤迪生物科技股份有限公司 © 2022-2023.