Literature DB >> 26958595

Benchmark data for identifying N(6)-methyladenosine sites in the Saccharomyces cerevisiae genome.

Wei Chen1, Pengmian Feng2, Hui Ding2, Hao Lin3, Kuo-Chen Chou4.   

Abstract

This data article contains the benchmark dataset for training and testing iRNA-Methyl, a web-server predictor for identifying N(6)-methyladenosine sites in RNA (Chen et al., 2015 [15]). It can also be used to develop other predictors for identifying N(6)-methyladenosine sites in the Saccharomyces cerevisiae genome.

Entities:  

Keywords:  N6-methyladenosine sites; PseAAC; PseKNC

Year:  2015        PMID: 26958595      PMCID: PMC4773366          DOI: 10.1016/j.dib.2015.09.008

Source DB:  PubMed          Journal:  Data Brief        ISSN: 2352-3409


Specifications table Value of the data N6-methyladenosine (m6A) is one of the most abundant RNA methylations and plays very important roles in many biological processes [15]. For in-depth understanding the regulatory mechanism of m6A, it is indispensable to characterize its sites in a genome-wide scope. The data can be used to develop computational predictors or high throughput tools for identifying the m6A sites in RNA.

Background

The benchmark dataset for developing computational methods to identify the methylation sites in DNA (see, e.g., [16]) is available [17], and the information thus obtained is very useful for both basic research and drug development. But so far no existing benchmark dataset whatsoever is available for developing computational methods to identify N6-methyladenosine in RNA. The present study was initiated in an attempt to construct a benchmark dataset for the later based on the experimental observations reported by Schwartz et al. [18] recently.

Data, experimental design, materials and methods

The data presented here are the benchmark dataset for training and testing iRNA-Methyl [15] (http://lin.uestc.edu.cn/server/iRNA-Methyl), a web-server predictor for identifying m6A sites in the S. cerevisiae genome. By means of the m6A-seq technique, Schwartz et al. [18] first identified 1,307 methylated adenine (m6A) sites in the S. cerevisiae genome. They have observed that most of the m6A sites share a consensus motif GAC where its center base may be methylated [18]. To construct the corresponding negative benchmark dataset, we used the flexible sliding window approach [19], [20] to search the S. cerevisiae genome, and obtained 33,280 RNA segments with exactly the same GAC consensus motif that, however, were not detected by the m6A-seq technique as methylated sites. Furthermore, it had been observed via preliminary tests that when the length of the RNA segments thus derived was 51 bp, the corresponding outcomes were most promising [15]. Accordingly, the 1,307 and 33,280 RNA segments each having 51 bp long were designated as positive and negative samples, respectively. Also, since the size of the negative samples thus obtained is overwhelmingly larger than that of the positive samples, to minimize the false prediction caused by such a highly skewed benchmark dataset, we randomly picked out 1,307 RNA segments from the 33,280 negative samples to form a negative subset that has the same size with the positive one. The final benchmark dataset thus obtained contains 1,307 positive samples and 1,307 negative samples. Their detailed sequences are given in Appendix A. They can also be downloaded at the web-site http://lin.uestc.edu.cn/server/iRNAMethyl/data.

Conflict of interest

None of the authors claims conflicting interest.
Subject areaBiology
More specific subject areaBioinformatics, computational biology, biomedicine
Type of dataText file
How data was acquiredUsing flexible sliding window approach
Data formatAnalyzed
Experimental factorsN/A
Experimental featuresRNA sample was formulated by combining its dinucleotide composition (DNC) [1], [2] and the pseudo components [3] since nearly all the machine-learning algorithms can only handle vectors [4]. The concept of pseudo components was originally introduced to reflect the sequence patterns of protein sequences via a series of vector components [5], [6] and has been widely used in computational proteomics [7]. Recently, it has been successfully extended to cover DNA [8], [9], [10], [11] and RNA sequences [12], [13] as well. For the detailed development process in this regard, see a recent review particle [14].
Data source locationChengdu 610054, China
Data accessibilityIn Appendix A of this paper and at the web-site http://lin.uestc.edu.cn/server/iRNAMethy/data
  20 in total

1.  iPro54-PseKNC: a sequence-based predictor for identifying sigma-54 promoters in prokaryote with pseudo k-tuple nucleotide composition.

Authors:  Hao Lin; En-Ze Deng; Hui Ding; Wei Chen; Kuo-Chen Chou
Journal:  Nucleic Acids Res       Date:  2014-10-31       Impact factor: 16.971

Review 2.  Pseudo nucleotide composition or PseKNC: an effective formulation for analyzing genomic sequences.

Authors:  Wei Chen; Hao Lin; Kuo-Chen Chou
Journal:  Mol Biosyst       Date:  2015-10

3.  repRNA: a web server for generating various feature vectors of RNA sequences.

Authors:  Bin Liu; Fule Liu; Longyun Fang; Xiaolong Wang; Kuo-Chen Chou
Journal:  Mol Genet Genomics       Date:  2015-06-18       Impact factor: 3.291

4.  iDNA-Methyl: identifying DNA methylation sites via pseudo trinucleotide composition.

Authors:  Zi Liu; Xuan Xiao; Wang-Ren Qiu; Kuo-Chen Chou
Journal:  Anal Biochem       Date:  2015-01-14       Impact factor: 3.365

5.  repDNA: a Python package to generate various modes of feature vectors for DNA sequences by incorporating user-defined physicochemical properties and sequence-order effects.

Authors:  Bin Liu; Fule Liu; Longyun Fang; Xiaolong Wang; Kuo-Chen Chou
Journal:  Bioinformatics       Date:  2014-12-10       Impact factor: 6.937

6.  iTIS-PseTNC: a sequence-based predictor for identifying translation initiation site in human genes using pseudo trinucleotide composition.

Authors:  Wei Chen; Peng-Mian Feng; En-Ze Deng; Hao Lin; Kuo-Chen Chou
Journal:  Anal Biochem       Date:  2014-07-10       Impact factor: 3.365

7.  PseKNC: a flexible web server for generating pseudo K-tuple nucleotide composition.

Authors:  Wei Chen; Tian-Yu Lei; Dian-Chuan Jin; Hao Lin; Kuo-Chen Chou
Journal:  Anal Biochem       Date:  2014-04-13       Impact factor: 3.365

8.  High-resolution mapping reveals a conserved, widespread, dynamic mRNA methylation program in yeast meiosis.

Authors:  Schraga Schwartz; Sudeep D Agarwala; Maxwell R Mumbach; Marko Jovanovic; Philipp Mertins; Alexander Shishkin; Yuval Tabach; Tarjei S Mikkelsen; Rahul Satija; Gary Ruvkun; Steven A Carr; Eric S Lander; Gerald R Fink; Aviv Regev
Journal:  Cell       Date:  2013-11-21       Impact factor: 41.582

9.  Benchmark data for identifying DNA methylation sites via pseudo trinucleotide composition.

Authors:  Zi Liu; Xuan Xiao; Wang-Ren Qiu; Kuo-Chen Chou
Journal:  Data Brief       Date:  2015-05-07

10.  iSS-PseDNC: identifying splicing sites using pseudo dinucleotide composition.

Authors:  Wei Chen; Peng-Mian Feng; Hao Lin; Kuo-Chen Chou
Journal:  Biomed Res Int       Date:  2014-05-21       Impact factor: 3.411

View more
  1 in total

1.  M6A-BiNP: predicting N6-methyladenosine sites based on bidirectional position-specific propensities of polynucleotides and pointwise joint mutual information.

Authors:  Mingzhao Wang; Juanying Xie; Shengquan Xu
Journal:  RNA Biol       Date:  2021-06-23       Impact factor: 4.652

  1 in total

北京卡尤迪生物科技股份有限公司 © 2022-2023.