Literature DB >> 26958595

Benchmark data for identifying N(6)-methyladenosine sites in the Saccharomyces cerevisiae genome.

Wei Chen¹, Pengmian Feng², Hui Ding², Hao Lin³, Kuo-Chen Chou⁴.

Abstract

This data article contains the benchmark dataset for training and testing iRNA-Methyl, a web-server predictor for identifying N(6)-methyladenosine sites in RNA (Chen et al., 2015 [15]). It can also be used to develop other predictors for identifying N(6)-methyladenosine sites in the Saccharomyces cerevisiae genome.

Entities: Chemical Species

Keywords: N6-methyladenosine sites; PseAAC; PseKNC

Year: 2015 PMID： 26958595 PMCID： PMC4773366 DOI： 10.1016/j.dib.2015.09.008

Source DB: PubMed Journal: Data Brief ISSN： 2352-3409

Specifications table Value of the data N6-methyladenosine (m6A) is one of the most abundant RNA methylations and plays very important roles in many biological processes [15]. For in-depth understanding the regulatory mechanism of m6A, it is indispensable to characterize its sites in a genome-wide scope. The data can be used to develop computational predictors or high throughput tools for identifying the m6A sites in RNA.

Background

The benchmark dataset for developing computational methods to identify the methylation sites in DNA (see, e.g., [16]) is available [17], and the information thus obtained is very useful for both basic research and drug development. But so far no existing benchmark dataset whatsoever is available for developing computational methods to identify N6-methyladenosine in RNA. The present study was initiated in an attempt to construct a benchmark dataset for the later based on the experimental observations reported by Schwartz et al. [18] recently.

Data, experimental design, materials and methods

The data presented here are the benchmark dataset for training and testing iRNA-Methyl [15] (http://lin.uestc.edu.cn/server/iRNA-Methyl), a web-server predictor for identifying m6A sites in the S. cerevisiae genome. By means of the m6A-seq technique, Schwartz et al. [18] first identified 1,307 methylated adenine (m6A) sites in the S. cerevisiae genome. They have observed that most of the m6A sites share a consensus motif GAC where its center base may be methylated [18]. To construct the corresponding negative benchmark dataset, we used the flexible sliding window approach [19], [20] to search the S. cerevisiae genome, and obtained 33,280 RNA segments with exactly the same GAC consensus motif that, however, were not detected by the m6A-seq technique as methylated sites. Furthermore, it had been observed via preliminary tests that when the length of the RNA segments thus derived was 51 bp, the corresponding outcomes were most promising [15]. Accordingly, the 1,307 and 33,280 RNA segments each having 51 bp long were designated as positive and negative samples, respectively. Also, since the size of the negative samples thus obtained is overwhelmingly larger than that of the positive samples, to minimize the false prediction caused by such a highly skewed benchmark dataset, we randomly picked out 1,307 RNA segments from the 33,280 negative samples to form a negative subset that has the same size with the positive one. The final benchmark dataset thus obtained contains 1,307 positive samples and 1,307 negative samples. Their detailed sequences are given in Appendix A. They can also be downloaded at the web-site http://lin.uestc.edu.cn/server/iRNAMethyl/data.

Conflict of interest

None of the authors claims conflicting interest.

Subject area	Biology
More specific subject area	Bioinformatics, computational biology, biomedicine
Type of data	Text file
How data was acquired	Using flexible sliding window approach
Data format	Analyzed
Experimental factors	N/A
Experimental features	RNA sample was formulated by combining its dinucleotide composition (DNC) [1], [2] and the pseudo components [3] since nearly all the machine-learning algorithms can only handle vectors [4]. The concept of pseudo components was originally introduced to reflect the sequence patterns of protein sequences via a series of vector components [5], [6] and has been widely used in computational proteomics [7]. Recently, it has been successfully extended to cover DNA [8], [9], [10], [11] and RNA sequences [12], [13] as well. For the detailed development process in this regard, see a recent review particle [14].
Data source location	Chengdu 610054, China
Data accessibility	In Appendix A of this paper and at the web-site http://lin.uestc.edu.cn/server/iRNAMethy/data

20 in total

1. iPro54-PseKNC: a sequence-based predictor for identifying sigma-54 promoters in prokaryote with pseudo k-tuple nucleotide composition.

Authors: Hao Lin; En-Ze Deng; Hui Ding; Wei Chen; Kuo-Chen Chou
Journal: Nucleic Acids Res Date: 2014-10-31 Impact factor: 16.971

Review 2. Pseudo nucleotide composition or PseKNC: an effective formulation for analyzing genomic sequences.

Authors: Wei Chen; Hao Lin; Kuo-Chen Chou
Journal: Mol Biosyst Date: 2015-10

3. repRNA: a web server for generating various feature vectors of RNA sequences.

Authors: Bin Liu; Fule Liu; Longyun Fang; Xiaolong Wang; Kuo-Chen Chou
Journal: Mol Genet Genomics Date: 2015-06-18 Impact factor: 3.291

4. iDNA-Methyl: identifying DNA methylation sites via pseudo trinucleotide composition.

Authors: Zi Liu; Xuan Xiao; Wang-Ren Qiu; Kuo-Chen Chou
Journal: Anal Biochem Date: 2015-01-14 Impact factor: 3.365

5. repDNA: a Python package to generate various modes of feature vectors for DNA sequences by incorporating user-defined physicochemical properties and sequence-order effects.

Authors: Bin Liu; Fule Liu; Longyun Fang; Xiaolong Wang; Kuo-Chen Chou
Journal: Bioinformatics Date: 2014-12-10 Impact factor: 6.937

6. iTIS-PseTNC: a sequence-based predictor for identifying translation initiation site in human genes using pseudo trinucleotide composition.

Authors: Wei Chen; Peng-Mian Feng; En-Ze Deng; Hao Lin; Kuo-Chen Chou
Journal: Anal Biochem Date: 2014-07-10 Impact factor: 3.365

7. PseKNC: a flexible web server for generating pseudo K-tuple nucleotide composition.

Authors: Wei Chen; Tian-Yu Lei; Dian-Chuan Jin; Hao Lin; Kuo-Chen Chou
Journal: Anal Biochem Date: 2014-04-13 Impact factor: 3.365

8. High-resolution mapping reveals a conserved, widespread, dynamic mRNA methylation program in yeast meiosis.

Authors: Schraga Schwartz; Sudeep D Agarwala; Maxwell R Mumbach; Marko Jovanovic; Philipp Mertins; Alexander Shishkin; Yuval Tabach; Tarjei S Mikkelsen; Rahul Satija; Gary Ruvkun; Steven A Carr; Eric S Lander; Gerald R Fink; Aviv Regev
Journal: Cell Date: 2013-11-21 Impact factor: 41.582

9. Benchmark data for identifying DNA methylation sites via pseudo trinucleotide composition.

Authors: Zi Liu; Xuan Xiao; Wang-Ren Qiu; Kuo-Chen Chou
Journal: Data Brief Date: 2015-05-07

10. iSS-PseDNC: identifying splicing sites using pseudo dinucleotide composition.

Authors: Wei Chen; Peng-Mian Feng; Hao Lin; Kuo-Chen Chou
Journal: Biomed Res Int Date: 2014-05-21 Impact factor: 3.411

1 in total

1. M6A-BiNP: predicting N⁶-methyladenosine sites based on bidirectional position-specific propensities of polynucleotides and pointwise joint mutual information.

Authors: Mingzhao Wang; Juanying Xie; Shengquan Xu
Journal: RNA Biol Date: 2021-06-23 Impact factor: 4.652

1 in total