Literature DB >> 26217768

Benchmark data for identifying DNA methylation sites via pseudo trinucleotide composition.

Zi Liu1, Xuan Xiao2, Wang-Ren Qiu1, Kuo-Chen Chou3.   

Abstract

This data article contains three benchmark datasets for training and testing iDNA-Methyl, a web-server predictor for identifying DNA methylation sites [Liu et al. Anal. Biochem. 474 (2015) 69-79].

Entities:  

Year:  2015        PMID: 26217768      PMCID: PMC4510404          DOI: 10.1016/j.dib.2015.04.021

Source DB:  PubMed          Journal:  Data Brief        ISSN: 2352-3409


Specifications table

Value of the data

DNA methylation plays an important role in regulating a variety of biological processes and is very important for basic research and drug development as well. The datasets presented here are good for testing DNA methylation site identifying algorithms because of their realistic, highly unbalanced nature. For the first dataset (Supplementary material, File 1), users can use the original sequences to construct their own benchmark dataset, for the the 2nd dataset (Supplementary material, File 2) and the 3rd dataset (Supplementary material, File 1) users can use them to design their own predictor for identifying methylation sites.

Data, experimental design, materials and methods

The data presented here are three benchmark datasets for training and testing iDNA-Methyl [1] http://www.jci-bioinfo.cn/iDNA-Methyl, a web-server predictor for identifying DNA methylation sites. The DNA sample was formulated by combining its trinucleotide composition (TNC) and the pseudo amino acid components (PseAAC) of the sequence translated from the DNA sample according to its genetic codons. Sliding a window of nucleotides along each of the DNA sequences taken from MethDB (http://www.methdb.de/), and DNA sample was formulated by combining its trinucleotide composition (TNC) and the pseudo amino acid components (PseAAC) of the sequence translated from the DNA sample according to its genetic codons. In real world, the data very unbalanced. Target-jackknife was used to optimize the unbalanced benchmark dataset and minimize the consequence of this kind of mis-prediction. The first dataset (Supplementary material, File 1) contains 2426 nucleotide segment samples, of which 787 are true methylation ones and 1639 are false methylation ones. The 2nd dataset (Supplementary material, File 2) is the optimized benchmark dataset obtained after the NCR (Neighborhood Cleaning Rule) [13] treatments on the original benchmark dataset of the DNA methylation system. It contains 522 non-methylation samples that were removed from the negative subset, each of which corresponds to a vector with 72 components. For distinction, the real Non-methylation starts with a line of “>Non-Methylation code”. The 3rd dataset (Supplementary material, File 1) is the optimized benchmark dataset obtained after both the NCR (Neighborhood Cleaning Rule) [13] and SMOTE (Synthetic Minority Over-Sampling Technique) [14] treatments on the 1st benchmark dataset. It contains 1117 DNA methylation (including 330 hypothetical methylation created by SMOTE) and 1117 non-methylation, each of which corresponds to a vector with 72 components. For distinction, the real DNA methylation starts with a line of “>Methylation code” while the hypothetical DNA methylation starts with a line of “Hypothetical” [6-8].
Subject areaBiology
More specific subject areaBioinformatics and Biomedicine
Type of dataText file
How data was acquiredUsing flexible sliding window approach[2–5]
Data formatAnalyzed
Experimental factorsn/a
Experimental featuresDNA sample was formulated by combining its trinucleotide composition (TNC) [6–8] and the pseudo amino acid components (PseAAC) [9–11] of the sequence translated from the DNA sample according to its genetic codons. Meanwhile, some novel techniques in statistical analysis were introduced to train and test the predictor, such as “Neighborhood Cleaning Rule”, “Synthetic Minority Over-Sampling Technique”, and “Target-Jackknife Test” [12].
Data source locationJingdezhen 333403, China
Data accessibilityWith this paper and at: http://www.jci-bioinfo.cn/DNAmethy/IDM_data.html
  12 in total

1.  Using subsite coupling to predict signal peptides.

Authors:  K C Chou
Journal:  Protein Eng       Date:  2001-02

Review 2.  Prediction of protein signal sequences.

Authors:  Kuo-Chen Chou
Journal:  Curr Protein Pept Sci       Date:  2002-12       Impact factor: 3.272

3.  Signal-CF: a subsite-coupled and window-fusing approach for predicting signal peptides.

Authors:  Kuo-Chen Chou; Hong-Bin Shen
Journal:  Biochem Biophys Res Commun       Date:  2007-04-05       Impact factor: 3.575

4.  Signal-3L: A 3-layer approach for predicting signal peptides.

Authors:  Hong-Bin Shen; Kuo-Chen Chou
Journal:  Biochem Biophys Res Commun       Date:  2007-08-31       Impact factor: 3.575

5.  iDrug-Target: predicting the interactions between drug compounds and target proteins in cellular networking via benchmark dataset optimization approach.

Authors:  Xuan Xiao; Jian-Liang Min; Wei-Zhong Lin; Zi Liu; Xiang Cheng; Kuo-Chen Chou
Journal:  J Biomol Struct Dyn       Date:  2015-01-14

6.  iDNA-Methyl: identifying DNA methylation sites via pseudo trinucleotide composition.

Authors:  Zi Liu; Xuan Xiao; Wang-Ren Qiu; Kuo-Chen Chou
Journal:  Anal Biochem       Date:  2015-01-14       Impact factor: 3.365

7.  PseKNC-General: a cross-platform package for generating various modes of pseudo nucleotide compositions.

Authors:  Wei Chen; Xitong Zhang; Jordan Brooker; Hao Lin; Liqing Zhang; Kuo-Chen Chou
Journal:  Bioinformatics       Date:  2014-09-16       Impact factor: 6.937

8.  Pse-in-One: a web server for generating various modes of pseudo components of DNA, RNA, and protein sequences.

Authors:  Bin Liu; Fule Liu; Xiaolong Wang; Junjie Chen; Longyun Fang; Kuo-Chen Chou
Journal:  Nucleic Acids Res       Date:  2015-05-09       Impact factor: 16.971

9.  Using amphiphilic pseudo amino acid composition to predict enzyme subfamily classes.

Authors:  Kuo-Chen Chou
Journal:  Bioinformatics       Date:  2004-08-12       Impact factor: 6.937

10.  PseKNC: a flexible web server for generating pseudo K-tuple nucleotide composition.

Authors:  Wei Chen; Tian-Yu Lei; Dian-Chuan Jin; Hao Lin; Kuo-Chen Chou
Journal:  Anal Biochem       Date:  2014-04-13       Impact factor: 3.365

View more
  1 in total

1.  Benchmark data for identifying N(6)-methyladenosine sites in the Saccharomyces cerevisiae genome.

Authors:  Wei Chen; Pengmian Feng; Hui Ding; Hao Lin; Kuo-Chen Chou
Journal:  Data Brief       Date:  2015-09-30
  1 in total

北京卡尤迪生物科技股份有限公司 © 2022-2023.