Literature DB >> 25215331

Prediction of DNase I hypersensitive sites by using pseudo nucleotide compositions.

Abstract

DNase I hypersensitive sites (DHS) associated with a wide variety of regulatory DNA elements. Knowledge about the locations of DHS is helpful for deciphering the function of noncoding genomic regions. With the acceleration of genome sequences in the postgenomic age, it is highly desired to develop cost-effective computational methods to identify DHS. In the present work, a support vector machine based model was proposed to identify DHS by using the pseudo dinucleotide composition. In the jackknife test, the proposed model obtained an accuracy of 83%, which is competitive with that of the existing method. This result suggests that the proposed model may become a useful tool for DHS identifications.

Entities: CellLine Chemical Disease Species

Mesh：

Substances：

Year: 2014 PMID： 25215331 PMCID： PMC4152949 DOI： 10.1155/2014/740506

Source DB: PubMed Journal: ScientificWorldJournal ISSN： 1537-744X

1. Introduction

DNase I hypersensitive sites (DHS) are regions of chromatin which are sensitive to cleavage by the DNase I enzyme. Since the discovery of DHSs in 1980s [1], they have been used as markers of regulatory DNA regions. In general, these specific regions are generally nucleosome-free and associate with a wide variety of genomic regulatory elements, such as promoters, enhancers, insulators, silencers, and suppressors [2-4]. Therefore, mapping of DHS has become an effective approach for discovering functional DNA elements from the noncoding sequences. Although the traditional Southern blotting technique is a gold-standard approach for identifying DHS, obtaining information from Southern blot approach is a tricky, time-consuming, and inaccurate task [5]. Recently, the DNase-seq technique (combination of DNase I digestion and high-throughput sequencing) has been proposed [6] and this technique allows for an unprecedented increase in resolution. However, methodologies for the analysis of DNase-seq data are relatively immature [7]. Therefore, computational models will be an important complement to experimental techniques for identifying DHS. Based on nucleotide compositions, a support vector machine model for identifying DHS in K562 cell line was proposed [8]. This method yielded quite encouraging results and did play a role in stimulating the development of this area. However, further work is needed due to the following reasons. First, the sequences in their dataset share high sequence similarities. Second, the DNA structural properties were ignored. To solve these problems, we proposed a new model for identifying DHS, which is trained on a high quality benchmark dataset. In the new model, each DNA sample is encoded by using the pseudo dinucleotide composition, into which the DNA structural properties are incorporated.

2. Materials and Methods

2.1. Benchmark Dataset

The experimentally confirmed 280 DHS and 731 non-DHS sequences were obtained from http://noble.gs.washington.edu/proj/hs/, which have been used to train DHS prediction models [8]. As elucidated in [9], a predictor, if trained and tested by a dataset containing redundant samples with high similarity, might yield misleading results with an overestimated accuracy. To get rid of the redundancy and avoid bias, the CD-HIT software [10] was utilized to remove those DNA fragments that have ≥60% pairwise sequence identity to each other. Finally, we obtained 247 positive and 710 negative samples for the benchmark dataset S, as can be formulated by where the subset S + contains 247 DHS sequences and S − contains 710 non-DHS sequences, while ⋃ represents the “union” in the set theory. The detailed sequences in the benchmark dataset S are given in Supplementary Information S1 available online at http://dx.doi.org/10.1155/2014/740506.

2.2. DNA Sequence Representation

In order to integrate the sequence-order effects and DNA physicochemical properties together, the pseudo nucleotide composition was proposed in 2011 [11]. Since then, the concept of pseudo nucleotide composition has penetrated into many branches of computational genomics, such as predicting the recombination spots [12], predicting promoters [13], predicting nucleosome positioning sequences [14], and identifying splice sites [15]. Because of its wide and increasing usage, recently, a flexible web-server, called “pseudo K-tuple nucleotide composition (PseKNC),” was developed [16], which can be used to generate various kinds of pseudo K-tuple nucleotide compositions. Encouraged by the success of introducing pseudo nucleotide composition to computational genomics, in the current study, the pseudo dinucleotide composition was used to represent DNA sequences in the benchmark dataset, which can be expressed as [12, 16] where In (3), f (u = 1,2,…, 16) is the normalized occurrence frequency of the dinucleotides in the DNA sequence. λ is the number of the total counted ranks (or tiers) of the correlations along a DNA sequence, and w is the weight factor. The concrete values for λ and w as well as k will be further discussed in Section 3.1, while the correlation factor θ represents the j-tier structural correlation factor between all the jth most contiguous dinucleotide R R at position i.

2.3. Support Vector Machine (SVM)

SVM is a supervised learning algorithm and has been widely used in computational genomics and proteomics [17-23]. The basic principle of SVM is to transform the input vector into a high dimension space and then seek a separating hyperplane with the maximal margin in this space by using the decision function where α is the Lagrange multipliers, b is the offset, is the ith training vector, and y represents the type of the ith training vector. is a kernel function which defines an inner product in a high dimensional feature space, and sgn is the sign function. Due to its effectiveness and speed in nonlinear classification process, the radial basis kernel function (RBF) was used in the current study. The Libsvm 2.84 package [24] was used to perform the SVM, which can be downloaded from http://www.csie.ntu.edu.tw/~cjlin/libsvm/. The regularization parameter C and the kernel width parameter γ were optimized via an optimization procedure using a grid search. The search spaces for C and γ are [215, 2−5] and [2−5, 2−15] with steps of 2−1 and 2, respectively.

2.4. Performance Evaluation

Three cross-validation methods, that is, independent dataset test, subsampling (or K-fold cross-validation) test, and jackknife test, are often used to evaluate the anticipated success rate of a predictor. Among the three methods, the jackknife test is deemed the least arbitrary and most objective one [9, 25] and, hence, has been widely recognized and increasingly adopted by investigators to examine the quality of various predictors [26-30]. Accordingly, the jackknife test was used to examine the performance of the model proposed in the current study. In the jackknife test, each sequence in the training dataset is in turn singled out as an independent test sample and all the rule-parameters are calculated without including the one being identified. A set of parameters, namely, sensitivity (Sn), specificity (Sp), Matthew's correlation coefficient (MCC), and accuracy (Acc), are used to evaluate the performance of the proposed model and they are defined as follows: where TP, TN, FP, and FN represent the number of the correctly recognized DHS, the number of the correctly recognized non-DHS, the number of non-DHS recognized as DHS, and the number of DHS recognized as non-DHS, respectively.

3. Results and Discussions

3.1. Parameter Optimization

By analyzing the dinucleotide composition of DHS and non-DHS sequences, we found that the frequency of CC, CG, GC, and GG is higher in DHS sequences, while the frequency of the remaining dinucleotides is higher in non-DHS (Figure 1). This is self-evident as to why the pseudo dinucleotide composition was used for the current case.

Figure 1

Comparative frequencies of 16 dinucleotides in DHS and non-DHS sequences.

A series of evidences [12, 14, 31, 32] have demonstrated that DNA local structural properties, that is, angular parameters (twist, tilt, and roll) and translational parameters (shift, slide, and rise), are effective in identifying DNA attributes. Therefore, in the present work, the six structural parameters of dinucleotides were used to calculate the pseudo dinucleotide composition by using the PseKNC web-server, which is available at http://lin.uestc.edu.cn/pseknc/default.aspx. As we can see from (1) and (2), the present model depends on the two parameters w and λ. w is the weight factor usually within the range from 0 to 1 and λ is the global order effect. Generally speaking, the greater the λ is, the more global sequence-order information the model contains. However, if λ is too large, it would reduce the cluster-tolerant capacity so as to lower down the cross-validation accuracy due to overfitting or “high dimension disaster” problem [33]. Therefore, our searching for the optimal values of the two parameters is in the range of w ∈ [0,1] and λ ∈ [1,10] with the steps of 0.1 and 1, respectively. In order to reduce the computational time, the 5-fold cross-validation approach was used to optimize the two parameters together with the parameters C and γ of the SVM. We found that when w = 0.2 and λ = 6 with C = 512 and γ = 0.0078125, a peak was observed for the Acc. Accordingly, the two numerical values were used for the two uncertain parameters in the following analysis.

3.2. Prediction Quality

The prediction quality measured by the four metrics defined in (5)–(8) for the present model in identifying DHS in the benchmark dataset S via the rigorous jackknife test was listed in Table 1, where, for facilitating comparison, the corresponding results obtained by the previous predictor [8] on the same benchmark data set are also given. As we can see from Table 1, the current method outperformed the existing model in all the four metrics, indicating that our proposed method may become a useful tool in identifying DHS sequences.

Table 1

Comparison of different methods for identifying DHS by the jackknife test on the same benchmark dataset.

Predictor	Sn (%)	Sp (%)	Acc (%)	MCC
Our method	72.12	86.78	83.00	0.57
Noble et al.^a	70.43	84.23	80.12	0.52

aFrom Noble et al. [8].

4. Conclusions

Since DHS associates with a wide variety of functional elements, knowledge about the locations of DHS is helpful for deciphering the genomes. However, strong DNA sequence conservation is not observed among DHS sequences, suggesting that it is difficult to computationally identify DHS from primary DNA sequence. A series of recent studies have demonstrated that the information coded by DNA structural properties is contributable to the identification of regulatory elements in genomes [12, 14, 31, 32]. Hence, in the present study, we proposed a SVM based model for identifying DHS by using the pseudo dinucleotide composition. In this model, we integrate dinucleotide composition with DNA structural properties. The predictive results of our model are better than existing methods. Therefore, it is anticipated that the proposed method may become a useful tool for identifying DHS sequences or, at the very least, it can play a complementary role to the existing methods in this area. Listed in Supplementary Information S1 are the 247 DHS and 710 non-DHS sequences of the benchmark dataset.

32 in total

1. Identification of voltage-gated potassium channel subfamilies from sequence information using support vector machine.

Authors: Wei Chen; Hao Lin
Journal: Comput Biol Med Date: 2012-01-31 Impact factor: 4.589

2. Identification of TATA and TATA-less promoters in plant genomes by integrating diversity measure, GC-Skew and DNA geometric flexibility.

Authors: Yong-Chun Zuo; Qian-Zhong Li
Journal: Genomics Date: 2010-11-26 Impact factor: 5.736

3. Using the concept of Chou's pseudo amino acid composition for risk type prediction of human papillomaviruses.

Authors: Maryam Esmaeili; Hassan Mohabatkar; Sasan Mohsenzadeh
Journal: J Theor Biol Date: 2009-12-02 Impact factor: 2.691

Review 4. Nuclease hypersensitive sites in chromatin.

Authors: D S Gross; W T Garrard
Journal: Annu Rev Biochem Date: 1988 Impact factor: 23.643

5. Predicting methylation status of human DNA sequences by pseudo-trinucleotide composition.

Authors: Xuan Zhou; Zhanchao Li; Zong Dai; Xiaoyong Zou
Journal: Talanta Date: 2011-05-27 Impact factor: 6.057

6. iLoc-Euk: a multi-label classifier for predicting the subcellular localization of singleplex and multiplex eukaryotic proteins.

Authors: Kuo-Chen Chou; Zhi-Cheng Wu; Xuan Xiao
Journal: PLoS One Date: 2011-03-30 Impact factor: 3.240

6. DHSpred: support-vector-machine-based human DNase I hypersensitive sites prediction using the optimal features selected by random forest.

Authors: Balachandran Manavalan; Tae Hwan Shin; Gwang Lee
Journal: Oncotarget Date: 2017-12-08

6 in total

Prediction of DNase I hypersensitive sites by using pseudo nucleotide compositions.

1. Introduction

2. Materials and Methods

2.1. Benchmark Dataset

2.2. DNA Sequence Representation

2.3. Support Vector Machine (SVM)

2.4. Performance Evaluation

3. Results and Discussions

3.1. Parameter Optimization

3.2. Prediction Quality

4. Conclusions

1. Identification of voltage-gated potassium channel subfamilies from sequence information using support vector machine.

2. Identification of TATA and TATA-less promoters in plant genomes by integrating diversity measure, GC-Skew and DNA geometric flexibility.

3. Using the concept of Chou's pseudo amino acid composition for risk type prediction of human papillomaviruses.

Review 4. Nuclease hypersensitive sites in chromatin.

5. Predicting methylation status of human DNA sequences by pseudo-trinucleotide composition.

6. iLoc-Euk: a multi-label classifier for predicting the subcellular localization of singleplex and multiplex eukaryotic proteins.

7. Some remarks on protein attribute prediction and pseudo amino acid composition.

8. Prediction of protein binding sites in protein structures using hidden Markov support vector machine.

9. CD-HIT: accelerated for clustering the next-generation sequencing data.

10. iSS-PseDNC: identifying splicing sites using pseudo dinucleotide composition.

1. pDHS-ELM: computational predictor for plant DNase I hypersensitive sites based on extreme learning machines.

2. Identifying N ⁶-methyladenosine sites in the Arabidopsis thaliana transcriptome.

3. Taxonomic Classification for Living Organisms Using Convolutional Neural Networks.

4. PseUI: Pseudouridine sites identification based on RNA sequence information.

5. Identification of Multi-Functional Enzyme with Multi-Label Classifier.

6. DHSpred: support-vector-machine-based human DNase I hypersensitive sites prediction using the optimal features selected by random forest.