Literature DB >> 21935457

iDNA-Prot: identification of DNA binding proteins using random forest with grey model.

Wei-Zhong Lin1, Jian-An Fang, Xuan Xiao, Kuo-Chen Chou.   

Abstract

DNA-binding proteins play crucial roles in various cellular processes. Developing high throughput tools for rapidly and effectively identifying DNA-binding proteins is one of the major challenges in the field of genome annotation. Although many efforts have been made in this regard, further effort is needed to enhance the prediction power. By incorporating the features into the general form of pseudo amino acid composition that were extracted from protein sequences via the "grey model" and by adopting the random forest operation engine, we proposed a new predictor, called iDNA-Prot, for identifying uncharacterized proteins as DNA-binding proteins or non-DNA binding proteins based on their amino acid sequences information alone. The overall success rate by iDNA-Prot was 83.96% that was obtained via jackknife tests on a newly constructed stringent benchmark dataset in which none of the proteins included has ≥25% pairwise sequence identity to any other in a same subset. In addition to achieving high success rate, the computational time for iDNA-Prot is remarkably shorter in comparison with the relevant existing predictors. Hence it is anticipated that iDNA-Prot may become a useful high throughput tool for large-scale analysis of DNA-binding proteins. As a user-friendly web-server, iDNA-Prot is freely accessible to the public at the web-site on http://icpr.jci.edu.cn/bioinfo/iDNA-Prot or http://www.jci-bioinfo.cn/iDNA-Prot. Moreover, for the convenience of the vast majority of experimental scientists, a step-by-step guide is provided on how to use the web-server to get the desired results.

Entities:  

Mesh:

Substances:

Year:  2011        PMID: 21935457      PMCID: PMC3174210          DOI: 10.1371/journal.pone.0024756

Source DB:  PubMed          Journal:  PLoS One        ISSN: 1932-6203            Impact factor:   3.240


Introduction

DNA-binding proteins play a vitally important role in many biological processes, such as recognition of specific nucleotide sequences, regulation of transcription, and regulation of gene expression. At present, several experimental techniques (such as filter binding assays, genetic analysis, chromatin immunoprecipitation on microarrays, and X-ray crystallography) have been used for identifying DNA-binding proteins. Although these techniques can provide a detailed picture about the binding, they are both time-consuming and expensive [1]. Particularly, the number of newly discovered protein sequences has been increasing extremely fast. For example, in 1986 the Swiss-Prot [2] database contained only 3,939 protein sequence entries, but now the number has jumped to 530,264 according to the release 2011_07 on 28-Jun-2011 by the UniProtKB/Swiss-Prot at http://web.expasy.org/docs/relnotes/relstat.html, meaning that the number of protein sequence entries now is more than 134 times the number from about 25 years ago. Facing the avalanche of new protein sequences generated in the postgenomic age, it is highly desired to develop automated methods for rapidly and effectively identifying and characterizing DNA-binding proteins based on the protein sequence information alone. Actually, numerous predictors were developed in this regard. For instance: Shanahan et al. [3] demonstrated how structural features were employed to determine whether a protein of known structure and unknown function was a DNA-binding proteins or not; Ahmad et al. [4] depicted how to distinguish DNA-binding and non DNA-binding proteins with net charge, net dipole moment and quadrupole moment, respectively; Nordhoff et al. [5] introduced identification and characterization of DNA-binding proteins by mass spectrometry, which was regarded as the most sensitive and specific analytical technique available for protein identification [6]. All the aforementioned methods were considerably relied on the results from biochemical experiments. Among the existing methods, those which are purely based on theoretical approaches are of using various classifying engines, such as support vector machine (SVM) [6], [7], [8], [9], [10], [11], [12], [13], [14], [15], artificial neural network (ANN) [16], [17], [18], [19], [20], [21], random forest [22], [23], [24], nearest neighbor [25], and boosted decision trees [26]. In addition to using different prediction engines, a remarkable difference among the existing methods is in that different features were extracted to represent protein samples. For instance, Bhardwaj [13] used a 40-D (dimensional) feature vector to formulate a protein sample that contains positive potential surface patches, overall charge of the protein, and overall surface composition. Yu et al. [10] constructed a 132-D feature vector to represent a protein sequence by using the information of hydrophobicity, predicted secondary structure, solvent accessibility, normalized van der Waals volume, polarity, and polarizability. Bhardwaj and Lu [9] represented the sample of a protein by harnessing the 70 features of the DNA-binding residues, including the residue's identity, charge, solvent accessibility, average potential, secondary structure, neighboring residues, and location in a cationic patch. Kumar et al. [14] extracted the features from the PSSM (Position-Specific Scoring Matrix) profile obtained from PSI-BLAST [27] to represent the protein sample. Subsequently, a different approach was proposed [22] to encode each protein sequence with 116 features by incorporating various physic-chemical properties of amino acids. Meanwhile, Nanni and Lumini [15] proposed a method to represent a protein sample by combing ontologies and dipeptide composition. Later, the same authors [6] introduced the grouped weight to represent protein samples for predicting DNA-binding proteins. Langlois and Lu [1] represented a protein sample with 472 features, of which 240 were secondary structure features, 231 dipeptide composition features, and one for the total charge over its amino acid sequence. However, the existing predictors have the following shortcomings. (1) The extracting features are very complicated and their dimensions are too large. Particularly, during the prediction process, some of the existing predictors even need the informations of query proteins that were obtained from other experiments, such as their three-dimensional (3D) structures, functions, and the other relevant knowledge. (2) The computational time needed by these predictors is usually very long; for instance, the predictor iDBPs [23] would usually take about 30 minutes for predicting one query protein. (3) Most predictors did not provide a web-server for the public usage, while the others claimed they did but are currently not in working condition and hence their practical application value is quite limited. In view of this, the present study was initiated in an attempt to develop a new and more powerful predictor by addressing the aforementioned three problems. According to a recent comprehensive review [28], to establish a really useful statistical predictor for a protein system, we need to consider the following procedures: (i) construct or select a valid benchmark dataset to train and test the predictor; (ii) formulate the protein samples with an effective mathematical expression that can truly reflect their intrinsic correlation with the attribute to be predicted; (iii) introduce or develop a powerful algorithm (or engine) to operate the prediction; (iv) properly perform cross-validation tests to objectively evaluate the anticipated accuracy of the predictor; (v) establish a user-friendly web-server for the predictor that is accessible to the public. Below, let us describe how to deal with these steps.

Materials and Methods

1. Benchmark datasets

DNA-binding protein sequences were collected from the Protein Data Bank (PDB) release 03-May-2011 at http://www.pdb.org/, in which there are 72,838 structures. By searching the keywords of “Protein-DNA complex” and “DNA binding” through the “Advanced Search Interface”, we extracted 3,689 structures from (PDB). To construct a high quality benchmark dataset with a wider coverage scope and lower homology bias, the data obtained above were screened strictly according to the following criteria. (1) Sequences with less than 50 amino acid (AA) residues were removed because they might just belong to fragments [29]. (2) Sequences with more than 10 consecutive character of “X” were taken away because they contained too many unknown amino acids. (3) To reduce redundancy and homology bias, the PISCES [30], [31] was utilized to cutoff those sequences that have pairwise sequence identity to any other in the dataset. Finally, we obtained 212 DNA-binding proteins. Similarly, 212 non DNA-binding protein domains were randomly picked from the data bank. Accordingly, the benchmark dataset thus obtained consists of 424 protein sequences of which half are DNA-binding protein sequences and the other half non-binding protein sequences. Their accession codes and sequences are given in Information S1.

2. A novel pseudo amino acid composition of grey model

To develop a powerful predictor for a protein system, one of the keys is to formulate the protein samples with an effective mathematical expression that can truly reflect their intrinsic correlation with the attribute to be predicted [28]. To realize this, the concept of pseudo amino acid composition (PseAAC) was proposed [32] to replace the simple amino acid composition (AAC) for representing the sample of a protein. According to Eq. 6 of [28], the general form of PseAAC for a protein can be formulated aswhere is a transpose operator, while the subscript is an integer and its value as well as the components , , … will depend on how to extract the desired information from the amino acid sequence of . In this study, we are to use the grey model parameters to define the elements in Eq. 1. In 1989, Deng [33] proposed a grey system theory to investigate the uncertainty of a system. According to this theory, if the information of a system investigated is fully known, it is called a “white system”; if completely unknown, a “black system”; if partially known, a “grey system”. The model developed on the basis of such a theory is called “grey model”, which is a kind of nonlinear and dynamic model formulated by a differential equation. The grey model is particularly useful for solving complicated problems that are lack of sufficient information, or need to process uncertain information and reduce random effects of acquired data. In the grey system theory, an important and generally used model is called GM(1,1). It is quite effective for monotonic series, with good simulating effect and small error, as reflected by the fact that using the GM(1,1) model has remarkably improved the success rates in predicting protein structural classes [34]. However, if the series concerned are not monotonic, the simulating effect of GM(1,1) would not be good and its error might be quite large. To overcome the above problem, the grey system theory used in the current study is a special grey dynamic model called GM(2,1), which can be used to handle the oscillation series. In GM(2,1) the strategy of minimum squares will be adopted to determine the uncertain parameters, as can be briefly described below. One of the most commonly used approaches in the grey system theory is the accumulative generation operation (AGO), which can convert a series without any obvious regularity into a strict monotonic increasing series so as to reduce the randomness and enhance the smoothness of the series, and minimize interference from the random information. Let us assume that is the original series of real numbers with an irregular distribution, and it is a non-negative original data sequence. Then, is viewed as the first-order accumulative generation operation (1-AGO) series for; i.e., the components in are given byThe GM(2,1) model can be expressed by the following second-order grey differential equation with one variable:where: In Eq. 3, the coefficients and are the developing coefficients, and the influence coefficient. Then we haveThus, it follows by the least-squares method thatwhere The least-square estimator for the coefficients , and should carry some intrinsic information contained in the discrete data sequence sampled from the system investigated. In view of this, the incorporation of these coefficients into the general form of PseAAC (Eq. 1) will make the formulation of a protein sample better to reflect its intrinsic correlation with the attribute to be predicted. This is the key of the novel approach. The concrete procedures are as follows. A protein sequence is composed of 20 different types of native amino acids denoted by A, C, D, E, F, G, H, I, K, L, M, N, P, Q, R, S, T, V, W and Y. Before using the grey dynamic model GM(2,1), we need to represent the protein sequence by a series of real numbers. Listed in is the numerical codes used in this study to represent the 20 amino acids.
Table 1

The numerical codes of 20 native amino acids.

Amino acidFactor score [35]
A 0.7330.325
C 0.8620.297
D 3.6560.025
E1.4770.814
F1.8910.869
G1.3300.791
H 1.6730.158
I2.1310.849
K0.5330.630
L 1.5050.182
M2.2190.902
N1.2990.720
P 1.6280.164
Q 3.0050.047
R1.5020.818
S 4.7600.008
T2.2130.901
V 0.5440.367
W0.6720.662
Y3.0970.957
The factor score for molecular volume are adopted [35] that is related to the molecular size or volume as well as side chain weight [35]. Because in the current study, only the non-negative sequences will be considered, we can adopt the following function for modelingThrough the above function, each of the 20 amino acids can be translated into numerical variable within the region of (0, 1) ( ). With the numerical codes thus obtained, we can convert a protein sequence to a series of real numbers. Thus, the three coefficients for any protein sequence can be derived with the grey model GM(2,1) by following Eqs.2–9.

3. Predicting algorithm

The three coefficients obtained in the above section, in addition to the 20 components in the classical amino acid composition [36], can be used to form a new mode of PseAAC, with components. Thus, according Eq. 1, the protein can be formulated with a new mode of PseAAC as given bywhere are the occurrence frequencies of the 20 different types of amino acids in the protein concerned, while represent the absolute value of coefficients, and , respectively. Now the Random Forest (RF) algorithm was adopted to perform the prediction. RF is a popular machine learning algorithm and recently it has been successfully employed in dealing with various biological prediction problems [37], [38], [39], [40]. RF is a combination of tree predictors such that each tree depends on the values of a random vector sampled independently and with the same distribution for all trees in the forest. It has been shown that combining multiple trees produced in randomly selected subspaces can significantly improve the prediction accuracy. RF performs a type of cross-validation by using out-of-bag samples. During the training process, each tree is constructed using a different bootstrap sample from the original data. For the detailed description about of the RF algorithm, refer to the papers [41], [42], [43]. The RF algorithm is available via the link at http://www.stat.berkeley.edu/~breiman/RandomForests/cc_home.htm. Recently, the RF tool for the MATLAB windows is also available at http://code.google.com/p/randomforest-matlab/that has two important functions: one is “classRF_train” for training given data and returning the prediction model, and the other is “classRF_predict” for predicting query input with the prediction model. The classifier in this paper was developed based on the RF tool for the MATLAB windows. The classifier thus established is called iDNA-Prot, which can be used to predict whether a protein can bind with DNA according to its sequence information alone. For practical applications, a web-server of iDNA-Prot was established at the web-site http://icpr.jci.edu.cn/bioinfo/iDNA-Prot. Moreover, for the convenience of the vast majority of experimental scientists, a step-by-step guide on how to use the web-server is given in Information S3, by which users can easily get their desired results without the need to follow the complicated mathematic equations involved in developing the iDNA-Prot predictor.

Results and Discussion

In statistical prediction, the following three cross-validation methods are often used to examine a predictor for its effectiveness in practical application: independent dataset test, subsampling test, and jackknife test [44]. However, of the three test methods, the jackknife test is deemed the most objective [45]. The reasons are as follows. (1) For the independent dataset test, although all the proteins used to test the predictor are outside the training dataset used to train it so as to exclude the “memory” effect or bias, the way of how to select the independent proteins to test the predictor could be quite arbitrary unless the number of independent proteins is sufficiently large. This kind of arbitrariness might result in completely different conclusions. For instance, a predictor achieving a higher success rate than the other predictor for a given independent testing dataset might fail to keep so when tested by another independent testing dataset [44]. (2) For the subsampling test, the concrete procedure usually used in literatures is the 5-fold, 7-fold or 10-fold cross-validation. The problem with this kind of subsampling test is that the number of possible selections in dividing a benchmark dataset is an astronomical figure even for a very simple dataset, as demonstrated by Eqs.28–30 in [28]. Therefore, in any actual subsampling cross-validation tests, only an extremely small fraction of the possible selections are taken into account. Since different selections will always lead to different results even for a same benchmark dataset and a same predictor, the subsampling test cannot avoid the arbitrariness either. A test method unable to yield a unique outcome cannot be deemed as a good one. (3) In the jackknife test, all the proteins in the benchmark dataset will be singled out one-by-one and tested by the predictor trained by the remaining protein samples. During the process of jackknifing, both the training dataset and testing dataset are actually open, and each protein sample will be in turn moved between the two. The jackknife test can exclude the “memory” effect. Also, the arbitrariness problem as mentioned above for the independent dataset test and subsampling test can be avoided because the outcome obtained by the jackknife cross-validation is always unique for a given benchmark dataset. Accordingly, the jackknife test has been increasingly and widely used by those investigators with strong math background to examine the quality of various predictors (see, e.g.,[46], [47], [48], [49], [50], [51], [52], [53], [54], [55], [56], [57]). In view of this, here the jackknife cross-validation was also used to examine the prediction quality of the current predictor. The results thus obtained with iDNA-Prot on the benchmark dataset of Information S1 is given in , from which we can see that the overall success rate was 83.96% in identifying proteins as DNA-binding proteins and non-DNA-binding proteins.
Table 2

Results obtained by iDNA-Prot on the benchmark dataset of Information S1 through the jackknife testa.

Protein typeNumber of proteinsNumber of correct predictionSuccess rate
DNA-binding protein21217984.43%
Non DNA-binding protein21217783.49%
Overall42435683.96%

The following parameters were used for Random Forest algorithm: the number of tree grown was 560 and the number of predictors sampled for splitting at each node was 5.

The following parameters were used for Random Forest algorithm: the number of tree grown was 560 and the number of predictors sampled for splitting at each node was 5. Furthermore, as a demonstration to show that the current predictor iDNA-Prot is superior to the existing ones, let us compare iDNA-Prot with DNA-Prot [22]. The reason we chose DNA-Prot for comparison is because among the existing methods for predicting DNA-binding proteins, the reported success rate achieved by DNA-Prot [22] is the highest. The datasets used to train and test DNA-Prot [22] as well as its standalone version can be obtained from http://www3.ntu.edu.sg/home/EPNSugan/index_files/dnaprot.htm. The training dataset used for DNA-Prot [22] contains 146 DNA-binding proteins and 250 non-DNA-binding proteins. The data used to test DNA-Prot [22] contain the following three sets: (i) testing dataset-1, , consisting of 92 DNA-binding proteins and 100 non DNA-binding proteins; (ii) testing dataset-2, , consisting of 823 DNA-binding proteins and 823 non DNA-binding proteins; and (iii) testing dataset-3, , consisting of 88 DNA-binding proteins and 233 non DNA-binding proteins. All these datasets were elaborated in [22]. However, it was found (see Information S2) that there are 10 identical protein sequences between the 146 DNA-binding proteins in the training dataset and the 92 DNA-binding proteins in the 1st testing dataset , that 19 identical protein sequences between the 146 DNA-binding proteins in the training dataset and the 88 DNA-binding proteins in the 3rd testing dataset , and that 94 identical protein sequences between the 250 non-DNA-binding proteins in the training dataset and the 233 non-DNA-binding proteins in the 3rd testing dataset . In other words, the so-called independent datasets used by DNA-Prot [22] were actually not independent and hence would lead to over-estimated success rates. Accordingly, to perform an objective and fair comparison of iDNA-Prot with DNA-Prot [22], let us construct a real independent dataset by randomly picking some DNA-binding proteins and non DNA-binding proteins from PDB (Protein Data Bank) according to the following criteria. These proteins must not occur in the training dataset of iDNA-Prot nor in the training dataset for DNA-Prot [22], and that none of the proteins included has more than 40% sequence identity to any other in a same subset. By doing so, we obtained a real independent dataset , in which 122 proteins are DNA-binding proteins and 122 non DNA-binding proteins. The sequences and accession numbers for such 244 independent proteins are given in the Information S4. Listed in are the tested results obtained respectively by DNA-Prot [22] and iDNA-Prot on the 244 independent proteins in (Information S4). From the table we can see that the success rates by iDNA-Prot in identifying both DNA-binding and non-DNA-binding proteins are remarkably higher than those by DNA-Prot [22], and that the overall success rate achieved by iDNA-Prot is about 13% higher than that by DNA-Prot [22].
Table 3

A comparison of the predicted results by DNA-Prot [22] and iDNA-Prot on the independent dataset in the Information S3.

Protein typeDNA-Prot [22] iDNA-Prot
DNA-binding protein87/122 = 71.31%109/122 = 89.34%
Non DNA-binding protein101/122 = 82.79%111/122 = 90.98%
Overall188/244 = 77.05%220/244 = 90.16%
In addition to yielding higher success rates than those by the relevant existing predictors, the computational time needed by iDNA-Prot to complete a prediction is also significantly shorter than any of its counterparts, and hence iDNA-Prot may become a useful high throughput tool for large-scale investigation of DNA-binding proteins. Moreover, as a demonstration to show the efficiency of the current method, the hypothetical proteins that are annotated as DNA-binding proteins were used to test iDNA-Prot. The information about this kind of hypothetical proteins can be obtained at http://www.ncbi.nlm.nih.gov/protein/?term=DNAbindinghypothetical, from which we randomly picked 100 DNA-binding hypothetical proteins for test. The results predicted by iDNA-Prot on these proteins are given in , from which we can see the overall success rate is 90%.
Table 4

The predicted results by iDNA-Prot on the 100 DNA-binding hypothetical proteins from http://www.ncbi.nlm.nih.gov/protein/?term=DNAbindinghypothetical.

GI codePredicted resultGI codePredicted resultGI codePredicted result
21960164DBP29122980non DBP26832636DBP
21957418DBP21671920DBP26832400DBP
21961058DBP32880245DBP21835917DBP
21960858DBP90578605DBP21835539DBP
21960204DBP90410315DBP21843120DBP
21958545DBP89076244non DBP21836833DBP
21958841non DBP23326729non DBP14627522DBP
14828174DBP90439438DBP78363301DBP
52629876DBP90328556DBP30116886DBP
21958534DBP89048073non DBP20673954DBP
21958313DBP30724697DBP88595361DBP
21957779non DBP30726408DBP15769834DBP
21958822DBP30726252DBP68057023DBP
21957238DBP30725976DBP59480370non DBP
1552778DBP30725598DBP52004347DBP
21960397DBP30725306DBP11114707DBP
21960196DBP30725067DBP22984739DBP
21960777DBP30724845DBP22984549DBP
21960121DBP16882676DBP14564202DBP
21960008DBP30687056DBP14563738non DBP
21959358non DBP30686776DBP14563550DBP
21959322DBP30686615DBP90406920DBP
21959035DBP30686107DBP14563368DBP
21958991DBP30685947DBP29434364DBP
21957969DBP30685727DBP23327083non DBP
21957386DBP30685502DBP93211002DBP
21956952DBP30685211DBP14498544DBP
21877200DBP30977777DBP14526726DBP
26832542DBP18859138DBP14527329DBP
26832403DBP52002457DBP22981160DBP
33989345DBP17093877DBP46913396DBP
33989345DBP30891446DBP71382240DBP
33875610DBP26832638DBP
52004491DBP26832637DBP
The benchmark dataset includes 424 proteins, classified into 212 DNA-binding proteins and 212 non DNA-binding proteins. Both the accession identifier of PDB (Protein Data Bank) and sequences are given. None of the proteins has more than 25% sequence identity to any other in a same subset. See the text of the paper for further explanation. (PDF) Click here for additional data file. List of protein codes that occur in both the training and testing datasets for DNA-Prot (Kumar et al., 2009). See the main paper for further explanation. (PDF) Click here for additional data file. A step-by-step guide on how to use the web-server of iDNA-Prot to get the desired results. (PDF) Click here for additional data file. The independent dataset includes 244 proteins, classified into 122 DNA-binding proteins and 122 non DNA-binding proteins. Both the accession identifier of PDB (Protein Data Bank) and sequences are given. None of the proteins has more than 40% sequence identity to any other in a same subset. See the text of the paper for further explanation. (PDF) Click here for additional data file.
  50 in total

1.  Rapid identification of DNA-binding proteins by mass spectrometry.

Authors:  E Nordhoff; A M Krogsdam; H F Jorgensen; B H Kallipolitis; B F Clark; P Roepstorff; K Kristiansen
Journal:  Nat Biotechnol       Date:  1999-09       Impact factor: 54.908

2.  Predicting rRNA-, RNA-, and DNA-binding proteins from primary structure with support vector machines.

Authors:  Xiaojing Yu; Jianping Cao; Yudong Cai; Tieliu Shi; Yixue Li
Journal:  J Theor Biol       Date:  2005-11-07       Impact factor: 2.691

3.  Solving the protein sequence metric problem.

Authors:  William R Atchley; Jieping Zhao; Andrew D Fernandes; Tanja Drüke
Journal:  Proc Natl Acad Sci U S A       Date:  2005-04-25       Impact factor: 11.205

4.  Residue-level prediction of DNA-binding sites and its application on DNA-binding protein predictions.

Authors:  Nitin Bhardwaj; Hui Lu
Journal:  FEBS Lett       Date:  2007-02-07       Impact factor: 4.124

5.  Predicting DNA-binding proteins: approached from Chou's pseudo amino acid composition and other specific sequence features.

Authors:  Y Fang; Y Guo; Y Feng; M Li
Journal:  Amino Acids       Date:  2007-07-12       Impact factor: 3.520

6.  Cell-PLoc: a package of Web servers for predicting subcellular localization of proteins in various organisms.

Authors:  Kuo-Chen Chou; Hong-Bin Shen
Journal:  Nat Protoc       Date:  2008       Impact factor: 13.491

Review 7.  Recent progress in protein subcellular location prediction.

Authors:  Kuo-Chen Chou; Hong-Bin Shen
Journal:  Anal Biochem       Date:  2007-07-12       Impact factor: 3.365

8.  Using Chou's amphiphilic pseudo-amino acid composition and support vector machine for prediction of enzyme subfamily classes.

Authors:  Xi-Bin Zhou; Chao Chen; Zhan-Chao Li; Xiao-Yong Zou
Journal:  J Theor Biol       Date:  2007-06-09       Impact factor: 2.691

9.  Kernel-based machine learning protocol for predicting DNA-binding proteins.

Authors:  Nitin Bhardwaj; Robert E Langlois; Guijun Zhao; Hui Lu
Journal:  Nucleic Acids Res       Date:  2005-11-10       Impact factor: 16.971

10.  Identification of DNA-binding proteins using support vector machines and evolutionary profiles.

Authors:  Manish Kumar; Michael M Gromiha; Gajendra P S Raghava
Journal:  BMC Bioinformatics       Date:  2007-11-27       Impact factor: 3.169

View more
  83 in total

1.  SySAP: a system-level predictor of deleterious single amino acid polymorphisms.

Authors:  Tao Huang; Chuan Wang; Guoqing Zhang; Lu Xie; Yixue Li
Journal:  Protein Cell       Date:  2011-12-19       Impact factor: 14.870

2.  QSAR classification of metabolic activation of chemicals into covalently reactive species.

Authors:  Chin Yee Liew; Chuen Pan; Andre Tan; Ke Xin Magneline Ang; Chun Wei Yap
Journal:  Mol Divers       Date:  2012-02-28       Impact factor: 2.943

Review 3.  Applications of alignment-free methods in epigenomics.

Authors:  Luca Pinello; Giosuè Lo Bosco; Guo-Cheng Yuan
Journal:  Brief Bioinform       Date:  2013-11-06       Impact factor: 11.622

Review 4.  DNA-protein interaction: identification, prediction and data analysis.

Authors:  Abbasali Emamjomeh; Darush Choobineh; Behzad Hajieghrari; Nafiseh MahdiNezhad; Amir Khodavirdipour
Journal:  Mol Biol Rep       Date:  2019-03-26       Impact factor: 2.316

5.  DP-BINDER: machine learning model for prediction of DNA-binding proteins by fusing evolutionary and physicochemical information.

Authors:  Farman Ali; Saeed Ahmed; Zar Nawab Khan Swati; Shahid Akbar
Journal:  J Comput Aided Mol Des       Date:  2019-05-23       Impact factor: 3.686

6.  iAFP-Ense: An Ensemble Classifier for Identifying Antifreeze Protein by Incorporating Grey Model and PSSM into PseAAC.

Authors:  Xuan Xiao; Mengjuan Hui; Zi Liu
Journal:  J Membr Biol       Date:  2016-11-03       Impact factor: 1.843

7.  Non-Metabolic Role of PKM2 in Regulation of the HIV-1 LTR.

Authors:  Satarupa Sen; Satish L Deshmane; Rafal Kaminski; Shohreh Amini; Prasun K Datta
Journal:  J Cell Physiol       Date:  2016-06-10       Impact factor: 6.384

8.  Classifying Multifunctional Enzymes by Incorporating Three Different Models into Chou's General Pseudo Amino Acid Composition.

Authors:  Hong-Liang Zou; Xuan Xiao
Journal:  J Membr Biol       Date:  2016-04-25       Impact factor: 1.843

9.  Comparative analysis of housekeeping and tissue-selective genes in human based on network topologies and biological properties.

Authors:  Lei Yang; Shiyuan Wang; Meng Zhou; Xiaowen Chen; Yongchun Zuo; Dianjun Sun; Yingli Lv
Journal:  Mol Genet Genomics       Date:  2016-02-20       Impact factor: 3.291

10.  Comprehensive comparative analysis and identification of RNA-binding protein domains: multi-class classification and feature selection.

Authors:  Samad Jahandideh; Vinodh Srinivasasainagendra; Degui Zhi
Journal:  J Theor Biol       Date:  2012-08-03       Impact factor: 2.691

View more

北京卡尤迪生物科技股份有限公司 © 2022-2023.