| Literature DB >> 15720719 |
Shandar Ahmad1, Akinori Sarai.
Abstract
BACKGROUND: Detection of DNA-binding sites in proteins is of enormous interest for technologies targeting gene regulation and manipulation. We have previously shown that a residue and its sequence neighbor information can be used to predict DNA-binding candidates in a protein sequence. This sequence-based prediction method is applicable even if no sequence homology with a previously known DNA-binding protein is observed. Here we implement a neural network based algorithm to utilize evolutionary information of amino acid sequences in terms of their position specific scoring matrices (PSSMs) for a better prediction of DNA-binding sites.Entities:
Mesh:
Substances:
Year: 2005 PMID: 15720719 PMCID: PMC550660 DOI: 10.1186/1471-2105-6-33
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Prediction results for binding sites in 62 Proteins with different data sets used for generating PSSM.
| Reference Data | Overall Correct predictions (%) | Sensitivity (S1) % | Specificity (S2) % | Net Prediction (S1+S2)/2 % |
| Sequence only (No PSSM) | 73.6 | 40.6 | 76.2 | 58.4(2.5) |
| PDNA-NR90 375 sequences | 63.8 | 65.9 | 63.4 | 64.6(2.1) |
| PDNA-RDN 1386 sequences | 64.0 | 67.1 | 63.3 | 65.2(2.1) |
| NCBI-NR 1,547,365 sequences | 66.7 | 69.5 | 63.9 | 66.7(1.4) |
| PDB-ALL 47,179 sequences | 62.6 | 65.6 | 61.8 | 64.7(1.8) |
| PIR 283,177 sequences | 66.4 | 68.2 | 66.0 | 67.1(2.7) |
PDNA refers to sequences from Protein-DNA complexes in the Protein Data Bank; NR90 means non-redundant at 90% sequence identity; RDN means data is redundant because similar proteins have not been removed. Values in the brackets show the standard deviation in values obtained from six cross-validation sets. Note that the sensitivity and specificity values shown in this table only refer to those values which sum up to give the best net prediction. These two scores can be mutually adjusted by changing cutoff threshold as described in the text and hence comparison between the data sets should only be made for the net prediction value (the last column) which is the score optimized during training.
Figure 2ROC analysis of binding site prediction using PSSMs against PDNA-RDN reference data set, compared with results obtained from sequence based predictions. The sensitivity of the prediction could be adjusted by changing the threshold on predicted probabilities, to annotate that residue to be DNA-binding or otherwise. As may be noted the area under the PSSM based prediction curve is significantly greater than that obtained from sequence based predictions. In addition, sensitivity versus specificity values also seems to be difficult to manipulate in case of sequence based predictions as points on the curve are very closely spaced. PDNA-RDN curve also shows the levels of prediction scores expected on our web-based predictions.
Figure 1Rows of Position Specific Scoring Matrices selected for neural network input: Network inputs consist of the PSSM of the target residue and its two neighboring residues on C- and N-terminals. Each residue is thereby represented by a 20 dimensional vector with integer values. These values represent (logarithmic) effective frequencies of occurrence at respective positions in a multiple alignment. Neural network input layer is therefore made of 20 × 5 = 100 units. Two units in the only hidden layer and one unit in the output layer add up to a total of 202 neural units to be trained in the fully connected neural network.