| Literature DB >> 23320072 |
Beisi Xu1, Dustin E Schones, Yongmei Wang, Haojun Liang, Guohui Li.
Abstract
Scanning through genomes for potential transcription factor binding sites (TFBSs) is becoming increasingly important in this post-genomic era. The position weight matrix (PWM) is the standard representation of TFBSs utilized when scanning through sequences for potential binding sites. However, many transcription factor (TF) motifs are short and highly degenerate, and methods utilizing PWMs to scan for sites are plagued by false positives. Furthermore, many important TFs do not have well-characterized PWMs, making identification of potential binding sites even more difficult. One approach to the identification of sites for these TFs has been to use the 3D structure of the TF to predict the DNA structure around the TF and then to generate a PWM from the predicted 3D complex structure. However, this approach is dependent on the similarity of the predicted structure to the native structure. We introduce here a novel approach to identify TFBSs utilizing structure information that can be applied to TFs without characterized PWMs, as long as a 3D complex structure (TF/DNA) exists. This approach utilizes an energy function that is uniquely trained on each structure. Our approach leads to increased prediction accuracy and robustness compared with those using a more general energy function. The software is freely available upon request.Entities:
Mesh:
Substances:
Year: 2013 PMID: 23320072 PMCID: PMC3540023 DOI: 10.1371/journal.pone.0052460
Source DB: PubMed Journal: PLoS One ISSN: 1932-6203 Impact factor: 3.240
Figure 1Structural based strategy for predicting transcription factor binding sites.
An illustration of the structural based strategy for predicting transcription factor binding sites: a) A native structure of TF bound with TFBS (PDB id: 2ERE) is used as the structure template (image created by Pymol [54]). Each base pair in the TFBS with length L (length of TFBS is listed in Table 1) is replaced by four kinds of base pairs and only the energy of binding contributed by the substituted base pair and the TF is calculated after the replacement. A L×4 position energy matrix (PEM) is then generated for each TF. b) The sequences of ORFs (here for example YMR108W upstream −470∼−455) are threaded into a specific TF's position energy matrix to get the binding energy of a sequence with the TF. For example, the binding energy of a sequence CTGCCGGTACCGGC would be given as , meanwhile the binding energy of sequence offset by a position, TGCCGGTACCGGCT would be given as . c) The binding energies of all sequences are sorted from lowest to highest (Left), and the binding sites from TRANSFAC and SCPD database (Right) are matched by overlapped position in the same ORF. Overlapped base pairs (Lo) with more than 50% of the binding sites in database (Ld) is considered as True Positive [55]. Note that some binding sites are much longer than that in native complex structure (Ln), in this case, we used Lo/Ln>50% as the criterion for classifying the site as the binding site. False Positive indicates a predicted TFBS not overlapping any TFBSs in the databases. True Negative indicates a TFBS in the databases overlapping with a predicted result not classified as a TFBS. d) Position energy matrix derived from PDB 2ere can be converted to PWM by Boltzmann formula [51]. The weblogo [56] of converted PWM where β = 0.05. Position 3∼12 is identical to part (e) position 1∼10. e) MA0324.1 NAME: LEU3 from the JASPAR CORE database [50] as comparison. We successful predicted most probability base pair on 9 out of 10 positions.
Data Summary.
| TF | PDB chain id | Nsites
| NORF
| Nresidue
| LTFBS
|
| GAL4 | 3coq_A,B | 15 | 7 | 178 | 19 |
| GCN4 | 1ysa_C,D | 18 | 11 | 114 | 16 |
| HAP1 | 1hwt_C,D,G,H | 9 | 5 | 287 | 19 |
| LEU3 | 2ere_A,B | 5 | 5 | 120 | 14 |
| MATA1 | 1yrn_A | 1 | 1 | 49 | 13 |
| MATALPHA2 | 1apl_C | 10 | 7 | 59 | 12 |
| MCM1 | 1mnm_A,B | 26 | 21 | 166 | 20 |
| MCM1_MATALPHA2 | 1mnm_A,B,C,D | 1 | 1 | 320 | 25 |
| NDT80 | 1mnn_A | 1 | 1 | 290 | 13 |
| PHO4 | 1a0a_A,B | 4 | 1 | 126 | 16 |
| PPR1 | 1pyi_A,B | 1 | 1 | 158 | 14 |
| PUT3 | 1zme_C,D | 3 | 2 | 140 | 14 |
| RAP1 | 1ign_A | 24 | 18 | 189 | 18 |
| TBP | 1ytb_A | 7 | 4 | 180 | 12 |
| TFIIA | 1ytf_C | 1 | 1 | 192 | 13 |
| TFIIA_TBP | 1ytf_A,C | 1 | 1 | 372 | 15 |
The transcription factors (TFs), their structure identifications and the number of experimentally verified transcription factor binding sites (TFBS) in yeast.
Transcription factor name, ‘_’ denotes transcription factors found in complex.
The structure used to represent the TF.
The number of binding sites collected from TRANSFAC and SCPD.
The number of ORFs these binding sites reside.
The number of residues in the TF.
The length of TFBS. Based on related PDB structure file, base-pairs that atoms are 10 Å away from the nearest atoms on the TF are excluded from the count.
Figure 2Prediction accuracy on different structures set.
Prediction accuracy on different structures set, where RMSD∼0 denotes using native structure in PDB. Group RMSD∼n(n = 1,2,3,4) indicate that structure templates have the same protein structure with native structure, but the DNA have been changed by basic translocation and rotation. The RMSD between the changed DNA and the native DNA part is (n-1,n]. Different training sets are shown in various symbols. a) Each prediction in Yeast_tFIRE is using the structure template itself as the training set. b) All_vcFIRE using all 212 TF/DNA structures in PDB which we have tested our energy functions on in a previous study [39]. c) Yeast_vcFIRE set using all yeast TF/DNA structures as the training set, including the 16 structures shown in Table 1. d) Yeast_tFIRE_Mutant set training each energy function with one structure, but each position of the DNA sequence is mutated to a random base pair with equal possibility. e) Yeast_tFIRE_Reference set all non-lowest energy values from Yeast_tFIRE to zero.
Comparison to other methods.
| Transcription Factor | NS
| NO
| Our | AlignACE | BioProspector | MEME | ||||||||
| SE | SP | AUC | SE | SP | AUC | SE | SP | AUC | SE | SP | AUC | |||
| MCM1_MATALPHA2 | 1 | 1 | 1.00 | 1.00 | 1.00 | - | - | - | - | - | - | - | - | - |
| PPR1 | 1 | 1 | 1.00 | 1.00 | 1.00 | - | - | - | - | - | - | - | - | - |
| NDT80 | 1 | 1 | 1.00 | 1.00 | 1.00 | - | - | - | - | - | - | - | - | - |
| LEU3 | 5 | 5 | 1.00 | 1.00 | 1.00 | 0.40 | 1.00 | 0.76 | 1.00 | 0.83 | 0.98 | 0.60 | 1.00 | 0.88 |
| MATA1 | 1 | 1 | 1.00 | 1.00 | 1.00 | 1.00 | 0.07 | 0.42 | - | - | - | - | - | - |
| TFIIA_TBP | 1 | 1 | 1.00 | 0.50 | 0.91 | - | - | - | - | - | - | - | - | - |
| GAL4 | 15 | 7 | 0.80 | 1.00 | 0.90 | 1.00 | 1.00 | 1.00 | 0.60 | 1.00 | 0.91 | 0.67 | 0.62 | 0.80 |
| MCM1 | 26 | 21 | 0.96 | 0.49 | 0.86 | 0.65 | 0.85 | 0.85 | 0.65 | 0.94 | 0.83 | 0.85 | 0.16 | 0.61 |
| MATALPHA2 | 10 | 7 | 0.50 | 0.83 | 0.83 | 0.90 | 0.08 | 0.47 | 0.30 | 1.00 | 0.68 | 0.10 | 1.00 | 0.64 |
| TBP | 7 | 4 | 0.71 | 0.83 | 0.81 | 1.00 | 0.07 | 0.57 | 1.00 | 0.12 | 0.60 | 0.86 | 0.06 | 0.39 |
| PHO4 | 4 | 1 | 1.00 | 0.29 | 0.78 | 1.00 | 0.44 | 0.78 | - | - | - | - | - | - |
| PUT3 | 3 | 2 | 0.67 | 1.00 | 0.76 | 1.00 | 0.17 | 0.59 | 1.00 | 0.05 | 0.46 | 0.33 | 0.50 | 0.67 |
| GCN4 | 18 | 11 | 0.39 | 0.78 | 0.72 | 0.56 | 0.06 | 0.29 | 0.17 | 1.00 | 0.63 | 0.06 | 1.00 | 0.44 |
| HAP1 | 9 | 5 | 0.89 | 0.12 | 0.68 | 0.75 | 0.06 | 0.56 | 0.89 | 0.06 | 0.46 | 1.00 | 0.20 | 0.73 |
| RAP1 | 24 | 18 | 0.12 | 1.00 | 0.66 | 0.11 | 1.00 | 0.20 | 0.41 | 0.07 | 0.22 | 0.83 | 0.14 | 0.58 |
| TFIIA | 1 | 1 | 1.00 | 0.14 | 0.62 | - | - | - | - | - | - | - | - | - |
| Average | 7.94 | 5.44 | 0.67 | 0.78 | 0.80 | 0.71 | 0.48 | 0.59 | 0.67 | 0.56 | 0.64 | 0.59 | 0.52 | 0.64 |
Most of tFIRE prediction is better than others. Except the use of AlignACE, BioProspector for GAL4 as well as the use of MEME for HAP1 outperform our structure-based approach.
NS: Number of sites collected from TRANSFAC and SCPD.
NO: Number of ORFs these binding sites taking place.
SE: Sensitivity [TP/(TP+FN)]. SP: Specificity [TP/(TP+FP)]. AUC: Area Under Receiver operating characteristic Curve.
AlignACE [48] v4.0 result using parameter as: number of columns = LTFBS.
BioProspector [49] result as: motif width = LTFBS; top motifs to report = NSites.
MEME [47] v4.3.0 result as: maximum motif width = LTFBS; maximum sites = NSites; maximum motif number = NSites.
Average value over LEU3, GAL4, MCM1, MATALPHA2, TBP, PUT3, GCN4, HAP1, RAP1.
Figure 3Performance of three de novo methods and tFIRE using GCN4 as an example.
The prediction results of binding are shown in the ROC plot. The ROC curves were generated by plotting the true positive rate [TP/(TP+FN)] (y-axis) against the false positive rate [FP/(TN+FP)] (x-axis). The AUC values for the three methods is shown in parentheses.
Prediction accuracy.
| Transcription Factor | TP | FN | FP | SE | SP | SE+SP | AUC | NTSites
| NSites
| NTORF
| NORF
|
| MCM1_MATALPHA2 | 1 | 0 | 0 | 1.00 | 1.00 | 2.00 | 1.00 | 1 | 1 | 1 | 1 |
| PPR1 | 1 | 0 | 0 | 1.00 | 1.00 | 2.00 | 1.00 | 1 | 1 | 1 | 1 |
| NDT80 | 1 | 0 | 0 | 1.00 | 1.00 | 2.00 | 1.00 | 1 | 1 | 1 | 1 |
| LEU3 | 5 | 0 | 0 | 1.00 | 1.00 | 2.00 | 1.00 | 5 | 5 | 5 | 5 |
| MATA1 | 1 | 0 | 0 | 1.00 | 1.00 | 2.00 | 1.00 | 1 | 1 | 1 | 1 |
| TFIIA_TBP | 1 | 0 | 1 | 1.00 | 0.50 | 1.50 | 0.91 | 0 | 1 | 0 | 1 |
| GAL4 | 12 | 3 | 0 | 0.80 | 1.00 | 1.80 | 0.90 | 12 | 15 | 7 | 7 |
| MCM1 | 25 | 1 | 26 | 0.96 | 0.49 | 1.45 | 0.86 | 22 | 26 | 18 | 21 |
| MATALPHA2 | 5 | 5 | 1 | 0.50 | 0.83 | 1.33 | 0.83 | 6 | 10 | 5 | 7 |
| TBP | 5 | 2 | 1 | 0.71 | 0.83 | 1.55 | 0.81 | 4 | 7 | 4 | 4 |
| PHO4 | 4 | 0 | 10 | 1.00 | 0.29 | 1.29 | 0.78 | 1 | 4 | 1 | 1 |
| PUT3 | 2 | 1 | 0 | 0.67 | 1.00 | 1.67 | 0.76 | 2 | 3 | 2 | 2 |
| GCN4 | 7 | 11 | 2 | 0.39 | 0.78 | 1.17 | 0.72 | 8 | 18 | 5 | 11 |
| HAP1 | 8 | 1 | 56 | 0.89 | 0.12 | 1.01 | 0.68 | 3 | 9 | 2 | 5 |
| RAP1 | 3 | 21 | 0 | 0.12 | 1.00 | 1.12 | 0.66 | 11 | 24 | 8 | 18 |
| TFIIA | 1 | 0 | 6 | 1.00 | 0.14 | 1.14 | 0.63 | 0 | 1 | 0 | 1 |
| Average | 5.13 | 2.81 | 6.44 | 0.81 | 0.75 | 1.56 | 0.85 | 4.88 | 7.94 | 3.81 | 5.44 |
| Standard Deviation | 6.17 | 5.64 | 14.81 | 0.27 | 0.33 | 0.37 | 0.13 | 5.93 | 8.48 | 4.53 | 6.26 |
tFIRE can achieve a average AUC at 0.85±0.13 and many of the predictions are top ranked TFBS.
Transcription factor name, ‘_’ denotes complex by two transcription factors.
TP: true positive. FN: false negative. FP: false positive. SE: Sensitivity [TP/(TP+FN)]. SP: Specificity [TP/(TP+FP)].
Area Under Receiver operating characteristic Curve.
NTSites: Number of sites ranked top, the higher the better discrimination ability in ORF.
NSites: Number of sites collected from TRANSFAC and SCPD.
NTORF: Number of prediction on how many ORFs achieved top ranked TFBS.
NORF: Number of ORFs these binding sites taking place.
PWM similarity to well-characterized PWM.
| Transcription Factor | Pre-PWMb | Even-PWMc |
| GAL4 | 0.15 | 0.32 |
| GCN4 | 0.12 | 0.32 |
| HAP1 | 0.33 | 0.40 |
| LEU3 | 0.22 | 0.48 |
| MATA10 | 0.73 | 0.70 |
| MATALPHA20 | 0.31 | 0.64 |
| MCM10 | 0.34 | 0.41 |
| NDT80 | 0.11 | 0.44 |
| PHO4 | 0.19 | 0.54 |
| PUT3 | 0.40 | 0.52 |
| RAP1 | 0.49 | 0.52 |
| TBP0 | 0.17 | 0.43 |
| Average | 0.30 | 0.48 |
| Standard Deviation | 0.18 | 0.12 |
ψ-test of predicted PWM to experimental PWM demonstrates prediction accuracy. The smaller ψ-test compared to Even-PWM the better.
This table shows the ψ-test [25] value of each TF's predicted PWM(Pre-PWM) via experimental PWM collected from JASPAR(33) while Pre-PWMs are converted from PEM by Boltzmann formula [51].
Also ψ-test of Even-PWM with an equal frequency of 0.25 for A,C,G,T at each position compared to experimental PWM.
Figure 4Predicted PWMs of two TF-TF complexes and their subunits.
The PWM of MCM1/MATALPHA2 complex is most likely to be a superposition of the MCM1 and MATALPHA2 PWMs. This indicates that MCM1 and MATALPHA2 both have strong binding affinity to DNA. Their complex's binding pattern contains both MCM1 and MATALPHA2 binding patterns. Conversely, TFIIA, which does not bind DNA directly, shows a very weak PWM, as expected. The TFIIA/TBP complex leads to a very different PWM prediction as compared to TBP alone. This indicates that TFIIA may not bind DNA directly, but it may alter the TBP structure [52].