| Literature DB >> 19651875 |
Wenwei Xiong1, Tonghua Li, Kai Chen, Kailin Tang.
Abstract
Sequence-based approach for motif prediction is of great interest and remains a challenge. In this work, we develop a local combinational variable approach for sequence-based helix-turn-helix (HTH) motif prediction. First we choose a sequence data set for 88 proteins of 22 amino acids in length to launch an optimized traversal for extracting local combinational segments (LCS) from the data set. Then after LCS refinement, local combinational variables (LCV) are generated to construct prediction models for HTH motifs. Prediction ability of LCV sets at different thresholds is calculated to settle a moderate threshold. The large data set we used comprises 13 HTH families, with 17 455 sequences in total. Our approach predicts HTH motifs more precisely using only primary protein sequence information, with 93.29% accuracy, 93.93% sensitivity and 92.66% specificity. Prediction results of newly reported HTH-containing proteins compared with other prediction web service presents a good prediction model derived from the LCV approach. Comparisons with profile-HMM models from the Pfam protein families database show that the LCV approach maintains a good balance while dealing with HTH-containing proteins and non-HTH proteins at the same time. The LCV approach is to some extent a complementary to the profile-HMM models for its better identification of false-positive data. Furthermore, genome-wide predictions detect new HTH proteins in both Homo sapiens and Escherichia coli organisms, which enlarge applications of the LCV approach. Software for mining LCVs from sequence data set can be obtained from anonymous ftp site ftp://cheminfo.tongji.edu.cn/LCV/freely.Entities:
Mesh:
Substances:
Year: 2009 PMID: 19651875 PMCID: PMC2761287 DOI: 10.1093/nar/gkp628
Source DB: PubMed Journal: Nucleic Acids Res ISSN: 0305-1048 Impact factor: 16.971
HTH families
| HTH family | Sequence | Description from SMART |
|---|---|---|
| HTH_ARAC | 5426 | Arabinose operon control protein |
| HTH_ARSR | 2111 | Arsenical resistance operon repressor |
| HTH_ASNC | 1571 | An autogenously regulated activator of asparagine synthetase |
| HTH_CRP | 810 | cAMP regulatory protein |
| HTH_DEOR | 1099 | Deoxyribose operon repressor |
| HTH_DTXR | 252 | Diphteria tox regulatory element |
| HTH_GNTR | 4402 | Gluconate operon transcriptional repressor |
| HTH_ICLR | 1058 | Isocitrate lyase regulation |
| HTH_LACI | 2137 | Lactose operon repressor |
| HTH_LUXR | 4068 | Lux regulon |
| HTH_MARR | 2777 | Multiple antibiotic resistance protein |
| HTH_MERR | 2046 | Mercury resistance |
| HTH_XRE | 6002 | XRE-family like proteins |
Results for traditional methods
| Method | Variable count | Ac (%) | Sn (%) | Sp (%) | CC |
|---|---|---|---|---|---|
| Single-residue | 20 | 74.50 | 70.59 | 78.41 | 0.49 |
| Double-residue | 400 | 80.65 | 76.91 | 84.39 | 0.61 |
| CTD | 104 | 76.72 | 73.92 | 79.51 | 0.54 |
Ac, Sn, Sp and CC denote the accuracy, sensitivity, specificity and correlation coefficient, respectively.
Prediction ability for different thresholds
| Threshold | Variable count | Ac (%) | Sn (%) | Sp (%) | CC |
|---|---|---|---|---|---|
| 7 | 567 | 93.84 | 94.85 | 92.82 | 0.88 |
| 8 | 354 | 93.29 | 93.93 | 92.66 | 0.87 |
| 13 | 78 | 88.20 | 89.59 | 86.80 | 0.77 |
| 17 | 27 | 83.21 | 84.80 | 79.63 | 0.65 |
Ac, Sn, Sp and CC denote the accuracy, sensitivity, specificity and correlation coefficient, respectively.
Figure 1.ROC curve for different thresholds (threshold of 7, 8, 13 and 17 from top to bottom).
Prediction results compared with GYM2.0
| HTH family | Count | GYM detected, Ac (%) | LCV detected, Ac (%) | ||
|---|---|---|---|---|---|
| HTH_ARAC | 4463 | 3492 | 78.24 | 3618 | 81.07 |
| HTH_ARSR | 1597 | 756 | 47.34 | 1240 | 77.65 |
| HTH_ASNC | 1274 | 852 | 66.87 | 1211 | 95.05 |
| HTH_CRP | 561 | 544 | 96.97 | 522 | 93.05 |
| HTH_DEOR | 736 | 602 | 81.79 | 624 | 84.78 |
| HTH_DTXR | 195 | 126 | 64.61 | 156 | 80.00 |
| HTH_GNTR | 3454 | 2701 | 78.20 | 3154 | 91.31 |
| HTH_ICLR | 760 | 580 | 76.32 | 706 | 92.89 |
| HTH_LACI | 1485 | 1454 | 97.91 | 1455 | 97.98 |
| HTH_LUXR | 2909 | 2331 | 80.13 | 2314 | 73.36 |
| HTH_MARR | 2152 | 1156 | 53.72 | 1934 | 89.87 |
| HTH_MERR | 1516 | 1103 | 72.76 | 1347 | 88.85 |
| HTH_XRE | 4454 | 3828 | 85.95 | 4164 | 93.49 |
| Total | 25 556 | 19 525 | 76.40 | 22 445 | 87.83 |
Prediction results of LCV and Pfam
| Method | Ac (%) | Sn (%) | Sp (%) | CC |
|---|---|---|---|---|
| LCVs (1) | 95.40 | 92.60 | 96.45 | 0.88 |
| HMM (1) | 87.05 | 100.00 | 81.87 | 0.75 |
| LCVs (2) | 93.12 | 90.91 | 93.83 | 0.82 |
| HMM (2) | 69.90 | 97.15 | 61.10 | 0.50 |
LCVs (1) denotes LCVs derived from the QuintessentialSet-88 set; HMM (1) denotes the HMM model of Crp clan in Pfam database; LCVs (2) denotes LCVs derived from SEEDs in Crp clan and HMM (2) denotes HMM model constructed by sequences in the QuintessentialSet-88 set. In both HMMs, models are calibrated to increase the sensitivity of search, and E-values are empirical estimates.
Statistics on newly detected HTH proteins of subcellular location and gene ontology
| Count | Count | ||
|---|---|---|---|
| Membrane | 76 | C:integral to membrane | 88 |
| Nucleus | 70 | C:nucleus | 72 |
| Secreted | 51 | F:protein binding | 59 |
| Cytoplasm | 48 | C:cytoplasm | 41 |
| Cell membrane | 21 | P:transcription | 36 |
| Endoplasmic reticulum membrane | 16 | P:regulation of transcription, DNA-dependent | 35 |
| Mitochondrion | 10 | C:extracellular region | 33 |
| Extracellular space | 9 | F:transcription factor activity | 28 |
| Cell junction | 5 | F:zinc ion binding | 23 |
| Total | 387 | Total | 1477 |
| Count | Count | ||
| Cytoplasm | 10 | F:DNA binding | 43 |
| Periplasm | 7 | P:regulation of transcription, DNA-dependent | 37 |
| Fimbrium | 1 | C:membrane | 28 |
| Secreted | 1 | P:signal transduction | 20 |
| P:chemotaxis | 19 | ||
| F:transcription factor activity | 18 | ||
| F:transmembrane receptor activity | 18 | ||
| P:pathogenesis | 14 | ||
| P:transposition, DNA-mediated | 13 | ||
| P:DNA recombination | 13 | ||
| Total | 19 | Total | 592 |
There are 350 and 861 new HTH proteins detected in H. sapiens and E. coli organism, respectively. Total count under each item is the count of annotation entries that can be found in UnitProtKB database. Some reviewed proteins may have more than one annotation entries; in contrast, unreviewed ones may have no annotations. Only the top 10 annotation entry groups are listed.
Figure 2.Long-range interactions between residues in LCV within the HTH motif. The HTH motif part (red for α-helix and green for turn) and its binding DNA (purple part) are shown in 3D structure of molecule (i.e. PDB Id:1BDI). Two potential interactions occur between residues of the {A7,V12} (cyan part) and {S13,V17} (blue part) LCVs, of which the distances are 2.83 and 3.11 Å, respectively.
Figure 3.LCV count distribution for each HTH family. The number of matches per sequence is shown for the top 10 LCV numbers.