| Literature DB >> 21342579 |
Hui-Lin Huang1, I-Che Lin, Yi-Fan Liou, Chia-Ta Tsai, Kai-Ti Hsu, Wen-Lin Huang, Shinn-Jang Ho, Shinn-Ying Ho.
Abstract
BACKGROUND: Existing methods of predicting DNA-binding proteins used valuable features of physicochemical properties to design support vector machine (SVM) based classifiers. Generally, selection of physicochemical properties and determination of their corresponding feature vectors rely mainly on known properties of binding mechanism and experience of designers. However, there exists a troublesome problem for designers that some different physicochemical properties have similar vectors of representing 20 amino acids and some closely related physicochemical properties have dissimilar vectors.Entities:
Mesh:
Substances:
Year: 2011 PMID: 21342579 PMCID: PMC3044304 DOI: 10.1186/1471-2105-12-S1-S47
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Related works of predicting DNA-binding domains/proteins from sequences
| Reference | Sequence type | Identity | Feature number | Representation | Feature type | Classifier |
|---|---|---|---|---|---|---|
| Shao et al. 2009 [ | protein | 25% | 343 | Seven class Conjoint triad | PCP | SVM |
| Fang et al. 2008 [ | protein | 35% | 40 | Pseudo-AA composition | PCP | SVM |
| Yu et al. 2006 [ | protein | 25% | 132 | Combined descriptors | PCP | SVM |
| Cai et al 2003 [ | protein | 40% | 40 | Pseudo-AA composition | PCP | SVM |
| Kumar et al. 2007 [ | domain and protein | 25% | 400 | PSSM | PSSM | SVM |
| Auto-IDPCPs | domain and protein | 25% | m* | Mean value of sequence# | PCP | SVM |
PCP: physicochemical property and biochemical property
*: a small number (<30) of feature vectors selected from 531 vectors
#: The averaged value of amino acids in a sequence for one property
Figure 1An illustration example. The properties H88 and A392 are two different properties but their distance 0.0178 is relatively small. On the other hand, H88 and H178 belonging to the same group Hydrophobicity in AAindex have a large distance 0.0877. H88 and H151 in the same group have a larger distance 0.0299 than that between H88 and A392.
Figure 2The minimum spanning tree of the amino acid indices stored in the AAindex1 release 9.0 [10]. Each rectangle is an amino acid index. Coloured nodes represent the indices classified by Tomii and Kanehisa [11] Red (A): alpha and turn propensities, Yellow (B): beta propensity, Green (C): composition, Blue (H): hydrophobicity, Cyan (P): physicochemical properties, Gray (O): other properties. White: the indices added to the AAindex after the release 3.0 by Tomii and Kanehisa [11].
Figure 3The flowchart of the proposed approach Auto-IDPCPs
The statistic data of the three data sets
| Datasets | Sequence | No. of DNA-binding | No. of non-DNA-binding |
|---|---|---|---|
| DNAset | domain | 146 | 250 |
| DNAaset | protein | 1153 | 1153 |
| DNAiset | domain / protein | 92 | 100 |
The 20 clusters and their corresponding physicochemical and biochemical properties in the AAindex database
| Cluster | No. | The label of 531 physicochemical and biochemical properties |
|---|---|---|
| C1 | 2 | P: 118 O: 156 |
| C2 | 2 | P: 504 505 |
| C3 | 6 | H: 10 11 446 447 448 449 |
| C4 | 3 | P: 9 112 150 |
| C5 | 4 | C: 116 H: 34 127 P: 117 |
| C6 | 6 | A: 313 H: 129 145 364 P: 177 O: 312 |
| C7 | 147 | A: 19 25 49 50 52 74 166 258 259 260 261 262 263 264 265 266 267 268 269 270 274 286 287 288 289 290 291 292293 294 295 296 341 346 347 348 350 351 359 362 376 392 424 454 455487 524 |
| C8 | 3 | P: 65 135 517 |
| C9 | 132 | A: 5 7 24 37 40 44 47 48 53 62 93 104 105 107 121 122 124 162 165 176 188 227 228 229 230 235 236 237 238 255 303 309 334 335 337 338 345 367 369 375 413 417 418 420 428 429 430 432 433 436 498 |
| C10 | 123 | A: 38 42 60 91 97 98 99 100 119 138 140 160 163 171 186 223 224 231 253 256 307 311 328 330 331 333 339 342 349 363 366 410 411 412 414 415 416 426 |
| C11 | 6 | P: 32 72 109 353 474 475 |
| C12 | 2 | H: 128 483 |
| C13 | 1 | C: 137 |
| C14 | 15 | H: 170 241 244 245 246 393 395 396 400 402 423 444 |
| C15 | 1 | H: 73 |
| C16 | 43 | A: 18 |
| C17 | 3 | H: 450 451 452 |
| C18 | 28 | A: 16 90 |
| C19 | 2 | H: 184 |
| C20 | 2 | P: 33 319 |
A: Alpha and turn propensities. B: Beta propensity. C: Composition. H: Hydrophobicity. P: Physicochemical properties. O: Other properties.
The statistical result of the 531 physicochemical properties distributed upon the 20 clusters and six groups
| Cluster | A | B | C | H | P | O | TOTAL |
|---|---|---|---|---|---|---|---|
| C1 | 1 | 1 | 2 | ||||
| C2 | 2 | 2 | |||||
| C3 | 6 | 6 | |||||
| C4 | 3 | 3 | |||||
| C5 | 1 | 2 | 1 | 4 | |||
| C6 | 1 | 3 | 1 | 1 | 6 | ||
| C7 | 47 | 7 | 2 | 74 | 14 | 3 | 147 |
| C8 | 3 | 3 | |||||
| C9 | 51 | 1 | 3 | 50 | 6 | 21 | 132 |
| C10 | 38 | 30 | 2 | 42 | 9 | 2 | 123 |
| C11 | 6 | 6 | |||||
| C12 | 2 | 2 | |||||
| C13 | 1 | 1 | |||||
| C14 | 12 | 2 | 1 | 15 | |||
| C15 | 1 | 1 | |||||
| C16 | 1 | 38 | 4 | 43 | |||
| C17 | 3 | 3 | |||||
| C18 | 3 | 17 | 8 | 28 | |||
| C19 | 1 | 1 | 2 | ||||
| C20 | 2 | 2 | |||||
| TOTAL | 141 | 38 | 47 | 217 | 58 | 30 | 531 |
| RATE | 0.266 | 0.072 | 0.089 | 0.409 | 0.109 | 0.056 |
A: Alpha and turn propensities. B: Beta propensity. C: Composition. H: Hydrophobicity. P: Physicochemical properties. O: Other properties.
Figure 4The statistical results of property distribution in the six groups (a) 531 amino acid indices (b) 402 amino acid indices
Figure 5The statistical result of St in selecting property sets from R =30 independent runs on DNAset and DNAaset.
Figure 6Prediction accuracies for various numbers of selected properties (a) DNAset and (b) DNAaset.
The robust solution S18 with a set of m=22 features for DNAset
| Feature ID | AAindex ID | Description |
|---|---|---|
| CHOP780216 | Normalized frequency of the 2nd and 3rd residues in turn (Chou-Fasman, 1978b) | |
| CIDH920103 | Normalized hydrophobicity scales for alpha+beta-proteins (Cid et al., 1992) | |
| DAYM780101 | Amino acid composition (Dayhoff et al., 1978a) | |
| FAUJ880109 | Number of hydrogen bond donors (Fauchere et al., 1988) | |
| FINA770101 | Helix-coil equilibrium constant (Finkelstein-Ptitsyn, 1977) | |
| NAGK730103 | Normalized frequency of coil (Nagano, 1973) | |
| NAKH920101 | AA composition of CYT of single-spanning proteins (Nakashima-Nishikawa, 1992) | |
| PALJ810105 | Normalized frequency of turn from LG (Palau et al., 1981) | |
| PALJ810106 | Normalized frequency of turn from CF (Palau et al., 1981) | |
| PRAM900104 | Relative frequency in reverse-turn (Prabhakaran, 1990) | |
| QIAN880105 | Weights for alpha-helix at the window position of -2 (Qian-Sejnowski, 1988) | |
| QIAN880117 | Weights for beta-sheet at the window position of -3 (Qian-Sejnowski, 1988) | |
| QIAN880129 | Weights for coil at the window position of -4 (Qian-Sejnowski, 1988) | |
| SUEM840101 | Zimm-Bragg parameter s at 20 C (Sueki et al., 1984) | |
| WEBA780101 | RF value in high salt chromatography (Weber-Lacey, 1978) | |
| WOEC730101 | Polar requirement (Woese, 1973) | |
| AURR980110 | Normalized positional residue frequency at helix termini N5 (Aurora-Rose, 1998) | |
| MUNV940102 | Free energy in alpha-helical region (Munoz-Serrano, 1994) | |
| WIMW960101 | Free energies of transfer of AcWl-X-LL peptides from bilayer interface to water (Wimley-White, 1996) | |
| KUMS000104 | Distribution of amino acid residues in the alpha-helices in mesophilic proteins (Kumar et al., 2000) | |
| BASU050102 | Interactivity scale obtained by maximizing the mean of correlation coefficient over single-domain globular proteins (Bastolla et al., 2005) | |
| JACR890101 | Weights from the IFH scale (Jacobs-White, 1989) |
The robust solution S6 with a set of m=28 features for DNAaset
| Feature ID | AAindex ID | Description |
|---|---|---|
| CHOP780202 | Normalized frequency of beta-sheet (Chou-Fasman, 1978b) | |
| CIDH920103 | Normalized hydrophobicity scales for alpha+beta-proteins (Cid et al., 1992) | |
| CIDH920105 | Normalized average hydrophobicity scales (Cid et al., 1992) | |
| FAUJ880109 | Number of hydrogen bond donors (Fauchere et al., 1988) | |
| FAUJ880111 | Positive charge (Fauchere et al., 1988) | |
| FINA910104 | Helix termination parameter at posision j+1 (Finkelstein et al., 1991) | |
| GEIM800104 | Alpha-helix indices for alpha/beta-proteins (Geisow-Roberts, 1980) | |
| GEIM800106 | Beta-strand indices for beta-proteins (Geisow-Roberts, 1980) | |
| KANM800102 | Average relative probability of beta-sheet (Kanehisa-Tsong, 1980) | |
| KLEP840101 | Net charge (Klein et al., 1984) | |
| KRIW710101 | Side chain interaction parameter (Krigbaum-Rubin, 1971) | |
| LIFS790101 | Conformational preference for all beta-strands (Lifson-Sander, 1979) | |
| MEEJ800101 | Retention coefficient in HPLC, pH7.4 (Meek, 1980) | |
| OOBM770102 | Short and medium range non-bonded energy per atom (Oobatake-Ooi, 1977) | |
| PALJ810107 | Normalized frequency of alpha-helix in all-alpha class (Palau et al., 1981) | |
| QIAN880123 | Weights for beta-sheet at the window position of 3 (Qian-Sejnowski, 1988) | |
| RACS770103 | Side chain orientational preference (Rackovsky-Scheraga, 1977) | |
| RADA880108 | Mean polarity (Radzicka-Wolfenden, 1988) | |
| ROSM880102 | Side chain hydropathy, corrected for solvation (Roseman, 1988) | |
| SWER830101 | Optimal matching hydrophobicity (Sweet-Eisenberg, 1983) | |
| ZIMJ680102 | Bulkiness (Zimmerman et al., 1968) | |
| ZIMJ680104 | Isoelectric point (Zimmerman et al., 1968) | |
| AURR980120 | Normalized positional residue frequency at helix termini C4' (Aurora-Rose, 1998) | |
| MUNV940103 | Free energy in beta-strand conformation (Munoz-Serrano, 1994) | |
| NADH010104 | Hydropathy scale based on self-information values in the two-state model (20% accessibility) (Naderi-Manesh et al., 2001) | |
| NADH010106 | Hydropathy scale based on self-information values in the two-state model (36% accessibility) (Naderi-Manesh et al., 2001) | |
| GUYH850105 | Apparent partition energies calculated from Chothia index (Guy, 1985) | |
| MIYS990104 | Optimized relative partition energies - method C (Miyazawa-Jernigan, 1999) |
Figure 7The m properties ranked by using main effect difference (MED) (a) m=22 for DNAset and (b) m=28 for DNAaset.
The overall accuracies (%) of 5-CV for three feature types and two hybrid feature types with SVM
| Dataset | Sen. | Spe. | MCC | PCPs | AAC | PSSM* | PCPs +AAC | PCPs +PSSM | |
|---|---|---|---|---|---|---|---|---|---|
| DNAset | 88.89 | 91.20 | 0.76 | 88.89 | 80.30 | 86.62 | 81.57 | 83.59 | |
| 82.19 | 90.00 | 0.53 | 87.12 | 81.82 | 86.62 | ||||
| DNAaest | 82.74 | 70.08 | 0.72 | 76.41 | 72.46 | 74.22 | 74.20 | 79.88 | |
| 81.96 | 69.04 | 0.51 | 75.50 | 73.59 | 80.27 | ||||
S: accurate solution, S: robust solution, Sen.: sensitivity, Spe.: specificity, MCC: Matthew’s correlation coefficient, PCPs: the m informative properties, PSSM*: obtained from [7].
A small, high-performance features set of size c from c clusters. The feature number c=5 and 8 for DNAset and DNAaset, respectively.
| DNAset | ACC 83.59% | Cluster | C | C | C | C | C | |||
|---|---|---|---|---|---|---|---|---|---|---|
| Feature ID | H88 | H86 | H67 | H209 | H178 | |||||
| DNAaset | ACC 73.24% | Cluster | C | C | C | C | C | C | C | C |
| Feature ID | P159 | H87 | A99 | C197 | P63 | H11 | H396 | H451 |
The m=5 features selected from the five clusters identified from DNAset that one best feature is selected from one cluster by IGA.
| Feature ID | AAindex ID | Description |
|---|---|---|
| H88 | FAUJ880111 | Positive charge (Fauchere et al., 1988) |
| H86 | FAUJ880109 | Number of hydrogen bond donors (Fauchere et al., 1988) |
| H67 | DESM900102 | Average membrane preference: AMP07 (Degli Esposti et al., 1990) |
| C209 | NAKH920108 | AA composition of MEM of multi-spanning proteins (Nakashima-Nishikawa, 1992) |
| H178 | MEEJ800101 | Retention coefficient in HPLC, pH7.4 (Meek, 1980) |
The m=8 features selected from the eight clusters identified from DNAaset that one best feature is selected from one cluster by IGA.
| Feature ID | AAindex ID | Description |
|---|---|---|
| P159 | LEVM760107 | van der Waals parameter epsilon (Levitt, 1976) |
| H87 | FAUJ880110 | Number of full nonbonding orbitals (Fauchere et al., 1988) |
| A99 | GEIM800103 | Alpha-helix indices for beta-proteins (Geisow-Roberts, 1980) |
| C197 | NAKH900109 | AA composition of membrane proteins (Nakashima et al., 1990) |
| P63 | DAWD720101 | Size (Dawson, 1972) |
| H11 | BIOV880102 | Information value for accessibility; average fraction 23% (Biou et al., 1988) |
| H396 | YUTK870104 | Activation Gibbs energy of unfolding, pH9.0 (Yutani et al., 1987) |
| H451 | NADH010106 | Hydropathy scale based on self-information values in the two-state model (36% accessibility) (Naderi-Manesh et al., 2001) |
Figure 8The appearance frequency of each identified cluster in the 30 runs. The clusters 7, 9, 10, 16 and 18 are more informative.
Figure 9An illustration example for exploring promising properties. H151 can be inferred from feature sets S1 and S2.
Some typical properties in the five identified clusters for analyzing DNA-binding domains
| Cid AAindex ID PCP | Cid AAindex ID PCP | |||
|---|---|---|---|---|
| 7 | BHAR880101 Flexibility | 10 | FASG760105 | pK-C |
| 7 | BURA740101 Secondary structure | 10 | JOND750102 | pk- (-COOH) |
| 7 | CHOC760103 Solvent accessibility | 10 | RADA880108 | Polarity |
| 7 | HOPT810101 Hydrophobicity | 16 | PRAM900101 | Hydrophobicity |
| 7 | FAUJ880111 Charge | 16 | FUKS010104 | Solvent accessibility |
| 9 | KARP850101 Flexibility | 16 | KUMS000103 | Secondary structure |
| 9 | PALJ810115 Secondary structure | 18 | PONP800107 | Solvent accessibility |
| 9 | ROSM880101 Hydrophobicity | 18 | GRAR740102 | Polarity |
| 9 | KUHL950101 Solvent accessibility | 18 | FASG760104 | pK-N |
| 10 | ZIMJ680101 Hydrophobicity | 18 | FAUJ880113 | pK-a(RCOOH) |
| 10 | EISD860101 Solvent accessibility | 18 | FAUJ880103 | Normalized van der |
| 10 | GEIM800101 Secondary structure | Waals volume | ||
Cid: Cluster ID. PCP: physicochemical and biochemical property