| Literature DB >> 17437643 |
Ke Chen1, Lukasz A Kurgan, Jishou Ruan.
Abstract
BACKGROUND: Traditionally, it is believed that the native structure of a protein corresponds to a global minimum of its free energy. However, with the growing number of known tertiary (3D) protein structures, researchers have discovered that some proteins can alter their structures in response to a change in their surroundings or with the help of other proteins or ligands. Such structural shifts play a crucial role with respect to the protein function. To this end, we propose a machine learning method for the prediction of the flexible/rigid regions of proteins (referred to as FlexRP); the method is based on a novel sequence representation and feature selection. Knowledge of the flexible/rigid regions may provide insights into the protein folding process and the 3D structure prediction.Entities:
Mesh:
Substances:
Year: 2007 PMID: 17437643 PMCID: PMC1863424 DOI: 10.1186/1472-6807-7-25
Source DB: PubMed Journal: BMC Struct Biol ISSN: 1472-6807
Figure 1Examples of the three types of flexible regions. 1) Pair (a1) and (a2) is an example of rotating regions. (a1) is chain A of protein 1l5e from Leu1 to Tyr100 and (a2) is protein 2ezm from Leu1 to Tyr 100. Both fragments share the same sequence, are build from two domains (colored gray and black) that also share the same structure, but the structures of the linkers (colored in light gray) are different. 2) Pair (b1) and (b2) is an example of regions with missing secondary structure. (b1) is chain A of protein 1ikx from Glu224 to Leu279 and (a2) is chain B of protein 1ikx f from Glu1224 to Leu1279. Both fragments share the same sequence. The Phe227 to Leu234 in (b1) forms a strand, while it forms a coil in (b2). 3) Pair (c1) and (c2) is an example of disarranging regions. (c1) is chain A of protein 1ffx from Ile171 to Cys200 and (c2) is chain A of protein 1jff from Ile171 to Cys200. The fragments share the same sequence, and have similar overall 3D-structure and secondary structure. At the same time, the URMSD between these two structures is larger than 0.8 since the middle region between 180Ala and 192His is disarranged. The spatial packing of the corresponding AAs is different for this region.
Prediction accuracy for different protein sequence representations based on 10-fold cross validation tests.
| Feature representation | Classifier1 Feature selection2 | FlexRP (Logistic Regression) | SVM | C4.5 | IB1 | Naïve Bayes |
| Composition vector | N/A | 67.37% | 68.74% | 57.70% | 57.33% | 65.20% |
| PSI-BLAST profile | N/A | 66.38% | 67.35% | 62.47% | 61.62% | 66.24% |
| Binary encoding | No selection | 66.38% | 66.06% | 58.82% | 59.92% | 61.84% |
| Binary encoding | Linear coefficient | 69.58% | 68.74% | 62.82% | 57.05% | 69.10% |
| Binary encoding | Entropy based | 69.19% | 68.74% | 63.24% | 58.21% | 69.00% |
| K-spaced AA pairs | Linear coefficient | 74.37% | 74.60% | 66.04% | 68.74% | 72.97% |
| K-spaced AA pairs | Entropy based | 78.46% | 66.25% | 66.93% | 76.01% |
1The tested classifiers include the proposed FlexRP method, Support Vector Machine (SVM), decision tree (C4.5), instance-based learner (IB1), and Naïve Bayes.
2 The sequence representations based on binary codes and frequencies of the k-spaced amino acid pairs were processed using two feature selection methods.
3 The best result is shown in bold.
Features selected by the entropy based method.
| DF | AK | DI | AD | AI | DC | DI | ED | DP | AC |
| EF | FH | ED | AI | AV | HD | FF | GL | EN | EL |
| EL | KI | EK | AV | AY | IE | FG | PG | GG | KF |
| KE | KY | FK | GG | DG | NQ | HP | PS | KC | KG |
| LI | LL | GG | KQ | DS | PG | IL | TI | RI | |
| LL | LQ | GR | LI | EK | QP | VI | TV | ||
| PA | PM | GS | LS | ER | RV | TL | VN | VR | |
| QT | VH | KL | PW | HQ | VL | VR | |||
| VI | VL | KS | SG | LL | VV | YC | |||
| VP | LL | YH | LV | YL | |||||
| PS | MV | ||||||||
| PD | |||||||||
| VI | SQ | ||||||||
| VK | TK | ||||||||
| VL | |||||||||
1k-spaced AA pairs represent frequency of the AA pairs that are separated by k other residues in the sequence; for k = 0 the pairs are equivalent to dipeptides.
Prediction accuracy after optimization.
| Method | Accuracy1 | sensitivity | specificity | sensitivity | specificity | MCC | TP | FP | FN | TN |
| FlexRP | 79.51% | 88.52% | 82.85% | 59.71% | 70.24% | 0.51 | 3478 | 720 | 451 | 1067 |
| SVM | 79.22% | 88.93% | 82.27% | 57.86% | 70.39% | 0.50 | 3494 | 753 | 435 | 1034 |
| Naïve Bayes | 78.41% | 80.15% | 87.40% | 74.59% | 63.09% | 0.53 | 3149 | 454 | 780 | 1333 |
1 The results were based on the best performing representation that includes 95 features selected using the entropy based selection method.
Figure 2The prediction accuracy in function of . The number of features used to represent the sequence increases with the increasing value of p.
Comparison of performances between FlexRP, IUPred, and Boden's methods.
| Method | Accuracy | sensitivity | specificity | sensitivity | specificity | MCC | TP | FP | FN | TN |
| FlexRP | 79.51% | 88.52% | 82.85% | 59.71% | 70.24% | 0.51 | 3478 | 720 | 451 | 1067 |
| IUPred | 65.64% | 88.88% | 69.58% | 14.55% | 37.30% | 0.05 | 3492 | 1527 | 437 | 260 |
| Boden's method | 56.21% | 56.71% | 73.53% | 55.12% | 36.67% | 0.11 | 2228 | 802 | 1701 | 985 |
Figure 3The predictions obtained with the Boden's method [29], the IUPred method [30] and the FlexRP method on the 11E to 216A segment in chain A of 1EUL protein. In the Boden's method residues with entropy greater than 0.49 are considered as regions undergoing conformational change; the IUPred method predicts all residues for which the probabilistic score is greater than 0.5 as belonging to the disordered regions. FlexRP classifies a residue as belonging to a flexible region if its corresponding probabilistic score is greater than 0.5. The actual flexible regions are identified using the white background.
List of 66 segments with multiple experimental structures.
| Protein ID1 | Start AA | End AA | Protein ID1 | Start AA | End AA | Protein ID1 | Start AA | End AA |
| 1eulA | 11E | 216A | 1c0mA | 199K | 268D | 1ic8A | 208P | 276A |
| 121p | 25Q | 166H | 1cdb | 24F | 105R | 1ihgA | 245K | 298E |
| 1a0h | 482V | 575D | 1cejA | 30C | 95S | 1iku | 104W | 189E |
| 1a7lA | 4E | 198L | 1cfpA | 2E | 80I | 1ilf | 7D | 140Q |
| 1a7xA | 31E | 106L | 1cpq | 7L | 128E | 1irf | 27L | 112L |
| 1a90 | 25L | 116Q | 1cto | 46R | 108M | 1jmvA | 68Q | 139R |
| 1ael | 41T | 111N | 1dem | 4R | 59R | 1k0tA | 11I | 80Y |
| 1akk | 34G | 103N | 1dhx | 339A | 430G | 1k9aA | 117T | 316A |
| 1al01 | 42V | 124T | 1dmzA | 613I | 706G | 1kmuR | 299E | 382Y |
| 1aonA | 218P | 371K | 1do0 | 42E | 165E | 1kvnA | 23I | 89R |
| 1ap9 | 104D | 155G | 1ei7A | 60V | 148S | 1l6kA | 8E | 61V |
| 1avfJ | 76G | 155L | 1ej6B | 723A | 928V | 1mfn | 3D | 184T |
| 1az0A | 191S | 244R | 1f2hA | 59L | 164C | 1mkmA | 10I | 215S |
| 1b4m | 42I | 134K | 1ffxA | 148G | 263P | 1o0vA | 265Q | 470M |
| 1b75A | 41E | 94A | 1fm6A | 266T | 430L | 1pbwA | 238L | 297E |
| 1b7eA | 133E | 239I | 1g3gA | 44P | 152K | 1qpmA | 28M | 81T |
| 1b8tA | 12V | 191S | 1gm0 | 15A | 122I | 1sw6A | 346Y | 429S |
| 1ba9 | 72G | 123A | 1go4G | 498F | 578M | 1uaaA | 88G | 537R |
| 1blr | 41V | 96E | 1hqmD | 1039L | 1116T | 1wtuA | 14T | 99K |
| 1boc | 6L | 75Q | 1hryA | 20R | 75R | 2btfA | 4D | 71I |
| 1bqmA | 276V | 400L | 1hstA | 26P | 79G | 2ezm | 1L | 100Y |
| 1bsh | 19L | 138M | 1i84S | 883E | 942E | 5gcn | 36M | 165G |
1 For each segment, one PDB ID together with the start and the end of the segment are listed.
Sizes of feature sets for the considered sequence representations.
| k-spaced AA pairs | ||||||||
| Feature representation | Composition Vector | PSI-BLAST profile | Binary Encoding | adjacent pairs (dipeptides) | 1-spaced pairs | ...... | Total | |
| Number of features | 20 | 315 | 380 | 400 | 400 | ...... | 400 | 400( |