| Literature DB >> 23244412 |
Mireille Gomes1, Rebecca Hamer, Gesine Reinert, Charlotte M Deane.
Abstract
BACKGROUND: Predicting protein contacts solely based on sequence information remains a challenging problem, despite the huge amount of sequence data at our disposal. Mutual Information (MI), an information theory measure, has been extensively employed and modified to identify residues within a protein (intra-protein) that are in contact. More recently MI and its variants have also been used in the prediction of contacts between proteins (inter-protein).Entities:
Mesh:
Substances:
Year: 2012 PMID: 23244412 PMCID: PMC3532072 DOI: 10.1186/1756-0500-5-472
Source DB: PubMed Journal: BMC Res Notes ISSN: 1756-0500
Figure 1Standardised entropy medians of surface buried residue columns for all domains in the dataset. Comparing the medians of the standardised entropy scores of each domain’s surface residue columns (blue) against the medians of each domain’s buried residue columns (yellow). Residue columns containing one or more gaps, or having an entropy score of 0 are not included in the median calculation.
Figure 2Contact non-contact prediction P-ROC curves for MI variants on the 40 test cases. A, B and C illustrate the performance of MI, MIp and MIc variants respectively when distinguishing contact from non-contact surface residues. The solid green line in all plots depicts the chance of randomly selecting a contact residue, while the dashed green line indicates the probability of randomly selecting a contact residue when employing the reduced alphabet amino acid set.
Precision for detecting contact versus non-contact residues at 20% recall
| MIc | 44.9 |
| MIp | 42.3 |
| MIcRA | 36.9 |
| MIpRA | 35.8 |
| MIp3D | 31.8 |
| MIp3DRA | 29.4 |
| MIRA | 28.4 |
| Random | 24.5 |
| MI | 24.4 |
| RandomRA | 24.4 |
| MI3DRA | 23.5 |
| MI3D | 19.9 |
Results are given for the 40 test cases. MI variants are listed in descending order of contact versus non-contact precision, i.e. best to worst classifier of contact residues. The probability of randomly selecting a contact residue from all surface residues is 24.5%. This probability changes to 24.4% when using the reduced alphabet amino acid set because residues are lost as the entropy of their corresponding MSA column reduces to 0.
Precision for detecting contact versus non-contact residues at 20% recall, for sub-alignments of 70%
| MIc | 52.5* | 2.1 | 54.8 |
| MIp | 46.0* | 2.1 | 47.4 |
| MIcRA | 41.9* | 1.8 | 41.4 |
| MIpRA | 38.2* | 1.5 | 38.5 |
| MIp3D | 30.6 | 1.3 | 28.5 |
| MIRA | 30.2* | 1.2 | 31.4 |
| MIp3DRA | 28.0* | 1.2 | 30.9 |
| MI | 25.8* | 0.8 | 27.6 |
| MI3DRA | 23.2* | 1.0 | 25.5 |
| Random | - | - | 24.4 |
| RandomRA | - | - | 24.4 |
| MI3D | 20.0 | 0.6 | 21.8 |
70% of sequences in an MSA were randomly selected and the 10 MI variant scores based on the new sub-alignment were calculated. This subset selection and calculation procedure was repeated 100 times for those test cases that had ≥200 sequences to ensure ≥125 sequences in each sub-alignment [8]. Thus 24 test cases were used. For each of the 100 iterations a P-ROC curve similar to Figure 2 was plotted for the 24 test cases (figures not shown), and the precision at 20% recall recorded. Columns one and two, respectively, contain the averages and standard deviations of these 100 precision values. Column three indicates the precision attained at 20% recall when all sequences in the 24 original MSAs were used. When considering this set of 24 cases, the probability of randomly selecting a contact residue from all surface residues is 24.4% generally and when using the reduced alphabet MSAs. The MI variants are listed in descending order of the average precision of the 100 70% sub- alignments. The presence of an ‘*’ at an MI variant indicates that the difference between the precision of this MI variant and the next lowest is significant at the 0.1% level, when using two-sample t-tests with a sample size of 24.
Precision at 20% recall of contact prediction algorithms used within Brown and Brown [11] pipeline
| MIp - original, minus buried | 42.3 |
| SCA | 31 |
| ZNMI | 30.5 |
| ZRES | 28.9 |
| MIp | 27.1 |
| MI | 25.7 |
| OMES | 25.7 |
| MI - original, minus buried | 24.4 |
Results are given for the 40 test cases. The Brown and Brown [11] pipeline was applied to the contact residue prediction algorithms listed in column one, with the exceptions of MIp and MI “original, minus buried.” As in Table 1, these two algorithms were run independently of the pipeline and buried residue columns, residue columns with one ore more gaps or an entropy of 0 were filtered out. The table is arranged in descending order of precision.
Performance of MI and MIc on a histidine kinase (HK) - response regulator (RR) complex
| | ||||
| | ||||
| MI | 3 | 0.2503 | 9 | 0.4864929 |
| MIc | 4 | 0.0800 | 9 | 0.4864929 |
MI and MIc are run on the HK-RR MSA provided by Hamer et al.[12]. Each surface, ungapped DHp residue column, having a column entropy greater than 0, is assigned the maximum score it achieves when paired with the RR residue columns. These DHp residues are then ranked according to score and the number of true contact residues amongst the top nine scores are recorded for MI and MIc respectively. The same steps are applied to residues in the RR and the number of true contact residues in the top 24 scores are counted for MI and MIc. The p-value refers to the probability of seeing a number this large or larger under the corresponding Binomial model.
The dataset
| 1A45 | 1 82 | 83 173 | 160 | E(146)N(14) |
| 1BIB | 67 270 | 271 317 | 236 | A(12)B(201)N(23) |
| 1BKS | 1 188 | 189 268 | 478 | A(21)B(401)E(10)N(46) |
| 1FNB | 19 152 | 153 314 | 58 | B(22)E(34)N(2) |
| 1G8A | 1 51 | 52 227 | 75 | A(47)E(20)N(8) |
| 1G8P | 18 216 | 261 350 | 230 | A(10)B(143)E(49)N(28) |
| 1I39 | 1 158 | 159 200 | 688 | A(32)B(538)E(7)V(1)U(1)N(109) |
| 1J5X | 2 169 | 170 319 | 252 | A(9)B(183)E(5)N(55) |
| 1LAP | 1 147 | 148 484 | 454 | A(2)B(331)E(84)N(37) |
| 1LLD | 7 148 | 149 319 | 709 | A(33)B(389)E(221)N(66) |
| 1MRI | 1 162 | 163 246 | 68 | B(2)E(65)N(1) |
| 1PII | 1 255 | 256 452 | 75 | B(65)N(10) |
| 1RHD | 1 156 | 157 293 | 505 | A(26)B(365)E(57)U(1)N(56) |
| 1THM | 1 127 | 128 208 | 106 | A(1)B(62)E(34)N(9) |
| 1W98 | 88 227 | 228 357 | 70 | E(64)N(6) |
| 1WRU | 3 176 | 177 346 | 64 | B(58)V(2)N(4) |
| 1X2G | 1 246 | 247 337 | 224 | A(2)B(155)E(42)N(25) |
| 2AAA | 1 376 | 377 484 | 245 | B(141)E(74)N(30) |
| 2AHE | 16 108 | 109 253 | 144 | B(25)E(100)N(19) |
| 2D3V | 3 95 | 96 195 | 77 | E(71)N(6) |
| 2D8N | 16 97 | 102 189 | 240 | E(195)N(45) |
| 2E64 | 1 188 | 189 235 | 294 | A(9)B(231)E(4)U(1)N(49) |
| 2I00 | 10 300 | 301 406 | 116 | A(2)B(80)N(34) |
| 2IU5 | 1 71 | 72 180 | 65 | B(56)N(9) |
| 2NPO | 3 76 | 77 188 | 224 | A(3)B(182)U(1)N(38) |
| 2NRC | 1 247 | 261 480 | 188 | A(9)B(96)E(68)N(15) |
| 2OF7 | 17 67 | 68 207 | 204 | B(135)N(69) |
| 2OI8 | 8 86 | 87 216 | 215 | B(151)N(64) |
| 2PGD | 1 172 | 178 433 | 317 | B(211)E(78)N(28) |
| 2PGE | 3 136 | 137 368 | 138 | A(6)B(102)E(1)N(29) |
| 2PGX | 2 56 | 57 250 | 102 | B(87)N(15) |
| 2PHZ | 20 142 | 143 296 | 420 | A(4)B(343)N(73) |
| 2QY9 | 201 284 | 285 495 | 471 | A(32)B(344)E(15)N(80) |
| 2REB | 23 268 | 269 328 | 482 | B(434)E(12)N(36) |
| 2TS1 | 1 220 | 248 319 | 598 | B(512)E(34)N(52) |
| 4ENL | 1 126 | 127 436 | 649 | A(32)B(448)E(122)N(47) |
| 4MDH | 1 154 | 155 333 | 339 | A(6)B(173)E(134)N(26) |
| 5FBP | 1 201 | 202 335 | 355 | A(3)B(213)E(112)N(27) |
| 6GST | 1 82 | 90 217 | 374 | B(10)E(312)N(52) |
| 8TLN | 1 135 | 136 316 | 44 | A(1)B(36)E(2)N(5) |
The “protein” column contains a list of pdb identifiers [40]. D1 and D2 columns denote the start and end pdb residues of domains 1 and 2, respectively. For all pdbs listed, the start and end residues are located in chain A of the structure, except for pdb 1W98 where the mentioned domains are in chain B, and pdb 8TLN in chain E. The “sequences” column indicates the number of sequences present in the multiple sequence alignment (MSA). The final column states the distribution of sequences in each MSA taken from the various species’ domains: eukaryotes (E); archea (A); bacteria (B); viruses (V); unclassified (U); and not found (N), i.e. those sequences that could not be found in the NCBI Taxonomy Database. This dataset was taken from Hamer et al.[12].
Figure 3Contact residues in a pair of interacting domains. Test case 1J5X.pdb [40]. The two structurally defined domains are depicted in orange (residue 2 to 169) and green (residue 170 to 319) respectively. In the magnified frame, residues in red denote contact residues. Dotted lines and corresponding numbers indicate the Ångström distance between a pair of atoms in the connected residues.
Figure 4Effect of entropies of 0 on MI scores. The percent of columns in an MSA that have an entropy of 0 is plotted against the percent of all domain-domain residue pairs in the corresponding complex that have an MI value of 0. Only those columns in the MSA that correspond to a residue in the reference structure are used. Columns that have one or more gaps are ignored. Each point on the plot represents a single case study in our domain-domain dataset.