| Literature DB >> 23342110 |
Abstract
Residue-residue interactions that fold a protein into a unique three-dimensional structure and make it play a specific function impose structural and functional constraints in varying degrees on each residue site. Selective constraints on residue sites are recorded in amino acid orders in homologous sequences and also in the evolutionary trace of amino acid substitutions. A challenge is to extract direct dependences between residue sites by removing phylogenetic correlations and indirect dependences through other residues within a protein or even through other molecules. Rapid growth of protein families with unknown folds requires an accurate de novo prediction method for protein structure. Recent attempts of disentangling direct from indirect dependences of amino acid types between residue positions in multiple sequence alignments have revealed that inferred residue-residue proximities can be sufficient information to predict a protein fold without the use of known three-dimensional structures. Here, we propose an alternative method of inferring coevolving site pairs from concurrent and compensatory substitutions between sites in each branch of a phylogenetic tree. Substitution probability and physico-chemical changes (volume, charge, hydrogen-bonding capability, and others) accompanied by substitutions at each site in each branch of a phylogenetic tree are estimated with the likelihood of each substitution, and their direct correlations between sites are used to detect concurrent and compensatory substitutions. In order to extract direct dependences between sites, partial correlation coefficients of the characteristic changes along branches between sites, in which linear multiple dependences on feature vectors at other sites are removed, are calculated and used to rank coevolving site pairs. Accuracy of contact prediction based on the present coevolution score is comparable to that achieved by a maximum entropy model of protein sequences for 15 protein families taken from the Pfam release 26.0. Besides, this excellent accuracy indicates that compensatory substitutions are significant in protein evolution.Entities:
Mesh:
Substances:
Year: 2013 PMID: 23342110 PMCID: PMC3546969 DOI: 10.1371/journal.pone.0054252
Source DB: PubMed Journal: PLoS One ISSN: 1932-6203 Impact factor: 3.240
Protein families used.
| Pfam ID | Seed | Full | Target protein domain | Fold | #sites | |||
| #seqs | Length | #seqs | Length | Uniprot ID | PDB ID | type | /Length | |
| Trans_reg_C | 362 | 114 | 35180 | 269 | OMPR_ECOLI/156-232 | 1ODD-A:156-232 |
| 76/77 |
| CH | 202 | 249 | 5756 | 650 | SPTB2_HUMAN/176-278 | 1BKR-A:5-107 |
| 101/103 |
| 7tm_1 | 64 | 434 | 26656 | 2354 | OPSD_BOVIN/54-306 | 1GZM-A:54-306 |
| 248/253 |
| SH3_1 | 61 | 56 | 8993 | 210 | YES_HUMAN/97-144 | 2HDA-A:97-144 |
| 48/48 |
| Cadherin | 57 | 129 | 18808 | 494 | CADH1_HUMAN/267-366 | 2O72-A:113-212 |
| 91/100 |
| Trypsin | 71 | 348 | 14720 | 1356 | TRY2_RAT/24-239 | 3TGI-E:16-238 |
| 212/216 |
| Kunitz_BPTI | 151 | 81 | 3090 | 209 | BPT1_BOVIN/39-91 | 5PTI-A:4-56 |
| 53/53 |
| KH_1 | 399 | 104 | 11484 | 280 | PCBP1_HUMAN/281-343 | 1WVN-A:7-69 |
| 57/63 |
| RRM_1 | 79 | 79 | 31837 | 580 | ELAV4_HUMAN/48-118 | 1G2E-A:41-111 |
| 70/71 |
| FKBP_C | 174 | 247 | 11034 | 845 | O45418_CAEEL/26-118 | 1R9H-A:26-118 |
| 92/93 |
| Lectin_C | 44 | 136 | 6530 | 801 | CD209_HUMAN/273-379 | 1SL5-A:273-379 |
| 103/107 |
| Thioredoxin | 50 | 123 | 16281 | 609 | THIO_ALIAC/1-103 | 1RQM-A:1-103 |
| 99/103 |
| Response_reg | 57 | 157 | 103232 | 804 | CHEY_ECOLI/8-121 | 1E6K-A:8-121 |
| 110/114 |
| RNase_H | 65 | 246 | 13801 | 574 | RNH_ECOLI/2-142 | 1F21-A:3-142 |
| 128/140 |
| Ras | 61 | 229 | 13525 | 1461 | RASH_HUMAN/5-165 | 5P21-A:5-165 |
| 159/161 |
Pfam release 26.0 (November 2011) was used.
The number of sequences and the length of alignment included in the Pfam seed alignment.
The number of sequences and the length of alignment included in the Pfam full alignment.
Target protein member in the Pfam family.
A protein structure corresponding to the target protein domain.
Site positions that are represented by the lower case of characters in Pfam alignments were excluded in the evaluation of prediction accuracy for comparison with the contact prediction published in [16].
‡Transmembrane .
Figure 1Framework of the present model.
See text for details.
Correlation () versus partial correlation () coefficients of concurrent substitutions between sites.
| Pfam ID |
|
|
|
|
|
| ||||
|
|
|
|
|
|
|
|
| |||
| TP:FP | PPV | TP:FP | PPV | TP:FP | PPV | TP:FP | PPV | |||
| Trans_reg_C | 0.12 | 7720 | 102∶2282 | 0.04 | 1∶30 | 0.03 | 0∶0 | – | 0∶0 | – |
| CH | 0.01 | 2960 | 167∶4226 | 0.04 | 2∶73 | 0.03 | 0∶2 | 0.0 | 0∶0 | – |
| 7tm_1 | 0.1 | 6302 | 358∶28576 | 0.01 | 0∶0 | – | 0∶0 | – | 0∶0 | – |
| SH3_1 | 0.01 | 4160 | 74∶674 | 0.10 | 7∶60 | 0.10 | 0∶5 | 0.0 | 0∶0 | – |
| Cadherin | 0.06 | 7617 | 214∶3333 | 0.06 | 1∶46 | 0.02 | 0∶7 | 0.0 | 0∶0 | – |
| Trypsin | 0.1 | 6688 | 617∶20312 | 0.03 | 0∶0 | – | 0∶0 | – | 0∶0 | – |
| Kunitz_BPTI | 0.01 | 2130 | 86∶799 | 0.10 | 11∶48 | 0.19 | 0∶2 | 0.0 | 0∶0 | – |
| KH_1 | 0.01 | 5114 | 78∶1116 | 0.07 | 1∶41 | 0.02 | 0∶4 | 0.0 | 0∶0 | – |
| RRM_1 | 0.15 | 7684 | 119∶1839 | 0.06 | 0∶0 | – | 0∶0 | – | 0∶0 | – |
| FKBP_C | 0.01 | 5695 | 199∶3445 | 0.05 | 0∶10 | 0.0 | 0∶1 | 0.0 | 0∶0 | – |
| Lectin_C | 0.01 | 4479 | 234∶4319 | 0.05 | 1∶19 | 0.05 | 0∶0 | – | 0∶0 | – |
| Thioredoxin | 0.06 | 7483 | 188∶4180 | 0.04 | 0∶3 | 0.0 | 0∶0 | – | 0∶0 | – |
| Response_reg | 0.46 | 7613 | 202∶5266 | 0.04 | 0∶1 | 0.0 | 0∶0 | – | 0∶0 | – |
| RNase_H | 0.01 | 4782 | 271∶7152 | 0.04 | 0∶5 | 0.0 | 0∶0 | – | 0∶0 | – |
| Ras | 0.02 | 6390 | 329∶11304 | 0.03 | 0∶0 | – | 0∶0 | – | 0∶0 | – |
OTUs connected to their parent nodes with branches shorter than the threshold value are removed from each Pfam full alignment, and the number of remaining OTUs, , is listed.
The is a threshold for a correlation coefficient corresponding to the E-value (the P-value ) in the Student's t-distribution of the degree of freedom, , where is the number of site pairs, and is the number of OTUs.
TP and FP are the numbers of true and false positives, which are the number of contact site pairs and the number of non-contact site pairs predicted as contacts in each category, respectively.
PPV stands for a positive predictive value; i.e., .
‡The numbers of contacts and of sites, and their ratio are listed. Protein structures used to calculate contact residue pairs are listed in Table 1. Neighboring residue pairs within 5 residues () along a peptide chain are excluded in the evaluation of prediction accuracy. Also both terminal sites are excluded from counting in this table.
Effectiveness of partial correlation coefficients on contact prediction accuracy.
| Pfam ID | #contacts | TP | PPV( | |||
| /#sites |
|
|
|
| ||
| Trans_reg_C | 103/75 | 27 | 0.222 |
|
|
|
| 1.4 | 37 | 0.189 |
|
|
| |
| CH | 169/100 | 43 | 0.047 |
|
|
|
| 1.7 | 57 | 0.053 |
|
|
| |
| 7tm_1 | 366/247 | 93 | 0.011 |
|
|
|
| 1.5 | 124 | 0.008 |
|
|
| |
| SH3_1 | 81/46 | 22 | 0.227 |
|
|
|
| 1.8 | 29 | 0.241 |
|
|
| |
| Cadherin | 215/90 | 55 | 0.291 |
|
|
|
| 2.4 | 73 | 0.274 |
|
|
| |
| Trypsin | 617/210 | 159 | 0.396 |
|
|
|
| 2.9 | 212 | 0.344 |
|
|
| |
| Kunitz_BPTI | 105/51 | 27 | 0.259 |
|
|
|
| 2.1 | 37 | 0.216 |
|
|
| |
| KH_1 | 79/55 | 22 | 0.455 |
|
|
|
| 1.4 | 30 | 0.367 |
|
|
| |
| RRM_1 | 119/68 | 33 | 0.273 |
|
|
|
| 1.8 | 44 | 0.295 |
|
|
| |
| FKBP_C | 199/91 | 50 | 0.220 |
|
|
|
| 2.2 | 66 | 0.197 |
|
|
| |
| Lectin_C | 243/102 | 61 | 0.197 |
|
|
|
| 2.4 | 82 | 0.171 |
|
|
| |
| Thioredoxin | 188/99 | 47 | 0.213 |
|
|
|
| 1.9 | 62 | 0.177 |
|
|
| |
| Response_reg | 202/110 | 50 | 0.000 |
|
|
|
| 1.8 | 67 | 0.015 |
|
|
| |
| RNase_H | 271/127 | 68 | 0.162 |
|
|
|
| 2.1 | 91 | 0.132 |
|
|
| |
| Ras | 329/158 | 83 | 0.229 |
|
|
|
| 2.1 | 111 | 0.207 |
|
|
| |
The threshold to remove OTUs with short branches and the number of remaining OTUs that are used for each protein here are listed in Table 2.
The numbers of contacts and of sites, and their ratio are listed. Protein structures used to calculate contact residue pairs are listed in Table 1. Neighboring residue pairs within 5 residues () along a peptide chain are excluded in the evaluation of prediction accuracy. Also both terminal sites are excluded from counting in this table.
TP and FP are the numbers of true and false positives, and their sum is equal to the number of predicted contacts; only predictions for and are listed.
Correlation coefficients of co-substitution are used as a score.
Partial correlation coefficients of co-substitution are used as a score.
In Eq. 26 for an overall coevolution score, with is supposed instead of Eq. 25; in other words, correlation coefficients are used instead of partial correlation coefficients for characteristic changes except co-substitution.
‡The overall coevolution score defined by Eq. 26 is used.
Coevolution score () based on each characteristic variable.
| Characteristic |
|
| ||||
| variable | TP | FP | PPV | TP | FP | PPV |
| over all protein families | ||||||
| Substitution | 687 | 642 | 0.52 | |||
| Volume | 18 | 20 | 0.47 | 73 | 10 |
|
| Charge | 6 | 8 | 0.43 | 134 | 54 |
|
| Hydrogen bond | 4 | 11 | 0.27 | 125 | 51 |
|
| Hydrophobicity | 23 | 13 |
| 23 | 16 |
|
|
| 14 | 20 | 0.41 | 9 | 10 | 0.47 |
|
| 24 | 17 |
| 30 | 14 |
|
| Turn propensity | 21 | 18 |
| 17 | 15 |
|
| Aromatic interaction | 30 | 10 |
| 16 | 14 |
|
| Branched side-chain | 26 | 16 |
| 20 | 8 |
|
| Cross link | 23 | 12 |
| 5 | 9 | 0.36 |
| Ionic side-chain | 27 | 15 |
| 14 | 18 | 0.44 |
See Eqs. 24 and 25 for the definitions of and , respectively. The is a threshold for a correlation coefficient corresponding to the E-value (the P-value ) in the Student's t-distribution of the degree of freedom, , where is the number of site pairs, and is the number of OTUs.
TP and FP are the numbers of true and false contact residue pairs over all 15 protein families listed in Table 2; protein structures used to calculate contact residue pairs are listed in Table 1. Neighboring residue pairs within 5 residues () along a peptide chain are excluded in the evaluation of prediction accuracy. Also both terminal sites are excluded from counting in this table.
PPV stands for a positive predictive value; i.e., .
‡These PPVs are larger than the PPV for concurrent substitutions, i.e., for .
Figure 2Dependence of PPV on the number of characteristic variables used.
Average PPVs are plotted against the number of characteristic variables used to score co-substitutions between sites. The characteristic variables except propensity listed in Table 4 are added in the listed order to define an overall coevolution score; that is, (1) occurrence of amino acid substitution, (2) side-chain volume, (3) charge, (4) hydrogen-bonding capability, (5) hydrophobicity, (6) and (7) turn propensities, (8) aromatic interaction, (9) branched side-chain, (10) cross-link capability, and (11) ionic side-chain. The solid and dotted lines correspond to predictions in which the ratio of the predicted to the true contacts is equal to or , respectively. The plus marks and open circles show the averages of PPV over all 15 proteins and the values of , where the sum is taken over all 15 proteins.
Accuracy of contact prediction based on the overall coevolution score ().
| Pfam ID | #contacts | TP | PPV | PPV | MDPNT | MDTNP | ||||
| /#sites | DI |
| DI |
| DI |
| DI |
| ||
| Trans_reg_C | 111/76 | 27 | 0.556 |
| 0.556 |
| 1.30 |
| 4.20 |
|
| 1.5 | 37 | 0.459 |
| 0.432 |
| 1.72 |
| 3.64 |
| |
| CH | 172/101 | 43 | 0.535 |
|
| 0.465 |
| 2.55 | 4.59 |
|
| 1.7 | 57 | 0.456 |
| 0.439 |
|
| 2.44 | 3.70 |
| |
| 7tm_1 | 372/248 | 93 | 0.290 |
| 0.194 |
| 7.43 |
| 12.68 |
|
| 1.5 | 124 | 0.282 |
| 0.169 |
| 7.30 |
| 12.18 |
| |
| SH3_1 | 89/48 | 22 | 0.636 |
| 0.636 |
| 0.83 |
|
| 2.34 |
| 1.9 | 29 | 0.552 |
| 0.552 |
| 1.15 |
| 1.56 |
| |
| Cadherin | 220/91 | 55 | 0.836 | 0.836 | 0.818 |
| 0.59 |
| 1.98 | 1.98 |
| 2.4 | 73 | 0.753 |
| 0.753 |
| 0.64 |
| 1.60 | 1.60 | |
| Trypsin | 636/212 | 159 | 0.642 |
| 0.591 |
| 1.75 |
| 3.26 |
|
| 3.0 | 212 | 0.580 |
| 0.533 |
| 2.26 |
| 2.83 |
| |
| Kunitz_BPTI | 111/53 | 27 | 0.593 |
| 0.444 |
| 1.40 |
| 2.31 |
|
| 2.1 | 37 |
| 0.486 |
| 0.486 |
| 1.46 |
| 1.94 | |
| KH_1 | 90/57 | 22 | 0.545 |
| 0.500 |
| 0.99 |
|
| 3.29 |
| 1.6 | 30 | 0.533 |
| 0.533 |
| 1.07 |
|
| 3.05 | |
| RRM_1 | 133/70 | 33 | 0.788 |
| 0.758 |
|
| 0.55 | 2.86 |
|
| 1.9 | 44 | 0.750 |
| 0.705 |
| 0.83 |
| 2.49 |
| |
| FKBP_C | 200/92 | 50 | 0.760 |
| 0.760 |
|
| 0.69 | 1.97 |
|
| 2.2 | 66 | 0.712 |
| 0.697 |
| 0.94 |
| 1.66 |
| |
| Lectin_C | 246/103 | 61 |
| 0.721 |
| 0.705 |
| 0.94 | 2.93 |
|
| 2.4 | 82 |
| 0.659 |
| 0.646 | 1.19 |
| 2.54 |
| |
| Thioredoxin | 188/99 | 47 | 0.532 |
| 0.532 |
| 0.98 |
| 3.43 |
|
| 1.9 | 62 | 0.597 |
| 0.565 |
| 0.94 |
| 3.16 |
| |
| Response_reg | 202/110 | 50 | 0.680 |
| 0.660 |
|
| 0.88 | 3.39 |
|
| 1.8 | 67 | 0.657 |
| 0.642 |
| 1.01 |
| 2.54 |
| |
| RNase_H | 273/128 | 68 |
| 0.471 |
| 0.471 |
| 1.53 |
| 5.44 |
| 2.1 | 91 |
| 0.407 |
| 0.407 |
| 2.19 | 3.27 |
| |
| Ras | 335/159 | 83 | 0.699 | 0.699 | 0.699 | 0.699 |
| 1.05 |
| 3.68 |
| 2.1 | 111 | 0.640 |
| 0.631 |
|
| 1.45 |
| 2.51 | |
The threshold to remove OTUs with short branches and the number of remaining OTUs that are used for each protein here are listed in Table 2.
The numbers of contacts and of sites, and their ratio are listed. Protein structures used to calculate contact residue pairs are listed in Table 1. Neighboring residue pairs within 5 residues () along a peptide chain are excluded in the evaluation of prediction accuracy.
TP and FP are the numbers of true and false positives, and their sum is equal to the number of predicted contacts; only predictions for and are listed.
DI means the prediction based on the direct information (DI) score published in [16].
PPV stands for a positive predictive value; i.e., . Better values are typed in a bold font.
MDPNT stands for the mean Euclidean distance from predicted site pairs to the nearest true contact in the 2-dimensional sequence-position space [16]. Better values are typed in a bold font.
MDTNP stands for the mean Euclidean distance from every true contact to the nearest predicted site pair in the 2-dimensional sequence-position space [16]. Better values are typed in a bold font.
‡Filters that are based on a secondary structure prediction and cysteine pairs, and were applied to DI in [16], are applied to both DI and . For DI, an additional filter [16] based on residue conservation is also used.
‡‡Only the conservation filter is used for DI but no filter is used for .
Figure 3Dependence of PPV on the number of predicted contacts.
The dependences of the positive predictive values on the total number of predicted contacts are shown for each protein fold of , , , and . The solid and dotted lines show the PPVs of the present method and the method based on the DI score [16], respectively. Only the conservation filter [16] is applied for the DI score. The total number of predicted site pairs is shown in the scale of the ratio of the number of predicted site pairs to the number of true contacts. The total number of predicted site pairs takes every 10 from 10 to a sequence length; also PPVs for the numbers of predicted site pairs equal to one fourth or one third of true contacts are plotted. The filled marks indicate the points corresponding to the number of predicted site pairs equal to one third of the number of true contacts. The number of sequences used here for each protein family is one listed in Table 1.
Figure 4Coevolving site pairs versus DI residue pairs.
Residue pairs whose minimum atomic distances are shorter than 5 Å in a protein structure and coevolving site pairs predicted are shown by gray filled-squares and by red or indigo filled-circles in the lower-left half of each figure, respectively. For comparison, such residue-residue proximities and predicted contact residue pairs with high DI scores in [16] are also shown by gray filled-squares and by red or indigo filled-circles in the upper-right half of each figure, respectively; only the conservation filter is applied but the filters based on a secondary structure prediction and for cysteine pairs are not applied to the DI scores. Red and indigo filled-circles correspond to true and false contact residue pairs, respectively. Residue pairs separated by five or fewer positions () in a sequence may be shown with the gray filled-squares but are excluded as well as nearest neighbors in both the predictions. The total numbers of coevolving site pairs and DI residue pairs plotted for each protein are both equal to one third of true contacts (). The PPVs of both the methods for each protein are listed in Table 5.
Figure 5Dependence of PPV on the number of sequences used.
The positive predictive values are plotted against the total number of homologous sequences used for each prediction. The total numbers of coevolving site pairs predicted for each protein are equal to one third of true contacts. The filled marks indicate the points corresponding to the number of used sequences listed for each protein family in Table 1. The values written near each data point indicate the threshold value ; OTUs connected to their parent nodes with branches shorter than this threshold value are removed in the Pfam reference tree of the Pfam full sequences used for each prediction. Some data points correspond to datasets generated by using the same value of the threshold but by removing different OTUs.