| Literature DB >> 30596639 |
Andrew F Neuwald1, Stephen F Altschul2.
Abstract
Protein Direct Coupling Analysis (DCA), which predicts residue-residue contacts based on covarying positions within a multiple sequence alignment, has been remarkably effective. This suggests that there is more to learn from sequence correlations than is generally assumed, and calls for deeper investigations into DCA and perhaps into other types of correlations. Here we describe an approach that enables such investigations by measuring, as an estimated p-value, the statistical significance of the association between residue-residue covariance and structural interactions, either internal or homodimeric. Its application to thirty protein superfamilies confirms that direct coupling (DC) scores correlate with 3D pairwise contacts with very high significance. This method also permits quantitative assessment of the relative performance of alternative DCA methods, and of the degree to which they detect direct versus indirect couplings. We illustrate its use to assess, for a given protein, the biological relevance of alternative conformational states, to investigate the possible mechanistic implications of differences between these states, and to characterize subtle aspects of direct couplings. Our analysis indicates that direct pairwise correlations may be largely distinct from correlated patterns associated with functional specialization, and that the joint analysis of both types of correlations can yield greater power. Data, programs, and source code are freely available at http://evaldca.igs.umaryland.edu.Entities:
Mesh:
Substances:
Year: 2018 PMID: 30596639 PMCID: PMC6329532 DOI: 10.1371/journal.pcbi.1006237
Source DB: PubMed Journal: PLoS Comput Biol ISSN: 1553-734X Impact factor: 4.475
List of variables.
| Symbol | Definition |
|---|---|
| Total number of column pairs in the ICA array | |
| Maximum 3D distance used to define contacting residue pairs (default: 5 Å) | |
| Number of contacting pairs, i.e. distinguished elements | |
| Optimum cut point (as defined by the ICA algorithm) for partitioning an array of length | |
| Number of left-distinguished elements, i.e. contacting pairs to the left of the cut point | |
| Minimum sequence separation between residue pairs in a query protein of known structure | |
| Estimated | |
| The number, among the | |
| The probability, based on the cumulative hypergeometric distribution, of | |
| Estimated joint | |
| -log10
| |
| Constant cut point (used instead of an optimized cut point | |
| ℓ | The length of the input MSA |
| Numerical factor defining the constant cut point as | |
| The probability, based on the cumulative hypergeometric distribution, of | |
| Estimated joint | |
| -log10
|
S calculated with and without P for thirty superfamilies using residue pairwise 3D distances ≤ 5 Å and a minimum of 5 intervening residues.
| Query | Resolution | Query | Description | Number of | ||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| (Å) | length | sequences | CCM | EVC | GSF | PCV | CCM | EVC | GSF | PCV | ||
| 1ayaA | 2.05 | 101 | tyrosine phosphatase SH2 domain | 12,208 | 55 | 54 | 43 | 56 | 55 | 44 | ||
| 1b5oA | 2.20 | 382 | aspartate aminotransferase | 105,741 | 810 | 774 | 606 | 805 | 768 | 601 | ||
| 1el3A | 1.70 | 315 | aldose reductase | 67,824 | 588 | 514 | 502 | 582 | 509 | 499 | ||
| 1k30A | 1.90 | 234 | glycerol-3-phosphate acyltransferase | 12,225 | 110 | 90 | 77 | 110 | 90 | 78 | ||
| 1b23P | 2.60 | 184 | elongation factor Tu | 78,839 | 135 | 109 | 101 | 134 | 109 | 102 | ||
| 1olzA | 2.0 | 481 | Sema4D | 5,453 | 187 | 185 | 92 | 184 | 183 | 91 | ||
| 1wznA | 1.90 | 155 | SAM-dependent methyltransferase | 146,217 | 156 | 159 | 154 | 152 | 155 | 149 | ||
| 1z0kC | 1.92 | 164 | Rab4 GTPase | 64,211 | 181 | 202 | 178 | 180 | 201 | 175 | ||
| 1zp9A | 2.00 | 258 | Rio1 serine kinase | 24,076 | 105 | 91 | 86 | 106 | 92 | 87 | ||
| 2b61A | 1.65 | 357 | homoserine transacetylase | 47,508 | 290 | 284 | 274 | 290 | 284 | 275 | ||
| 3ex7H | 2.30 | 241 | DEAD-box ATPase eIF4AIII | 98,478 | 239 | 173 | 157 | 237 | 173 | 156 | ||
| 4ag9A | 1.76 | 165 | glucosamine-6-phosphate acetylase | 107,738 | 167 | 173 | 163 | 164 | 170 | 163 | ||
| 5dfiA | 1.63 | 318 | apurinic-apyrimidinic endonuclease | 36,297 | 293 | 244 | 193 | 291 | 242 | 193 | ||
| 5hf7A | 1.54 | 227 | thymine DNA glycosylase | 7,588 | 125 | 72 | 66 | 125 | 73 | 66 | ||
| 5m4pA | 2.30 | 164 | pyruvate dehydrogenase kinase | 1,651 | 33 | 34 | 23 | 34 | 36 | 24 | ||
| 1ijxA | 1.90 | 127 | cysteine-rich domain of sFRP-3 | 3,224 | 24 | 19 | 20 | 25 | 20 | 21 | ||
| 2nrlA | 0.91 | 147 | myoglobin | 9,514 | 95 | 68 | 58 | 94 | 69 | 58 | ||
| 4cmlA | 2.30 | 313 | INPP5B | 4,724 | 244 | 247 | 177 | 242 | 242 | 176 | ||
| 1jw9B | 1.70 | 249 | molybdopterin synthase MoeB | 23,170 | 318 | 272 | 243 | 312 | 269 | 242 | ||
| 3fhkF | 2.30 | 147 | disulfide isomerase | 1,042 | 61 | 64 | 49 | 61 | 64 | 49 | ||
| 3h7uA | 1.25 | 335 | plant stress-response enzyme Akr4c9 | 67,652 | 573 | 502 | 481 | 565 | 494 | 473 | ||
| 1g9rA | 2.00 | 311 | galactosyltransferase LgtC | 10,575 | 264 | 281 | 212 | 254 | 274 | 208 | ||
| 4em8A | 1.95 | 148 | ribose 5-phosphate isomerase B | 7,217 | 181 | 160 | 146 | 173 | 153 | 138 | ||
| 1i6mA | 1.72 | 328 | tryptophanyl-tRNA synthetase | 20,731 | 312 | 198 | 166 | 309 | 194 | 165 | ||
| 3f1lA | 0.95 | 252 | oxidoreductase, Ycik | 99,991 | 448 | 433 | 397 | 439 | 426 | 393 | ||
| 1jr3A | 2.2 | 373 | bacterial DNA clamp loader γ subunit | 24,739 | 373 | 258 | 245 | 365 | 258 | 245 | ||
| 1nnlA | 1.53 | 225 | human phosphoserine phosphatase | 130,332 | 133 | 111 | 91 | 131 | 111 | 93 | ||
| 1frwA | 1.75 | 194 | E. coli MobA | 79,445 | 193 | 179 | 168 | 194 | 180 | 169 | ||
| 1bqbA | 1.72 | 301 | Aureolysin metalloproteinase | 5,289 | 333 | 254 | 190 | 325 | 252 | 190 | ||
| 2ovdA | 1.8 | 182 | human complement protein C8γ | 6,874 | 62 | 53 | 42 | 63 | 53 | 41 | ||
| average: | 240 | 211 | 180 | 237 | 209 | 179 | ||||||
For each query the optimal S among competing methods is shown in bold. Shaded scores indicate the query for which the optimal method changes when P is excluded.
aHydrogen atoms were added using the Reduce program [21], except for 3f1lA for which hydrogens were already present in the pdb coordinate file.
bPSICOV version 2.4 using the recommended –p and –d 0.03 options.
Fig 1Empirical values of Ŝ as a function of S yielded by randomly shuffled 100,000 DCA arrays (blue dots connected by lines), and by 100,000 DCA arrays derived from column-permuted MSAs, where the order of the columns and of the residues within each column were randomly permuted (red triangles connected by lines).
Solid straight lines represents agreement of Ŝ with S, and the dashed curves represent an error range of two standard deviations. Results are shown for six of the domains listed in Table 2, designated by their corresponding pdb identifiers 3fhkF, 1ijxA, 4cmlA, 1olzA, 1k30A, 1wznA, ordered by increasing numbers of sequences in their corresponding MSAs. For 1wxnA, the additional data points (faint red triangles connected by a dashed line) corresponds to an MSA of 5,117 sequences randomly drawn from the original MSA.
Wilcoxon Signed Rank 2-tailed tests for the 30 analyses in Table 2.
| Comparison | |||||||||
|---|---|---|---|---|---|---|---|---|---|
| method 1 | method 2 | Z-value | Z-value | Z-value | Z-value | ||||
| CCM | EVC | 1.70 | 9×10−2 | 1.82 | 7×10−2 | 2.76 | 6×10−3 | 4.00 | 6×10−5 |
| CCM | GSF | 3.71 | 2×10−4 | 3.67 | 2×10−4 | 3.98 | 7×10−5 | 4.56 | 5×10−6 |
| CCM | PCV | 4.78 | 2×10−6 | 4.78 | 2×10−6 | 4.70 | 3×10−6 | 4.72 | 2×10−6 |
| EVC | GSF | 3.22 | 1×10−3 | 3.16 | 2×10−3 | 3.28 | 1×10−3 | 3.10 | 2×10−3 |
| EVC | PCV | 4.76 | 2×10−6 | 4.78 | 2×10−6 | 4.47 | 8×10−6 | 4.10 | 4×10−5 |
| GSF | PCV | 4.76 | 2×10−6 | 4.70 | 3×10−6 | 3.53 | 4×10−4 | 2.34 | 2×10−2 |
aNote that p-value estimates below ~10−4 are unreliable.
Fig 3S, S and PPV scores as a function of various 3D structural coordinates for each of five protein domains.
Structures are ordered by the average of their scores over four methods: CCM (black lines), EVC (cyan lines), GSF (red lines) and PCV (green lines). Below the name for each domain are shown both the mean value of F and of the optimal cut points X for the S-scores. The constant cut point values of x = F × ℓ are shown between the PPV and S plots. The value of r (the maximum 3D distance defining contacting pairs) is 5 Å.
S for Ran GTPase in the transition state complex and in the corresponding ground state complex.
| Input MSA: | number of | transition state | ground state | Δ | ||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| seqs | A | D | J | G | avg | A | D | J | G | avg | avg | |
| GTPase superfamily | 274,681 | 53.9 | 53.7 | 54.9 | 53.9 | 54.1 | 42.7 | 42.2 | 45.1 | 41.5 | 42.9 | 11.2 |
| R4 family | 27,571 | 95.0 | 96.4 | 94.5 | 93.2 | 94.8 | 70.0 | 71.1 | 72.7 | 69.3 | 70.8 | 24.0 |
| Ran subfamily | 507 | 5.4 | 5.3 | 5.3 | 5.6 | 5.4 | 6.9 | 6.9 | 6.9 | 6.9 | 6.9 | -1.5 |
aRan-GDP-AlFx-RanBP1-RanGAP; pdb: 1k5g; 3.1 Å.
bRan-GppNHp-RanBP1-RanGAP; pdb: 1k5d; 2.7 Å.
cThe s-scores are based on CCMpred DC scores with L = 12,090, on m = 5, and on r = 2.6 Å.
dThe letters A, D, J and G correspond to the chain designations for each of four Ran subunits within the crystal structure unit cell.
eThe R4 family is composed of multiple subfamilies; this includes Rab, Rho, Ras and Ran GTPases.
S as a measure of structural biological relevance for the bacterial DNA clamp loader complex based on a maximum distance of 2.6 Å.
| subunit | # aligned | Unbound | bound to ψ + “ATP” + DNA | ||||||
|---|---|---|---|---|---|---|---|---|---|
| seqs | Δ | Δ | |||||||
| 8,765 | 47,914 | 135.4 | 153.8 | 18 | 157.5 | 22 | n.a. | n.a. | |
| δ- | 24,739 | 43,694 | 148.0 | 172.1 | 24 | 172.6 | 25 | 168.3 | 165.5 |
| δ-γ- | ″ | ″ | 146.5 | 162.7 | 16 | 167.9 | 21 | 156.0 | 158.4 |
| δ-γ-γ- | ″ | ″ | 154.3 | 157.5 | 3 | 169.0 | 15 | 155.1 | 164.9 |
| δ-γ-γ-γ- | 23,512 | 32,439 | 65.2 | 122.8 | 58 | 118.9 | 54 | n.a. | n.a. |
These analyses are based on CCMpred DC scores with m = 5.
aBased on a 2.7 Å structure (pdb: 1jr3), for which the subunits δ, γ1, γ2, γ3 and δ’ are labeled as chains D, C, A, B and E, respectively.
bBased on a 3.5 Å structure (pdb: 3gli) that contains τ instead of γ, which is a shorter variant of τ. The two S and two ΔS columns correspond to two clamp loader complexes within the crystal structure unit cell. For the first complex the subunits δ, γ1, γ2, γ3 and δ’ are labeled as chains A, B, C, D and E, respectively, and for the second complex as chains F, G, H, I and J, respectively.
cS based on contacts both internal to each γ and with adjacent γ subunit(s); n.a. = not applicable.
S as a measure of biological relevance for the N-acetyltransferase Gna1 bound either to coenzyme A (CoA) (pdb: 4ag7; 1.55 Å) or to both CoA and N-acetyl-D-glucosamine-6-phosphate (GlcNAc-6P) (pdb: 4ag9; 1.76 Å).
| structural state | distance-based | ||||||
|---|---|---|---|---|---|---|---|
| A | A:B | Δ | B | B:A | Δ | ||
| Gna1 + CoA | 48.6 | 57.3 | 8.7 | 46.5 | 58.1 | 11.6 | |
| Gna1 + CoA + GlcNAc-6P | 43.4 | 54.9 | 11.5 | 45.0 | 58.1 | 13.1 | |
| Δ | -5.2 | 2.4 | -1.5 | 0.0 | |||
The s-scores are based on CCMpred DC scores, using the corresponding MSA from Table 2, on r = 2.6 Å and on m = 5.
aThe letters A and B correspond to the chain designations for the individual subunits; S in these columns is based solely on internal contacts.
bS in these columns is based on internal and homodimeric contacts; the letter to the right of the colon represents the subunit from which trans-homodimer pairwise distances were obtained.
S as a measure of the overlap between pairs of BPPS pattern residues and the highest DC scoring pairs for the N-acetyltransferase Gna1 bound to both CoA and N-acetyl-D-glucosamine-6-phosphate (GlcNAc-6P) (pdb: 4ag9; 1.76 Å).
| characteristic pattern | pattern-based | pattern residue positions | |
|---|---|---|---|
| GNAT superfamily | 2.2 | 153,116,83,118,154,103,31,121,113,24,106,82, | |
| Gna1 family | 27.9 | 135,90,92,104,68,95,93,44,141,136,43,102,105, | |
| Δ | 25.7 |
aPattern residues were determined in [39, 42]; positions are ordered by decreasing BPPS score.
S as a measure of structural biological relevance for the bacterial DNA clamp loader complex based on a minimum distance of 3 Å and a maximum of 5 Å.
| subunit | # aligned | Unbound | bound to ψ + “ATP” + DNA | ||||||
|---|---|---|---|---|---|---|---|---|---|
| seqs | Δ | Δ | |||||||
| 8,765 | 47,669 | 120.4 | 47,647 | 107.9 | -12.5 | 47,653 | 110.4 | -10.0 | |
| δ- | 24,739 | 43,447 | 160.1 | 43,428 | 128.4 | -31.7 | 43,429 | 130.5 | -29.6 |
| δ-γ- | ″ | 43,440 | 151.5 | 43,432 | 130.4 | -21.1 | 43,433 | 136.6 | -14.9 |
| δ-γ-γ- | ″ | 43,438 | 127.6 | 43,436 | 135.5 | 7.8 | 43,442 | 145.4 | 17.8 |
| δ-γ-γ-γ- | 23,512 | 32,240 | 109.7 | 32,198 | 91.0 | -18.7 | 32,201 | 92.1 | -17.6 |
This analysis is based on CCMpred DC scores with m = 5, and focuses on putative hydrophobic interactions as opposed to the focus in Table 7 on putative hydrogen bond interactions. Note that, for these, L decreases slightly by the number of pairs less than 3 Å apart for each structure, therefore each S is based on a different value of L
aSee footnote a in Table 7.
bSee footnote b in Table 7.