Sandeep Chakraborty1, Ravindra Venkatramani2, Basuthkar J Rao1, Bjarni Asgeirsson3, Abhaya M Dandekar4. 1. Department of Biological Sciences, Tata Institute of Fundamental Research, Mumbai, 400 005, India. 2. Department of Chemical Sciences, Tata Institute of Fundamental Research, Mumbai, 400 005, India. 3. Science Institute, Department of Biochemistry, University of Iceland, IS-107 Reykjavik, Iceland. 4. Plant Sciences Department, University of California,, Davis, CA, 95616, USA.
Abstract
The structure of a protein provides insight into its physiological interactions with other components of the cellular soup. Methods that predict putative structures from sequences typically yield multiple, closely-ranked possibilities. A critical component in the process is the model quality assessing program (MQAP), which selects the best candidate from this pool of structures. Here, we present a novel MQAP based on the physical properties of sidechain atoms. We propose a method for assessing the quality of protein structures based on the electrostatic potential difference (EPD) of Cβ atoms in consecutive residues. We demonstrate that the EPDs of Cβ atoms on consecutive residues provide unique signatures of the amino acid types. The EPD of Cβ atoms are learnt from a set of 1000 non-homologous protein structures with a resolution cuto of 1.6 Å obtained from the PISCES database. Based on the Boltzmann hypothesis that lower energy conformations are proportionately sampled more, and on Annsen's thermodynamic hypothesis that the native structure of a protein is the minimum free energy state, we hypothesize that the deviation of observed EPD values from the mean values obtained in the learning phase is minimized in the native structure. We achieved an average specificity of 0.91, 0.94 and 0.93 on hg_structal, 4state_reduced and ig_structal decoy sets, respectively, taken from the Decoys `R' Us database. The source code and manual is made available at https://github.com/sanchak/mqap and permanently available on 10.5281/zenodo.7134.
The structure of a protein provides insight into its physiological interactions with other components of the cellular soup. Methods that predict putative structures from sequences typically yield multiple, closely-ranked possibilities. A critical component in the process is the model quality assessing program (MQAP), which selects the best candidate from this pool of structures. Here, we present a novel MQAP based on the physical properties of sidechain atoms. We propose a method for assessing the quality of protein structures based on the electrostatic potential difference (EPD) of Cβ atoms in consecutive residues. We demonstrate that the EPDs of Cβ atoms on consecutive residues provide unique signatures of the amino acid types. The EPD of Cβ atoms are learnt from a set of 1000 non-homologous protein structures with a resolution cuto of 1.6 Å obtained from the PISCES database. Based on the Boltzmann hypothesis that lower energy conformations are proportionately sampled more, and on Annsen's thermodynamic hypothesis that the native structure of a protein is the minimum free energy state, we hypothesize that the deviation of observed EPD values from the mean values obtained in the learning phase is minimized in the native structure. We achieved an average specificity of 0.91, 0.94 and 0.93 on hg_structal, 4state_reduced and ig_structal decoy sets, respectively, taken from the Decoys `R' Us database. The source code and manual is made available at https://github.com/sanchak/mqap and permanently available on 10.5281/zenodo.7134.
The challenge of deriving the native structure of a protein from its sequence has intrigued researchers for decades
[1]. Methods that predict putative structures from sequences are based either on features from databases of known structures (template-based methods)
[2–
4] or use first principles of atomic interactions (
ab initio or
de novo methods)
[5–
7]. Typically, these methods yield multiple, closely-ranked possibilities. Model quality assessment programs (MQAP) that validate accuracy of these predicted structures are used to select the best candidate from the set of predicted structures.MQAPs can be classified as energy, consensus or knowledge based. Two major sources of errors in energy based methods used for refining or discriminating protein structures are inaccuracies in the force field due to the inherent approximations in equations that model multi-atomic configurations, and inadequate sampling of the conformational space
[8–
12]. Consensus based methods are based on the principle that structural features that are frequently observed in a population of structures are more likely to be present in the native structure
[13–
16]. These clustering methods outperform other MQAP methods
[14] and are “very useful for structural meta-predictors
[17]”. However, they are prone to be computationally intensive due structure-to-structure comparison of all models
[16], and are of limited use when the number of possible structures is small
[18]. Knowledge based methods proceed by deriving an empirical potential (also known as statistical potential) from the frequency of residue contacts in the known structures of native proteins
[19,
20]. For a system in thermodynamic equilibrium, statistical physics hypothesizes that the accessible states are populated with a frequency which depends on the free energy of the state and is given by the Boltzmann distribution. The Boltzmann hypothesis states that if the database of known native protein structures is assumed to be a statistical system in thermodynamic equilibrium, specific structural features would be populated based on the free energy of the protein conformational state. Applying a converse logic, Sippl reasoned that the frequencies of occurrence of structural features such as interatomic distances in the database of known protein structures could be used to assign a free energy (potential of mean force) for a given protein conformation
[21,
22]. Furthermore, this statistical potential can be used to discriminate the native structure
[23–
27]. The proper characterization of the reference state is a critical aspect in applying statistical potentials
[23]. In spite of their popularity, the application of such empirical energy functions to predict and assess protein structures are vigorously debated
[28,
29]. Many MQAP programs perform better when multiple statistical metrics are combined
[30–
33]. The paramount importance of obtaining high quality protein structures from sequences using
in silico methods can be estimated by the effort invested by researchers every two years
[34] to evaluate both structure prediction tools
[35] and MQAPs
[17,
34,
36].Here, we propose a novel statistical potential to assess the quality of protein structures based on the electrostatic potential difference (EPD) of C
β atoms in consecutive residues - EPD profile of sidechain atoms used in assessment of protein structures (ESCAPIST). Previously, we have established that the EPD is conserved in cognate pairs of active site residues in proteins with the same function
[37–
40]. The ability of finite difference methods to quickly obtain consistent electrostatic properties from peptide structures provides an invaluable tool for investigating other innate properties of protein structures
[41]. We plot the EPD profiles for different atom types (C
α atoms, C
β atoms and the C-N bond) in consecutive residues from a set of non-homologous protein structures obtained from the PISCES database (
http://dunbrack.fccc.edu/PISCES.php)
[42]. We proceed to show that the EPD between C
β atoms in consecutive residues can be used to generate a scoring function that assesses the quality of protein structures. This EPD scoring function is then applied to standard decoy sets from the Decoys ‘R’ Us database (
http://dd.compbio.washington.edu) to establish the validity of our method
[43].
Results
Electrostatic potential difference (EPD) based discrimination
To extract feature values we chose a set of 1000 proteins from the PISCES database with percentage identity cutoff of 20%, resolution cutoff of 1.6 Å and a R-fn class="Chemical">actor cutoff of 0.25 (SI
Table 1).
Table 1.
Electrostatic potential differences (EPD) for consecutive residue pairs for C
α atoms for residue pairs that include proline.
While these pairs have for a low standard deviation (SD) like all other pairs, the absolute value of their mean is different (higher) than any pair that does not include a proline. This also highlights the unique nature of proline in protein structures.
Pair
Mean EPD
SD
Number of
samples
AP
-167.3
28.6
328
CP
-153
30.3
45
DP
-184.5
29.5
290
EP
-176.6
27.3
346
FP
-160.4
25.3
173
GP
-165.3
29.2
339
HP
-162.7
34.6
92
IP
-161.9
27.2
175
KP
-156.6
29.6
203
LP
-165.2
28.3
323
MP
-161.3
29.5
70
NP
-159.6
27.6
168
PQ
168.5
26.1
131
PR
156.2
31.5
184
PS
172.3
25.9
269
PT
170.8
27.6
218
PV
164.4
30.4
299
PW
158.5
29.7
70
PY
155.5
29
141
Electrostatic potential differences (EPD) for consecutive residue pairs for C
α atoms for residue pairs that include proline.
While these pairs have for a low standard deviation (SD) like all other pairs, the absolute value of their mean is different (higher) than any pair that does not include a proline. This also highlign class="Chemical">hts the unique nature of proline in protein structures.
Invariance of the EPD in the C-N peptide bond and between C
α atoms of consecutive residues
Adaptive Poisson-Boltzmann Solve (APBS) writes out the electrostatic potential in dimensionless units of kT/e where k is Boltzmann’s constant, T is the temperature in K and e is the charge of an electron. The units of EPD are same as that of the electrostatic potential. The EPD of the C-N peptide bond has a Gaussian distribution with mean = 420 EPD units and SD = 55 EPD units (
Figure 1). In the probability distribution for four pairs of amino acids the mean of all pairs of amino acids are the same (
Figure 1a).
Figure 1b shows the scatter plot for the mean and standard deviation (SD). Thus, the amino acids are indistinguishable using the profile of the EPD of the C-N peptide bond across all protein structures since they have identical mean values and a large variance (SD=
~50).
Figure 1.
Electrostatic potential differences (PD) for the C-N peptide bond.
AA: Alanine/Alanine, AC: Alanine/Cysteine, HS: Histidine/Serine and DF: Aspartic-acid/Phenylalanine. (
a) Probability distribution for four pairs of amino acids. (
b) Scatter plot for all pairs of amino acids. It can be seen that the mean and SD for all pairs of amino acids are the same. Further, the variance is large (SD=~50), indicating that this feature is not tightly constrained in peptide structures.
Electrostatic potential differences (PD) for the C-N peptide bond.
AA: Alanine/n class="Chemical">Alanine, AC: Alanine/Cysteine, HS: Histidine/Serine and DF: Aspartic-acid/Phenylalanine. (
a) Probability distribution for four pairs of amino acids. (
b) Scatter plot for all pairs of amino acids. It can be seen that the mean and SD for all pairs of amino acids are the same. Further, the variance is large (SD=~50), indicating that this feature is not tightly constrained in peptide structures.
The probability distribution for four pairs of amino acids for the EPD between the C
α atoms of consecutive residues (
Figure 2a) have means that are slightly more varied than those for the C-N bond (
Figure 1a). In the scatter plot for the mean and SD of all pairs (
Figure 2b) the outliers are pairs that include proline, which have a higher mean, although the magnitude of SD is the same (
Table 1).
Figure 2.
Electrostatic potential differences (PD) for consecutive residue pairs for C
α atoms.
A: Alanine/Alanine, AC: Alanine/Cysteine, HS: Histidine/Serine, DF: Aspartic-acid/Phenylalanine. (
a) Probability distribution for four pairs of amino acids. (
b) Scatter plot for all pairs of amino acids. It is seen that pairs of amino acids which include proline have a higher mean, although the magnitude of SD is the same.
Electrostatic potential differences (PD) for consecutive residue pairs for C
α atoms.
A: Alanine/n class="Chemical">Alanine, AC: Alanine/Cysteine, HS: Histidine/Serine, DF: Aspartic-acid/Phenylalanine. (
a) Probability distribution for four pairs of amino acids. (
b) Scatter plot for all pairs of amino acids. It is seen that pairs of amino acids which include proline have a higher mean, although the magnitude of SD is the same.
Distinctive EPD between C
β atoms of consecutive residues for certain amino acid pairs
In contrast to the results described above, the EPD between the C
β atoms in consecutive residues in the peptide structure can be used to discriminate different amino acid pairs in the protein structure. The mean EPD of all amino acid pairs are much more varied (
Figure 3a). These pairs do not include glycine, which lacks a sidechain. In the scatter plot for the mean and SD, the outliers are pairs that include cysteine (
Figure 3b), which have a much higher SD (=
~90) as compared to other pairs (SD=
~35) (
Table 2), and thus cannot be used for discriminatory purposes.
Figure 3.
Electrostatic potential differences (
PD) for consecutive residue pairs for C
β atoms.
AA: Alanine/Alanine, AD: Alanine/Aspartic-acid, AE: Alanine/Glutamic-acid, DF: Aspartic-acid/Phenylalanine, DY - Aspartic-acid/Tyrosine, HT: Histidine/Threonine, HS: Histidine/Serine. (
a) Probability distribution for seven pairs of amino acids. (
b) Scatter plot for all pairs of amino acids. The pairs which include cysteine have a high standard deviation. It is seen that the mean is much more varied than the electrostatic potential difference (EPD) for C
α and the C-N peptide bond.
Table 2.
Electrostatic potential differences (EPD) for consecutive residue pairs for C
β atoms for residue pairs that has one cysteine.
These pairs have a random values for the mean and a high standard deviation (SD), with the exception of the pair ‘CC’ (not the disulfide bond) which has a low mean value and SD. Consequently, these values can not discriminate between pairs of amino acids.
Pair
Mean EPD
SD
Number of
samples
AC
-53.7
86.9
178
CC
-7.1
30.4
36
CD
103.8
92.7
154
CE
96.8
94.7
121
CF
-21.4
84.2
85
CH
-12
93.3
97
CI
32.8
80.9
136
CK
50.2
93.9
131
CL
42.8
90.6
224
CM
61.9
100
39
CN
63.7
96.1
115
CP
24.9
88
45
CQ
66.7
92.1
95
CR
35.4
95.1
144
CS
106.3
98.1
184
CT
109.9
97.5
173
CV
54.9
90.7
183
CW
-0.2
85.9
43
CY
8.5
91.5
96
Electrostatic potential differences (
PD) for consecutive residue pairs for C
β atoms.
AA: Alanine/n class="Chemical">Alanine, AD: Alanine/Aspartic-acid, AE: Alanine/Glutamic-acid, DF: Aspartic-acid/Phenylalanine, DY - Aspartic-acid/Tyrosine, HT: Histidine/Threonine, HS: Histidine/Serine. (
a) Probability distribution for seven pairs of amino acids. (
b) Scatter plot for all pairs of amino acids. The pairs which include cysteine have a high standard deviation. It is seen that the mean is much more varied than the electrostatic potential difference (EPD) for C
α and the C-N peptide bond.
Electrostatic potential differences (EPD) for consecutive residue pairs for C
β atoms for residue pairs that has one cysteine.
These pairs have a random values for the mean and a high standard deviation (SD), with the exception of the pair ‘CC’ (not the disulfide bond) which has a low mean value and SD. Consequently, these values can not discriminate between pairs of amino acids.These values are used as a discriminator when choosing the native structure from a set of possible candidates (
Table 3). To establish the non-triviality of these values, we also show that the variance of the EPD between these pairs increases with increasing sequence distance. Thus, the EPD between the pairs ‘DF’ and ‘HS’ has lesser correlation as the sequence distance between them increases (sample size for each sequence distance is
> 30) (
Figure 4). The SD for distance 1 (i.e. consecutive residues) is 29.8 EPD units and 31.8 EPD units for ‘DF’ and ‘HS’, respectively - and rises to around 60 EPD units with increasing sequence distance.
Table 3.
Electrostatic potential differences (EPD) in a sample of consecutive residue pairs of C
β atoms.
These pairs are used for discriminating predicted structures in order to obtain the native structure. The complete set is available at
https://github.com/sanchak/mqap.
Pair
Mean EPD
SD
Number of
samples
DF
-108.9
29.5
481
DY
-107.4
30.7
442
DH
-105.2
33.5
242
DW
-104.1
27.7
209
EH
-98.5
28.5
200
EY
-96.5
28
378
EW
-94.2
29.8
184
SY
-93.5
27.5
403
EF
-93.1
27.6
439
TY
-93
28.6
384
TW
-90.8
28.7
144
SW
-89.2
27.7
169
FT
89.2
26.8
436
FS
92.3
28.4
453
HS
93.7
31.8
235
HT
95.1
31.5
235
Figure 4.
Standard deviation (SD) of the electrostatic potential difference between C
β atoms increases with increasing sequence distance for amino acid pairs.
Each sequence distance has at least 30 sample points. DF: Aspartic-acid/Phenylalanine, HS: Histidine/Serine. As expected, there is lesser correlation in the EPD values between the shown amino acid pairs ‘DF’ and ‘HS’ as the sequence distance between the residues increases. The SD for distance 1 (i.e. consecutive residues) is 29.8 EPD units and 31.8 EPD units for ‘DF’ and ‘HS’, respectively - and rises to around 60 EPD units with increasing sequence distance.
Electrostatic potential differences (EPD) in a sample of consecutive residue pairs of C
β atoms.
These pairs are used for discriminating predicted structures in order to obtain the native structure. The complete set is available at
https://github.com/sanchak/mqap.
Standard deviation (SD) of the electrostatic potential difference between C
β atoms increases with increasing sequence distance for amino acid pairs.
Each sequence distance has at least 30 sample points. DF: Aspartic-acid/Phenylalanine, HS: Histidine/Serine. As expected, there is lesser correlation in the EPD values between the shown amino acid pairs ‘DF’ and ‘HS’ as the sequence distance between the residues increases. The SD for distance 1 (i.e. consecutive residues) is 29.8 EPD units and 31.8 EPD units for ‘DF’ and ‘HS’, respectively - and rises to around 60 EPD units with increasing sequence distance.
Validating using decoy sets
We obtained the score (
PDScore) of any given protein structure by comparing the electrostatics of the C
β atoms based on
Table 3. To benchmark model quality assessment programs, we used decoy sets from the Decoys ‘R’ Us database
[43]. We detail our results from some of these decoy sets. Each set has several structures that are supposed to be ranked worse than the native structure.The misfold decoy set has
~20 protein structures, each of which has a correct and an incorrect structure specified (three structures have two incorrect structures: we randomly chose the first)
[44]. The PDScore of these proteins were computed (
Table 4). Barring three structures (PDBids: 1CBH, 1FDX and 2SSI), the PDScore of the incorrect structure is higher than that of the correct structures.
Table 4.
Misfold decoy set.
This decoy set has ~20 protein structures - each of which has a correct and an incorrect structure specified. The PDBs are sorted based on the number of residues in the structure (NRes). Three of the structures (1CBH, 1FDX and 2SSI) have a lower PDScore for the incorrect structure.
PDB
NRes
Correct
PDScore
Incorrect
PDScore
Specificity
1CBH
36
18.7
12.6
0
1PPT
36
18
33.5
1
1FDX
54
33
30.9
0
5RXN
54
25.1
35
1
1SN3
65
20.7
30.3
1
2CI2
65
19.9
35.2
1
2CRO
65
26.7
43.4
1
1HIP
85
19.1
36.8
1
2B5C
85
22.1
34.3
1
2CDV
107
17.4
40.9
1
2SSI
107
22.6
20
0
1BP2
123
21
44.1
1
2PAZ
123
19.3
27.3
1
1P2P
124
28.6
29
1
1RN3
124
20.7
28.8
1
1LH1
153
18
26.2
1
2I1B
153
19
27.6
1
1REI
212
16.5
21.8
1
5PAD
212
18.8
32
1
1RHD
293
23.2
31.9
1
2CYP
293
21.2
35.8
1
2TMN
316
27.2
32.2
1
2TS1
317
21.3
28.2
1
Misfold decoy set.
This decoy set has ~20 protein structures - each of which has a correct and an incorrect structure specified. The PDBs are sorted based on the number of residues in the structure (NRes). Three of the structures (1CBH, 1FDX and 2SSI) have a lower PDScore for the incorrect structure.The hg_structal set has about
~30 proteins. Each protein has 30 structures (including the native structure).
Table 5A shows specificity obtained for structures in this decoy set. The average specificity obtained for this decoy set is 0.91 (
Table 5A). The decoy set 4state_reduced has
~600 structures for each of the seven proteins. We obtain an average specificity of 0.94 for this decoy set (
Table 5B). Similarly, for the ig_structal decoy set we obtain a specificity of 0.93 (
Supplementary Table 1).
Table 5.
hg_structal and 4state_reduced decoy sets.
The PDBs are sorted based on specificity. (A) The hg_structal decoy set has ~30 protein structures - each of which has 30 structures. The average specificity obtained for the set is 0.91. (B) The 4state_reduced decoy set has 7 protein structures - each of which has
~600 structures. The average specificity obtained for the set is 0.94. (C) The fisa set has 4 protein structures - each of which has 500 structures. The electrostatic discriminator has low specificities in this case. We have previously demnostrated that this decoy set can be discriminated by a distance based criterion. It consists of physically nonviable structures, thus rendering an electrostatic analysis meaningless. NRes = number of residues, NStructures = number of structures in the decoy set.
PDB
NRes
NStructures
Specificity
(A)
hg_structal
2PGHA
141
30
0.2
1MBS
153
30
0.5
2DHBA
141
30
0.6
1HDAB
145
30
0.9
1MYT
146
30
0.9
1HLM
158
30
0.9
1HSY
153
30
0.9
1MBA
146
30
0.9
1MYGA
153
30
0.9
1MYJA
153
30
0.9
1ASH
147
30
1
1BABB
146
30
1
1COLA
197
30
1
1CPCA
162
30
1
1ECD
136
30
1
1EMY
153
30
1
1FLP
142
30
1
1GDM
153
30
1
1HBG
147
30
1
1HBHA
142
30
1
1HBHB
146
30
1
1HDAA
141
30
1
1HLB
157
30
1
1ITHA
141
30
1
1LHT
153
30
1
2DHBB
146
30
1
2LHB
149
30
1
2PGHB
146
30
1
4SDHA
145
30
1
(B)
4state_reduced
2CRO
65
675
0.8
3ICB
75
654
0.9
4RXN
54
677
0.9
4PTI
118
688
1
1CTF
131
631
1
1R69
97
676
1
1SN3
65
661
1
(C)
fisa
4ICB
76
501
0
1FC2
44
501
0.4
1HDDC
57
501
0.1
2CRO
65
501
0.7
Table S1.
Proteins from the PISCES database used for learning values.
Set of 1000 proteins from the PISCES database with percentage identity cutoff of 20%, resolution cutoff of 1.6 Å, R-factor cutoff of 0.25, and a RDCC cutoff of 0.012 Å used to learn feature values.
The PDBs are sorted based on specificity. (A) The hg_structal decoy set has ~30 protein structures - each of which has 30 structures. The average specificity obtained for the set is 0.91. (B) The 4state_reduced decoy set has 7 protein structures - each of which has
~600 structures. The average specificity obtained for the set is 0.94. (C) The fisa set has 4 protein structures - each of which has 500 structures. The electrostatic discriminator has low specificities in this case. We have previously demnostrated that this decoy set can be discriminated by a distance based criterion. It consists of physically nonviable structures, thus rendering an electrostatic analysis meaningless. NRes = number of residues, NStructures = number of structures in the decoy set.
Discussion
The functional characterization of a protein from its sequence using
in silico methods based on the ‘sequence to structure to function’ paradigm is contingent upon the availability of its 3D-structure. The rapidly developing field of next generation sequencing has exacerbated the bottleneck of obtaining structural data using crystallization techniques
[45]. This ever-widening gap has been filled by methods that predict structures from sequences
[46], based either on features from databases of known structures
[2–
4] or from first principles of atomic interactions
[5,
6].The various sources of error in protein structure prediction have been previously discussed in detail
[47]. An incorrect model of a protein structure will result in an inaccurate analysis of its properties
[48]. For example, continuum models
[49] that compute potential differences and pK
a values from charge interactions in proteins
[50] are sensitive to the spatial arrangement of the atoms in the structure. It must be pointed out that other detailed methods using quantum mechanical/molecular mechanical (QM/MM) techniques and and doing extensive conformational sampling have been able to determine the side chain pK
a values with high accuracy
[51]. Accurate structural information is indispensable for
in silico methods that extract the electrostatic profile of atoms in the peptide structure
[41,
52], and for other methods widely used in pharmacology for drug discovery
[53]. Model quality assessment programs (MQAP) that validate the accuracy of predicted structures are thus a critical aspect in the process of modeling a protein structure from its sequence. MQAPs can be classified as energy
[8–
12], consensus
[13–
16] or knowledge based (statistical potential)
[21–
27]. The state of the art methods for predicting structures
[35] and MQAPs
[17,
34,
36] are evaluated by researchers every two years.Previously, we hypothesized and demonstrated that the electrostatic potential difference (EPD) in cognate pairs in the active site are conserved in proteins with the same functionality
[37,
40,
54], even when evolution has converged to the same catalytic from completely different sequences
[55]. This similarity is observed in structures solved independently over many years and establishes the reliability of the APBS and PDB2PQR implementations
[41,
56]. We focused on unraveling other electrostatic features that are innate to protein structures. Here, we first demonstrate that the EPD between the C-N peptide bond and the C
α atoms of consecutive residues are independent of the amino acid type. This is expected, since the distance between these atoms are almost invariant across all structures. The EPD of the C-N bond has a high variance, implying that the backbone accommodates relatively large variations while seeking energetically minimized structures.The true source of the chemical and structural diversity in protein structures is the side chain atoms. Every amino acid, except glycine, has a C
β atom that hosts a unique moiety of atoms. Although the reactive groups are different for amino acids, we show that this difference is encapsulated in the backbone C
β atoms. We first show that different pairs of amino acids have significantly different mean EPD values in side chain C
β atoms (
Figure 3), unlike the EPD of the C-N peptide bond (
Figure 1) or the EPD between consecutive C
α atoms (
Figure 2). Further, the variance is much less than in the EPD of the C-N bond. These facts suggested that the EPD between C
β atoms of consecutive residues in the peptide structure mightact as a discriminator. Our hypothesis is based on the insightful Boltzmann law that lower energy conformations are disproportionately sampled, on the thermodynamic hypothesis
[57] that the native structure has minimal energy, and the hypothesis that statistical derived features in known protein structures have a Gaussian distribution
[21]. We apply our discriminator to standard decoy sets from the Decoys ‘R’ Us database to establish the validity of the method
[43].Our work also highlights the unique properties of proline in the protein structure
[58]. This is evident from the different magnitude of EPD in consecutive C
α atoms involving proline (
Table 1). Another noteworthy aspect is the large variation in EPD in consecutive C
β atoms involving cysteine (
Table 2), demonstrating the unique role cysteine plays in providing flexibility to protein structures, a critical element in the evolution of complex organisms
[59]. The discrimination of C
β atoms also provides a uniform basis for methods that require a single-atom representation of a residue. Such methods depend on a correct parameterization of the reactive atoms
[37], a task further complicated by amino acids such as histidine which has two reactive atoms. For example, the EPD between the negatively charged E and D with respect to the aromatic phenylalanine is -108 and -93 EPD units, in spite of the difference in their reactive atom. Similarly, the EPD between alanine and the three aromatic amino acids (F, W and Y) are -67, -66 and -63 EPD units respectively.We achieved an average specificity of 0.91, 0.94 and 0.93 on hg_structal, 4state_reduced and ig_structal decoy sets, respectively, taken from the Decoys ‘R’ Us database. We have previously demonstrated that the fisa decoy set can be discriminated by a distance based discriminator
[60]. ESCAPIST does not discriminate the native structure in this decoy set (
Table 5C). The physical implication of ESCAPIST results on the fisa decoy set, which has significant RMSD for backbone C
α atoms, needs further elaboration. The input to a finite difference Poisson-Boltzmann (FDPB) analysis is a charge distribution that might be unfeasible due to energy functions other than electrostatics. For example, van der Waals force or the elastic bond length force components might prevent two atoms from being in close proximity. However, if such a physically impossible configuration were presented to a FDPB-based analysis tool, such as APBS
[41], it would still generate an electrostatic potential landscape. Inferences based on this potential landscape would be incorrect due to its physical non-viability. Thus, before invoking the EPD constraints specified here, it is imperative that other spatial constraints that are rarely violated in structures are checked. Possibly for this same reason, MQAPs that combine many features in their scoring functions are superior. Moreover, it should be kept in mind that decoy sets, like most benchmarking sets, are prone to biases
[61] and possible errors
[31]. In fact, the fisa decoy set has been shown to violate the van der Waals term
[61]. To summarize, we present a novel discriminating feature in protein structures based on the electrostatic properties of the side chain atoms. We validated this discrimination in several decoy sets taken from the Decoys ‘R’ Us database, and achieved high specificities in most decoy sets.
Methods
Our proposed method has two phases. In the learning phase, we process multiple structures to extract the feature values - mean values of electrostatic potential difference (EPD) for each amino acid pair. These feature values are applied on query proteins to obtain a score (PDscore) that indicates the deviation of the feature values in the given structure from the ‘ideal’ values. Thus, a lower PDscore indicates a better candidate.
Learning phase
Algorithm 1 shows the procedure LearnFeatures() that extracts the feature values from a set of proteins Φ
(
Equation 1). We ignore the first x=
IgnoreNTerm and last y=
IgnoreCTerm pairs of residues in the protein structure to exclude the terminals. For every consecutive pair of residues in the structure, we calculate the EPD (see below for method) between two specified atoms (
atomP and
atomQ). Both
atomP and
atomQ are set to C
β to obtain EPD values for C
β atoms, while we set
atomP to ‘C’ and
atomQ to ‘N’ in order to obtain the C-N peptide bond EPD values. The mean (Mean learnt value - MLV) and standard deviation (SD) are computed for each amino acid type pair (AAType) in protein (
Equation 2), and the mean computed for the globals set of proteins (MLV(
AAType
)) for each pair of amino acid types (
Equation 3). Pairs whose EPD have a SD greater than a threshold value (
sdThresh, 50 by default) are ignored. Means for all significant pairs (
) are noted to a file, which is the input to the quality assessment phase. The EPD between a pair of amino acid is order-independent - for example, the EPD statistics for the pair ‘AC’ (alanine-cysteine) includes the EPD of both ‘AC’ and ‘CA’ (with the sign reversed).
Quality assessment phase
Algorithm 2 shows the function AssessEPDQuality() that generates the
PDscore for a given protein from the template file generated by the learning phase. The set of proteins
consists of the native structure
P
1 and N-1 decoys structures (
Equation 4). Once again, barring x=
IgnoreNTerm and y=
IgnoreCTerm number of residues from the N and C terminals, the pairwise EPD for consecutive residues are computed. The absolute value of the difference of these values from their corresponding means, if they exist, in the template file is added to generate the absolute score (
Equation 5). This score is normalized with the number of residues that have been compared to obtain the final
PDscore. In summary, the
PDscore encapsulates the average distance of the EPD for the given atom pairs (it may be C
β, C
α or the C-N bond) of consecutive residues from their mean values. We hypothesize that in the native or a near native structure, the
PDscore will be minimized for the EPD of C
β atoms of consecutive residues, i.e. given a set of proteins
Φ
consists of the native structure
P
1 and N-1 decoys structures,
P
1 will have the minimum PDscore (
Equation 6).The top level procedure ESCAPIST() is shown in
Algorithm 3. It invokes the function LearnFeatures() once, and applies the learnt values to assess the quality of structures based on the feature values obtained.
Computing electrostatic potentials
Adaptive Poisson-Boltzmann Solver
[41] (APBS) and the PDB2PQR package
[56] package was used to calculate the potential difference between the reactive atoms of the corresponding proteins. The APBS parameters are set as follows - solute dielectric: 2, solvent dielectric: 78, solvent probe radius: 1.4 Å, Temperature: 298 K and 0 ionic strength. APBS writes out the electrostatic potential in dimensionless units of kT/e where k is Boltzmann’s constant, T is the temperature in K and e is the charge of an electron.Input:
ϕ
= {
P
1 ···
P
} :
M Proteins in the learning setInput:
IgnoreNTerm: Ignore this number of residues in the N TerminalInput:
IgnoreCTerm: Ignore this number of residues in the C TerminalInput:
atomP: Atom type in first residueInput:
atomQ: Atom type in second residueInput:
sdThresh: Threshfold for standard deviation of the EPDOutput:
= {
meanPDC
1 ···
meann class="Disease">PDC
}: Mean values of EPD between specified atoms
of successive residues, there being K such significant pairsbegin/*K pairs of amino acid type (sorted: n class="Chemical">AC and CA are equivalent)*/
/*Each set is initialize to be the null set*/= {
1
} :foreach
P
doN = NumberOfResidues(
P
);for
p ← IgnoreNTerm
to (
N − IgnoreCTerm)
doq =
p + 1 ;/* Amino acid pairs are order independent */ResiduePairTypeString =
ResidueTypeString(
p) +
ResidueTypeString(
q);ResiduePairTypeStringSorted =
Sort(
ResiduePairTypeString;/* Reverse sign of potential difference accordingly */multfactor = 1 ;if
(ResiduePairTypeStringSorted != ResiduePairTypeString)
thenmultfactor = -1 ;endPD = PotentialDifference(
p, q, atomP, atomQ) *
multfn class="Chemical">actor ;
/* Let the amino acid pair be the kth in the set
*/InsertInSet(
PD,
);endend/* Compute Mean and SD of each set - ignore pairs with SD greater than sdThresh*/= ∅;foreach
do(
Mean
) = MeanAndSD(
);if
(SD
thenAdd(
Mean
);endendreturn (
);endInput:
P
1 : Protein under considerationInput:
IgnoreNTerm: Ignore this number of residues in the N TerminalInput:
IgnoreCTerm: Ignore this number of residues in the C TerminalInput:
atomP: Atom type in first residueInput:
atomQ: Atom type in second residueInput:
ϕ
= {
meanPDC
1 ···
meann class="Disease">PDC
}: Mean values of EPD between specified atoms of
successive residuesOutput:
PDscore: Score indicating the normalized distance of the observed values from the (mean)learnt values from native structuresbeginPDscore = 0 ;
NumberCompared = 0 ;
N = NumberOfResidues(
P
1);for
p ← IgnoreNTerm
to (
N − IgnoreCT erm)
doq =
p + 1 ;/* Amino acid pairs are order independent */ResiduePairTypeString =
ResidueTypeString(
p) +
ResidueTypeString(
q);ResiduePairTypeStringSorted =
Sort(
ResiduePairTypeString;/* Reverse sign of potential difference accordingly */multfactor = 1;if
(ResiduePairTypeStringSorted != ResiduePairTypeString)
thenmultfactor = -1 ;end/* Let the amino acid pair be the kth in the set
*/PD = PotentialDifference(
p, q, atomP, atomQ) *
multfn class="Chemical">actor ;
if
(∃
meanPDC
thenNumberCompared =
NumberCompared + 1 ;diff = absolute(
PD − meann class="Disease">PDC
);
PDscore =
n class="Disease">PDscore +
diff;
endend/* Normalize */PDscore =
n class="Disease">PDscore/NumberCompared;
return (
PDscore);endInput:
: Learning setInput:
P
1: Protein to be scoredInput:
IgnoreNTerm: Ignore this number of residues in the N TerminalInput:
IgnoreCTerm: Ignore this number of residues in the C TerminalInput:
atomP: Atom type in first residueInput:
atomQ: Atom type in second residueInput:
sdThresh: Threshfold for standard deviation of the EPDOutput:
PDscore: Score indicating the normalized distance of the observed values from the (mean)learnt values from native structuresbegin/* This is invoked once*/
=LearnFeatures(
);;PDscore =
AssessEn class="Disease">PDQuality(
P
1
, IgnoreNTerm, IgnoreCTerm, atomP, atomQ,
);
return (
PDscore);endThe quality of protein structure models is assessed by the geometry of adjacent C beta atoms. The approach successfully distinguishes properly folded proteins in most cases. It adds a new way to assess protein models and could be included in the protein model assessment toolbox.Just a few comments:a) Table 4 lists three of 20 structures where the incorrect one has a lower score. A few comments about structural features of those examples that lead unexpected scores would be useful.b) It might be preferable to note in the title that the method is applied to computational models of protein structure as a way to distinguish the manuscript from those that deal with quality assessment of experimentally determined structures.c) I did not test the available source code.I have read this submission. I believe that I have an appropriate level of expertise to confirm that it is of an acceptable scientific standard.We would like to thank you for taking the time and reviewing our paper, and deeply appreciate your positive comments. We do not have the expertise to comment on the structural characteristics that lead to the 3 negative results in Table 4 - these require insights into crystal structures that we do not possess at the present time. Also, we believe that this method could be applied to any structure - computational or experimentally obtained - and thus are leaving the title unchanged.This is an interesting idea that uses the physical (electrostatic) properties of amino acid side chains in order to predict secondary structure from sequence, and thus assess (and rank) the quality of protein structures. The manuscript is well-written, and the authors provide comprehensive information to allow others to follow the study-design and methodology. The focus on electrostatics is important as this has been repeatedly shown by rigorous theoretical studies (work by Warshel and others) to be the primary driving force in determining protein function and most likely folding stability as well (whether directly or indirectly). As the specific methodology the authors use is slightly further from my area of expertise I cannot directly comment on this, however, importantly the source code has been made Open Access and freely available through Github.My only comment is on the second paragraph of the Discussion, which comments on the pitfalls when using an incorrect model of a protein structure, particularly when trying to calculate pK
as using continuum models which are dependent on the initial conformation. While the authors highlight a very important challenge, I would like to point out that it can to some extent be resolved by extensive conformational sampling (particularly the pK
a problem, as the pK
a is an average property over all possible protein conformations), which we discussed at length in a review a few years ago (
Kamerlin
). Otherwise this is a nice manuscript and a valuable contribution to the literature.I have read this submission. I believe that I have an appropriate level of expertise to confirm that it is of an acceptable scientific standard.
Table S2.
ig_structal decoy set.
The PDBs are sorted based on specificity: The ig_structal decoy set has ~61 protein structures - each of which has 61 structures. The average specificity obtained for the set is 0.97. NRes =: number of residues, NStructures =: number of structures in the decoy set.