| Literature DB >> 29047165 |
Gang Xu1, Tianqi Ma2,3, Tianwu Zang2,3, Qinghua Wang4, Jianpeng Ma1,2,3,4.
Abstract
We report a C-atom-based scoring function, named OPUS-CSF, for ranking protein structural models. Rather than using traditional Boltzmann formula, we built a scoring function (CSF score) based on the native distributions (derived from the entire PDB) of coordinate components of mainchain C (carbonyl) atoms on selected residues of peptide segments of 5, 7, 9, and 11 residues in length. In testing OPUS-CSF on decoy recognition, it maximally recognized 257 native structures out of 278 targets in 11 commonly used decoy sets, significantly outperforming other popular all-atom empirical potentials. The average correlation coefficient with TM-score was also comparable with those of other potentials. OPUS-CSF is a highly coarse-grained scoring function, which only requires input of partial mainchain information, and very fast. Thus, it is suitable for applications at early stage of structural building.Entities:
Keywords: coarse-graining; decoy recognition; protein folding; protein structure modeling; scoring function
Mesh:
Substances:
Year: 2017 PMID: 29047165 PMCID: PMC5734313 DOI: 10.1002/pro.3327
Source DB: PubMed Journal: Protein Sci ISSN: 0961-8368 Impact factor: 6.725
The results of OPUS‐CSF5 (5‐residue segment) and OPUS‐CSF (combined segment length) on 11 decoys sets compared with different potentialsa
| Decoy sets | Total # of targets | DFIRE | RWplus | dDFIRE | OPUS‐PSP | GOAP | OPUS‐CSF5 | OPUS‐CSF |
|---|---|---|---|---|---|---|---|---|
| 4state_reduced | 7 | 6 (–3.48) | 6 (3.51) | 7 (–4.15) |
| 7 (–4.38) | 7 (–3.38) | 7 (–3.31) |
| fisa | 4 |
| 3 (–4.79) | 3 (–3.80) | 3 (–4.24) | 3 (–3.97) | 2 (–2.31) | 2 (–2.55) |
| fisa_casp3 | 5 | 4 (–4.80) | 4 (–5.17) | 4 (–4.83) |
| 5 (–5.27) | 4 (–4.38) | 4 (–6.72) |
| hg_structal | 29 | 12 (–1.97) | 12 (–1.74) | 16 (–1.33) | 18 (1.87) | 22 (–2.73) |
| 23 (–2.06) |
| ig_structal | 61 | 0 (0.92) | 0 (1.11) | 26 (–1.02) | 20 (0.69) | 47 (–1.62) | 49 (–2.03) |
|
| ig_structal_hires | 20 | 0 (0.17) | 0 (0.32) | 16 (–2.05) | 14 (–0.77) | 18 (–2.35) | 19 (–2.19) |
|
| I–TASSER | 56 | 49 (–4.02) | 56 (–5.77) | 48 (–5.03) | 55 (–7.43) | 45 (–5.36) | 55 (–5.32) |
|
| lattice_ssfit | 8 | 8 (–9.44) | 8 (–8.85) | 8 (–10.12) | 8 (–6.75) | 8 (–8.38) | 8 (–9.56) |
|
| lmds | 10 | 7 (–0.88) | 7 (–1.03) | 6 (–2.44) | 8 (–5.63) | 7 (–4.07) | 8 (–5.47) |
|
| MOULDER | 20 | 19 (–2.97) | 19 (–2.84) | 18 (–2.74) | 19 (–4.84) | 19 (–3.58) |
| 20 (–3.16) |
| ROSETTA | 58 | 20 (–1.82) | 20 (–1.47) | 12 (–0.83) | 39 (–3.00) | 45 (–3.70) | 49 (–3.68) |
|
| Total | 278 | 128 (–1.94) | 135 (–2.13) | 164 (–2.52) | 196 (–2.86) | 226 (–3.57) | 244 (–3.56) |
|
The results of other potentials come from the GOAP paper. The numbers of targets, with their native structures successfully recognized by various potentials, are listed in the table. The numbers in parentheses are the average Z‐scores of the native structures. The larger the absolute value of Z‐score, the better. Out of the total 278 targets in 11 decoy sets, OPUS‐CSF5 (5‐residue segment) recognized 244 and OPUS‐CSF (combined segment length) recognizes 257 native structures from their decoys. The bold number in each row indicates the best one among all the potential functions for that particular decoy set (if the numbers of targets are the same, the bold face entries are those having the better Z‐scores).
Average Pearson correlation coefficients of CSF scores with TM‐scoresa
| Decoy sets | OPUS‐PSP | GOAP | OPUS‐CSF |
|---|---|---|---|
| 4state_reduced | −0.589 | – | −0.667 |
| fisa | −0.282 | −0.347 | – |
| fisa_casp3 | −0.095 | −0.221 | – |
| hg_structal | −0.752 | – | −0.803 |
| ig_structal | −0.779 | −0.865 | – |
| ig_structal_hires | −0.832 | −0.885 | – |
| I–TASSER | −0.284 | – | −0.452 |
| lattice_ssfit | −0.051 | −0.058 | – |
| lmds | −0.091 | −0.146 | – |
| MOULDER | −0.802 | – | −0.863 |
| ROSETTA | −0.343 | – | −0.391 |
| Average | −0.521 | −0.632 | −0.624 |
The correlation coefficient of a decoy set is the average coefficient of all targets in that decoy set. In calculating the correlation coefficients, the native structure was excluded. OPUS‐CSF has comparable average correlation coefficient with other two potentials. The bold number in each row indicates the best one among the three potential functions for that particular decoy set. For OPUS‐CSF, only those results for the combined segment case are listed.
Figure 1The histogram of standard deviations of the coordinate components in the CND lookup table for 5‐residue segment case. The distribution peaks at a very small value of standard deviation indicating that the coordinate components of the 1st and 5th mainchain C (carbonyl) are clustered in a narrow distribution, that is, the configurational distributions of the 5‐residue peptide segments are narrow. In addition, the average value of the standard deviation is 1.20 Å.
Figure 2The population distribution of CSF scores for 278 native structures in 11 decoy sets. The X‐axis is the CSF score (per independent coordinate component variable). The Y‐axis is the histogram of the population.
Figure 3The distribution of frequency of sequence repeating in the CND lookup table. The X‐axis is the repeating frequency, and the Y‐axis is the number of sequences with particular repeating frequency. Sequences that repeat less than five times were omitted in our study. Analysis of this distribution indicates that half of the sequences repeat >26 times. The largest value of X‐axis is 29,618 with one sequence, but not shown for the purpose of clarity.
The result of OPUS‐CSF built by different length of residue segmentsa
| Num_above5 | Num_all | Num_above5/Num_all | |
|---|---|---|---|
| 5‐residues | 1766273 | 2350969 | 0.751 |
| 7‐residues | 3736778 | 9544858 | 0.391 |
| 9‐residues | 3713506 | 10262243 | 0.362 |
| 11‐residues | 3743204 | 10698802 | 0.350 |
Num_above5 is the number of sequence segments which occur at least five times in PDB. Num_all shows the total number of sequence segments in PDB. The ratio decreases as the length of segments increases.
The performance of OPUS‐CSF based on different lengths of residue segments on the 11 decoys setsa
| 5‐residues | 7‐residues | 9‐residues | 11‐residues | |
|---|---|---|---|---|
| Success numbers | 244 (278) | 218 (278) | 220 (278) | 219 (278) |
| Z‐scores | −3.56 | −4.55 | −4.62 | −4.57 |
| Average Coverage | 0.971 | 0.749 | 0.712 | 0.683 |
| Unknowns | 0 | 41 | 45 | 46 |
Success numbers are the numbers of native structures that OPUS‐CSF correctly recognized from the decoys. Numbers in parentheses (278) are the total number of native structures (or targets) in 11 decoy sets. The Z‐scores are the calculated for the CSF scores of the native structures with respect to their decoys. Coverage means the ratio between the number of segments available in CND lookup table and the number of total segments of a target sequence. The table shows the average coverage among 278 targets in 11 decoy sets. Unknowns are the numbers of target sequences that have <20% of coverage. For these sequences, OPUS‐CSF is not applicable. Note, 5‐residue case does not have sequence classified as unknown, while 7‐residue case, for example, has 41 out of 278 sequences not applicable for OPUS‐CSF. The number of unknown increases slightly as the length of segment increases. Note, in the combined segment case, the longer segments may make no contribution to the CSF score if they are unknowns. Since the 5‐residue segment case has no unknowns, it guarantees OPUS‐CSF applicable to all target sequences even in rare ones that all longer segments are regarded as unknown.
Figure 4Local molecular coordinate system in OPUS‐CSF defined by the mainchain atoms of the 3rd residues. The origin is on Cα atom. The X‐axis is along the Cα–C line. Y‐axis is in the plan of Cα–C–O atoms, and parallel to the orthogonal projection of C–O vector. Z‐axis is defined accordingly.