Literature DB >> 16046811

Early-stage folding in proteins (in silico) sequence-to-structure relation.

Michał Brylinski¹, Leszek Konieczny, Patryk Czerwonko, Wiktor Jurkowski, Irena Roterman.

Abstract

A sequence-to-structure library has been created based on the complete PDB database. The tetrapeptide was selected as a unit representing a well-defined structural motif. Seven structural forms were introduced for structure classification. The early-stage folding conformations were used as the objects for structure analysis and classification. The degree of determinability was estimated for the sequence-to-structure and structure-to-sequence relations. Probability calculus and informational entropy were applied for quantitative estimation of the mutual relation between them. The structural motifs representing different forms of loops and bends were found to favor particular sequences in structure-to-sequence analysis.

Entities: Chemical Disease Species

Year: 2005 PMID： 16046811 PMCID： PMC1184056 DOI： 10.1155/JBB.2005.65

Source DB: PubMed Journal: J Biomed Biotechnol ISSN： 1110-7243

INTRODUCTION

Prediction of three-dimensional protein structures remains a major challenge to modern molecular biology. On the one hand, identical pentapeptide sequences exist in completely different tertiary structures in proteins [1]; on the other, different amino acid sequences can adopt approximately the same three-dimensional structure. However, the patterns of sequence conservation can be used for protein structure prediction [2, 3, 4]. Usually, secondary structure definition has been used for ab initio methods as a common starting conformation for protein structure prediction [5]. A large body of experiments and theoretical evidence suggests that local structure is frequently encoded in short segments of protein sequence. A definite relation between the amino acid sequences of a region folded into a supersecondary structure has been found. It was also found that they are independent of the remaining sequence of the molecule [6, 7]. Early studies of local sequence-structure relationships and secondary structure prediction were based on either simple physical principles [8] or statistics [9, 10, 11, 12]. Nearest-neighbor methods use a database of proteins with known three-dimensional structures to predict the conformational states of test protein [13, 14, 15, 16]. Some methods are based on nonlinear algorithms known as neural nets [17, 18, 19] or hidden Markov models [20, 21, 22, 23]. In addition to studies of sequence-to-structure relationships focused on determining the propensity of amino acids for predefined local structures [24, 25, 26, 27], others involve determining patterns of sequence-to-structure correlations [21, 22, 28, 29, 30]. The evolutionary information contained in multiple sequence alignments has been widely used for secondary structure prediction [31, 32, 33, 34, 35, 36, 37, 38]. Prediction of the percentage composition of α-helix, β-strand, and irregular structure based on the percentage of amino acid composition, without regard to sequence, permits proteins to be assigned to groups, as all α, all β, and mixed α/β [5, 39]. Structure representation is simplified in many models. Side chains are limited to one representative virtual atom; virtual Cα − Cα bonds are often introduced to decrease the number of atoms present in the peptide bond [40, 41]. The search for structure representation in other than the , ψ angles conformational space has been continuing [42]. Other models are based on limitation of the conformational space. One of them divided the Ramachandran map into four low-energy basins [43, 44]. In another study, all sterically allowed conformations for short polyalanine chains were enumerated using discrete bins called mesostates [45]. The need to limit the confomational space was also asserted [46, 47]. The model introduced in this paper is based on limitation of the conformational space to the particular part of the Ramachandran map. The structures created according to this limited conformational subspace are assumed to represent early-stage structural forms of protein folding in silico. In this paper, in contrast to commonly used base of final native structures of proteins, the early-stage folding conformation of the polypeptide chain is the criterion for structure classification. Two approaches are the basis for the early-stage folding model presented in this paper. (1) The geometry of the polypeptide chain can be expressed using parameters other than , ψ angles. These new parameters are the V-angle—dihedral angle between two sequential peptide bond planes—and the R-radius, radius of curvature, found to be dependent on the V-angle in the form of a second-degree polynomial. Details on the background of the geometric model based on the V, R [48, 49] are recapitulated briefly in “appendix A.” (2) The structures satisfying the V-to-R relation appeared to distinguish the part of the Ramachandran map (the complete conformational space) delivering the limited conformational subspace (ellipse path on the Ramachandran map). It was shown that the amount of information carried by the amino acid is significantly lower than the amount of information needed to predict , ψ angles (point on Ramachandran map). These two amounts of information can be balanced after introducing the conformational subspace limited to the conformational subspace distinguished by the simplified model presented above. Details on the background of the information-theory-based model [50] are reviewed briefly in “appendix B.” The conformational subspace found to satisfy the geometric characteristics (polypeptide limited to the chain peptide bond planes with side chains ignored) and the condition of information balancing appeared to select the part of Ramachandran map which can be treated as the early-stage conformational subspace. The introduced model of early-stage folding was extended to make it applicable to the creation of starting structural forms of proteins for an energy-minimization procedure oriented to protein structure prediction. The characteristics and possible applicability of the sequence-to-structure and structure-to-sequence contingency tables is the aim of this paper. The structures created according to the limited conformational subspace can be reached in two different ways: (1) as the partial unfolding (Figures 1a–1e) and (2) as the basis for the initial structure assumed to represent early-stage folding (Figures 1f–1j). The partial unfolding of the native structural form (called the “step-back” structure in this paper) is expressed by changing the , ψ angles to the , ψ sb angles (, ψ sb angles belong to the ellipse path, and their values are obtained according to the criterion of the shortest distance between , ψ and the ellipse—shown in Figure 1b). The second approach, in which the structure is created on the basis of the , ψ es angles (, ψ es denote the dihedral angles belonging to the ellipse and representing a particular probability maximum), is based on the library of sequence-to-structure relations for tetrapeptides.

Figure 1

(a–e) Step-back unfolding path: (a) native structure of 1APJ, (b) partial unfolding procedure, (c) step-back conformation according to the limited conformational subspace, (d) example of amino-acid-dependent probability profile (Glu) for complete PDB 2003 after moving , ψ angles to the nearest point on the ellipse path, (e) letter codes assigned according to probability profiles. (f–j) Folding simulation path: (f) early-stage structure prediction in terms of structural letters, (g) an example of a discrete profile (Glu) applied to early-stage structure creation, (h) predicted early-stage conformation of 1APJ, (i) late-stage folding simulation procedure (under consideration—not applied yet), (j) structure of 1APJ as a result of the energy-minimization procedure with proper disulphide bridges constraints.

A scheme summarizing the two procedures—partial unfolding and partial folding—is shown below (Figure 1). The procedure called partial unfolding starts at the native structure of the protein (Figure 1a). The values of the , ψ angles present in the protein are changed (according to the shortest distance criterion) to the values of the angles belonging to the ellipse (, ψ sb). When these dihedral angles are applied, the structure of the same protein looks as is shown in Figure 1c. When this procedure is applied to all proteins present in the protein data bank, a probability profile can be obtained which represents the distribution of , ψ angles in the limited conformational subspace. The distribution is different for each amino acid, although some characteristic maxima can be distinguished. The profile shown in Figure 1d represents Glu (the ellipse equation t-parameter = 0° represents the point of = 90° and ψ = −90°, and then increases clockwise). Particular probability maxima can be recognized using the letter codes also shown in Figure 2. These letter codes are used to classify the structures of proteins in their early-stage folding (in silico) (Figure 1e).

Figure 2

Letter codes for structure classification. (a) The ellipse-path-limited conformational subspace in relation to , ψ angles as they appear in real proteins. Arrows denote the shortest-distance criterion for definition of , ψ angles belonging to the ellipse for arbitrary selected points. (b) Probability maxima as they appear along the ellipse (starting t-point shown in (c)) and corresponding letter codes for structure identification. (c) Limited conformational subspace with fragments distinguished according to probability maxima shown in (b).

The opposite procedure, aimed at protein folding, is shown also in Figures 1f–1j. The starting point in this procedure is the amino acid sequence of a particular protein. After selecting four-amino-acid fragments (in an overlapping system), four different structural codes (for the same tetrapeptide) can be attributed on the basis of the contingency table described above (Figure 1f). Only a particular fragment of the probability profile (according to the letter code) can be recognized in this case. In consequence, the , ψ es values representing the location of the probability maximum on the t-axis can be attributed to a particular sequence (Figure 1g). This is why the , ψ es angles differ versus , ψ sb. In consequence, the structure of the transforming growth factor β binding protein-like domain (protein selected as an example, PDB ID: 1APJ) created according to the , ψ es angles shown in Figure 1h differs versus the (, ψ sb)-based structure. The “sb” (step-back) and “es” (early-stage) structures differ due to the continuous form of the probability distribution in “sb” procedure and the discrete one in the “es” procedure. The next step in the prediction procedure is energy minimization, which in some cases causes approach toward the native structure (Figure 1j). The structures created according to the ellipse path treated as the starting structures for the energy-minimization procedure, deliver forms that approach the native structure after one simple optimization procedure. BPTI [51], ribonuclease [50], to some extent also human hemoglobin α and β chains [52] and lysozyme [53] were used as the model molecules. All these examples proved that the ellipse-path-limited conformational subspace helped define the initial structure for the energy-minimization procedure, leading to proper, native-like structures without any forms inconsistent with protein-like ones. When the energy-minimization procedure is not sufficient to deliver the proper native-like structure of the protein (which can be seen in Figures 1a and 1j), the additional procedure is necessary (Figure 1i). It is under study now and will be published in the close future.

MATERIAL AND METHODS

Early-stage folding structure classification

All proteins present in PDB (release January 2003) were taken for analysis [54]. Letter codes have been used for sequence identification. A letter code system is introduced in this paper for structure representation in protein early-stage folding (in silico) based on the probability distribution of , ψ angles along the ellipse-path-limited conformational subspace (see “appendix B”). To easily distinguish the structure codes versus sequence codes, the former are printed in bold and the latter in italics in this work. Comparison of distributions between three-state secondary structures indicated four-amino-acid fragments as the most common ones for α-helices, β-strands, and loops [21, 55]. The tetrapeptide was adopted as the unit for investigation of the sequence-structure relation. The probability distribution along the ellipse, which is assumed to represent the limited conformational subspace, is the basis for the structure classification introduced in this paper. The profile of the probability distribution (of all amino acids) along the ellipse path is shown in Figure 2. Figure 2a shows the usual distribution of , ψ angles as found in proteins together with the ellipse path. The procedure of moving particular , ψ angles to the ellipse path is also shown in Figure 2a. The shortest distance between particular , ψ angles (point on the Ramachandran map) and a point belonging to the ellipse path located the , ψ (e denotes ellipse belonging) dihedral angles determining the early stage for a particular amino acid of the polypeptide chain. After moving all , ψ angles to the ellipse path, the profile of the probability distribution can be obtained, as shown in Figure 2b. The t-parameter is the ellipse parameter present in the equation shown in “appendix A.” The t-parameter equal to zero represents the point = 90° and ψ = −90° on the Ramachandran map and increases clockwise, as is shown in Figure 2c. Seven probability maxima can be distinguished in this profile. Each of them is letter coded. This coding system was applied to classify the structures of all proteins analyzed. The codes introduced according to the probability distribution shown in Figure 2b are interpreted as follows: C (t-value range) represents right-handed helical structures, E represents β-structural forms, and G represents left-handed helices. The β-structural forms are differentiated (some amino acids like Ala, Ser, Asp reveal two probability maxima [50]); this is why code F also represents β-like structures. Although all other letters represent structural forms not identified in the traditional classification, the presence of probability maxima suggests the need to distinguish these categories (code A mostly for Pro and Gly, code B represented mostly by Asn and Asp, and code D characteristic for Tyr and Asn, to take a few examples).

The contingency table

A window size of four amino acids (analogous to the open reading frame in nucleotide identification) with one amino acid step (overlapping system) was applied to code the sequences and structures in proteins. Potentially 160 000 (204) different sequences for tetrapeptides can occur (columns). Taking seven different structural forms for each amino acid in a tetrapeptide, 2401 (74) structural forms can be distinguished for a tetrapeptide (rows). These numbers give an idea of the size of the contingency table under consideration. For all cells, probability values of p , p , and p were calculated as follows: where i denotes a particular structure (row), j denotes a particular sequence (column), n is the number of polypeptide chains belonging to the ith structure and representing the jth sequence, N is the total number of ORFs, and N and N denote the number of ORFs belonging to a particular ith structure and jth sequence, respectively. The table expressing all probabilities (, , and ) is available on request at http://www.bioinformatics.cm-uj.krakow.pl/earlystage/. All values are expressed on a logarithmic scale because of the very low probability values in the cells of the table.

Information entropy as a measure of sequence-to-structure and structure-to-sequence predictability

High values of probability calculated as above (relative to potential probability values) can disclose highly coupled pairs of structure and sequence. Ranking the probability values can extract the highly determined relations for both sequence-to-structure and structure-to-sequence. Structural predictability can also be measured using informational entropy calculation. According to Shannon's definition [56], the amount of information can be calculated as follows: where I expresses the amount of information (in bits) dependent on p —the probability of event i. This definition is very useful for measuring the amount of information carried by a particular simple (elementary) event. In the case of a complex event, for which few solutions are possible, informational entropy can be calculated, expressing the level of uncertainty in predicting the solution. Informational entropy according to Shannon's definition is as follows: where n is the number of possible solutions for a particular event. N denotes the number of possible solutions for the event under consideration (number of elementary events). SE reaches its maximum value for all p equal to each other, that is, each ith solution is equally probable for the event under consideration and no solution is preferred. The maximum value depends on the number of possible solutions for the event (n). SE equal to zero (or 1.0) represents the determinate case in which only one solution is possible. The higher the difference between and SE, the higher the degree of determinability in the given case. A high value means that the case is realized by a few solutions and that some of them occur with higher probability, which can be interpreted as a case with higher determinability (biased event). SE, , and the values of the differences between them can be calculated for all rows SE (structural preferences versus amino acid sequence) and for columns SE (sequence preference for a particular structural form) in the contingency table. SE allowed extraction of structures highly determined by the sequence; SE extracted structures highly attributed to a particular sequence. The SE calculation performed for each column (particular sequence) in the contingency table was calculated as follows: where SE denotes informational entropy for the j-column, i denotes a particular row (structure), N is the number of nonzero cells in the j-column, and is calculated according to (2). The value SE as calculated according to (6) measures the level of uncertainty in predicting structure for the jth sequence. The closer the SE value to zero, the higher the degree of chance in prediction. expresses quantitatively the level of uncertainty in the most difficult case for making a decision. For the j-column (sequence): where denotes maximum informational entropy for the j-column, i denotes a particular row (structure), N 0 is the number of nonzero cells in the j-column, and denotes the value of probability in a column under the assumption that all nonzero cells are equally represented (the principal condition for ). In other words, for all nonzero cells (i=1,...,N 0) in the j-column can be calculated as follows: Thus the difference between two quantities ((6) and (7)) can be used as the “distance” between the most difficult situation (all solutions equally possible—random solution) and the situation observed in the case under consideration. For the j column Analogous calculations for rows (sequences) were performed. For each i-row, the value of SE , , and Δ SE was calculated.

RESULTS

Structures coded according to the introduced system

Structures of all proteins present in the PDB (release January 2003) [56] were analyzed. The , ψ angles were calculated for each amino acid. The , ψ angles were calculated according to the shortest distance versus the ellipse. A letter code was assigned for each amino acid according to the ellipse path fragment. Since the tetrapeptide was used as the structural unit, four letters coded one structural unit. The overlapping reading frame system was applied, which means that one amino acid step was applied in structure classification. The maximum combination of seven letter codes for a four-letter string is equal to 2401. This means that 2401 different four-letter strings were expected to be found. It turned out that only 2397 different strings were found in real proteins. Since there are 20 amino acids and four amino acids were taken for the unit, 160 000 different sequences of tetrapeptides were expected; 146 940 different sequences were found in the proteins under consideration.

Contingency table

Each tetrapeptide found in proteins was described by a four-letter string expressing the sequence and a four-letter string expressing the structure. Each tetrapetide with a known sequence and known structure can be ordered in the form of a table. The rows of the table represent structures and the columns represent sequences. Finally a 2397 × 146 940 table was constructed. To distinguish the structure codes from sequence codes, sequence codes are in bold capital letters and structure codes in italics. The scheme of the contingency table is presented in Table 1. The total number of tetrapeptides in the analyzed database was found to be 1 529 987. Global analysis of the contingency table shows that the maximum number of different structures attributed to the same tetrapeptide is 144. This tetrapeptide appeared to be of the sequence GSAA. The maximum number of different sequences was found for α-helix (CCCC: 90 587) and for β-structure (EEEE: 47 809). Four structures were not found in the library: ABAB, ABBD, ABFB, DBAB.

Table 1

Scheme of the sequence-structure contingency table. Symbols explained in text.

Structure	Sequence

	1	2	...	j	...	160 000

1	n₁₁,p11t, p11c, p11r	n₁₂, p12t, p12c, p12r	...	n_1j, p1jt, p1jc, p1jt	...	N₁^r
2	n₂₁, p21t, p21c, p21r	n₂₂, p22t, p22c, p22r	...	n_2 j, p2jt, p2jc, p2jr	...	N₂^r
...	...	...	...	...	...	...
i	n_i1, pi1t, pi1c, pi1r	n_i2, pi2t, pi2c, pi2r	...	n_ij, pijt, pijc, pijr	...	N_i^r
...	...	...	...	...	...	...
2 401	N₁^c	N₂^c	...	N_j^c	...	N^t

Information entropy calculation

SE, , and the value of the difference between these two quantities (ΔSE) were calculated according to the procedure presented in “material and methods.” They can be calculated for columns (sequences) and for rows (structures) separately. The calculation of SE for the j-column expresses the information entropy related to the structural differentiation of a particular sequence. The calculation of SE for the i-row in the contingency table expresses the sequential differentiation for a particular structure. according to information entropy characteristics expresses the entropy for the case in which each of all the nonzero cells represents equal probability. For , all structures for a particular sequence are equally probable. Equal probability for a set of elementary events (different structures) represents the random situation. The bigger the difference , the more deterministic the case. This is why the difference (Δ SE) between SE and was taken to measure the degree of structure-to-sequence (or vice versa) determination. The interpretation of Tables 2 and 3 is as follows. The structural predictability for a particular sequence can be estimated in the first case, and the predictability of the sequence for a particular structure in the latter case. The results for only the top ten structures and top ten sequences are shown in Tables 2 and 3.

Table 2

Sequence-to-structure relation measured according to the value of the difference (ΔSE) between entropy of information (SE) calculated for the probability values found in the contingency table (particular column) and maximum entropy of information (), which (according to the characteristics of entropy of information) is reached for equal probability values in each nonzero cell in a particular column.


Sequence	Structure	SE^c (bit)	SEcmax (bit)	ΔSE^c (bit)
AAAA	CCCC	2.29	6.44	4.15
GDSG	GCFG	1.57	5.49	3.92
AVRR	CCCC	1.04	4.95	3.91
LAAA	CCCC	1.77	5.61	3.84
EAEL	CCCC	1.37	5.21	3.83
LDKA	CCCC	1.30	5.09	3.78
DAAV	CCCC	0.69	4.46	3.77
AKLK	CCCC	0.76	4.52	3.77
DSGG	CFGF	1.97	5.73	3.76
ELAA	CCCC	1.30	5.04	3.75

Table 3

Structure-to-sequence relation measured according to the value of the difference (ΔSE) between entropy of information (SE) calculated for the probability values found in the contingency table (particular column) and maximum entropy of information (), which (according to the characteristics of entropy of information) is reached for equal probability values in each nonzero cell in a particular column.


Structure	Sequence	SE^r (bit)	SErmax (bit)	ΔSE^r (bit)
GCFG	GDSG	4.82	7.99	3.17
AEED	GLRL	3.86	6.81	2.95
BACE	GGAE	2.20	5.09	2.89
EAEG	IGIG	4.79	7.68	2.89
AEGE	GIGH	4.74	7.63	2.89
BFBE	PEPV	2.28	5.13	2.85
AEGD	GNES	2.09	4.91	2.82
EBCB	ELPD	3.68	6.38	2.70
EBFB	FBEP	2.57	5.17	2.60
AFFP	GFRN	2.03	4.58	2.55

Its highest structural predictability for a particular sequence confirms polyalanine as a highly probable helical structure. Generally, the highly predictable structures for particular sequences are helical forms (Table 2). The sequence predictability for particular structural forms displayed a quite unexpected regularity. The structures representing irregular structural forms appeared to reveal the strongest entropy decrease versus the random distribution of sequences. This can be seen analyzing the letter codes for the structures (Table 3). The top ten structures presented in Table 3 are also shown in Figure 3. In summary, one can say that when a particular irregular structural form is expected in a protein, there are preferable sequences to build these irregular motifs; they are shown in Table 3. This seems to be of particular relevance for threading procedures oriented to the production of new proteins not observed in nature.

Figure 3

Structures of tetrapeptides with highest structure-to-sequence determinability as found using informational entropy calculation (see “material and methods” and Table 1). Gray terminal fragments represent the extended form of polyalanine (tetrapeptides) to emphasize the mutual spatial orientation of terminal fragments. Other colors distinguish ellipse fragments as follows: red (A), green (B), violet (C), sky-blue (D), yellow (E), dark blue (F), orange (G). The data for creation of these structures is given in Table 1 and Figure 2.

DISCUSSION

Particular classes of amino acid relations to particular structural forms in proteins were recently found to solve the problem of structure predictability [57]. All papers concerning this subject linked sequence with structure as it appears in the final native form of the protein. The model introduced in this paper represents an approach to the relation between sequence and structure in the early-stage folding structural form; the bases for the model are presented in detail elsewhere [48, 49, 50], and verified by BPTI [51], ribonuclease [50], hemoglobin [52], and lysozyme [53] folding. The (in silico) early-stage structures of these proteins can be found in the corresponding publications. Several algorithms for quantitatively assigning α-helix, β-strand, and loop regions for proteins with known structure have been developed [58, 59, 61]. The three-dimensional model presented in this paper shows that it is enough to select seven fragments of the ellipse with well-defined probability maxima to be able to predict the early-stage structural form. The high structure-to-sequence relation found for loops (Table 3, Figure 3) may be particularly important, since a recent survey of 31 genomes indicated that disordered segments longer than 50 residues are very prevalent [62]. Helices, sheets, and turns together account for only about 50%–55% of all protein structure on average [63]; the remaining structures are classified as several types of loops [63, 64]. Current estimations suggest that over 50% of proteins in eukaryotes may carry unconstructed regions of more than 40 residues in length [65], while less than 1% of the proteins in the PDB contains such long disordered regions. These observations taken together imply that many proteins with disordered regions would be unlikely to form crystals [66]. Proteins containing long, disordered segments under physiological conditions are frequently involved in regulatory functions [67], and the structural disorder may be relieved upon binding of the protein to its target molecule [68, 69]. Intrinsically unconstructed proteins and regions, which are also known as natively unfolded and intrinsically disordered, differ from structured globular proteins and domains with regard to many attributes, including amino acid composition, sequence complexity, hydrophobicity, charge, flexibility, and type and rate of amino acid substitutions over evolutionary time [66]. Compared to highly ordered secondary structure regions, the loops and turns are more difficult to identify due to the absence of hydrogen bonding and repeating backbone dihedral angle patterns [70]. The first computational tool indicating the predictability of disordered regions from protein sequence [71] was a neural network predictor (PONDR). Several other disorder predictors have been published since then [72, 73, 74]. Statistically based turn propensity used over a four-residue window was described [75]. The inverse folding problem is the design of protein sequences that have a desired structure [76, 77]. It is impossible to mention even a small part of the papers dealing with the sequence-to-structure relation. Recently, it was concluded that the probability of any state (,ψ) is influenced by the full sequence and not only by the local structure [78]. A genome-scale fold recognition program exploring the knowledge-based structure-derived score function for a particular residue was proposed incorporating three terms: backbone torsion, buried surface, and contact energy [79]. Unlike many others, our model, dual in nature, incorporating sequential and structural information, predicts sequence-to-structure as well as structure-to-sequence. The contingency table was independently analyzed using another statistics-related method (Meus J, Stefaniak J. The Z coefficient as a measure of dependence in contingency tables (unpublished data), Meus J, Brylinski M, Piwowar P, et al. A tabular approach to the sequence-to-structure relation in proteins (unpublished data)). High accordance was found between the results presented in this paper and in the statistical analysis: the top ten sequences and structures presented in Table 1 were found to be among the most highly correlated, both in sequence-to-structure and in structure-to-sequence, on the ranking list created by the alternate calculation method. The order of the two ranking lists is very similar, additionally confirming the reliability of the model presented. Aside from early-stage structure prediction, the contingency table presented may contribute to conventional secondary structure prediction, local and supersecondary structure prediction, location of transmembrane regions in proteins, location of genes, or sequence design. The list of highly determinable tetrapeptides (in sequence-to-structure and structure-to-sequence relations) also allowed the SPI (structure predictability index) scale to be defined [80]. Applied to amino acid sequences, this scale helps to measure the degree of difficulty of structure prediction for a particular amino acid sequence without knowledge of the final, native structure of the protein. The sequence-to-structure and structure-to-sequence contingency tables, which is created on the basis of all proteins of known structure (step-back procedure), can be used to create the early-stage folding (in silico) structure. Applied to other (late-stage folding) procedures, it presumably can enable protein structure prediction. The early-stage form was used as the object for comparison to simplify the presentation of the structure (seven possibilities). The SPI (structure predictability index) parameter, attributed to any amino acid sequence, allows estimation of the degree of difficulty in structure prediction. The probability values (which can be higher or lower) taken from particular cells of the contingency table can tell how offen a particular structure occurs in the protein database so far. The information entropy-based classification presented in this paper allows highly distributed structural forms to be distinguished for a particular tetrapeptide sequence.

Table 4

Amount of information (I (bit)) carried by a particular amino acid, calculated on the basis of the frequency and amount of information ( (bit), ψ denote , ψ angles belonging to the ellipse) necessary to predict the structure belonging to the ellipse path (early-stage folding conformational subspace) with 10° step of t-angle precision (see ellipse equation in “appendix A”). Detailed analysis of the data shown in this table can be found elsewhere [50].


Amino acid	Amount of information carried by amino acid	Averaged amount of information necessary to predict the ellipse-belonging structure
	I_i (bit)	SEiϕeψe (bit)

Gly	3.805	7.806
Asp	4.117	7.073
Leu	3.492	6.438
Lys	3.908	6.789
Ala	3.662	6.409
Ser	4.095	6.975
Asn	4.545	7.267
Glu	3.833	6.520
Thr	4.196	6.720
Arg	4.249	6.677
Val	3.886	6.233
Gln	4.663	6.676
Ile	4.151	6.208
Phe	4.713	6.617
Tyr	4.941	6.685
Pro	4.442	6.124
His	5.477	6.965
Cys	5.544	6.937
Met	5.614	6.494
Trp	6.236	6.581

75 in total

1. A method for optimizing potential-energy functions by a hierarchical design of the potential-energy landscape: application to the UNRES force field.

Authors: Adam Liwo; Piotr Arłukowicz; Cezary Czaplewski; Stanislaw Ołdziej; Jaroslaw Pillardy; Harold A Scheraga
Journal: Proc Natl Acad Sci U S A Date: 2002-02-19 Impact factor: 11.205

2. Conservation analysis and structure prediction of the SH2 family of phosphotyrosine binding domains.

Authors: R B Russell; J Breed; G J Barton
Journal: FEBS Lett Date: 1992-06-08 Impact factor: 4.124

3. Distinguishing foldable proteins from nonfolders: when and how do they differ?

Authors: Tobin R Sosnick; R Stephen Berry; Andrés Colubri; Ariel Fernández
Journal: Proteins Date: 2002-10-01

4. Relationships between amino acid sequence and backbone torsion angle preferences.

Authors: O Keskin; D Yuret; A Gursoy; M Turkay; B Erman
Journal: Proteins Date: 2004-06-01

5. Describing protein structure: a general algorithm yielding complete helicoidal parameters and a unique overall axis.

Authors: H Sklenar; C Etchebest; R Lavery
Journal: Proteins Date: 1989

6. Structural analysis based on state-space modeling.

Authors: C M Stultz; J V White; T F Smith
Journal: Protein Sci Date: 1993-03 Impact factor: 6.725

7. Prediction of protein secondary structure at better than 70% accuracy.

Authors: B Rost; C Sander
Journal: J Mol Biol Date: 1993-07-20 Impact factor: 5.469

8. Patterns, structures, and amino acid frequencies in structural building blocks, a protein secondary structure classification scheme.

Authors: J S Fetrow; M J Palumbo; G Berg
Journal: Proteins Date: 1997-02

9. Lysozyme folded in silico according to the limited conformational sub-space.

Authors: W Jurkowski; M Brylinski; L Konieczny; I Roterman
Journal: J Biomol Struct Dyn Date: 2004-10

10. Flavors of protein disorder.

Authors: Slobodan Vucetic; Celeste J Brown; A Keith Dunker; Zoran Obradovic
Journal: Proteins Date: 2003-09-01

7 in total

1. Localization of ligand binding site in proteins identified in silico.

Authors: Michal Brylinski; Marek Kochanczyk; Elzbieta Broniatowska; Irena Roterman
Journal: J Mol Model Date: 2007-03-30 Impact factor: 1.810

2. Directional Association Measurement in Contingency Tables: Genomic Case.

Authors: Monika Piwowar; Tomasz KuŁaga
Journal: J Comput Biol Date: 2018-12-18 Impact factor: 1.479

3. Intermediates in the protein folding process: a computational model.

Authors: Irena Roterman; Leszek Konieczny; Mateusz Banach; Wiktor Jurkowski
Journal: Int J Mol Sci Date: 2011-07-29 Impact factor: 5.923

4. Statistical dictionaries for hypothetical in silico model of the early-stage intermediate in protein folding.

Authors: Barbara Kalinowska; Piotr Fabian; Katarzyna Stąpor; Irena Roterman
Journal: J Comput Aided Mol Des Date: 2015-03-26 Impact factor: 3.686

5. The Possible Mechanism of Amyloid Transformation Based on the Geometrical Parameters of Early-Stage Intermediate in Silico Model for Protein Folding.

Authors: Irena Roterman; Katarzyna Stapor; Dawid Dułak; Leszek Konieczny
Journal: Int J Mol Sci Date: 2022-08-22 Impact factor: 6.208

6. Prediction of functional sites based on the fuzzy oil drop model.

Authors: Michał Bryliński; Katarzyna Prymula; Wiktor Jurkowski; Marek Kochańczyk; Ewa Stawowczyk; Leszek Konieczny; Irena Roterman
Journal: PLoS Comput Biol Date: 2007-04-12 Impact factor: 4.475

7. Hypothetical in silico model of the early-stage intermediate in protein folding.

Authors: Barbara Kalinowska; Paweł Alejster; Kinga Sałapa; Zbigniew Baster; Irena Roterman
Journal: J Mol Model Date: 2013-06-28 Impact factor: 1.810

7 in total