A sequence-to-structure library has been created based on the complete PDB database. The tetrapeptide was selected as a unit representing a well-defined structural motif. Seven structural forms were introduced for structure classification. The early-stage folding conformations were used as the objects for structure analysis and classification. The degree of determinability was estimated for the sequence-to-structure and structure-to-sequence relations. Probability calculus and informational entropy were applied for quantitative estimation of the mutual relation between them. The structural motifs representing different forms of loops and bends were found to favor particular sequences in structure-to-sequence analysis.
A sequence-to-structure library has been created based on the complete PDB database. The tetrapeptide was selected as a unit representing a well-defined structural motif. Seven structural forms were introduced for structure classification. The early-stage folding conformations were used as the objects for structure analysis and classification. The degree of determinability was estimated for the sequence-to-structure and structure-to-sequence relations. Probability calculus and informational entropy were applied for quantitative estimation of the mutual relation between them. The structural motifs representing different forms of loops and bends were found to favor particular sequences in structure-to-sequence analysis.
Prediction of
three-dimensional protein structures remains a major challenge to
modern molecular biology. On the one hand, identical pentapeptide
sequences exist in completely different tertiary structures in
proteins [1]; on the other, different amino acid sequences
can adopt approximately the same three-dimensional structure. However, the
patterns of sequence conservation can be used for protein
structure prediction [2, 3, 4]. Usually, secondary structure
definition has been used for ab initio methods as a
common starting conformation for protein structure prediction
[5]. A large body of experiments and theoretical evidence
suggests that local structure is frequently encoded in short
segments of protein sequence. A definite relation between the
amino acid sequences of a region folded into a supersecondary
structure has been found. It was also found that they are
independent of the remaining sequence of the molecule [6, 7].
Early studies of local sequence-structure relationships and
secondary structure prediction were based on either simple
physical principles [8] or statistics [9, 10, 11, 12].
Nearest-neighbor methods use a database of proteins with known
three-dimensional structures to predict the conformational states
of test protein [13, 14, 15, 16]. Some methods are based on
nonlinear algorithms known as neural nets [17, 18, 19] or
hidden Markov models [20, 21, 22, 23]. In addition to
studies of sequence-to-structure relationships focused on
determining the propensity of amino acids for
predefined local
structures [24, 25, 26, 27], others involve determining
patterns of sequence-to-structure correlations [21, 22, 28, 29, 30]. The evolutionary information contained in multiple
sequence alignments has been widely used for secondary structure
prediction [31, 32, 33, 34, 35, 36, 37, 38]. Prediction of
the percentage composition of α-helix, β-strand, and
irregular structure based on the percentage of amino acid
composition, without regard to sequence, permits proteins to be
assigned to groups, as all α, all β, and mixed
α/β [5, 39].Structure representation is simplified in many models. Side chains
are limited to one representative virtual atom; virtual
Cα − Cα bonds are often introduced to decrease the
number of atoms present in the peptide bond [40, 41]. The
search for structure representation in other than the
, ψ angles conformational space has been continuing [42].Other models are based on limitation of the conformational space.
One of them divided the Ramachandran map into four low-energy
basins [43, 44]. In another study, all sterically allowed
conformations for short polyalanine chains were enumerated using
discrete bins called mesostates [45]. The need to limit the
confomational space was also asserted [46, 47].The model introduced in this paper is based on limitation of the
conformational space to the particular part of the Ramachandran
map. The structures created according to this limited
conformational subspace are assumed to represent early-stage
structural forms of protein folding
in silico.In this paper, in contrast to commonly used base of final native
structures of proteins, the early-stage folding conformation of
the polypeptide chain is the criterion for structure
classification.Two approaches are the basis for the early-stage folding model
presented in this paper.(1) The geometry of the polypeptide chain can be expressed using
parameters other than , ψ angles. These new
parameters are the V-angle—dihedral angle between two
sequential peptide bond planes—and the R-radius, radius of
curvature, found to be dependent on the V-angle in the form of a
second-degree polynomial. Details on the background of the
geometric model based on the V, R [48, 49] are
recapitulated briefly in “appendix A.”(2) The structures satisfying the V-to-R relation appeared to
distinguish the part of the Ramachandran map (the complete
conformational space) delivering the limited conformational
subspace (ellipse path on the Ramachandran map). It was shown
that the amount of information carried by the amino acid is
significantly lower than the amount of information needed to
predict , ψ angles (point on Ramachandran map). These
two amounts of information can be balanced after introducing the
conformational subspace limited to the conformational subspace
distinguished by the simplified model presented above. Details on
the background of the information-theory-based model [50] are
reviewed briefly in “appendix B.”The conformational subspace found to satisfy the geometric
characteristics (polypeptide limited to the chain peptide bond
planes with side chains ignored) and the condition of information
balancing appeared to select the part of Ramachandran map which can
be treated as the early-stage conformational subspace.The introduced model of early-stage folding was extended to make it
applicable to the creation of starting structural forms of proteins for an
energy-minimization procedure oriented to protein structure prediction. The
characteristics and possible applicability of the sequence-to-structure and
structure-to-sequence contingency tables is the aim of
this paper.The structures created according to the limited conformational
subspace can be reached in two different ways: (1) as the partial
unfolding (Figures 1a–1e) and (2) as the
basis for the initial structure assumed to represent early-stage
folding (Figures 1f–1j). The partial
unfolding of the native structural form (called the “step-back”
structure in this paper) is expressed by changing the ,
ψ angles to the , ψ
sb angles
(, ψ
sb angles belong to the ellipse path,
and their values are obtained according to the criterion of the
shortest distance between , ψ and the ellipse—shown
in Figure 1b). The second approach, in which the
structure is created on the basis of the ,
ψ
es angles (, ψ
es denote the
dihedral angles belonging to the ellipse and representing a
particular probability maximum), is based on the library of
sequence-to-structure relations for tetrapeptides.
Figure 1
(a–e) Step-back unfolding path: (a) native structure of 1APJ, (b)
partial unfolding procedure, (c) step-back conformation according
to the limited conformational subspace, (d) example of
amino-acid-dependent probability profile (Glu) for complete PDB
2003 after moving , ψ angles to the nearest point on
the ellipse path, (e) letter codes assigned according to
probability profiles. (f–j) Folding simulation path: (f)
early-stage structure prediction in terms of structural letters,
(g) an example of a discrete profile (Glu) applied to early-stage
structure creation, (h) predicted early-stage conformation of
1APJ, (i) late-stage folding simulation procedure (under
consideration—not applied yet), (j) structure of 1APJ as a
result of the energy-minimization procedure with proper disulphide
bridges constraints.
A scheme summarizing the two procedures—partial unfolding and
partial folding—is shown below (Figure 1). The
procedure called partial unfolding starts at the native structure
of the protein (Figure 1a). The values of the ,
ψ angles present in the protein are changed (according to
the shortest distance criterion) to the values of the angles
belonging to the ellipse (, ψ
sb). When these
dihedral angles are applied, the structure of the same protein
looks as is shown in Figure 1c. When this procedure is
applied to all proteins present in the protein data bank, a
probability profile can be obtained which represents the
distribution of , ψ angles in the limited
conformational subspace. The distribution is different for each
amino acid, although some characteristic maxima can be
distinguished. The profile shown in Figure 1d represents
Glu (the ellipse equation t-parameter = 0° represents the
point of = 90° and ψ = −90°, and then increases
clockwise). Particular probability maxima can be recognized using
the letter codes also shown in Figure 2. These letter
codes are used to classify the structures of proteins in their
early-stage folding (in silico) (Figure 1e).
Figure 2
Letter codes for
structure classification. (a) The ellipse-path-limited conformational
subspace in relation to , ψ angles as they appear in
real proteins. Arrows denote the shortest-distance criterion
for definition of , ψ angles belonging to the ellipse
for arbitrary selected points. (b) Probability maxima as they appear
along the ellipse (starting t-point shown in (c)) and
corresponding letter codes for structure identification.
(c) Limited conformational subspace with fragments distinguished
according to probability maxima shown in (b).
The opposite procedure, aimed at protein folding, is shown also in
Figures 1f–1j. The starting point in this
procedure is the amino acid sequence of a particular protein.
After selecting four-amino-acid fragments (in an overlapping
system), four different structural codes (for the same
tetrapeptide) can be attributed on the basis of the contingency
table described above (Figure 1f). Only a particular
fragment of the probability profile (according to the letter code) can
be recognized in this case. In consequence, the ,
ψ
es values representing the location of the probability
maximum on the t-axis can be attributed to a particular sequence
(Figure 1g). This is why the ,
ψ
es angles differ versus , ψ
sb. In
consequence, the structure of the transforming growth factor
β binding protein-like domain (protein selected as an
example, PDB ID: 1APJ) created according to the ,
ψ
es angles shown in Figure 1h differs versus
the (, ψ
sb)-based structure. The “sb”
(step-back) and “es” (early-stage) structures differ due to
the continuous form of the probability distribution in “sb”
procedure and the discrete one in the “es” procedure. The
next step in the prediction procedure is energy minimization,
which in some cases causes approach toward the native structure
(Figure 1j).The structures created according to the ellipse path treated as
the starting structures for the energy-minimization procedure,
deliver forms that approach the native structure after one simple
optimization procedure. BPTI [51], ribonuclease [50], to
some extent also human hemoglobin α and β chains
[52] and lysozyme [53] were used as the model molecules.
All these examples proved that the ellipse-path-limited
conformational subspace helped define the initial structure for
the energy-minimization procedure, leading to proper, native-like
structures without any forms inconsistent with protein-like ones.
When the energy-minimization procedure is not sufficient to
deliver the proper native-like structure of the protein (which can
be seen in Figures 1a and 1j), the
additional procedure is necessary (Figure 1i). It is
under study now and will be published in the close future.
MATERIAL AND METHODS
Early-stage folding structure classification
All proteins present in PDB (release January 2003) were taken for
analysis [54]. Letter codes have been used for sequence
identification. A letter code system is introduced in this paper
for structure representation in protein early-stage folding
(in silico) based on the probability distribution of
, ψ angles along the ellipse-path-limited
conformational subspace (see “appendix B”). To easily
distinguish the structure codes versus sequence codes, the former
are printed in bold and the latter in italics in this work.Comparison of distributions between three-state secondary
structures indicated four-amino-acid fragments as the most common
ones for α-helices, β-strands, and loops [21, 55]. The tetrapeptide was adopted as the unit for investigation of
the sequence-structure relation.The probability distribution along the ellipse, which is assumed
to represent the limited conformational subspace, is the basis
for the structure classification introduced in this paper. The
profile of the probability distribution (of all amino acids) along
the ellipse path is shown in Figure 2.
Figure 2a shows the usual distribution of , ψ angles as found in proteins together with the ellipse path.The procedure of moving particular , ψ angles to the
ellipse path is also shown in Figure 2a. The shortest
distance between particular , ψ angles (point on the
Ramachandran map) and a point belonging to the ellipse path
located the , ψ
(e denotes ellipse belonging)
dihedral angles determining the early stage for a particular amino
acid of the polypeptide chain. After moving all , ψ
angles to the ellipse path, the profile of the probability
distribution can be obtained, as shown in Figure 2b.
The t-parameter is the ellipse parameter present in the equation
shown in “appendix A.” The t-parameter equal to zero
represents the point = 90° and ψ = −90°
on the Ramachandran map and increases clockwise, as is shown in
Figure 2c. Seven probability maxima can be
distinguished in this profile. Each of them is letter coded.This coding system was applied to classify the structures of all
proteins analyzed. The codes introduced according to the
probability distribution shown in Figure 2b are
interpreted as follows: C (t-value range) represents
right-handed helical structures, E represents β-structural
forms, and G represents left-handed helices.
The β-structural
forms are differentiated
(some amino acids like Ala, Ser, Asp reveal two
probability maxima [50]); this is why code F also represents
β-like structures.Although all other letters represent structural forms not
identified in the traditional classification, the presence of
probability maxima suggests the need to distinguish these
categories (code A mostly for Pro and Gly, code B represented
mostly by Asn and Asp, and code D characteristic for Tyr and Asn,
to take a few examples).
The contingency table
A window size of four amino acids (analogous to the open reading
frame in nucleotide identification) with one amino acid step
(overlapping system) was applied to code the sequences and
structures in proteins. Potentially 160 000 (204) different
sequences for tetrapeptides can occur (columns). Taking seven
different structural forms for each amino acid in a tetrapeptide,
2401 (74) structural forms can be distinguished for a
tetrapeptide (rows). These numbers give an idea of the size of the
contingency table under consideration. For all cells, probability
values of p
, p
, and p
were calculated as follows:
where i denotes a particular structure (row), j denotes a
particular sequence (column), n
is the number of
polypeptide chains belonging to the ith structure and
representing the jth sequence, N
is the total number of
ORFs, and N
and N
denote the number of ORFs
belonging to a particular ith structure and jth sequence,
respectively. The table expressing all probabilities (, , and ) is available on request at http://www.bioinformatics.cm-uj.krakow.pl/earlystage/.
All values are expressed on a
logarithmic scale because of the very low probability values in
the cells of the table.
Information entropy as a measure of sequence-to-structure and structure-to-sequence predictability
High values of
probability calculated as above (relative to potential probability
values) can disclose highly coupled pairs of structure and
sequence. Ranking the probability values can extract the highly
determined relations for both sequence-to-structure and
structure-to-sequence.Structural predictability can also be measured using informational
entropy calculation. According to Shannon's definition [56],
the amount of information can be calculated
as follows:
where I
expresses the amount of information (in bits)
dependent on p
—the probability of event i. This definition
is very useful for measuring the amount of information carried by
a particular simple (elementary) event. In the case of a complex
event, for which few solutions are possible, informational entropy
can be calculated, expressing the level of uncertainty in
predicting the solution. Informational entropy according to
Shannon's definition is as follows:
where n is the number
of possible solutions for a particular event. N denotes the
number of possible solutions for the event under consideration
(number of elementary events).SE reaches its maximum value for all p
equal to each
other, that is, each ith solution is equally probable for the
event under consideration and no solution is preferred. The
maximum value depends on the number of possible solutions for the
event (n).SE equal to zero (or 1.0) represents the
determinate case in which only one solution is possible. The
higher the difference between and SE, the
higher the degree of determinability in the given case. A high
value means that the case is realized by
a few solutions and that some of them occur with higher
probability, which can be interpreted as a case with higher
determinability (biased event).SE, , and the values of the differences
between them can be calculated for all rows SE (structural
preferences versus amino acid sequence) and for columns SE
(sequence preference for a particular structural form) in the
contingency table. SE allowed extraction of structures
highly determined by the sequence; SE extracted structures
highly attributed to a particular sequence.The SE calculation performed for each column (particular
sequence) in the contingency table was calculated as follows:
where SE
denotes informational entropy for the j-column,
i denotes a particular row (structure), N
is the number
of nonzero cells in the j-column, and is calculated
according to (2).The value SE
as calculated according to (6)
measures the level of uncertainty in predicting structure for the
jth sequence. The closer the SE value to zero, the higher the
degree of chance in prediction.expresses quantitatively the level of
uncertainty in the most difficult case for making a decision. For
the j-column (sequence):
where denotes maximum informational
entropy for the j-column, i denotes a particular row
(structure), N
0 is the number of nonzero cells in the
j-column, and denotes the value of
probability in a column under the assumption that all nonzero
cells are equally represented (the principal condition for
). In other words, for all nonzero cells
(i=1,...,N
0) in the j-column
can be calculated as follows:
Thus the difference between two quantities ((6) and
(7)) can be used as the “distance” between the most
difficult situation (all solutions equally possible—random
solution) and the situation observed in the case under
consideration. For the j column
Analogous calculations for rows (sequences) were performed. For
each i-row, the value of SE
, ,
and Δ SE
was calculated.
RESULTS
Structures coded according to the introduced system
Structures of all proteins present in the PDB (release January
2003) [56] were analyzed. The , ψ angles were
calculated for each amino acid. The , ψ
angles
were calculated according to the shortest distance versus the
ellipse. A letter code was assigned for each amino acid according
to the ellipse path fragment. Since the tetrapeptide was used as
the structural unit, four letters coded one structural unit. The
overlapping reading frame system was applied, which means that one
amino acid step was applied in structure classification. The
maximum combination of seven letter codes for a four-letter string
is equal to 2401. This means that 2401 different four-letter
strings were expected to be found. It turned out that only 2397
different strings were found in real proteins. Since there are
20 amino acids and four amino acids were taken for the unit,
160 000 different sequences of tetrapeptides were expected;
146 940 different sequences were found in the proteins under
consideration.
Contingency table
Each tetrapeptide found in proteins was described by a four-letter
string expressing the sequence and a four-letter string expressing
the structure. Each tetrapetide with a known sequence and known
structure can be ordered in the form of a table. The rows of the
table represent structures and the columns represent sequences.
Finally a 2397 × 146 940 table was constructed. To
distinguish the structure codes from sequence codes, sequence
codes are in bold capital letters and structure codes in italics.
The scheme of the contingency table is presented in
Table 1. The total number of tetrapeptides in the
analyzed database was found to be 1 529 987. Global analysis
of the contingency table shows that the maximum number of
different structures attributed to the same tetrapeptide is 144.
This tetrapeptide appeared to be of the sequence GSAA. The
maximum number of different sequences was found for α-helix
(CCCC: 90 587) and for β-structure (EEEE: 47 809).
Four structures were not found in the library: ABAB, ABBD,
ABFB, DBAB.
Table 1
Scheme of the sequence-structure contingency table. Symbols explained in text.
Structure
Sequence
1
2
...
j
...
160 000
1
n11,p11t, p11c, p11r
n12, p12t, p12c, p12r
...
n1j, p1jt, p1jc, p1jt
...
N1r
2
n21, p21t, p21c, p21r
n22, p22t, p22c, p22r
...
n2
j, p2jt, p2jc, p2jr
...
N2r
...
...
...
...
...
...
...
i
ni1, pi1t, pi1c, pi1r
ni2, pi2t, pi2c, pi2r
...
nij, pijt, pijc, pijr
...
Nir
...
...
...
...
...
...
...
2 401
N1c
N2c
...
Njc
...
Nt
Information entropy calculation
SE, , and the value of the
difference between these two quantities (ΔSE) were
calculated according to the procedure presented in
“material and methods.” They can be calculated for columns (sequences) and for
rows (structures) separately. The calculation of SE
for
the j-column expresses the information entropy related to the
structural differentiation of a particular sequence. The
calculation of SE
for the i-row in the contingency table
expresses the sequential differentiation for a particular
structure. according to information entropy
characteristics expresses the entropy for the case in which each
of all the nonzero cells represents equal probability. For
, all structures for a
particular sequence are equally probable. Equal probability for a
set of elementary events (different structures) represents the
random situation. The bigger the difference
, the more deterministic the case. This
is why the difference (Δ SE) between SE and
was taken to measure the degree of
structure-to-sequence (or vice versa) determination.The interpretation of Tables 2 and 3 is as
follows. The structural predictability for a particular sequence
can be estimated in the first case, and the predictability of the
sequence for a particular structure in the latter case. The
results for only the top ten structures and top ten sequences are
shown in Tables 2 and 3.
Table 2
Sequence-to-structure relation measured according to the value of the
difference (ΔSE) between entropy of information
(SE) calculated for the probability values found in the
contingency table (particular column) and maximum entropy of
information (), which (according to the
characteristics of entropy of information) is reached for equal
probability values in each nonzero cell in a particular
column.
Sequence
Structure
SEc (bit)
SEcmax (bit)
ΔSEc (bit)
AAAA
CCCC
2.29
6.44
4.15
GDSG
GCFG
1.57
5.49
3.92
AVRR
CCCC
1.04
4.95
3.91
LAAA
CCCC
1.77
5.61
3.84
EAEL
CCCC
1.37
5.21
3.83
LDKA
CCCC
1.30
5.09
3.78
DAAV
CCCC
0.69
4.46
3.77
AKLK
CCCC
0.76
4.52
3.77
DSGG
CFGF
1.97
5.73
3.76
ELAA
CCCC
1.30
5.04
3.75
Table 3
Structure-to-sequence relation measured according to the value of the
difference (ΔSE) between entropy of information
(SE) calculated for the probability values found in the
contingency table (particular column) and maximum entropy of
information (), which (according to the
characteristics of entropy of information) is reached for equal
probability values in each nonzero cell in a particular
column.
Structure
Sequence
SEr (bit)
SErmax (bit)
ΔSEr (bit)
GCFG
GDSG
4.82
7.99
3.17
AEED
GLRL
3.86
6.81
2.95
BACE
GGAE
2.20
5.09
2.89
EAEG
IGIG
4.79
7.68
2.89
AEGE
GIGH
4.74
7.63
2.89
BFBE
PEPV
2.28
5.13
2.85
AEGD
GNES
2.09
4.91
2.82
EBCB
ELPD
3.68
6.38
2.70
EBFB
FBEP
2.57
5.17
2.60
AFFP
GFRN
2.03
4.58
2.55
Its highest structural predictability for a particular sequence
confirms polyalanine as a highly probable helical structure.
Generally, the highly predictable structures for particular
sequences are helical forms (Table 2).The sequence predictability for particular structural forms
displayed a quite unexpected regularity. The structures
representing irregular structural forms appeared to reveal the
strongest entropy decrease versus the random distribution of
sequences. This can be seen analyzing the letter codes for the
structures (Table 3).The top ten structures presented in Table 3 are also
shown in Figure 3. In summary, one can say that when a
particular irregular structural form is expected in a protein,
there are preferable sequences to build these irregular motifs;
they are shown in Table 3. This seems to be of
particular relevance for threading procedures oriented to the
production of new proteins not observed in nature.
Figure 3
Structures of
tetrapeptides with highest structure-to-sequence determinability
as found using informational entropy calculation (see “material and
methods” and Table 1). Gray terminal fragments
represent the extended form of polyalanine (tetrapeptides) to
emphasize the mutual spatial orientation of terminal fragments.
Other colors distinguish ellipse fragments as follows: red (A),
green (B), violet (C),
sky-blue (D), yellow (E), dark
blue (F), orange (G). The data for creation of these
structures is given in Table 1 and
Figure 2.
DISCUSSION
Particular classes of amino acid relations to particular
structural forms in proteins were recently found to solve the
problem of structure predictability [57]. All papers
concerning this subject linked sequence with structure as it
appears in the final native form of the protein. The model
introduced in this paper represents an approach to the relation
between sequence and structure in the early-stage folding
structural form; the bases for the model are presented
in detail elsewhere [48, 49, 50], and verified by BPTI
[51], ribonuclease [50], hemoglobin [52], and
lysozyme [53] folding. The (in silico) early-stage
structures of these proteins can be found in the corresponding
publications.Several algorithms for quantitatively assigning α-helix,
β-strand, and loop regions for proteins with known
structure have been developed [58, 59, 61]. The
three-dimensional model
presented in this paper shows that it is enough to select seven
fragments of the ellipse with well-defined probability maxima to
be able to predict the early-stage structural form.The high structure-to-sequence relation found for loops
(Table 3, Figure 3) may be particularly
important, since a recent survey of 31 genomes indicated that
disordered segments longer than 50 residues are very prevalent
[62]. Helices, sheets, and turns together account for only
about 50%–55% of all protein structure on average [63];
the remaining structures are classified as several types of loops
[63, 64]. Current estimations suggest that over 50% of
proteins in eukaryotes may carry unconstructed regions of
more than 40
residues in length [65], while less than 1% of the
proteins in the PDB contains such long disordered regions. These
observations taken together imply that many proteins with
disordered regions would be unlikely to form crystals [66].
Proteins containing long, disordered segments under physiological
conditions are frequently involved in regulatory functions
[67], and the structural disorder may be relieved upon
binding of the protein to its target molecule [68, 69].
Intrinsically unconstructed proteins and regions, which are also
known as natively unfolded and intrinsically disordered, differ
from structured globular proteins and domains with regard to many
attributes, including amino acid composition, sequence complexity,
hydrophobicity, charge, flexibility, and type and rate of amino
acid substitutions over evolutionary time [66]. Compared to
highly ordered secondary structure regions, the loops and turns
are more difficult to identify due to the absence of hydrogen
bonding and repeating backbone dihedral angle patterns [70].
The first computational tool indicating the predictability of
disordered regions from protein sequence [71] was a neural
network predictor (PONDR). Several other disorder predictors have
been published since then [72, 73, 74]. Statistically based
turn propensity used over a four-residue window was described
[75]. The inverse folding problem is the design of protein
sequences that have a desired structure [76, 77]. It is
impossible to mention even a small part of the papers dealing with
the sequence-to-structure relation. Recently, it was concluded
that the probability of any state (,ψ) is influenced
by the full sequence and not only by the local structure
[78].A genome-scale fold recognition program exploring the
knowledge-based structure-derived score function for a particular
residue was proposed incorporating three terms: backbone torsion,
buried surface, and contact energy [79].Unlike many others, our model, dual in nature, incorporating sequential and
structural information, predicts sequence-to-structure as well as
structure-to-sequence.The contingency table was independently analyzed
using another statistics-related method (Meus J, Stefaniak J. The
Z coefficient as a measure of dependence in contingency tables
(unpublished data), Meus J, Brylinski M, Piwowar P, et al. A
tabular approach to the sequence-to-structure relation in
proteins (unpublished data)). High accordance was found between
the results presented in this paper and in the statistical
analysis: the top ten sequences and structures presented in
Table 1 were found to be among the most highly
correlated, both in sequence-to-structure and in
structure-to-sequence, on the ranking list created by the
alternate calculation method. The order of the two ranking lists
is very similar, additionally confirming the reliability
of the model presented.Aside from early-stage structure prediction, the contingency
table presented may contribute to conventional secondary
structure prediction, local and supersecondary structure
prediction, location of transmembrane regions in proteins,
location of genes, or sequence design.The list of highly determinable tetrapeptides (in
sequence-to-structure and structure-to-sequence relations) also
allowed the SPI (structure predictability index) scale to be
defined [80]. Applied to amino acid sequences, this scale
helps to measure the degree of difficulty of structure prediction
for a particular amino acid sequence without knowledge of the
final, native structure of the protein.The sequence-to-structure and structure-to-sequence contingency
tables, which is created on the basis of all proteins of known
structure (step-back procedure), can be used to create the
early-stage folding (in silico) structure. Applied to
other (late-stage folding) procedures, it presumably can enable
protein structure prediction. The early-stage form was used as the
object for comparison to simplify the presentation of the
structure (seven possibilities). The SPI (structure predictability
index) parameter, attributed to any amino acid sequence, allows
estimation of the degree of difficulty in structure prediction.
The probability values (which can be higher or lower) taken from
particular cells of the contingency table can tell how offen a
particular structure occurs in the protein database so far. The
information entropy-based classification presented in this paper
allows highly distributed structural forms to be distinguished for
a particular tetrapeptide sequence.
Table 4
Amount of information (I
(bit)) carried by a
particular amino acid, calculated on the basis of the frequency
and amount of information ( (bit),
ψ
denote , ψ angles belonging to the ellipse)
necessary to predict the structure belonging to the ellipse path
(early-stage folding conformational subspace) with 10°
step of t-angle precision (see ellipse equation in “appendix A”).
Detailed analysis of the data shown in this table can be found
elsewhere [50].
Amino acid
Amount of information carried by amino acid
Averaged amount of
information necessary to predict the ellipse-belonging structure
Authors: Adam Liwo; Piotr Arłukowicz; Cezary Czaplewski; Stanislaw Ołdziej; Jaroslaw Pillardy; Harold A Scheraga Journal: Proc Natl Acad Sci U S A Date: 2002-02-19 Impact factor: 11.205