Mitsuaki Narita1, Masakuni Narita2, Yasuko Itsuno3, Shinichi Itsuno3. 1. Department of Biotechnology & Life Science, Tokyo University of Agriculture and Technology, Naka-machi 2-24-16, Koganei, Tokyo 183-8588, Japan. 2. Research Laboratory, Nihon Pharmaceutical Co., Ltd., Shinizumi 34, Narita 286-0825, Japan. 3. Department of Environmental and Life Sciences, Toyohashi University of Technology, 1-1 Hibarigaoka, Toyohashi 441-8580, Japan.
Abstract
To the best of our knowledge, this is the first study that shows that the X-ray structures of proteins can be dissected into their continuous folding structure units. Each folding structure unit was designed such that both the terminal di- or tri-peptide sequences shared common sequences with the two adjacent folding structure units. To encode the folding structure information of proteins into their amino acid sequences, we proposed 44 kinds of folding elements, which covered all of the amino acids in the protein chains, and defined all folding structure units. The folding element was defined to mean a minimum structural piece, which covered the frame of the main chain of each amino acid in a protein chain. A folding structure unit of a local sequence could be fully characterized by the sequential combination of individual folding elements assigned to each amino acid. The folding structure information showed amino acid preferences in various positions in folding structure units. Folding structure formation proceeded on the basis of probability theory. Strikingly, relative formation ability analysis clearly indicated that we can decode the types and the chain length of folding structure units from the amino acid sequence of a protein.
To the best of our knowledge, this is the first study that shows that the X-ray structures of proteins can be dissected into their continuous folding structure units. Each folding structure unit was designed such that both the terminal di- or tri-peptide sequences shared common sequences with the two adjacent folding structure units. To encode the folding structure information of proteins into their amino acid sequences, we proposed 44 kinds of folding elements, which covered all of the amino acids in the protein chains, and defined all folding structure units. The folding element was defined to mean a minimum structural piece, which covered the frame of the main chain of each amino acid in a protein chain. A folding structure unit of a local sequence could be fully characterized by the sequential combination of individual folding elements assigned to each amino acid. The folding structure information showed amino acid preferences in various positions in folding structure units. Folding structure formation proceeded on the basis of probability theory. Strikingly, relative formation ability analysis clearly indicated that we can decode the types and the chain length of folding structure units from the amino acid sequence of a protein.
Anfinsen’s basic tenet of protein folding maintains that
the information determining the native structure of a protein is encoded
in its amino acid sequence.[1,2] Research conducted to
understand protein folding has strongly supported his proposal.[3−14] In kinetic folding, we consider that not secondary structures but folding structures repeatedly unfold and refold. Secondary
structure is a local part of a tertiary structure or, in other words,
is the conformation of a local sequence. In aqueous solution of a
globular protein, a nucleus can grow rapidly by addition of peptide
chain segments to direct protein folding.[15] Globular proteins fold to create separately cooperative folding structures along folding pathways. However, they
are inherently unstable, and noncovalent tertiary interactions among
the folding structures are primarily responsible
for the formation of secondary and tertiary structures.[15−27] One of the important objectives in this article is to define the folding structure units (Figure ), which can be verified by X-ray
structures of proteins. Folding structures of proteins having
X-ray structures could be statistically analyzed to yield the folding
structure information. Here, we confirmed for the first time that
folding structure formation can be derived from folding structure
information, on the basis of probability theory. Folding structure,
instead of secondary structure, enabled us to understand protein folding
on an amino acid level. The reason why secondary structure is not
suitable for decoding of secondary structure information is described
below.
Figure 1
Amino acid sequence and the notation of a folding structure unit.
Amino acid sequence and the notation of a folding structure unit.Nearly 50% of amino acids in globular proteins are in either α-helix
or β-strand forms.[28,29] Namely, half of the
sequences form simple secondary structures, and residual secondary
structures are irregular structures. As three-dimensional structures
of globular proteins are available from PDB,[30] their secondary structures, except for irregular structures, can
be defined using dihedral angles (ϕ, ψ) of each amino
acid. However, the secondary structure information has never been
encoded in amino acid sequences precisely as many kinds of secondary
structures still have no definition. Although N-cap and C-cap residues
of α-helices have been defined,[31] those of other secondary structures are ambiguous. As a result,
many kinds of secondary structures cannot be assigned to local sequences
of proteins. Thus, the introduction of folding structures, which cover
all of the protein chains, is indispensable for the dissection of
folding structure information. Furthermore, the tertiary structure
of a protein is derived from its secondary structures. Thus, the relationship
between folding and secondary structures is critical for conformational
analysis of tertiary structures of proteins.Protein folding pathways must be correctly described in the amino
acid sequences of proteins.[14] However,
recent examples of protein fold switching indicate that an amino acid
sequence of a polypeptide chain can encode a stable fold, while simultaneously
hiding latent propensities for alternative states with novel functions.[32] Thus, detailed encoding and decoding of the
folding structure information is the most important step to understand
protein folding. The objective of this study is to develop statistical
decoding of the folding structure information encoded in the 20 kinds
of amino acids.In 1988, Richardson reported the precise amino acid preferences
for 17 individual positions relative to the α-helical ends.[33] Using Richardson’s analysis, we analyzed
the amino acid preferences in type II β-turn and type I α-turn.[34,35] Although there is no mention of decoding of the folding structure
information in these reports, these
treatments involve a study of the normalized folding structure information
encoded in the 20 kinds of amino acids. This is exactly the encoding
and decoding of folding structure information that we realized in
this article. Decoding of the folding structure information enables
us to understand the folding structure formation of a local sequence
on the basis of probability theory.Detailed encoding of the folding structure information into the
amino acid sequences of proteins requires precise definition of the
folding structure units. Two-dimensional representations of native
structures of proteins, derived from the backbone dihedral angles
(ϕ, ψ) and the amino acid sequences, clearly displayed
their three-dimensional structures.[34] The
representations distinctly indicate that the precise definition of
cap residues of all of the folding structure units is possible, using
the relationship between dihedral angles and amino acid sequences.
Furthermore, the precise definition of cap residues strongly suggests
the existence of overlapping regions of folding structure units at
both the terminal regions and develops the design concept of folding
structure units.Here, we propose 44 kinds of folding elements to
encode the folding structure information into amino acid sequences
of proteins. By use of folding elements, we designed the folding structure
units on the basis of the design concept that protein folding can
be derived from continuous folding structure units with overlapping
regions. Subsequently, the folding structure information for the amino
acid preferences in the 44 different folding elements was encoded
and decoded statistically. Finally, on the basis of probability theory,
folding structure formation was demonstrated by relative formation
ability (RFA) analysis. Folding structure, instead of secondary structure,
enabled us to evaluate the RFA value of a local sequence for any local
structure. The decoding of the folding structure information appears
to yield the initiation mechanism of protein folding.
Results and Discussion
Definition of Folding Structure Units
To clarify the
position of each amino acid of the folding structure units, relative
to their cap residues, we introduced symbols termed as folding
elements. The folding element means a minimum structural
piece, which covers the frame of the main chain of each amino acid
in protein chains, and represents the position and single or multiple
conformational regions of each amino acid in the folding structure
units. Here, we proposed 44 kinds of folding elements and defined
folding structure units that cover all of the protein chains. As an
identical amino acid expresses folding elements of α-helices,
β-strands, and irregular structures in protein chains, each
of the 20 kinds of amino acids in protein sequences possessed characteristics
described by our 44 kinds of folding elements. For the statistical
treatment, the folding structure units were classified into four categories:
α-helix (H), α-turn and β-turn (T), β-strand
(S), and interconnecting folding structure units (HH, HS, SH, and
SS).Folding structure units can be fully characterized by local
sequences, assigned with a single folding element for each amino acid
(Figure ). As a result,
the notation of a folding structure unit can be represented by a sequential
combination of folding elements. The local sequence of a protein forms
a variety of dynamic folding structures in the denatured state. The
particular dynamic folding structure is fixed to the static folding
structure in its native structure by tertiary interactions.
Design of α-Helix, Turn, and β-Strand Folding
Structure Units
On the basis of two-dimensional representations
of native structures of proteins, each of the folding structure units
was designed such that both the terminal di- and tri-peptide sequences
shared common sequences with two adjacent folding structure units.
A common sequence forms a single irregular structure, which always
functions as an overlapped folding structure. It is statistically
treated as an overlapping sequence.Primarily, we determined
the cap residues of the folding structure units on the basis of the
continuity of the backbone dihedral angles (ϕ, ψ) of the
α- and β-regions.[34,36] First, N-cap and C-cap
residues of H and T folding structure units were located at both ends
of the sequence having continuous α-helix dihedral angles. Dihedral
angles of the cap residues were out of the α-region, as shown
in Figure , for the
H19 folding structure unit of protein GB1, the immunoglobulin-binding
B1 domain of protein G.[37] The sequence
of protein GB1 is shown in Figure .
Figure 2
H19 (α-helix with 19 amino acids residues) structure unit.
Figure 9
Amino acid sequence of protein GB1, the notation of its continuous
folding structure units, and RFA values of local sequences for respective
folding structure units.
H19 (α-helix with 19 amino acids residues) structure unit.N-Cap and C-cap residues were assigned to folding elements and . For
the terminal amino acids positioned outside the caps,
the terminal folding elements of ′ and ′ were ascribed. The terminal amino acids were coincident with cap
or internal residues of two adjacent folding structure units and conserved
their characteristics. Both the terminal di- and tri-peptide sequences
of folding structure units were always common sequences (Figure ), which are signals
of termination and initiation of adjacent folding structure units;
therefore, continuous folding structure units can be derived from
the presence of these common sequences.The notations of H folding structure units were as follows. For
example, when seven continuous α-region residues were found
in a protein chain, these residues were assigned to folding elements . In this example, the eleven
residues were assigned to folding elements ′′ and denoted an H11 folding structure unit.
The seven folding elements (, , , , , , and ) were
in a single conformational region (α-region), and the four folding
elements (′, , , and ′) were in any of the
multiple conformational regions ( and are in a region other than α-region).
When a sequence contained more than 12 amino acids, all of the central
residues were assigned to folding element (Figure ). Figure shows the notation
of an α-helix (H19) sequence of 19 amino acids V21–V39 in protein GB1 together with that of its secondary
structure. The notations of secondary structures can be represented
one-dimensionally using conformational regions, instead of folding
elements. Folding element was introduced
to designate H9 folding structure units, with folding elements ordered
as ′′. The shortest
H folding structure unit (H8) is formed with eight amino acids assigned
to the folding elements, ′′.Hexa- and hepta-peptide sequences formed T6 and T7 folding structure
units. The notations of T6 and T7 units are shown in Figure . Folding elements and were
assigned to cap residues in a T6 folding structure unit. Hexa- and
hepta-peptide sequences were assigned to T6 (′′) and T7 (′′) folding structure units, respectively. Residues assigned to and were
out of the α-region. T6 unit comprised type I and III β-turns,[38] and T7 unit comprised type I α-turn.[35] The three folding elements, , , and , were in the α-region, and the four folding elements, ′, , , and ′, were in any of the multiple conformational
regions.
Figure 3
T7 (α-turn) and T6 (β-turn) structure units.
T7 (α-turn) and T6 (β-turn) structure units.Second, the dihedral angles of cap residues in the S folding structure
unit were still associated with the β-region. To avoid the complication
of overlapped folding structures, which always grow at both terminal
sequences of folding structure units (Figure ), a different definition of the cap residues
of S folding structure units was introduced. The cap residues of S
folding structure units were located at both ends of the continuous
β-region residues, and the folding elements ′···′ were used for the longer S folding
structure units, where the central residues in β-strand sequences
longer than a heptapeptide sequence were assigned to folding element . N-Cap and C-cap residues were assigned to
strand folding elements and , respectively. For the terminal amino acids
positioned outside the caps, the terminal folding elements of ′ and ′ were ascribed. The terminal amino
acids were coincident with cap or internal residues of two adjacent
folding structure units and conserved their characteristics. The S5
folding structure units that consist of pentapeptide sequences were
expressed by the sequence of folding elements ′′, where N-cap and C-cap residues were assigned
to and (Figure ). The eight
folding elements, , , , , , , , and ; were in the β-region; and the four folding elements, ′, ′, ′, and ′, were in any of the multiple conformational regions, other than
the β-region.
Figure 4
S7 (β-strand with seven amino acid residues) and S5 (β-strand
with five amino acid residues) structure units.
S7 (β-strand with seven amino acid residues) and S5 (β-strand
with five amino acid residues) structure units.
Design of Interconnecting Folding Structure Units
Interconnecting
folding structure units were located between the two types (H (T)
and S) of folding structure units, whose cap residues are determined
on the basis of the continuity of the backbone dihedral angles (ϕ,
ψ) of the α- and β-regions. Both the terminal dipeptide
sequences (cap and terminal amino acids) of interconnecting folding
structure units were always coincident with the terminal dipeptide
sequences of two adjacent folding structure units. The interconnecting
sequences were designed to conserve the characteristics of two adjacent
folding structure units.When both sides of a pentapeptide sequence
were β-strands, the sequence was assigned to an SS5 folding
structure unit. The SH5, HS5, and HH5 folding structure units were
also assigned (Figure ). Symbol H of these structure units included H and T folding structure
units. Cap
folding elements 1–4 were ascribed depending on
the adjacent folding structure unit. For the cap residues of the interconnecting
folding structure unit, the cap folding elements of 1 (adjacent to and ), 2 (adjacent to and ), 3 (adjacent to and ), and 4 (adjacent to and ) were defined. For the amino acids positioned
outside the caps, the terminal folding elements of 1′, 2′, 3′, and 4′ were
ascribed. Only the folding elements 1′ and 3′ were in the β-region. For the amino acids positioned inside
the caps, the internal folding elements of were ascribed. Figure shows the notations of interconnecting folding structure units formed
by a pentapeptide sequence.
Figure 5
Interconnection structure unit (1).
Interconnection structure unit (1).A minimum interconnecting sequence was a tripeptide sequence, and
both the terminal dipeptide sequences conserved the characteristics
of the two adjacent folding structure units. In about 2% of the amino
acids in proteins, single amino acid residues can be assigned to between cap elements of the adjacent folding
structure units. In such cases, is
replaced by 13 (between
β-strand and β-strand), 14 (between β-strand and helix), 23 (between helix and β-strand),
and 24 (between helix
and helix) (Figure ). A helix in parenthesis consisted of an α-helix and turn
folding structure units. The local sequence A20–D22 in protein GB1 (Figure ) was an example of this type of interconnecting sequence.
When a tripeptide sequence was found between S and H folding structure
units, the sequence was assigned as an SH3 folding structure unit.
The SS3, HS3, and HH3 folding structure units were also assigned.
All of the folding elements, except for 1′ and 3′, of the interconnecting folding structure units were in any of the
multiple conformational regions.
Figure 6
Interconnection structure unit (2).
Interconnection structure unit (2).
Contrast between Folding and Secondary Structures
The
X-ray structures of globular proteins could be dissected into their
continuous folding structure units on the basis of the precise definition
of N-cap and C-cap amino acids of H, T, and S folding structure units.
Fundamentally, all of the folding structure units of globular protein
were determined on the basis of the continuity of the backbone dihedral
angles (ϕ, ψ) of the α- and β-region. The
continuity of folding structure units with overlapping regions at
both the terminal sequences was clearly displayed in Figure using protein GB1 chain. The
overlapping region shared a common sequence with the two adjacent
folding structure units. The common sequences were statistically treated
as overlapping sequences, which were assigned overlapped folding structures
independently. Any of the folding structures of protein GB1, as well
as globular proteins, could be represented by a local sequence of
individual folding element assigned to each amino acid. All of the
local sequences of globular proteins were covered with folding structures.
Thus, the folding structure information of all of the local sequences
for their respective folding structures could be analyzed.The
notations of all of the folding structure units were represented by
sequential combinations of specific folding elements. Out of the 44
folding elements, 21 were in a single conformational region. They
are as follows: , , , , , , , , , , , , , , , , , , , 1′, and 3′. However, the remaining 23 folding elements
represented multiple conformational regions. The most significant
feature of a folding structure, in relation to a secondary structure,
is that the assignment of individual folding elements to each amino
acid of a local sequence enables the determination of precisely simple
secondary structures with continuous α- or β-region residues.
Concurrently, we can specify the terminal sequences whose secondary
structures cannot be determined. Most significantly, we may consider
dynamic local structures in the denatured state using folding structures,
as the folding structure information is expected to be derived from
Richardson’s analysis.[33]The tertiary structure of a protein is derived from its secondary
structures. Thus, the relationship between folding and secondary structures
is critical for conformational analysis of tertiary structures of
proteins. The notation of secondary structure is represented by the
sequential combination of conformational regions assigned to each
amino acid, although the definition of conformational regions is generally
ambiguous. To differentiate between folding and secondary structures,
both the notations, assigned to the local sequence V21–V39 of protein GB1, are represented in Figure . Amino acids assigned to and were
in any of the multiple conformational regions, other than the α-region.[36] Secondary structures of proteins were primarily
classified into two categories: simple secondary structures, having
continuous α- or β-region residues, and irregular, undefined
structures. Simple secondary structures are internal folding structures,
and most of the irregular secondary structures are terminal folding
structures. Conformational regions could not be assigned to many portions
of local sequences of proteins, as many kinds of secondary structures
remain undefined.Furthermore, a common sequence forms a single irregular secondary
structure, which always functions as an overlapped folding structure.
On the contrary, identical terminal folding structures of different
proteins contain multiple secondary structures. The secondary structure
information has never been decoded, as the relationship between secondary
structures and amino acid sequences is complicated. Thus, we cannot
discuss dynamic local structures in the denatured state by secondary
structures.Originally, the precise definition of all of the folding structure
units enabled detailed sequence analysis for the formation of simple
secondary structures and irregular structures. Most importantly, sequences
forming simple secondary structures and irregular structures were
characterized by assignment of the 44 kinds of folding elements. Folding
structures of proteins included single or multiple secondary structures,
for example, terminal folding structures of proteins composed of multiple
irregular structures.
Internal folding structures, except for interconnecting folding structure
units, consisted of single simple secondary structures. The precise
definition of cap residues enabled the determination of single simple
secondary structures having continuous α- or β-region
residues and sequences forming irregular structures, on the basis
of the assignment of individual folding elements to each amino acid
of a local sequence.For terminal regions of H folding structure units, we could determine
the terminal sequences precisely, but not the secondary structures.
Any of the simple secondary structures of α-helices could be
determined precisely as the H folding structure units are characterized
by 12 kinds of folding elements. Similarly, simple secondary structures
formed by central di- or tri-peptide sequences of T6 and T7 folding
structure units could be determined precisely as the T folding structure
information was encoded by 13 kinds of folding elements. For the S
folding structure units, any of the simple secondary structures having
continuous β-region residues could be determined precisely as
the S folding structure information was encoded by 12 kinds of folding
elements. The precise definition of interconnecting folding structure
units enabled determination of the interconnecting sequences that
form irregular structures, but not secondary structures. Statistical
analysis of the relationship between interconnecting sequences and
secondary structures was expected to provide the secondary structure
information.
Encoding the Folding Structure Information of Proteins
By the assignment of individual folding elements of folding structure
units to each amino acid of local sequences, the folding structure
information could be encoded into the amino acid sequences of proteins.
Folding structure units were determined on the basis of the continuity
of the backbone dihedral angles (ϕ, ψ) of the α-
and β-regions. The dihedral angles used for data set preparation
were as follows: α-region: −130 ≤ ϕ ≤
−30, −80 ≤ ψ ≤ +30; β-region:
−180 ≤ ϕ ≤ −45, +90 ≤ ψ
≤ +180. N-Cap and C-cap residues of the H, T, and S folding
structure units were located at both ends of the sequence. The dihedral
angles of the cap residues of H and T folding structure units were
out of the α-region, whereas the dihedral angles of the cap
residues in the S folding structure unit were in the β-region.
Both the terminal di- and tri-peptide sequences of H, T, and S folding
structure units always belonged to two adjacent folding structure
units in a protein chain. Each of the amino acid in overlapping local
sequences was independently assigned to each of the overlapped folding
element. Interconnecting folding structure units were located between
the α-region sequences and the β-region ones. The cap
residues of the interconnecting folding structure units were defined
depending on the adjacent folding structure unit. Both their terminal
dipeptide sequences always shared common sequences with the adjacent
folding structure units.The folding structure information was
encoded into protein GB1 chain as follows. Each amino acid in protein
GB1 chain was assigned to one or more of folding elements pertaining
to a folding structure unit. For example, the local sequence G41–D47 of protein GB1 formed S7 (G41–D47), and the terminal dipeptide sequence G41E42 and the terminal tripeptide sequence Y45–D47 were the common sequences in the adjacent
folding structure units (Figure ). In Figure , all of the amino acid residues were assigned to 77 folding
elements in total, along with the folding structure units. When a
tripeptide sequence formed an interconnecting folding structure unit,
the central amino acid residue expressed three folding elements (e.g.,
N8 and V21 in Figure ). The central amino acid was independently
assigned to two additional terminal folding elements of the adjacent
folding structure units. About 2% of the amino acids in the protein
data set expressed three folding elements. These data provided important
statistical values useful for analysis of the amino acid–folding
element relationship.
Figure 7
Overlapped folding structure.
Overlapped folding structure.
Decoding the Folding Structure Information
The protein
data set prepared above includes 1 000 666 (T10) amino
acid residues, which we assigned to the 44 folding elements. The common
sequence was statistically treated as overlapping local sequences.
The number of amino acids assigned to α-helix folding element in the data set was 41 752 (T0) in total. The number of Ala assigned to
α-helix folding element was
1383 (TA).The number of amino
acid residues assigned to a particular folding element could be regarded
as the number of the folding element. As the number of amino acid
residues that were transformed into α-helix folding element in the protein chains used in this study
was 41 752 (T0), the formation
probability of all of the amino acid residues found in the protein
data set for α-helix folding element was 41 752/1 000 666 (Q0 = T0/T10). This value was
termed “folding element value” (Table ).
Table 1
Normalized Formation Probability (NFP)
Value of the 20 Kinds of Amino Acids for the 44 Kinds of Folding Elements
P1Xa
a′
a
b
c
d
e
f
g
h
i
i′
j
k
l′
l
m
n
o
p
p′
r′
r
A
8.1%
0.75
0.41
1.1
1.3
1.2
1.5
1.3
1.4
1.1
0.68
0.76
1.0
1.4
0.47
0.95
0.77
0.82
0.76
0.74
0.77
0.60
0.97
C
1.4%
1.0
1.1
0.56
0.51
0.67
0.91
0.77
0.70
1.1
1.2
0.81
0.54
1.2
0.52
1.2
1.2
1.3
1.3
1.4
0.75
0.63
1.2
D
5.8%
0.77
3.0
0.83
1.6
1.6
0.65
0.61
0.92
0.86
0.90
1.1
2.0
1.2
1.7
0.72
0.48
0.38
0.35
1.6
1.4
1.4
0.49
E
6.8%
0.72
0.56
1.2
2.5
2.1
1.1
1.2
1.6
1.0
0.51
0.87
2.3
1.3
0.73
0.96
0.78
0.68
0.69
0.71
0.82
0.71
0.94
F
4.1%
1.3
0.38
0.92
0.66
0.89
1.1
0.98
0.67
1.2
1.1
0.75
0.54
1.5
0.42
1.2
1.1
1.4
1.5
0.93
0.57
0.63
1.5
G
7.3%
1.1
1.6
0.46
0.76
0.65
0.42
0.30
0.39
0.37
3.5
1.1
0.66
0.46
4.1
0.45
0.29
0.40
0.38
0.44
2.6
3.6
0.44
H
2.3%
0.98
1.3
0.72
1.0
1.0
0.82
0.90
0.91
1.5
1.3
0.86
1.3
0.79
1.1
1.2
0.88
0.95
0.88
1.2
0.91
1.1
1.1
1
5.9%
1.3
0.18
0.86
0.47
0.69
1.2
1.2
0.61
0.57
0.76
0.87
0.31
0.86
0.29
1.1
1.5
1.7
1.8
0.96
0.54
0.43
1.3
K
5.9%
0.83
0.55
0.98
1.1
0.81
1.1
1.6
1.8
1.3
0.84
1.2
1.3
0.67
1.0
1.2
1.1
0.75
0.87
0.73
0.83
0.89
1.0
L
9.4%
1.2
0.28
1.0
0.58
0.88
1.5
1.6
1.1
1.4
0.86
0.77
0.42
1.1
0.39
0.86
0.96
1.2
1.2
0.70
0.66
0.59
1.1
M
1.9%
1.4
0.40
0.85
0.66
0.94
1.4
1.4
1.2
1.1
0.88
0.74
0.48
1.0
0.52
1.0
0.91
1.2
1.2
0.69
0.59
0.60
1.2
N
4.2%
0.86
2.4
0.53
0.88
0.73
0.71
0.70
0.91
1.5
1.5
0.91
1.3
0.76
2.0
0.76
0.58
0.50
0.47
1.3
1.6
1.8
0.68
P
4.6%
1.2
1.5
3.6
0.51
0.36
0.21
0.15
0.28
0.059
0.031
3.0
0.27
0.34
0.62
0.58
1.9
0.96
1.2
1.9
1.8
0.69
0.71
Q
3.7%
0.80
0.57
0.83
1.2
1.6
1.2
1.2
1.5
1.4
0.89
0.87
1.2
0.99
0.89
1.1
0.86
0.77
0.78
0.70
0.69
0.88
0.97
R
5.1%
0.84
0.63
0.92
0.94
0.78
1.2
1.4
1.6
1.2
0.84
0.98
1.0
0.84
0.88
1.1
1.0
0.87
0.93
0.74
0.74
0.83
1.0
S
5.9%
0.83
2.4
0.87
1.3
0.84
0.69
0.73
1.1
1.1
0.87
0.95
1.9
1.0
0.95
1.1
0.79
0.79
0.66
1.6
1.0
0.93
0.92
T
5.5%
1.0
1.8
0.79
0.92
1.0
0.73
0.55
0.64
1.2
0.64
0.97
1.0
0.88
0.86
1.3
1.2
1.2
1.0
1.4
1.0
0.90
1.1
V
7.3%
1.2
0.18
0.84
0.53
0.88
1.0
0.85
0.57
0.51
0.63
0.82
0.37
0.96
0.33
1.3
1.7
2.0
1.9
1.1
0.57
0.47
1.2
W
1.4%
1.2
0.39
1.3
1.2
0.92
1.1
1.0
0.72
0.88
0.75
0.71
0.99
1.8
0.45
1.2
1.4
1.1
1.3
0.86
0.65
0.65
1.6
Y
3.5%
1.2
0.50
0.85
0.75
0.88
1.0
0.93
0.69
1.4
0.99
0.76
0.71
1.2
0.51
1.3
1.2
1.4
1.4
0.96
0.59
0.70
1.4
Qx0b
0.042
0.042
0.041
0.032
0.027
0.20
0.026
0.031
0.040
0.040
0.042
0.0086
0.002
0.042
0.041
0.04
0.10
0.041
0.041
0.043
0.013
0.013
Composition of amino acid (X) in
all of the amino acid residues found in the data set.
Folding element value Q0 ( = folding element)
is the formation probability of all of the amino acid residues found
∑Q0 = 1.39.
Composition of amino acid (X) in
all of the amino acid residues found in the data set.Folding element value Q0 ( = folding element)
is the formation probability of all of the amino acid residues found
∑Q0 = 1.39.As the terminal and cap residues always had two or three folding
elements, the sum total of folding element values of the protein data
set was not equal to 1.0, but to 1.39. This value means that about
40% of protein sequences were common sequences. As an example, the
sum total of the folding element values of protein GB1 was calculated
as follows. Each amino acid of protein GB1 is treated as a structural
piece of a folding structure. Each amino acid was assigned to one
or two folding elements, except for the N8 and V21 residues, which were assigned to three folding elements, as shown
in Figure . Protein
GB1 consists of 56 amino acids, which expressed 77 folding elements
in the peptide sequence. As the common sequences were treated statistically
as overlapping sequences, a total of 77 amino acids were assigned
to 77 folding elements. The total folding element value for protein
GB1 was 1.38 (77/56).Using this approach, we could calculate the formation probability
of each amino acid for a given folding element. The number of Ala
(T1A) in our data set prepared above was 81 527. Among them,
the number of Ala that expressed α-helix folding element was 1383 (TA). Thus, the formation probability of Ala for α-helix folding
element was 1383/81 527 (QA = TA/T1A).
The formation probability of Ala for (QaA) was then normalized by the formation probability of all of
the amino acid residues for , folding
element value of (Q0). The “NFP” (NFP = QA/Q0 in the
case of Ala for ) expressed the formation
preference of an amino acid in a folding element. In this way, NFP
values of the 20 kinds of amino acids for the 44 folding elements
were calculated (Table ). NFP values consisted of the normalized folding structure information
of the amino acids for their respective folding elements. Each amino
acid in a protein sequence expressed single or multiple folding elements
through a dihedral angle transition, with a particular formation probability.Table shows the
NFP values of amino acids for all of the 44 folding elements. These
results indicate that identical amino acids in a protein chain can
express different folding elements to form different folding structures.
As a result, the folding elements of an identical amino acid revealed
the chameleon character of the native structure. NFP values of amino
acids depended on the structure of their side chains. The amino acid
pairs, such as D–N, E–Q, F–W–Y, I–V,
K–R, L–M, and S–T, showed similar NFP values
for the 44 kinds of folding elements. Figure shows the similarity between the NFP values
of L and M. Other pairs of amino acids also showed high level of similarity
between their NFP values. These results clearly indicate that amino
acid preferences in positions of folding structure units strongly
depend on the similarity of amino acid residues and that probability
theory can be used to evaluate the folding structure information appropriately.
The determined NFP values for folding elements of α-helices
are in good agreement with amino acid preferences for specific locations
at the ends of α-helices.[32] Statistical
decoding of the folding structure information encoded in local sequences
may disclose protein folding pathways.
Figure 8
Similarity on the NFP values of Leu and Met.
Similarity on the NFP values of Leu and Met.The probability of occurrence of an Ala residue at α-helix
folding element (PA) was found to be 1383/41 752 (TA/T0 = PA). Comparison of this value with the occurrence probability
of all of the Ala residues found in the data set, 81 527/1 000 666
(T1A/T10 = P1A), showed the preference of Ala at α-helix folding
element . We used the “normalized occurrence probability” (NOP = PA/P1A in the case of Ala at ) to express the preference of an amino acid residue
appearing in a particular folding element. The above statistical analysis
used Richardson’s analysis as a basis for the position-specific
amino acid preferences in α-helices. The NOP values corresponded
to the position-specific amino acid preference values found by Richardson.[32] The NOP values in the α-helix region (′–′) were in good agreement with
Richardson’s preference values. These values are the normalized
folding structure information of α-helices encoded in the 20
kinds of amino acids. Small differences between NOP values and Richardson’s
preference values are caused by the definition of the helix region.[32] From the discussion above, the NFP value can
be assumed to be mathematically equivalent to the NOP value of an
amino acid at a folding element (eq ). The NFP values of the 20 kinds of amino acids for
the 44 folding elements were calculated for the peptide sequence design.Statistical analysis of folding element and amino acid in our data
set showed the occurrence number of amino acid assigned to each folding
element. Amino acid preferences were normalized by the occurrence
probability of an amino acid on the basis of the overall percentage
of each amino acid found in our data set. The result gave NOP. We
can compare these values in any sequence to discuss the folding structure.
The NFP values in Table show the normalized folding structure information of each amino
acid in the peptide sequences. They can be used for the evaluation
of designed peptide sequences for folding structure formation.where NFP is the normalized formation probability;
NOP is the normalized occurrence probability; is the folding element; X is the amino acid residue; TX is the number of the folding element assigned to amino acid X, equals to the number
of the amino acid X at ; T1X is the
number of amino acid X in the data set; T0 is the number of folding element in total, equals to the number of all of the amino acids at ; T10 is the number of total amino acid residues
in the data set; QX is the formation
probability of amino acid X for folding element ; Q0 is the formation probability
of all of the amino acid residues in the data set for folding element (folding element value); PX is the occurrence probability of amino acid X at folding
element ; and P1X is the occurrence
probability of amino acid X in the data set.
RFA Analysis of the Continuous Folding Structure Units of Protein
GB1 Sequence
The normalized folding structure information
of a local sequence gave the RFA value for any folding structure.
RFA values are the products of NFP values of each amino acid for a
respective folding element. As all of the NOP values of the 20 kinds
of amino acids in a peptide sequence were 1.0, the relative occurrence
ability value of any of the local sequences in a peptide was 1.0,
regardless of its chain length and amino acid sequence. As the NFP
value of each amino acid in a peptide sequence was the relative folding
structure information, the normalized folding structure information
can be regarded as the relative folding structure information in the
peptide sequences.As a protein chain example for RFA analysis,
we chose protein GB1 sequence (Figure ). The notations
of the folding structure units of protein GB1 were assigned to protein
GB1 sequence by referring to a two-dimensional representation of the
native structure of protein GB1.[34]Figure shows all of the
amino acid residues assigned to the 77 folding elements in total,
along with the continuous folding structure units. In protein GB1
chain, both the terminal sequences of the folding structure units
always overlapped each other in the continuous folding structure units.
In protein GB1 sequence, all of the RFA values for folding structure
units were more than 1.4.Amino acid sequence of protein GB1, the notation of its continuous
folding structure units, and RFA values of local sequences for respective
folding structure units.The internal folding structures of H, T, and S folding structure
units consisted of simple secondary structures, and interconnecting
sequences expressed irregular structures. Sequences forming simple
secondary structures and irregular structures could be determined
precisely on the basis of the assignment of individual folding elements
to each amino acid of protein GB1 sequence. For the H folding structure
unit of protein GB1, we determined the terminal dipeptide sequences,
V21D22 and G38V39, forming
irregular structures and the 15-residue sequence, A23–N37, forming the simple secondary structure precisely. For T
folding structure units, we determined common sequences, N8G9, L12K13, Y45–D47, and K50T51, forming irregular structures
and central di- and tri-peptide sequences, K11T12 and D47–T49, forming simple secondary
structures. For S folding structure units, hexa- and penta-peptide
sequences forming simple secondary structures were determined precisely.
The RFA analysis clearly indicated that any of simple secondary structures
could be decoded from the protein GB1 sequence on the basis of the
tertiary structure information. Precise decoding of simple secondary
structures from protein GB1 sequence will appear elsewhere.The folding structure units could be formed by simple accumulation
of folding elements encoded in amino acids along the local sequences,
on the basis of probability theory. The simple accumulation is the
general solution for folding structure formation in the denatured
state. The local structure formation of a protein using folding structure,
instead of secondary structure, could lead to the first general solution
for local structure formation, on the basis of probability theory.
Toward fully understanding the continuous folding structure units,
the most important issue is to dissect the continuity of folding structure
units using RFA analysis and tertiary structure information.
Methods
Preparation of the Protein Data Set
The protein data
bank of well-determined three-dimensional protein structures could
provide statistically meaningful data for analysis of the relationship
between the 20 kinds of amino acids and the 44 kinds of folding elements.
First, we extracted protein chains in which the sequence similarity
was minimal to enable statistical analysis of the data. Folding structure
units were determined on the basis of the continuity of the backbone
dihedral angles (ϕ and ψ) of the α- and β-regions.
The dihedral angles used for data set preparation are as follows:
α-region: −130 ≤ Φ ≤ −30,
−80 ≤ Ψ ≤ +30, β-region: −180
≤ Φ ≤ −45, +90 ≤ Ψ ≤
+180, ω-region: the regions other than α- and β-regions.
The dihedral angle data used in this article were obtained from Database
of Secondary Structure in Proteins (DSSP) provided by Kabsch and Sander.[29] N-Cap and C-cap residues of folding structure
units are located at both ends of the sequence. The dihedral angles
of the cap residues of the H and T folding structure units are out
of the α-region, whereas the dihedral angles of the cap residues
in the S folding structure unit are still in the β-region. This
definition may avoid the complication of overlapped folding structures.
Interconnection folding structure units are located between the α-region
sequences and the β-region ones. Depending on the sequences
in front and behind of the interconnection folding structure unit,
the folding structure units are classified into HH (between α-region
sequences), HS (between α-region and β-region sequences),
SH (between β-region and α-region sequences), and SS (between
β-region sequences).We used the CATH classification database[39] to avoid sequence similarity. One protein chain
was picked from each “Sequence Family” in the CATH classification,
each containing domains characteristic of the different Sequence Families.
Some protein chains contained repeated sequences. Chains with multiple
repeated sequences longer than a heptapeptide were removed from the
data set. A total of 4636 protein chains were selected for the data
set, which included 1 000 666 amino acid residues. Information
regarding the folding structure units, N-terminal and C-terminal folding structures,
was extracted from all sequences of the selected protein chains.
Extraction of All of the Folding Structure Units
Folding
structure units were determined on the basis of the continuity of
the backbone dihedral angles (ϕ and ψ) of the α-
and β-regions. The dihedral angles used for data set preparation
were described above.The common sequences were statistically
treated as overlapping local sequences, and each of their amino acids
was independently assigned to each of the overlapped folding elements.
All of the folding structure units could be extracted from protein
chains, and each of the 20 kinds of amino acids was assigned to their
appropriate folding elements.
Conclusions
By use of folding elements, we could design the folding structure
units based on the concept that protein folding should be derived
from continuous folding structure units. Each of the folding structure
units was designed so that both the terminal di- or tri-peptide sequences
shared common sequences with the two adjacent folding structure units.
The folding structure information showed amino acid preferences in
positions of folding structure units. It is simply described as a
one-dimensional sequence of folding elements assigned to an amino
acid sequence. In protein GB1 sequence, all of the RFA values for
folding structure units verified by the X-ray structure were more
than 1.4, and the continuous folding structure units led to the native
structure through tertiary interactions. In the protein chain of GB1,
both the terminal sequences of the folding structure units always
overlapped each other in the continuous folding structure units. The
simple accumulation of folding elements encoded in amino acids along
local sequences, on the basis of probability theory, could be the
general solution for folding structure formation.