Christopher Prior1, Owen R Davies2, Daniel Bruce3,4, Ehmke Pohl3,4. 1. Department of Mathematical Sciences, Durham University, Durham DH1 3LE, United Kingdom. 2. Institute for Cell and Molecular Bioscience, Medical School, University of Newcastle, Newcastle upon Tyne, NE2 4HH, United Kingdom. 3. Department of Biosciences Durham University, Durham DH1 3LE, United Kingdom. 4. Department of Chemistry, Durham University, Durham DH1 3LE, United Kingdom.
Abstract
Small angle X-ray scattering (SAXS) is an important tool for investigating the structure of proteins in solution. We present a novel ab initio method representing polypeptide chains as discrete curves used to derive a meaningful three-dimensional model from only the primary sequence and SAXS data. High resolution structures were used to generate probability density functions for each common secondary structural element found in proteins, which are used to place realistic restraints on the model curve's geometry. This is coupled with a novel explicit hydration shell model in order to derive physically meaningful three-dimensional models by optimizing against experimental SAXS data. The efficacy of this model is verified on an established benchmark protein set, and then it is used to predict the lysozyme structure using only its primary sequence and SAXS data. The method is used to generate a biologically plausible model of the coiled-coil component of the human synaptonemal complex central element protein.
Small angle X-ray scattering (SAXS) is an important tool for investigating the structure of proteins in solution. We present a novel ab initio method representing polypeptide chains as discrete curves used to derive a meaningful three-dimensional model from only the primary sequence and SAXS data. High resolution structures were used to generate probability density functions for each common secondary structural element found in proteins, which are used to place realistic restraints on the model curve's geometry. This is coupled with a novel explicit hydration shell model in order to derive physically meaningful three-dimensional models by optimizing against experimental SAXS data. The efficacy of this model is verified on an established benchmark protein set, and then it is used to predict the lysozyme structure using only its primary sequence and SAXS data. The method is used to generate a biologically plausible model of the coiled-coil component of the human synaptonemal complex central element protein.
Biological
small angle X-ray scattering (BioSAXS) is an increasingly
important method for characterizing protein structures in solution.[1−3] Its primary advantage over techniques such as crystallography and
NMR is its ability to provide information under native conditions
about large protein molecules not accessible by complementary methods.
However, there is a price to pay for this advantage: the random motion
and orientation of molecules in solution leads to a loss of information
due to an effective averaging of the scattering, leaving only information
about the protein’s intramolecular distances, not their spatial
orientations.[4] Because of these challenges,
correctly interpreting BioSAXS data to realize meaningful results
remains a challenging task.[5]Two
main methods have been developed to interpret BioSAXS data.
The first assumes an accurate three-dimensional (3D) model of the
protein backbone, usually derived from X-ray crystallography.[6−9] This model is used to calculate the X-ray scattering curve once
the excluded solvent volume is taken into account. A major advance,
first presented in the CRYSOL algorithm,[6] was the inclusion of the solvation layer—the ordered water
molecules at the surface of the protein. CRYSOL and the FoXS package,
which was developed by Schneidman-Duhovny et al.,[7] adjust an implicit “shell” of scattering
(“implicit” meaning they do not model individual solvent
molecules). Other packages treat the shell explicitly using either
molecular dynamics (AquaSAXS)[8] or a geometric
filling approach (the SCT suite).[9] Allowing
for a shell that can have gaps and fill cavities in the protein model
gives a more reliable fit to the data.[5] An extension of this approach is to use all atomistic modeling with
Protein Data Bank (PDB) structures as a starting point.[10,11] The application of such techniques, however, can require significant
technical expertise. The second method does not assume an initial
structure (ab initio) but simplifies the protein model as either a
volume[12] or a chain[13] of scattering beads without explicit secondary structure.
These methods are hence applicable to de novo structural prediction,
but the lack of secondary structure means interpreting these predictions
is a difficult task.[5]Here we propose
an alternative ab initio technique that uses a
curve model of the 3D structure of the polypeptide chain. This description
has a much reduced number of parameters by comparison to all atomistic
models. Similar curve models have been previously proposed[14−16] but not applied to the interpretation of BioSAXS data. The model
is parametrized by consecutive discretized descriptions of the four
major secondary structural elements: α-helices, β-strands,
flexible sections, and random coils. The permissible geometry of these
curves is restricted by empirically determined constraints, which
are akin to Ramachandran constraints.[17] To use the model for the interpretation of BioSAXS data, the polypeptide
chain model is combined with a water model for the first hydration
shell and an empirically calibrated scattering model. The geometry
of the model can then be optimized against the experimental BioSAXS
data. A critical factor, novel to our curve representation of the
polypeptide chain, is the construction of empirical probability distributions
for the model parameters. These distributions serve the dual purpose
of preferencing commonly observed secondary structures in the set
of potential chain models, while simultaneously allowing for predictions
with rare/novel but physically permissible secondary structure. An
advantage of this method for ab initio interpretation of BioSAXS data,
by comparison to the established bead models,[12,13] is that by accurately characterizing the protein’s secondary
structure it can reliably incorporate additional structural information
in order to improve the results of the technique. In this study, contact
predictions, based on sequence alignments alone, are used to improve
the model predictions. A final advantage of the code developed is
that its only input requirements are the primary sequence and scattering
data, so it places only basic technical requirements on the user for
its use.We first applied this new methodology to data of well-characterized
model protein lysozyme before moving to the BioSAXS data of the structural
core of the humansynaptonemal complex central element protein 1 (SYCE1).
This protein represents an essential structural component of the synaptonemal
complex (SC) that binds together homologous chromosomes during meiosis
and provides the necessary three-dimensional environment for crossover
formation.[18−20] The SC is formed of oligomeric α-helical coiled-coil
proteins that undergo self-assembly to create a latticelike assembly.[21−23] In a recent biochemical and biophysical study, humanSYCE1 was shown
to adopt a homodimeric structure in which its structural core is provided
by residues 25–179 forming an antiparallel coiled coil.[24] Further, the structural core was expressed in
an engineered construct in which two SYCE1 25–179 sequences
were tethered together through a short linker sequence (GQTNPG). This
construct faithfully reproduced the native structure, and substantially
improved protein stability in solution[24] (by comparison to the unlinked core). In this study, using secondary
structure predictions and distance restraints purely based on the
sequence of the protein alone, an excellent model of an antiparallel
extended but bent coiled coil is derived, which is fully consistent
with biological data.
Methods
First we describe the reduced
parameter protein model we use to
interpret BioSAXS data. This is composed of a polypeptide chain curve
model with a surrounding explicit hydration shell. Empirically calibrated
structure factor functions for each constituent element of the model
are constructed to produce theoretical scattering curves for this
tertiary structure model.
Polypeptide Chain
The polypeptide
chain is represented
as a set of points in 3D space {c}, the positions of the Cα atoms in each amino acid. The geometry of four consecutive
points (c, c, c, c) can be characterized
by two parameters: the curvature κ and torsion τ. κ
is defined by the unique sphere made by the center of the joining
edges (see Figure a); the smaller the sphere the more tightly the curves joining the
points fold on themselves. τ is determined by how sharply the
curve bends away from its plane of curvature, and its sign denotes
the curve section’s chirality (it is positive for right-handed
coiling and negative for left-handed coiling). More precise definitions
are as follows.
Figure 1
Figures depicting elements of the backbone model. (a)
Curve subsections
(c, c, c, c) (red points) and
their midsection points (c, c, c) (blue). The first example is more tightly
wound and has a smaller sphere, hence a higher κ value. The
sphere defined by these midsection points is shown; the inverse of
its radius is the curvature κ. (b) α-Helical section with
uniformly similar (κ, τ) values. (c) Flexible (linker)
section with varying (κ, τ) values.
Figures depicting elements of the backbone model. (a)
Curve subsections
(c, c, c, c) (red points) and
their midsection points (c, c, c) (blue). The first example is more tightly
wound and has a smaller sphere, hence a higher κ value. The
sphere defined by these midsection points is shown; the inverse of
its radius is the curvature κ. (b) α-Helical section with
uniformly similar (κ, τ) values. (c) Flexible (linker)
section with varying (κ, τ) values.
Curvature
κ
A section of four residues defined
by the points (c, c, c, c) defines three edges with midpoints c = (c + c)/2, which in turn define the curvature
sphere.[1−3] The curvature, the inverse of its radius, iswhere θ123 is the angle between
the vectors c – c and c – c.
Torsion τ
Three points define a plane (with unit
normal vector n), and the four points (c, c, c, c) define two planes through their
unit normal vectors n2 and n2, respectively:The torsion is the (length
weighted) angle these planes make with each other:with θ is
the angle between n1 and n2; see, e.g., ref (25).The algorithm for generating a curve of length n from n – 3 pairs of values of
(κ, τ) is as follows: Consider a section of curve of length m and m – 3 pairs (κ, τ), whose three
initial points c1, c2, and c3 are randomly chosen (with fixed
separation distance R = 3.8). Since scattering expressions
are invariant under an arbitrary translation and rotation,[4] the exact values of the first two points do not
matter (as long as their separation is R). The third
point is a structural degree of freedom, but it is restricted such
that the Cα–Cα distance between c1 and c3 is greater than R. Once these points are specified the fourth point will
bewith θ
= ∈ [0, π], ϕ
∈ [0, 2π]. The set (c1, c2, c3, θ, ϕ)
defines four points and hence κ and τ values. Using values
of κ1 and τ1, eqs and 3 are solved for
θ and ϕ; this gives c4. The next
point c5 can similarly be found from the values
κ2 and τ3, and so on until all m – 3 (κ, τ) have been used to yield the m points c. Examples of
an α-helical and flexible linker sections (taken from the structure
of bovineserum albumin (PDB 3V03)[26]) are shown in Figure b,c.
Secondary
Structure Geometry Restraints
In order to
derive geometric constraints, Cα coordinates were
extracted from a set of over 60 protein structures for which high
resolution crystal structures are available in the PDB and the κ
and τ values calculated for all subsections (c, c, c, c). The κ–τ pairs are
shown in Figure a.
There are three main populations of values (preferential regions).
As shown in section 1.3 of the Supporting Information, these regions of (κ, τ) space correspond to the three
preferential domains of Ramachandran space.[17] Using the PDB’s secondary structure annotation, this data
was split into categories of β-strands, α-helices, and
the rest which are not identified (referred to here as linkers). To
account for random coils, the data was further divided into subsets
whose values remained in one preferential domain (as in Figure b) and those whose κ–τ
values belong to multiple domains (as in Figure c). For each set of data a representative
probability density function (PDF) was calculated using kernel smoothing
techniques[27] (for details see sections
1.4 and 1.5 of the Supporting Information); an example is shown in Figure b.
Figure 2
Illustrations of κ–τ spaces used to
impose realistic
geometry constraints on the polypeptide chain. (a) (κ, τ)
pairs obtained from crystal structures, plotted as points with κ
on the horizontal axis and τ on the vertical axis. (b) PDF,
created from the data in (a), which correspond to linker sections.
There are three distinct domains of high probability corresponding
to the preferred secondary structural elements.
Illustrations of κ–τ spaces used to
impose realistic
geometry constraints on the polypeptide chain. (a) (κ, τ)
pairs obtained from crystal structures, plotted as points with κ
on the horizontal axis and τ on the vertical axis. (b) PDF,
created from the data in (a), which correspond to linker sections.
There are three distinct domains of high probability corresponding
to the preferred secondary structural elements.
Generating Models from Secondary Structure Annotation
In
order to generate models based on secondary structure information
alone, a protein of n amino acids is split into l distinct subdomains of length m (∑ m = n). Each
section l is classified as α-helical, β-strand,
or linker; for the purpose of testing and calibration the PDB file’s
secondary structure assignment was used to perform this task. For
each section of length m, m – 3 (κ,
τ) pairs are drawn from an appropriate PDF and the section is
constructed. This process creates the l individual
sections of secondary structure elements, which must then be linked
together. Two neighboring sections with specified geometry (for example,
an α-helix and linker) still have a relative rotational degree
of freedom. To ensure this remains physically realistic, the geometry
of the last three and first Cα positions of neighboring
secondary sections were extracted from the PDB set and further PDFs
for the set of permissible (κ, τ) pairs of these joining
sections were generated for each type of join (i.e., α-helix
to linker or linker to β-strand). Therefore, the final step
of the process is to obtain all (κ, τ) values for the
joint geometry and then construct the whole backbone. A precise mathematical
description of this algorithm, constrained backbone algorithm (CB), is given in section 1.6 of the Supporting Information. One example of a structure generated using this
algorithm is shown in Figure b; this particular structure was used as a starting point
for an ab initio structure optimization in this study.
Figure 6
Secondary knot fingerprint analysis of the lysozyme structure.
(a) Cα trace of lysozyme (PDB 1LYZ(41)). The α-helices are shown in red, β-strand
structures are in green, and linker sections are in light blue. (b)
Random structure generated using the CB algorithm which has the same
secondary structural elements as lysozyme. This could be a starting
model for the fitting procedure. Panels c and d are secondary fingerprints
of two different crystal structures of lysozome (1LYZ and 1AKI, respectively).
The knot types are indicated (Rolfsen classification[42]); white spaces indicate no secondary knots (all knots were
of the primary type). (e) Secondary fingerprint for the random structure
shown in (b); it differs significantly from (c) and (d) and has a
larger range of knots present.
The Hydration
Layer
Once the curve representation is
obtained, it is crucial to include a model of the hydration layer
in order to generate realistic scattering curves. To this aim, solvent
molecules are placed between a pair of cylindrical surfaces surrounding
the axis of a section of the backbone (Figure a). This layer is then reduced by removing
all overlapping solvent molecules. This ensures the shell remains
in hollow sections between the fold and on the protein surface, while
the water molecules are removed from significantly folded regions.
This is a crucial aspect of our hydration layer model, as it has been
shown that one needs to allow for inhomogeneous hydration layers in
order to avoid inaccurate predictions from BioSAXS data.[28] This method is illustrated in Figure , where the two cylinders of
radius Rc (core) and Ro (outer), Ro > Rc, are centered on a section i’s
helical axis (Figure a). Consider a solvent molecule belonging to another section j whose nearest distance from the axis of section i is Rs. If Rs < Rc, the solvent is
too close to the backbone and removed. If Rc < Rs < Ro, the solvent is classed as being shared by the sections i and j and only counted once.
Figure 3
Visualizations
of the hydration layer model. (a) Initial solvent
layer, shown as silver spheres with the core Rc and outer Ro cylinders surrounding
the axis of the section (red curve). (b) Overlapping sections and
solvent layers. (c) In blue, the removed solvent molecules of the
pair of sections shown in (b).
Visualizations
of the hydration layer model. (a) Initial solvent
layer, shown as silver spheres with the core Rc and outer Ro cylinders surrounding
the axis of the section (red curve). (b) Overlapping sections and
solvent layers. (c) In blue, the removed solvent molecules of the
pair of sections shown in (b).This process is applied to all solvent molecules from sections i and j on each other; an example of the
outcome is shown in Figure b,c. Applying this process pairwise to all sections of a Cα backbone yields the final hydration layer. The exact
mathematical description of this hydration layer is detailed in sections
2.1–2.3 of the Supporting Information. The values of the radii (Rc, Ro) and a number of other parameters controlling
the solvent density were determined by fitting the model to high resolution
crystal structures which contained the first hydration shell. An example
model shell, generated with these parameters, is shown in comparison
to the model solvent positions from the subatomic resolution structure
of a phosphate binding protein from the PDB 4F1V(29) in Figure . It is shown the two distributions are statistically similar in
section 2.4 of the Supporting Information and hence that the model is a realistic representation of the average
positions of the inner hydration shell.
Figure 4
Comparisons of crystallographic
and model solvent positions from
the crystal structure of a phosphate binding protein PDB 4F1V, determined at ultrahigh
resolution of 0.88 Å.[29] (a) PDB backbone
and relevant solvent molecules. (b) Model solvent positions (surrounding
the same curve as in (a)) obtained with the experimentally determined
hydration shell model parameters.
Comparisons of crystallographic
and model solvent positions from
the crystal structure of a phosphate binding protein PDB 4F1V, determined at ultrahigh
resolution of 0.88 Å.[29] (a) PDB backbone
and relevant solvent molecules. (b) Model solvent positions (surrounding
the same curve as in (a)) obtained with the experimentally determined
hydration shell model parameters.
Scattering Formula
Once the polypeptide chain and hydration
layer models are determined, the Debye formula[30]is used to calculate the
scattered intensity I(q) as a function
of momentum transfer, q = π sin(θ)/λ.
Here N is total number of Cα atoms
and solvent molecules
and f(q) is the form factor for residue i. There are two
types, one for an amino acid with an excluded volume correction and
one for a solvent molecule, which are defined in the following section.
Amino Acid Form Factors
The form factor fam of an amino acid, centered on the Cα atom position, iswhere fb is the scattering
of the amino acid in a vacuum, fex is
the adjustment due to the excluded volume
of solvent, and ρex is a constant. Each amino acid
is assigned the same scattering function fb(q), a five-factor exponential representation:where {A, B}5 and C are empirically determined
constants (a standard
form used to fit molecular form factors[31]). The excluded volume effect is captured using an exponential model
in the formwhere rw is the
average atomic radius of the atom.[6,7,13] To calculate the excluded volume for amino acids,
coordinates for all 20 amino acids[32] and
values of rw for carbon, nitrogen, oxygen,
hydrogen, and sulfur (e.g., ref (33)) were used to compute the excluded volume scattering,
centered at the Cα, throughwhere rα is the
distance of atom i from the Cα molecule
and Nam is the number of atoms in the
amino acid. Since fb does not discriminate
individual amino acids, this value fexam was averaged
over all 20 amino acids, weighted by their abundance in globular proteins
(see ref (34)). This
averaged function, shown in Figure e, gives fex(q). Finally, eq includes
a constant ρex which modulates the effect of the
excluded volume scatter by comparison to fb; this value is constrained to lie within 0.75 and 1.25 (similar
constraints are used in refs (6, 7, and 13)). The scattering form for an
individual water molecule in the hydration layer iswhere fhy and fox are the vacuum scatterings of hydrogen and
oxygen, respectively.[31] The constant ρh was empirically determined (as in ref (7)). A detailed description
of the parameter determination method is given in section 3 of the Supporting Information.
Figure 5
Fits to scattering data
for various molecules using appropriate
Cα coordinates as a backbone model {c}. (See section 3 of the Supporting Information for details.) In panels a–d,
the data scattering data is shown overlaid by the smoothed data used
for fitting (blue curve) and the model fit (red curve). Panel e is
the averaged scattering function famex obtained by averaging the scattering
parameters obtained from fits like those shown in (a)–(d).
Fits to scattering data
for various molecules using appropriate
Cα coordinates as a backbone model {c}. (See section 3 of the Supporting Information for details.) In panels a–d,
the data scattering data is shown overlaid by the smoothed data used
for fitting (blue curve) and the model fit (red curve). Panel e is
the averaged scattering function famex obtained by averaging the scattering
parameters obtained from fits like those shown in (a)–(d).
Evaluating Structural Similarity
In the next step the
geometry of each model generated by the CB algorithm is optimized
by refinement against the scattering data. However, since the problem
is under-determined, many models will fit the experimental data, so
a method is required to compare structures and determine which predictions
are “essentially the same” in that they only differ
by small local conformational changes (as one should expect in solution).
The standard methods in protein crystallography for comparing similar
protein structures are based on root-mean-squared deviations (RMSDs)
where two structures are superimposed to minimize the sum of all distances
of equivalent paired atoms.[35,36] This measure and variants
on it are known to be overly sensitive to large deviations in single
loops (as discussed in ref (35)). Unlike homologous crystal structures, which will often
only differ by the change in a small subsection of the whole structure,
the comparison here will be made between structures generated by a
random algorithm, so the significant buildup of relatively small individual
RMSD errors is likely. In section 2.1 of the Supporting Information, a number of additional problems with using the
RMSD measure in this context are discussed in detail. To mitigate
these problems, a novel and more robust approach based on knot theoretical
techniques was developed.
Knot Fingerprints
Techniques from
knot theory have
previously been applied to identify specific (knotted) entanglements
in protein structures.[37] To compare two
protein structures using knot theory, the N- and C- termini need to
be joined.[38] As demonstrated in ref (37), a sensible procedure
for making this extension is to surround the backbone with a sphere,
then choose two random points on the sphere and join the end termini
to these points; finally, this extended curve is closed with a geodesic
arc. The knot is then classified (e.g., via Jones polynomials). This
procedure is repeated a significant number of times (10 000
in this study), and the most common knot (MCK) is chosen to indicate
the knotting of the curve. To obtain additional information, the MCK
is calculated for all subsets {c|i = k, k + 1, ..., j, j > k, j – k > 3} of the curve.
One can then plot this data on a “staircase” diagram
with j and k on the axes and each
square of the domain colored by its most common knot (e.g., ref (39)) (examples of staircase
diagrams are shown in Figure c–e). The fingerprint
is found to be preserved across protein families,[39] even when there is low sequence identity.[40]Secondary knot fingerprint analysis of the lysozyme structure.
(a) Cα trace of lysozyme (PDB 1LYZ(41)). The α-helices are shown in red, β-strand
structures are in green, and linker sections are in light blue. (b)
Random structure generated using the CB algorithm which has the same
secondary structural elements as lysozyme. This could be a starting
model for the fitting procedure. Panels c and d are secondary fingerprints
of two different crystal structures of lysozome (1LYZ and 1AKI, respectively).
The knot types are indicated (Rolfsen classification[42]); white spaces indicate no secondary knots (all knots were
of the primary type). (e) Secondary fingerprint for the random structure
shown in (b); it differs significantly from (c) and (d) and has a
larger range of knots present.
Secondary
Knot Fingerprints
Figure c is the knot fingerprint for one set of
lysozyme coordinates (shown in Figure a), of the second most common knot
identified during the random closure process. The secondary fingerprint
shown in Figure d
is from a second set of lysozyme coordinates; panels c and d of Figure are significantly
similar. The secondary fingerprint (Figure e) is derived from a CB generated backbone
model, shown in Figure b, which has the same secondary structure sequence as 1LYZ. The secondary fingerprint
differences between the correct structure (Figure c) and the randomly generated structure (Figure d) is immediately
obvious. All primary fingerprints in these cases are identical, and all have the unknot as the most common knot. It is clear secondary
(and possibly tertiary) knot fingerprints can differentiate unknotted
folds. A knot fingerprint statistic is defined in section 4.2 of the Supporting Information which quantifies the weighted
similarity of knot fingerprints at level l associated
with the curves K1 and K2 (l = 2 for Figures c–e); it yields a value between 0,
completely dissimilar, and 1, identically folded.In section
4.3 of the Supporting Information it is
demonstrated that the statistic has the following properties. First,
it quantifies crystal structures of the same molecule as highly similar, , and randomly generated structures
(with
the same secondary structure sequence) as significantly dissimilar,
generally (see Figure a). Second, it judges crystal monomer structures
of
different proteins with different 3D structures but with similar lengths
as being significantly different (typically ); i.e. it can differentiate folds. Third,
it is shown to have excellent properties under deformation. To demonstrate, n randomly distributed changes were applied to a lysozyme
crystal structure Kpdb using the CB algorithm.
For each n, 50 such structures K were generated and the values of the
statistic were calculated. The results are plotted
as a function of n in Figure b. The mean value drops off rapidly to the
same value as the average of the randomly generated structures (after
about 15 changes). The maximum value always remains significantly
higher than the mean; it drops below PDB quality after only two changes.
Therefore, a high value of >0.75 indicates the structure
is likely largely the same as the original structure.
Figure 7
Properties of the (secondary)
knot fingerprint statistic based on variations of the lysozyme
structure.
(a) Secondary knot statistics of various structures K compared to the curve shown in Figure a. The two distinct sets are lysozyme coordinates
from the PDB and random structures with secondary structure alignment
to lysozyme (generated using the CB algorithm). (b) Plots of the mean,
maximum, and minimum values of the 50 secondary knot statistics comparing
the 1LYZ structure
and the same structure subjected to n random changes
in its secondary structure. The dotted lines show one standard deviation
from the mean. The black line is the average of the PDB structure
secondary fingerprint statistics (see (a)), the purple line is the
random structure average (crossing the mean at about n = 15), and the yellow line is the average of secondary fingerprint
values for models which fit the experimental data (crossing the mean
at about n = 3).
Properties of the (secondary)
knot fingerprint statistic based on variations of the lysozyme
structure.
(a) Secondary knot statistics of various structures K compared to the curve shown in Figure a. The two distinct sets are lysozyme coordinates
from the PDB and random structures with secondary structure alignment
to lysozyme (generated using the CB algorithm). (b) Plots of the mean,
maximum, and minimum values of the 50 secondary knot statistics comparing
the 1LYZ structure
and the same structure subjected to n random changes
in its secondary structure. The dotted lines show one standard deviation
from the mean. The black line is the average of the PDB structure
secondary fingerprint statistics (see (a)), the purple line is the
random structure average (crossing the mean at about n = 15), and the yellow line is the average of secondary fingerprint
values for models which fit the experimental data (crossing the mean
at about n = 3).
Experimental Data Fitting
The following chi-square
statistic χf2 is used to to assess the fit quality of model predictionswhere ns is
the discrete number of points on the domain q ∈
[0, 0.4] on which the scattering is sampled (a commonly
used domain, e.g., ref (7)). Im is the model scattering calculated
using the Debye formula, eq , and Ies is the smoothed experimental data. The data
is smoothed using the procedure described in ref (43) which is designed to avoid
overfitting. The idea is that the maximum distance between two points
on the molecule Dmax can be used to identify
the number of independent data points one needs to recreate the scattering
data using the Shannon sampling theorem. This number is the nearest
integer to n = Dmaxqmaxπ–1. One then separates
the experimental data into n bins; as in ref (43) the bins are chosen to
be evenly spaced on the domain of q values used.
A data point is then selected from each bin; again we follow the established
procedure in ref (43) that this data point is the median of 1001 randomly selected samples
of the data from the given bin, akin to a robust least-trimmed-squares
method (see ref (43) for more details). The factor Ld, which
will superimpose identical curves that differ by a translation, is
used because the protein concentration can only be measured with relatively
low accuracy[6,7] (when taking a logarithm of the
data, a scaling factor becomes a vertical translation). In addition,
to prevent chemically unreasonable conformations, a penalty is applied
if the Cα–Cα distance of
≤3.8 Å occurs for any pair of nonadjacent Cα positions; this quantity is labeled χnl. The initial
model is optimized as described above until χf2 + χnl < 0.008. Values below this threshold represent an excellent
fit to the scattering data, as shown in Figure d. This value is based on a comparison to
other studies (see section 3.6 of the Supporting Information).
Figure 8
Illustrating the fitting process. (a) Initial configuration
of
the backbone based only on the secondary structure assignment of lysozyme
(PDB 1LYZ).
Also shown as spheres are the molecules of the hydration layer. (b)
Model scattering curve comparing BioSAXS data. (c) Final structure
(and hydration layer) obtained from the fitting process and its model
scattering curve now fitting the BioSAXS data well (d).
Illustrating the fitting process. (a) Initial configuration
of
the backbone based only on the secondary structure assignment of lysozyme
(PDB 1LYZ).
Also shown as spheres are the molecules of the hydration layer. (b)
Model scattering curve comparing BioSAXS data. (c) Final structure
(and hydration layer) obtained from the fitting process and its model
scattering curve now fitting the BioSAXS data well (d).
Results
Validation of the Backbone Curve and Water
Model
As
discussed under Methods, each part of the
model, the Cα backbone, the explicit hydration layer,
and the scattering model have been individually designed and verified
using actual structures from the PDB. However, it remains to demonstrate
the composite model’s efficacy. To test this, it was applied
to the benchmark set of proteins used to compare the set of atomistic
small angle scattering verification methods in ref (44) (this is in addition to
the cases shown in Figure ). This set includes monomeric and multimeric proteins both
globular and elongated. We allow the parameters of the scattering
model to vary for each structure but fix the geometric hydration layer
as described above. The scattering model is physically constrained
in the same manner as in the FoXS[7] and
CRYSOL[6] models, as discussed in detail
in section 3 of the Supporting Information. For the sake of brevity we also detail these results in section
3.6 of the Supporting Information; it suffices
to state here that the model performs comparably to the atomistic
structure techniques and hence can be used to correctly infer protein
structure from small angle scattering data.
Developing and Testing
an Averaged Scattering Model for ab Initio
Prediction
In an ab initio fitting it will be necessary to
fix all parameters of the scattering model so that the algorithm only
alters the protein backbone parameters (the pairs (κ, τ)); this will
allow the model to run in a reasonable time frame. In section 3.61
of the Supporting Information, we detail
the construction of an averaged scattering model based on the set
of parameters used for each successful fitting detailed in section
3.6 of the Supporting Information. In general,
if this average scattering model is then reapplied to the PDB structure
and explicit hydration shell, we do not obtain a sufficiently good
fit to the scattering data (although it is not too far off).The aim of this section is to show that we can use this averaged
model and distort an initial PDB curve representation model in order
obtain a high quality fit to the scattering data while still retaining
a sufficiently realistic structure (within a few angstroms on average).
This demonstrates the ab initio technique proposed here contains within
its potential prediction population a high quality representation
of the actual protein structure. It will also highlight some properties
of the knot fingerprint statistic, by comparison to the widely used
RMSD structural comparison statistic.To perform this test we
selected three pairs of proteins and known
crystal structures—lysozyme (PDB 1LYZ), ribonuclease (PDB 1C0B), and bovine serum
albumin (BSA; PDB 3V03, selecting a monomer unit)—and scattering data obtained from
the SAS database.[45] We used the PDB coordinates
and secondary structure assignment as an initial input into the algorithm;
then we altered each secondary section individually using Monte Carlo
sampling of the κ–τ distributions and the CB algorithm
to generate new structures. Using the hydration layer and scattering
model, scattering curves were generated for these models. This process
was run until a suitable fit to the scattering data was obtained.
Lysozyme
and Ribonuclease
Examples of the derived models
obtained for lysozyme are compared to subsections of the original
PDB in Figure ; we
compare subsections for visual clarity. Typically the structures are
nearly identical with only the occasional slight deviation in the
geometry of some of the linker sections. This similarity is reflected
in both the RMSD measures (calculated using the Biopython module[46]) and the knot fingerprint statistics, as shown
in Figure a. As
expected, both indicate excellent fits to the structure. There is
a (Pearson) correlation of −0.3 between the two measures, a
value on the edge of weak and reasonable. The results for ribonuclease
were very similar, and the fit statistics are shown in Figure b. Again there is also a clear
relationship between the knot fingerprint statistic and the RMSD measure;
in this case the correlation is very strong, −0.8. We see this
correlation between the two measures as further justification of the
knot statistic’s appropriateness as a measure of structure.
Figure 9
Sections
of the 1LYZ PDB
structure and example fits obtained by fitting our model to
the scattering data. Panels (a) and c are subsections of the PDB;
(a) has the sheet. Panels b and d are composite visualizations of
the predictions.
Figure 10
Comparison of RMSD measures
and knot fingerprint statistics for fittings of the model to scattering
data for lysozyme and ribonuclease.These results are obtained using
the PDB structure as the initial input to the algorithm and are by
comparison to that PDB. (a) Lysozyme; (b) ribonuclease.
Sections
of the 1LYZ PDB
structure and example fits obtained by fitting our model to
the scattering data. Panels (a) and c are subsections of the PDB;
(a) has the sheet. Panels b and d are composite visualizations of
the predictions.Comparison of RMSD measures
and knot fingerprint statistics for fittings of the model to scattering
data for lysozyme and ribonuclease.These results are obtained using
the PDB structure as the initial input to the algorithm and are by
comparison to that PDB. (a) Lysozyme; (b) ribonuclease.
BSA
Example fits to the (parts of the) larger BSA structure
are shown in Figure . We only display subsections as the full molecule is too complex
for a clear visual comparison; the sections were chosen at random
and are indicative of the general comparison. Once again it is clear
the structures are very similar.
Figure 11
Sections of the 3V03 PDB structure and
example fits obtained by fitting our model to
the scattering data. Panels a and c are subsections of the PDB. Panels
b and d are example predictions.
Sections of the 3V03 PDB structure and
example fits obtained by fitting our model to
the scattering data. Panels a and c are subsections of the PDB. Panels
b and d are example predictions.Hence, we have shown that the model and method have the potential
to correctly predict the tertiary structure of proteins accurately.
From a purely ab initio perspective the question now is, how easy
is it to get to the correct structure from a random initial guess?
This question proves to be more complicated, requiring multiple predictions,
so for this preliminary study we focus on a single structure: lysozyme.
Ab Initio Prediction
In the case where no crystal structure
is available, the secondary structure prediction based on the sequence
alone can be used as a starting point. In order to test this ab initio
method, we used the small angle scattering data of lysozyme to make
predictions of its structure. The process for obtaining a model is
summarized in Figure . First, an initial structure is randomly generated by the CB algorithm
and surrounded with an explicit hydration layer (Figure a). A model scattering curve
is calculated and compared to the experimental data (Figure b). The curve is then changed
by using a Monte Carlo algorithm to generate new secondary structure
units (along with a new hydration shell), thus altering the model’s
fold until it attains a sufficiently good fit to the scattering data
(Figure c,d).Once again we use the χf2 statistic (eq ), but this time with an additional constraint
on the potential search space, contact predictions, based on a large
number of homologous sequences. Data from the RaptorX web server[47] for the lysozyme primary sequence were obtained.
The Cα pairs with the 10 highest correlations were
selected. An extra potential χcon was added to the
optimization statistic to ensure the distance between these pairs
was restricted to be within 5 and 15 Å. If l = 1, ..., nc labels the nc pairs of constrained points with mutual distances dc, then the quality of contact match χcon is defined as follows:with C a constant
and dfc a reference distance (7 was used in this study). The value
of C controls the likely variation in the distances dc; a value of C = 0.01 in this
study was found to give good results. In the following a model was
considered a valid prediction when both χf2 + χnl + χcon < 0.008 and χf <
0.008 so that predictions had to simultaneously fit the scattering
data and minimize the geometric penalties of not overlapping and also
satisfying the contact predictions (to within a specified tolerance
dictated by the constant C).The results of
the ab initio fitting procedure are shown in Figure a. The RMSD and
knot fingerprint statistics, compared to the 1LYZ crystal structure,
are shown. The first observation is that the best knot fingerprint
statistics are comparable to the lower end of the from-PDB predictions
obtained in the previous section. The second is that these correspond
to the best RMSD measures. The apparent correlation between the two
measures seems to remain for knot fingerprint statistics above 0.6.
However, there is a gap between the best RMSD for the ab initio predictions
and those derived from the PDB structure. This is to be expected as
the knot statistic is more tolerant of differences which preserve
the entanglement (the general geometry of the fold). This difference
can be seen visually in Figure , panels b and c, which respectively represent the
first 10 secondary structure sections of the 1LYZ crystal structure
and the best fit ab initio prediction (the one closest to the PDB
predictions in Figure ). The same fold back of the two significant α-helical sections
is present in both cases, as is the fold back of the β-sheet
(although the variability in strand geometry allowed in the algorithm
means they are not identical). Further, the relative orientation of
this helical pair and the strand section present is the same in both
cases. Therefore, overall the basic fold geometry is correctly predicted,
which is why the knot statistic is so close to the PDB values. There
are, however, a number of sections with some reasonably significant
distance differences, for example, the linker section joining the
two helices; this means a bigger difference in the RMSD measure. Given
all the difficulties associated with interpreting small angle scattering
experiments, we suggest the knot statistic is a more appropriate measure
of the accuracy of the prediction. One can see a similar conclusion
can be applied to the rest of the molecule shown in Figure , panels d and e, for the
PDB and fit, respectively.
Figure 12
Ab initio predictions for lysozyme based on
sequence data alone.
Panel a depicts the RMSD and knot statistic values for the predictions K; these are indicated as blue circles
with the from-PDB data (Figure ) shown as brown squares for comparison. (b) Secondary
structure sections 1–10 (residues 1–63) of the 1LYZ crystal structure.
(c) Secondary structure sections 1–10 of the best ab initio
fit. (d) Secondary structure sections 11+ (the reminder of the structure)
of the 1LYZ crystal
structure (residues 64–129). (e) Secondary structure sections
11+ of the best ab initio fit.
Ab initio predictions for lysozyme based on
sequence data alone.
Panel a depicts the RMSD and knot statistic values for the predictions K; these are indicated as blue circles
with the from-PDB data (Figure ) shown as brown squares for comparison. (b) Secondary
structure sections 1–10 (residues 1–63) of the 1LYZ crystal structure.
(c) Secondary structure sections 1–10 of the best ab initio
fit. (d) Secondary structure sections 11+ (the reminder of the structure)
of the 1LYZ crystal
structure (residues 64–129). (e) Secondary structure sections
11+ of the best ab initio fit.
Objective
Prediction Comparisons
Using only the protein
sequence for secondary structure prediction and BioSAXS data, we have
been able to obtain tertiary structure models which can be observed
and quantified to have a fold geometry (topology) significantly similar
to the lysozyme structure. However, a large number of predictions
have knot statistics which suggest the structure’s fold topology
differs significantly from that of the crystal structure (Figure a). The target
applications for this method will be unknown structures, and it must
be established whether one could have identified these were “bad”
predictions without knowledge of the underlying structure.To
differentiate predictions, we should seek objective structure comparison
measures which do not depend on comparison of known structural information
(i.e., not to the PDB). One example would be the contact prediction
statistic χcon. This is objective in the sense that
it only relies on sequence data, and it would generally be available.
A scatter plot of the knot statistic indicates the high quality ab
initio predictions (high ) are less likely to have a high
χcon than the worse predictions; see Figure d. If we were to run a significant
number
of predictions and then say select only those below the mean χcon value and then most of the high χcon predictions
remain, this could be a first means of filtering the predictions,
although we see it will still leave “bad” predictions
and thus further analysis is required.
Figure 13
Comparisons of high and low lysozyme
predictions. (a) PDB 1LYZ crystal structure.
(b) High quality fit (). (c) Low quality fit . (d) Comparison of the contact constraint
χcon and the knot fingerprint; the blue points (with
larger values) are for the ab initio fits and the brown dots are from
the PDB fits. (e) Two (green) sections of a sheet from lysozyme model.
A plane and its normal bisecting the strand sections are shown; also
shown are two sections of the rest of the molecule which bisect the
plane between the two strands. (f) Fingerprint–RMSD comparison
plot with screened ab initio predictions.
Comparisons of high and low lysozyme
predictions. (a) PDB 1LYZ crystal structure.
(b) High quality fit (). (c) Low quality fit . (d) Comparison of the contact constraint
χcon and the knot fingerprint; the blue points (with
larger values) are for the ab initio fits and the brown dots are from
the PDB fits. (e) Two (green) sections of a sheet from lysozyme model.
A plane and its normal bisecting the strand sections are shown; also
shown are two sections of the rest of the molecule which bisect the
plane between the two strands. (f) Fingerprint–RMSD comparison
plot with screened ab initio predictions.
β-Sheet Model Variations and the Power of Knot Statistics
Figure shows
the full lysozyme crystal structure (a), a high model (0.73)
(b) with RMSD 9.21 (by comparison
to the crystal structure), and a low model (0.23)
(c) with RMSD = 10.8. Therefore,
there is a relatively small difference between the two predictions
RMSD measures, but a significant one as measured by the knot topological
method. One clear difference is the isolation of the β-sheet.
In Figure a,b the
sheet is at one edge of the structure, while in Figure c it is closer to the α-helical
secondary units of the structure. Furthermore, because the constituent
strands of the prediction shown in Figure c are not sufficiently closely related,
there appears to be a section of α-helix passing between them.
This is a significant difference in entanglement detected by the knot-based
measure for Figure c compared to Figure a,b An inspection of the structures indicated that the better performing
structures (in terms of their fingerprints) tended to have tighter
and more isolated β-sheets, consistent with the examples illustrated.
To try to quantify this, we created two mathematical measures. The
first measure is the mean distance between sequentially paired Cα atoms of the predicted sheet structure (this sequential
dependence can be determined by distance measures and does not need
a predetermined knowledge of the strand orientation). We calculate
this value for all predictions and choose those, say, less than the
median value. The second is a discrete test as to whether any other
section of the molecule passes “between the sheet”.
We approximate a plane for the sheet as indicated in Figure e and then determine if any
other arcs of the main Cα chain pierce this plane;
if this does occur we simply reject the structure as being physically
unrealistic (as is the case in Figure e). Both are objective measures.When
the combination of sheet measures and the contact prediction cutoffs
are applied, we are left with a significant proportion of the high
quality fits, including the one with the lowest RMSD (Figure d). Crucially all the lower
quality fits are filtered out. It should be noted that one of the
high quality predictions was lost during this filtering
process, on the basis that its mean sheet distance was too high. This
underlying selection mechanism should be generally applicable being
based on basic principles, so there is an indication it will be possible
to produce a general post hoc selection procedure. In future it might
be also be useful to use information such as sulfide bonding and hydrophobic
exposure to further classify predictions.
Application
to a Novel Protein with Unknown 3D Structure: The
Human SYCE1 Core
Based on the success of utilizing contact
predictions to constrain potential models, we applied the algorithm
on the structural core of the humanSYCE1 protein, a tethered construct
where the sequence is repeated to allow formation of extended antiparallel
coiled coils with two short additional helices at each end that could
fold back to form a small three-helix bundle. The secondary structure
of the tethered protein construct resulted in eight stretches of α-helices
where, based on the heptad repeats helices 2, 3, and 4, can be aligned
to helices 6, 7, and 8 corresponding to the same sequence, respectively,
in an antiparallel fashion. This resulted in 14 close contact predictions
between helices 2 and 8, and helices 4 and 6, respectively, as shown
in Figure .
Figure 14
Schematic
drawing of the SYCE1 construct with each box corresponding
to one predicted α-helix. The SYCE sequence of approximately
120 amino acids corresponding to helices 1–4 is duplicated
and linked by a tether to a repeat of the same sequence comprised
of helices 5–8.
Schematic
drawing of the SYCE1 construct with each box corresponding
to one predicted α-helix. The SYCE sequence of approximately
120 amino acids corresponding to helices 1–4 is duplicated
and linked by a tether to a repeat of the same sequence comprised
of helices 5–8.
Deriving the Models
Based on the sequence and secondary
structure predictions (a combination of those of RaptorX[47] and HHpred[48]), 40
initial configurations were generated using the CB algorithm. An example
is shown in Figure a along with its hydration layer; its scattering curve is compared
to the experimental data (from ref (24)) in Figure b. As shown, the fitting is limited to the domain q ∈ [0, 0.3] Å–1, which balances
the twin considerations of a sufficient resolution and reliable signal-to-noise
ratio. Using Monte Carlo optimization, the structure is altered until
a reliable fit χf2 + χnl + χcon < 0.008 is obtained, where the potential χcon is based on the contact predictions described above. One such model
is shown in Figure c along with its scattering curve in Figure d. The identical chains of the structure
have folded to lie (nearly) parallel with the end termini occupying
a local neighborhood. Two example models for which χf2 + χnl + χcon < 0.008 are shown in Figure a,b. Panels c and
d of Figure indicate
one of the coiled-coil structures and depict the pairwise distances
associated with the contact prediction terms χcon. All models share the elongated bend shape with a antiparallel coiled-coil
arrangement of helix 2–4 to helix 6–8, respectively.
The first helix in each helix (helices 1 and 5, respectively) show
different orientations which reflect the expected conformational flexibility
of the protein in solution. Importantly, the central coiled coil (made
of helices 3 and 7, respectively) is not based on the constraints
given a priori but is entirely based on the optimization against the
experimental data. Although a bead model results in a similar overall
shape,[24] our method is able to derive a
more detailed molecular model with distinct structural features such
as the central coiled coil.
Figure 15
Illustrations of the optimization process used
to obtain the model
predictions for structural core of human SYCE1. (a) Initial starting
configuration of the backbone based only on the sequence data shown
in Figure . Also
shown as spheres are the molecules of the hydration layer. Large black
and white spheres indicate the end termini. (b) Scattering curve of
the initial configuration (blue) overlaid on the scattering data (red).
(c) Model prediction for which χf2 + χnl + χcon < 0.008; the end termini are next to each other. (d)
Final scattering curve compared to experimental data.
Figure 16
Illustrations of the model predictions. (a, b) All model predictions.
(c) One of the coiled-coil units of (a) with black tubes representing
the contact prediction distances, as seen along the axis of the unit.
(d) Tilted helical structure of the coiled-coil unit. (e) Model obtained
by minimizing the chi-squared measure χnl + χcon only (i.e., without taking the scattering data into account).
Illustrations of the optimization process used
to obtain the model
predictions for structural core of humanSYCE1. (a) Initial starting
configuration of the backbone based only on the sequence data shown
in Figure . Also
shown as spheres are the molecules of the hydration layer. Large black
and white spheres indicate the end termini. (b) Scattering curve of
the initial configuration (blue) overlaid on the scattering data (red).
(c) Model prediction for which χf2 + χnl + χcon < 0.008; the end termini are next to each other. (d)
Final scattering curve compared to experimental data.Illustrations of the model predictions. (a, b) All model predictions.
(c) One of the coiled-coil units of (a) with black tubes representing
the contact prediction distances, as seen along the axis of the unit.
(d) Tilted helical structure of the coiled-coil unit. (e) Model obtained
by minimizing the chi-squared measure χnl + χcon only (i.e., without taking the scattering data into account).
Experimental Scattering Data Is Crucial to the Prediction Quality
One might ask if the contact predictions alone were sufficient
to predict the structure, since they are crucial to forming (some
of) the coiled-coil structure. To test this, we derived models by
minimizing the chi-squared measure χnl + χcon (i.e., ignoring the scattering data); a typical example
is shown in Figure e. The outer α-helices are present as the contact prediction
constraint χcon forces these structures to form.
However, the whole structure is significantly folded. This folding
was found to be a typical property of models obtained by minimizing
only χnl + χcon, and the degree
of folding was far from consistent. The clear effect of further enforcing
the model fit the scattering data is twofold: first straightening
out the whole structure and second, in doing so, developing a coiled-coil
geometry in the middle of the structure.
Fitting to the Scattering
Data and Contact Predictions Is Not
Straightforward
As a final note, we note that of the 40 initial
structures generated, only 5 obtained a suitably low combined chi-squared
statistic (χf2 + χnl + χcon < 0.008).
All five structures, two of which are shown in Figure , were basically identical in this case
(comparative values > 0.9), so there was
no need for
any post hoc structural comparison analysis. By comparison all 40
lead to models for which χnl + χcon < 0.008. As we have just seen, there is significant value in
the extra information provided by the scattering data. The difficulty
with obtaining suitable fits indicates that in the future more advanced
optimization techniques than a straightforward Monte Carlo search
may be needed.
Discussion
Summary of Results
This paper describes in-depth the
development of a tertiary structure model for BioSAXS data interpretation.
A number of key points have been demonstrated with regard to its potential
use for the structural biology community. First, if the method starts
with a known structure with a similar tertiary structure to the target
structure, e.g., a homology model or an incomplete (core) model for
a structure, then it will likely find a highly accurate fit to the
presumably correct structure, as verified on a benchmark set of proteins.
Second, given a near-complete absence of tertiary structural information,
save that available from sequence data such as secondary structure
predictions, the technique can generate realistic representations
of the structure’s fold. Further, in this ab initio scenario
there is the potential to reliably separate realistic predictions
from those which are not biologically plausible, by both constraining
the fitting procedure and applying optimization filtering. This final
result, demonstrated here on lysozyme, is a significant result; there
exists no purely ab initio SAXS technique so far which has achieved
such detailed predictions of the protein’s fold (a number of
techniques superimpose tertiary and secondary structures into ab initio
bead predictions but this requires extra information such as a valid
homologous structure). We now discuss various aspects of the predictions
and compare them to existing methods.
Data Requirements
The fundamental information required
to construct the backbone model is the sequence, used to identify
secondary structures (linker, strand, and α-helix). This information
can be obtained from the sequence using, for example, the publicly
available jPred server.[49] Then we only
need the scattering data to fit this model to. Therefore, in its most
basic form the algorithm can yield predictions from only two files:
a sequence file and a scattering data file.We can also use
the sequence file and existing spatial information to constrain the
potential search space. In the case of the lysozyme predictions, the
search space was additionally constrained using contact predictions.
Contact predictions can be obtained based on the sequence using, for
example, the RaptorX server. The server does include homology model
generation if applicable, but we have only used the list of amino
acid pairs which have a high sequence correlation. As discussed above,
we constrain their mutual distance. Similarly, the SYCE1 protein models
were additionally constrained using coiled-coil sequence alignments,
as discussed under Results.The technique
can also accommodate the use of additional information;
for example, PDB files could be used as initial homology models or
partial structure specifications. In this case the necessary pieces
information required are the secondary structure specifications and
the Cα coordinates. In general, any additional information
used to constrain the search space needs to provide information on
the Cα coordinates, in particular what secondary
structure type is it, and how does its constrain its spatial coordinates
(i.e., a fixed location or a fixed distances for pairs of Cα coordinates)? It is possible, for example, to fix the coordinates
of a subsection of the target structure and only vary the rest, if
a partial prediction is already present.
Comparisons to Existing
Techniques
With regard to comparisons
to existing techniques there are two categories to be discussed. The
first is the set of different experimental methods used to derive
structures in the Protein Data Bank. The predictions from our methods,
applied to small angle scattering data, can be near this level of
quality if a reliable initial structural model is
provided. This was demonstrated under Results when we used PDB structures as a starting model; the algorithm yielded
structures with RMSD measures (by comparison to the PDB structure)
highly comparable to experimentally obtained models (for α-carbon
positions). In a purely ab initio scenario our results indicate it
is currently difficult to obtain this level of accuracy on a reliable
basis (although one can get single angstrom RMSD measures). However,
as shown in Figure d, there is some indication that, if extra constraints such as contact
predictions from homologous sequences can be enforced to a high degree
of accuracy, there is the potential to reach similar levels of structural
resolution to these alternative experimental techniques.
Ab Initio Methods
The second comparison would be to
SAXS-specific ab initio techniques for interpreting BioSAXS data.
These include the bead based models such as GASBOR and DAMMIN.[13,50] However, a direct comparison is difficult because the natures of
predictions are different. Neither method makes explicit predictions
of the sequence-ordered tertiary structure of the molecule. Both are
composed of effective scattering beads. The DAMMIN model aims to predict
the volume occupied by the molecule by creating a cloud of beads,
while GASBOR aims for a structure with a chainlike nature constraining
bead–bead distances, but there is no explicit secondary structure
in the model. However, previously a technique for comparing bead models
to tertiary structures was suggested, using a measurement, the normalized
spatial discrepancy (NSD), derived from several metrics for comparing
the distributional similarity of two sets of points.[51] This calculation is implemented in the SUPCOMB routines,
part of the ATSAS package. The SUPCOMB algorithm rotates (and translates)
the bead clouds to minimize their NSD measure, as described in ref (51). This does not account
for the sequential nature of the chain in the models derived here.
Instead, it in effect measures the similarity of two point clouds,
but it is a sensible means of comparing the results of the results
obtained in this study to GASBOR predictions (the closest to the CB
model in that it develops chainlike structures). We compared the GASBOR
prediction for lysozyme (obtained from SASBDB,[45] code SASDAG2) for the same scattering data that was used
in Results to obtain the CB algorithm prediction;
note that both use the same q domain ([0, 0.4] Å–1). In both cases the structures were compared to the 1LYZ structure via their
NSD measures. An NSD value below 1 is generally considered a good
match.[52] A small subset of the results
are indicated in Table ; all values are below 1. The quality of this fit is illustrated
in Figure , where
the predictions are superimposed on the solvent-accessible surface
area of the 1LYZ structure. Therefore, in terms of point cloud comparison, both prediction
techniques can yield very similar predictions. This is not such a
surprise as both fit the low q portion of the scattering
curve, which contains the large scale structural information, very
well.
Table 1
Various Structural
Comparison Metrics
for Predictions of Lysozyme Predictions, All by Comparison to the 1LYZ Structure and All
Quoted to Three Significant Figures
prediction
type
SUPCOMB NSD
RMSD
GASBOR model
0.813
N/A
N/A
CB pred 1
0.880
0.729
9.77
CB pred 2
0.863
0.703
9.22
CB pred 3
0.889
0.708
10.3
CB pred 4
0.866
0.133
13.2
Figure 17
GASBOR (obtained from
SASBDB, code SASDAG2) and CB algorithm models
for lysozyme superimposed on the 1LYZ crystal structure. (a) Bead cloud (red)
of a GASBOR prediction for the same lysozyme scattering data as used
above, obtained from SASBDB (code SASDAG2). A lysozyme prediction
(blue curve) from Results for which the SUPCOMB
NSD is 0.813 and knot fingerprint statistic (both by comparison to the 1LYZ PDB structure).
(c) Lysozyme prediction (green curve) from Results for which the SUPCOMB NSD is 0.8664 and knot fingerprint statistic (both by comparison to the 1LYZ PDB structure).
(d) Secondary structure sections 1–10 of the CB lysozyme model
shown in (c).
GASBOR (obtained from
SASBDB, code SASDAG2) and CB algorithm models
for lysozyme superimposed on the 1LYZ crystal structure. (a) Bead cloud (red)
of a GASBOR prediction for the same lysozyme scattering data as used
above, obtained from SASBDB (code SASDAG2). A lysozyme prediction
(blue curve) from Results for which the SUPCOMB
NSD is 0.813 and knot fingerprint statistic (both by comparison to the 1LYZ PDB structure).
(c) Lysozyme prediction (green curve) from Results for which the SUPCOMB NSD is 0.8664 and knot fingerprint statistic (both by comparison to the 1LYZ PDB structure).
(d) Secondary structure sections 1–10 of the CB lysozyme model
shown in (c).What is of particular interest
is that structures with significantly
different knot fingerprint statistics yield very similar SUPCOMB and
similar RMSD measures. Two examples of CB predictions are shown in Figure , panels b and
c, respectively: and . Both structures fit very well inside the
lysozyme solvent accessible surface area, as indicated by the near
identical NSD measures. The low measure of the structure in Figure c is seen in Figure d to be a result
of the fact
that the subsection involving the strand structure has a significantly
different subfold compared to the PDB structure, shown in Figure b. In particular,
the β-sheet structure “folds back” in the direction
opposite to that of the PDB structure (and the high structure
shown in Figure c). This highlights the benefit of the additional
predictive information provided directly by the CB algorithm: one
can discern geometrically distinct predictions and thus make a more
precise prediction about the structure of the protein under consideration.The ATSAS package does allow for the interpretation of bead models
with tertiary structure through the use of the CORAL package.[50] The package attempts to fit a structure into
the bead model with a mixture of known (manually assigned) and unknown
secondary structure elements. This procedure was performed in ref (24) to provide evidence that
the SYCE1 core modeled in ref (24) was a coiled-coil domain; a rendering of this model is
shown in Figure a. Two coiled coils were superimposed on a bead model with CORAL
providing an additional linker section to join them. Our model simply
uses the sequence data to determine the secondary structural elements;
then it is able to try millions of differing (physically realistic)
folds and tests each time if they satisfy the scattering
data, a much more direct and exhaustive test, which relies on far
less user input. What is interesting is that this technique predicts
an additional coiled-coil domain at the structure’s center
(cf. panels a and b of Figure ), owing to the sequence interpretation splitting of
the helical units. The method presented in this paper offers more
flexibility in terms of using additional structural constraints and
is more amenable to automated structural evaluation. Its main comparative
advantage, over the coral package, is the potentially exhaustive automated
search of a space of potential tertiary folds with realistically constrained
secondary structure.
Figure 18
Comparison of the CORAL derived model of the SYCE1 core
obtained
in ref (24) (a) and
an example CB derived model (b).
Comparison of the CORAL derived model of the SYCE1 core
obtained
in ref (24) (a) and
an example CB derived model (b).
Computation Time
A single calculation comprising the
CB algorithm, the generation of the hydration layer, and calculation
of the scattering curve takes on the order of 0.05 s for lysozyme
(129 residues) and 0.5 s for BSA (433 residues), both based on calculations
performed on a single CPU, with the main cost coming from the Debye
formula, eq . As far
as the actual optimization goes, the timing can vary significantly;
this depends on the number of secondary units which can be changed,
the randomized initial condition, and the difficulty in satisfying
additional restrictions such as the contact predictions (and how tightly
they have been penalized). The ab initio lysozyme predictions generally
varied between 10 min and 1 h. For the SYCE1 chain (318 residues)
it was closer to 20 h (that said, as mentioned above the predictions
produced in this case were reliably accurate). In future we will look
to implement Bayesian learning techniques for the search, as a large
number of models suggested by the Monte Carlo sampling overlap themselves
and consistently trying such models wastes much time. This will be
crucial to ensuring it can be run for larger molecules in the future.
Number of Initial Models
One might ask how many predictions
are required in order to obtain a viable structure (ab initio). The
examples here present a contrasting picture. The lysozyme cases consistently
produced structure which fit the scattering data, but as discussed
only a relatively small percentage (about 8%) were considered a sufficiently
good fit to the scattering data (i.e., a sufficiently low RMSD with
respect to the PDB structure and high value).
By contrast, from 40 initial conditions
for the SYCE1 molecule only 12.5% were able to fit the scattering
data, but all were nearly identical () and excellent candidate structures. It
is likely this is because lysozyme is a globular protein while SYCE1
has a very flat, linear structure. It is relatively easy to distort
our model into a globular shape, but it allows for more structural
variance, while the more linear structure is harder to form but much
more constrained. The consistent evidence is that, currently, one
might need at least 10 optimization runs in order to obtain a good
quality prediction. A future aim will be to better enforce contact
predictions or other constraints during the fitting procedure in order
to bring down this ratio.
Conclusion
As
a solution-based technique, BioSAXS can provide structural information
for targets where crystallization and cryo-electron microscopic (cryo-EM)
techniques are challenging. Also, the method allows data collection
in a more natural environment than techniques such as crystallography
and cryo-EM. Additionally, SAXS is not limited by protein size, as
is the case for cryo-EM and NMR. Therefore, there is a clear need
to develop the techniques for interpretation of this data in an ab
initio setting which improve on the levels of structural detail provided
by the bead models currently popular.In this paper we have
shown that curve representation with hydration
shell provides a molecular model for BioSAXS data with fits as good
as or better than traditional bead and envelope models. Unlike these
models our model includes a complete secondary and tertiary model
description. Importantly, starting from random models that only take
secondary structure information and sequence-dependent distance constraints
into account, a physically meaningful 3D model can be obtained by
fitting models against the experimental data. That this is possible
is due to the fact that the model is described with far fewer parameters
compared to even a coarse-grained model that required three coordinates
for each amino acid, combined with use of geometric
constraints for regular secondary structural elements.In order
to show the potential of this ab initio technique, it
was applied to a tethered core component of the humanSYCE1 protein,
for which no high resolution structural data is available. The model
derived was based on sequence information alone and matches that of
a model that was previously reported in ref (24), where the model was based
on manual inspection of the sequences coupled with the fitting of
ideal coiled coil segments to experimental scattering data. Importantly,
while the previously modeled structure includes two coiled coil segments,
the model derived here recognized that this was the minimum number
of segments required to explain the curved structure and that the
true structure could consist of multiple coiled coils interrupted
by short linkers. Thus, our novel ab initio method has successfully
generated a highly plausible model from experimental scattering data
without the need for any more than minimal manual evaluation. This
facility will be crucial for ab initio structural determination (from
BioSAXS data) of larger molecules where it would not be practical
to generate structures manually.Further experimental information
such as distance information from
any other source can easily been added in the form of additional restraints
into the optimization algorithm. The model’s explicit description
of realistic secondary structure means additional information, such
as contact predictions, radius of gyration, hydrophobicity of the
chain, and disulfide bonding, can be employed as model constraints
in the future. This will further enhance the accuracy of all potential
models, and in particular help the end user to distinguish mathematically
correct but physically less likely models from correct solution. The
secondary knot fingerprint statistic developed shows significant potential
to evaluate structural similarities of models and hence to further
automate this vital validation step.The two future next steps
are (i) the application of this method
to multimeric structures where each known monomer structure can initially
be treated as rigid body and then refined in order to account for
local changes in solution and (ii) the application to larger, de novo
structures where the exact 3D structure remains elusive. The second
goal will require further refinements of the search space method of
the optimization algorithm. The application to homomultimers is straightforward
and requires only minor addition to the existing code; we expect this
to be the major initial application of our methods. Due to the limited
information content of small angle X-ray scattering data, the ab initio
fold determination will depend on the accuracy of secondary structure
prediction combined with appropriately weighted distance constraints
such as those discussed above.
Authors: Michal Jamroz; Wanda Niemyska; Eric J Rawdon; Andrzej Stasiak; Kenneth C Millett; Piotr Sułkowski; Joanna I Sulkowska Journal: Nucleic Acids Res Date: 2014-10-31 Impact factor: 16.971