Literature DB >> 32023061

Obtaining Tertiary Protein Structures by the ab Initio Interpretation of Small Angle X-ray Scattering Data.

Christopher Prior¹, Owen R Davies², Daniel Bruce^3,4, Ehmke Pohl^3,4.

Abstract

Small angle X-ray scattering (SAXS) is an important tool for investigating the structure of proteins in solution. We present a novel ab initio method representing polypeptide chains as discrete curves used to derive a meaningful three-dimensional model from only the primary sequence and SAXS data. High resolution structures were used to generate probability density functions for each common secondary structural element found in proteins, which are used to place realistic restraints on the model curve's geometry. This is coupled with a novel explicit hydration shell model in order to derive physically meaningful three-dimensional models by optimizing against experimental SAXS data. The efficacy of this model is verified on an established benchmark protein set, and then it is used to predict the lysozyme structure using only its primary sequence and SAXS data. The method is used to generate a biologically plausible model of the coiled-coil component of the human synaptonemal complex central element protein.

Entities: Chemical Disease Gene Species

Mesh：

Substances：
Proteins

Year: 2020 PMID： 32023061 PMCID： PMC7145352 DOI： 10.1021/acs.jctc.9b01010

Source DB: PubMed Journal: J Chem Theory Comput ISSN： 1549-9618 Impact factor: 6.006

Introduction

Biological small angle X-ray scattering (BioSAXS) is an increasingly important method for characterizing protein structures in solution.[1−3] Its primary advantage over techniques such as crystallography and NMR is its ability to provide information under native conditions about large protein molecules not accessible by complementary methods. However, there is a price to pay for this advantage: the random motion and orientation of molecules in solution leads to a loss of information due to an effective averaging of the scattering, leaving only information about the protein’s intramolecular distances, not their spatial orientations.[4] Because of these challenges, correctly interpreting BioSAXS data to realize meaningful results remains a challenging task.[5] Two main methods have been developed to interpret BioSAXS data. The first assumes an accurate three-dimensional (3D) model of the protein backbone, usually derived from X-ray crystallography.[6−9] This model is used to calculate the X-ray scattering curve once the excluded solvent volume is taken into account. A major advance, first presented in the CRYSOL algorithm,[6] was the inclusion of the solvation layer—the ordered water molecules at the surface of the protein. CRYSOL and the FoXS package, which was developed by Schneidman-Duhovny et al.,[7] adjust an implicit “shell” of scattering (“implicit” meaning they do not model individual solvent molecules). Other packages treat the shell explicitly using either molecular dynamics (AquaSAXS)[8] or a geometric filling approach (the SCT suite).[9] Allowing for a shell that can have gaps and fill cavities in the protein model gives a more reliable fit to the data.[5] An extension of this approach is to use all atomistic modeling with Protein Data Bank (PDB) structures as a starting point.[10,11] The application of such techniques, however, can require significant technical expertise. The second method does not assume an initial structure (ab initio) but simplifies the protein model as either a volume[12] or a chain[13] of scattering beads without explicit secondary structure. These methods are hence applicable to de novo structural prediction, but the lack of secondary structure means interpreting these predictions is a difficult task.[5] Here we propose an alternative ab initio technique that uses a curve model of the 3D structure of the polypeptide chain. This description has a much reduced number of parameters by comparison to all atomistic models. Similar curve models have been previously proposed[14−16] but not applied to the interpretation of BioSAXS data. The model is parametrized by consecutive discretized descriptions of the four major secondary structural elements: α-helices, β-strands, flexible sections, and random coils. The permissible geometry of these curves is restricted by empirically determined constraints, which are akin to Ramachandran constraints.[17] To use the model for the interpretation of BioSAXS data, the polypeptide chain model is combined with a water model for the first hydration shell and an empirically calibrated scattering model. The geometry of the model can then be optimized against the experimental BioSAXS data. A critical factor, novel to our curve representation of the polypeptide chain, is the construction of empirical probability distributions for the model parameters. These distributions serve the dual purpose of preferencing commonly observed secondary structures in the set of potential chain models, while simultaneously allowing for predictions with rare/novel but physically permissible secondary structure. An advantage of this method for ab initio interpretation of BioSAXS data, by comparison to the established bead models,[12,13] is that by accurately characterizing the protein’s secondary structure it can reliably incorporate additional structural information in order to improve the results of the technique. In this study, contact predictions, based on sequence alignments alone, are used to improve the model predictions. A final advantage of the code developed is that its only input requirements are the primary sequence and scattering data, so it places only basic technical requirements on the user for its use. We first applied this new methodology to data of well-characterized model protein lysozyme before moving to the BioSAXS data of the structural core of the human synaptonemal complex central element protein 1 (SYCE1). This protein represents an essential structural component of the synaptonemal complex (SC) that binds together homologous chromosomes during meiosis and provides the necessary three-dimensional environment for crossover formation.[18−20] The SC is formed of oligomeric α-helical coiled-coil proteins that undergo self-assembly to create a latticelike assembly.[21−23] In a recent biochemical and biophysical study, human SYCE1 was shown to adopt a homodimeric structure in which its structural core is provided by residues 25–179 forming an antiparallel coiled coil.[24] Further, the structural core was expressed in an engineered construct in which two SYCE1 25–179 sequences were tethered together through a short linker sequence (GQTNPG). This construct faithfully reproduced the native structure, and substantially improved protein stability in solution[24] (by comparison to the unlinked core). In this study, using secondary structure predictions and distance restraints purely based on the sequence of the protein alone, an excellent model of an antiparallel extended but bent coiled coil is derived, which is fully consistent with biological data.

Methods

First we describe the reduced parameter protein model we use to interpret BioSAXS data. This is composed of a polypeptide chain curve model with a surrounding explicit hydration shell. Empirically calibrated structure factor functions for each constituent element of the model are constructed to produce theoretical scattering curves for this tertiary structure model.

Polypeptide Chain

The polypeptide chain is represented as a set of points in 3D space {c}, the positions of the Cα atoms in each amino acid. The geometry of four consecutive points (c, c, c, c) can be characterized by two parameters: the curvature κ and torsion τ. κ is defined by the unique sphere made by the center of the joining edges (see Figure a); the smaller the sphere the more tightly the curves joining the points fold on themselves. τ is determined by how sharply the curve bends away from its plane of curvature, and its sign denotes the curve section’s chirality (it is positive for right-handed coiling and negative for left-handed coiling). More precise definitions are as follows.

Figure 1

Figures depicting elements of the backbone model. (a) Curve subsections (c, c, c, c) (red points) and their midsection points (c, c, c) (blue). The first example is more tightly wound and has a smaller sphere, hence a higher κ value. The sphere defined by these midsection points is shown; the inverse of its radius is the curvature κ. (b) α-Helical section with uniformly similar (κ, τ) values. (c) Flexible (linker) section with varying (κ, τ) values.

Curvature κ

A section of four residues defined by the points (c, c, c, c) defines three edges with midpoints c = (c + c)/2, which in turn define the curvature sphere.[1−3] The curvature, the inverse of its radius, iswhere θ123 is the angle between the vectors c – c and c – c.

Torsion τ

Three points define a plane (with unit normal vector n), and the four points (c, c, c, c) define two planes through their unit normal vectors n2 and n2, respectively:The torsion is the (length weighted) angle these planes make with each other:with θ is the angle between n1 and n2; see, e.g., ref (25). The algorithm for generating a curve of length n from n – 3 pairs of values of (κ, τ) is as follows: Consider a section of curve of length m and m – 3 pairs (κ, τ), whose three initial points c1, c2, and c3 are randomly chosen (with fixed separation distance R = 3.8). Since scattering expressions are invariant under an arbitrary translation and rotation,[4] the exact values of the first two points do not matter (as long as their separation is R). The third point is a structural degree of freedom, but it is restricted such that the Cα–Cα distance between c1 and c3 is greater than R. Once these points are specified the fourth point will bewith θ = ∈ [0, π], ϕ ∈ [0, 2π]. The set (c1, c2, c3, θ, ϕ) defines four points and hence κ and τ values. Using values of κ1 and τ1, eqs and 3 are solved for θ and ϕ; this gives c4. The next point c5 can similarly be found from the values κ2 and τ3, and so on until all m – 3 (κ, τ) have been used to yield the m points c. Examples of an α-helical and flexible linker sections (taken from the structure of bovine serum albumin (PDB 3V03)[26]) are shown in Figure b,c.

Secondary Structure Geometry Restraints

In order to derive geometric constraints, Cα coordinates were extracted from a set of over 60 protein structures for which high resolution crystal structures are available in the PDB and the κ and τ values calculated for all subsections (c, c, c, c). The κ–τ pairs are shown in Figure a. There are three main populations of values (preferential regions). As shown in section 1.3 of the Supporting Information, these regions of (κ, τ) space correspond to the three preferential domains of Ramachandran space.[17] Using the PDB’s secondary structure annotation, this data was split into categories of β-strands, α-helices, and the rest which are not identified (referred to here as linkers). To account for random coils, the data was further divided into subsets whose values remained in one preferential domain (as in Figure b) and those whose κ–τ values belong to multiple domains (as in Figure c). For each set of data a representative probability density function (PDF) was calculated using kernel smoothing techniques[27] (for details see sections 1.4 and 1.5 of the Supporting Information); an example is shown in Figure b.

Figure 2

Illustrations of κ–τ spaces used to impose realistic geometry constraints on the polypeptide chain. (a) (κ, τ) pairs obtained from crystal structures, plotted as points with κ on the horizontal axis and τ on the vertical axis. (b) PDF, created from the data in (a), which correspond to linker sections. There are three distinct domains of high probability corresponding to the preferred secondary structural elements.

Generating Models from Secondary Structure Annotation

In order to generate models based on secondary structure information alone, a protein of n amino acids is split into l distinct subdomains of length m (∑ m = n). Each section l is classified as α-helical, β-strand, or linker; for the purpose of testing and calibration the PDB file’s secondary structure assignment was used to perform this task. For each section of length m, m – 3 (κ, τ) pairs are drawn from an appropriate PDF and the section is constructed. This process creates the l individual sections of secondary structure elements, which must then be linked together. Two neighboring sections with specified geometry (for example, an α-helix and linker) still have a relative rotational degree of freedom. To ensure this remains physically realistic, the geometry of the last three and first Cα positions of neighboring secondary sections were extracted from the PDB set and further PDFs for the set of permissible (κ, τ) pairs of these joining sections were generated for each type of join (i.e., α-helix to linker or linker to β-strand). Therefore, the final step of the process is to obtain all (κ, τ) values for the joint geometry and then construct the whole backbone. A precise mathematical description of this algorithm, constrained backbone algorithm (CB), is given in section 1.6 of the Supporting Information. One example of a structure generated using this algorithm is shown in Figure b; this particular structure was used as a starting point for an ab initio structure optimization in this study.

Figure 6

Secondary knot fingerprint analysis of the lysozyme structure. (a) Cα trace of lysozyme (PDB 1LYZ(41)). The α-helices are shown in red, β-strand structures are in green, and linker sections are in light blue. (b) Random structure generated using the CB algorithm which has the same secondary structural elements as lysozyme. This could be a starting model for the fitting procedure. Panels c and d are secondary fingerprints of two different crystal structures of lysozome (1LYZ and 1AKI, respectively). The knot types are indicated (Rolfsen classification[42]); white spaces indicate no secondary knots (all knots were of the primary type). (e) Secondary fingerprint for the random structure shown in (b); it differs significantly from (c) and (d) and has a larger range of knots present.

The Hydration Layer

Once the curve representation is obtained, it is crucial to include a model of the hydration layer in order to generate realistic scattering curves. To this aim, solvent molecules are placed between a pair of cylindrical surfaces surrounding the axis of a section of the backbone (Figure a). This layer is then reduced by removing all overlapping solvent molecules. This ensures the shell remains in hollow sections between the fold and on the protein surface, while the water molecules are removed from significantly folded regions. This is a crucial aspect of our hydration layer model, as it has been shown that one needs to allow for inhomogeneous hydration layers in order to avoid inaccurate predictions from BioSAXS data.[28] This method is illustrated in Figure , where the two cylinders of radius Rc (core) and Ro (outer), Ro > Rc, are centered on a section i’s helical axis (Figure a). Consider a solvent molecule belonging to another section j whose nearest distance from the axis of section i is Rs. If Rs < Rc, the solvent is too close to the backbone and removed. If Rc < Rs < Ro, the solvent is classed as being shared by the sections i and j and only counted once.

Figure 3

Visualizations of the hydration layer model. (a) Initial solvent layer, shown as silver spheres with the core Rc and outer Ro cylinders surrounding the axis of the section (red curve). (b) Overlapping sections and solvent layers. (c) In blue, the removed solvent molecules of the pair of sections shown in (b). This process is applied to all solvent molecules from sections i and j on each other; an example of the outcome is shown in Figure b,c. Applying this process pairwise to all sections of a Cα backbone yields the final hydration layer. The exact mathematical description of this hydration layer is detailed in sections 2.1–2.3 of the Supporting Information. The values of the radii (Rc, Ro) and a number of other parameters controlling the solvent density were determined by fitting the model to high resolution crystal structures which contained the first hydration shell. An example model shell, generated with these parameters, is shown in comparison to the model solvent positions from the subatomic resolution structure of a phosphate binding protein from the PDB 4F1V(29) in Figure . It is shown the two distributions are statistically similar in section 2.4 of the Supporting Information and hence that the model is a realistic representation of the average positions of the inner hydration shell.

Figure 4

Comparisons of crystallographic and model solvent positions from the crystal structure of a phosphate binding protein PDB 4F1V, determined at ultrahigh resolution of 0.88 Å.[29] (a) PDB backbone and relevant solvent molecules. (b) Model solvent positions (surrounding the same curve as in (a)) obtained with the experimentally determined hydration shell model parameters.

Scattering Formula

Once the polypeptide chain and hydration layer models are determined, the Debye formula[30]is used to calculate the scattered intensity I(q) as a function of momentum transfer, q = π sin(θ)/λ. Here N is total number of Cα atoms and solvent molecules and f(q) is the form factor for residue i. There are two types, one for an amino acid with an excluded volume correction and one for a solvent molecule, which are defined in the following section.

Amino Acid Form Factors

The form factor fam of an amino acid, centered on the Cα atom position, iswhere fb is the scattering of the amino acid in a vacuum, fex is the adjustment due to the excluded volume of solvent, and ρex is a constant. Each amino acid is assigned the same scattering function fb(q), a five-factor exponential representation:where {A, B}5 and C are empirically determined constants (a standard form used to fit molecular form factors[31]). The excluded volume effect is captured using an exponential model in the formwhere rw is the average atomic radius of the atom.[6,7,13] To calculate the excluded volume for amino acids, coordinates for all 20 amino acids[32] and values of rw for carbon, nitrogen, oxygen, hydrogen, and sulfur (e.g., ref (33)) were used to compute the excluded volume scattering, centered at the Cα, throughwhere rα is the distance of atom i from the Cα molecule and Nam is the number of atoms in the amino acid. Since fb does not discriminate individual amino acids, this value fexam was averaged over all 20 amino acids, weighted by their abundance in globular proteins (see ref (34)). This averaged function, shown in Figure e, gives fex(q). Finally, eq includes a constant ρex which modulates the effect of the excluded volume scatter by comparison to fb; this value is constrained to lie within 0.75 and 1.25 (similar constraints are used in refs (6, 7, and 13)). The scattering form for an individual water molecule in the hydration layer iswhere fhy and fox are the vacuum scatterings of hydrogen and oxygen, respectively.[31] The constant ρh was empirically determined (as in ref (7)). A detailed description of the parameter determination method is given in section 3 of the Supporting Information.

Figure 5

Fits to scattering data for various molecules using appropriate Cα coordinates as a backbone model {c}. (See section 3 of the Supporting Information for details.) In panels a–d, the data scattering data is shown overlaid by the smoothed data used for fitting (blue curve) and the model fit (red curve). Panel e is the averaged scattering function famex obtained by averaging the scattering parameters obtained from fits like those shown in (a)–(d).

Evaluating Structural Similarity

In the next step the geometry of each model generated by the CB algorithm is optimized by refinement against the scattering data. However, since the problem is under-determined, many models will fit the experimental data, so a method is required to compare structures and determine which predictions are “essentially the same” in that they only differ by small local conformational changes (as one should expect in solution). The standard methods in protein crystallography for comparing similar protein structures are based on root-mean-squared deviations (RMSDs) where two structures are superimposed to minimize the sum of all distances of equivalent paired atoms.[35,36] This measure and variants on it are known to be overly sensitive to large deviations in single loops (as discussed in ref (35)). Unlike homologous crystal structures, which will often only differ by the change in a small subsection of the whole structure, the comparison here will be made between structures generated by a random algorithm, so the significant buildup of relatively small individual RMSD errors is likely. In section 2.1 of the Supporting Information, a number of additional problems with using the RMSD measure in this context are discussed in detail. To mitigate these problems, a novel and more robust approach based on knot theoretical techniques was developed.

Knot Fingerprints

Techniques from knot theory have previously been applied to identify specific (knotted) entanglements in protein structures.[37] To compare two protein structures using knot theory, the N- and C- termini need to be joined.[38] As demonstrated in ref (37), a sensible procedure for making this extension is to surround the backbone with a sphere, then choose two random points on the sphere and join the end termini to these points; finally, this extended curve is closed with a geodesic arc. The knot is then classified (e.g., via Jones polynomials). This procedure is repeated a significant number of times (10 000 in this study), and the most common knot (MCK) is chosen to indicate the knotting of the curve. To obtain additional information, the MCK is calculated for all subsets {c|i = k, k + 1, ..., j, j > k, j – k > 3} of the curve. One can then plot this data on a “staircase” diagram with j and k on the axes and each square of the domain colored by its most common knot (e.g., ref (39)) (examples of staircase diagrams are shown in Figure c–e). The fingerprint is found to be preserved across protein families,[39] even when there is low sequence identity.[40] Secondary knot fingerprint analysis of the lysozyme structure. (a) Cα trace of lysozyme (PDB 1LYZ(41)). The α-helices are shown in red, β-strand structures are in green, and linker sections are in light blue. (b) Random structure generated using the CB algorithm which has the same secondary structural elements as lysozyme. This could be a starting model for the fitting procedure. Panels c and d are secondary fingerprints of two different crystal structures of lysozome (1LYZ and 1AKI, respectively). The knot types are indicated (Rolfsen classification[42]); white spaces indicate no secondary knots (all knots were of the primary type). (e) Secondary fingerprint for the random structure shown in (b); it differs significantly from (c) and (d) and has a larger range of knots present.

Secondary Knot Fingerprints

Figure c is the knot fingerprint for one set of lysozyme coordinates (shown in Figure a), of the second most common knot identified during the random closure process. The secondary fingerprint shown in Figure d is from a second set of lysozyme coordinates; panels c and d of Figure are significantly similar. The secondary fingerprint (Figure e) is derived from a CB generated backbone model, shown in Figure b, which has the same secondary structure sequence as 1LYZ. The secondary fingerprint differences between the correct structure (Figure c) and the randomly generated structure (Figure d) is immediately obvious. All primary fingerprints in these cases are identical, and all have the unknot as the most common knot. It is clear secondary (and possibly tertiary) knot fingerprints can differentiate unknotted folds. A knot fingerprint statistic is defined in section 4.2 of the Supporting Information which quantifies the weighted similarity of knot fingerprints at level l associated with the curves K1 and K2 (l = 2 for Figures c–e); it yields a value between 0, completely dissimilar, and 1, identically folded. In section 4.3 of the Supporting Information it is demonstrated that the statistic has the following properties. First, it quantifies crystal structures of the same molecule as highly similar, , and randomly generated structures (with the same secondary structure sequence) as significantly dissimilar, generally (see Figure a). Second, it judges crystal monomer structures of different proteins with different 3D structures but with similar lengths as being significantly different (typically ); i.e. it can differentiate folds. Third, it is shown to have excellent properties under deformation. To demonstrate, n randomly distributed changes were applied to a lysozyme crystal structure Kpdb using the CB algorithm. For each n, 50 such structures K were generated and the values of the statistic were calculated. The results are plotted as a function of n in Figure b. The mean value drops off rapidly to the same value as the average of the randomly generated structures (after about 15 changes). The maximum value always remains significantly higher than the mean; it drops below PDB quality after only two changes. Therefore, a high value of >0.75 indicates the structure is likely largely the same as the original structure.

Figure 7

Properties of the (secondary) knot fingerprint statistic based on variations of the lysozyme structure. (a) Secondary knot statistics of various structures K compared to the curve shown in Figure a. The two distinct sets are lysozyme coordinates from the PDB and random structures with secondary structure alignment to lysozyme (generated using the CB algorithm). (b) Plots of the mean, maximum, and minimum values of the 50 secondary knot statistics comparing the 1LYZ structure and the same structure subjected to n random changes in its secondary structure. The dotted lines show one standard deviation from the mean. The black line is the average of the PDB structure secondary fingerprint statistics (see (a)), the purple line is the random structure average (crossing the mean at about n = 15), and the yellow line is the average of secondary fingerprint values for models which fit the experimental data (crossing the mean at about n = 3).

Experimental Data Fitting

The following chi-square statistic χf2 is used to to assess the fit quality of model predictionswhere ns is the discrete number of points on the domain q ∈ [0, 0.4] on which the scattering is sampled (a commonly used domain, e.g., ref (7)). Im is the model scattering calculated using the Debye formula, eq , and Ies is the smoothed experimental data. The data is smoothed using the procedure described in ref (43) which is designed to avoid overfitting. The idea is that the maximum distance between two points on the molecule Dmax can be used to identify the number of independent data points one needs to recreate the scattering data using the Shannon sampling theorem. This number is the nearest integer to n = Dmaxqmaxπ–1. One then separates the experimental data into n bins; as in ref (43) the bins are chosen to be evenly spaced on the domain of q values used. A data point is then selected from each bin; again we follow the established procedure in ref (43) that this data point is the median of 1001 randomly selected samples of the data from the given bin, akin to a robust least-trimmed-squares method (see ref (43) for more details). The factor Ld, which will superimpose identical curves that differ by a translation, is used because the protein concentration can only be measured with relatively low accuracy[6,7] (when taking a logarithm of the data, a scaling factor becomes a vertical translation). In addition, to prevent chemically unreasonable conformations, a penalty is applied if the Cα–Cα distance of ≤3.8 Å occurs for any pair of nonadjacent Cα positions; this quantity is labeled χnl. The initial model is optimized as described above until χf2 + χnl < 0.008. Values below this threshold represent an excellent fit to the scattering data, as shown in Figure d. This value is based on a comparison to other studies (see section 3.6 of the Supporting Information).

Figure 8

Illustrating the fitting process. (a) Initial configuration of the backbone based only on the secondary structure assignment of lysozyme (PDB 1LYZ). Also shown as spheres are the molecules of the hydration layer. (b) Model scattering curve comparing BioSAXS data. (c) Final structure (and hydration layer) obtained from the fitting process and its model scattering curve now fitting the BioSAXS data well (d).

Results

Validation of the Backbone Curve and Water Model

As discussed under Methods, each part of the model, the Cα backbone, the explicit hydration layer, and the scattering model have been individually designed and verified using actual structures from the PDB. However, it remains to demonstrate the composite model’s efficacy. To test this, it was applied to the benchmark set of proteins used to compare the set of atomistic small angle scattering verification methods in ref (44) (this is in addition to the cases shown in Figure ). This set includes monomeric and multimeric proteins both globular and elongated. We allow the parameters of the scattering model to vary for each structure but fix the geometric hydration layer as described above. The scattering model is physically constrained in the same manner as in the FoXS[7] and CRYSOL[6] models, as discussed in detail in section 3 of the Supporting Information. For the sake of brevity we also detail these results in section 3.6 of the Supporting Information; it suffices to state here that the model performs comparably to the atomistic structure techniques and hence can be used to correctly infer protein structure from small angle scattering data.

Developing and Testing an Averaged Scattering Model for ab Initio Prediction

In an ab initio fitting it will be necessary to fix all parameters of the scattering model so that the algorithm only alters the protein backbone parameters (the pairs (κ, τ)); this will allow the model to run in a reasonable time frame. In section 3.61 of the Supporting Information, we detail the construction of an averaged scattering model based on the set of parameters used for each successful fitting detailed in section 3.6 of the Supporting Information. In general, if this average scattering model is then reapplied to the PDB structure and explicit hydration shell, we do not obtain a sufficiently good fit to the scattering data (although it is not too far off). The aim of this section is to show that we can use this averaged model and distort an initial PDB curve representation model in order obtain a high quality fit to the scattering data while still retaining a sufficiently realistic structure (within a few angstroms on average). This demonstrates the ab initio technique proposed here contains within its potential prediction population a high quality representation of the actual protein structure. It will also highlight some properties of the knot fingerprint statistic, by comparison to the widely used RMSD structural comparison statistic. To perform this test we selected three pairs of proteins and known crystal structures—lysozyme (PDB 1LYZ), ribonuclease (PDB 1C0B), and bovine serum albumin (BSA; PDB 3V03, selecting a monomer unit)—and scattering data obtained from the SAS database.[45] We used the PDB coordinates and secondary structure assignment as an initial input into the algorithm; then we altered each secondary section individually using Monte Carlo sampling of the κ–τ distributions and the CB algorithm to generate new structures. Using the hydration layer and scattering model, scattering curves were generated for these models. This process was run until a suitable fit to the scattering data was obtained.

Lysozyme and Ribonuclease

Examples of the derived models obtained for lysozyme are compared to subsections of the original PDB in Figure ; we compare subsections for visual clarity. Typically the structures are nearly identical with only the occasional slight deviation in the geometry of some of the linker sections. This similarity is reflected in both the RMSD measures (calculated using the Biopython module[46]) and the knot fingerprint statistics, as shown in Figure a. As expected, both indicate excellent fits to the structure. There is a (Pearson) correlation of −0.3 between the two measures, a value on the edge of weak and reasonable. The results for ribonuclease were very similar, and the fit statistics are shown in Figure b. Again there is also a clear relationship between the knot fingerprint statistic and the RMSD measure; in this case the correlation is very strong, −0.8. We see this correlation between the two measures as further justification of the knot statistic’s appropriateness as a measure of structure.

Figure 9

Figure 10

Comparison of RMSD measures and knot fingerprint statistics for fittings of the model to scattering data for lysozyme and ribonuclease.These results are obtained using the PDB structure as the initial input to the algorithm and are by comparison to that PDB. (a) Lysozyme; (b) ribonuclease.

Sections of the 1LYZ PDB structure and example fits obtained by fitting our model to the scattering data. Panels (a) and c are subsections of the PDB; (a) has the sheet. Panels b and d are composite visualizations of the predictions. Comparison of RMSD measures and knot fingerprint statistics for fittings of the model to scattering data for lysozyme and ribonuclease.These results are obtained using the PDB structure as the initial input to the algorithm and are by comparison to that PDB. (a) Lysozyme; (b) ribonuclease.

BSA

Example fits to the (parts of the) larger BSA structure are shown in Figure . We only display subsections as the full molecule is too complex for a clear visual comparison; the sections were chosen at random and are indicative of the general comparison. Once again it is clear the structures are very similar.

Figure 11

Sections of the 3V03 PDB structure and example fits obtained by fitting our model to the scattering data. Panels a and c are subsections of the PDB. Panels b and d are example predictions.

Sections of the 3V03 PDB structure and example fits obtained by fitting our model to the scattering data. Panels a and c are subsections of the PDB. Panels b and d are example predictions. Hence, we have shown that the model and method have the potential to correctly predict the tertiary structure of proteins accurately. From a purely ab initio perspective the question now is, how easy is it to get to the correct structure from a random initial guess? This question proves to be more complicated, requiring multiple predictions, so for this preliminary study we focus on a single structure: lysozyme.

Ab Initio Prediction

In the case where no crystal structure is available, the secondary structure prediction based on the sequence alone can be used as a starting point. In order to test this ab initio method, we used the small angle scattering data of lysozyme to make predictions of its structure. The process for obtaining a model is summarized in Figure . First, an initial structure is randomly generated by the CB algorithm and surrounded with an explicit hydration layer (Figure a). A model scattering curve is calculated and compared to the experimental data (Figure b). The curve is then changed by using a Monte Carlo algorithm to generate new secondary structure units (along with a new hydration shell), thus altering the model’s fold until it attains a sufficiently good fit to the scattering data (Figure c,d). Once again we use the χf2 statistic (eq ), but this time with an additional constraint on the potential search space, contact predictions, based on a large number of homologous sequences. Data from the RaptorX web server[47] for the lysozyme primary sequence were obtained. The Cα pairs with the 10 highest correlations were selected. An extra potential χcon was added to the optimization statistic to ensure the distance between these pairs was restricted to be within 5 and 15 Å. If l = 1, ..., nc labels the nc pairs of constrained points with mutual distances dc, then the quality of contact match χcon is defined as follows:with C a constant and dfc a reference distance (7 was used in this study). The value of C controls the likely variation in the distances dc; a value of C = 0.01 in this study was found to give good results. In the following a model was considered a valid prediction when both χf2 + χnl + χcon < 0.008 and χf < 0.008 so that predictions had to simultaneously fit the scattering data and minimize the geometric penalties of not overlapping and also satisfying the contact predictions (to within a specified tolerance dictated by the constant C). The results of the ab initio fitting procedure are shown in Figure a. The RMSD and knot fingerprint statistics, compared to the 1LYZ crystal structure, are shown. The first observation is that the best knot fingerprint statistics are comparable to the lower end of the from-PDB predictions obtained in the previous section. The second is that these correspond to the best RMSD measures. The apparent correlation between the two measures seems to remain for knot fingerprint statistics above 0.6. However, there is a gap between the best RMSD for the ab initio predictions and those derived from the PDB structure. This is to be expected as the knot statistic is more tolerant of differences which preserve the entanglement (the general geometry of the fold). This difference can be seen visually in Figure , panels b and c, which respectively represent the first 10 secondary structure sections of the 1LYZ crystal structure and the best fit ab initio prediction (the one closest to the PDB predictions in Figure ). The same fold back of the two significant α-helical sections is present in both cases, as is the fold back of the β-sheet (although the variability in strand geometry allowed in the algorithm means they are not identical). Further, the relative orientation of this helical pair and the strand section present is the same in both cases. Therefore, overall the basic fold geometry is correctly predicted, which is why the knot statistic is so close to the PDB values. There are, however, a number of sections with some reasonably significant distance differences, for example, the linker section joining the two helices; this means a bigger difference in the RMSD measure. Given all the difficulties associated with interpreting small angle scattering experiments, we suggest the knot statistic is a more appropriate measure of the accuracy of the prediction. One can see a similar conclusion can be applied to the rest of the molecule shown in Figure , panels d and e, for the PDB and fit, respectively.

Figure 12

Ab initio predictions for lysozyme based on sequence data alone. Panel a depicts the RMSD and knot statistic values for the predictions K; these are indicated as blue circles with the from-PDB data (Figure ) shown as brown squares for comparison. (b) Secondary structure sections 1–10 (residues 1–63) of the 1LYZ crystal structure. (c) Secondary structure sections 1–10 of the best ab initio fit. (d) Secondary structure sections 11+ (the reminder of the structure) of the 1LYZ crystal structure (residues 64–129). (e) Secondary structure sections 11+ of the best ab initio fit.

Objective Prediction Comparisons

Using only the protein sequence for secondary structure prediction and BioSAXS data, we have been able to obtain tertiary structure models which can be observed and quantified to have a fold geometry (topology) significantly similar to the lysozyme structure. However, a large number of predictions have knot statistics which suggest the structure’s fold topology differs significantly from that of the crystal structure (Figure a). The target applications for this method will be unknown structures, and it must be established whether one could have identified these were “bad” predictions without knowledge of the underlying structure. To differentiate predictions, we should seek objective structure comparison measures which do not depend on comparison of known structural information (i.e., not to the PDB). One example would be the contact prediction statistic χcon. This is objective in the sense that it only relies on sequence data, and it would generally be available. A scatter plot of the knot statistic indicates the high quality ab initio predictions (high ) are less likely to have a high χcon than the worse predictions; see Figure d. If we were to run a significant number of predictions and then say select only those below the mean χcon value and then most of the high χcon predictions remain, this could be a first means of filtering the predictions, although we see it will still leave “bad” predictions and thus further analysis is required.

Figure 13

Comparisons of high and low lysozyme predictions. (a) PDB 1LYZ crystal structure. (b) High quality fit (). (c) Low quality fit . (d) Comparison of the contact constraint χcon and the knot fingerprint; the blue points (with larger values) are for the ab initio fits and the brown dots are from the PDB fits. (e) Two (green) sections of a sheet from lysozyme model. A plane and its normal bisecting the strand sections are shown; also shown are two sections of the rest of the molecule which bisect the plane between the two strands. (f) Fingerprint–RMSD comparison plot with screened ab initio predictions.

β-Sheet Model Variations and the Power of Knot Statistics

Figure shows the full lysozyme crystal structure (a), a high model (0.73) (b) with RMSD 9.21 (by comparison to the crystal structure), and a low model (0.23) (c) with RMSD = 10.8. Therefore, there is a relatively small difference between the two predictions RMSD measures, but a significant one as measured by the knot topological method. One clear difference is the isolation of the β-sheet. In Figure a,b the sheet is at one edge of the structure, while in Figure c it is closer to the α-helical secondary units of the structure. Furthermore, because the constituent strands of the prediction shown in Figure c are not sufficiently closely related, there appears to be a section of α-helix passing between them. This is a significant difference in entanglement detected by the knot-based measure for Figure c compared to Figure a,b An inspection of the structures indicated that the better performing structures (in terms of their fingerprints) tended to have tighter and more isolated β-sheets, consistent with the examples illustrated. To try to quantify this, we created two mathematical measures. The first measure is the mean distance between sequentially paired Cα atoms of the predicted sheet structure (this sequential dependence can be determined by distance measures and does not need a predetermined knowledge of the strand orientation). We calculate this value for all predictions and choose those, say, less than the median value. The second is a discrete test as to whether any other section of the molecule passes “between the sheet”. We approximate a plane for the sheet as indicated in Figure e and then determine if any other arcs of the main Cα chain pierce this plane; if this does occur we simply reject the structure as being physically unrealistic (as is the case in Figure e). Both are objective measures. When the combination of sheet measures and the contact prediction cutoffs are applied, we are left with a significant proportion of the high quality fits, including the one with the lowest RMSD (Figure d). Crucially all the lower quality fits are filtered out. It should be noted that one of the high quality predictions was lost during this filtering process, on the basis that its mean sheet distance was too high. This underlying selection mechanism should be generally applicable being based on basic principles, so there is an indication it will be possible to produce a general post hoc selection procedure. In future it might be also be useful to use information such as sulfide bonding and hydrophobic exposure to further classify predictions.

Application to a Novel Protein with Unknown 3D Structure: The Human SYCE1 Core

Based on the success of utilizing contact predictions to constrain potential models, we applied the algorithm on the structural core of the human SYCE1 protein, a tethered construct where the sequence is repeated to allow formation of extended antiparallel coiled coils with two short additional helices at each end that could fold back to form a small three-helix bundle. The secondary structure of the tethered protein construct resulted in eight stretches of α-helices where, based on the heptad repeats helices 2, 3, and 4, can be aligned to helices 6, 7, and 8 corresponding to the same sequence, respectively, in an antiparallel fashion. This resulted in 14 close contact predictions between helices 2 and 8, and helices 4 and 6, respectively, as shown in Figure .

Figure 14

Schematic drawing of the SYCE1 construct with each box corresponding to one predicted α-helix. The SYCE sequence of approximately 120 amino acids corresponding to helices 1–4 is duplicated and linked by a tether to a repeat of the same sequence comprised of helices 5–8.

Deriving the Models

Based on the sequence and secondary structure predictions (a combination of those of RaptorX[47] and HHpred[48]), 40 initial configurations were generated using the CB algorithm. An example is shown in Figure a along with its hydration layer; its scattering curve is compared to the experimental data (from ref (24)) in Figure b. As shown, the fitting is limited to the domain q ∈ [0, 0.3] Å–1, which balances the twin considerations of a sufficient resolution and reliable signal-to-noise ratio. Using Monte Carlo optimization, the structure is altered until a reliable fit χf2 + χnl + χcon < 0.008 is obtained, where the potential χcon is based on the contact predictions described above. One such model is shown in Figure c along with its scattering curve in Figure d. The identical chains of the structure have folded to lie (nearly) parallel with the end termini occupying a local neighborhood. Two example models for which χf2 + χnl + χcon < 0.008 are shown in Figure a,b. Panels c and d of Figure indicate one of the coiled-coil structures and depict the pairwise distances associated with the contact prediction terms χcon. All models share the elongated bend shape with a antiparallel coiled-coil arrangement of helix 2–4 to helix 6–8, respectively. The first helix in each helix (helices 1 and 5, respectively) show different orientations which reflect the expected conformational flexibility of the protein in solution. Importantly, the central coiled coil (made of helices 3 and 7, respectively) is not based on the constraints given a priori but is entirely based on the optimization against the experimental data. Although a bead model results in a similar overall shape,[24] our method is able to derive a more detailed molecular model with distinct structural features such as the central coiled coil.

Figure 15

Figure 16

Illustrations of the model predictions. (a, b) All model predictions. (c) One of the coiled-coil units of (a) with black tubes representing the contact prediction distances, as seen along the axis of the unit. (d) Tilted helical structure of the coiled-coil unit. (e) Model obtained by minimizing the chi-squared measure χnl + χcon only (i.e., without taking the scattering data into account).

Illustrations of the optimization process used to obtain the model predictions for structural core of human SYCE1. (a) Initial starting configuration of the backbone based only on the sequence data shown in Figure . Also shown as spheres are the molecules of the hydration layer. Large black and white spheres indicate the end termini. (b) Scattering curve of the initial configuration (blue) overlaid on the scattering data (red). (c) Model prediction for which χf2 + χnl + χcon < 0.008; the end termini are next to each other. (d) Final scattering curve compared to experimental data. Illustrations of the model predictions. (a, b) All model predictions. (c) One of the coiled-coil units of (a) with black tubes representing the contact prediction distances, as seen along the axis of the unit. (d) Tilted helical structure of the coiled-coil unit. (e) Model obtained by minimizing the chi-squared measure χnl + χcon only (i.e., without taking the scattering data into account).

Experimental Scattering Data Is Crucial to the Prediction Quality

One might ask if the contact predictions alone were sufficient to predict the structure, since they are crucial to forming (some of) the coiled-coil structure. To test this, we derived models by minimizing the chi-squared measure χnl + χcon (i.e., ignoring the scattering data); a typical example is shown in Figure e. The outer α-helices are present as the contact prediction constraint χcon forces these structures to form. However, the whole structure is significantly folded. This folding was found to be a typical property of models obtained by minimizing only χnl + χcon, and the degree of folding was far from consistent. The clear effect of further enforcing the model fit the scattering data is twofold: first straightening out the whole structure and second, in doing so, developing a coiled-coil geometry in the middle of the structure.

Fitting to the Scattering Data and Contact Predictions Is Not Straightforward

As a final note, we note that of the 40 initial structures generated, only 5 obtained a suitably low combined chi-squared statistic (χf2 + χnl + χcon < 0.008). All five structures, two of which are shown in Figure , were basically identical in this case (comparative values > 0.9), so there was no need for any post hoc structural comparison analysis. By comparison all 40 lead to models for which χnl + χcon < 0.008. As we have just seen, there is significant value in the extra information provided by the scattering data. The difficulty with obtaining suitable fits indicates that in the future more advanced optimization techniques than a straightforward Monte Carlo search may be needed.

Discussion

Summary of Results

This paper describes in-depth the development of a tertiary structure model for BioSAXS data interpretation. A number of key points have been demonstrated with regard to its potential use for the structural biology community. First, if the method starts with a known structure with a similar tertiary structure to the target structure, e.g., a homology model or an incomplete (core) model for a structure, then it will likely find a highly accurate fit to the presumably correct structure, as verified on a benchmark set of proteins. Second, given a near-complete absence of tertiary structural information, save that available from sequence data such as secondary structure predictions, the technique can generate realistic representations of the structure’s fold. Further, in this ab initio scenario there is the potential to reliably separate realistic predictions from those which are not biologically plausible, by both constraining the fitting procedure and applying optimization filtering. This final result, demonstrated here on lysozyme, is a significant result; there exists no purely ab initio SAXS technique so far which has achieved such detailed predictions of the protein’s fold (a number of techniques superimpose tertiary and secondary structures into ab initio bead predictions but this requires extra information such as a valid homologous structure). We now discuss various aspects of the predictions and compare them to existing methods.

Data Requirements

The fundamental information required to construct the backbone model is the sequence, used to identify secondary structures (linker, strand, and α-helix). This information can be obtained from the sequence using, for example, the publicly available jPred server.[49] Then we only need the scattering data to fit this model to. Therefore, in its most basic form the algorithm can yield predictions from only two files: a sequence file and a scattering data file. We can also use the sequence file and existing spatial information to constrain the potential search space. In the case of the lysozyme predictions, the search space was additionally constrained using contact predictions. Contact predictions can be obtained based on the sequence using, for example, the RaptorX server. The server does include homology model generation if applicable, but we have only used the list of amino acid pairs which have a high sequence correlation. As discussed above, we constrain their mutual distance. Similarly, the SYCE1 protein models were additionally constrained using coiled-coil sequence alignments, as discussed under Results. The technique can also accommodate the use of additional information; for example, PDB files could be used as initial homology models or partial structure specifications. In this case the necessary pieces information required are the secondary structure specifications and the Cα coordinates. In general, any additional information used to constrain the search space needs to provide information on the Cα coordinates, in particular what secondary structure type is it, and how does its constrain its spatial coordinates (i.e., a fixed location or a fixed distances for pairs of Cα coordinates)? It is possible, for example, to fix the coordinates of a subsection of the target structure and only vary the rest, if a partial prediction is already present.

Comparisons to Existing Techniques

With regard to comparisons to existing techniques there are two categories to be discussed. The first is the set of different experimental methods used to derive structures in the Protein Data Bank. The predictions from our methods, applied to small angle scattering data, can be near this level of quality if a reliable initial structural model is provided. This was demonstrated under Results when we used PDB structures as a starting model; the algorithm yielded structures with RMSD measures (by comparison to the PDB structure) highly comparable to experimentally obtained models (for α-carbon positions). In a purely ab initio scenario our results indicate it is currently difficult to obtain this level of accuracy on a reliable basis (although one can get single angstrom RMSD measures). However, as shown in Figure d, there is some indication that, if extra constraints such as contact predictions from homologous sequences can be enforced to a high degree of accuracy, there is the potential to reach similar levels of structural resolution to these alternative experimental techniques.

Ab Initio Methods

The second comparison would be to SAXS-specific ab initio techniques for interpreting BioSAXS data. These include the bead based models such as GASBOR and DAMMIN.[13,50] However, a direct comparison is difficult because the natures of predictions are different. Neither method makes explicit predictions of the sequence-ordered tertiary structure of the molecule. Both are composed of effective scattering beads. The DAMMIN model aims to predict the volume occupied by the molecule by creating a cloud of beads, while GASBOR aims for a structure with a chainlike nature constraining bead–bead distances, but there is no explicit secondary structure in the model. However, previously a technique for comparing bead models to tertiary structures was suggested, using a measurement, the normalized spatial discrepancy (NSD), derived from several metrics for comparing the distributional similarity of two sets of points.[51] This calculation is implemented in the SUPCOMB routines, part of the ATSAS package. The SUPCOMB algorithm rotates (and translates) the bead clouds to minimize their NSD measure, as described in ref (51). This does not account for the sequential nature of the chain in the models derived here. Instead, it in effect measures the similarity of two point clouds, but it is a sensible means of comparing the results of the results obtained in this study to GASBOR predictions (the closest to the CB model in that it develops chainlike structures). We compared the GASBOR prediction for lysozyme (obtained from SASBDB,[45] code SASDAG2) for the same scattering data that was used in Results to obtain the CB algorithm prediction; note that both use the same q domain ([0, 0.4] Å–1). In both cases the structures were compared to the 1LYZ structure via their NSD measures. An NSD value below 1 is generally considered a good match.[52] A small subset of the results are indicated in Table ; all values are below 1. The quality of this fit is illustrated in Figure , where the predictions are superimposed on the solvent-accessible surface area of the 1LYZ structure. Therefore, in terms of point cloud comparison, both prediction techniques can yield very similar predictions. This is not such a surprise as both fit the low q portion of the scattering curve, which contains the large scale structural information, very well.

Table 1

Various Structural Comparison Metrics for Predictions of Lysozyme Predictions, All by Comparison to the 1LYZ Structure and All Quoted to Three Significant Figures

prediction type	SUPCOMB NSD		RMSD
GASBOR model	0.813	N/A	N/A
CB pred 1	0.880	0.729	9.77
CB pred 2	0.863	0.703	9.22
CB pred 3	0.889	0.708	10.3
CB pred 4	0.866	0.133	13.2

Figure 17

GASBOR (obtained from SASBDB, code SASDAG2) and CB algorithm models for lysozyme superimposed on the 1LYZ crystal structure. (a) Bead cloud (red) of a GASBOR prediction for the same lysozyme scattering data as used above, obtained from SASBDB (code SASDAG2). A lysozyme prediction (blue curve) from Results for which the SUPCOMB NSD is 0.813 and knot fingerprint statistic (both by comparison to the 1LYZ PDB structure). (c) Lysozyme prediction (green curve) from Results for which the SUPCOMB NSD is 0.8664 and knot fingerprint statistic (both by comparison to the 1LYZ PDB structure). (d) Secondary structure sections 1–10 of the CB lysozyme model shown in (c). What is of particular interest is that structures with significantly different knot fingerprint statistics yield very similar SUPCOMB and similar RMSD measures. Two examples of CB predictions are shown in Figure , panels b and c, respectively: and . Both structures fit very well inside the lysozyme solvent accessible surface area, as indicated by the near identical NSD measures. The low measure of the structure in Figure c is seen in Figure d to be a result of the fact that the subsection involving the strand structure has a significantly different subfold compared to the PDB structure, shown in Figure b. In particular, the β-sheet structure “folds back” in the direction opposite to that of the PDB structure (and the high structure shown in Figure c). This highlights the benefit of the additional predictive information provided directly by the CB algorithm: one can discern geometrically distinct predictions and thus make a more precise prediction about the structure of the protein under consideration. The ATSAS package does allow for the interpretation of bead models with tertiary structure through the use of the CORAL package.[50] The package attempts to fit a structure into the bead model with a mixture of known (manually assigned) and unknown secondary structure elements. This procedure was performed in ref (24) to provide evidence that the SYCE1 core modeled in ref (24) was a coiled-coil domain; a rendering of this model is shown in Figure a. Two coiled coils were superimposed on a bead model with CORAL providing an additional linker section to join them. Our model simply uses the sequence data to determine the secondary structural elements; then it is able to try millions of differing (physically realistic) folds and tests each time if they satisfy the scattering data, a much more direct and exhaustive test, which relies on far less user input. What is interesting is that this technique predicts an additional coiled-coil domain at the structure’s center (cf. panels a and b of Figure ), owing to the sequence interpretation splitting of the helical units. The method presented in this paper offers more flexibility in terms of using additional structural constraints and is more amenable to automated structural evaluation. Its main comparative advantage, over the coral package, is the potentially exhaustive automated search of a space of potential tertiary folds with realistically constrained secondary structure.

Figure 18

Comparison of the CORAL derived model of the SYCE1 core obtained in ref (24) (a) and an example CB derived model (b).

Computation Time

A single calculation comprising the CB algorithm, the generation of the hydration layer, and calculation of the scattering curve takes on the order of 0.05 s for lysozyme (129 residues) and 0.5 s for BSA (433 residues), both based on calculations performed on a single CPU, with the main cost coming from the Debye formula, eq . As far as the actual optimization goes, the timing can vary significantly; this depends on the number of secondary units which can be changed, the randomized initial condition, and the difficulty in satisfying additional restrictions such as the contact predictions (and how tightly they have been penalized). The ab initio lysozyme predictions generally varied between 10 min and 1 h. For the SYCE1 chain (318 residues) it was closer to 20 h (that said, as mentioned above the predictions produced in this case were reliably accurate). In future we will look to implement Bayesian learning techniques for the search, as a large number of models suggested by the Monte Carlo sampling overlap themselves and consistently trying such models wastes much time. This will be crucial to ensuring it can be run for larger molecules in the future.

Number of Initial Models

One might ask how many predictions are required in order to obtain a viable structure (ab initio). The examples here present a contrasting picture. The lysozyme cases consistently produced structure which fit the scattering data, but as discussed only a relatively small percentage (about 8%) were considered a sufficiently good fit to the scattering data (i.e., a sufficiently low RMSD with respect to the PDB structure and high value). By contrast, from 40 initial conditions for the SYCE1 molecule only 12.5% were able to fit the scattering data, but all were nearly identical () and excellent candidate structures. It is likely this is because lysozyme is a globular protein while SYCE1 has a very flat, linear structure. It is relatively easy to distort our model into a globular shape, but it allows for more structural variance, while the more linear structure is harder to form but much more constrained. The consistent evidence is that, currently, one might need at least 10 optimization runs in order to obtain a good quality prediction. A future aim will be to better enforce contact predictions or other constraints during the fitting procedure in order to bring down this ratio.

Conclusion

As a solution-based technique, BioSAXS can provide structural information for targets where crystallization and cryo-electron microscopic (cryo-EM) techniques are challenging. Also, the method allows data collection in a more natural environment than techniques such as crystallography and cryo-EM. Additionally, SAXS is not limited by protein size, as is the case for cryo-EM and NMR. Therefore, there is a clear need to develop the techniques for interpretation of this data in an ab initio setting which improve on the levels of structural detail provided by the bead models currently popular. In this paper we have shown that curve representation with hydration shell provides a molecular model for BioSAXS data with fits as good as or better than traditional bead and envelope models. Unlike these models our model includes a complete secondary and tertiary model description. Importantly, starting from random models that only take secondary structure information and sequence-dependent distance constraints into account, a physically meaningful 3D model can be obtained by fitting models against the experimental data. That this is possible is due to the fact that the model is described with far fewer parameters compared to even a coarse-grained model that required three coordinates for each amino acid, combined with use of geometric constraints for regular secondary structural elements. In order to show the potential of this ab initio technique, it was applied to a tethered core component of the human SYCE1 protein, for which no high resolution structural data is available. The model derived was based on sequence information alone and matches that of a model that was previously reported in ref (24), where the model was based on manual inspection of the sequences coupled with the fitting of ideal coiled coil segments to experimental scattering data. Importantly, while the previously modeled structure includes two coiled coil segments, the model derived here recognized that this was the minimum number of segments required to explain the curved structure and that the true structure could consist of multiple coiled coils interrupted by short linkers. Thus, our novel ab initio method has successfully generated a highly plausible model from experimental scattering data without the need for any more than minimal manual evaluation. This facility will be crucial for ab initio structural determination (from BioSAXS data) of larger molecules where it would not be practical to generate structures manually. Further experimental information such as distance information from any other source can easily been added in the form of additional restraints into the optimization algorithm. The model’s explicit description of realistic secondary structure means additional information, such as contact predictions, radius of gyration, hydrophobicity of the chain, and disulfide bonding, can be employed as model constraints in the future. This will further enhance the accuracy of all potential models, and in particular help the end user to distinguish mathematically correct but physically less likely models from correct solution. The secondary knot fingerprint statistic developed shows significant potential to evaluate structural similarities of models and hence to further automate this vital validation step. The two future next steps are (i) the application of this method to multimeric structures where each known monomer structure can initially be treated as rigid body and then refined in order to account for local changes in solution and (ii) the application to larger, de novo structures where the exact 3D structure remains elusive. The second goal will require further refinements of the search space method of the optimization algorithm. The application to homomultimers is straightforward and requires only minor addition to the existing code; we expect this to be the major initial application of our methods. Due to the limited information content of small angle X-ray scattering data, the ab initio fold determination will depend on the accuracy of secondary structure prediction combined with appropriately weighted distance constraints such as those discussed above.

41 in total

1. Frequencies of amino acid strings in globular protein sequences indicate suppression of blocks of consecutive hydrophobic residues.

Authors: R Schwartz; S Istrail; J King
Journal: Protein Sci Date: 2001-05 Impact factor: 6.725