The power of structural information for informing biological mechanisms is clear for stable folded macromolecules, but similar structure-function insight is more difficult to obtain for highly dynamic systems such as intrinsically disordered proteins (IDPs) which must be described as structural ensembles. Here, we present IDPConformerGenerator, a flexible, modular open-source software platform for generating large and diverse ensembles of disordered protein states that builds conformers that obey geometric, steric, and other physical restraints on the input sequence. IDPConformerGenerator samples backbone phi (φ), psi (ψ), and omega (ω) torsion angles of relevant sequence fragments from loops and secondary structure elements extracted from folded protein structures in the RCSB Protein Data Bank and builds side chains from robust Monte Carlo algorithms using expanded rotamer libraries. IDPConformerGenerator has many user-defined options enabling variable fractional sampling of secondary structures, supports Bayesian models for assessing the agreement of IDP ensembles for consistency with experimental data, and introduces a machine learning approach to transform between internal and Cartesian coordinates with reduced error. IDPConformerGenerator will facilitate the characterization of disordered proteins to ultimately provide structural insights into these states that have key biological functions.
The power of structural information for informing biological mechanisms is clear for stable folded macromolecules, but similar structure-function insight is more difficult to obtain for highly dynamic systems such as intrinsically disordered proteins (IDPs) which must be described as structural ensembles. Here, we present IDPConformerGenerator, a flexible, modular open-source software platform for generating large and diverse ensembles of disordered protein states that builds conformers that obey geometric, steric, and other physical restraints on the input sequence. IDPConformerGenerator samples backbone phi (φ), psi (ψ), and omega (ω) torsion angles of relevant sequence fragments from loops and secondary structure elements extracted from folded protein structures in the RCSB Protein Data Bank and builds side chains from robust Monte Carlo algorithms using expanded rotamer libraries. IDPConformerGenerator has many user-defined options enabling variable fractional sampling of secondary structures, supports Bayesian models for assessing the agreement of IDP ensembles for consistency with experimental data, and introduces a machine learning approach to transform between internal and Cartesian coordinates with reduced error. IDPConformerGenerator will facilitate the characterization of disordered proteins to ultimately provide structural insights into these states that have key biological functions.
Disordered states of proteins, including
unfolded states, intrinsically
disordered regions (IDRs) of otherwise folded domains, and full intrinsically
disordered proteins (IDPs), are increasingly recognized for the roles
that they play in folding kinetics,[1] aggregation
propensity,[2,3] critical biological functions,[4] and pathological disease states.[5] Structural insights are needed to better understand these
disordered protein states, and a variety of solution experiments have
been developed to enable structural and dynamic descriptions of disordered
proteins using nuclear magnetic resonance (NMR), small-angle X-ray
scattering (SAXS), single-molecule fluorescence (SMF), and other data
types.[6−11] However, solution experimental data for disordered states are averaged
over a very large number of heterogeneous interconverting conformations,
leading to greater challenges in structural interpretation than for
folded proteins. Thus, specific computational approaches are required
to bridge the gap between experiment and structural ensembles for
disordered protein states.The overall general approach begins
with a large set of potential
fractionally populated conformations and then selects a subset of
these and/or assigns weights to conformational subpopulations that
best agree with the available, but highly averaged, experimental restraints.
These two components have typically been considered to be separate
problems, and a number of methods exist for each. TraDES,[12,13] Flexible-meccano,[14] FastFloppyTail,[15] and other methods[16] are available to generate conformer pools, based primarily on the
statistical distributions found in folded protein structures from
the RCSB Protein Data Bank (PDB).[17,18] TraDES builds
trajectories of three Cα positions at a time based on the probabilities
in a set of nonredundant structures from the PDB and then fills in
the rest of the chain. Flexible-meccano builds chains by selecting
φ/ψ torsion angles based on amino acid-specific conformational
potentials derived from the PDB. FastFloppyTail also utilizes a three-residue
fragment-based approach, with a bias toward the loop regions of the
PDB. An approach from the group of Bernadó[16] similarly uses data from the PDB extracted as tripeptide
segments. Another approach builds on Flexible-meccano but uses tripeptide
conformers derived from molecular dynamics (MD) simulations.[19]Because the structural ensembles provided
by these methods are
agnostic to experiment, a separate step is needed to select a subset
of the conformations or, more generally, reweight the conformers in
the pool to define ensembles that best represent information about
the disordered state from the solution experiments. These include,
among others, ENSEMBLE,[20] ASTEROIDS,[20,21] X-EISD,[22,23] BME,[24] and BW[25]/VBWSS,[26] with a number
utilizing Bayesian statistics[22−24,27] and/or maximum entropy[24,28] to address inherent
uncertainties in experiments and back-calculations from a disordered
structural ensemble. MD simulations are an alternative to the statistical
sampling of torsion angle pools, providing conformers that are biased
by the physicochemical interactions included in the force fields[29−32] and represent Boltzmann weighted states. Furthermore, some MD approaches
to generating IDP ensembles integrate experimental observables directly
during the sampling phase. For example, some have used maximum entropy[33,34] and Bayesian statistics with averaged NMR biases[2,35] to
guide the molecular dynamics sampling of IDPs toward conformations
in better agreement with experimental data. Another uses fragments
from MD simulations with a reweighted hierarchical chain growth algorithm[36] incorporating experimental data to generate
disordered ensembles.Together these approaches have all been
valuable for both creating
and ultimately characterizing a variety of disordered proteins; however,
a number of challenges remain. In particular, for disordered proteins
with a preferential sampling of fractional secondary structure and
tertiary contacts, especially for longer sequences, the starting pool
sample becomes the more limiting factor for the successful identification
of subsets for reweighted ensembles that can fit experimental data.
While a number of these tools can generate conformers biased by known
secondary structure distributions, most of these tools are not flexible
as to how users can generate disordered conformer ensembles as well
as evaluate them with respect to experimental data.Here, we
report the open-source software platform IDPConformerGenerator
to generate disordered protein conformations, utilizing a wide range
of new and novel methods and models within a single software suite.
IDPConformerGenerator begins with backbone builds based on torsion
angle distributions of phi (φ), psi (ψ), and omega (ω)
found in the PDB and then enables the building of side-chain ensembles
using Monte Carlo side-chain entropy (MC-SCE) that completes the all-atom
description by including hydrogens. IDPConformerGenerator has significant
flexibility in user-defined options for the size of peptide fragments
used to build the backbone, amino acid substitutions, secondary structure
biases, steric clash criteria, and energy biasing using force fields.
Additional stand-alone and integrated algorithms within IDPConformerGenerator
extend the fundamental internal coordinate conformer ensemble builds
with state-of-the-art transformations to Cartesian coordinates using
Int2Cart,[37] which yields more correct valence
geometries and reduces steric clashes. Finally, the generated ensembles
can be evaluated with stand-alone and integrated software modules
such as the X-EISD Bayesian model for assessing agreement with many
different experimental data types including NMR, SAXS, and single-molecule
fluorescence resonance energy transfer (smFRET).[22,23] What makes IDPConformerGenerator distinct from other tools is its
flexibility as a user-friendly tool kit to explore different computational
strategies and protocols for rationally defining conformational ensembles
of (intrinsically) disordered protein sequences.We demonstrate
that IDPConformerGenerator can efficiently calculate
ensembles of proteins of up to at least 440 residues in length with
a variety of secondary structure distributions and tertiary contact
patterns. Many of these have reasonable root-mean-squared deviations
(RMSDs) from experimental solution data, particularly some generated
with bias for fractional secondary structure based on NMR chemical
shifts. These results support the utility of IDPConformerGenerator
for the creation of initial conformer pools that are more physically
representative and more readily optimized by using experimental restraints
with X-EISD[22,23] or ENSEMBLE.[20]
Methods
Design of IDPConformerGenerator
We set out to design
a tool to efficiently generate conformers that realistically sample
the likely conformational space of intrinsically disordered sequences
from the statistical sampling of backbone torsion angles (φ,
ψ, and ω) of short protein segments in the PDB that are
identical or similar in sequence to the protein under investigation.
This led to our choice to exploit the PDB for the sampling of torsion
angle space to generate more physically meaningful conformers, a choice
also utilized by TraDES,[12,13] Flexible-meccano,[14] FastFloppyTail,[15] and others. Given these physically sound backbone conformations,
we also provide side-chain building algorithms such as MC-SCE that
can generate ensembles of different rotamer states that are Boltzmann
weighted and further all-atom representations by including hydrogens.
The resulting sets of conformations are intended to be utilized as
inputs to downstream approaches to define ensembles that best agree
with experimental data, such as those that select subsets (e.g., ENSEMBLE[20] and ASTEROIDS[20,21]), reweight
conformers (e.g., BME[24]), or both (X-EISD[22,23]).
Building Conformational Ensembles
IDPConformerGenerator
starts by creating a protein sequence database annotated with φ,
ψ, and ω torsion angles and secondary structure per residue.
We use nonredundant lists of structures such as those generated by
the Dunbrack PISCES database.[38] Hence,
IDPConformerGenerator builds structures by extracting φ, ψ,
and ω backbone torsion angles from the database, fragment-by-fragment
(with fragments being peptides of variable length), using torsion
angles matching the input sequence for each fragment or matching a
user-defined residue tolerance (or substitution) dictionary. While
other tools utilize rigid fragment sizes, IDPConformerGenerator allows
users to configure the size and probability of the peptide fragments
used to build the IDP chain stepwise, modulating the sampling strategy
to explore. IDPConformerGenerator uses DSSP nomenclature[39,40] to annotate residues by secondary structure elements. Because of
this, users can define the secondary structure classes that IDPConformerGenerator
will sample, either across the sequence or in a residue-specific manner,
based on knowledge such as from NMR chemical shifts for fractional
populations of secondary structures as a function of residue.[41−43] We provide several methods to sample bond geometries when converting
from internal to Cartesian coordinates. The most important one is
the Int2Cart methodology[37] that predicts
bond lengths and angles for a set of torsion angles and residue identities.
Instead of using hard spheres to model atom volumes, IDPConformerGenerator
computes the whole Lennard-Jones (L-J) potential to tolerate small
clashes that can be compensated for by favorable interactions, with
user-defined thresholds to direct the acceptance of a fragment or
backtracking to rebuild. To generate full side chains, IDPConformerGenerator
has integrated the MC-SCE algorithm, originally developed for the
more difficult case of folded proteins but which works easily for
disordered states.[44]
Associated and Integrated Tools
IDPConformerGenerator
is designed as a platform to facilitate the generation of disordered
protein conformations, including analysis of resulting ensembles and
scoring or reweighting with respect to experimental data. Tools for
the analysis of structure are integrated within IDPConformerGenerator,
including those for secondary structure, torsion angle distributions,
radius of gyration (Rg), end-to-end distances,
asphericity (deviation from the spherical shape of the conformers),
and Cα–Cα distance and distance-difference matrices.
The software easily enables the use of downstream tools for scoring,
reweighting, or subsetting to fit experimental data and will serve
as a future platform for integrating these tools, including the simple
ENSEMBLE approach and the X-EISD Bayesian model.
IDPConformerGenerator Software Platform
To facilitate
its development and use, IDPConformerGenerator is open source and
extensively documented and the architecture is modular to allow easy
extension with other modules and strategies (https://github.com/julie-forman-kay-lab/IDPConformerGenerator). IDPConformerGenerator is written in Python, and all of its automated
functionalities are available as command line commands. In addition,
all of IDPConformerGenerator’s functionalities are available
through the Python interpreter and can be imported and used independently
by more advanced users. Also, all IDPConformerGenerator pipelines
are distributable across multiple CPU cores. IDPConformerGenerator’s
software design facilitates a flexible approach to building conformers
with numerous user parameters that enable very different realistic
ensembles to be built, with the design philosophy and options discussed
here. The overall workflow of IDPConformerGenerator is described in
detail next and is schematized in Figure .
Figure 1
Schematic diagram of the IDPConformerGenerator
approach. Generating
conformers requires the creation of a reusable database of backbone
torsion angles and input of the primary sequence, with optional user-defined
parameters including those for amino acid substitutions, secondary
structure sampling, and fragment size probabilities. An example of
a peptide of 2 residues (fragment size 2) that is used to build inhibitor-2
(I-2) is shown with backbone torsion angles labeled, a helical secondary
structure with all-atom side chains of an I-2 conformer, and an illustrative
set of 100 generated conformations of I-2. Conformers that are generated
can then be scored or reweighted based on experimental data.
Schematic diagram of the IDPConformerGenerator
approach. Generating
conformers requires the creation of a reusable database of backbone
torsion angles and input of the primary sequence, with optional user-defined
parameters including those for amino acid substitutions, secondary
structure sampling, and fragment size probabilities. An example of
a peptide of 2 residues (fragment size 2) that is used to build inhibitor-2
(I-2) is shown with backbone torsion angles labeled, a helical secondary
structure with all-atom side chains of an I-2 conformer, and an illustrative
set of 100 generated conformations of I-2. Conformers that are generated
can then be scored or reweighted based on experimental data.
Backbone Conformer Generation
IDPConformerGenerator
requires a torsion angle database from which to build backbone conformers.
Users can provide custom-made lists of the PDB chains to assemble
the torsion angle database, thus allowing tuning for resolution and
diversity. IDPConformerGenerator considers only continuous chains
and selects the atoms with the highest occupancy for those with alternative
locations. The IDPConformerGenerator parser works with both the older
PDB format and newer mmCIF files. For the results reported here, we
employed a fixed nonredundant database of PDB structures from the
Dunbrack PISCES database[38] (Oct 15, 2020),
including high-resolution structures with resolution better than or
equal to 2.0 Å and an R factor of ≤0.25
with a maximum mutual sequence identity of 90%.Conformer backbones
are built fragment-by-fragment, where fragments can be configured
for different lengths (described below). Physical validation of the
conformers (for example, a steric clash check) is parametrizable via
force fields and threshold parameters. If a clash is found, then IDPConformerGenerator
will delete the last fragment and attempt a new one. If no new fragment
can be added without a steric clash, then the building process will
backtrack to delete additional fragments. This process is repeated
until the whole backbone is complete. With the default parameters,
some sequences can be built very fast while others require extensive
sampling times, also dependent on length (see below).
Including the Peptide Bond ω Torsion Angle
One
of the major differences between IDPConformerGenerator and previous
backbone sampling tools is that IDPConformerGenerator includes peptide
ω torsion angles in the sampling and building regime. The ω
torsion angle is considered to be part of the torsion angle set for
each residue in the order ω/φ/ψ. The decision to
include ω is fundamental to our strategy to explore the IDP
landscape by the addition of multiple residue-sized fragments, and
since ω angles can vary to up to 20° in loop regions of
high-resolution structures (Figure ), including ω in the generator increases the
accuracy of the extracted fragment. Note that this variation is not
dependent on the resolution of the structure but that helices have
a narrower ω torsion angle distribution (Supporting Information Figure 1). Accurate ω torsion
angles are also critical for the future incorporation of folded domains
within otherwise disordered chains.
Figure 2
Histogram of ω dihedral angle distributions
for structures
found within the IDPConformerGenerator database. PDBs from the 24 003
PDB structure database were used, with sets of this with resolutions
better or equal to 1.5 Å (∼5000 structures), better or
equal to 1.8 Å (∼16 000 structures), and better
or equal to 2.0 Å (full set). The deviations from an ω
torsion angle of 180° (trans) are plotted, centered around 0°,
to facilitate the visualization of the distribution rather than the
actual ω angles. Cis peptide bonds are ignored for visualization
purposes.
Histogram of ω dihedral angle distributions
for structures
found within the IDPConformerGenerator database. PDBs from the 24 003
PDB structure database were used, with sets of this with resolutions
better or equal to 1.5 Å (∼5000 structures), better or
equal to 1.8 Å (∼16 000 structures), and better
or equal to 2.0 Å (full set). The deviations from an ω
torsion angle of 180° (trans) are plotted, centered around 0°,
to facilitate the visualization of the distribution rather than the
actual ω angles. Cis peptide bonds are ignored for visualization
purposes.
Sequence Specificity of Chosen Torsion Angles
IDPConformerGenerator
builds structures based on extracting backbone torsion angles from
the torsion angle database, using torsion angles either (i) only from
residues that exactly match the input sequence (when possible; see
below) or (ii) from residues that match user-defined residue substitutions,
i.e., isoleucine matching either isoleucine or leucine. The ability
to explore a narrow or broader sequence space of the PDB with user-defined
flexibility is an important feature of IDPConformerGenerator. Utilizing
the exact sequence to find torsion angles in the PDB-derived database
will guide IDPConformerGenerator to choose torsion angles that reflect
the structural biases of that sequence. This capitalizes on the PDB-derived
database as an “empirical force field” and is expected
to generate local and secondary structure biases based on the sequence
dependence of these structures. Enabling the substitution of residues
for similar residues, with a user-defined substitution list, recognizes
the availability limits of exact sequence matches in the PDB and extends
the potential torsion angles possible to be sampled.IDPConformerGenerator
has the ability to extract substitution lists based on a table derived
from the EDSSMat50 IDP-specific substitution matrix[45] (Supporting Information Table S1A), with the user specifying the columns to be used to define how
conservative or liberal to make the substitutions. Users can also
provide specific substitution lists through the command line (with
an example shown) ′-subs {“A”: “AG”}′,
with substitutions described in the form of a Python dictionary where
keys are the residues to replace and values are the list of residues
to include.
Peptide Fragment Sizes
IDPConformerGenerator enables
users to define the size and probability of the peptide fragments
used to build the IDP chain stepwise, modulating the sampling strategy
to explore. By default, IDPConformerGenerator samples fragment sizes
of one, two, three, four, and five residues with 10, 10, 30, 30, and
20% probabilities, respectively. Users can freely configure these
values in any given manner. Therefore, IDPConformerGenerator can emulate
previously published algorithms that build conformer chains by single
residue or tripeptide additions while allowing selection using countless
other strategies. Shorter fragments (such as one residue at a time)
disregard the sequence context of residue torsion angle frequencies,
while with larger fragment sizes, information about cooperative structure
(helical, turn, or extended strand-like elements) from the PDB is
included in the growing disordered protein chain. In practice, fragment
sizes of up to pentapeptides are most valuable because it is hard
to find sequence matches for longer segments, and cooperative structures
found in disordered proteins generally do not extend beyond five residues
(although there can be regions of a longer cooperative helix). If
a sequence match for the requested fragment size is not found, then
IDPConformerGenerator reduces the fragment size, one residue at a
time, until a sequence match is found in the database, taking into
account residue substitutions. Fragments are reduced until single
residue additions are sampled. Regardless of the size of the fragment,
if a proline residue in the input sequence follows the selected fragment,
then IDPConformerGenerator tries to expand the fragment sequence to
include the proline as well because prolines impose severe torsion
restraints on preceding residues, a strategy used by other disordered
chain-generating tools.[16,46,47]
Secondary Structure Sampling
IDPConformerGenerator
uses DSSP nomenclature[39,40] and the IDPConformerGenerator-created
torsion angle database can be configured to annotate residues by DSSP
or any other third-party software (via the IDPConformerGenerator ′sscalc′
interface) that classifies secondary structure elements. Because of
this, the secondary structure classes of the PDB database that are
sampled can be defined by users. A logical approach for generating
disordered ensembles is to sample the loop regions of folded structures,
which dominate most IDPs/IDRs. FastFloppyTail focuses on these regions,
and others have noted an increased ability to fit local structural
experimental data to ensembles derived from the coil or loop regions
of PDBs,[48] even though these regions can
be poorly defined.[49] Using loops will bias
to irregular structural elements found in the PDB that most likely
represent disordered states and will also include short helical and
extended elements. However, significantly populated helical elements
and other secondary structure elements are found in disordered proteins
states, so other DSSP codes are valuable to include. Therefore, IDPConformerGenerator
allows users to sample any residues regardless of secondary structure
annotation (“ANY” flag) or to specifically sample loops,
helices, extended structures, poly proline II helices, or an individual
or combination of any DSSP secondary structure code.The secondary
structure bias can be performed uniformly across the sequence or in
a residue-specific manner, using knowledge such as from NMR chemical
shifts for fractional populations of secondary structures as a function
of residue.[41−43] Depending on user choice, IDPConformerGenerator can
prioritize building with the secondary structures of interest over
the full sequence, leading to sampled fragments consisting only of
residues matching a single secondary structure annotation, or can
completely disregard secondary structures while building to allow
the inherent secondary structure propensities observed in the PDB
database to emerge. In this way, for example, ABC could be a fragment
in which A and B residues are annotated as a loop and C is annotated
as a helix (with C being the first residue of a helix following a
loop). Moreover, custom secondary structure sampling based on experimental
knowledge of the fractional populations of secondary structures can
be used, which can in turn override the database bias to build conformers
to match known structural probabilities. For the bias of secondary
structure on a per-residue basis, building utilizes a custom secondary
structure sampling database file containing fractional propensities
for different secondary structures as a function of residue derived
from NMR chemical shift data (using csssconv with δ2D[42] or CheSPI[43]). If
the chemical shift data are not available but other knowledge of sampling
of helical or extended/β-strand regions exists (or if users
want to explicitly define these), then users can specify where significant
sampling of helical or extended/β-strand regions occurs and
sample the rest of the conformer without bias. Combining the rich
ability to sample torsion angles from specific secondary structure
annotations with the residue-type substitutions that increase the
matching tolerance and specification of fragment sizes, users can
sample highly restrictive or very broad conformational spaces.
Side-Chain Building
During the backbone building process
immediately after each backbone fragment is created, alanine side
chains are added to all residues, except for glycines and prolines
for which the full residue is added. These dummy alanines serve as
coarse-grained representations of the real side chain. They avoid
building backbone conformations that are too compact to fit side chains
without steric clashes, yet the volume of the alanine side chain is
small enough to allow the backbone to sample packed conformations,
enabling the exploration of side-chain packing.However, full
side chains must be added, and IDPConformerGenerator adds side chains
using the MC-SCE algorithm by sending backbone atom coordinates only
and excluding any alanine and proline side chains. The MC-SCE algorithm
is a Monte Carlo approach to building side-chain conformations on
a predefined backbone structure[44] that
utilizes a convergent Rosenbluth sampling scheme and an augmented
Dunbrack library for side-chain rotamer sampling.[44] The MC-SCE algorithm was originally written in Fortran
but was fully rebuilt in Python to interface with IDPConformerGenerator
to build side-chain structures. Given a backbone structure, MC-SCE
builds the side chains by aligning the backbone N, Cα and C′ atoms of the Dunbrack templates with the backbone
from IDPConformerGenerator. The side chains are then rotated according
to the sampled torsion angles, and this sampling procedure is the
key to the Monte Carlo nature of the algorithm.MC-SCE can be
used both as a stand-alone option (https://github.com/THGLab/MCSCE) and as two modes for working within IDPConformerGenerator. The
simple mode provides an option for rapidly adding side chains to a
backbone structure without introducing clashes, but the conformations
might be energetically suboptimal. Conversely, the exhaustive mode
generates side-chain conformations via the user-defined total number
of trials for the parallel execution of the building process, with
the all-atom structure having the lowest energy of these returned
to IDPConformerGenerator, but takes longer to run. (See the Supporting Information for more details.)The FASPR[44] algorithm used to build
side-chain structures is also an integrated option. This stand-alone
software for side-chain packing performs quickly for folded proteins.
Note that FASPR does not include hydrogens, leading to a need to identify
an optimal approach to build them afterward. We have opted for MC-SCE
as our preferred approach because it generates a complete all-atom
description of conformers, including hydrogens, which is an important
advantage over FASPR.
Internal to Cartesian Coordinate Transformations
The
design of IDPConformerGenerator as a builder based on torsion angle
sampling, rather than based on Cartesian coordinate space, has benefits
and drawbacks. One clear benefit is that building with secondary structure
biases, such as from NMR chemical shifts and backbone 3J-coupling data, is “native” to the builder. Building
with tertiary contact biases, such as from NMR 1H–1H NOE or PRE data, is not. Importantly, energy calculations
are made on Cartesian coordinates. In order to facilitate energy calculations,
the conformers based on internal coordinates must be back-transformed
to Cartesian coordinates.The original approach used for most
of the work reported here uses statistical sampling of bond angles
for the set of matched fragments and average values for bond lengths
based on the identity of the previous residue and the one following
the residue being built. The currently recommended approach that improves
upon the aforementioned strategy uses Int2Cart developed by Li and
co-workers,[37] which is a deep learning
model that better predicts the correlations among the whole sequence
and the bond lengths and bond angles for a given set of ω, φ,
and ψ torsion angles to yield more accurate Cartesian coordinates.
This very recent implementation can be used as a stand-alone option
(https://github.com/THGLab/int2cart) and is also integrated within IDPConformerGenerator directly.
Energy Considerations
Instead of using hard spheres
to model atom volumes, IDPConformerGenerator computes the whole Lennard-Jones
(L-J) potential to create conformers that are self-avoiding polymers.
The default L-J parameters are the Amber14SB force field, which were
used for the results generated here but are also user-definable. Computing
the whole L-J potential allows the building engine to tolerate small
clashes that can be compensated by favorable interactions and compensates
for the fact that rigid conformers are created; i.e., no flexible
minimization is performed at any stage. Severe clashes will, nonetheless,
have a profound impact on the energy term, and thus the energy threshold
for rejection can be defined by users, with higher values allowing
the exploration of broader conformational space. This feature is useful
when modeling sequences with reduced representations in the database.IDPConformerGenerator can build backbone-only or full side-chain-containing
conformers. For this reason, two energy threshold parameters are implemented:
one to control the tolerance for backbone atoms (“-etbb”)
and another to control the energy threshold in all-atom conformers
with side chains (“-etss”). The energy thresholds to
accept or reject a conformer can be calculated pairwise (atom-by-atom)
or over the full structure based on user choice. For each fragment
built, the energy is computed; if it is below the threshold, then
the fragment is accepted, otherwise it is rejected. If side chains
are being built, then once the backbone is complete, IDPConformerGenerator
attempts to place the side chains. If successful (the energy term
is below the threshold), then the conformer is saved. Otherwise, the
backbone conformation is considered to be too restrictive to build
side chains, the whole conformer is discarded, and the creation of
a new conformer starts. Since the energy threshold for acceptance
after side chain addition is distinct from the threshold controlling
backbone building, a user can accept all side-chain packing results
by providing a large number for the side-chain energy threshold.
X-EISD Bayesian Model
IDPConformerGenerator is designed
as a platform and supports the direct incorporation of the calculated
ensemble into downstream tools for scoring and reweighting based on
experimental data. The internal integration of these tools is envisioned
in the near future. Of particular interest is X-EISD,[22,23] a method which calculates the maximum log likelihood of a protein
structural ensemble by accounting for the uncertainties in a wide
range of experimental data and back-calculation models from structures.
These data include NMR chemical shifts, J couplings,
residual dipolar couplings (RDCs), hydrodynamic radii, nuclear Overhauser
effects (NOEs), and paramagnetic resonance enhancements (PREs); smFRET;
and SAXS curves.[22,23] We have also introduced new data
types, NMR R2 relaxation rates and S2 order parameters, for the selection of an
IDP ensemble consistent with NMR dynamics data.[50] Given the ensembles created with IDPConformerGenerator,
the X-EISD model can be used as a scoring function to reweight the
IDP ensembles for the best agreement with experimental data based
on the different experimental and back-calculation uncertainties.
Analysis Tools
There are a number of commands currently
integrated within IDPConformerGenerator that can enable the analysis
of resulting ensembles. These include the analysis of the resulting
fractional secondary structure. In addition, a set of complementary
analysis tools were utilized to ask specific research questions regarding
the ensembles (see below; available as stand-alone scripts). These
include the RMSDs from experimental data restraints and ENSEMBLE and
X-EISD scores; pairwise RMSDs of atomic coordinates; measures of local
structure, including secondary structure and φ/ψ/ω
distributions; measures of hydrodynamic properties, including Rg, end-to-end distances, and asphericity (deviation
from spherical shape of the conformers); and measures of tertiary
contacts, including the Cα–Cα distance and distance
difference matrices.
Additional User-Defined Parameters
Parameters are available
to control reproducibility, including random seeds. IDPConformerGenerator
runs are deterministic; i.e., the same results can be achieved by
providing the same initial database, the same input parameters, and
the same random seeds on the same machine. Users can also specify
the number of cores of a multiprocessor computer to use.
Results
In order to demonstrate the utility of IDPConformerGenerator
and
ask questions regarding the optimal parameters for building diverse
and physically meaningful ensembles, we have used a set of intrinsically
disordered proteins (IDPs) of various sizes: Sic1,[51] α-synuclein,[52] inhibitor-2,[53] and Tau,[54] as well
as the unfolded state of the N-terminal SH3 domain of the Drosophila signaling protein Drk (drkN SH3).[55]Sic1 is a yeast cell-cycle regulator that inhibits a
cyclin-dependent kinase and is degraded following ubiquitination due
to binding the ubiquitin ligase substrate-binding domain (Cdc4 WD40
domain) in a dynamic complex dependent on multisite phosphorylation.[51] The N-terminal 92 amino acids (aa) of Sic1 are
necessary and sufficient for binding and have been extensively characterized
by NMR, SAXS, and smFRET, and this fragment is therefore used here.[6,56−58]Human α-synuclein
(aSyn, 140 aa) is highly abundant
in the brain, where it is found largely in the axon terminals of presynaptic
neurons to regulate synaptic vesicle trafficking and subsequent neurotransmitter
release.[52] In the presence of membrane
vesicles (or other lipid environments), α-synuclein forms a
helical structure, but in the absence of lipid, it is highly disordered
with a minimal propensity for helical or other secondary structure.
It has been studied in both states, but for testing purposes, we utilize
NMR and SAXS data from the disordered state.[15,21,59−64]Inhibitor-2 (I-2, 159 aa) is an inhibitor
of protein
phosphatase 1 (PP1), forming a dynamic complex with PP1 that orders
only a limited portion of I-2 upon binding, based on crystallographic
data.[53] In the absence of PP1, I-2 is disordered
yet has a significant population of helices, based on characterization
by NMR.[65,66]Tau (microtubule-associated
protein tau) is a 758-residue
IDP with numerous functional annotations, including the promotion
of microtubule assembly and stability and roles in establishing and
maintaining the polarity of axons in neurons.[54] It is an RNA-binding protein that phase separates in vitro and is
found in cellular biomolecular condensates, consistent with its lower
complexity sequence.[67] We use the first
441 residues as a test system since a fragment encompassing these
residues has been studied using NMR.[68,69] Short Tau
peptides have also been studied,[15] and
we similarly utilize a Tau peptide as a test system.Finally, the drkN SH3 domain exists in dynamic equilibrium
between folded and unfolded states, with the unfolded state extensively
studied as a model disordered protein for the development of ensemble
calculation methods due to the large number of experimental NMR, SAXS,
and smFRET restraints available and its small size (59 aa).[22,50,66,70]Sequences for these proteins and fragments are given
in Supporting Information Table S1B. Note
that there
are some peptide sequences of aSyn and Tau in the PDB database we
use (Supporting Information Table S1C),
many in complex with antibodies. It may be valuable to include structures
of the protein of interest or homologous proteins, such as complexes
of folded proteins with the disordered protein of interest (or its
fragments), to provide conformations that are likely to be sampled
at some level. Alternatively, users may choose to exclude such structures
to avoid potential bias. Either approach is possible because IDPConformerGenerator
allows users to assemble custom-made databases of torsion angles from
user-defined input PDB lists. The number of sequence matches for different
fragment sizes of the drkN SH3 domain in our database for different
secondary structure sampling, including exact matches or with substitutions,
is given in Supporting Information Table S1D to provide concrete examples of how torsion angles are chosen in
IDPConformerGenerator. The ′stats′ subclient calculates
the sequence matches in the database for an input sequence and considering
the input parameters of the building process. In this way, users can
easily assess the number of angles available for each chunk and identify
possible bottlenecks where residue tolerance might be needed.Here we characterize multiple aspects of IDPConformerGenerator:
computational speed for generating ensembles, the diversity of conformer
sampling, the presence or absence of secondary structure (especially
helical fraction), how well these unoptimized disordered ensembles
recapitulate experimental data, and comparisons to other structural
ensemble generators including TraDES, FastFloppyTail, and MD simulations.
Computational Timings
The speed of conformational ensemble
generation is important, particularly for larger proteins, as large
and conformationally diverse input pools are valuable for further
reweighting or subsetting. We compared speeds for generating disordered
conformers using IDPConformerGenerator with different fragment sizes
and different secondary structure options for all of the test systems.
For these, we generated backbones for each protein with a 100 kJ backbone
energy threshold and used MC-SCE for side chains. The goal was to
yield 1000 successful full conformers for each protein such that timings
were normalized on a per successful conformer basis. Exact timings
and the percentage of successful conformers from the generated backbones
can be found in Supporting Information Table S2A.Figure A
demonstrates the general trend of faster conformer generation for
shorter chains, as expected, with a nonlinear dependence. Building
with only helices or extended strands is usually faster than building
with loops or mixtures of loops with helical or extended structures,
such as with CSSS or ANY, as loops increase the likelihood of steric
clashes and difficulties in side-chain packing, although helices and
strands are not as representative of disordered states (Supporting Information Table S2A). As shown in Figure B, increasing the
fragment size significantly increases the speed of conformer generation
for the proteins investigated, and varying the secondary structure
sampling method alters the speed for different fragment sizes. In
most cases, using substitutions was also found to be faster, likely
due to more fragment matches in the database. Overall, conformer generation
is reasonably efficient but strongly dependent on chain length, with
speeds of 400–500 conformers per hour per computer node for
the drkN SH3 domain unfolded state (59 aa), 200–275 for Sic1
(92 aa), 50–100 for aSyn (140 aa), 40–75 for I-2 (159
aa), and about 5 for Tau (441 aa) using one node on the Graham supercomputing
resource. For this and all other calculations unless otherwise specified,
we used one node of the Graham resource of Compute Canada (now Digital
Research Alliance of Canada) with 2x Intel E5-2683 v4 Broadwell @
2.1 GHz CPUs and with 125 GB of RAM per node.
Figure 3
Timings for IDPConformerGenerator
with MC-SCE and dependencies
on the protein test system (size), secondary structure sampling, and
fragment length. Speed is defined as the number of conformers per
hour. (A) Speeds for variable secondary structure sampling methods
on disordered proteins with different lengths, shown for sampling
with custom secondary structure propensities (CSSS, yellow), “ANY”
secondary structure (gray), combinations of loops, helices, and strands
(orange), and only loops (blue). Speeds are shown for selected ensembles
of the drkN SH3 domain unfolded state (59 amino acids, aa), the Sic1
N-terminal targeting region (92 aa), α-synuclein (aSyn, 140
aa), and inhibitor 2 (I-2, 159 aa). Speeds for all conformer ensembles
generated including for the Tau fragment (441 aa) are in Supporting Information Table S2A. (B) Speeds
for variable fragment sizes and secondary structure sampling methods
for Sic1 are shown for sampling with only loops with substitutions
(gray), only loops without substitutions (orange), and “ANY”
secondary structure (blue). Default fragment length probabilities
are 0.1, 0.1, 0.3, 0.3, and 0.2 for fragment lengths of 1, 2, 3, 4,
and 5, respectively.
Timings for IDPConformerGenerator
with MC-SCE and dependencies
on the protein test system (size), secondary structure sampling, and
fragment length. Speed is defined as the number of conformers per
hour. (A) Speeds for variable secondary structure sampling methods
on disordered proteins with different lengths, shown for sampling
with custom secondary structure propensities (CSSS, yellow), “ANY”
secondary structure (gray), combinations of loops, helices, and strands
(orange), and only loops (blue). Speeds are shown for selected ensembles
of the drkN SH3 domain unfolded state (59 amino acids, aa), the Sic1
N-terminal targeting region (92 aa), α-synuclein (aSyn, 140
aa), and inhibitor 2 (I-2, 159 aa). Speeds for all conformer ensembles
generated including for the Tau fragment (441 aa) are in Supporting Information Table S2A. (B) Speeds
for variable fragment sizes and secondary structure sampling methods
for Sic1 are shown for sampling with only loops with substitutions
(gray), only loops without substitutions (orange), and “ANY”
secondary structure (blue). Default fragment length probabilities
are 0.1, 0.1, 0.3, 0.3, and 0.2 for fragment lengths of 1, 2, 3, 4,
and 5, respectively.We also tested the intersection of the impact of
sequence length
and the diversity of amino acid residues in the sequence on the speed
of conformer generation. Low-complexity IDRs using fewer amino acids
are increasingly understood to play a functional role in facilitating
phase separation within biomolecular condensates or membraneless organelles.[71,72] Of our test proteins, the Tau fragment is the longest (441 aa) and
is known to phase separate.[67] It is also
lower in complexity than the other test proteins, with the first 300
residues annotated as having compositional bias by CAST and being
low complexity by SEG, respectively.[73,74] Such low-complexity
sequences are not found in the folded proteins in our database, and
we explored if they would take longer to build. We quantified the
speed of conformer generation in minutes per amino acid on the multiprocessor
server. Tau was segmented into 3 segments of 147 aa to compare with
I-2 (159 aa) and aSyn (140 aa) and 5 segments of ∼90 residues
in order to compare with Sic1 (92 aa). We found that the central 147-residue
segment of the Tau fragment was the fastest to build but that there
were no clear trends on the basis of complexity when comparing Tau
to aSyn or I-2 (Supporting Information Figure
S2).The side-chain addition step is much longer than
backbone generation,
with our preferred side-chain packing algorithm MC-SCE taking a larger
fraction of the time as the chain length increases. MC-SCE was initially
optimized for packing side chains onto the backbone of folded proteins.
Although the success rate decreases with longer backbone lengths,
we found that the settings in MC-SCE could be optimized for IDPs by
reducing the number of attempts/trials spent on packing side chains
onto backbones from 128 to 32. For Tau, using 32 trials increased
the speed per conformer by 3 to 4.4 times depending on secondary structure
sampling (Supporting Information Table S2B). Another observation based on these benchmarks is that the success
rate increases with an increased number of backbone conformers available
as input to MC-SCE.For methodological purposes, we also asked
about the optimal energy
flags for the speed of calculation of conformers that do not have
significant steric clashes in order to facilitate the rapid building
of structural ensembles. We built sets of 1000 backbone conformers
of I-2 with loops or other secondary structure sampling with either
100 or 250 kJ pairwise energy thresholds and used MC-SCE for side
chains, with average times of 68 and 38 min, respectively. The ∼80%
increase in time for the 100 kJ threshold led to only an ∼10%
gain in clash-free conformers. We observed similar results for aSyn,
with average times of 45 and 25 min, respectively, and an ∼80%
increase in time for the 100 kJ threshold and only an ∼15%
gain in clash-free conformers. Thus, increasing the energy threshold
can speed up the full conformer generation time for proteins at least
as long as I-2 (Supporting Information Table S2C).
Sampling Depth
Next, we interrogated the depth of the
torsion angle space in ensembles built from torsions derived from
the PDB data set. When building proteins with a specific sequence,
particularly for fragment sizes of 3, 4, and 5, the finite size of
the PDB-derived database leads to minimal torsion angle options as
only sequence matches of the defined fragment size can be used to
build. This leads to what we call torsion angle bottlenecks for specific
residues. Figure shows
the φ distributions for Sic1 generated using loops with various
fragment sizes, demonstrating decreasing numbers of distinct torsion
angles as the fragment size increases. For fragment sizes of 5 between
residues 20–30, essentially 1 set of backbone torsion angles
was used over all of these 1000 structures. Supporting
Information Figure S3 shows histograms of how many segments
of the drkN SH3 domain sequence for various fragment sizes are present
in the database, demonstrating the minimal data for fragment sizes
of 6 and 7, with values also provided in Supporting
Information Table 1D. In order to avoid such torsion angle
bottlenecks, mixtures of fragment sizes are optimal when requesting
exact sequence matches. The default of probabilistic sampling of fragment
sizes of 1, 2, 3, 4, and 5 in a ratio of 1:2:3:3:1 enables contributions
from larger fragment sizes with cooperative structural elements while
minimizing torsion angle bottlenecks, as seen in the bottom right
rows of Figure . Using
substitutions can help avoid bottlenecks, with Figure panels reporting on ensembles with substitutions
showing greater torsion angle coverage than for those without. Increasing
the number of DSSP codes utilized can also be beneficial (Supporting Information Figure S4), as using only
helices or only strands yields limited torsion angle sampling (and
is not realistic for disordered chains). Being agnostic to secondary
structure annotation is another approach, as seen for the difference
between using loops only or all possible annotations for the drkN
SH3 domain sequence (Supporting Information Figure
S5).
Figure 4
Phi (φ) torsion angles in Sic1 ensembles sampled using different
fragment sizes, with and without substitutions. Calculations were
for 1000-conformer ensembles generated by sampling loops only, with
fragment sizes of 1, 2, 3, 4, and 5 and default fragment size probabilities.
The left and right columns are for the Sic1 sequence without and with
substitutions, respectively. Substitutions are derived from columns
5, 3, and 2 of the EDSSMat50 amino acid substitution matrix. The plot
was generated with the ′--plots′ flag in the ′idpconfgen
torsions′ CLI.
Phi (φ) torsion angles in Sic1 ensembles sampled using different
fragment sizes, with and without substitutions. Calculations were
for 1000-conformer ensembles generated by sampling loops only, with
fragment sizes of 1, 2, 3, 4, and 5 and default fragment size probabilities.
The left and right columns are for the Sic1 sequence without and with
substitutions, respectively. Substitutions are derived from columns
5, 3, and 2 of the EDSSMat50 amino acid substitution matrix. The plot
was generated with the ′--plots′ flag in the ′idpconfgen
torsions′ CLI.We then looked for the optimal parameters (fragment
size, secondary
structure flags) for increasing diversity of calculated structures,
as measured by average pairwise RMSDs, hydrodynamic parameters, and
asphericity. Rg, the end-to-end distance
(Ree), and asphericity are all measures
of the shape of a conformation, with smaller Rg and Ree values and asphericity
approaching 0 implying more spherical, compact chains while large
values reflect irregular, less compact shapes. As expected, we note
that it is critical to incorporate loop regions to build diverse structural
ensembles of disordered protein states; with only helical or extended
DSSP flags, long helices or strands are built, which are not representative
of disordered conformations (Figure , Supporting Information Figure
S6, and Supporting Information Table S3). This is seen by the
much higher Ree and asphericity values,
such as for those built with helices only having asphericity values
of >0.8 and with strands only having Ree values 1.5 to 2 times as large as for those built with loops. To
further increase the diversity of calculated structures, the “ANY”
secondary structure flag is optimal as it will use the natural secondary
structure propensities of the entire PDB database and will not be
limited to user-defined secondary structures that restrict the sampling
of conformational space (see below).
Figure 5
Diversity analysis of conformational ensembles
of the drkN SH3
domain unfolded state and I-2. The radius of gyration (Rg), end-to-end distance (Ree), asphericity (A), and pairwise root-mean-squared
deviations of atomic positions (pwRMSDs) are shown as a function of
secondary structure sampling parameters for 1000-conformer ensembles
generated with different secondary structure sampling, including loops
(L+), loops and helices (L+H+), loops, helices, and extended strands
(L+H+E+), and all torsion angles agnostic to secondary structure (ANY)
and biased by δ2D chemical shifts (CSSS) or with FastFloppyTail
(FFT) for the drkN SH3 unfolded state (row 1) and I-2 (row 2). Standard
deviations for Rg, Ree, A, and pwRMSD are also shown as bars. Supporting Information Figure S6 shows similar
data for other protein systems. * is for the standard FFT protocol,
which for this case treats the protein as a mixture of ordered and
disordered, while the other is for a modified protocol in which the
protein is considered to be fully disordered.
Diversity analysis of conformational ensembles
of the drkN SH3
domain unfolded state and I-2. The radius of gyration (Rg), end-to-end distance (Ree), asphericity (A), and pairwise root-mean-squared
deviations of atomic positions (pwRMSDs) are shown as a function of
secondary structure sampling parameters for 1000-conformer ensembles
generated with different secondary structure sampling, including loops
(L+), loops and helices (L+H+), loops, helices, and extended strands
(L+H+E+), and all torsion angles agnostic to secondary structure (ANY)
and biased by δ2D chemical shifts (CSSS) or with FastFloppyTail
(FFT) for the drkN SH3 unfolded state (row 1) and I-2 (row 2). Standard
deviations for Rg, Ree, A, and pwRMSD are also shown as bars. Supporting Information Figure S6 shows similar
data for other protein systems. * is for the standard FFT protocol,
which for this case treats the protein as a mixture of ordered and
disordered, while the other is for a modified protocol in which the
protein is considered to be fully disordered.Plotting pairwise RMSDs as a distribution (Figure ) demonstrates that
the ensembles are smoothly
sampled, with no significant clusters of similar structures, consistent
with our goal of generating diverse conformers. Varying secondary
structure sampling approaches can also increase the variety of conformational
space explored, as the custom secondary structure sampling shifts
the RMSD histogram to larger values. As seen in plots of pairwise
RMSD distance matrices (Supporting Information
Figure S7), no regions of lower pairwise RMSDs are seen, indicating
that the generated conformers have a large variability in Cα
backbones. Pairwise RMSD is correlated to protein length, with RMSD
values ranging from above 5 Å to 30 Å for the shorter disordered
drkN SH3 domain unfolded state (59 aa) and from 15 Å to above
50 Å for I-2 (159 aa), indicating significant heterogeneity in
conformational sampling.
Figure 6
Pairwise RMSD distributions for ensembles of
the (A) drkN SH3 domain
unfolded state and (B) I-2. Calculations were for different ensembles
of 1000 conformers each, plotted with bin sizes of 5 Å. “ANY”
indicates sampling the database without biasing secondary structures,
“nosub” indicates no substitutions, “sub532”
indicates amino acid substitutions from columns 5, 3, and 2 of the
EDSSMat50 amino acid substitution matrix, and “CheSPI”
or “δ2D” indicates custom secondary structure
sampling (CSSS) pools biased by CheSPI or δ2D estimations of
secondary structure propensities.
Pairwise RMSD distributions for ensembles of
the (A) drkN SH3 domain
unfolded state and (B) I-2. Calculations were for different ensembles
of 1000 conformers each, plotted with bin sizes of 5 Å. “ANY”
indicates sampling the database without biasing secondary structures,
“nosub” indicates no substitutions, “sub532”
indicates amino acid substitutions from columns 5, 3, and 2 of the
EDSSMat50 amino acid substitution matrix, and “CheSPI”
or “δ2D” indicates custom secondary structure
sampling (CSSS) pools biased by CheSPI or δ2D estimations of
secondary structure propensities.We were also interested to see if IDPConformerGenerator
is able
to effectively capture local structural changes with amino acid sequence
changes. We used a 16-mer peptide from the Tau K18 fragment previously
studied by Stelzl and co-workers; reweighted hierarchical chain growth
was used to generate Tau ensembles recapitulating structural details
that were identified by NMR to have microtuble binding capacity and
that are lost upon mutation of position P301.[36] To investigate the conformational diversity explored by IDPConformerGenerator
and the variation in conformations for single-site mutations, we generated
sets of 10 000 conformers for wild type (WT), P301L, P301S,
and P301T for the Tau fragment sequence: DNIKHVP301GGGSVQIVY. We sampled by considering only sequence
matching, disregarding secondary structure annotations, and allowed
no residue substitutions for sequence matching.One of the structural
parameters explored in the Stelzl study is
the distance between V300 O (backbone carbonyl oxygen) and G303 N
(backbone amide nitrogen). Figure A shows distributions for this O–N distance
for the different variants. In agreement with the Stelzl study, we
observe a considerable fraction of conformers for the WT with distance
below 4 Å, reflecting a turn structure and likely hydrogen bond,
while for mutants these occurrences are much rarer. Each mutant reveals
different patterns of O–N distances, showing that IDPConformerGenerator
can capture local conformational diversity from single-point mutations
and that these will be incorporated into the larger disordered chain. Figure B shows the torsion
angles for residue 301 in the variants. Here, we also observe very
different profiles. Note the presence of conformers with a cis-prolyl peptide bond for P301 reflecting the natural
tendency of cis-Pro in the context of this sequence
but absent in the mutants lacking proline. These results demonstrate
that IDPConformerGenerator can effectively sample realistic local
conformations in a sequence-specific manner, consistent with its design.
Another approach to sampling particular turn types or other structures
that is available with IDPConformerGenerator is to utilize an amino
acid substitution dictionary to incorporate residues with known propensities
for these structures.
Figure 7
Local structural variations between the Tau K18 16-mer
WT and different
mutants. (A) Distribution of the distance between V300 O and G303
N atoms in 10 000-conformer ensembles generated with no substitutions.
(B) Torsion angle distributions for position 301 of the different
conformers in these WT and mutant ensembles, with ω representing
the torsion angle N-terminal to the φ, as is our convention
(typically denoted as ω of the preceding residue).
Local structural variations between the Tau K18 16-mer
WT and different
mutants. (A) Distribution of the distance between V300 O and G303
N atoms in 10 000-conformer ensembles generated with no substitutions.
(B) Torsion angle distributions for position 301 of the different
conformers in these WT and mutant ensembles, with ω representing
the torsion angle N-terminal to the φ, as is our convention
(typically denoted as ω of the preceding residue).
Fractional Secondary Structure
An obvious question
regards the impact of the secondary structure flags on the ultimate
fractional secondary structures in the ensembles built. We generated
1200-conformer ensembles of Sic1 using different combinations of secondary
structure sampling with loops (activated by default), helices, and
strands. IDPConformerGenerator can pool together DSSP codes T (hydrogen-bonded
turn), S (bend), B (β-bridge), P (PPII helix), I (pi-helix)
and ’ ’ (blank, loops/irregular) as loop (L), H (α-helix)
and G (3-10 helix) as helix (H), and only E (extended strand, participating
in β-ladder) as strand (E). For all of our calculations, we
utilized this pooled set of DSSP codes. We generated Sic1 ensembles
due to its lack of inherent significant biases in secondary structure
propensity[56,57] (Figure ). When restricting to loops only or loops
and strands, similar sampling is observed with greater sampling of
the β-region of the Ramachandran diagram, but since there are
no hydrogen-bonded interactions, DSSP catalogs these as loops. There
is also significant sampling of the α-region and some small
amounts of cooperative helix are observed. With strands only, sampling
of the β-region of the Ramachandran diagram is dominant, with
no strands defined by DSSP, again due to the lack of hydrogen bonds.
Restricting sampling to helices, however, leads to dominant sampling
of the α-region of the Ramachandran plot, and as expected, these
structures show significant α-helix as defined in the DSSP.
With loops and helices, there are also significant cooperative helices
generated. Similar results were seen for the drkN SH3 unfolded state
(Supporting Information Figure S8).
Figure 8
Fractional
secondary structure in Sic1 ensembles. Analyses were
performed on 1200-conformer pools of Sic1 generated with different
combinations of secondary structure sampling consisting of loops,
helix, and extended strands. Orange indicates α-helix detected
by DSSP (solid) and the α-region on the Ramachandran (Rama.)
diagram (dashed). Blue indicates extended strand for DSSP (solid)
and the β-region on the Ramachandran diagram (dashed). Black
indicates coil/loop for DSSP (solid) and other regions on the Ramachandran
diagram (dashed).
Fractional
secondary structure in Sic1 ensembles. Analyses were
performed on 1200-conformer pools of Sic1 generated with different
combinations of secondary structure sampling consisting of loops,
helix, and extended strands. Orange indicates α-helix detected
by DSSP (solid) and the α-region on the Ramachandran (Rama.)
diagram (dashed). Blue indicates extended strand for DSSP (solid)
and the β-region on the Ramachandran diagram (dashed). Black
indicates coil/loop for DSSP (solid) and other regions on the Ramachandran
diagram (dashed).Sampling with all three secondary structure options
in combination
(loop, helix, strand) is not the same as sampling with the ANY flag
(′--dany′), as the latter approach samples based solely
on the sequence matching patterns, disregarding secondary structure
annotations, thus reflecting the inherent structural propensities
of the input sequence fragments as present in the database. The explicit
listing of secondary structure codes limits to sampling fragments
with the same secondary structure code for all residues while with
the ANY flag this is not a requirement, so we wondered if there would
be significant differences in emerging structural patterns. An analysis
of ensembles of the drkN SH3 domain unfolded state demonstrates no
significant visual differences in torsion angle distributions, but
the ANY pools do have greater psi ranges between residues 22 and 29
and there are greater helical propensities for the LHE pool compared
to the ANY pool (Figure ). Although the general trends of secondary structure propensities
seem similar between the LHE and ANY pools, small differences demonstrate
that these options generate different conformer pools. We recommend
the ANY flag to build IDP conformer ensembles with the sampling of
torsion angles based on frequencies observed in the PDB. To maximize
the sampling of torsion angle space, we recommend sampling with both
ANY and LHE to minimize torsion angle bottlenecks.
Figure 9
Comparison of torsion
angle sampling for L+/H+/E+ and ANY. Ensembles
of 1000 conformers each of the drkN SH3 domain unfolded state were
generated with sampling a combination of loops (L), helix (H), and
extended strands (E) or sampling without biasing secondary structure
with the ANY flag. Phi and psi (φ and ψ) torsion angle
distributions for each conformer pool are shown as a scatter plot
in the first two rows. The third row depicts fractional secondary
structure based on DSSP (dark solid lines) or the Ramachandran (Rama.)
diagram (dashed lines), with orange indicating α-helix for DSSP
and the α-region of the Ramachandran diagram, blue indicating
extended strand for DSSP and the β-region of the Ramachandran
diagram, and black indicating coil/loop for DSSP and other regions
of the Ramachandran diagram.
Comparison of torsion
angle sampling for L+/H+/E+ and ANY. Ensembles
of 1000 conformers each of the drkN SH3 domain unfolded state were
generated with sampling a combination of loops (L), helix (H), and
extended strands (E) or sampling without biasing secondary structure
with the ANY flag. Phi and psi (φ and ψ) torsion angle
distributions for each conformer pool are shown as a scatter plot
in the first two rows. The third row depicts fractional secondary
structure based on DSSP (dark solid lines) or the Ramachandran (Rama.)
diagram (dashed lines), with orange indicating α-helix for DSSP
and the α-region of the Ramachandran diagram, blue indicating
extended strand for DSSP and the β-region of the Ramachandran
diagram, and black indicating coil/loop for DSSP and other regions
of the Ramachandran diagram.Importantly, we were interested in whether our
design of IDPConformerGenerator
to exploit the secondary structure propensities found in the PDB would
match experimentally measured propensities from NMR chemical shifts.
Two sets of 3000 conformers each of the drkN SH3 domain unfolded state
were generated using a backbone energy threshold of 100 kJ, with the
“ANY” flag and with the CSSS flag to do custom secondary
structure sampling based on δ2D[42] calculations from NMR chemical shift data[75] on a per-residue basis. As seen in Figure (left), there are natural secondary structure
propensities for α-helix, particularly for residues 16–29,
based on δ2D predictions for secondary structure propensities.[42] At residues 58 and 59, the predicted probabilities
of secondary structure are set to 1/3 as no chemical shift data are
available. Although extended β-strand regions have also been
predicted with δ2D, DSSP defines extended strands based on both
torsion angle ranges (the same ones as used for segmenting the Ramachandran
space) and hydrogen bonds, and there are minimal cases of tertiary
contacts satisfying β-hydrogen bonds in these disordered ensembles.
However, sampling in the β-region of the Ramachandran diagram
is plentiful. (Note that it may be valuable to utilize a different
definition of strand pairing besides DSSP that is more permissive
for local backbone geometries to characterize potential β-strands.)
This ensemble shows that helical structure is oversampled relative
to what is found experimentally. However, the regions where significant
α-helical secondary structure is sampled do overlap with the
observed secondary structure propensities identified by δ2D.
With custom secondary structure sampling, in contrast, there is very
good agreement for the α-helical and coil/loop structure on
a per-residue basis with that suggested by δ2D (Figure , left). Sampling in the β-region
is somewhat increased, with no observed β-strand H-bonding structure
seen using DSSP. Overall, biasing the sampling for torsion angles
in the PDB based on secondary structure yields an ensemble with an
overestimate of helical structure compared to the δ2D estimates,
while directing the sampling by NMR data, as expected, yields an ensemble
in much closer agreement to these data.
Figure 10
Custom secondary structure
sampling. (Left) For the drkN SH3 domain
unfolded state, two sets of 3000 conformers each were generated, and
(right) for inhibitor-2 (I-2), two sets of 1500 conformers each were
generated, with (A, B) the “ANY” flag or with (C, D)
the CSSS flag to do custom secondary structure sampling based on δ2D
calculations from NMR chemical shift data. (A, C) Plots of fractional
secondary structure based on DSSP (dark solid lines), the Ramachandran
(Rama.) diagram (dashed lines), or δ2D (light solid lines).
Orange indicates α-helix for δ2D and DSSP and the α-region
on the Ramachandran diagram. Blue indicates extended strand for δ2D
and DSSP and the β-region on the Ramachandran diagram. Black
indicates coil/loop for δ2D and DSSP and other regions on the
Ramachandran diagram. (B, D) Aligned conformers of the ensembles using
PyMOL.
Custom secondary structure
sampling. (Left) For the drkN SH3 domain
unfolded state, two sets of 3000 conformers each were generated, and
(right) for inhibitor-2 (I-2), two sets of 1500 conformers each were
generated, with (A, B) the “ANY” flag or with (C, D)
the CSSS flag to do custom secondary structure sampling based on δ2D
calculations from NMR chemical shift data. (A, C) Plots of fractional
secondary structure based on DSSP (dark solid lines), the Ramachandran
(Rama.) diagram (dashed lines), or δ2D (light solid lines).
Orange indicates α-helix for δ2D and DSSP and the α-region
on the Ramachandran diagram. Blue indicates extended strand for δ2D
and DSSP and the β-region on the Ramachandran diagram. Black
indicates coil/loop for δ2D and DSSP and other regions on the
Ramachandran diagram. (B, D) Aligned conformers of the ensembles using
PyMOL.Similar results were found for I-2, which has significantly
populated
helices around residues 85–99 and 121–145, based on
NMR chemical shifts[75] with δ2D assignments.
These peaks match with sampling from the α-region of the Ramachandran
plot in the ANY ensemble (Figure , right), but there is significant helical structure
throughout. Biasing by the NMR data, we can generate an ensemble with
nearly exact agreements of the secondary structure propensities calculated
by δ2D to the secondary structure propensities of the conformer
ensemble calculated by DSSP (Figure , right). While sampling torsion angles in the PDB
using the ANY flag may provide some insight into the natural propensities
for α- or β-regions of the Ramachandran diagram, biasing
the sampling based on experimental NMR data can yield conformer pools
that are more likely to be representative of the disordered protein.
Additional plots of residue-specific fractional secondary structures
for calculated ensembles are provided in Supporting
Information Figures S9–S12.
Comparison with Experimental Data
Beyond the chemical
shift-derived secondary structure, we were interested in the ability
of the generated ensembles to match experimental data. While the goal
is to build diverse conformer pools that have the potential to fit
experiment following a subsetting or reweighting procedure, such as
with ENSEMBLE[20] or X-EISD,[22,23] an initial match to experiment clearly demonstrates this potential.
Using RMSD from experimental data (Figure and Supporting Information
Figure S13) and ENSEMBLE and X-EISD scores as metrics (Supporting Information Table S4), we found that
IDPConformerGenerator ensembles are not in very close agreement with
the experimental data, as expected, but that the deviation is not
large for many restraint types, such as SAXS, chemical shifts, 3-bond 1HN–1Hα J couplings,
and RDCs. While there are larger deviations for those representing
tertiary contacts, PREs and 1H–1H NOEs,
it is difficult to directly compare the RMSD values for the different
experimental data types as the range of data varies considerably.
An RMSD of ∼4 Å for PRE, a measurement that goes out to
20 or 30 Å, may be closer than an RMSD of ∼0.8 ppm of
Cα chemical shifts, a measurement that varies less than 2 ppm.
CSSS generally provides ensembles in better agreement with Cα
and Cβ chemical shift restraints, as expected, particularly
for proteins with known significant sampling of secondary structure,
such as I-2.
Figure 11
Root-mean squared deviations (RMSDs) of back-calculated
values
from conformational ensembles to experimental data for the drkN SH3
domain unfolded state. Analyses of 1000-conformer ensembles generated
using various secondary structure sampling and using FastFloppyTail
(FFT). RMSDs are given for SAXS, chemical shifts (carbonyl, Cα,
Cβ, Hα), PRE, 3JHN-HA, and NOE if available. Sources of experimental data are provided
in Methods. * is for the standard FFT protocol,
which for this case treats the protein as a mixture of ordered and
disordered, while the other is for a modified protocol in which the
protein is considered to be fully disordered.
Root-mean squared deviations (RMSDs) of back-calculated
values
from conformational ensembles to experimental data for the drkN SH3
domain unfolded state. Analyses of 1000-conformer ensembles generated
using various secondary structure sampling and using FastFloppyTail
(FFT). RMSDs are given for SAXS, chemical shifts (carbonyl, Cα,
Cβ, Hα), PRE, 3JHN-HA, and NOE if available. Sources of experimental data are provided
in Methods. * is for the standard FFT protocol,
which for this case treats the protein as a mixture of ordered and
disordered, while the other is for a modified protocol in which the
protein is considered to be fully disordered.A significant measure of the ability to match experimental
restraints
is the effective sampling of various tertiary contacts. A comparison
of the Cα–Cα distance matrices for fragment sizes
of 1, 3, and 5 as well as the default fragment size sampling for Sic1
shows that there is a relatively smooth sampling of longer distances
(Figure , top row).
Significant differences are observed between ensembles generated with
the three different fragment sizes, as seen in the difference distance
matrices between ensembles, demonstrating that mixtures of fragment
sizes are valuable for sampling a diverse set of tertiary contacts
(Figure , bottom
row). In addition, there are significant differences in tertiary contacts
for Sic1 ensembles generated with loops and with “ANY”
(Supporting Information Figure S14). Together,
these results provide evidence that using a large, combined input
pool of conformations created with varying fragment sizes, secondary
structure sampling, and other parameters would enable reweighting
or subsetting to fit the distance restraint and other data types.
Figure 12
Analysis
of tertiary contacts for Sic1 ensembles. (Top row) Cα–Cα
distance matrices (lower) with deviations (upper) for 1000-conformer
ensembles of Sic1 generated with the loops-only flag for secondary
structure, with substitutions from columns 5, 3, and 2 of the EDSSMat50
amino acid substitution matrix and with variable fragment lengths.
(Bottom row) Significant differences between Cα–Cα
distance matrices (lower) and deviations (upper) (P < 0.05 from a Mann–Whitney U test).
Analysis
of tertiary contacts for Sic1 ensembles. (Top row) Cα–Cα
distance matrices (lower) with deviations (upper) for 1000-conformer
ensembles of Sic1 generated with the loops-only flag for secondary
structure, with substitutions from columns 5, 3, and 2 of the EDSSMat50
amino acid substitution matrix and with variable fragment lengths.
(Bottom row) Significant differences between Cα–Cα
distance matrices (lower) and deviations (upper) (P < 0.05 from a Mann–Whitney U test).
Comparison to Other Disordered Chain-Generating Tools
One of the early motivations for building IDPConformerGenerator is
the significant number of steric clashes found in TraDES[12,13] conformers. The definition of a clash depends on whether only the
repulsive portion of the L-J potential is considered or the whole
L-J potential is used, allowing closer distances which are compensated
for by favorable interactions. We generated a set of 1000 Sic1 conformers
using default TraDES parameters and used Chimera to check for steric
clashes[76] using a stringent criterion based
largely on distance (similar to the repulsive portion of the L-J).
With criteria of no backbone clashes and ≤5 side chain clashes,
this TraDES ensemble had no conformers meeting the criteria. In contrast,
1000-conformer IDPConformerGenerator ensembles of Sic1 built using
custom secondary structure sampling and default fragment sizes had
324/1000 conformers meeting these criteria. A similar drkN SH3 domain
unfolded state pool had 395/1000 conformers meeting these criteria.
Chimera’s clash definition is more stringent than the one we
utilize in IDPConformerGenerator and MC-SCE, which allows close contacts
if compensated for by favorable L-J energy, thus all of these IDPConformerGenerator
conformers are arguably physically realistic conformations. IPDConformerGenerator
does generate many more conformers with fewer steric clashes as defined
by Chimera than does TraDES.We also calculated a set of FastFloppyTail[15] ensembles to enable comparison with IDPConformerGenerator.
While there are different parameters for running FastFloppyTail, it
is not user customizable in terms of sampling specific secondary structures
or various fragment lengths, with the exception of using 3-mer and/or
9-mer fragment libraries. We therefore used the recommended protocol
for each system, with 3-mer fragment libraries and with an additional
run to force the drkN SH3 domain to be disordered throughout. (See
the Supporting Information for details.)
We measured the speed, diversity, sampling of secondary structure,
and match to experimental data (Figures and 11, Supporting Information Figures S6 and S9–S13,
and Supporting Information Tables S2, S3, and S4). For the
quantitative speed comparison, we considered only the time of the
calculation following setup with the initial files. For IDPConformerGenerator,
initial setup includes the generation of the initial torsion angle
database, which we needed to do only once for all of the systems,
and provides the specific protein sequence with input parameter files.
Generating the initial torsion angle database took 37 min on a desktop
computer using 63 of the 64 cores, 20–30 min for downloading,
while processing and generating the database were fast. The time is
very dependent on the Internet connection speed and number of cores
since the process is embarrassingly parallel. For FastFloppyTail,
there is no ability to define multiple processors, and there are a
number of steps and files required before the conformer generation
process, including the prediction of disordered regions and secondary
structure and creating a fragment library, which is required for each
protein. The predictions require multiple external websites or programs.
For the test systems, we could utilize the premade files on the FastFloppyTail
website, but for other proteins, this would not be the case. For the
drkN SH3 domain, Sic1, aSyn, and I-2, the fragment libraries took
between 11 and 16 min each to calculate on the HPC system that we
used for ensemble generation, while for Tau it was about 45 min (Supporting Information Table S2). Another issue
with FastFloppyTail is that disordered proteins that are not predicted
to be disordered can yield challenges in setup. In particular, the
unfolded state of the drkN SH3 domain has a sequence that is not predicted
to be disordered, leading to our testing both the recommended algorithm
and one defining it as disordered (Supporting
Information). α-Synuclein is known to fold into a long
α-helix in the presence of lipid or micelles and different predictive
algorithms have variable success in predicting its disordered state,[77] and the authors of FastFloppyTail noted a need
to find a predictor which correctly identified its disordered state.[15]The results demonstrate that IDPConformerGenerator
is faster than
FastFloppyTail for chains shorter than 200 residues with default parameters.
There is generally minimal difference in the diversity of the ensembles
between the two tools, although asphericity values are higher for
IDPConformerGenerator and FastFloppyTail ensembles are often more
compact. The secondary structure sampling for FastFloppyTail ensembles
often falls between IDPConformerGenerator ANY and CSSS biased by NMR
chemical shifts, with FastFloppyTail having higher populated secondary
structure than suggested by experiment. Matches to experimental data
are more variable, with IDPConformerGenerator run using different
secondary structure sampling approaches giving the lowest RMSD values
for different data types for different systems, often lower than the
FastFloppyTail ensemble, particularly for the CSSS with chemical shifts,
although not always. A clear distinction is that IDPConformerGenerator
enables users to flexibly define different approaches for the generation
of conformers, including for diversifying the resulting ensembles.To compare an alternative strategy of IDP conformer ensemble generation,
we have extracted conformers from MD trajectories generated using
an optimized force field from Robustelli et al.[78] (See Supporting Information, Additional
Methods, for details.) We compared the three overlapping protein
systems with ours (drkN SH3, Sic1, and aSyn) from this MD study, calculating
fractional secondary structure profiles (Supporting
Information Figures S15–S17) and agreement with experimental
data (Supporting Information Table S4A).
As seen in Supporting Information Figures S15–S17, ensembles from these MD simulations deviate from δ2D[42] predicted secondary structure propensities,
while IDPConformerGenerator’s CSSS approach can easily recreate
any secondary structure profile, including δ2D calculated propensities,
while maintaining diversity. The agreement of IDPConformerGenerator
conformers with experimental data is in general comparable to that
of these MD simulations using an optimized force field for disordered
proteins (a99SB-disp),[78] even showing better agreement in some cases (Supporting Information Table S4A). IDPConformerGenerator,
with its significantly lower computational cost than the MD, facilitates
the calculation of multiple ensembles with various secondary structure
sampling approaches. This includes CSSS, which usually leads to better
agreement with chemical shift data than the MD conformers. Different
IDPConformerGenerator secondary structure sampling approaches also
enable comparable or better fits than MD to other data. Interestingly,
the drkN SH3 IDPConformerGenerator ensembles have significantly better
agreement with the PRE and NOE data than the MD conformers, while
for Sic1 the MD fits better to PRE data. Overall, IDPConformerGenerator
performs very well compared to MD trajectories generated with the
a99SB-disp force field.
Discussion and Conclusions
A range of theoretical and
computational approaches for generating
disordered ensembles exist,[10,79] each of which has strengths
and unique features based on the design philosophy. Testing of IDPConformerGenerator
on our set of model disordered proteins demonstrates that this tool
is highly flexible and can function as a platform to enable the generation
of various initial conformational pools built with different biases
and parameters, which is valuable for addressing a range of scientific
needs. It is computationally efficient depending on the sequence length
and parameters and can enable sampling using the frequencies of secondary
structures within the PDB database provided or the experimental secondary
structure propensities from NMR experiments. The resulting ensembles
are not far from fitting experimental data, including those for local
structure, tertiary contacts, and overall hydrodynamic properties.
IDPConformerGenerator ensembles have agreement with experiment comparable
to those generated by MD simulations, at significantly less computational
cost. Ensembles generated with the various secondary structure sampling
approaches have distinct levels of agreement with the multiple experimental
restraint types, suggesting that each IDPConformerGenerator build
configuration samples unique regions of conformational space. Future
work will explore the optimal parameters for sampling structures to
facilitate the identification of subsets or reweighting to best fit
restraints, including tertiary contacts. However, the current results
strongly suggest that using an input pool with a combination of ensembles
generated with different approaches, including with and without substitutions,
varied fragment sizes and combinations, and varied secondary structure
sampling including bias with NMR chemical shift-derived probabilities,
can effectively sample a range of conformational space to facilitate
fitting experimental data with subsets or reweighting.Scientific
software is often created by scientists and not software
engineers, leading to tools that are not as user-friendly, generalizable,
easy to maintain, or thoroughly documented as desired. A larger goal
of developing IDPConformerGenerator was to design it to be easy to
use so that it would be widely used, not only to generate disordered
protein ensembles as starting pools for subsetting or reweighting
but also to enable it to function as a platform for adding existing
functionality or future approaches to define ensembles that best fit
experimental data, for computational experiments testing various ideas,
and for analyses of resulting ensembles. There are a number of straightforward
extensions of IDPConformerGenerator planned, including the ability
to build disordered regions around a folded domain, an important functionality
(and one that motivated the creation of FastFloppyTail). Due to IDPConformerGenerator’s
modularity, analysis tools can easily be built utilizing current functions.
Currently available functions include those to analyze ensembles for
fractional secondary structure and torsion angle distributions and
to analyze the database for the number of available sequence matches
and for identifying structures with select keywords. Further additions,
such as the analysis of tertiary contacts, could be implemented with
ease and are planned for a future release. More significantly, this
platform is being developed to include back-calculators for comparison
to experimental data and Bayesian scoring and reweighting approaches
such as X-EISD.[22,23] We envision that IDPConformerGenerator
will be the basis for an expanding platform of tools to facilitate
the structural characterization of IDPs and IDRs consistent with solution
experimental data. We hope that our goal of developing a user-friendly
and flexible platform will draw more researchers to the field of intrinsically
disordered protein structural characterization, which in one view
should be almost half the size of the folded protein structural biology
community, given the relative amounts of disorder and folded structure
in eukaryotic proteomes (∼0.35 to 0.65).[4] Ultimately, the resulting structural ensembles should provide
physical insights into how these abundant and dynamics states regulate
and carry out their critical biological functions and how disease
variants in IDPs/IDRs lead to pathology.
Authors: Thomas D Hurley; Jie Yang; Lili Zhang; Kristie D Goodwin; Qin Zou; Marc Cortese; A Keith Dunker; Anna A DePaoli-Roach Journal: J Biol Chem Date: 2007-07-18 Impact factor: 5.157
Authors: K Aurelia Ball; Aaron H Phillips; Paul S Nerenberg; Nicolas L Fawzi; David E Wemmer; Teresa Head-Gordon Journal: Biochemistry Date: 2011-08-15 Impact factor: 3.162