Ruth Nussinov1,2, Mingzhen Zhang1, Yonglan Liu3, Hyunbum Jang1. 1. Computational Structural Biology Section, Frederick National Laboratory for Cancer Research, Frederick, Maryland 21702, United States. 2. Department of Human Molecular Genetics and Biochemistry, Sackler School of Medicine, Tel Aviv University, Tel Aviv 69978, Israel. 3. Cancer Innovation Laboratory, National Cancer Institute, Frederick, Maryland 21702, United States.
Abstract
AlphaFold has burst into our lives. A powerful algorithm that underscores the strength of biological sequence data and artificial intelligence (AI). AlphaFold has appended projects and research directions. The database it has been creating promises an untold number of applications with vast potential impacts that are still difficult to surmise. AI approaches can revolutionize personalized treatments and usher in better-informed clinical trials. They promise to make giant leaps toward reshaping and revamping drug discovery strategies, selecting and prioritizing combinations of drug targets. Here, we briefly overview AI in structural biology, including in molecular dynamics simulations and prediction of microbiota-human protein-protein interactions. We highlight the advancements accomplished by the deep-learning-powered AlphaFold in protein structure prediction and their powerful impact on the life sciences. At the same time, AlphaFold does not resolve the decades-long protein folding challenge, nor does it identify the folding pathways. The models that AlphaFold provides do not capture conformational mechanisms like frustration and allostery, which are rooted in ensembles, and controlled by their dynamic distributions. Allostery and signaling are properties of populations. AlphaFold also does not generate ensembles of intrinsically disordered proteins and regions, instead describing them by their low structural probabilities. Since AlphaFold generates single ranked structures, rather than conformational ensembles, it cannot elucidate the mechanisms of allosteric activating driver hotspot mutations nor of allosteric drug resistance. However, by capturing key features, deep learning techniques can use the single predicted conformation as the basis for generating a diverse ensemble.
AlphaFold has burst into our lives. A powerful algorithm that underscores the strength of biological sequence data and artificial intelligence (AI). AlphaFold has appended projects and research directions. The database it has been creating promises an untold number of applications with vast potential impacts that are still difficult to surmise. AI approaches can revolutionize personalized treatments and usher in better-informed clinical trials. They promise to make giant leaps toward reshaping and revamping drug discovery strategies, selecting and prioritizing combinations of drug targets. Here, we briefly overview AI in structural biology, including in molecular dynamics simulations and prediction of microbiota-human protein-protein interactions. We highlight the advancements accomplished by the deep-learning-powered AlphaFold in protein structure prediction and their powerful impact on the life sciences. At the same time, AlphaFold does not resolve the decades-long protein folding challenge, nor does it identify the folding pathways. The models that AlphaFold provides do not capture conformational mechanisms like frustration and allostery, which are rooted in ensembles, and controlled by their dynamic distributions. Allostery and signaling are properties of populations. AlphaFold also does not generate ensembles of intrinsically disordered proteins and regions, instead describing them by their low structural probabilities. Since AlphaFold generates single ranked structures, rather than conformational ensembles, it cannot elucidate the mechanisms of allosteric activating driver hotspot mutations nor of allosteric drug resistance. However, by capturing key features, deep learning techniques can use the single predicted conformation as the basis for generating a diverse ensemble.
AlphaFold has overcome age-long bottlenecks
and forcefully bared
the power of artificial intelligence (AI) in biological research.[1−3] AlphaFold has combined numerous deep learning innovations to predict
the three-dimensional (3D) structures of proteins at or near experimental
scale resolution, inspiring the community (including us) to rethink
studies of function, evolution, and disease (e.g., refs (4−13)). The sheer volume of the rapidly generated accurate structures
argues that new, ambitious, frontier-pushing studies will emerge.
It also points to research projects that should be reconsidered. The
richness of high quality data that are being compiled in databases
(e.g., refs (5 and 14−25)) is already strengthening studies that require protein structures,
such as mapping binding sites and interactions in signaling pathways,
and identification of hot spots, including latent and rare cancer
driver mutations.[26−34] The most profound impact will likely be in accelerating and improving
production of new medications (e.g., ref (35)), and in generating data that can be used toward
this vital aim (e.g., refs (5, 17, 18, and 36−39)). AI developments and applications[40] may
further help foretell whether the signal propagating downstream will
be strong enough to reach its genomic target to activate (suppress)
gene expression,[41] and predict pathways.[42−49] Altogether, these powerful approaches and the databases that they
create revamp and transform traditional and ongoing research involving
the use of structures. They also embolden us to step back, rethink,
and innovate our projects.AlphaFold’s achievements have
been made possible by the
protein databank (PDB), currently with a size nearing 200 000
experimentally determined structures. It has been trained on protein
chains from the PDB and uses the input sequence to query databases
of protein sequences to construct a multiple sequence alignment.[4] However, its striking success has not led us
to a deeper mechanistic understanding of exactly how a protein sequence
folds, thus not assisting in the folding of a protein from its sequence.
Below, we first briefly describe the protein folding problem,[50,51] and strategies to predict protein structures. We describe key conceptual
and computational developments and the transformative AlphaFold advances.
We outline its strengths and some weaknesses. We emphasize what it
has accomplished, and what it has not, and the magnitude of the challenges,
underscoring the difference between the theoretical folding
problem, which was not solved,[51−57] and practical predictions by incorporating additional
evolution information that generally have been.[58−65] We proceed to AI approaches to the complementary problem of protein–protein
interactions (PPIs) by these methods and others,[66−75] with the human–microbiome PPI as a relevant and topical example.[66] AI-powered prediction of human–microbe
PPIs can accelerate research into questions such as how microbiota
hijack cell signaling and provide drug targets.[76−80] We discuss how AI can reshape drug discovery, for
example by amplifying repurposing of FDA-approved drugs,[81−91] an area which is already thriving. AI can also select combinations
of drug targets, powerfully guiding and accelerating experiments by
providing specific testable hypotheses. Machine learning has already
proved its merit in the life and medical sciences.[1,92−98] Coupled with harnessed exascale computing,[99,100] advanced, AI-powered methods are set to revolutionize therapeutic
development, providing prioritized drug combinations for the attending
physicians.Finally, we note that AlphaFold, which predicts
single ranked structures
for a protein sequence, is unable to address directly allosteric mechanisms,
which are based on the populations of conformational states
in the ensembles.[101−117] Allostery, where the signal propagates dynamically with the shifts
in the populations, underlies regulation and thus cell life.[118,119] Due to its higher specificity and consequently lower toxicity, which
results from targeting nonconserved allosteric sites, allostery also
increasingly features in allosteric drugs.[120−125]Can we then foresee AlphaFold assisting in unraveling the
mechanisms
of allosteric hotspot mutations and allosteric drug discovery? Indirectly
it can and does, even in our hands. The rigid structures that AlphaFold
predicts can be submitted to MD simulations that generate such ensembles
(Figure ). At the
same time, as we discuss here, other AI-based strategies can assist
directly in such efforts, most effectively via accelerating and enhancing
MD simulations. Efforts are also likely to persist in exploiting AI
toward prediction of allosteric binding sites. Nevertheless, it behooves
us to recall that the effectiveness of allosteric sites is determines
by both stable interactions at the site, which is something that AI
can help with, and initiation of effective allosteric signals, which
would be more challenging. Current approaches to predict allosteric
binding sites address only the former. In that sense, they resemble
the characterization of orthosteric sites, except that their scoring
is based on statistics of allosteric sites.
Figure 1
Current strategy of allosteric
drug discovery in computational
structural biology employing the AlphaFold program with artificial
intelligence (AI)-powered methods (top panel). Experimental
instruments, such as X-ray crystallography, cryo-electron microscopy
(cryo-EM), and nuclear magnetic resonance (NMR) can resolve protein
structures, but often miss the coordinates of highly fluctuating regions
in the protein structure. AlphaFold can predict the missing coordinates
of these regions. The resulting structure can be subjected to molecular
dynamics (MD) simulations that provide conformational dynamics, conformational
changes, and folding characteristics of the protein. An example is
shown for Src homology region 2-containing protein tyrosine phosphatase
2 (SHP2) (bottom panel). The X-ray structure of SHP2
(PDB ID: 4DGP) misses residues in two flexible regions, which can be predicted
by AlphaFold. SHP2 contains two Src homology 2 (SH2) domains (nSH2
and cSH2) and a protein tyrosine phosphatase (PTP) domain.
Current strategy of allosteric
drug discovery in computational
structural biology employing the AlphaFold program with artificial
intelligence (AI)-powered methods (top panel). Experimental
instruments, such as X-ray crystallography, cryo-electron microscopy
(cryo-EM), and nuclear magnetic resonance (NMR) can resolve protein
structures, but often miss the coordinates of highly fluctuating regions
in the protein structure. AlphaFold can predict the missing coordinates
of these regions. The resulting structure can be subjected to molecular
dynamics (MD) simulations that provide conformational dynamics, conformational
changes, and folding characteristics of the protein. An example is
shown for Src homology region 2-containing protein tyrosine phosphatase
2 (SHP2) (bottom panel). The X-ray structure of SHP2
(PDB ID: 4DGP) misses residues in two flexible regions, which can be predicted
by AlphaFold. SHP2 contains two Src homology 2 (SH2) domains (nSH2
and cSH2) and a protein tyrosine phosphatase (PTP) domain.Orthosteric drugs block the active site; allosteric
drugs alter
the population of the active state of the protein, including the active
site, through binding at a site far away.[126] We suggested that allosteric drugs can constitute of “anchors”
and “drivers” atoms, where the anchor atoms bind to
the allosteric pocket, without changing the conformation of the binding
site. The interactions of the anchor atoms stabilize ligand binding,
resembling protein–ligand binding at the orthosteric site.
The binding of driver atoms “pulls” or “pushes”
atoms in the protein pocket. This initiates the allosteric signal,
which shifts the receptor population from the inactive to the active
state. Driver atoms can trigger agonism and antagonism. AlphaFold
cannot handle population shifts. AI strategies can but will need to
go beyond prediction of stabilizing interactions.Finally, not
surprisingly, prediction of the structures of intrinsically
disordered proteins (IDPs) and regions (IDRs) is another problem where
AlphFold falls short. Disordered proteins (regions) are characterized
by broad and heterogeneous ensembles where the differences in the
relative conformational stabilities are small, or even minor and the
barriers are low.[127−135] The conformations interconvert, leading to low probabilities of
AlphaFold’s reliably capturing those most favored, or the conformational
distribution. Nevertheless, leveraging, learning, and mining of the
conformations can exploit AI.[136−138]AI-powered algorithms,
which are fed vast compiled data, and enabled
by the emerging massive compute power are propelling a revolution
in computational biology (Figure ). Unlike quantum computing, in the case of AI and
data-driven computing, the technological innovations at the requisite
scales are already at hand.
Protein Folding versus Prediction of Protein Structure
Protein Folding
The protein folding problem embraces
two questions:[51] first, the conceptual
question of how a protein’s amino acid sequence dictates its
3D atomic structure, and second, how, starting from a single amino
acid sequence, to successfully predict the 3D structure, without using
information related to other available (homologous, same family) sequences
nor structures of any related sequences. Such computational prediction
methods are guided by the conceptual notion that this is how the protein
folds in nature. Single sequence-based prediction in solution considers
forces related to hydrogen bonds, ion pairs, van der Waals attractions,[139] and chiefly water-mediated hydrophobic interactions,
with the hydrophobic effect the driving force for protein folding.
This formal folding problem emerged six decades ago, alongside the
first atomic-resolution protein (globins) structure. The structure
led to thermodynamic questions of the balance of interatomic forces
that determines the structure of the protein, how the protein can
fold so quickly, that is the kinetics of the pathways, and the computational
problem of protein structure prediction. The landmark thermodynamic
hypothesis of Christian Anfinsen and his colleagues[140,141] stated that the native structure of a protein is its thermodynamically
most stable structure, and it is determined only by its amino acid
sequence and the conditions it is at, with kinetics playing no role.
No other considerations are at play, that is, whether it is synthesized
in the lab or on the ribosome or undergoes chaperone assisted folding.
The folding paradigm stipulated that unfolded molecules will always
spontaneously fold into the same shape; that is, the linear amino
acid sequence specifies a protein’s folded native state.[58,142−144] Anfinsen’s thermodynamic hypothesis
emphasized the shape of the energy landscape where the native state
is the one with the lowest free energy.[141,145] Computationally, that description posed the problem of prediction
of protein structure, forming the basis for approaches that dominated
the field for scores of years. If only the sequence matters, along
with the physicochemical forces, it should be possible for “good”
algorithms to fold it. Assuming that the crystal structure represents
the minimum energy state, the “goodness” of the predicted
structure can then be assessed by comparison with it. Anfinsen’s
description combines sampling of alternative conformations, ranking
them by energy and identifying the lowest energy state.[51,146−148] Subsequent efforts focused on prediction
of secondary structures, although the dominant role of the hydrophobic
interactions suggested that secondary structure is an outcome of the
3D structure and its cause.[149,150] The small (5–10
kcal/mol) difference in the stability of the native structures as
compared to the denatured states[151] compounded
the challenge that predicting methods faced.Already early on,
Cyrus Levinthal conceptualized the key problem facing the protein
and the prediction algorithms:[152] the vast
time scales for the protein to search the folding space and reach
its most stable native state under biological conditions.[58] For prediction algorithms’ sampling backbone
states, the search space size grows exponentially with chain length,
becoming an impossibility. Levinthal argued that there is no need
to search this vast space since the energy landscape is funnel-like,
rather than flat, and thus can guide sampling toward the biological
conformational basin.[51,153] Packed hydrophobic cores optimize
their van der Waals (vdW) interactions, restrict torsion angles, and
abolish internal “holes”, with hydrogen bonds and salt
bridges balancing the loss of interactions with water. Harold Scheraga
employed physical chemistry to pioneer studies to decipher how amino
acid sequences influence the 3D folding pathways, thermodynamics,
and biological activity of proteins. Neither AlphaFold nor broadly
other protein structure prediction algorithms consider folding pathways.
Physical chemistry is accounted for implicitly; in the case of AlphaFold,
via AI.
Protein Structure Prediction
Prediction of protein
structure can be template-based or template-free, which does not use
global similarity to an experimental (protein data bank, PDB) structure.[58] Template-free modeling exploits physics-based
energy functions. Both can exploit machine learning and AI to use
data in the PDB. Template-based modeling selects a structural template
and uses sequence alignment. Template-free modeling uses conformational
sampling and ranking. It may start with multiple-sequence alignment
to related sequences to predict local structural features, which will
guide the 3D modeling followed by refining and ranking.Integrative
modeling[154−156] that assembles structures from individual
components may suffer from high false-positive rate. Computational
integrative approach can combine data from experimental methods, bioinformatics,
physics, and statistics for rapid and accurate structure determination
of protein complexes. The algorithms can integrate experimental data,
such as X-ray crystallography, NMR spectroscopy, 2D and 3D electron
microscopy (EM), small-angle X-ray scattering (SAXS), mass spectrometry
(MS), hydrogen–deuterium exchange (HDX), mutations, sequence
conservation and covariation, and statistical analysis of known structures.
Computationally, the algorithms can derive from computer vision, image
processing, computational geometry, machine learning, robotics, and
graph algorithms. Machine learning has however been used toward protein
structure prediction.[157−163]AlphaFold is not the first in being a machine-learning model.
Its
remarkable success (with scores of near 90 even in the difficult targets
in the 2020 Critical Assessment of Protein Structure Prediction, CASP)
was influenced by its training not only on all the PDB structures
but also on structures it predicted, and it uses the structure and
correlation data to predict the pairs of amino acids that are in contact
as well as all amino acid pairwise distances. It also ensured that
the distances between the amino acids satisfy the triangle inequality,
saving time at intermediate steps.[164] To
date, AlphaFold illuminates half of the dark human proteins.[10]Still, questions remain, such as which
structural states exist
for a give protein, and what is the population of each state. Addressing
these questions is vital to relate protein structure to function.
This is where AlphaFold falls short. However, the models it produces
can serve as input to generate ensembles, for example by MD simulations,
which, if carried out at sufficiently longtime scales, in parallel,
it should be able to produce. Simulations can sample the relevant
states, can enumerate possible state combinations (multistate models),
and can determine the population sizes for the states.
The Structure–Function Paradigm Overlooked Ensembles
and Dynamic Energy Landscapes: AlphaFold Is Attuned but Is Unable
to Address Them
The sequence–structure–function
dogma was the touchstone of a generation. It dominated molecular biology
for decades. It was introduced by physical chemists who explained
that biological macromolecules function when they are folded. Thus,
to understand how molecules function, one needs to consider their
3D structures, a transformative paradigm that became a tenet of modern
biology. Today, it is broadly recognized that rigid molecules cannot
perform a function, leading the way to the appreciation that to sustain
life, molecular flexibility is a necessity. That however has not fully
translated to the understanding of the powerful concept of the energy
landscape.[165] That is, that biomolecules
are dynamical objects that are always interconverting between a variety
of structures with varying energies,[166] and that this is the origin of allosteric mechanisms.[167−169] This notion of flexibility as interconversion between conformations
is critical for understanding biological processes and their regulation,
such as protein activation as a shift of the ensemble from the inactive
to the active state, how allosteric drugs work, cell signaling, and
binding mechanisms through conformational selection rather than induced
fit. The conceptual evolution from the classic structure–function
paradigm to dynamic energy landscapes of biomolecular
function and allosteric mechanisms, poses a challenge to AlphaFold’s
powerful predictions. To understand biological regulation, structure
should be linked to function through protein ensembles in terms of
populations and relative energies, which is the foundation of allostery
(Figure ). Despite
its transformative power and vast broad impact, the AlphaFold predictions
are unable to address it directly. It is only through their sampling
that this functional aim can be accomplished.
Figure 2
Structural ensembles
for B-Raf activation. The snapshots for B-Raf
kinase domains (top panels) are generated from the
protein databank (PDB). The representative inactive OFF-state conformation
(PDB ID: 3SKC) and active ON-state conformation (PDB ID: 6UAN) are highlighted
in blue and red, respectively. The free energy landscape of B-Raf
kinase domain depicting the population shift from OFF-state to ON-state
upon activation (middle panel). Highlighted activation
segments of αC-helix and A-loop representing the side by side
comparisons between the single structure predicted by AlphaFold and
the representative B-Raf conformations of both inactive OFF-state
and active ON-state (bottom panels). The AlphaFold
structure falls into neither the active ON-state nor the inactive
OFF-state.
Structural ensembles
for B-Raf activation. The snapshots for B-Raf
kinase domains (top panels) are generated from the
protein databank (PDB). The representative inactive OFF-state conformation
(PDB ID: 3SKC) and active ON-state conformation (PDB ID: 6UAN) are highlighted
in blue and red, respectively. The free energy landscape of B-Raf
kinase domain depicting the population shift from OFF-state to ON-state
upon activation (middle panel). Highlighted activation
segments of αC-helix and A-loop representing the side by side
comparisons between the single structure predicted by AlphaFold and
the representative B-Raf conformations of both inactive OFF-state
and active ON-state (bottom panels). The AlphaFold
structure falls into neither the active ON-state nor the inactive
OFF-state.Around their native states, protein landscapes
consist of rapidly
interconverting conformations. The ensembles are “fuzzy”.[170,171] Events associated with their environments and functions, such as
changes in pH, interactions with ions, water, and lipids, and binding
of small or macromolecules, promote conformational changes. These
are frustrated by their local restricted molecular environment.[172] The cooperative, accommodating structural changes
shift the ensemble. The shifted, now populated states are frustrated
by their current neighboring residues conformations. Binding and catalysis
involve making and breaking covalent and noncovalent interactions
at the interaction site. These propagate through frustration, influencing
the conformational states of the ensemble. The shifts in the ensemble
alter the relative stabilities, i.e., the populations of the states,
thus influencing the allosteric transitions. Importantly, frustration
does not create new conformations; instead, it alters the number of
molecules populating it.[173]Frustration
is thus a powerful tool harnessed by evolution for function.[174]Biomolecules must be described
statistically, not statically.[166,175] Static descriptions
were the norm for decades. Yet, a static description
cannot capture function. It cannot describe protein activation from
the inactive to the active state upon some activation event, such
as binding a hormone, or being covalently modified by a post-translational
modification, or the presence of oncogenic driver mutations. It is
also unable to describe how high affinity binding to an activator
shifts protein molecules to their active state.[176,177] It will further fail when attempting to describe how allosteric
“rescue mutations”[178,179] work (albeit
not other rescue mutations, e.g., refs (180−183)), how allosteric drugs are able to block the active site, and how
mutations countering them can be overcome. All these processes which
take place in the cell would not have been possible had the protein
existed in a single structure or was flipping between only two states,
active and inactive. While there is a single conformation
that the active enzyme should adopt for productive catalysis, there
are multiple ways to inactivate it and thus many inactive states.
The notion of a single structure bred the concept of the “lock-and-key”
binding mechanism. This view was superseded by the “induced
fit” mechanism which considered the presence of only two states,
an active and an inactive state. In an induced fit scenario, the ligands
bind to the single “open” protein structure and the
interaction between a protein and a rigid binding partner induces
a conformational change in the protein.[184] In contrast, the conformational selection mechanism[167−169,185] theorizes that the energy surface
hosts a very large number of conformations, and the one that fits
best is selected, with subsequent minor induced fit optimization,
largely by side chains.AlphaFold exploits AI to make template-free
predictions of protein
structures from their sequences, equipping biologists with structures
with good resolution. The predictions that it yields, like those obtained
by homology modeling, are rigid. Flexibility is implicitly captured
by the absence of, or low confidence levels of predicted structure
for certain regions, as in the case of intrinsically disordered proteins.
Thus, computational methods once relegated to the periphery of biology,
are now at the forefront, driving “the second molecular biology
revolution”. AlphaFold can drive breakthroughs in fundamental
problems in the life sciences, including precision medicine, with
promise to transform research and accelerate drug discovery. It is
driven by deep learning innovations, which appear poised to transform
MD simulations.
Appications of Artificial Intelligence and Machine Language
AI and Machine Language in Simulations
Machine learning
for molecular simulations—tools, strategies, and principles—have
been reviewed recently.[186] As can be seen
from this excellent review, machine learning has already been making
a significant impact on the development of approximate methods for
complex atomic systems. The innovation in the development and integration
of MD simulations with deep learning can reproduce, interpret, predict,
and generate data relating to the behavior of biological macromolecules.[187−192] Deep learning methods can help MD simulations excel in their efficiency
and scales, with AI bridging between deep learning technologies and
simulations. Challenges toward broad usage include smooth connection
of AI and MD and automation of workflows. These could popularize novel
deep learning tools in MD simulations toward efficiently exploiting
both powerful methods. The number of publications in this area has
been skyrocketing, emphasizing the recognition of the potency of AI
and machine learning in simulation. As an example, MD simulations
need to perform extensive sampling of the conformational space that
require long time scales. Deep learning involving, e.g., variational
autoencoders have been shown to be useful. The learned latent space
in the variational autoencoders has been employed to generate unsampled
protein conformations, and simulations starting from these conformations
accelerated the sampling.[193] In another
example, a deep learning framework with mixed classical and machine
learning potentials (TorchMD) has been developed for molecular simulations.[188] The review cited above provides additional
diverse examples. Deep learning has also been already exploited in
structural modeling and design[60,194−196] and analysis[191] and linking these to
function.[197]
AI and Machine Language in Prediction of Pathogen–Human
Host PPIs
AI and deep learning are also being developed and
applied to experimental determination and prediction of macromolecular
structures,[198] as well as to PPIs.[199,200]Applications of AI approaches to human–microbiome protein–protein
interactions have also been reviewed recently.[66] These interactions play important roles in human health
and disease. There is a rapid increase of data that microbes, bacteria,
and viruses impact human health. They can modulate human signaling
and immune response by interacting with the human proteins. To decipher
this modulation, it is important to identify the specific interactions,
the human host proteins that are involved, and the structure of the
complex. Identification of the interactions along with their structural
details at atomic resolution permit understanding the mechanisms involved
in pathogen survival and assist in drug discovery targeting these
interactions. The interactions help the pathogens to elude and bypass
the immune defense, with the pathogens hijacking host signaling. Mechanistically,
pathogen proteins can have surfaces which resemble those of the host,
allowing them to mimic and compete with host protein interactions
(Figure ). They bind
to the host protein and rewire its physiological signaling. Data,
including structural details, are scarce and large-scale experimental
detection is challenging. Efficient and robust computational strategies
to predict the interactions is thus vital. We have developed an algorithm
and server to predict these human host–microbial PPIs (HMIs)
based on their protein structures, which can be experimental or modeled.
In large scale applications, AlphaFold can now be used toward this
aim. Machine learning permits both the large-scale efficient and generalizable
application and addressing the complex dynamics of such relationships
that the machine learning algorithms can decipher.
Figure 3
Human–microbiome
PPIs promote GTPase activation. Human cell
division control protein 42 homologue (Cdc42) is a small GTPase of
the Rho family, involved in cell cycle. In human cells, it is activated
by guanine-nucleotide exchange factors (GEFs), such as DOCK9 (PDB
ID: 2WMO), by
transforming the inactive GDP-bound to the active GTP-bound forms.[207] Bacterial secretes toxins or effectors mimicking
the GEF proteins, such as SopE (PDB: 1GZS) from Salmonella[208] and MAP (PDB: 3GCG) from Escherichia coli,[209] can interact with Cdc42 and activate
it. The interaction surfaces of these bacterial GEF mimicries resemble
the host protein, allowing them to mimic and compete with the host
protein interactions. PPIs, protein–protein interactions; HMIs,
host–(or human−) microbiome interactions. Ongoing work
incorporates AI into the HMI prediction algorithm. If the structures
of the human or microbe are unable, AlphaFold can generate them.
Human–microbiome
PPIs promote GTPase activation. Human cell
division control protein 42 homologue (Cdc42) is a small GTPase of
the Rho family, involved in cell cycle. In human cells, it is activated
by guanine-nucleotide exchange factors (GEFs), such as DOCK9 (PDB
ID: 2WMO), by
transforming the inactive GDP-bound to the active GTP-bound forms.[207] Bacterial secretes toxins or effectors mimicking
the GEF proteins, such as SopE (PDB: 1GZS) from Salmonella[208] and MAP (PDB: 3GCG) from Escherichia coli,[209] can interact with Cdc42 and activate
it. The interaction surfaces of these bacterial GEF mimicries resemble
the host protein, allowing them to mimic and compete with the host
protein interactions. PPIs, protein–protein interactions; HMIs,
host–(or human−) microbiome interactions. Ongoing work
incorporates AI into the HMI prediction algorithm. If the structures
of the human or microbe are unable, AlphaFold can generate them.Challenges in machine learning
for PPI prediction relate to both
data and method. With limited microbial but not human data, microbial
sample sizes are small. In sequence-based algorithms, the dimensionality
problem can be pronounced, where the difficulty exponentially grows
as the feature size increases. Principal component analysis (PCA),
uniform manifold approximation and projection (UMAP), or autoencoders
can be used to embed the samples into lower-dimensional spaces,[201,202] and preprocessing and postprocessing pipelines can be employed for
other data. In structure-based methods the problems may relate to
the quantity and diversity of the representation. Data relating to
host-microbe PPIs with 3D structures are sparse, thus facing a problem
in training and evaluating the computational methods. Additional problems
involve lack of gold standard test data set. Evaluation metrics are
also unclear, the PPI networks are sparse, and more.[66] DeepMind’s AlphaFold2 success in sequence-based
protein structure prediction,[4] as well
as the RoseTTAFold[70] open-source counterpart,
and the publicly available AlphaFold2 prediction of all human proteins[203] are major steps that benefit the scientific
community.
Conclusions
AI and machine learning are appending projects.
They are applied
in diverse applications, including biological networks.[204] They impact disease biology, drug discovery,
microbiome research, and synthetic biology. They also evolved a machine
learning pipeline for molecular complex detection in protein-interaction
networks,[205] as well as the relevance of
major signaling pathways in cancer survival.[46]Here, we briefly overviewed the immense impact of AlphaFold,
and
of AI in structural biology, with some examples. We highlighted what
AlphaFold can and cannot accomplish and why. Allosteric mechanisms
fall into the latter category. Nevertheless, through MD simulations
of models that AlphaFold produces, this aim can be accomplished as
well. Still, however, even though simulations would address this dynamics
problem, at such scales, the cost is prohibitive. A paradigm-shifting
machine learning method is needed to model protein dynamics.AlphaFold and its underlying deep learning innovations have opened
up the next frontiers in protein science,[164] including precision medicine.[206] Protein
structures connect to cell biology, chemistry, biophysics, and medicine.
To date, over 180 000 protein structures are available, open
to all researchers across the world in the PDB database. Still, structures
of pathogens are not within them, and neither are many others, which
are essential to human health. The resource is now there, and with
growing computational power, eventually, these will be there as well.
Nonetheless, the availability of the structures is insufficient. For
us, as biophysicists, the key is what are the significant questions
to ask. What should the focus of our research be, such that we do
not repeat what was so well done but instead exploit the new capabilities
to ask the really important questions.