Literature DB >> 35939398

Construction of a Deep Neural Network Energy Function for Protein Physics.

Huan Yang¹, Zhaoping Xiong¹, Francesco Zonta¹.

Abstract

The traditional approach of computational biology consists of calculating molecule properties by using approximate classical potentials. Interactions between atoms are described by an energy function derived from physical principles or fitted to experimental data. Their functional form is usually limited to pairwise interactions between atoms and does not consider complex multibody effects. More recently, neural networks have emerged as an alternative way of describing the interactions between biomolecules. In this approach, the energy function does not have an explicit functional form and is learned bottom-up from simulations at the atomistic or quantum level. In this study, we attempt a top-down approach and use deep learning methods to obtain an energy function by exploiting the large amount of experimental data acquired with years in the field of structural biology. The energy function is represented by a probability density model learned from a large repertoire of building blocks representing local clusters of amino acids paired with their sequence signature. We demonstrated the feasibility of this approach by generating a neural network energy function and testing its validity on several applications such as discriminating decoys, assessing qualities of structural models, sampling structural conformations, and designing new protein sequences. We foresee that, in the future, our methodology could exploit the continuously increasing availability of experimental data and simulations and provide a new method for the parametrization of protein energy functions.

Entities: Chemical

Mesh：

Substances：
Proteins

Year: 2022 PMID： 35939398 PMCID： PMC9476656 DOI： 10.1021/acs.jctc.2c00069

Source DB: PubMed Journal: J Chem Theory Comput ISSN： 1549-9618 Impact factor: 6.578

Introduction

A realistic description of biological molecules should involve Quantum Mechanics (QM); however, its computational cost strongly limits feasible applications of such an approach. Models based on approximate classical potential, on the other hand, have been more successful in describing how proteins and other biological molecules interact. Two main categories of simplified potentials have emerged: physics-based potentials and knowledge-based potentials. Both methods rely on physical intuition to map interactions between atoms into simple functional forms that depend on a set of parameters. In the first case, the parameters are derived by comparison with high-precision QM calculations or experiments. These potentials often take the name of force fields and are mainly used to perform molecular dynamics (MD) simulations.[1−5] In the second case, probability distributions of observables (e.g., distances, angles, native contacts, etc.) are directly obtained from experimental information and transformed into a statistical potential.[6−10] These potentials are mainly used for applications that are too computationally expensive for MD simulations. The two approaches are not antithetic, and many hybrid potentials have been developed from combining the two methodologies.[11,12] Despite their wide usage and success, approximate classical potentials still present major limitations that go in opposite directions. For some applications, their simple functional form cannot take into account all the details required for an accurate description of the system. At the same time, simulations based on classical potentials can be too computationally expensive and cannot always reach time and size scales directly comparable with experiments. Although QM calculations and hybrid methods[13−16] can be used to improve the accuracy of the calculations, and enhanced sampling methods[17] or coarse-grained force fields[18−21] can be used to scale up in both time and size scales, the core problems still exist. Deep learning has entered the field of structural biology with a number of different applications that steadily increased with the years.[22,23] In most cases, deep learning has been used to complement standard bioinformatics techniques based on sequence analysis, for example, protein structure prediction using primary sequences. This classic problem, the holy grail of computational biology, has de facto been solved recently after the impressive results obtained by Alphafold2, Rosettafold, and others in the 14th Critical Assessment of Techniques for Protein Structure Prediction (CASP14) competition.[24−27] Researchers have also applied unsupervised deep learning methods to extract biological, biophysical, and evolutionary information from protein sequences.[28−30] Despite these successful examples, deep learning models are very data-hungry and, as a consequence, their field of application is still limited. In more recent years, novel approaches that combine deep learning with physics-based methodologies emerged. For example, in chemistry and material science, neural network and quantum Monte Carlo methods have been applied to solve the Schrödinger equation[31,32] or generate classical atomistic force fields by fitting calculations based on density functional theory[33,34] or ab initio molecular dynamics.[35,36] In protein science, some groups have developed neural network energy functions for multiscale modeling of proteins, following a bottom-up approach, for example, fitting coarse-grained potentials to atomic MD simulations.[37] Even though these energy functions are not transferable to different protein systems, they could lead to a paradigm shift in how force fields will be parametrized in the future. In this study, we propose a top-down approach and use unsupervised deep learning to construct a statistical potential from experimental data (protein structure and sequences). The statistical potential has not a simple functional form, but it is instead represented by a deep neural network. We then treat this statistical potential as a proper energy function (neural network energy function or NNEF) that can be applied to various tasks in pair with other standard methodology (for example, running MD simulations or discriminating decoys from native configurations). In strict terms, the NNEF is a free energy function that takes into account both enthalpic and entropic contributions. Because entropy is not explicitly considered, hereafter, we use the term “energy function” for simplicity. The energy function should depend only on the protein configuration, be time-independent and differentiable, and be invariant under rotations, translations, and permutations of the components to reproduce protein physics correctly.

Results

Design of the Neural Network Energy Function Based on Protein Building Blocks

In constructing the data-driven energy function for proteins, the choice of the training data sets is critical, and a naïve training of the network with all the known experimental protein structures and sequences will likely fail to reproduce protein physics correctly. In an ideal case, we would like to have a large sample of unbiased sequence-structure pairs. In reality, when one looks at their physical properties, the set of proteins with known experimental structures is small and redundant. The ensemble of known folding domains[38,39] is likely incomplete as structures that appear legitimate from an energetic point of view may have not yet been selected or explored by evolution.[40] However, we can hypothesize that evolution had enough time to extensively explore the configuration space of smaller three-dimensional (3D) local structures.[41−43] Such local structural patterns have been chosen not only for their biological meaning but also because they have lower energy than a random configuration of the same group of amino acids. Under this hypothesis, all the information necessary to build a working energy function is already contained in the available protein databases. Following these guidelines, we decided to represent a protein as a collection of small-scale structures (building blocks) paired with their sequence information. Each building block describe the chemical environment around a single amino acid. Building blocks can overlap as a certain amino acid can belong to different building blocks. In the training phase, the building blocks are selected from a nonredundant set of protein structures and their homologous sequences (see Figure and Methods). The neural network learns the probability P(X) of a given building block X from its structural and sequence features as P(X) = p(Structure(X)) · p(Sequence(X) | Structure(X)), and we calculate the energy of the building block as the opposite of the logarithm of the probability: E(X) = – log P(X). We assume that the set is complete, that is, building blocks absent in this set are improbable and thus have higher energies.

Figure 1

Overview of the NNEF model. (A) We use a sample of nonredundant protein structures and augment the training data with their homologous sequences. (B) Local structure around each residue defines the building block and is used as the input of the energy model. Each building block X includes the residue itself, its four nearest residues along the sequence, and the ten other nearest residues in the 3D space. Each residue is represented as one bead. The number of building block used as samples to train the network is about 2.5 billion. (C) The energy model is illustrated in (E), where we fit a probability density function and calculate the energy as E(X) = – log P(X). The total energy of the protein is the sum of energies of all building blocks. In the auto-regressive model, we separate the structural and sequence features and calculate P(X) = p(Structure(X)) · p(Sequence(X) | Structure(X)) using the transformer network architecture. We use the softmax function to calculate the probabilities of discrete features and use Gaussian mixtures to calculate the probabilities of continuous features. The total number of parameters of the network model is 2.1 million. (D) The learned energy function can be applied for various tasks, such as decoy scoring, conformation sampling, and sequence design, without being explicitly trained for any of those. Each building block is formally independent from the others, so that, by construction, the probability of a given protein configuration (Y) is the product of the probabilities of the single building blocks (X): P(Y) = ∏P(X), and its energy is the sum of the energies of each building block: E(Y) = ∑E(X). Interactions between different amino acids are kept in a self-consistent way, as changing the coordinates or the sequence of a single amino acid will affect all the building blocks in which the amino acid is present. The energy function obtained in this way will be local and additive by construction. If properly trained, the neural network will be able to recognize alternative local minima, which corresponds to alternative configurations, active or intermediate states.

Protein Building Blocks

Definition and representation of building blocks rely on human intuitions. They can be pairs of residues, peptide fragments, groups of residues, and so forth. They can be represented by atoms or virtual beads paired with various geometric and chemical features. Furthermore, the size of the building blocks has to be small enough to minimize the risk of learning from undersampled sets but large enough to be able to represent complex tertiary structures. In our current model, we define a building block as the local structure of 15 amino acids, which includes a central residue, its four nearest neighboring along the sequence, and the ten other residues which are closest to the central one in the 3D space (Figures B and S1). With this definition, building blocks are typically composed of a few noncontiguous segments of residues that can be far in the primary sequence. As we will show, the energy function can be used to run MD simulations, and in this particular case, the building blocks change dynamically with time. To represent the structure of building blocks, we use very simple low-resolution geometric features, that is, the coordinates of beads at the Cβ positions (Cα for Glycine) and the connectivity of these beads based on the protein sequence. Other features, such as the positions of backbone atoms, the side chain positions, partial charges, hydrogen bonds, and so forth, could be added in future refined versions of the energy function but have not been considered in the current realization. The coordinates of all the beads composing each building block are rotated to the same local internal coordinate system (Figure S2) to guarantee rotational–translational invariance of the energy function.

Understanding the Neural Network Energy Function

To understand whether the NNEF can correctly grasp the physical properties of proteins, we test it in several tasks described hereafter.

Scoring Decoys Generated by Modifications of the Native Structures

A good energy function should be able to discriminate native-like from non-native conformations. We tested the energy function against the 3DRobot decoy set[44] and a second set of decoys that we generated by sampling normal modes of protein structures. The 3DRobot decoy set includes 200 nonhomologous single-domain proteins (48 in the alpha class, 40 in the beta class, and 112 in the alpha/beta class) and 300 structural decoys for each of these proteins, with root-mean-square deviation (RMSD) ranging from 0 to 12 Å. The second set was generated from 18 small (<120 residues) proteins, comprising 4 alpha proteins, 7 beta proteins, and 7 alpha/beta proteins. We can observe that for all the proteins we considered, the energy values tend to increase with the distance from the native, with native-like decoys having low energies and decoys with a large RMSD having higher energies (a typical result for each set is shown in Figure ).

Figure 2

Scoring decoys generated by modifications of the native structures. Left panel shows one typical example protein in the 3DRobot decoy set. Right panel shows one typical example protein in the normal mode decoy set. Red square is the native structure, and blue dots are decoys. In all proteins of both decoy sets, the energy increases with the distance from the native.

Scoring Structure Predictions

Another way in which we tested the quality of the NNEF is to score predictions generated in the CASP14 contest. This is a more significant challenge than scoring the decoy sets generated by modifications of native structures because such predicted structures cover more diverse conformations and are optimized according to some other scoring functions. Each of the 97 protein domains in CASP14 has about 200–500 model predictions submitted by different groups. We evaluated the NNEF for each of these predictions and measured its correlation with the CASP14 Global Distance Test Total Score (GDTTS, a measure of the overall quality of the prediction). For about 70% of the sets, we obtain a Pearson correlation coefficient |ρ| > 0.75 (Figure A,B). In the remaining cases (many of which are proteins that belong to complexes and for which their tertiary native structures could depend on the environment), the energy of the best CASP prediction is always near the global minimum of the evaluated NNEF for the whole set. In other words, the energy function appears to be quite successful in detecting good configurations but could be fooled by wrong but reasonable configurations (Figure C,D).

Figure 3

Scoring structure predictions in CASP14. Panel (A) shows the histogram of Pearson correlation coefficients ρ between the energy score and CASP GDT_TS score for proteins in CASP14. About 70% of proteins have Pearson correlation coefficients |ρ | > 0.75. Panels B, C, and D show three particular cases. 3D structures of some decoys are shown with indicators of their positions in the plot of energy vs GDT_TS. Panel (B) shows an example of good correlation between the NNEF energy and CASP GDT_TS score. Panel (C) shows an example of a protein with a simple alpha-helices fold. In this case, some models with non-native helix organizations have comparable energies to native-like models. Panel (D) shows an example of a protein involved in a complex context. Some models with wrong folds have energies lower than those of the native-like models.

Evaluating the Energy of Configurations within a MD Trajectory

The results above suggest that the learned energy function can be generalized to non-native configurations, despite being trained only with native structures. To further explore this feature, we use the NNEF to score a conformation ensemble of a small protein (Fip35) sampled over a 100-microsecond long MD trajectory.[3] The simulation was performed at 395 K using the Amber99SB force field,[45] and the trajectory is publicly available for download. The temperature of the simulation approximates the protein’s in silico melting temperature; therefore, Fip35 explores regions of the phase space far from the native configuration and undergoes multiple folding and unfolding events within the simulated time window. We computed the RMSD along the original MD trajectory and compared it with the energy evaluated with the NNEF. As shown in Figure , our results show that the folded states have low energy scores, while the unfold states generally have higher energy scores. This indicates that the NNEF is able to distinguish real configurations of proteins out of equilibrium from configurations that are more native-like.

Figure 4

Evaluating the energy of configurations within a MD trajectory for a small protein Fip35. The protein undergoes multiple folding and unfolding events. The RMSD and the energy along the MD trajectory correlate well, suggesting that the energy function can generalize to non-native conformations.

Performing MD Simulations

The energy function can be interpreted as a Hamiltonian function and can be used to study protein dynamics with an implicit solvent (Langevin dynamics). In the dynamics, the Cartesian coordinates (q⃗) of the residue beads are updated at each step: q⃗(t + 1) = q⃗(t) – α · ∇ E + βΓ(t), where q⃗ is the position vector of the i-th residue, E is the NNEF for the protein at time t, and Γ(t) is Gaussian noise with null average and unitary variance. The coefficients α and β are related to the physical friction coefficients, the integration time step, the temperature of the system, and the physical units of energy values. With a proper choice of their values (see Methods), we can observe that the dynamics generated using this method produce fluctuations consistent with those obtained with classical MD based on force fields. We generated a 30,000 steps Langevin dynamics for each protein in a test set comprising 18 small proteins (using for all simulations the same pair of values α and β) and compared such trajectories to simulations obtained with amber14SB force field[5] using OpenMM.[46] We can observe that for most proteins in the sample, the two dynamics produce RMSFs in good correlation across the whole sequence (see Figures and S4), and their difference remains below 0.5 Å. A complete assessment of the validity of the NNEF as a force field requires and deserves a much more exhaustive analysis, which is out of the scope of this article. However, these preliminary results indicate that the NNEF correctly describe a protein behavior when it is excited by thermal fluctuation.

Figure 5

RMSFs resulted from Langevin dynamics simulations using the NNEF. We compare the RMSFs of the trajectories from the NNEF (blue squares) with the RMSFs of the trajectories from classical MD simulations obtained with amber14SB force field (green dots). Two examples with high correlation are shown here. For most proteins in the sample, the two dynamics produce RMSFs in good correlation across the whole sequence (see also Figure S5).

Importance of the Sequence in the Energy Function

The next question we want to address is whether the neural network has learned the chemical differences between amino acids and not only to evaluate local structural patterns common to any generic protein chain. This is not a given because it is possible to construct protein models that correctly describe secondary structures without any sequence information.[47] To investigate this point, we evaluated the energy of various ″decoy″ sequences assigned to the same 3D structure (100 different protein structures were used as a test). The decoy sequences were generated in four different ways: (A) substituting residues with chemically similar residues, (B) shuffling the residues in the sequence, (C) mutating residues to random ones, and (D) mutating all residues to the same residue type. In most cases, the mutated sequences’ energies are higher than the correct one, and random mutations are worse than mutations to amino acids with similar chemical properties (Figure ). However, the network appears to prefer sequences rich in alanine as almost all poly-alanine decoys have better energy than the natural sequence (Figure S3). To a lesser extent, this happens to poly-leucine and poly-histidine decoys, but not for the other amino acid.

Figure 6

Distributions of energy differences E(decoy)-E(native) for three sequence decoy dataset. The decoy sequences were generated in three different ways: (1) substituting residues with chemically similar residues (green histogram), (2) shuffling the residues in the sequence (yellow histogram), and (3) mutating residues to random ones (blue histogram). The energies of the mutated sequences are higher than those of the native ones, and random mutations produce higher energies than mutations to similar amino acids.

Protein Sequence Design

To further explore the interplay between the sequence and structure in the NNEF, we redesigned the sequences of the 18 proteins in the test sample starting from the 3D conformation of their backbone. Given a random sequence, we run simulated annealing in the sequence space and attempt to minimize the energy while keeping fixed the reference structure. In this way, we expect to obtain a sequence that will eventually fold to the desired 3D configuration. At each step of the annealing process, we propose a random point mutation for the protein sequence. The mutation is accepted or rejected according to a Metropolis algorithm. In most cases, the simulations converge to sequences having energies lower than the native after a few thousand steps. We designed 100 sequences for each target protein within the mentioned test sample. Amino acid frequencies for these 1800 designed sequences are shown in Figure A. As we can observe, the annealing process converges to sequences that favor ten amino acids (Ala, Val, Leu, Gly, Pro, Ser, Thr, Arg, Glu, and Asp). It is worth noticing that these amino acids have relatively high frequencies in natural protein sequences and could be the first that joined biological proteins early in evolution, according to the theory on the temporal order of amino acids in evolution.[48] Overall, the average sequence recovery fraction is about 25% for the total sample of 1800 designed sequences. We also analyzed the differences between the core and surface residues of the designed sequences (Figure B). Core residues are mostly hydrophobic (Ala, Val, and Leu), while the polar residues are more likely to be exposed to the solvent. Moreover, an analysis of the 100 sequences designed for each structure reveals that core residues are more conserved than surface residues. Finally, to measure the reliability of the protein design method, we randomly selected two of the low energy designed sequences for each protein and predicted their structures using TrRosetta.[25] As Figures C and S5 show, the predicted structures match the target structures well in two-third of the cases.

Figure 7

Results of the protein sequence design given the backbone structure. (A) Amino acid frequencies of the 1800 designed sequences and the native sequences for the test sample of 18 proteins (100 sequences for each protein). Designed sequences show a preference for ten amino acids (Ala, Val, Leu, Gly, Pro, Ser, Thr, Arg, Glu, and Asp). (B) Amino acid frequencies of the residues in the core and surface of the designed proteins. Core residues are mostly hydrophobic (Ala, Val, and Leu), while the polar residues are mainly on the surface. (C) A few examples of the target structures and predicted structures using TrRosetta. For about two-thirds of the designed sequences, the predicted structures can match the target structures.

Discussions and Conclusions

In this study, we used unsupervised deep learning methods to derive a statistical potential that describes amino acid interactions within proteins. We represented the protein as a collection of building blocks comprising nearby residues in the 3D space. The underlining ansatz is that evolution extensively explored the configuration space of such building blocks, and, for this reason, known experimental structures contain a complete and nonredundant set of building blocks that can be used to train the neural network correctly. In this way, we can get a potential that well approximates protein physics, starting from a small number of PDB structures, even though the network is defined by a large set of parameters. The strength of our methodology largely depends on the validity of our starting hypothesis. The fact that we can obtain reasonable results confirms our intuition a posteriori. By substituting the energy function with a neural network, we are not explicitly fixing its functional form. Therefore, we can keep track of nonobvious correlations and complex multibody interactions and hopefully overcome some of the limitations of classical potentials, which are usually defined by a few pairwise interaction parameters. Indeed, the protein is not represented only by the Cartesian position of its component (in our cases the Cβ atoms), but more exactly by a series of vectors that encode all the information on how that particular atom (in our case, the whole amino acid) behaves in the environment created by the other components. For this reason, we are not restrained to a single resolution in the choice of the degree of freedoms. The protein description could be fully atomistic, coarse-grained, or mixed without any loss of generality. The network will likely be able to learn how different resolutions should merge. It is foreseeable, in the future, to use a neural network energy function to build models in which some parts have the desired level of details while other, less interesting, parts are represented at a lower resolution. Most importantly, the NNEF is general, and it can discriminate decoys, assess qualities of structural models, sample structural conformations, and design new protein sequences, without being trained for any of these tasks specifically, indicating that it is possible to learn general properties of protein physical and chemical properties from structural data alone. The methodology we adopted is very flexible, and our current implementation is far from realizing its full potential. We can improve the method by extensively testing different ways to combine structures and sequences, testing other representations of the proteins and building blocks and improving training methods and loss functions. It is also possible to refine the energy function parameters after the unsupervised training by applying supervised fitting to the various tasks we are interested in. Furthermore, the constantly increasing availability of experimental data could be sufficient to improve the new version of the NNEF in the future.

Methods

Structure and Sequence Dataset

In order to successfully train the network, we want to curate a nonredundant sample of structures and associate each structure to a family of homologous sequences. To reduce the redundancy in the structures, we use a sample of PDB chains in a version of the PISCES CulledPDB,[49] in which the percentage identity cutoff is 50%, the resolution cutoff is 3.0 Å, the R-factor cutoff is 1.0, and the total number of chains is about 29,000. In our work, only structures solved by X-ray crystallography are included; however, it is possible to augment the structural part of the data by including also high-accuracy structure predictions.[27] Then, all the chains are matched to the HH-suite PDB70 database[50] to get the aligned sequence data. The number of matched chains is about 19,000. We filter the aligned sequences using hhfilter, requiring <50% sequence identity to the PDB sequence and >50% sequence coverage. After filtering, we generate homologous structures by simply mapping the aligned sequences to the coordinates of the structure. Given a sequence A in a PDB chain and an aligned homologous sequence B, we ignore insertions in sequence B and substitute gaps in sequence B with the aligned sites in sequence A. In this way, the resulting chimeric sequence has the same length as sequence A and can be mapped to the coordinates of A. To test the transferability of the energy function, we use a radical partition of the dataset. All chains are matched to the structural classifications in the CATH 4.2 database.[38] Each chain can include more than one CATH domain. A chain is classified as one class (e.g., alpha/beta) if all the CATH domains in the chain are classified as that class (alpha/beta). The training data include only the alpha/beta chains. We obtain about 7500 alpha/beta chains and use 7000 chains as the training dataset and 500 as the validation dataset. The trained energy function is tested on all protein classes, including alpha/beta, mainly alpha and beta proteins.

Neural Network Model

After decomposing the proteins of the training set into building blocks, we characterize the likelihood of a building block by fitting a probability density function and setting for each building block X: E(X) = – log P(X). The total energy of a protein Y is the sum of all building blocks composing Y: E(Y) = ∑E(X). The probability function P(X) is obtained with an autoregressive model and maximum-likelihood training from a set of N building blocks {X}. Each building block X can be viewed as a list of k variables x1, x2,..., x ordered in a given scheme, and its probability P(X) can be expanded according to the Bayesian rule as P(X) = ∏(x | x<). In the ordering scheme, we separate the structural and sequence features so that P(X) = p(Structure(X)) · p(Sequence(X) | Structure(X)). With P(X) expanded as a chain of conditional probability functions, each conditional probability p(x | x<) is represented as a neural network that shares parameters with other conditional probability functions. Because proteins can be represented as molecular graphs, we use transformer graph neural networks as the autoregressive model. The neural network architecture of the autoregressive model is shown in Figure E.

Ordering Scheme in the Autoregressive Model

In the autoregressive model, the structural and sequence feature variables are ordered according to the following rule, schematically shown in Figure S1. Each building block X can be viewed as a graph with the central bead a as the root node. The central segment is visited first with the order (a, a-1, a+1, a-2, a+2). Then, the other segments are visited in the order of increasing distance to the central bead. Within each surrounding segment, the beads are visited in the order of increasing primary sequence numbers.

Input Features

In our model, the features of a building block include the residue types, Cartesian coordinates, and bond connections of 15 amino acids belonging to a building block. To make the energies rotational–translational invariant, the coordinates of the Cβ atoms are rotated to the same local internal coordinate system of the central residue (Figure S2). The coordinate system is defined so that the residue a-1 is on the X-axis, and the residues a-1 and a+1 are on the X-Y plane. The Cartesian coordinates of all beads are then converted to Polar coordinates. The bond connections, residue types, and positional labels 1–15 are converted to high dimensional vectors through look-up tables. The components of each vector are learned by the neural network and encode information about sequence and structural features of the building block. After this step, the coordinates, bond connections, and positional labels are concatenated as structural features, while residue types and positional labels are concatenated as sequence features.

Transformer Encoder and Decoder

The encoder for structural features has four standard transformer encoder layers. The outputs after two encoder layers are used as the latent codes of the structural features and passed to the decoder. The decoder for sequence features has four standard transformer decoder layers. Both the encoder and decoder layers use the standard transformer layer.[51] In the decoder, structure–sequence attention is used because we decompose the probability as P(X) = p(Structure(X)) · p(Sequence(X) | Structure(X)). The attention in both the encoder and decoder is masked by position-based causal masks, that is, each position can only pay attention to positions before it, so that the network cannot know the answers by looking at the whole input data.

Neural Network Training

We use the softmax function to get the probabilities for the predictions of the discrete labels, such as the bond connections and the residue types. For the predictions of the continuous variables, such as the radius and angles in the coordinates, we calculate the probabilities using sums of Gaussian functions: p(x) = ∑(x, μ, σ), where G(x, μ, σ) is a Gaussian function of the variable x with average μ and variance σ, and c, μ, σ are outputs of the neural network. Thus, the network calculates the conditional probabilities of the next residue coordinate, given the coordinates of previous residues. It also calculates the conditional probabilities of the next residue type, given previous residue types and the coordinates of all residues. After getting the conditional probabilities, we calculate the energy terms as the negative log of the probabilities. The energy of the building block is the weighted sum of all the energy terms. We train the network to minimize the energies, that is, the loss function is simply the energy values. We use minibatch training, Adam optimizer[52] with starting learning rate 5 × 10–5 and betas = (0.9,0.99), and L2 regularization with weight 10–6.

Langevin Dynamics

In the Results section, we will show that we can use the NNEF to run Langevin dynamics according to the following equation: q⃗(t + 1) = q⃗(t) – α · ∇ E + βΓ(t), where q→ is the position vector of the i-th residue, E is the NNEF for the i-th residue at time t, Γ(t) is a Gaussian noise with null average and unitary variance, and α and β are physical parameters of the simulation. To decide proper values for these two parameters, we run a grid search using short simulations (α = [1e-3, 3e-3, 5e-3, 7e-3, 0.01, 0.015, 0.02, 0.04 = [0.01, 0.03, 0.06, 0.1, 0.15]). When α and β are small, the dynamics are very slow, and the protein’s residues are almost locked in the initial position. When β is large, the protein unfolds after a small number of time steps quickly. We decide to use fixed values in the middle (α = 0.01 and β = 0.05) to run simulations.

Small Protein Test Set

For testing sequence reconstruction and Langevin dynamics, we used a test set of 18 small proteins corresponding to the following PDB IDs (the underscore identifies the protein chain): 1ZZK_A, 2MCM_A, 2VIM_A, 3FBL_A, 3IPZ_A, 3KXT_A, 3NGP_A, 3P0C_A, 3SNY_A, 3SOL_A, 3VI6_A, 4M1X_A, 4O7Q_A, 4QRL_A, 5CYB_A, 5JOE_A, 5ZGM_A, 6H8O_A.

45 in total

1. Unified approach for molecular dynamics and density-functional theory.

Authors:
Journal: Phys Rev Lett Date: 1985-11-25 Impact factor: 9.161

2. 3DRobot: automated generation of diverse and well-packed protein structure decoys.

Authors: Haiyou Deng; Ya Jia; Yang Zhang
Journal: Bioinformatics Date: 2015-10-14 Impact factor: 6.937

3. Statistical potential for assessment and prediction of protein structures.

Authors: Min-Yi Shen; Andrej Sali
Journal: Protein Sci Date: 2006-11 Impact factor: 6.725

4. The SIRAH 2.0 Force Field: Altius, Fortius, Citius.

Authors: Matías R Machado; Exequiel E Barrera; Florencia Klein; Martín Sóñora; Steffano Silva; Sergio Pantano
Journal: J Chem Theory Comput Date: 2019-03-13 Impact factor: 6.006

5. Martini 3: a general purpose force field for coarse-grained molecular dynamics.

Authors: Paulo C T Souza; Riccardo Alessandri; Jonathan Barnoud; Sebastian Thallmair; Ignacio Faustino; Fabian Grünewald; Ilias Patmanidis; Haleh Abdizadeh; Bart M H Bruininks; Tsjerk A Wassenaar; Peter C Kroon; Josef Melcr; Vincent Nieto; Valentina Corradi; Hanif M Khan; Jan Domański; Matti Javanainen; Hector Martinez-Seara; Nathalie Reuter; Robert B Best; Ilpo Vattulainen; Luca Monticelli; Xavier Periole; D Peter Tieleman; Alex H de Vries; Siewert J Marrink
Journal: Nat Methods Date: 2021-03-29 Impact factor: 28.547

6. Biological structure and function emerge from scaling unsupervised learning to 250 million protein sequences.

Authors: Alexander Rives; Joshua Meier; Tom Sercu; Siddharth Goyal; Zeming Lin; Jason Liu; Demi Guo; Myle Ott; C Lawrence Zitnick; Jerry Ma; Rob Fergus
Journal: Proc Natl Acad Sci U S A Date: 2021-04-13 Impact factor: 11.205

7. The Rosetta All-Atom Energy Function for Macromolecular Modeling and Design.

Authors: Rebecca F Alford; Andrew Leaver-Fay; Jeliazko R Jeliazkov; Matthew J O'Meara; Frank P DiMaio; Hahnbeom Park; Maxim V Shapovalov; P Douglas Renfrew; Vikram K Mulligan; Kalli Kappel; Jason W Labonte; Michael S Pacella; Richard Bonneau; Philip Bradley; Roland L Dunbrack; Rhiju Das; David Baker; Brian Kuhlman; Tanja Kortemme; Jeffrey J Gray
Journal: J Chem Theory Comput Date: 2017-05-12 Impact factor: 6.006

8. Role of gamma carboxylated Glu47 in connexin 26 hemichannel regulation by extracellular Ca²⁺: insight from a local quantum chemistry study.

Authors: Francesco Zonta; Fabio Mammano; Mauro Torsello; Nicola Fortunati; Laura Orian; Antonino Polimeno
Journal: Biochem Biophys Res Commun Date: 2014-01-24 Impact factor: 3.575

9. HH-suite3 for fast remote homology detection and deep protein annotation.

Authors: Martin Steinegger; Markus Meier; Milot Mirdita; Harald Vöhringer; Stephan J Haunsberger; Johannes Söding
Journal: BMC Bioinformatics Date: 2019-09-14 Impact factor: 3.169

10. Highly accurate protein structure prediction with AlphaFold.

Authors: John Jumper; Richard Evans; Alexander Pritzel; Tim Green; Michael Figurnov; Olaf Ronneberger; Kathryn Tunyasuvunakool; Russ Bates; Augustin Žídek; Anna Potapenko; Alex Bridgland; Clemens Meyer; Simon A A Kohl; Andrew J Ballard; Andrew Cowie; Bernardino Romera-Paredes; Stanislav Nikolov; Rishub Jain; Demis Hassabis; Jonas Adler; Trevor Back; Stig Petersen; David Reiman; Ellen Clancy; Michal Zielinski; Martin Steinegger; Michalina Pacholska; Tamas Berghammer; Sebastian Bodenstein; David Silver; Oriol Vinyals; Andrew W Senior; Koray Kavukcuoglu; Pushmeet Kohli
Journal: Nature Date: 2021-07-15 Impact factor: 49.962