Literature DB >> 34977905

Accurate Machine Learning Prediction of Protein Circular Dichroism Spectra with Embedded Density Descriptors.

Luyuan Zhao¹, Jinxiao Zhang², Yaolong Zhang¹, Sheng Ye³, Guozhen Zhang¹, Xin Chen⁴, Bin Jiang¹, Jun Jiang¹.

Abstract

A data-driven approach to simulate circular dichroism (CD) spectra is appealing for fast protein secondary structure determination, yet the challenge of predicting electric and magnetic transition dipole moments poses a substantial barrier for the goal. To address this problem, we designed a new machine learning (ML) protocol in which ordinary pure geometry-based descriptors are replaced with alternative embedded density descriptors and electric and magnetic transition dipole moments are successfully predicted with an accuracy comparable to first-principle calculation. The ML model is able to not only simulate protein CD spectra nearly 4 orders of magnitude faster than conventional first-principle simulation but also obtain CD spectra in good agreement with experiments. Finally, we predicted a series of CD spectra of the Trp-cage protein associated with continuous changes of protein configuration along its folding path, showing the potential of our ML model for supporting real-time CD spectroscopy study of protein dynamics.

Entities: Chemical

Year: 2021 PMID： 34977905 PMCID： PMC8715543 DOI： 10.1021/jacsau.1c00449

Source DB: PubMed Journal: JACS Au ISSN： 2691-3704

Introduction

Protein structure determination is crucial for understanding many biological functions.[1−3] Especially, tracking protein structural variation in real time is desirable for exploring underlying mechanisms.[4−6] Very recently, AlphaFold, the artificial intelligence (AI) program developed by DeepMind, has achieved accurate prediction of the folded structure of a protein based on its primary structure, i.e., amino acid sequence.[7] Despite the tremendous process in protein tertiary structure prediction, people still have little knowledge about the path through which a protein evolves from one configuration to another. Yet the structure information on protein dynamics is the key to understanding how they function and how their function can be modulated. Spectroscopy techniques may shed light into protein dynamics by directly probing protein structures and couplings along the dynamic process.[8−12] Among many spectroscopic techniques, electronic circular dichroism (CD) holds advantages of ease of operation and high sensitivity to subtle structural changes.[13,14] Aided by an advanced light source, it could be a powerful probe of real-time protein structure determination. Spectroscopic measurement needs to be coupled with a rapid theoretical interpretation means, to realize real-time interpretation of structural variations. However, the huge computational cost required for accurately simulating protein spectra at the quantum chemistry (QC) level has long been a painful obstacle for such an ambitious goal. The rapidly developing machine learning (ML) techniques which have been successfully applied to physical and biological sciences to circumvent the challenge of solving complicated structure–property relationships offer an opportunity for addressing the longstanding problem. Recently, ML has been applied in various spectroscopic simulations,[15−17] as well as protein structure prediction.[7,18,19] Along this track, we have also developed ML tools for predicting protein infrared (IR) and ultraviolet (UV) spectra.[20−22] Proteins with different secondary structure profiles have distinctive signatures in CD spectra, making them useful in studying protein dynamics such as folding and binding events.[13,23−25] The CD spectrum in the far UV is based on the energy of the electronic transitions that the peptide bond contributions dominate. And two key parameters of electric and magnetic transition dipole moments of peptide bonds are necessary to simulate the CD spectra of different secondary structures.[26−28] However, accurate prediction of these two physical quantities by ML methods is difficult for two reasons: (1) they are vectors of multiple coordinate dependent components, which are covariant with system rotation or twisting, and (2) their vector directions are essentially determined by the corresponding types of electronic transitions, which are not well described by regular ML models and descriptors extracted directly from structural parameters. This difficulty can be overcome by the recently proposed tensorial embedded atom neural network (EANN) model.[29,30] Accordingly, we have developed a set of embedded density descriptors, to learn tensorial properties in a fully symmetry-adapted way. This descriptor considers each atom as impurities embedded in the electron gas generated by surrounding atoms and is constructed by the square of the linear combination of atomic orbitals from adjacent atoms. Consequently, the obtained descriptor is invariant with respect to the overall translation and rotation and permutation of identical atoms. Combining the virtual ML output obtained from the embedded density descriptor with atomic coordinate vectors by multiplication, we can get the symmetry-conservative tensors for describing electric dipole moments of the N-methylacetamide (NMA) molecule.[29] In this study, we have constructed a ML protocol based on novel embedded density descriptors to predict the CD spectra of proteins with a comparable accuracy to density functional theory (DFT) calculations while significantly faster than the latter. The embedded density descriptors can learn both the electric and magnetic transition dipole moments of peptide bonds well, which integrate complex information on molecular chirality, atomic structure, electron and spin density, wave function transition, and so on. The simulated CD spectra by our ML model not only are in good agreement with experiment results but also help to distinguish various proteins with different secondary structure profiles. Moreover, we have successfully mapped the folding process of a protein with simulated CD spectra, showing the potential of our ML protocol in facilitating real-time observation of protein dynamics using CD spectra in the future.

Theory and Computation Detail

Proteins are composed of peptide bonds and amino acid residues (Figure a), and most CD responses in the far UV region come from two electronic excitations of the peptide bond: n → π* transition around 220 nm and π → π* transitions around 190 nm (Figure b). The model Hamiltonian of exciton can be constructed based on the Frenkel exciton model.[31−33] It is necessary to calculate the excitation energy ε of the peptide bonds and the resonance coupling J between excited states with the dipole approximation:[34,35]where m (n) runs over peptide bonds, a and b denote the n → π* or π → π* transitions, respectively, ε denotes the excitation energy of the isolated peptide bond, μ and μ denote the electric transition dipole moment of the peptide bond and the ground state dipole moment of the surrounding amino acid residues k, respectively, and r denotes the distance vector between m and n. In addition, magnetic transition dipole moment μ of peptide bonds is needed for calculating rotatory strength (R = |μ|·|μ|·cos θ).

Figure 1

(a) NMA structure and protein structure. (b) Valence molecular orbitals and two electronic transitions of the peptide bond which are n → π* or π → π* transitions. (c) Machine learning protocol for predicting protein CD spectra. The whole ML protocol for protein CD spectra is described as follows (Figure c). A total of 1000 different types of proteins are downloaded from the RCSB Protein Data Bank.[36] Proteins are first sliced into peptide bonds and amino acid residues. We then randomly selected 50 000 peptide bonds and 200 000 amino acid residues for data preparation. A total of 200 000 amino acid residues include 20 kinds of amino acids, each with 10 000 structures. We employ the time dependent DFT (TDDFT) method at the PBE0/cc-pVDZ level to calculate the excited state properties of peptide bonds (modeled by the N-methylacetamide molecule in Figure a) and the DFT method at B3LYP/6-311++G** level to calculate the ground state properties of amino acid residues. All the DFT and TDDFT simulations are performed in the Gaussian 16 package.[37] Internal coordinates and converted Cartesian coordinates are chosen as the molecular descriptors for ε0 of the peptide bond and μ of the amino acid residue, respectively, while the embedded density descriptors[29] are used for better representing electric and magnetic transition dipole moments μ/μ of the peptide bond, as discussed below. Starting with the DFT/TDDFT data sets, we run the ML data-training process to build the correlation between the descriptors and our prediction targets. For the ε0 of peptide bonds and the μ of amino acid residues, we use a neural network model with three hidden layers (32, 64, and 128 neurons, respectively). The rectified linear unit activation function is used for each hidden layer to resist the disappearance of the gradient and reduce the influence of noise.[38] L2 regularization is employed to solve the problem of overfitting.[39] In addition, we use the Adam optimizer[40] to avoid the local minima during the NN training. And we adjust the learning rate every 500 steps after setting the initial learning rate to 0.001. μ and μ of peptide bonds are predicted using embedded atomic neural networks (EANN) with the atom-wise embedded density descriptors.[29] Note that 36 descriptors are used as the input for representing the local environment of each atom of the NMA molecule, and each atomic neural network consists of 2 hidden layers (30 neurons in each layer). With these parameters predicted by the ML model, ε and J can be calculated according to eq and eq , yielding the effective Hamiltonian of exciton, and μ and μ can provide rotatory strength for the CD spectrum. The SPECTRON[41] program is used to diagonalize the Hamiltonian and finally output the CD spectrum of the selected protein (details in the Supporting Information).

Results and Discussion

The accuracy and robustness of ML prediction are examined with the Pearson correlation coefficient (r) and the mean relative error (MRE) (details in the Supporting Information). Internal coordinates are chosen as the molecular descriptor to predict the excitation energies of peptide bonds. Each peptide bond has nine internal coordinates. For both n → π* and π → π* transitions, the ML predicted ε0 are in good agreement with the TDDFT calculated one (Figure S1b). The results have high Pearson coefficients (0.9616 and 0.9512) and low MREs (0.363% and 0.252%). Meanwhile, the Cartesian coordinates are reoriented to the same reference coordinate system for the prediction of the ground state dipole moments of amino acid residues. The Cartesian coordinates can directly determine the structural features, and they have proved to be good descriptors for studying both the magnitude and direction of ground-state dipole moments in our previous work.[21] All correlation coefficients for μ prediction are greater than 0.98, and most MRE values are below 5% (Figure S1c). The detailed results about ε0 of peptide bonds and μ of amino acid residues can be found in our previous work.[22] The results show that these simple molecular descriptors already warrant good accuracy and robustness of our ML models for ε0 and μ. For the prediction of the more challenging electric and magnetic transition dipole moments of peptide bonds, we use the tensorial EANN model[29] based on the electron density-like atom-wise descriptors called embedded density descriptors, which are the square of a series of the linear combinations of Gaussian atomic orbitals from neighboring atoms.[30] As detailed in the SI, the EANN model can fit the direction of these transition dipole moments automatically and preserve their symmetry-covariant properties. To validate this, we compare the predictions of μ and μ with various molecular descriptors (details in the Supporting Information). Figure a,b shows the predicted μ and μ using the Coulomb matrix (CM) as the descriptor with a gradient boosting regression (GBR) algorithm. The results for both n → π* and π → π* transitions are unsatisfactory (as seen from poor Pearson coefficients), presumably because the μ and μ components in three directions are treated separately and not described as a whole for a given transition type (Figure S4 and Figure S5). In contrast, Figure c,d shows that the TDDFT calculated μ/μ are perfectly predicted by EANN in all directions, with high Pearson coefficients (r > 0.95) and low MREs (<1.5%) for both n → π* and π → π* transitions.

Figure 2

ML prediction of the electric and magnetic transition dipole moments of peptide bonds. (a) Correlation plots of the TDDFT and ML predicted electric transition dipole moments of the n → π* and π → π* transitions using CM with GBR. (b) Correlation plots of the TDDFT and ML predicted magnetic transition dipole moments of the n → π* and π → π* transitions using CM with GBR. (c) Same as (a) but using EANN. (d) Same as (b) but using EANN. To validate the advantage of the ML protocol, we compare DFT-based CD spectra and ML-based CD spectra of four different types of proteins (Figure S7). The essential matrix elements in the model Hamiltonian including ε0, μ, μ, and μ are calculated by DFT calculations and the ML model, respectively. Then CD spectra are generated by diagonalizing the model Hamiltonian using the SPECTRON program. As Table S1 shows, the simulated CD spectra from two computational protocols show reasonable agreement in terms of high Spearman rank coefficient (ρ). This coefficient is widely used for measuring the agreement between the spectra.[42−45] These results will be further improved in the future by increasing the prediction accuracy of the essential matrix elements in the model Hamiltonian, especially for the nondiagonal matrix elements (Figure S8c). Moreover, the speed for ML-based approach is mostly 4 orders of magnitude faster than the DFT-based one. Next, we compare the spectra simulated by our protocol with the corresponding experimental results. The proteins used in Figure are randomly selected from the RCSB Protein Data Bank[36] and are not included in our training data. The experimental spectra are all from the SMP180 and SP175 data sets of the Protein Circular Dichroism Data Bank (PCDDB).[46,47] The simulated spectra with different secondary structures are all in good agreement with the experiment (Table ). More importantly, the simulated spectra of different secondary structures have different peak positions and line shapes, which lay the foundation for the interpretation of the experimental spectra. More results can be seen in the Supporting Information (Figure S10). The overall results show that our ML model is of high accuracy and good transferability.

Figure 3

Experimental (black curves) and ML predicted (red curves) CD spectra of different types proteins. Intensity is scaled to have the same maximum intensity for each panel.

Table 1

Comparison of the ML Simulated Protein CD Spectra with Experiments in Terms of Spearman Rank Correlation (ρ)

protein	PDB ID	secondary class	number of atoms	ρ
Peroxidase C1A	7ATJ	α	2944	0.74
Pectate lyase C	1AIR	β	2786	0.81
Pyruvate kinase	1A49	α + β	34001	0.79
Cytochrome bc1 complex	1BE3	α + β	16222	0.88
Aspartokinase III	2J0X	α + β	6915	0.88
DNase I	3DNI	α + β	2494	0.75
TraF protein	3JQO	α + β	38842	0.83
Carboxypeptidase A	5CPA	α + β	2753	0.97

Experimental (black curves) and ML predicted (red curves) CD spectra of different types proteins. Intensity is scaled to have the same maximum intensity for each panel. To further demonstrate the applicability of our ML model, we run an MD simulation of four proteins with different secondary structures (α-helix, β-sheet, and α + β) to take the factor of environmental fluctuation into account and average the CD spectra of 1000 molecular dynamics (MD) conformations (details in the Supporting Information). The simulated CD spectra by our ML model are in good agreement with the experimental spectra in terms of overall line shape and the main peak position (Figure a). However, the ML-based spectra show a worse agreement than the spectra shown in Figure when including the environment fluctuation by MD simulation. It is likely that a 2 ns trajectory is insufficient to cover configuration fluctuation of proteins in experimental measurement. In contrast, a single-frame structure from a PDB file represents the protein’s average configuration in the crystalline condition, which is a better representative of its real configuration in spectra measurement. Therefore, CD spectra derived from a single-frame structure using our ML model are in better agreement with experimental results than those derived from an MD trajectory (Figure ). In particular, the simulated CD spectra of different secondary structures show unique characteristics, which help to distinguish one from others. For comparison, the averaged IR spectra of these four proteins based on 1000 MD are obtained by the ML model proposed in our previous work.[20] The averaged IR spectra of two proteins with α + β structure (PDB ID: 2RHE, 5DFR) are very similar (Figure S11). It is thus clear that CD spectra are more sensitive to different secondary structures. These results indicate that our ML model can be applied to obtain CD spectra of different structures under environmental fluctuations with high accuracy and good transferability.

Figure 4

(a) Experimental (black curves) and ML predicted (red curves) CD spectra. The ML predictions are based on 1000 MD configurations. (b) The ML predicted CD spectra of the Trp-cage protein along its folding path (S1 → S100, S1: the original unfolded structure, S25: slightly folded along with the decrease of coil content, S50: folding faster and helical elements appear, S75: a cage formed with the rapid increase of α-helix, S100: the final stably folded structure). All spectra are averaged over 100 MD conformations for each state. Protein folding is a key process for proteins to form unique three-dimensional structures and functions. Tracking structural changes in real time during the folding process can facilitate mechanistic understanding of proteins. Trp-cage (PDB ID: 1L2Y) is a mini protein that has been widely studied, and it is a convenient tool for studying folding dynamics.[48] Therefore, we use our ML model to monitor the folding process of the Trp-cage protein. All the CD spectra predicted by our ML model are based on 100 MD conformations retrieved from our previous study.[49] We have selected five representative states during the folding process (Figure b and Table S2). The initial state is in total coil structure (S1) and starts to fold as the content of coil decreases (S25). The structure folds faster and helical elements appear (S50). Then, a cage is formed with the appreciable increase of α-helix (S75). Finally it reaches the fully folded (S100) state. Clearly, the simulated CD spectra of the structure in different stages of the folding process are different. The random coil marked “RC” has a positive band at 212 nm and a negative one around 195 nm. The alpha helix marked “α” has negative bands at 222 and 208 nm and a positive one at 190 nm. During the folding process, the characteristics of “RC” decrease while those of “α” increase. Especially, the simulated CD spectrum of the completely folded state fits well with the experimental one[50] after the latter is blue-shifted as a whole by 6 nm (equivalent to ∼0.2 eV). The above results of Trp-cage simulation show that our ML protocol can facilitate real-time CD spectroscopy study on protein folding.

Summary

We have proposed a cost-effective machine learning protocol to simulate the electronic circular dichroism spectrum for proteins. This ML model benefits from the embedded density descriptors that give a robust and reliable prediction of the key tensorial parameters for the CD spectrum, including the electric and magnetic transition dipole moments of peptide bonds. Based on the parameters predicted by ML, we build the effective Hamiltonian and generate CD spectra of proteins. Our computational protocol not only significantly speeds up spectra simulation compared to the conventional first-principle calculation protocol but also obtains comparable results with experiments on a variety of different proteins, signifying the efficiency, accuracy, and transferability of our protocol. Further, the model is used for fast prediction of CD spectra of different conformations of Trp-cage along its folding pathway, demonstrating its power of efficiently mapping protein structures with their corresponding CD spectra for the study of protein dynamics. In summary, our ML protocol is a promising tool in the simulation of a protein CD spectrum and can be extended to other fields such as near-ultraviolet spectroscopy and two-dimensional spectroscopy.

38 in total

1. What don't we know?

Authors: Donald Kennedy; Colin Norman
Journal: Science Date: 2005-07-01 Impact factor: 47.728

Review 2. Coherent multidimensional optical spectroscopy of excitons in molecular aggregates; quasiparticle versus supermolecule perspectives.

Authors: Darius Abramavicius; Benoit Palmieri; Dmitri V Voronine; Frantisek Sanda; Shaul Mukamel
Journal: Chem Rev Date: 2009-06 Impact factor: 60.622

3. Getting to Know Your Neighbor: Protein Structure Prediction Comes of Age with Contextual Machine Learning.

Authors: Jack Hanson; Kuldip K Paliwal; Thomas Litfin; Yuedong Yang; Yaoqi Zhou
Journal: J Comput Biol Date: 2019-08-30 Impact factor: 1.479

4. Embedded Atom Neural Network Potentials: Efficient and Accurate Machine Learning with a Physically Inspired Representation.

Authors: Yaolong Zhang; Ce Hu; Bin Jiang
Journal: J Phys Chem Lett Date: 2019-08-14 Impact factor: 6.475

5. Atomic Detail of Protein Folding Revealed by an Ab Initio Reappraisal of Circular Dichroism.

Authors: Alan Ianeselli; Simone Orioli; Giovanni Spagnolli; Pietro Faccioli; Lorenzo Cupellini; Sandro Jurinovich; Benedetta Mennucci
Journal: J Am Chem Soc Date: 2018-03-06 Impact factor: 15.419

6. High-accuracy protein structures by combining machine-learning with physics-based refinement.

Authors: Lim Heo; Michael Feig
Journal: Proteins Date: 2019-11-15

7. Extracting single and two-exciton couplings in photosynthetic complexes by coherent two-dimensional electronic spectra.

Authors: Darius Abramavicius; Benoit Palmieri; Shaul Mukamel
Journal: Chem Phys Date: 2008-08-22 Impact factor: 2.348