| Literature DB >> 33336200 |
Wenhao Gao1, Sai Pooja Mahajan1, Jeremias Sulam2, Jeffrey J Gray1.
Abstract
Deep learning is catalyzing a scientific revolution fueled by big data, accessible toolkits, and powerful computational resources, impacting many fields, including protein structural modeling. Protein structural modeling, such as predicting structure from amino acid sequence and evolutionary information, designing proteins toward desirable functionality, or predicting properties or behavior of a protein, is critical to understand and engineer biological systems at the molecular level. In this review, we summarize the recent advances in applying deep learning techniques to tackle problems in protein structural modeling and design. We dissect the emerging approaches using deep learning techniques for protein structural modeling and discuss advances and challenges that must be addressed. We argue for the central importance of structure, following the "sequence → structure → function" paradigm. This review is directed to help both computational biologists to gain familiarity with the deep learning methods applied in protein modeling, and computer scientists to gain perspective on the biologically meaningful problems that may benefit from deep learning techniques.Entities:
Keywords: deep generative model; deep learning; protein design; protein folding; representation learning
Year: 2020 PMID: 33336200 PMCID: PMC7733882 DOI: 10.1016/j.patter.2020.100142
Source DB: PubMed Journal: Patterns (N Y) ISSN: 2666-3899
Figure 1Striking Improvement in Model Accuracy in CASP13 Due to the Deployment of Deep Learning Methods
(A) Trend lines of backbone accuracy for the best models in each of the 13 CASP experiments. Individual target points are shown for the two most recent experiments. The accuracy metric, GDT_TS, is a multiscale indicator of the closeness of the Cα atoms in a model to those in the corresponding experimental structure (higher numbers are more accurate). Target difficulty is based on sequence and structure similarity to other proteins with known experimental structures (see Kryshtafovych et al. for details). Figure from Kryshtafovych et al. (2019).
(B) Number of FM + FM/TBM (FM, free modeling; TBM, template-based modeling) domains (out of 43) solved to a TM score threshold for all groups in CASP. AlphaFold ranked first among them, showing that the progress is mainly due to the development of DL-based methods. Figure from Senior et al. (2020).
Figure 2Schematic Comparison of Three Major Tasks in Protein Modeling: Function Prediction, Structure Prediction, and Protein Design
In function prediction, the sequence and/or the structure is known and the functionality is needed as output of a neural net. In structure prediction, sequence is known input and structure is unknown output. Protein design starts from desired functionality, or a step further, structure that can perform this functionality. The desired output is a sequence that can fold into the structure or has such functionality.
Figure 3Schematic Representation of Several Architectures Used in Protein Modeling and Design
(A) CNNs are widely used in structure prediction.
(B) RNNs learn in an auto-regressive way and can be used for sequence generation.
(C) The VAE can be jointly trained by protein and properties to construct a latent space correlated with properties.
(D) In the GAN setting, a mapping from a priori distribution to the design space can be obtained via the adversarial training.
Figure 4Different Types of Representation Schemes Applied to a Protein
Features Contained by CUProtein Dataset
| Feature Name | Description | Dimensions | Type | IO |
|---|---|---|---|---|
| AA Sequence | sequence of amino acid | n | 21 chars | input |
| PSSM | position-specific scoring matrix, a residue-wise score for motifs appearance | n | real [0, 1] | input |
| MSA covariance | covariance matrix across homologous NA sequences | n | real [0, 1] | input |
| SS | a coarse categorized secondary structure (Q3 or Q8) | n | 3 or 8 chars | input |
| Distance matrices | pairwise distance between residues (C | n | positive real (Å) | output |
| Torsion angles | variable dihedral angles for each residues (φ, ψ) | n | real [−π, +π] (radians) | output |
n, number of residues in one protein. Data from Drori et al.
A Summary of Publicly Available Molecular Biology Databases
| Dataset | Description | N | Website |
|---|---|---|---|
| European Bioinformatics Institute (EMBL-EBI) | a collections of wide range of datasets | – | |
| National Center for Biotechnology Information (NCBI) | a collections of biomedical and genomic databases | – | |
| Protein Data Bank (PDB) | 3D structural data of biomolecules, such as proteins and nucleic acids | ||
| Nucleic Acid Database (NDB) | structure of nucleic acids and complex assemblies | ||
| Universal Protein Resource (UniProt) | protein sequence and function infromations | ||
| Sequence Read Archive (SRA) | raw sequence data from “next-generation” sequencing technologies | NCBI database |
A Summary of Structure Prediction Models
| Model | Architecture | Dataset | N_train | Performance | Testset | Citation |
|---|---|---|---|---|---|---|
| / | MLP(2-layer) | proteases | 13 | 3.0 Å RMSD (1TRM),1.2 Å RMSD (6PTI) | 1TRM, 6PTI | Bohr et al. |
| PSICOV | graphical Lasso | – | – | precision: Top-L 0.4, Top-L/2 0.53,Top-L/5 0.67, Top-L/10 0.73 | 150 Pfam | Jones et al. |
| CMAPpro | 2D biRNN + MLP | ASTRAL | 2,352 | precision: Top-L/5 0.31, Top-L/10 0.4 | ASTRAL 1.75 CASP8, 9 | Di Lena et al. |
| DNCON | RBM | PDB SVMcon | 1,230 | precision: Top-L 0.46, Top-L/2 0.55, Top-L/5 0.65 | SVMCON_TEST, D329, CASP9 | Eickholt et al. |
| CCMpred | LM | – | – | precision: Top-L 0.5, Top-L/2 0.6, Top-L/5 0.75, Top-L/10 0.8 | 150 Pfam | Seemayer et al. |
| PconsC2 | Stacked RF | PSICOV set | 150 | positive predictive value (PPV) 0.44 | set of 383 CASP10(114) | Skwark et al. |
| MetaPSICOV | MLP | PDB | 624 | precision: Top-L 0.54, Top-L/2 0.70, Top-L/5 0.83, Top-L/10 0.88 | 150 Pfam | Jones et al. |
| RaptorX-Contact | ResNet | subset of PDB25 | 6,767 | TM score: 0.518 (CCMpred: 0.333, MetaPSICOV: 0.377) | Pfam, CASP11, CAMEO, MP | Wang et al, 2017 |
| RaptorX-Distance | ResNet | subset of PDB25 | 6,767 | TM score: 0.466 (CASP12), 0.551 (CAMEO), 0.474 (CASP13) | CASP12 + 13, CAMEO | Xu, 2018 |
| DeepCov | 2D CNN | PDB | 6,729 | precision: Top-L 0.406, Top-L/2 0.523, Top-L/5 0.611, Top-L/10 0.642 | CASP12 | Jones et al, 2018 |
| SPOT | ResNet, Res-bi-LSTM | PDB | 11,200 | AUC: 0.958 (RaptorX-contact ranked 2nd: 0.909) | 1,250 chains after June 2015 | Hanson et al. |
| DeepMetaPSICOV | ResNet | PDB | 6,729 | precision: Top-L/5 0.6618 | CASP13 | Kandathil et al, 2019 |
| MULTICOM | 2D CNN | CASP 8-11 | 425 | TM score: 0.69, GDT_TS: 63.54, SUM | CASP13 | Hou et al. |
| C-I-TASSER∗ | 2D CNN | – | – | TM score: 0.67, GDT_HA: 0.44, RMSD: 6.19, SUM | CASP13 | Zheng et al. |
| AlphaFold | ResNet | PDB | 31,247 | TM score: 0.70, GDT_TS: 61.4,SUM | CASP13 | Senior et al. |
| MapPred | ResNet | PISCES | 7,277 | precision: 78.94% in SPOT, 77.06% in CAMEO, 77.05 in CASP12 | SPOT, CAMEO, CASP12 | Wu et al, 2019 |
| trRosetta | ResNet | PDB | 15,051 | TM_score: 0.625 (AlphaFold: 0.587) | CASP13, CAMEO | Yang et al, 2020 |
| RGN | bi-LSTM | ProteinNet 12 (before 2016)∗∗ | 104,059 | 10.7 Å dRMSD on FM, 6.9 Å on TBM | CASP12 | AlQuraishi, 2019 |
| / | biGRU, Res LSTM | CUProtein | 75,000 | preceded CASP12 winning team, comparable with AlphaFold in RMSD | CASP12 + 13 | Drori et al. |
FM, free modeling; GRU, gated recurrent unit; LM, pseudo-likelihood maximization; MLP, multi-layer perceptron; MP, membrane protein; RBM, restricted Boltzmann machine; RF, random forest; RMSD, root-mean square deviation; TBM, template-based modeling.
∗C-I-TASSER and C-QUARK were reported, we only report one here.
∗∗RGN was trained on different ProteinNet for each CASP, we report the latest one here.
Figure 5Two Representative DL Approaches to Protein Structure Prediction
(A) Residue distance prediction by RaptorX: the overall network architecture of the deep dilated ResNet used in CASP13. Inputs of the first-stage, 1D convolutional layers are a sequence profile, predicted secondary structure, and solvent accessibility. The output of the first stage is then converted into a 2D matrix by concatenation and fed into a deep ResNet along with pairwise features (co-evolution information, pairwise contact, and distance potential). A discretized inter-residue distance is the output. Additional network layers can be attached to predict torsion angles and secondary structures. Figure from Xu and Wang (2019).
(B) Direct structure prediction: overview of recurrent geometric networks (RGN) approach. The raw amino acid sequence along with a PSSM are fed as input features, one residue at a time, to a bidirectional LSTM net. Three torsion angles for each residue are predicted to directly construct the 3D structure. Figure from AlQuraishi (2019).
Generative Models to Identify Sequence from Function (Design for Function)
| Model | Architecture | Output | Dataset | N_train | Performance | Citation |
|---|---|---|---|---|---|---|
| – | WGAN + AM | DNA | chromosome 1 of human hg 38 | 4.6M | ~4 times stronger than training data in predicted TF binding | Killoran et al. |
| – | VAE | AA | 5 protein families | – | natural mutation probability prediction rho = 0.58 | Sinai et al. |
| – | LSTM | AA | ADAM, APD, DADP | 1,554 | predicted antimicrobial property 0.79 ± 0.25 (random: 0.63 ± 0.26) | Müller et al, 2018 |
| PepCVAE | CVAE | AA | – | 15K labeled,1.7M unlabeled | generate predicted AMP with 83% (random, 28%; length, 30) | Das et al. |
| FBGAN | WGAN | DNA | UniProt (res., 50) | 3,655 | predicted antimicrobial property over 0.9 after 60 epochs | Gupta et al. |
| DeepSequence | VAE | AA | mutational scan data | 41 scans | aimed for mutation effect prediction, outperformed previous models | Riesselman et al. |
| DbAS-VAE | VAE+AS | DNA | simulated data | – | predicted protein expression surpassed FB-GAN/VAE | Brookes et al. |
| – | LSTM | musical scores | – | 56 betas + 38 alphas | generated proteins capture the secondary structure feature | Yu et al. |
| BioSeqVAE | VAE | AA | UniProt | 200,000 | 83.7% reconstruction accuracy,70.6% EC accuracy | Costello et al. |
| – | WGAN | AA | antibiotic resistance determinants | 6,023 | 29% similar to training sequence (BLASTp) | Chhibbar et al. |
| PEVAE | VAE | AA | 3 protein families | 31,062 | latent space captures phylogenetic, ancestral relationship, and stability | Ding et al. |
| – | ResNet | AA | mutation data + Ilama immune repertoire | 1.2M (nano) | predicted mutation effect reached state-of-the-art, built a library of CDR3 seq | Riesselman et al. |
| Vampire | VAE | AA | immuneACCESS | – | generated sequences predicted to be similar to real CDR3 sequences | Davidson et al, 2019 |
| ProGAN | CGAN | AA | eSol | 2,833 | solubility prediction | Han et al, 2019 |
| ProteinGAN | GAN | AA | MDH from UniProt | 16,706 | 60 sequences were tested | Repecka et al. |
| CbAS-VAE | VAE+AS | AA | protein fluorescence dataset | 5,000 | predicted protein fluorescence surpassed FB-VAE/DbAS | Brookes et al. |
AA, amino acid sequence; AM, activation maximization; AS, adaptive sampling; CGAN, conditional generative adversarial network; CVAE, conditional variational autoencoder; DNA, DNA sequence; EC, enzyme commission.
Generative Models for Protein Structure Design
| Model | Architecture | Representation | Dataset | N_train | Performance | Citation |
|---|---|---|---|---|---|---|
| – | DCGAN | C | PDB (16-, 64-, 128-residue fragments) | 115,850 | meaningful secondary structure, reasonable Ramachandran plot | Anand et al. |
| RamaNet | GAN | torsion angles | ideal helical structures from PDB | 607 | generated torsions are concentrated around helical region | Sabban et al. |
| – | DCGAN | backbone distance | PDB (64-residue fragment) | 800,000 | smooth interpolations; recover from sequence design and folding | Anand et al. |
| Ig-VAE | VAE | coordinates and backbone distance | AbDb (antibody structure) | 10,768 | sampled 5,000 Igs screened for SARS-CoV2 Binder | Eguchi et al. |
| – | CNN (input design) | same as trRosetta | – | – | 27 out of 129 sequence-structure pairs experimentally validated | Anishchenko et al. |
CNN, convolutional neural network; DCGAN, deep convolutional generative adversarial network; GAN, generative adversarial network; VAE, variational autoencoder.
Generative Models to Identify Sequence from Structure (Protein Design)
| Model | Architecture | Input | Dataset | N_train | Performance | Citation |
|---|---|---|---|---|---|---|
| SPIN | MLP | sliding window with 136 features | PISCES | 1,532 | sequence recovery of 30.7% on 1,532 proteins (CV) | Li et al. |
| SPIN2 | MLP | sliding window with 190 features | PISCES | 1,532 | sequence recovery of 34.4% on 1,532 proteins (CV) | O’Connell et al. |
| – | MLP | target residue and its neighbor as pairs | PDB | 10,173 | sequence recovery of 34% on 10,173 proteins | Wang et al. |
| – | CVAE | string encoded structure or metal | PDB, MetalPDB | 3,785 | verified with structure prediction and dynamic simulation | Greener et al. |
| SPROF | Bi-LSTM + 2D ResNet | 112 1-D features + C | PDB | 11,200 | sequence recovery of 39.8% on protein | Chen et al. |
| ProDCoNN | 3D CNN | gridded atomic coordinates | PDB | 17,044 | sequence recovery of 42.2% on 5,041 proteins | Zhang et al. |
| – | 3D CNN | gridded atomic coordinates | PDB-REDO | 19,436 | sequence recovery 70%, experimental validation of mutation | Shroff et al. |
| ProteinSolver | Graph NN | partial sequence, adjacency matrix | UniParc | sequence recovery of 35%, folding and MD test with 4 proteins | Strokach et al, 2019 | |
| gcWGAN | CGAN | random noise + structure | SCOPe | 20,125 | diversity and TM score of prediction from designed sequence | Karimi et al. |
| – | Graph Transformer | backbone structure in graph | CATH based | 18,025 | perplexity: 6.56 (rigid), 11.13 (flexible) (random: 20.00) | Ingraham et al. |
| DenseCPD | ResNet | gridded backbone atomic density | PISCES | sequence recovery of 54.45% on 500 proteins | Qi et al. | |
| – | 3D CNN | gridded atomic coordinates | PDB | 21,147 | sequence recovery from 33% to 87%, test with folding of TIM barrel | Anand et al. |
| – | CNN (input design) | Same as trRosetta | – | – | Norn et al. |
Bi-LSTM, bidirectional long short-term memory; CV, cross-validation; MLP, multi-layer perceptron.