| Literature DB >> 35348602 |
Wenze Ding1,2,3,4, Kenta Nakai5, Haipeng Gong3,4.
Abstract
Proteins with desired functions and properties are important in fields like nanotechnology and biomedicine. De novo protein design enables the production of previously unseen proteins from the ground up and is believed as a key point for handling real social challenges. Recent introduction of deep learning into design methods exhibits a transformative influence and is expected to represent a promising and exciting future direction. In this review, we retrospect the major aspects of current advances in deep-learning-based design procedures and illustrate their novelty in comparison with conventional knowledge-based approaches through noticeable cases. We not only describe deep learning developments in structure-based protein design and direct sequence design, but also highlight recent applications of deep reinforcement learning in protein design. The future perspectives on design goals, challenges and opportunities are also comprehensively discussed.Entities:
Keywords: deep learning; deep reinforcement learning; protein design; protein sequence; protein structure
Mesh:
Substances:
Year: 2022 PMID: 35348602 PMCID: PMC9116377 DOI: 10.1093/bib/bbac102
Source DB: PubMed Journal: Brief Bioinform ISSN: 1467-5463 Impact factor: 13.994
Figure 1An illustration of the internal architectures of (A) VAEs and (B) GANs. Arrows represent corresponding dataflow.
Figure 2An illustration of two inverse processes, i.e. protein structure prediction (upper) and structure-based protein design (lower).
Brief summary of recent researches focused on structure-based protein design
| Reference order | Research objective | Data resource | Network architecture |
|---|---|---|---|
| [ | Complete corrupted structures | Protein structures from PDB database | dcGAN |
| [ | Hallucinate novel proteins through protein structure prediction networks | Completely arbitrary protein sequences with fixed length of 100 amino acids | trRosetta network within residue substitution step of a simulated annealing trajectory |
| [ | Generate coordinates of immunoglobulin backbones | Antibody structures from AbDb database | VAE |
| [ | Generate protein sequence with given geometric and amino acid constraints | Proteins extracted from UniProt database, sequence repository Gene3D | GNN |
| [ | Optimize over protein sequences and structures simultaneously by backpropagating gradients through protein structure prediction networks | Proteins collected from a structure-refinement research (redundancy with trRosetta training set were reduced) | trRosetta network |
| [ | Rate candidate predicted structures without explicit standards and answers | Known correct rankings | RankNet and LambdaRank |
To maximize the usage of limited exhibition space in this paper, we only choose one research as representative from a bunch of researches with similar objectives or procedures.
Figure 3GANs are used as an inpainting tool to repair the inter-residue distance map for a corrupted protein structure. The missing part of the original corrupted distance map (upper) is highlighted with green dashed squares and the corresponding structure is represented in cyan (dotted line for corruptions). The distance map is repaired (lower) and the structure translated from it is represented in violet.
Figure 4An illustration of protein representation learning, direct protein sequence design and related downstream protein analysis applications. Protein representations with fundamental features are obtained through protein language models (bottom). In combination with different kinds of top models, these representation vectors could be used for either protein sequence design or other analysis tasks (top).
Brief summary of recent researches focused on direct protein sequence design
| Reference order | Research objective | Data resource | Network architecture |
|---|---|---|---|
| [ | Extract fundamental features of unlabeled protein sequences into a statistical representation | Protein sequences from UniRef50 database | mLSTM RNN |
| [ | Train a deep contextual protein language model to produce generalized features | Protein sequences from UniParc database | Transformer |
| [ | Build precise virtual protein fitness landscape based on protein sequence representation | A few mutants of natural target protein and their functional characterizations | Single-layer linear regression model on the top of UniRep |
| [ | Generate synthetic genes coding proteins with desirable functions or biophysical properties | Peptides with 5–50 residues from UniProt dataset | WGAN with an external feedback loop |
| [ | Generate functional protein sequences by learning natural sequence diversity | Bacterial MDH sequences from UniProt dataset | Tailored GAN with temporal convolution and self-attention |
To maximize the usage of limited exhibition space in this paper, we only choose one research as representative from a bunch of researches with similar objectives or procedures.
Figure 5Deep-reinforcement-learning-based protein design is analogous to natural protein synthesis process. (A) An illustration of the natural protein synthesis process. (B) Protein sequence generation from left to right by deep reinforcement learning. The agent takes an available action (what kind of amino acid to pick in the next step) according to its policy conditioned on the current state.