Literature DB >> 35348602

Protein design via deep learning.

Wenze Ding^1,2,3,4, Kenta Nakai⁵, Haipeng Gong^3,4.

Abstract

Proteins with desired functions and properties are important in fields like nanotechnology and biomedicine. De novo protein design enables the production of previously unseen proteins from the ground up and is believed as a key point for handling real social challenges. Recent introduction of deep learning into design methods exhibits a transformative influence and is expected to represent a promising and exciting future direction. In this review, we retrospect the major aspects of current advances in deep-learning-based design procedures and illustrate their novelty in comparison with conventional knowledge-based approaches through noticeable cases. We not only describe deep learning developments in structure-based protein design and direct sequence design, but also highlight recent applications of deep reinforcement learning in protein design. The future perspectives on design goals, challenges and opportunities are also comprehensively discussed.

Entities: Chemical

Keywords: deep learning; deep reinforcement learning; protein design; protein sequence; protein structure

Mesh：

Substances：
Proteins

Year: 2022 PMID： 35348602 PMCID： PMC9116377 DOI： 10.1093/bib/bbac102

Source DB: PubMed Journal: Brief Bioinform ISSN： 1467-5463 Impact factor: 13.994

Introduction

Among all molecules in our sophisticated and wonderful world, proteins that participate in most biochemical reactions have been under the spotlight of fundamental scientific researches as well as medical and industrial applications for decades. According to the ‘central dogma,’ the basic biological principle articulated by Francis Crick in 1958, proteins are the executive ends of information flow systems in living organisms, each performing one or a few specifically encoded functions that jointly define the corresponding organism in turn. A wide variety of native proteins such as nuclear proteins, membrane proteins, hemoproteins, lipoproteins, heat-shock proteins, contractile proteins, etc. manifest strikingly excellent properties compared with man-made machines, including extremely high efficiency, economy and precision in operation, self-assembly upon synthesis and so on. Considering their enormous quantity, fantastic quality and consequent pluripotency, protein materials have attracted extensive attentions since they could provide possible solutions for many serious social challenges. Due to the strictly limited working environment and relatively short operation life, native proteins, however, cannot meet the surging demands of human beings satisfactorily. Furthermore, since native proteins are optimized gradually through millions of years of evolution under the selective pressure of nature, they in principle are unlikely to handle challenges arising from human society within hundreds of years. Therefore, artificial protein modification, and even one step further, the design of brand-new proteins from scratch emerges as the times require. Fortunately, protein design becomes technically possible with the long-drawn accumulation of knowledge from past biochemical and biophysical studies of proteins [1]. Many impressive achievements have been made through protein design over the past decade, which intensively impacted and promoted synthetic biology in both academia and industry. Advances in immune signaling [2, 3], targeted therapeutics [4, 5], sense-response systems [6], protein switches [7, 8], self-assembly materials [9, 10] and other fields not mentioned here have shown the exciting potential of utilizing proteins as functional and reproducible materials. In addition, these breakthroughs in protein design also expand our exploration and understanding of protein sequence, structure and function spaces. Taking sequence space as an example, since all native protein sequences originated from a few ancient accidental events and gradually evolved with haphazard mutation and oriented selective pressure, they exist in the sequence space in the form of sprinkling clusters called protein families instead of even dispersion. The properties and functions of protein sequences located in the vast remaining space would never be sampled by natural evolution within a limited time scale, which thus endows the great significance of protein design. The earlier protein design approaches such as directed evolution [11, 12] and the following rational engineering [13, 14] mainly focus on the imitation and/or acceleration of natural evolutionary processes. Through rounds of mutation library construction and high-throughput screening, these methods could successfully obtain proteins with improved performance or even new functions by chance [15-18]. Nevertheless, these approaches always confront the tradeoff between assay fidelity and throughput, and more importantly, their explorations are still restricted around the corresponding initial native proteins. With the development of computational devices and algorithms, shortages mentioned above are gradually overcome by computer-assisted protein engineering, which avoids the relatively random mutation strategy and provides some definite design blueprints based on biophysical and biochemical principles of proteins. Among many computer-assisted protein engineering methodologies, de novo protein design aiming to generate new proteins not existing in nature has drawn the most attention [1]. With copious valuable achievements, de novo protein design was nominated as one of the top 10 annual breakthroughs by Science in 2016 [19]. Basically, the task of de novo protein design is to find new sequences targeted for desired functions. In practice, however, there are some impediments in the construction of direct mapping between the protein sequence and function spaces. For example, information encoded in a protein sequence is hard to extract from the target sequence alone, since it is simply a permutation or combination of 20 kinds of amino acid residues. Besides, different protein functions could barely be quantitatively articulated. Since proteins need to form particular tertiary structures to perform their specific functions and structures usually contain richer information, e.g. the Cartesian coordinates of atoms stored in PDB files, protein structures are perfect media for the bidirectional mapping between sequences and functions. In addition, massive protein structural data accumulated from previous researches, such as protein fold classification, consequent clustering and reaction mechanism information described by binding interfaces, catalytic centers and allosteric regulations, would also be extremely helpful. Thus, de novo protein design proceeds mainly in the structure-based manner. Structure-based de novo protein design usually has three domains or stages, i.e. backbone generation, sequence fitness and candidate scoring, exemplified by Top 7 [20], the first globular protein that was designed without natural homologs, as well as other famous related works. Generally, a specific folding topology with predefined secondary structural elements and/or geometric constraints (e.g. inter-residue distances and orientations) is designed at the first step. Then, compatible peptide fragments are picked under the evaluation by sequence-independent energy functions and several sequence-structure optimization iterations are executed. During these iterations, rotamers are substituted randomly based on the energy functions, following the Metropolis-Hastings algorithm. After that, candidates are scored, rated and selected to generate the final design outputs [21]. Despite the significant achievements [22-24], these conventional approaches are mainly knowledge based, relying on physical principles and statistical rules [25]. With the plenty of data accumulated for protein sequence, structure and function as well as their relationships [26-28], research interests of protein design gradually converted towards data-driven methods in recent years [29]. Among them, deep learning techniques, which have revolutionized many other fields like natural language processing and computer vision [30], made the most significant impacts. Deep learning offers the simplest and also the most general approximation and parameterization methodology for high-order statistics and potentials by enlarging the receptive field with the support of big data, and thus could be integrated into all domains of structure-based protein design for further improvements and even breakthroughs. Besides, deep learning also sheds a light on the direct protein sequence design for specific functions or properties without the medium of structures. In this review, we orient our discussion to advanced protein design approaches based on deep learning techniques, the benefits offered by them and the predictable trends. It is noteworthy that many other advances that hugely promoted protein design, exemplified by DNA synthesis, protein structure prediction and protein manufacture, would not be detailed here.

Briefings of deep learning techniques related to this review

In a nutshell, deep learning trains an artificial neural network or a combination of related networks to approximate complicated unknown functions in a high-dimensional abstract space. Artificial neurons or nodes with non-linear activations are connected by specific affine transformations with parameterized weights and biases, which are modified in each training step through the back propagation of gradients computed from losses, i.e. the differences between current network outputs and corresponding ground truths.

Discriminative models

Convolutional neural network (CNN) is one of the most successful deep network architectures working with data that have typical grid-structured topology like ordinary pictures and protein inter-residue distance maps. There are two major operators in an ordinary CNN: convolution and pooling, where convolution is a special linear operation with pair-wised multiplication while pooling is a proportional down-sampling manner. The unique mechanism of CNN empowers it to overcome shortages of common deep feedforward networks with approaches like parameter sharing, sparse interaction and equivariant representation, etc. Besides, convolution also makes it possible to handle input data of variable sizes. Modern practical implementations of CNN often involve huge networks containing architectural variants and millions of units. Among them, ResNet was one of the most famous propellants that promoted the development of protein bioinformatics in the last decade [31, 32]. An illustration of the internal architectures of (A) VAEs and (B) GANs. Arrows represent corresponding dataflow. Recurrent neural network (RNN) is another classic network architecture suitable for processing sequential data like natural languages and protein sequences. The underlying idea to unfold recursive computation into a computational graph with repetitive structure naturally results in large-scale parameter sharing. Typically, RNN produces a single output according to the information of entire sequence extracted and stored in hidden units with recurrent connections at the current time step. Variants of RNN like long short-term memory (LSTM) and gated recurrent unit (GRU) also played an important role in natural language processing and bioinformatics during the past decade [33-35]. Recently, a novel network called transformer that contains encoder–decoder architecture and attention mechanism exhibited its superior capability for sequence processing [36]. Within the transformer network, multi-head attention module consisting of multiple self-attentions could capture correlations of amino acid residues among different dimensions, which makes it appropriate for representation learning of protein sequences [37]. Deep graph neural network (GNN) operates on graph, a non-Euclidean data structure, and focuses on problems like clustering, link prediction and node classification. GNN has been widely applied to knowledge graphs, social networks, drug discovery and protein bioinformatics [38-41]. There are many kinds of GNNs, such as convolutional GNN, recurrent GNN, graph autoencoder and so on, which mainly generalize the corresponding operations from Euclidean data with grid or sequential structure to graph data. For example, similar to CNN, convolutional GNN generates the representation of a node through aggregating the features of its neighbors within the graph to expand the receptive field of corresponding neuron.

Generative models

Unlike discriminative models widely used in protein researches that construct mappings from the space of the input data to that of the output label by maximizing the respective likelihood of samples, generative models such as generative adversarial networks (GANs) [42] and variational auto-encoders (VAEs) [43] try to capture the underlying data distribution of training set and sample brand new instances according to the learned distribution. It is noteworthy that the relationship between GANs and VAEs is complicated. Although these two frameworks have a large intersection, the VAE architecture could be trained for some models that GANs could not and vice versa. As shown in Figure 1, GAN generally contains two main parts: a generator and a discriminator. The generator takes samples from a learned distribution while the discriminator distinguishes the generator’s outputs from real samples in training dataset. Essentially, the joint training procedure of GAN is a two-player game. Therefore, if both parts have sufficient model capacity and enough network training is implemented, the Nash equilibrium of this specific game would appear and the distribution learned by the generator would be identical with the one of training data. Meanwhile, the architecture of ordinary VAE is similar to classical encoder–decoder except the encoder estimates the mean and variance of a normal distribution instead of producing a latent variable directly. Combing the advantages of Bayesian method, VAE with its elegant mathematical foundation, simple structure as well as satisfactory training cost and model performance, gradually becomes one of the common options for generative models and influences bioinformatics a lot.

Figure 1

An illustration of the internal architectures of (A) VAEs and (B) GANs. Arrows represent corresponding dataflow.

Deep reinforcement learning

Combining the great fitting power of deep learning for high-dimensional function and the ability of reinforcement learning to interact with surroundings in various situations, deep reinforcement learning techniques contributed to many areas including protein design [44-52]. Basically, deep reinforcement learning divides the world into two parts, an environment and an agent. Within every training step, the agent chooses an available action according to its own policy, which slightly changes the environment, and then receives feedbacks called rewards from the environment. The positive rewards encourage the agent to strengthen its policy, i.e. making the same choice in a similar situation for the next time, while the negative ones spur the agent to change its policy.

Deep learning in structure-based protein design

Structure-based protein design could be treated as the reverse process of protein structure prediction. For the latter, some potential structures should be modeled for a given sequence, while for the former, some feasible sequences should be optimized for a backbone with the designed topology (Figure 2). Protein homology plays an important role in protein structure prediction, providing massive evolutionary information for precise inferences. Recently, deep learning has revolutionized protein structure prediction in many ways, from early efforts in the protein inter-residue contact prediction and contact-assisted structure modeling [31, 53–57] to the subsequent accurate prediction of inter-residue geometric properties and geometric-constraint-based protein folding [32, 58–62]. Furthermore, attention networks with the most advanced end-to-end training procedure developed by Google DeepMind shocked the public in the 14th Critical Assessment of protein Structure Prediction (CASP) experiments by providing a wonderful solution for the structure prediction of single-domain proteins [63-65]. Deep learning techniques utilized in protein structure prediction like the convolutional neural networks could efficiently capture fold-level structural features from co-evolutionary information harbored in the multiple sequence alignment [66]. These successes deepened our understanding of the sequence–structure relationship for proteins, which is also the foundation of structure-based design, and provided a bunch of practical tools that could be directly used in design problems.

Figure 2

An illustration of two inverse processes, i.e. protein structure prediction (upper) and structure-based protein design (lower).

An illustration of two inverse processes, i.e. protein structure prediction (upper) and structure-based protein design (lower). In addition to circumstantial improvement of protein design through advances in structure prediction, customized deep learning approaches also made considerable contributions to protein design directly nowadays. Novel network architectures, training procedures and data manipulations aiming to serve various design objectives in diverse design stages sprang up continuously, vigorously promoting the exploration of proteins. We will detail these novelties, illustrate the differences between these approaches and conventional knowledge-based ones, and articulate corresponding significance in the following sections.

Backbone sampling and generation

Functions and structures of proteins are closely correlated. A protein will perform its unique function only when its specific 3D structure is correctly folded. Hence, generating a backbone conformation under some particular design purposes becomes the first step in general protein design routines. Just like the immense space of protein sequence, the space of backbone structure is also extremely vast, with thousands of degrees of freedom even for small peptides. Nevertheless, designable backbones usually cluster into minute regions that disperse sparsely in the space [67], because protein domains stabilized by complicated atom-level forces like hydrogen bonds and hydrophobic interactions have to adopt exquisite shapes with well packed cores and properly exposed interfaces. The earliest routines redesigned existing native protein structures to get possible backbones with improved structural stability and perhaps new functions [68, 69] or systematically sampled helical bundles [23, 70, 71] under the constraints of Crick’s parameterization. Ensuing de novo design methods generated protein backbones mainly through the combination of fragment-assembly-based simulations and human intuition [20, 72–81], exemplified by the famous Top7 mentioned previously [20]. As shown in Table 1, modern deep-learning-based approaches trained generative models to either generate 2D inter-residue geometric feature maps of sampled backbones or directly output their atom coordinates.

Table 1

Brief summary of recent researches focused on structure-based protein design

Reference order	Research objective	Data resource	Network architecture
[82]	Complete corrupted structures	Protein structures from PDB database	dcGAN
[91]	Hallucinate novel proteins through protein structure prediction networks	Completely arbitrary protein sequences with fixed length of 100 amino acids	trRosetta network within residue substitution step of a simulated annealing trajectory
[93]	Generate coordinates of immunoglobulin backbones	Antibody structures from AbDb database	VAE
[39]	Generate protein sequence with given geometric and amino acid constraints	Proteins extracted from UniProt database, sequence repository Gene3D	GNN
[110]	Optimize over protein sequences and structures simultaneously by backpropagating gradients through protein structure prediction networks	Proteins collected from a structure-refinement research (redundancy with trRosetta training set were reduced)	trRosetta network
[61]	Rate candidate predicted structures without explicit standards and answers	Known correct rankings	RankNet and LambdaRank

To maximize the usage of limited exhibition space in this paper, we only choose one research as representative from a bunch of researches with similar objectives or procedures.

Brief summary of recent researches focused on structure-based protein design To maximize the usage of limited exhibition space in this paper, we only choose one research as representative from a bunch of researches with similar objectives or procedures. GANs were used to generate protein inter-residue distance maps for the completion of corrupted structures [82-84]. This task aimed to infer plausible backbones of missing residues for the target protein, analogous to an image-inpainting problem, i.e. inpainting a large distance map with small size-fixed patches (Figure 3). For example, deep convolutional GAN (dcGAN) [85] was chosen to learn a mapping from a low-dimensional standard normal distribution z to an unknown high-dimensional probability distribution in the space of protein inter-residue distance map with a fixed size [82]. After inpainting, backbone structures were obtained using either the alternating direction method of multipliers (ADMM) algorithm [86] to trace Cα positions with concrete coordinates or Rosetta [21] to sample fragments according to the generated distance constraints. Although satisfactory outcomes have been achieved by these works, some limitations still exist. For example, the distance maps generated via the dcGAN method mentioned above [82] were restricted to 16-, 64- and 128-residue fragments instead of arbitrary length for the intrinsic properties of dcGAN. This shortage, especially its incompetence to larger protein fragments, slashed its practicality. Meanwhile, VAEs that performed conditional generation through the introduction of a representative latent space were also shown to be very useful for protein backbone design [87-89]. With all these successful trials, the ability of generative models to produce protein backbones with multiple structural elements (e.g. secondary structures) has been validated and further related researches would surely acquire a greater depth in the coming future.

Figure 3

GANs are used as an inpainting tool to repair the inter-residue distance map for a corrupted protein structure. The missing part of the original corrupted distance map (upper) is highlighted with green dashed squares and the corresponding structure is represented in cyan (dotted line for corruptions). The distance map is repaired (lower) and the structure translated from it is represented in violet. Deep neural networks originally trained for image recognition could be used to generate ‘hallucinations’ with a transformed style [90]. Similarly, information of protein sequence–structure relationships stored in billions of parameters in the powerful protein structure prediction networks could also be utilized inversely to generate new sequences and structures [91]. Completely random sequences of 100 residues were fed into trRosetta network [32], a well-performed predictor of protein inter-residue geometric properties based on sequence alignments, to derive the background inter-residue distance distributions. Then, a Monte Carlo simulated annealing trajectory was produced for each initial random sequence to iteratively optimize this sequence and get compatible structures. Within the trajectory, a random single residue substitution was initiated at an arbitrary position and the distance distribution map of this mutated sequence would be immediately predicted by trRosetta for every time step. This substitution would be accepted only if the Kullback–Leibler divergence between distance distributions of the new sequence and corresponding background satisfied the Metropolis criterion. Through this procedure, diverse sequences and designable structures not observed in nature were generated. Subsequent in vitro synthesis showed that these ‘protein hallucinations’ were monomeric and stable, possessing designed structural elements. Furthermore, although constructed through trRosetta [32], this hallucination approach could be easily extended to more advanced protein structure prediction networks like AlphaFold2 [63] and RoseTTAFold [65] to improve its ‘hallucinating power.’ The significance of this work is not limited in showing a feasible exploration for structure or sequence generation. More importantly, it also exhibits a more straightforward avenue to construct supporting scaffolds around predetermined activation sites for protein design, where structures are not required to be mapped out beforehand. The translation from protein inter-residue geometric matrices to backbone coordinates could also be undertaken by approaches related to deep learning [32, 61, 92]. Some of them incorporated energy based optimization [32, 93] while others employed self-adaptive data screening [61]. It is notable that some researches skipped two-dimensional structural representations and generated backbones with 3D atom coordinates directly. For example, a VAE-based architecture modeled backbone flexibility of immunoglobulin proteins via catching the related structure distribution, compressing it into a low-dimensional latent space and interpolating that space to sample structures with predefined complementarity determining regions (CDRs) [93].

Sequence design in protein fitness landscape

Almost all information of a protein is encoded in its sequence. However, inferring possible sequences for a predefined structure with desired function from the vast multidimensional sequence space termed as the protein fitness landscape [94] is extremely struggling and impossible to be handled with brute force, considering the countless permutations formed by the 20 usual proteinogenic amino acids [95]. Generally, protein fitness landscape searching methods cluster residue side-chain conformations as different rotamers [96], abstract the sequence optimization of a given backbone to a discrete energy minimization problem and then search combinations of rotamers around the global minimum [97]. The energy optimization process is analogous to mountain hiking (minimizing energy equivalent to maximizing its opposite), during which a hiker tries to arrive at the global optimal point through a meandering route consisting of multiple tiny trail steps. Despite the previous achievements, traditional approaches confronted restrictions like the powerlessness in multi-body interaction design and the excessive homology of outputs. Although similar in general, the learning process of deep neural network differs from conventional energy minimization in several ways. Thus, deep learning with its intrinsic advantages and training techniques accumulated in earlier researches could substantially mitigate the limitations of regular procedures either by replacing the entire optimization routine or by eliciting local amelioration within their frameworks. Deterministic approaches could solve the fitness problem accurately for small backbones [98] but become powerless for large ones due to the exponential increase of computational complexity. Statistical sampling methods, exemplified by Monte Carlo simulations, have been used to solve this dilemma and could achieve acceptable approximations in practice [99]. Because the backbone energy evaluated by existing force fields is highly sensitive to conformational changes, backbone flexibility is usually considered in these methods by simultaneously optimizing rotamers and the corresponding local structures [100-103]. Besides, the hydrogen-bond network is also an important point that should be carefully attended to in sequence optimization procedures [104]. Deep learning approaches excel at optimizing the joint probability of residues under the given backbone constraints. Thus, applying them to sequence fitness problem could effectively alleviate or even address the challenges in conventional methods. In analogy to Sudoku puzzles, a deep GNN called ProteinSolver was proposed by converting the sequence fitness for a predetermined backbone into a constraint satisfaction problem, where amino acids were assigned such that the atom-level inter-residue forces could be compatible with the given fold [39]. Through training on more than 70 million sequences corresponding to over 80 thousand structures, GNN elucidated the rules governing these constraints by inferring hitherto hidden patterns. Unlike other works in this topic that mainly used computational metrics to validate the accuracy and quality of their designs, in vitro validations of ProteinSolver by circular dichroism experiments testified its capability to fit protein sequences. It is noteworthy that ProteinSolver was only trained and tested with the constraints derived from existing proteins, and thus, its ability to sample reasonable sequences of novel proteins still needs further validation. Another method based on conditional generative model and graph representation also improved the reliability and computational speed of sequence fitness compared with traditional methods like Rosetta [38]. More specifically, in this work, a spatial k-nearest neighbor graph was used in a multi-head self-attention encoder to develop the backbone representation independent of sequence. Then, conditioned on previously generated s amino acids and the given structure, the (s + 1) th residue was predicted autoregressively by a decoder, similar to common procedures in language modeling. Other deep-learning-based methods constructed their networks with various architectures including auto-encoders [105], 3D convolutional neural networks [106], DenseNets [107] and GANs [84] to predict sequence probability profiles from a given backbone structure. Since these data-driven approaches are capable of assimilating co-evolutionary information from protein sequence databases, integrating high-dimensional hints, catching the inconspicuous internal patterns and deducing the most possible solutions, protein sequence profiles generated by them usually exhibit better agreement with the natural molecular evolution than those profiles sampled by conventional knowledge-based methods lacking the help of deep learning. Deep learning also contributes to the energy evaluation process of protein fitness landscape searching. In comparison to traditional knowledge-based energy functions that are typically combinations of statistical and empirical potential terms [21, 97, 108], deep learning models could provide a more general and more accurate description of the multidimensional potential functions in the real world. A 3D convolutional neural network was trained in an autoregressive manner to learn the distribution of sequences conditioned on a predetermined backbone directly from the protein structure data [109]. In absence of any human-specified priors, potentials learned by this network could precisely predict side chain conformations without using any conventional forcefields. In vitro experimental data, especially the high-resolution crystal structures of two designed TM-barrel proteins, validated the design capability of this network and corresponding structural agreements. Compared with the classical molecular mechanics force fields with great complexity and cost, this data-driven method only needed a few hours for training, which exhibited its practical applicability and huge potentiality. In addition, networks originally constructed for protein structure prediction could also be repurposed for sequence design by energy landscape optimization. With gradients backpropagated from the predefined structures to input protein sequences through the trRosetta network [32], sequences and structures could be optimized simultaneously [110]. This research hints that future combination of the low-resolution trRosetta model that considers the full conformational landscape and the high-resolution Rosetta model that is good at single point energy estimation would further improve protein design methodologies.

Scoring function and candidate rating

Usually, iterations of sequence–structure optimization would produce a set of candidate sequences. To lighten the burden of downstream laboratorial synthesis, it is necessary to select a small subset of candidates that have the largest probabilities for the intended protein properties and functions. A typical approach is to rank all candidates by scoring functions and only retain the top k. One of the most frequently used scoring functions is the potential energy mentioned above, since the chosen sequence should be able to fold into the correct topology with acceptable stability. Candidate rating is thus often simplified as identifying sequence–structure pairs with the lowest energies. Some summaries have been articulated in the last two sections since this step has a close relationship with previous steps and many researches integrate them all together. Scoring functions in the Rosetta program range from statistical potentials established using Bayesian methods [111] to complicated modern force fields [112]. Thus, rating systems of many protein design routines are derived from Rosetta. Meanwhile, a distinct approach introduces deep ranking networks called RankNet and LambdaRank [113] in recommendation systems for candidate rating [61]. Instead of directly optimizing potential items for precise energy estimation, these networks update themselves according to the discrete ranking fitness, i.e. difference between current order ranked by the network and the supposed one. Although this work is originally proposed to address the protein structure prediction problem, the underlying fundamental concept could be easily generalized to protein design.

Deep learning in direct sequence design

As described above, the major task of protein design is to find sequences capable of stably exhibiting desired properties and conducting expected functions. Besides, longer information pathway with more transit points would generally introduce unnecessary transformation and transmission of data, which might cause larger signal deviations. Thus, in principle, directly mapping the spaces of protein sequence and function seems to be advantageous over design procedures that need predetermined structural topologies as intermedia. More importantly, due to advances in sequencing technology, the accumulation speed of protein sequence data is much faster than its structural counterpart, especially after the introduction of metagenomics [114]. Tremendous number of unlabeled sequences in combination with the powerful capability of deep learning for feature extraction, pattern recognition and objective generation make it possible and valuable to directly explore the sequence space and improve the protein design paradigm. Different from protein fitness landscape searching for a given backbone, direct sequence design learns a meaningful distribution of sequence representation in a latent space and generates sequences in real space according to speculative representations derived from the learned distribution (Figure 4). Therefore, generative models are more widely used in this area compared with discriminative ones (as exhibited in Table 2). In this section, we will focus on two major aspects of direct protein sequence design with concrete cases to look through the past achievements and anticipate the future trends.

Figure 4

Table 2

Brief summary of recent researches focused on direct protein sequence design

Reference order	Research objective	Data resource	Network architecture
[33]	Extract fundamental features of unlabeled protein sequences into a statistical representation	Protein sequences from UniRef50 database	mLSTM RNN
[37]	Train a deep contextual protein language model to produce generalized features	Protein sequences from UniParc database	Transformer
[34]	Build precise virtual protein fitness landscape based on protein sequence representation	A few mutants of natural target protein and their functional characterizations	Single-layer linear regression model on the top of UniRep
[127]	Generate synthetic genes coding proteins with desirable functions or biophysical properties	Peptides with 5–50 residues from UniProt dataset	WGAN with an external feedback loop
[121]	Generate functional protein sequences by learning natural sequence diversity	Bacterial MDH sequences from UniProt dataset	Tailored GAN with temporal convolution and self-attention

To maximize the usage of limited exhibition space in this paper, we only choose one research as representative from a bunch of researches with similar objectives or procedures.

An illustration of protein representation learning, direct protein sequence design and related downstream protein analysis applications. Protein representations with fundamental features are obtained through protein language models (bottom). In combination with different kinds of top models, these representation vectors could be used for either protein sequence design or other analysis tasks (top). Brief summary of recent researches focused on direct protein sequence design To maximize the usage of limited exhibition space in this paper, we only choose one research as representative from a bunch of researches with similar objectives or procedures.

Representation learning

Although deep learning has shown huge success in many sub-fields of protein bioinformatics, there are still two major obstacles impeding its further development. The first one is the expensive cost of protein characterization, which leads to the data scarcity of sequence-label pairs for the training of deep neural networks. The second one is the lack of method generalization, since most domain-specific deep learning methods have not sufficiently exploited the fundamental features of protein sequences and thus are hard to be transferred from one problem to another through simple fine-tuning. One possible solution to overcome these obstacles is the representation learning using protein language models. Protein sequence and natural language both have internal long-range dependencies of distant contexts. Thus, inspired by natural language processing [115], protein language models treat a complete sequence as a paragraph or a sentence and the amino acids within it as single words [116, 117]. Through supervised or unsupervised training, a dictionary of word vectors, i.e. amino acid embeddings, could be optimized and the representation of a protein sequence with its fundamental features could be inferred in a latent space. A method called unified representation (UniRep) trained a multiplicative long–short-term-memory RNN (mLSTM RNN) [35] with 1900 hidden units to learn the fundamental representation of protein sequences and encode arbitrary sequences into length-fixed vectors [33]. UniRep was trained with approximately 24 million sequences from the UniRef50 database [118] and its self-supervised training procedure [119] utilized input sequences themselves as the corresponding labels. Specifically, it iterated through amino acids of a sequence sequentially and compared the true next residue with the one predicted by the model based on its dynamic summary of all previously visited residues. With this training procedure, the model of UniRep gradually maximized the conditional probability of correct amino acid type for next residue and learned a progressively better protein sequence representation by adjusting its parameters and revising its hidden state construction manner. In the absent of any evolutionary, structural, physicochemical and other kinds of related data explicitly, representation vectors of protein sequences encoded by UniRep intrinsically contained the required information and thus could be easily clustered by these properties. When evaluated on a comprehensive set of critical protein engineering problems, UniRep with simple linear or non-linear models trained on the top of it showed generalizable and superior performance. Although the data mining ability of RNN architecture used by UniRep might be inferior compared with current popular ones in the field of nature language processing like transformer, the basic conceptions it came up with and the impressive extensions it showed still influenced following researches a lot. Trained on 250 million sequences with breadth and diversity from the UniParc database [120], a deep transformer called ESM-1b also learned protein sequence representations with fundamental features [37]. The model consisted of 33 layers, having around 650 M parameters. It utilized another self-supervised strategy, masking language modeling objective, for its training. ESM-1b Transformer integrated residue contexts across the entire input protein sequence through many stacked self-attention modules. It constructed a complicated representation space for protein sequences. Representation vectors derived from this space carried distinguishable protein features of the corresponding sequences. For example, secondary and tertiary structural properties could be identified from the generated sequence representations. Superiority over other state-of-the-art input features across a wide range of applications like mutational effect prediction further testified its generalizability and advantages. Furthermore, with the rapid accumulation of protein sequence data and the usage of network architectures with higher complexity and capability, the future versions of ESM-1b were expected to have additional improvements in protein sequence representation. However, the training cost of such a huge protein language model would not be something that ordinary small research groups could afford and it would be meaningless to repeat the construction of these infrastructures for the whole academic community. Thus, the sharing spirit existed in this work and many other famous researches should be advocated and kept for a long time. Other works focusing on representation learning adopted deep generative model architectures like GANs, VAEs and autoregressive ones. They compressed discrete protein sequences into a continuous latent space by capturing contextual information within these sequences [121-123]. For example, trained by sequences from the Swiss-Prot database, a VAE model called BioSeqVAE learned good sequence representations, which could be used as input features for multiple downstream applications [124]. Since different researches of representation learning generally use self-built datasets and have no unified evaluation process or standard, it is difficult for people to compare them and consider the accuracy and efficiency, advantages and disadvantages of each [125]. Hence, a work introduced a set of protein bioinformatics tasks with clear definitions, data and assessing metrics to construct a standard evaluation system for protein transfer learning [126]. This task set called tasks assessing protein embeddings (TAPE) contained five concrete problems within three major aspects: protein structure prediction, remote protein homolog detection and protein design. The authors also benchmarked several representation learning methods, of which the methodology could be easily generalized to recent works mentioned above. Deep-reinforcement-learning-based protein design is analogous to natural protein synthesis process. (A) An illustration of the natural protein synthesis process. (B) Protein sequence generation from left to right by deep reinforcement learning. The agent takes an available action (what kind of amino acid to pick in the next step) according to its policy conditioned on the current state.

Sequence generation

Representation learning has laid a solid foundation for sequence generation. By condensing, integrating and extracting fundamental features within sequence statistics, learned representations embody protein properties like function, structure, stability, dynamics, half-life, binding, etc. Therefore, in combination with downstream generative models or methods, proteins of desired functions but with unseen sequences could be generated in a high throughput manner. For example, a low-N protein engineering method [34] was reported based on representation learning of UniRep [33]. A simple supervised top model taking the sequence representations as input was trained on a limited number (as few as 24 sequences, and this is the source of its name ‘low-N’) of functionally assayed random mutants of the target protein to rate arbitrary sequences. Then, in silico directed evolution was executed through a Markov Chain Monte Carlo procedure on the surrogate fitness landscape provided by sequence representations and the rating model. GANs also played an important role in direct protein sequence generation. A Wasserstein GAN (WGAN) [84] combined with a novel external feedback-loop mechanism (denoted as a function analyzer) was trained to generate DNA sequences encoding proteins [127]. The function analyzer could be in any form, differentiable or non-differentiable, as long as it took a sequence as input and output a score. The training procedure of this so-called FBGAN system contained two parts. Firstly, the WGAN was pretrained with general DNA sequences converted from protein sequences reversely to generate valid genes. Within every training step, sequences produced by the generator of WGAN were fed into the function analyzer to evaluate their related properties and those with scores exceeding a predetermined threshold were chosen to replace the oldest samples in the original training set of the discriminator. This feedback-loop mechanism finetuned the distribution mapping between the latent space and the real DNA sequence space for specific downstream optimization objectives. Successful applications on the generation of antimicrobial peptides [128] and helical proteins supported the good generalizability of this model. In addition, this unique network architecture and training procedure could be easily extended to other domains beyond genomics and protein sequences. However, there were also some compromises in this work beyond its success. For example, FBGAN focused on gene sequence generation though the research objective was protein sequence design, because gene sequences had clear codon structures to instruct start/stop positions and much simpler vocabulary (only four nucleotides) compared with proteins. Thus, direct generation of longer and more complex protein sequences would still be an important task for follow-up researches of FBGAN. ProteinGAN was another GAN architecture constructed to expand functional protein sequence space [121]. Implemented with customized temporal convolutional network [129] and self-attention mechanism [130], ProteinGAN could not only learn useful sequence motifs and critical long-range inter-residue interactions simultaneously, but also concentrate on functional areas like catalytic centers. To validate its contribution to real protein engineering, ProteinGAN was trained on a family of bacterial malate dehydrogenase (MDH) enzymes. By uniformly interpolating the latent space, the model successfully generated 20 thousand protein sequences exhibiting sequence properties highly correlated with the latent dimensions, which supported its ability to capture the intrinsic features of native sequences and their inter-relationships. Among 55 generated sequences tested experimentally, 24% of them stably existed in physiological solutions with blatant catalytic activity, which further demonstrated its potential to generate new, diverse functional protein sequences. Other direct sequence generative models adopted different architectures suitable for specific generation demands [87, 88, 122, 131]. For example, an attention-based transformer model was trained on the Swiss-Prot database to generate functional signal peptide sequences and experimental tests proved its practicality [131]. As all roads lead to Rome, distinct networks with various customized training procedures all serve one similar goal: learning to sample diverse protein sequences that are previously unseen in nature and to enhance the likelihood of those satisfying desired criteria.

Design with deep reinforcement learning

Protein design approaches based on deep reinforcement learning are just like in silico simulations of natural protein synthesis processes (Figure 5). With the application of more advanced technologies, these methods can help us excavate more intrinsic principles of proteins and get more high-quality functional protein materials. For example, DyNA PPO [132] was such a deep reinforcement learning model based on proximal-policy optimization [133] for sequence design. The model generated sequences from left to right one amino acid after another, with the overall procedure regarded as a Markov decision process. Before the completion of sequence generation, the reward to the agent remained 0. At the end of each round, sequence fitness measurement given by a panel of machine learning models that tried to approximate surrogate fitness functions was taken as the final reward. DyNA PPO balanced the tradeoff in reward estimation by using a bunch of models to learn different aspects of the sequence fitness landscape but only using the most suitable one with sufficient accuracy to update its policy. Although its superiority has been shown in the large-scale benchmarking across several methods, the report of DyNA PPO did not exhibit any verification through wet lab experiments. Thus, its practicability still needs to be testified in future researches. Alternatively, reinforcement learning could be used to finetune some pre-trained generative models for protein design. For example, a RNN was tuned by a policy-based reinforcement learning approach to generate desirable compounds [134]. The most important inspiration from this research would be the attempt and success of decreasing the catastrophic forgetting risk [135], a common problem for protein generative models.

Figure 5

Deep-reinforcement-learning-based protein design is analogous to natural protein synthesis process. (A) An illustration of the natural protein synthesis process. (B) Protein sequence generation from left to right by deep reinforcement learning. The agent takes an available action (what kind of amino acid to pick in the next step) according to its policy conditioned on the current state.

Conclusions and future perspectives

In the last decade, protein design has achieved great successes, helping mankind deal with social challenges in multiple facets. Examples could be found everywhere in our daily life, including designed small-molecule binding proteins that are used in in vivo biosensors [136, 137], designed biomedical inhibitors that aim to prevent viral infections [138], designed enzymes that have attractive catalytic efficiencies [139-141], designed highly symmetric self-assembly materials that endow vaccine applications with multivalent presentation of antigens [10, 142], etc. Recently, deep learning techniques have shown preliminary but impressive impacts to the field of protein design. Through their incredible power of extracting and integrating statistical patterns within existing protein data, artificial deep neural networks learn fundamental protein features, store them in billions of parameters and generalize them for inferences in different sub-fields. However, roadblocks still stand in our path to routinely design arbitrary proteins using deep learning methods. For example, our understanding to protein folding mechanism, one of the most important and essential problems in bioinformatics and also the paramount theoretical principle of all kinds of protein design methods, is far from sufficiency. Many efforts combining deep learning, physical modeling and in silico simulations have been made in this area. Perhaps deep reinforcement learning trying to build policies and find possible trajectories from extended protein chains to well-folded structures would also be helpful. Diverse and abundant well-annotated data are necessary for all fields adopting deep learning, just as the influence of ImageNet database [143] to the development of computer vision. However, for protein design with a specific objective, related data of protein functions and properties are usually not only scarce but also collected without unified and standard experimental conditions. The scarcity of training data would hinder the accurate design, consequently leading to the demand for additional experimental optimization. Although some databases exemplified by ProtaBank [144] have been constructed to alleviate this phenomenon, lots of efforts still need to be made. Another important direction to overcome this deficiency might be the few-shot learning [145, 146], and to our knowledge, related exploration has not been tried yet. Scoring accuracy and computational speed of energy functions in protein design also need to be further improved, since they guide the optimization direction and are used repeatedly in every step. Compared with traditional potential terms, energy functions learned by deep neural networks evaluate designs more precisely but slowly. The adoption of more advanced and lightweight network architectures as well as knowledge distillation [147] and network pruning [148] may partially handle this dilemma. Another plight for both protein design and its reverse procedure, protein structure prediction, is that current approaches for optimization are usually adept in landscapes with only one minimum, while many proteins perform their functions and properties through structural transformation among different conformations. This requires deep learning methods to design proteins with multiple and distinct energy minima. Future researchers should attend to such complexity. Another important and imminent assignment of deep-learning-based protein design is promoting its application scope. Many researches of this field focused on algorithm development and in silico evaluation with barely few experimental verifications and practical applications. Taking pharmacy and therapeutics as an example, although conventional drug discovery methodologies concentrated on molecular dynamics simulations and molecular docking [40] have made great achievements, protein design approaches are gradually showing their impressive capability and promising future. There are many roadmaps involving protein design in this field, which aim at various diseases afflicting human beings. One possible procedure is designing a modular protein sensor-actuator switch where small ligands could directly change downstream transductions of corresponding cellular signal pathway by binding to the designed targets [6, 73, 149]. Another approach might be designing mimetics of natural immune proteins with augmented therapeutic affinity and activity but diminished immunogenicity and toxicity [2, 150, 151]. Besides, through treating short peptides (usually less than 50 residues) as small molecules and utilizing knowledge about protein–protein interactions (PPI) [41] instead of drug–target interactions (DTI), high-throughput protein design methods could be constructed for therapeutics with specific targets [4, 152]. In the context of current worldwide pandemic of COVID-19, protein design is especially important since designed mini-protein inhibitors of ACE2 receptor (coronavirus binder) have provided a good start for corresponding therapeutics [5, 153]. However, almost all above successes were achieved by traditional knowledge-based design methodologies. Getting out the in-silico limit and putting the advanced data-driven algorithms into effect should be another key point of future researches focusing on deep-learning-based protein design. Many challenges confronting protein design could be ameliorated or even handled by combining deep learning efforts with complementary advances in conventional de novo methods, while others still await the development of new methodologies from the ground up. No matter which case it is, proteins are important gifts from nature to mankind, and with the blueprints glimpsed by deep learning, we could craft desired tools as we want to make our world a better place after iterations of trials and errors. Recently, the introduction of deep learning has shown preliminary but transformative influence to the field of protein design. Deep learning could provide fast, high-throughput and precise in silico protein design methodologies. We retrospect current advances in deep-learning-based protein design procedures mainly within the past 2 years and illustrate their novelty, advantage and significance in comparison with traditional knowledge-based approaches through important milestones. We also comprehensively discuss the coming challenges and opportunities in the near future. This review could help people get familiar with this field and promote relevant researches. Click here for additional data file.

115 in total

1. Estimating the prevalence of protein sequences adopting functional enzyme folds.

Authors: Douglas D Axe
Journal: J Mol Biol Date: 2004-08-27 Impact factor: 5.469

Review 2. The coming of age of de novo protein design.

Authors: Po-Ssu Huang; Scott E Boyken; David Baker
Journal: Nature Date: 2016-09-15 Impact factor: 49.962

3. Human-level control through deep reinforcement learning.

Authors: Volodymyr Mnih; Koray Kavukcuoglu; David Silver; Andrei A Rusu; Joel Veness; Marc G Bellemare; Alex Graves; Martin Riedmiller; Andreas K Fidjeland; Georg Ostrovski; Stig Petersen; Charles Beattie; Amir Sadik; Ioannis Antonoglou; Helen King; Dharshan Kumaran; Daan Wierstra; Shane Legg; Demis Hassabis
Journal: Nature Date: 2015-02-26 Impact factor: 49.962

4. End-to-End Differentiable Learning of Protein Structure.

Authors: Mohammed AlQuraishi
Journal: Cell Syst Date: 2019-04-17 Impact factor: 10.304

5. ProDCoNN: Protein design using a convolutional neural network.

Authors: Yuan Zhang; Yang Chen; Chenran Wang; Chun-Chao Lo; Xiuwen Liu; Wei Wu; Jinfeng Zhang
Journal: Proteins Date: 2020-01-06

6. Bottom-up de novo design of functional proteins with complex structural features.

Authors: Che Yang; Fabian Sesterhenn; Jaume Bonet; Eva A van Aalen; Leo Scheller; Luciano A Abriata; Johannes T Cramer; Xiaolin Wen; Stéphane Rosset; Sandrine Georgeon; Theodore Jardetzky; Thomas Krey; Martin Fussenegger; Maarten Merkx; Bruno E Correia
Journal: Nat Chem Biol Date: 2021-01-04 Impact factor: 15.040

7. UniRef clusters: a comprehensive and scalable alternative for improving sequence similarity searches.

Authors: Baris E Suzek; Yuqi Wang; Hongzhan Huang; Peter B McGarvey; Cathy H Wu
Journal: Bioinformatics Date: 2014-11-13 Impact factor: 6.937

8. Accurate De Novo Prediction of Protein Contact Map by Ultra-Deep Learning Model.

Authors: Sheng Wang; Siqi Sun; Zhen Li; Renyu Zhang; Jinbo Xu
Journal: PLoS Comput Biol Date: 2017-01-05 Impact factor: 4.475

9. De novo design of a four-fold symmetric TIM-barrel protein with atomic-level accuracy.

Authors: Po-Ssu Huang; Kaspar Feldmeier; Fabio Parmeggiani; D Alejandro Fernandez Velasco; Birte Höcker; David Baker
Journal: Nat Chem Biol Date: 2015-11-23 Impact factor: 15.040

10. A Computationally Designed Hemagglutinin Stem-Binding Protein Provides In Vivo Protection from Influenza Independent of a Host Immune Response.

Authors: Merika Treants Koday; Jorgen Nelson; Aaron Chevalier; Michael Koday; Hannah Kalinoski; Lance Stewart; Lauren Carter; Travis Nieusma; Peter S Lee; Andrew B Ward; Ian A Wilson; Ashley Dagley; Donald F Smee; David Baker; Deborah Heydenburg Fuller
Journal: PLoS Pathog Date: 2016-02-04 Impact factor: 6.823

3 in total