| Literature DB >> 35769205 |
Fangping Wan1,2,3, Daphne Kontogiorgos-Heintz1,2,3,4, Cesar de la Fuente-Nunez1,2,3.
Abstract
Computers can already be programmed for superhuman pattern recognition of images and text. For machines to discover novel molecules, they must first be trained to sort through the many characteristics of molecules and determine which properties should be retained, suppressed, or enhanced to optimize functions of interest. Machines need to be able to understand, read, write, and eventually create new molecules. Today, this creative process relies on deep generative models, which have gained popularity since powerful deep neural networks were introduced to generative model frameworks. In recent years, they have demonstrated excellent ability to model complex distribution of real-word data (e.g., images, audio, text, molecules, and biological sequences). Deep generative models can generate data beyond those provided in training samples, thus yielding an efficient and rapid tool for exploring the massive search space of high-dimensional data such as DNA/protein sequences and facilitating the design of biomolecules with desired functions. Here, we review the emerging field of deep generative models applied to peptide science. In particular, we discuss several popular deep generative model frameworks as well as their applications to generate peptides with various kinds of properties (e.g., antimicrobial, anticancer, cell penetration, etc). We conclude our review with a discussion of current limitations and future perspectives in this emerging field. This journal is © The Royal Society of Chemistry.Entities:
Year: 2022 PMID: 35769205 PMCID: PMC9189861 DOI: 10.1039/d1dd00024a
Source DB: PubMed Journal: Digit Discov ISSN: 2635-098X
Peptide generation studies using deep generative models. Abbreviations: NML, neural language model; VAE, variational autoencoder; GAN, generative adversarial network; AMP, antimicrobial peptide; ACP, anticancer peptide; CPP, cell-penetrating peptide; PMO, phosphorodiamidate morpholino oligomer
| Method | Feature Representation | Application | Citation | Year |
|---|---|---|---|---|
| NML | One-hot | AMP generation | Müller | 2018 |
| NLM | Character sequence | AMP generation | Nagarajan | 2018 |
| NLM | Character sequence | ACP generation | Grisoni | 2018 |
| NLM | Learned representation using one-hot | Signal peptide generation | Wu | 2020 |
| NLM | Learned representation using structural and evolutionary data | AMP generation | Caceres-Delpiano | 2020 |
| NLM | One-hot | AMP generation | Wang | 2021 |
| NLM | Character sequence | CPP generation | Tran | 2021 |
| NLM | One-hot | AMP generation | Capecchi | 2021 |
| NLM | Fingerprint, one-hot | PMO delivery peptide generation | Schissel | 2021 |
| VAE | Learned representation using character sequence | AMP generation | Das | 2018 |
| VAE | Learned representation using one-hot | AMP generation | Dean | 2020 |
| VAE | Learned representation using character sequence | AMP generation | Das | 2021 |
| GAN | Character sequence | AMP generation | Tucs | 2020 |
| GAN | Character sequence/PDB structure | ACP generation | Rossetto | 2020 |
| GAN | Learned representation using character sequence | AMP generation | Ferrell | 2020 |
| GAN | Character sequence | AMP generation | Oort | 2021 |
| GAN | Sequence of amino acid property vectors | Immunogenic peptide generation | Li | 2021 |
| GAN | Character sequence | AMP generation | Surana | 2021 |
Fig. 1Peptide design pipeline based on deep generative models. This data-driven approach starts with peptide data curation and conversion of peptides to machine-readable representations. Deep generative models generate novel peptides by taking the above representations and modeling the distribution of the training peptide data. The generated peptides are then examined and validated by wet-lab experimentation. Among various deep generative models, Neural language models (NLMs) either predict the next amino acid based on previously generated amino acids or map the source peptides to target ones. Variational autoencoders (VAEs) use an encoder and a decoder to map peptides to latent variables and generate peptides from latent variables, respectively. In generative adversarial networks (GANs), a generator produces synthetic data while a discriminator distinguishes generated samples from real ones.
Glossary of machine learning terms that are not strictly defined in this review. All terms are delineated in alphabetical order
| Term | Meaning |
|---|---|
| Active learning | Algorithms to query data samples from a dataset for labelling and model training so that the trained model has maximum performance gain |
| Backpropagation | An algorithm that trains neural networks. Gradients of the loss with respect to parameters in a neural net are first calculated by the chain rule. Then, gradient descent is performed to optimize the parameters |
| Bayesian optimization | A sequential search algorithm to optimize computationally expensive black-box functions |
| Bidirectional encoder representations from transformers | Transformer-based language model for feature/representation learning |
| Convolutional neural network | A class of neural networks that uses a series of convolution operations and nonlinear transformations to process structured inputs ( |
| Data augmentation | Supplementing a dataset with modified copies of the data or with synthetic data, often used to prevent overfitting and improve prediction performance |
| Decoder | A neural network converts compressed signals/features (usually represented as low-dimensional vectors) to raw signals |
| Embedding | A mapping from a high-dimensional input to a low-dimensional vector |
| Encoder | A neural network converts raw inputs/signals to compressed signals/features (usually represented as low-dimensional vectors) |
| Gradient | In machine learning, it usually refers to the partial derivative of loss with respect to machine learning model ( |
| Gradient descent | An optimization algorithm to minimize a differentiable function by iteratively moving in the opposite direction of the gradient of the function |
| Label | The answer of what machine learning models aim to predict |
| Loss | A measure between label and machine learning model prediction, indicating how far the prediction is from the corresponding label |
| Low-shot learning | Machine learning approaches that train effective models using a small number of training samples. Also known as few-shot learning |
| Multitask learning | Machine learning approaches that train models to solve multiple tasks simultaneously |
| Neural network | A model inspired by the brain that uses a series of nonlinear transformations to process inputs and make predictions |
| Objective | A value ( |
| One-hot encoding | A vector representation of categorical data, wherein all values are 0 except the singular 1 |
| Recurrent neural network | A class of neural networks that are suitable for modelling time series and sequential data by using the output from the last time step as input to the current time step |
| Training | A process to optimize parameters of machine learning models so that objective(s) are minimized/maximized |
| Training data/dataset/set | Data used to train machine learning models |
| Transfer learning | Machine learning approaches that transfer information/knowledge from one task to the other in order to improve prediction performance of models |
| Transformer | A neural network architecture based on attention for sequence-to-sequence learning |
Peptide databases covered in this review
| Name | Citation | Labels | Data size | Application |
|---|---|---|---|---|
| Uniprot | The UniProt Consortium[ | Sparse labels | 190 million sequences | Wu |
| Das | ||||
| Capecchi | ||||
| Das | ||||
| Oort | ||||
| CPPsite 2.0 | Agrawal | Cell-penetrating peptides | 1850 peptides,1,150 used in application | Schissel |
| Pfam | Bateman | Sparse labels | 47 million sequences, 21 million used in application | Caceres-Delpiano |
| DBAASP | Pirtskhalava | Antimicrobial, toxicity, anticancer and hemolytic activity | >15, 700 peptides | Wang |
| Tran | ||||
| Capecchi | ||||
| Das | ||||
| Tucs | ||||
| Ferrell | ||||
| Oort | ||||
| ToxinPred's dataset | Gupta | Toxicity | 1805 toxic peptides | Das |
| AVPdb | Qureshi | Antiviral | 2683 peptides | Oort |
| Surana | ||||
| LAMP | Zhao | Antimicrobial activity | 5547 peptides | Tran |
| Tucs | ||||
| THPdb | Usmani | FDA approved therapeutic peptides | 239 peptides | Rossetto |
| CAMP | Thomas | Antimicrobial activity | 3782 peptides | Wang |
| Tran | ||||
| Tucs | ||||
| Surana | ||||
| DRAMP | Kang | Antimicrobial activity | 19 899 peptides | Wang |
| Surana | ||||
| YADAMP | Piotto | Antimicrobial activity | 2133 peptides | Nagarajan |
| Wang | ||||
| DADP | Novkovi'c | Broad defence activity | 2571 peptides | Müller |
| ADP | Wang | Antimicrobial activity | 3273 peptides | Müller |
| Tran | ||||
| Dean | ||||
| Tucs | ||||
| DBAMP | Jhong | Antimicrobial activity | 12 389 peptides | Surana |
| IEDB | Fleri | Immune epitope | > 1 million peptides, 8971 used in application | Li |