| Literature DB >> 35830864 |
Wiktoria Wilman1, Sonia Wróbel1, Weronika Bielska1, Piotr Deszynski1, Paweł Dudzic1, Igor Jaszczyszyn1,2, Jędrzej Kaniewski1, Jakub Młokosiewicz1, Anahita Rouyan1, Tadeusz Satława1, Sandeep Kumar3, Victor Greiff4, Konrad Krawczyk1.
Abstract
Antibodies are versatile molecular binders with an established and growing role as therapeutics. Computational approaches to developing and designing these molecules are being increasingly used to complement traditional lab-based processes. Nowadays, in silico methods fill multiple elements of the discovery stage, such as characterizing antibody-antigen interactions and identifying developability liabilities. Recently, computational methods tackling such problems have begun to follow machine learning paradigms, in many cases deep learning specifically. This paradigm shift offers improvements in established areas such as structure or binding prediction and opens up new possibilities such as language-based modeling of antibody repertoires or machine-learning-based generation of novel sequences. In this review, we critically examine the recent developments in (deep) machine learning approaches to therapeutic antibody design with implications for fully computational antibody design.Entities:
Keywords: antibody; artificial intelligence; deep learning; drug discovery; immunoinformatics; machine learning
Mesh:
Substances:
Year: 2022 PMID: 35830864 PMCID: PMC9294429 DOI: 10.1093/bib/bbac267
Source DB: PubMed Journal: Brief Bioinform ISSN: 1467-5463 Impact factor: 13.994
Figure 1Antibody encoding schemes. (A) One-hot encoding. Sparse vector representation for each residue with 1 for amino acid present and 0 s for remaining positions. (B) Substitution matrix. Rather than 0/1 as in one-hot encoding, each amino acid present receives a score from the substitution amino acid matrix, e.g. Blosum. (C) Amino acid properties. Similarly to substitution-matrix approaches, scores encapsulate knowledge-based properties, such as size, charge, etc. (D) Learned amino acid properties. Infer embeddings for each amino acid based on training of the network. (E) Encoding of supplementary attributes such as organism, gene, etc., alongside amino acid encoding. (F) Encoding of structural features. For invariant representations, structures can be represented by distance matrices or by orientation angles between consecutive amino acids.
Figure 2Some common neural network architectures and concepts in the context of some antibody-specific problems. Simplified examples are given to show potential applications on sequence/structural inputs with the networks capable of operating on more complex inputs (e.g. entire variable region sequences rather than just CDR-H3 or more complex molecular descriptors than just atomic coordinates). (A) Recurrent networks. Information is read one element at a time, maintaining a hidden state. This architecture is often used for sequence-based input such as CDRs or variable region sequences. (B) Convolutional Neural Networks. Predictions are constrained to portions of the input and are then pooled together. Such networks can focus on local patterns and combining them into predictions, making them useful in identifying motifs in sequences or identifying molecular surface features. (C) Graph Neural Networks. The abstract linkage between elements in input can be reflected. Such networks can process abstract representations of molecules. (D) Residual Neural Networks. Portions of the network can be circumvented, allowing for deeper networks without risking exploding or vanishing gradients. Such networks were used with great success for structure prediction. (E) Encoder-Decoder networks. The input is encoded into a latent representation by reducing the dimensionality and attempting to reconstruct the input. The resulting latent representation can reflect intrinsic features of the input, such as gene assignments and propensity towards similar targets.
Recent examples of machine learning applications in antibodies
|
|
|
|
|
|
|
|
|
|
|---|---|---|---|---|---|---|---|---|
| Structure prediction | DeepH3 | CDRH3 prediction | 1388 structures | Series of 1D and 2D convolutions (3 1D + 25 2D blocks) | 30 epochs, batch size 4, 35 h using one NVIDIA Tesla K80 Graphics processing unit (GPU) | PyTorch |
| [ |
| DeepAb | V region structure prediction | 118 386 sequences and 1692 structures | A 1D ResNet (1D convolution followed by three 1D ResNet blocks) and the bi-LSTM encoder | 60 epochs, batch size 128, NVIDIA K80 GPU requiring 60 h | PyTorch |
| [ | |
| AbLooper | CDR Prediction | 3438 structures | Five E(n)-equivariant graph neural networks (EGNNs), each one with four layers | NVIDIA Tesla V100 GPU, predict the CDRs for one hundred structures in under five seconds | PyTorch |
| [ | |
| NanoNet | Heavy chain prediction | ~2000 structures | Two 1D ResNets with input tensor of 140 × 22 | batch size of 16 and ~ 130 epochs,10 min on a GeForce RTX 2080 Ti | Keras/TensorFlow |
| [ | |
| Humanization/Deimmunization | Nativeness LSTM | Learn distribution of amino acids at positions | 400 000 sequences | Bidirectional LSTM with dimensionality 64 | 10 epochs | PyTorch |
| [ |
| Sapiens | Antibody humanization | 20 milion heavy chains and 19 milion light chains | RoBERTA transformer, 4 layers, 8 attention heads, 568 857 parameters | 700 epochs for heavy chains, 300 epochs for light chains | PyTorch/Fairseq [ |
| [ | |
| hu-Mab | Discriminate between human/mouse sequences | 65 million sequences with 13 million non-human ones | Random Forest | n/a | scikit-learn |
| [ | |
| Binding models | Parapred | Paratope residues prediction | 1662 sequences (277 antibody–antigen complexes × 6 Complementarity determining regions each) and tested on the same dataset using 10-fold cross-validation technique | Convolutional and recurrent neural networks | 16 epochs, 32 batch size | Keras |
| [ |
| Epitope3d | Conformational epitopes prediction | 1351 antibody–antigen structures (covering 40 842 epitope residues) and 180 unbound antigen structures; tested on 20 unbound antigen structures; 45 unbound antigen structures used for external blind test | Supervised learning algorithms: Multi-layer Perceptron, Support Vector Machines, K-Nearest Neighbor, Adaboost, Gaussian processes (GP), Random Forest, Gradient Boost, XGBoost, Extra Trees | N/A | scikit-learn Python |
| [ | |
| mmCSM-AB | Prediction of the consequences of multiple point mutations on antibody–antigen binding affinity | 1640 mutations with associated changes in binding affinity (905 single missense mutations and 735 modeled reverse mutations); tested on 242 multiple missense mutations with associated changes in binding affinity | Supervised learning algorithms for example: Random Forest, Extra Trees, Gradient Boost, XGBoost, SVM and Gaussian Process | n/a | scikit-learn Python |
| [ | |
| Phage display LSTM | Generate novel kynurenine binding sequences from LSTM | 959 sequences | LSTM, two layers | 269 epochs | Keras/Tensorflow | n/a | [ | |
| Phage display CNN | Predict phage enrichment and generate novel CDRH3 | 96 847 sequences (largest dataset on github) | Ensemble of CNNs, largest with two convolutional layers and 18 706 parameters | 20 epochs | Keras |
| [ | |
| Image-based prediction | Distinguish between binding antibodies and lineages | 24 953 models with calculated fingerprints from | ResNet-50 [ | Pre-trained model | Keras/Tensorflow |
| [ | |
| Paratope and Epitope Prediction with graph Convolution Attention Network (PECAN) | Epitope and paratope prediction | 162 structures for epitope prediction and 460 for paratope prediction | Graph Convolutional Attention Network | Up to 250 epochs, batch size of 32 (multiple parameters tested) | Tensorflow |
| [ | |
| DLAB | Sorting of protein docking poses | 759 Antibody– | Convolutional | n/a | PyTorch |
| [ | |
| Embeddings/Language Methods | immune2vec | Embed CDRH3 into 100 dimensions using skip-gram | 15,63 million sequences | Two dense layers | n/a | Gensim |
| [ |
| ProtVec CDRH3 | Embed CDRH3 sequences to | COVID-119 data | Based on ProtVec | Reused previous model. | Reused previous model |
| [ | |
| AntiBerty | Masked language modeling, paratope prediction | 558 milion sequences | BERT transformer encoder model, 8 layers, 26 M trainable parameters. | 8 epochs, 10 days | PyTorch | n/a | [ | |
| AntiBerta | Masked language modeling, paratope prediction | 57 million sequences | Antibody-specific Bi-directional Encoder Representation from Transformers, 86 m parameters | 12-layer transformer model that is pre-trained on 57 M human BCR sequences, 3 epochs, batch size of 96 across 8 NVIDIA V100 GPUs | PyTorch | n/a | [ | |
| AbLang | Masked language modeling, reconstruct erroneous | 14 milion heavy chains, 200 000 light chains training. Evaluation sets of 100 k, 50 k for heavy lights respectively. | Based on RoBERTA from HuggingFace. | 20 epochs for | PyTorch |
| [ | |
| Generative methods/antibody design | Mouse VAE | Model latent space of CDR triples of antigen challenged mice | 243 374 sequences. | VAE with encoder | 200 Epochs on a single GPU from | Tensorflow | n/a (available after peer review) | [ |
| Developability-controlled GAN | Learn latent representation of human sequences and bias it towards biophysical properties | 400 000 sequences | Generative Adversarial Network, (single chain) seven layers consisting of | 500 epochs, batch size of 128 | Keras/Tensorflow | n/a | [ | |
| Nanobody generation | Autoregression on nanobody sequences to generate novel CDRH3 | 1.2 milion sequences | ResNet with nine blocks with six | 250 000 updates, batch size of 30. | Tensorflow/PyTorch |
| [ | |
|
|
| 70 000 murine CDR3 sequences | 1024 LSTM with embedding layer and dense output layer. | 20 epochs, batch | Tensorflow |
| [ | |
| Immunoglobulin Language Model (IgLM) | Masked language modeling, generate synthetic libraries of antibodies by solving masked language model | 558 milion sequences | Transformer decoder architecture based on the GPT-2 model with 512 embeddings, 12 milion parameters | batch size of 512 | GPT-2 from HuggingFace | n/a | [ | |
| IG-VAE | Immunoglobulins structure generation | 10 768 immunoglobulins structures (including 4154 non-sequence-redundant structures)- set covers almost 100% of the antibody structure database (AbDb); Tested on 5000 structures from the latent space of the Ig-VAE | VAE | n/a | PyTorch | n/a | [ | |
| Generative method Benchmarking: (AR) the sequence-based autoregressive generative model, geometric vector perceptron (GVP) the precise structure-based graph neural network and (Fold2Seq) fold-based generative model | Antibody CDR regions design based on portion of sequence or structure. | Sequences from natural llama nanobody repertoire | AR- Autoregressive Causal Dilated Convolutions; GVP -based Encoder-Decoder GNN; Fold2Seq- Encoder-Decoder Transformer | n/a | n/a | n/a | [ | |
| GNN-based generation | CDRs sequence and 3D structure design | ~5000 structures. For CDR-H1, the train/validation/test size is 4050, 359 and 326. For CDR-H2, the train/validation/test size is 3876, 483 and 376. For CDRH3, the train/validation/test size is 3896, 403 and 437. | Message passing network (MPN): Iterative Refinement Graph Neural Network (RefineGNN) | batch size of 16, dropout of 0.2 and learning rate of 0.0005 | n/a | n/a | [ | |
| AntBO | CDRH3 region design | Bayesian Optimization and | 87 cores 12 GB GPU memory | GPyTorch, Botorch | n/a | [ |
For each method, we present the basic reported parameters used for training the network and approximate input. Wherever available, we report the architecture and libraries used to offer a point of reference for the currently used techniques in the field. Some methods (e.g. hu-Mab or mmCSM-AB) were not deep-learning-based, though they are included here for completeness. The non-peer-reviewed Biorxiv/Arxiv papers are indicated by ‘*’ in the Paper column.
Figure 3Generative methods for computational antibody design. (A) Millions of natural NGS sequences can be used to learn the general features of the antibody sequences, such as positional frequencies, amino acid dependencies and gene groupings. (B) Feeding antigen-specific sequences, one can bias the distribution to learn the features of sequences specific to a given antigen. (C) Sequences that have known favorable biophysical properties (e.g. solubility, low immunogenicity) can be used to bias the latent space towards such features. (D) One can use the latent space to randomly sample points from it in a directed fashion that complies with certain specifications, such as specificity and biophysical properties.