Literature DB >> 35099535

ABlooper: Fast accurate antibody CDR loop structure prediction with accuracy estimation.

Brennan Abanades¹, Guy Georges², Alexander Bujotzek², Charlotte M Deane¹.

Abstract

MOTIVATION: Antibodies are a key component of the immune system and have been extensively used as biotherapeutics. Accurate knowledge of their structure is central to understanding their antigen binding function. The key area for antigen binding and the main area of structural variation in antibodies is concentrated in the six complementarity determining regions (CDRs), with the most important for binding and most variable being the CDR-H3 loop. The sequence and structural variability of CDR-H3 make it particularly challenging to model. Recently deep learning methods have offered a step change in our ability to predict protein structures.
RESULTS: In this work we present ABlooper, an end-to-end equivariant deep-learning based CDR loop structure prediction tool. ABlooper rapidly predicts the structure of CDR loops with high accuracy and provides a confidence estimate for each of its predictions. On the models of the Rosetta Antibody Benchmark, ABlooper makes predictions with an average CDR-H3 RMSD of 2.49 Å, which drops to 2.05 Å when considering only its 75% most confident predictions. AVAILABILITY: https://github.com/oxpig/ABlooper. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.

Entities: Chemical

Year: 2022 PMID： 35099535 PMCID： PMC8963302 DOI： 10.1093/bioinformatics/btac016

Source DB: PubMed Journal: Bioinformatics ISSN： 1367-4803 Impact factor: 6.937

1 Introduction

1.1 Antibody structure

Antibodies are a class of protein produced by B cells during an immune response. Their ability to bind with high affinity and specificity to almost any antigen makes them attractive for use as therapeutics (Carter and Lazar, 2018). Knowledge of the structure of antibodies is becoming increasingly important in biotherapeutic development (Chiu ). However, experimental structure determination is time-consuming and expensive so it is not always practical or even possible to use routinely. Computational modelling tools have allowed researchers to bridge this gap by predicting large numbers of antibody structures to a high level of accuracy (Leem ; Ruffolo ). For example, models of antibody structures have recently been used for virtual screening (Schneider ) and to identify coronavirus-binding antibodies that bind the same epitope with very different sequences (Robinson ). The overall structure of all antibodies is similar and therefore can be accurately predicted using current methods (e.g. Leem ). The area of antibodies that it is hardest to model is the sequence variable regions that provide the structural diversity necessary to bind a wide range of antigens. This diversity is largely focussed on six loops known as the complementarity determining regions (CDRs). The most diverse of these CDRs and therefore the hardest to model is the third CDR loop of the heavy chain (CDR-H3) (Teplyakov ).

1.2 Deep learning for protein structure prediction

At CASP14 (Kryshtafovych ), DeepMind showcased AlphaFold2 (Jumper ), a neural network capable of accurately predicting many protein structures. The method relies on the use of equivariant neural networks and an attention mechanism. More recently, RoseTTAFold, a novel neural network based on equivariance and attention was shown to obtain results comparable to those of AlphaFold2 (Baek ). These methods both rely on the use of equivariant networks. For a network to be equivariant with respect to a group, it must be able to commute with the group action. For rotations, this means that rotating the input before feeding it into the network will have the same result as rotating the output. In the case of proteins, using a network equivariant to both translations and rotations in 3D space allows us to learn directly from atom coordinates. This is in contrast to previous methods like TrRosetta (Yang ) or the original version of AlphaFold (Senior ) that predicted invariant features, such as inter-residue distances and orientations which are then used to reconstruct the protein. A number of approaches for developing equivariant networks have been recently developed (e.g. Finzi ). In this article, we explore the use of an equivariant approach to CDR structure prediction. We chose to use E(n)-Equivariant Graph Neural Networks (E(n)-EGNNs; Satorras ) as our equivariant approach due to their speed and simplicity.

1.3 Deep learning for antibody structure prediction

Deep learning-based approaches have also been shown to improve structure prediction in antibodies, e.g. DeepH3 (Ruffolo ), an antibody-specific version of TrRosetta. Recently, DeepAb (Ruffolo ), an improved version of DeepH3, was shown to outperform all currently available antibody structure prediction methods. DeepAb and DeepH3 are similar to TrRosetta and the original version of AlphaFold in that deep learning is used to obtain inter-residue geometries that are then fed into an energy minimization method to produce the final structure. In this work, we present ABlooper, a fast and accurate tool for antibody CDR loop structure prediction. By leveraging E(n)-EGNNs, ABlooper directly predicts the structure of CDR loops. By simultaneously predicting multiple structures for each loop and comparing them amongst themselves, ABlooper is capable of estimating a confidence measure for each predicted loop.

2 Materials and Methods

2.1 Data

The data used to train, test and validate ABlooper were extracted from SAbDab (Dunbar ), a database of all antibody structures contained in the PDB (Berman ). Structures with a resolution better than 3.0 Å and no missing backbone atoms within any of the CDRs were selected. The CDRs were defined using the Chothia numbering scheme (Chothia ). For easy comparison with different pipelines, we used the 49 antibodies from the Rosetta Antibody Benchmark as our test set. For validation, 100 structures were selected at random. It was ensured that there were no structures with the same CDR sequences in the training, testing and validation sets. Sequence redundancy was allowed within the training set to expose the network to the existence of antibodies with identical sequences but different structural conformations. This resulted in a total of 3438 training structures. Additionally, we use a secondary test set composed of 114 antibodies (SAbDab Latest Structures) with a resolution of under 2.3 Å and a maximum CDR-H3 loop length of 20, which were added to SAbDab after the initial test, train and validation sets were extracted (November 8, 2020 to May 24, 2021). A list containing the PDB IDs of all the structures used in the train, test, and validation sets is given in the Supplementary Material. ABodyBuilder was used to build models of all the structures. Structural models were generated using the singularity version of ABodyBuilder (Leem ) (fragment database from July 8, 2021) excluding all templates with a 99% or higher sequence identity. ABlooper CDR models for the test sets were obtained by remodelling the CDR loops on ABodyBuilder models.

2.2 Deep learning

ABlooper is composed of five E(n)-EGNNs, each one with four layers, all trained in parallel. The model is trained on the position of the -N-C- backbone atoms for all six CDR loops plus two anchor residues at either end. E(n)-EGNNs require a starting geometry, so a non-descriptive input geometry is generated by evenly spacing each CDR loop residue on a straight line between its anchor residues (Fig. 1). The model is given four different types of features per node resulting in a 41-dimensional vector. These include a one-hot encoded vector describing the amino acid type, the atom type and which loop the residue belongs to. Additionally, sinusoidal positional embeddings are given to each residue describing how close it is to the anchors. An outline of how E(n)-EGNNs are used within ABlooper is shown in Figure 1.

Fig. 1.

Flowchart showing how E(n)-EGNN is used to predict CDR loops in ABlooper. The input geometry for each CDR loop is generated by aligning its residues between their anchors, while the node features are extracted from the loop sequence. Atom coordinates are then iteratively updated using a four-layer E(n)-EGNN resulting in a predicted set of conformations for each CDR Two different losses were used during training. To quantify the structural similarity between the predicted and true structures, RMSD was used. To encourage the conservation of distances between neighbouring atoms in the backbone chain, an L1-loss between the true and predicted inter-atom distances was used. This was composed of five terms between the following pairs of atoms: -----. Each of the five E(n)-EGNNs were trained to make predictions independently by minimizing the RMSD between their prediction and the crystal structure. The output from the five networks is then averaged to obtain a final prediction. To ensure that the final combined prediction of all E(n)-EGNNs was physically plausible, the L1-loss was used on the final averaged structure. The model was trained in two phases. First, it was trained until convergence without the L1-loss term using the RAdam (Liu ) optimizer with a learning rate of and a weight decay of . In the second stage, the L1-loss term was added with a weighting of 1.0. For this stage, the model was trained using the Adam (Kingma and Ba, 2014) optimizer with a learning rate of and early stopping. More details on the implementation of ABlooper can be found in the Supplementary Material.

2.3 Loop relaxation

During training, ABlooper is encouraged to predict physically plausible CDR loops via the intra-residue atom distance loss term. However, ABlooper occasionally produces loops with incorrect backbone geometries. To enforce correct backbone geometries we relax the predicted loops using a restrained energy minimization procedure. As our energy function, we use the AMBER14 (Maier ) protein force field with an additional harmonic potential term keeping the positions of backbone atoms close to their original predicted positions. The spring constant of the harmonic potential is set to 10 kcal/mol2. Energy minimization is done using the Langevin Integrator in the OpenMM python package (Eastman ). This relaxation step typically results in a small loss in accuracy, but ensures that predicted loops are physically plausible.

2.4 Deepab and AlphaFold2

DeepAb structural models were generated using the open-source version of the code (available at https://github.com/RosettaCommons/DeepAb). As suggested in their paper (Ruffolo ), we generated five decoys per structure. This took around 10 min per antibody on an 8-core Intel i7-10700 CPU. Antibody structures were generated using the open-source version of AlphaFold2 (available at https://github.com/deepmind/alphafold). We used the ‘full_dbs’ preset and allowed it to use templates from before May 14, 2020. As AlphaFold2 is intended to predict single chains (Jumper ), we predicted and aligned the heavy and light chain independently before comparing to other methods. On a 20-core Intel 6230 CPU this took around 3 h per antibody modelled.

3 Results

3.1 Using ABlooper to predict CDR loops on modelled antibody structures

We used ABlooper to predict the CDRs on ABodyBuilder models of the Rosetta Antibody Benchmark (RAB) and the SAbDab Latest Structures (SLS) sets. The RMSD between the -N-C- atoms in the backbone of the crystal structure and the predicted CDRs for both test sets is shown in Table 1.

Table 1.

Performance comparison between AlphaFold2, ABodyBuilder, DeepAb and ABlooper for both test sets

Method	CDR-H1	CDR-H2	CDR-H3	CDR-L1	CDR-L2	CDR-L3
Rosetta Antibody Benchmark
AlphaFold2^a	0.84	0.99	2.87	0.53	0.49	0.95
ABodyBuilder	1.08	0.99	2.77	0.69	0.50	1.12
DeepAb	0.83	0.93	2.44	0.50	0.44	0.85
ABlooper	0.92	1.01	2.49	0.62	0.52	0.97
ABlooper unrelaxed	0.90	1.03	2.45	0.61	0.51	0.93
SAbDab latest structures
ABodyBuilder	1.24	1.07	3.25	0.88	0.57	1.03
DeepAb^a	1.00	0.82	2.49	0.59	0.45	0.90
ABlooper	1.14	0.97	2.72	0.74	0.55	1.04
ABlooper Unrelaxed	1.14	0.99	2.66	0.73	0.54	1.01

The mean RMSD to the crystal structure across each test set for the six CDRs is shown. RMSDs for each CDR are calculated after superimposing their corresponding chain to the crystal structure. RMSDs are given in Angstroms (Å).

It is likely that AlphaFold2 used at least some of the structures in the benchmark set during training. Similarly, structures in the SAbDab Latest Structures set may have been used for training DeepAb.

Performance comparison between AlphaFold2, ABodyBuilder, DeepAb and ABlooper for both test sets The mean RMSD to the crystal structure across each test set for the six CDRs is shown. RMSDs for each CDR are calculated after superimposing their corresponding chain to the crystal structure. RMSDs are given in Angstroms (Å). It is likely that AlphaFold2 used at least some of the structures in the benchmark set during training. Similarly, structures in the SAbDab Latest Structures set may have been used for training DeepAb. ABlooper achieves lower mean RMSDs than AbodyBuilder for most CDRs (Table 1). By far, the largest improvement is for the CDR-H3 loop, where due to the large structural diversity, homology modelling performs worst (Leem ). ABlooper predicts loops of a similar accuracy to AlphaFold2 and DeepAb for all CDRs except CDR-H3, where ABlooper and DeepAb outperform AlphaFold2. One potential source of error for ABlooper is the model frameworks generated by ABodyBuilder, so we examined its resilience to the small deviations seen in these models and found little to no correlation between framework error and CDR prediction error (see Supplementary Material).

3.2 Prediction diversity as a measure of prediction quality

ABlooper predicts five structures for each loop. We found that the average RMSD between predictions can be used as a measure of certainty of the final averaged prediction. If all five models agree on the same conformation, then it is more likely that it will be the correct conformation, if they do not, then the final prediction is likely to be less accurate (Fig. 2). This allows ABlooper to give a confidence score for each predicted loop. As shown in Figure 2D, this score can be used as a filter, removing structures which are expected to be incorrectly modelled by ABlooper. For example, by setting a 1.5 Å inter-prediction RMSD cut-off on structures from the Rosetta Antibody Benchmark, the average CDR-H3 RMSD for the set can be reduced from 2.49 to 2.05 Å while keeping around three quarters of the predictions. As expected, accuracy filtering has a tendency to remove longer CDR-H3 predictions but it is not exclusively correlated to length (see Supplementary Material).

Fig. 2.

(A) CDR-H3 loop RMSD between final averaged prediction and crystal structure compared with average RMSD between the five ABlooper predictions for both the Rosetta Antibody Benchmark and the SAbDab Latest Structures set. (B) An example of a poorly predicted CDR-H3 loop. All five predictions are given in grey, with the final averaged prediction in blue and the crystal structure in green. The predictions from the five networks are very different, indicating an incorrect final prediction. (C) Example of correctly predicted CDR loops. All five predictions are similar, indicating a high confidence prediction. Colours are the same as in (B). (D) Effect of removing structures with a high CDR-H3 inter-prediction RMSD on the averaged RMSD for the set. The number of structures remaining after each quality cut-off is shown as a percentage. Data shown for the RAB and the SLS sets

4 Discussion

We present ABlooper, a fast and accurate tool for predicting the structures of the CDR loops in antibodies. It builds on recent advances in EGNNs to improve CDR loop structure prediction. On an NVIDIA Tesla V100 GPU, the unrelaxed version of ABlooper can predict the CDR backbone atoms for 100 structures in under 5 s. Loop relaxation and side-chain prediction are the most computationally expensive parts of the pipeline taking around 10 s per structure. ABlooper outperforms ABodyBuilder (a state of the art homology method) and produces antibody models of similar accuracy to both AlphaFold2 and DeepAb, but on a far faster timescale. By predicting each loop multiple times, ABlooper is capable of producing an accuracy estimate for each generated loop structure. It is not clear whether a high prediction diversity score is indicative of loops with multiple conformations or underrepresentation of the given loop sequence in SAbDab (Dunbar ). However, due to how ABlooper is trained (with the averaged prediction encouraged to be physically plausible), we would expect individual decoys from ABlooper to be unphysical for divergent predictions. With the arrival of B-cell receptor repertoire sequencing, the number of publicly available paired antibody sequence data is rapidly increasing (Kovaltsuk ; Olsen ). Fast accurate tools such as ABlooper provide the opportunity for structural studies (such as Robinson ) at previously infeasible scales. The model used for ABlooper is available at: https://github.com/oxpig/ABlooper.

Funding

This work was supported by the Engineering and Physical Sciences Research Council (EPSRC) with grant number (EP/S024093/1). Conflict of Interest: none declared. Click here for additional data file.

20 in total

1. Antibody modeling assessment II. Structures and models.

Authors: Alexey Teplyakov; Jinquan Luo; Galina Obmolova; Thomas J Malia; Raymond Sweet; Robyn L Stanfield; Sreekumar Kodangattil; Juan Carlos Almagro; Gary L Gilliland
Journal: Proteins Date: 2014-03-31

Review 2. Next generation antibody drugs: pursuit of the 'high-hanging fruit'.

Authors: Paul J Carter; Greg A Lazar
Journal: Nat Rev Drug Discov Date: 2017-12-01 Impact factor: 84.694

3. Improved protein structure prediction using predicted interresidue orientations.

Authors: Jianyi Yang; Ivan Anishchenko; Hahnbeom Park; Zhenling Peng; Sergey Ovchinnikov; David Baker
Journal: Proc Natl Acad Sci U S A Date: 2020-01-02 Impact factor: 11.205

4. Improved protein structure prediction using potentials from deep learning.

Authors: Andrew W Senior; Richard Evans; John Jumper; James Kirkpatrick; Laurent Sifre; Tim Green; Chongli Qin; Augustin Žídek; Alexander W R Nelson; Alex Bridgland; Hugo Penedones; Stig Petersen; Karen Simonyan; Steve Crossan; Pushmeet Kohli; David T Jones; David Silver; Koray Kavukcuoglu; Demis Hassabis
Journal: Nature Date: 2020-01-15 Impact factor: 49.962

Review 5. Antibody Structure and Function: The Basis for Engineering Therapeutics.

Authors: Mark L Chiu; Dennis R Goulet; Alexey Teplyakov; Gary L Gilliland
Journal: Antibodies (Basel) Date: 2019-12-03

6. ff14SB: Improving the Accuracy of Protein Side Chain and Backbone Parameters from ff99SB.

Authors: James A Maier; Carmenza Martinez; Koushik Kasavajhala; Lauren Wickstrom; Kevin E Hauser; Carlos Simmerling
Journal: J Chem Theory Comput Date: 2015-07-23 Impact factor: 6.006

7. Accurate prediction of protein structures and interactions using a three-track neural network.

Authors: Minkyung Baek; Frank DiMaio; Ivan Anishchenko; Justas Dauparas; Sergey Ovchinnikov; Gyu Rie Lee; Jue Wang; Qian Cong; Lisa N Kinch; R Dustin Schaeffer; Claudia Millán; Hahnbeom Park; Carson Adams; Caleb R Glassman; Andy DeGiovanni; Jose H Pereira; Andria V Rodrigues; Alberdina A van Dijk; Ana C Ebrecht; Diederik J Opperman; Theo Sagmeister; Christoph Buhlheller; Tea Pavkov-Keller; Manoj K Rathinaswamy; Udit Dalwadi; Calvin K Yip; John E Burke; K Christopher Garcia; Nick V Grishin; Paul D Adams; Randy J Read; David Baker
Journal: Science Date: 2021-07-15 Impact factor: 47.728

8. Antibody structure prediction using interpretable deep learning.

Authors: Jeffrey A Ruffolo; Jeremias Sulam; Jeffrey J Gray
Journal: Patterns (N Y) Date: 2021-12-09

9. Geometric potentials from deep learning improve prediction of CDR H3 loop structures.

Authors: Jeffrey A Ruffolo; Carlos Guerra; Sai Pooja Mahajan; Jeremias Sulam; Jeffrey J Gray
Journal: Bioinformatics Date: 2020-07-01 Impact factor: 6.937

10. Highly accurate protein structure prediction with AlphaFold.

Authors: John Jumper; Richard Evans; Alexander Pritzel; Tim Green; Michael Figurnov; Olaf Ronneberger; Kathryn Tunyasuvunakool; Russ Bates; Augustin Žídek; Anna Potapenko; Alex Bridgland; Clemens Meyer; Simon A A Kohl; Andrew J Ballard; Andrew Cowie; Bernardino Romera-Paredes; Stanislav Nikolov; Rishub Jain; Demis Hassabis; Jonas Adler; Trevor Back; Stig Petersen; David Reiman; Ellen Clancy; Michal Zielinski; Martin Steinegger; Michalina Pacholska; Tamas Berghammer; Sebastian Bodenstein; David Silver; Oriol Vinyals; Andrew W Senior; Koray Kavukcuoglu; Pushmeet Kohli
Journal: Nature Date: 2021-07-15 Impact factor: 49.962

10 in total

1. Deciphering the language of antibodies using self-supervised learning.

Authors: Jinwoo Leem; Laura S Mitchell; James H R Farmery; Justin Barton; Jacob D Galson
Journal: Patterns (N Y) Date: 2022-05-18

2. Shape Complementarity Optimization of Antibody-Antigen Interfaces: The Application to SARS-CoV-2 Spike Protein.

Authors: Alfredo De Lauro; Lorenzo Di Rienzo; Mattia Miotto; Pier Paolo Olimpieri; Edoardo Milanetti; Giancarlo Ruocco
Journal: Front Mol Biosci Date: 2022-05-20

3. BioPhi: A platform for antibody design, humanization, and humanness evaluation based on natural antibody repertoires and deep learning.

Authors: David Prihoda; Jad Maamary; Andrew Waight; Veronica Juan; Laurence Fayadat-Dilman; Daniel Svozil; Danny A Bitton
Journal: MAbs Date: 2022 Jan-Dec Impact factor: 5.857

Review 4. Machine-designed biotherapeutics: opportunities, feasibility and advantages of deep learning in computational antibody discovery.

Authors: Wiktoria Wilman; Sonia Wróbel; Weronika Bielska; Piotr Deszynski; Paweł Dudzic; Igor Jaszczyszyn; Jędrzej Kaniewski; Jakub Młokosiewicz; Anahita Rouyan; Tadeusz Satława; Sandeep Kumar; Victor Greiff; Konrad Krawczyk
Journal: Brief Bioinform Date: 2022-07-18 Impact factor: 13.994