Literature DB >> 30321295

SCALOP: sequence-based antibody canonical loop structure annotation.

Wing Ki Wong¹, Guy Georges², Francesca Ros², Sebastian Kelm³, Alan P Lewis⁴, Bruck Taddese⁵, Jinwoo Leem¹, Charlotte M Deane¹.

Abstract

MOTIVATION: Canonical forms of the antibody complementarity-determining regions (CDRs) were first described in 1987 and have been redefined on multiple occasions since. The canonical forms are often used to approximate the antibody binding site shape as they can be predicted from sequence. A rapid predictor would facilitate the annotation of CDR structures in the large amounts of repertoire data now becoming available from next generation sequencing experiments.
RESULTS: SCALOP annotates CDR canonical forms for antibody sequences, supported by an auto-updating database to capture the latest cluster information. Its accuracy is comparable to that of a standard structural predictor but it is 800 times faster. The auto-updating nature of SCALOP ensures that it always attains the best possible coverage.
AVAILABILITY AND IMPLEMENTATION: SCALOP is available as a web application and for download under a GPLv3 license at opig.stats.ox.ac.uk/webapps/scalop. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.

Entities: Chemical Gene

Mesh：

Substances：

Year: 2019 PMID： 30321295 PMCID： PMC6513161 DOI： 10.1093/bioinformatics/bty877

Source DB: PubMed Journal: Bioinformatics ISSN： 1367-4803 Impact factor: 6.937

1 Introduction

Antibodies are proteins of the immune system that bind to foreign molecules. The binding site is largely formed of six complementarity-determining regions (CDRs): three on each of the heavy and light chains. Conformational clusters, known as ‘canonical forms’, have been observed in five of the six CDRs (e.g. Chothia and Lesk, 1987; North ; Nowak ). Canonical forms have been redefined in the literature many times, but each update has been a static snapshot of the available data. These constant renewals illustrate how the growth of structural data continuously modifies our understanding of CDR loop structures, with 10 canonical forms in 1987 (Chothia and Lesk, 1987) and 26 by 2016 (Nowak ). Several sequence-based canonical form prediction methods have been developed (e.g. Chothia and Lesk, 1987; Long ; North ; Nowak ). Chothia and Lesk (1987) suggested structurally-determining residues for canonical form assignment. Using a similar approach, Swindells published a freely available web server that can handle bulk canonical form assignment, but some clusters lack a representative structure. Hidden Markov models have also been built for cluster assignment (North ; Nowak ). The most recently published method used a Gradient Boosting Machine to annotate CDR backbone conformations with up to 85.1% accuracy (Long ). However, none of these tools uses an auto-updating database, and none provides both a web interface and a freely available software package for large-scale sequence analysis. Here we present SCALOP, which both clusters the H1, H2, L1, L2 and L3 CDRs in an auto-updating database, and creates a canonical form predictor. SCALOP can be used to rapidly approximate an antibody binding site shape from sequence alone (Krawczyk ) with a minimum accuracy of 89.47% (Table 1) (Supplementary Material). The tool is available as a web server and as a Python package for bulk processing.

Table 1.

Coverage and precision of SCALOP and FREAD on SAbDab

		H1	H2	L1	L2	L3
Coverage (%)	SCALOP	93.75	97.54	97.38	98.50	91.69
Coverage (%)	FREAD	96.79	93.38	98.76	98.89	98.02
Precision (%)	SCALOP	89.26	93.60	95.67	99.13	93.31
Precision (%)	FREAD	80.19	88.50	92.72	98.27	91.29

Note: A target structure with a root-mean-square deviation of <1.5 Å to the predicted structure is considered correct.

Coverage and precision of SCALOP and FREAD on SAbDab Note: A target structure with a root-mean-square deviation of <1.5 Å to the predicted structure is considered correct.

2 Algorithm

SCALOP takes one or a set of amino acid sequences of full antibody chains as input. It then numbers the sequence with ANARCI (Dunbar and Deane, 2016), and scores the extracted CDR sequences against PSSMs of the appropriate clusters. The cluster nomenclature follows that of Nowak (Supplementary Material). The input CDR sequence is then assigned to the cluster with the maximum score above a scoring threshold (Supplementary Material). SCALOP returns the name of the assigned cluster, and the PDB code and chain identifier of the assigned cluster’s median structure as the result. SCALOP can return a structural model if a structure of the framework is given alongside the CDR sequence (Supplementary Material). The database is updated monthly, previous databases are available from the website.

2.1 Building the PSSM

We adopted the length-independent CDR clustering method developed by Nowak . Structures in SAbDab (Dunbar ) available as of July 10, 2017 were used (Supplementary Material). We built PSSMs for each cluster using their unique sequences only: where is the element score, is the probability of observing an amino acid at the ANARCI-numbered position within the cluster and is the background probability of (Supplementary Material).

2.2 Cluster assignment

To make a cluster prediction, we only consider the target sequence against clusters of the respective CDR types (i.e. H1 or H2). The PSSM score for a target sequence, for cluster is: where is the set of positions in the target sequence. Since L2 loop structures are often invariant, we assign L2 loops of the dominant sequence length to a single canonical form; otherwise, it is not clustered.

3 Benchmark

We evaluated the performance of SCALOP on our training set using a leave-one-out cross-validation protocol (Table 1) and on a blind test set (Supplementary Material). It achieved similar results on both. We also compared to an adapted version of FREAD, an accurate database-search method for loop structure prediction (Deane and Blundell, 2001; Krawczyk ) (Supplementary Material). This version does not generate a structural model, but returns the PDB code of its prediction. The prediction coverage and precision of the methods are comparable (Table 1) (Supplementary Material). To assess the speed and the portion of consistent predictions made by SCALOP and FREAD, we ran both predictors on a next generation sequencing dataset, with ∼8 million light chain and ∼5 million heavy chain sequences (Krawczyk ). About 98% of the predictions are consistent between the two methods (Supplementary Material). On a single core, predicting 100 sequences requires 227s using FREAD, but 0.29s using SCALOP. This rapid prediction suggests the possibility of running SCALOP as a fast and reliable first-screen. In order to ensure that SCALOP always offers the best possible prediction coverage, it uses an auto-updating database. Figure 1 demonstrates the advantage of this auto-updating approach using L3 as an example. We selected the representative years based on previous publication dates of canonical forms definitions (Al-Lazikani ; North ; Nowak ). Data until the end of the year were used, i.e. for 2016, all structures available on SAbDab deposited before the end of 2016 were used. In 1997, there was only a single L3 cluster; by 2016 there were seven and the portion of non-clustered data had more than halved. Using the 1997 dataset for prediction, we achieve similar precision as with 2017’s data (97.4% in 1997 and 94.0% in 2017), but ∼30% less coverage.

Fig. 1.

The changes in L3 clusters in the past 20 years. The radii of the pie charts are proportional to the log(number of sequences). In 1997, only one L3 cluster existed whose members were all length-9 loops. In 2011, four clusters existed, covering different sequence lengths. Between 2011 and 2016, some length-10 sequences joined the 2011-L3-9-A cluster, which becomes the 2016-L3-9, 10-A cluster. The enriched knowledge improves the prediction coverage of SCALOP while retaining the precision. The numbers below the pie chart are a leave-one-out (if needed) cross-validation on all antibodies up to July 1, 2018 (Supplementary Material)

Funding

This work was supported by funding from the Engineering and Physical Sciences Research Council and Medical Research Council [grant number EP/L016044/1]. Conflict of Interest: none declared. Click here for additional data file.

10 in total

1. A new clustering of antibody CDR loop conformations.

Authors: Benjamin North; Andreas Lehmann; Roland L Dunbrack
Journal: J Mol Biol Date: 2010-10-28 Impact factor: 5.469

2. abYsis: Integrated Antibody Sequence and Structure-Management, Analysis, and Prediction.

Authors: Mark B Swindells; Craig T Porter; Matthew Couch; Jacob Hurst; K R Abhinandan; Jens H Nielsen; Gary Macindoe; James Hetherington; Andrew C R Martin
Journal: J Mol Biol Date: 2016-08-22 Impact factor: 5.469

3. Canonical structures for the hypervariable regions of immunoglobulins.

Authors: C Chothia; A M Lesk
Journal: J Mol Biol Date: 1987-08-20 Impact factor: 5.469

4. Standard conformations for the canonical structures of immunoglobulins.

Authors: B Al-Lazikani; A M Lesk; C Chothia
Journal: J Mol Biol Date: 1997-11-07 Impact factor: 5.469

5. CODA: a combined algorithm for predicting the structurally variable regions of protein models.

Authors: C M Deane; T L Blundell
Journal: Protein Sci Date: 2001-03 Impact factor: 6.725

6. ANARCI: antigen receptor numbering and receptor classification.

Authors: James Dunbar; Charlotte M Deane
Journal: Bioinformatics Date: 2015-09-30 Impact factor: 6.937

7. Length-independent structural similarities enrich the antibody CDR canonical class model.

Authors: Jaroslaw Nowak; Terry Baker; Guy Georges; Sebastian Kelm; Stefan Klostermann; Jiye Shi; Sudharsan Sridharan; Charlotte M Deane
Journal: MAbs Date: 2016 May-Jun Impact factor: 5.857

8. SAbDab: the structural antibody database.

Authors: James Dunbar; Konrad Krawczyk; Jinwoo Leem; Terry Baker; Angelika Fuchs; Guy Georges; Jiye Shi; Charlotte M Deane
Journal: Nucleic Acids Res Date: 2013-11-08 Impact factor: 16.971

9. Structurally Mapping Antibody Repertoires.

Authors: Konrad Krawczyk; Sebastian Kelm; Aleksandr Kovaltsuk; Jacob D Galson; Dominic Kelly; Johannes Trück; Cristian Regep; Jinwoo Leem; Wing K Wong; Jaroslaw Nowak; James Snowden; Michael Wright; Laura Starkie; Anthony Scott-Tucker; Jiye Shi; Charlotte M Deane
Journal: Front Immunol Date: 2018-07-23 Impact factor: 7.561

10. Non-H3 CDR template selection in antibody modeling through machine learning.

Authors: Xiyao Long; Jeliazko R Jeliazkov; Jeffrey J Gray
Journal: PeerJ Date: 2019-01-11 Impact factor: 2.984

10 in total

11 in total

1. Deciphering the language of antibodies using self-supervised learning.

Authors: Jinwoo Leem; Laura S Mitchell; James H R Farmery; Justin Barton; Jacob D Galson
Journal: Patterns (N Y) Date: 2022-05-18

Review 2. Current advances in biopharmaceutical informatics: guidelines, impact and challenges in the computational developability assessment of antibody therapeutics.

Authors: Rahul Khetan; Robin Curtis; Charlotte M Deane; Johannes Thorling Hadsund; Uddipan Kar; Konrad Krawczyk; Daisuke Kuroda; Sarah A Robinson; Pietro Sormanni; Kouhei Tsumoto; Jim Warwicker; Andrew C R Martin
Journal: MAbs Date: 2022 Jan-Dec Impact factor: 5.857

Review 3. A Review of Deep Learning Methods for Antibodies.

Authors: Jordan Graves; Jacob Byerly; Eduardo Priego; Naren Makkapati; S Vince Parish; Brenda Medellin; Monica Berrondo
Journal: Antibodies (Basel) Date: 2020-04-28

4. DaReUS-Loop: a web server to model multiple loops in homology models.

Authors: Yasaman Karami; Julien Rey; Guillaume Postic; Samuel Murail; Pierre Tufféry; Sjoerd J de Vries
Journal: Nucleic Acids Res Date: 2019-07-02 Impact factor: 16.971

5. Comparative Analysis of the CDR Loops of Antigen Receptors.

Authors: Wing Ki Wong; Jinwoo Leem; Charlotte M Deane
Journal: Front Immunol Date: 2019-10-15 Impact factor: 7.561

6. Potent Neutralizing Antibodies against SARS-CoV-2 Identified by High-Throughput Single-Cell Sequencing of Convalescent Patients' B Cells.

Authors: Yunlong Cao; Bin Su; Xianghua Guo; Wenjie Sun; Yongqiang Deng; Linlin Bao; Qinyu Zhu; Xu Zhang; Yinghui Zheng; Chenyang Geng; Xiaoran Chai; Runsheng He; Xiaofeng Li; Qi Lv; Hua Zhu; Wei Deng; Yanfeng Xu; Yanjun Wang; Luxin Qiao; Yafang Tan; Liyang Song; Guopeng Wang; Xiaoxia Du; Ning Gao; Jiangning Liu; Junyu Xiao; Xiao-Dong Su; Zongmin Du; Yingmei Feng; Chuan Qin; Chengfeng Qin; Ronghua Jin; X Sunney Xie
Journal: Cell Date: 2020-05-18 Impact factor: 41.582

7. Robustification of RosettaAntibody and Rosetta SnugDock.

Authors: Jeliazko R Jeliazkov; Rahel Frick; Jing Zhou; Jeffrey J Gray
Journal: PLoS One Date: 2021-03-25 Impact factor: 3.240

8. Structural diversity of B-cell receptor repertoires along the B-cell differentiation axis in humans and mice.

Authors: Aleksandr Kovaltsuk; Matthew I J Raybould; Wing Ki Wong; Claire Marks; Sebastian Kelm; James Snowden; Johannes Trück; Charlotte M Deane
Journal: PLoS Comput Biol Date: 2020-02-18 Impact factor: 4.475

9. Maturation of the Human Immunoglobulin Heavy Chain Repertoire With Age.

Authors: Marie Ghraichy; Jacob D Galson; Aleksandr Kovaltsuk; Valentin von Niederhäusern; Jana Pachlopnik Schmid; Mike Recher; Annaïse J Jauch; Enkelejda Miho; Dominic F Kelly; Charlotte M Deane; Johannes Trück
Journal: Front Immunol Date: 2020-08-06 Impact factor: 7.561

10. Computational approaches to therapeutic antibody design: established methods and emerging trends.

Authors: Richard A Norman; Francesco Ambrosetti; Alexandre M J J Bonvin; Lucy J Colwell; Sebastian Kelm; Sandeep Kumar; Konrad Krawczyk
Journal: Brief Bioinform Date: 2020-09-25 Impact factor: 11.622