Literature DB >> 26424857

ANARCI: antigen receptor numbering and receptor classification.

Abstract

MOTIVATION: Antibody amino-acid sequences can be numbered to identify equivalent positions. Such annotations are valuable for antibody sequence comparison, protein structure modelling and engineering. Multiple different numbering schemes exist, they vary in the nomenclature they use to annotate residue positions, their definitions of position equivalence and their popularity within different scientific disciplines. However, currently no publicly available software exists that can apply all the most widely used schemes or for which an executable can be obtained under an open license.
RESULTS: ANARCI is a tool to classify and number antibody and T-cell receptor amino-acid variable domain sequences. It can annotate sequences with the five most popular numbering schemes: Kabat, Chothia, Enhanced Chothia, IMGT and AHo.
AVAILABILITY AND IMPLEMENTATION: ANARCI is available for download under GPLv3 license at opig.stats.ox.ac.uk/webapps/anarci. A web-interface to the program is available at the same address. CONTACT: deane@stats.ox.ac.uk.

Entities: Disease Gene Species

Mesh：

Substances：

Year: 2015 PMID： 26424857 PMCID： PMC4708101 DOI： 10.1093/bioinformatics/btv552

Source DB: PubMed Journal: Bioinformatics ISSN： 1367-4803 Impact factor: 6.937

1 Introduction

The variable domains of antibodies and T-cell receptors (TCR) contain these proteins’ major binding regions. Alignment of these variable sequences to a numbering scheme allows equivalent residue positions to be annotated and for different molecules to be compared. Performing numbering is fundamental for immunoinformatics analysis and rational engineering of therapeutic molecules (Shirai, 2014). Several numbering schemes have been proposed, each is favoured by scientists in different immunological disciplines. The Kabat scheme (Kabat ) was developed based on the location of regions of high sequence variation between sequences of the same domain type. It numbers antibody heavy (VH) and light (Vλ and Vκ) variable domains differently. Chothia’s scheme (Al-Lazikani, 1997) is the same as Kabat’s but corrects where an insertion is annotated around the first VH complementarity determining region (CDR) so that it corresponds to a structural loop. Similarly, the Enhanced Chothia scheme (Abhinandan and Martin, 2008) makes further structural corrections of indel positions. In contrast to these Kabat-like schemes, IMGT (Lefranc, 2003) and AHo (Honegger and Plückthun, 2001) both define unique schemes for antibody and T cell receptor (TCR) (Vα and Vβ) variable domains. Thus, equivalent residue positions can easily be compared between domain types. IMGT and AHo differ in the number of positions they annotate (128 and 149 respectively) and where they consider indels to occur. Separate online interfaces exist that can apply each numbering scheme: Kabat, Chothia and Enhanced Chothia through Abnum (Abhinandan and Martin, 2008); IMGT through DomainGapAlign (Ehrenmann, 2010); and AHo through PyIgClassify (Adolf-Bryfogle ). No program currently exists that can apply all schemes or for which an executable is available under open license. We have developed ANARCI, a program that can annotate sequences with all five of the numbering schemes described above. We provide both a web-interface and the software under open license so that these fundamental annotations can be easily available for further immunoinformatics analyses.

2 Algorithm

ANARCI takes single or multiple amino-acid protein sequences as input. The program aligns each sequence to a set of Hidden Markov Models (HMMs) using HMMER3 (Eddy, 2009). Each HMM describes the putative germ-line sequences for a domain type (VH, Vλ or Vκ, Vα or Vβ) of a particular species (Human, Mouse, Rat, Rabbit, Pig or Rhesus Monkey). The most significant alignment is then used to apply one of five numbering schemes.

2.1 Building Hidden Markov Models

The HMM for each domain type from each species was built in the following way: The pre-aligned (gapped) germ-line sequences for the v-gene segment of each available species and domain type were downloaded from the IMGT/Gene Database (Giudicelli, 2005). The sequences of the j-gene segment were also downloaded. These were aligned to a single reference sequence using Muscle (Edgar, 2004) with a large (−10) gap-open penalty. All possible pairwise combinations of the relevant v and j gene segments were taken to form a set of putative germ-line domain sequences. For the VH domain, the d gene segment was not included. Each position in the alignment represents one of the 128 positions in the IMGT numbering scheme. From the alignment an HMM is built using the hmmbuild tool. Here, the ‘—hand’ option is specified to preserve the structure of the alignment. In total, 24 HMMs were built describing variable domain types from six different species. These HMMs were combined into a single HMM database using hmmpress.

2.2 Numbering an input sequence

An input sequence is aligned to each HMM using hmmscan. If an alignment has a bit-score of less than 100 it is not considered further. This threshold proves effective at preventing the false recognition of other IG-like proteins. Otherwise, the most significant alignment classifies its domain type and the alignment is translated into a chosen numbering scheme. ANARCI can apply the Kabat, Chothia, Extended Chothia, IMGT or AHo schemes to VH, Vλ and Vκ domain sequences. The IMGT and AHo schemes can also be applied to Vα and Vβ domain sequences. Where possible, a position in the HMM alignment is annotated with the equivalent position in the numbering scheme. In regions where there is no direct equivalence between the alignment and the numbering scheme the sequence is numbered according to the specification described in the corresponding publication. For example, HMM alignment position 40 for a VH sequence is equivalent to Kabat position 31-35X depending on the length of CDRH1. For each numbered domain a header is written that describes the most significant alignment including the species, chain type and alignment range. The numbering follows in a column delimited format. Alternatively, users may import ANARCI as a Python module and use the API within their own scripts.

3 Benchmark

With the rise of next generation sequencing, the ability to annotate large numbers of antibody sequences is becoming a common task. We used ANARCI to number a set of 1 936 119 VH sequences taken from a vaccination response study performed at Oxford University. The algorithm took three hours wall clock time using 32-cores with AMD Opteron 6272 Processors. All but 9560 sequences were successfully numbered. Where numbering failed the sequences had very unusual insertions or deletions that may be a result of sequencing errors.

4 Webserver

In addition to the command line tool we provide a webserver interface to ANARCI (Fig. 1). Users may submit a single amino-acid sequence or a Fasta file of multiple sequences to apply their chosen scheme. The interface displays the assigned species and type of domain, the location of each domain in the sequence and, using the JSAV library (Martin, 2014), the annotated numbering scheme. Plain text or CSV formatted output files are available for download.

Fig. 1.

The web-based interface to ANARCI. The species, domain type and numbering is reported for each sequence. The annotations can either be downloaded or visualized on the webpage

5 Conclusion

We have developed ANARCI, a program for annotating antigen receptor variable domain amino-acid sequences with five commonly used numbering schemes. The program can be run as command-line tool or imported as a Python module for incorporation in custom scripts. We also provide a public web-browser interface that can annotate small numbers of sequences. ANARCI is freely distributed under the GPLv3 license and available at opig.stats.ox.ac.uk/webapps/anarci.

11 in total

1. Yet another numbering scheme for immunoglobulin variable domains: an automatic modeling and analysis tool.

Authors: A Honegger; A Plückthun
Journal: J Mol Biol Date: 2001-06-08 Impact factor: 5.469

2. MUSCLE: multiple sequence alignment with high accuracy and high throughput.

Authors: Robert C Edgar
Journal: Nucleic Acids Res Date: 2004-03-19 Impact factor: 16.971

3. Analysis and improvements to Kabat and structurally correct numbering of antibody variable domains.

Authors: K R Abhinandan; Andrew C R Martin
Journal: Mol Immunol Date: 2008-07-09 Impact factor: 4.407

4. A new generation of homology search tools based on probabilistic inference.

Authors: Sean R Eddy
Journal: Genome Inform Date: 2009-10

Review 5. Antibody informatics for drug discovery.

Authors: Hiroki Shirai; Catherine Prades; Randi Vita; Paolo Marcatili; Bojana Popovic; Jianqing Xu; John P Overington; Kazunori Hirayama; Shinji Soga; Kazuhisa Tsunoyama; Dominic Clark; Marie-Paule Lefranc; Kazuyoshi Ikeda
Journal: Biochim Biophys Acta Date: 2014-08-08

6. Standard conformations for the canonical structures of immunoglobulins.

Authors: B Al-Lazikani; A M Lesk; C Chothia
Journal: J Mol Biol Date: 1997-11-07 Impact factor: 5.469

7. PyIgClassify: a database of antibody CDR structural classifications.

Authors: Jared Adolf-Bryfogle; Qifang Xu; Benjamin North; Andreas Lehmann; Roland L Dunbrack
Journal: Nucleic Acids Res Date: 2014-11-11 Impact factor: 19.160

8. IMGT/GENE-DB: a comprehensive database for human and mouse immunoglobulin and T cell receptor genes.

Authors: Véronique Giudicelli; Denys Chaume; Marie-Paule Lefranc
Journal: Nucleic Acids Res Date: 2005-01-01 Impact factor: 16.971

9. Viewing multiple sequence alignments with the JavaScript Sequence Alignment Viewer (JSAV).

Authors: Andrew C R Martin
Journal: F1000Res Date: 2014-10-23

10. IMGT/3Dstructure-DB and IMGT/DomainGapAlign: a database and a tool for immunoglobulins or antibodies, T cell receptors, MHC, IgSF and MhcSF.

Authors: François Ehrenmann; Quentin Kaas; Marie-Paule Lefranc
Journal: Nucleic Acids Res Date: 2009-11-09 Impact factor: 16.971

76 in total

ANARCI: antigen receptor numbering and receptor classification.

1 Introduction

2 Algorithm

2.1 Building Hidden Markov Models

2.2 Numbering an input sequence

3 Benchmark

4 Webserver

5 Conclusion

1. Yet another numbering scheme for immunoglobulin variable domains: an automatic modeling and analysis tool.

2. MUSCLE: multiple sequence alignment with high accuracy and high throughput.

3. Analysis and improvements to Kabat and structurally correct numbering of antibody variable domains.

4. A new generation of homology search tools based on probabilistic inference.

Review 5. Antibody informatics for drug discovery.

6. Standard conformations for the canonical structures of immunoglobulins.

7. PyIgClassify: a database of antibody CDR structural classifications.

8. IMGT/GENE-DB: a comprehensive database for human and mouse immunoglobulin and T cell receptor genes.

9. Viewing multiple sequence alignments with the JavaScript Sequence Alignment Viewer (JSAV).

10. IMGT/3Dstructure-DB and IMGT/DomainGapAlign: a database and a tool for immunoglobulins or antibodies, T cell receptors, MHC, IgSF and MhcSF.

1. Filtering Next-Generation Sequencing of the Ig Gene Repertoire Data Using Antibody Structural Information.

2. STCRDab: the structural T-cell receptor database.

3. AbRSA: A robust tool for antibody numbering.

4. Modeling and Viewing T Cell Receptors Using TCRmodel and TCR3d.

5. Adaptive Immune Receptor Repertoire (AIRR) Community Guide to TR and IG Gene Annotation.

6. Synthetic Receptors for Sensing Soluble Molecules with Mammalian Cells.

7. Quantifying the nativeness of antibody sequences using long short-term memory networks.

8. Exploring antibody repurposing for COVID-19: beyond presumed roles of therapeutic antibodies.

9. Uncovering of cytochrome P450 anatomy by SecStrAnnotator.

10. Exploring the sequence features determining amyloidosis in human antibody light chains.