| Literature DB >> 30083160 |
Konrad Krawczyk1, Sebastian Kelm2, Aleksandr Kovaltsuk1, Jacob D Galson3, Dominic Kelly3, Johannes Trück3,4, Cristian Regep1, Jinwoo Leem1, Wing K Wong1, Jaroslaw Nowak1, James Snowden2, Michael Wright2, Laura Starkie2, Anthony Scott-Tucker2, Jiye Shi2, Charlotte M Deane1.
Abstract
Every human possesses millions of distinct antibodies. It is now possible to analyze this diversity via next-generation sequencing of immunoglobulin genes (Ig-seq). This technique produces large volume sequence snapshots of B-cell receptors that are indicative of the antibody repertoire. In this paper, we enrich these large-scale sequence datasets with structural information. Enriching a sequence with its structural data allows better approximation of many vital features, such as its binding site and specificity. Here, we describe the structural annotation of antibodies pipeline that maps the outputs of large Ig-seq experiments to known antibody structures. We demonstrate the viability of our protocol on five separate Ig-seq datasets covering ca. 35 m unique amino acid sequences from ca. 600 individuals. Despite the great theoretical diversity of antibodies, we find that the majority of sequences coming from such studies can be reliably mapped to an existing structure.Entities:
Keywords: B-cell receptor; antibody specificity; bioinformatics tools; next-generation sequencing; protein; structural homology
Year: 2018 PMID: 30083160 PMCID: PMC6064724 DOI: 10.3389/fimmu.2018.01698
Source DB: PubMed Journal: Front Immunol ISSN: 1664-3224 Impact factor: 7.561
Figure 2Chothia-aligning the 13.5 m unique baseline antibody variable sequences in datasets UCB_H and UCB_L to antibodies with known structures. (A) Full variable region sequence of the heavy chain. (B) Framework of the heavy chain. (C) Full variable region of the light chain. (D) Framework of the light chain. The pink bars indicate the number of sequences (right-hand y-axis) whose highest sequence identity structure match has the sequence identity given on the x-axis. The blue line (left-hand y-axis) indicates the expected root mean square deviation (RMSD) of a model built using a sequence identity match of that quality (with vertical SD error bars). For example, 80% sequence identity for the framework of the heavy chain translates to a 0.8 Å expected model RMSD.
Figure 1The structural annotation of antibodies algorithm. The input consists of amino acid sequences in FASTA format. These sequences are Chothia-numbered using ANARCI (29). Chothia-numbered sequences are then aligned to known structures of antibodies as defined by the structural antibody database (25). Best templates are identified for the entire variable region as well as for Chothia-delimited framework only. The full variable region templates are employed to define complementarity determining region (CDR) anchoring residues that are used as input to FREAD which determines if we can identify a suitable template for each of the CDRs.
Sequence datasets.
| Dataset name | Non-redundant sequences (H = heavy chain, L = light chain) | Individuals | Description |
|---|---|---|---|
| UCB_H | H: 4,925,532 | 494 (pooled) | Proprietary, non-immunized comprehensive diversity library |
| UCB_L | L: 8,380,540 | ||
| HBP | H: 7,685,149 | 15 | Hep B Primary vaccination ( |
| HBB | H: 4,718,120 | 10 | Hep B Booster ( |
| MEN | H: 6,036,457 | 10 | Meningococcal vaccination ( |
| FLU | H: 3,409,916 | 58 | Influenza vaccination ( |
We have employed one dataset of baseline antibody human diversity (UCB_L and UCB_H) and four immunized datasets (HBP, HBB, MEN, and FLU). In total, the datasets comprised ca. 600 individuals and ca. 36 m sequences.
Structural mapping of the complementarity determining regions (CDRs) in the UCB_H and UCB_L datasets.
| Dataset and CDR subset | Total: redundant (non-redundant) | In the Protein Data Bank (PDB): redundant (non-redundant) | Can model: redundant (non-redundant) | Cannot model: redundant (non-redundant) |
|---|---|---|---|---|
| UCB_H | ||||
| H1 | 4,718,716 (110,495) | 2,238,760 (159) | 2,479,674 (110,245) | 282 (91) |
| H2 | 4,718,717 (159,222) | 1,568,190 (305) | 3,150,527 (158,917) | 0 (0) |
| H3 | 4,714,545 (1,623,070) | 61 (25) | 3,614,289 (1,088,900) | 1,100,195 (534,145) |
| L1 | 8,127,157 (1,020,446) | 889,922 (135) | 7,206,821 (1,005,664) | 30,414 (14,647) |
| L2 | 8,127,157 (159,646) | 2,942,147 (189) | 5,137,392 (151,627) | 47,618 (7,830) |
| L3 | 8,120,282 (1,080,668) | 135,548 (130) | 7,876,402 (1,060,293) | 109,332 (20,245) |
Chothia-defined CDRs in each dataset were extracted from the full variable region. CDRs, which had length less than three, were discarded. The redundant datasets were constrained to unique sequences only, which we denote as “non-redundant,” and the resulting numbers of loops are given in the “Total” column. “In the PDB” indicates number of loops we could find direct sequences matches for in the PDB. Of the loops, which were not found directly in the PDB, the “Can model” column indicates the number of loops FREAD found suitable templates for. The “Cannot model” column shows the numbers of loops which were not in the PDB and FREAD could not find templates for. In each case, the numbers of redundant loops are given without parentheses whereas non-redundant loops are given in parentheses.
Figure 3Example of how structural mapping provides clues to antibody specificity. Structural annotation of antibodies (SAAB) outputs the Protein Data Bank (PDB) codes used to map frameworks, full variable sequence, and each of the complementarity determining regions for a sequence. The PDB codes are also mapped to the antigens recognized by the antibody structures (as stored in structural antibody database). If sequences match to similar PDB structures this could be indicative of similar binding sites and thus specificity. As an example, we examined the top 10 PDBs that were used to map H3 in the FLU dataset. A total of more than 7k H3 sequences were mapped to 4m5z, a complex of an antibody with influenza hemagglutinin (this is not among the top 10 H3-mapped PDBs in our other datasets). We show several sequence-diverse H3 loops on the left, which are unlikely to be grouped together by sequence-only methods. However, SAAB identifies that they are all likely to share a similar structure to the H3 loop of 4m5z (right, in blue) and, therefore, perhaps similar specificity.