Literature DB >> 22754554

Immunoglobulin analysis tool: a novel tool for the analysis of human and mouse heavy and light chain transcripts.

Tobias Rogosch¹, Sebastian Kerzel, Kam Hon Hoi, Zhixin Zhang, Rolf F Maier, Gregory C Ippolito, Michael Zemlin.

Abstract

Sequence analysis of immunoglobulin (Ig) heavy and light chain transcripts can refine categorization of B cell subpopulations and can shed light on the selective forces that act during immune responses or immune dysregulation, such as autoimmunity, allergy, and B cell malignancy. High-throughput sequencing yields Ig transcript collections of unprecedented size. The authoritative web-based IMGT/HighV-QUEST program is capable of analyzing large collections of transcripts and provides annotated output files to describe many key properties of Ig transcripts. However, additional processing of these flat files is required to create figures, or to facilitate analysis of additional features and comparisons between sequence sets. We present an easy-to-use Microsoft(®) Excel(®) based software, named Immunoglobulin Analysis Tool (IgAT), for the summary, interrogation, and further processing of IMGT/HighV-QUEST output files. IgAT generates descriptive statistics and high-quality figures for collections of murine or human Ig heavy or light chain transcripts ranging from 1 to 150,000 sequences. In addition to traditionally studied properties of Ig transcripts - such as the usage of germline gene segments, or the length and composition of the CDR-3 region - IgAT also uses published algorithms to calculate the probability of antigen selection based on somatic mutational patterns, the average hydrophobicity of the antigen-binding sites, and predictable structural properties of the CDR-H3 loop according to Shirai's H3-rules. These refined analyses provide in-depth information about the selective forces acting upon Ig repertoires and allow the statistical and graphical comparison of two or more sequence sets. IgAT is easy to use on any computer running Excel(®) 2003 or higher. Thus, IgAT is a useful tool to gain insights into the selective forces and functional properties of small to extremely large collections of Ig transcripts, thereby assisting a researcher to mine a data set to its fullest.

Entities: Chemical Disease Gene Species

Keywords: antibody repertoire; deep sequencing; high-throughput analysis; immunoglobulin heavy chain gene; immunoglobulin light chain gene; rearrangement; sequence analysis software; somatic mutation

Year: 2012 PMID： 22754554 PMCID： PMC3384897 DOI： 10.3389/fimmu.2012.00176

Source DB: PubMed Journal: Front Immunol ISSN： 1664-3224 Impact factor: 7.561

Introduction

The fate of a B cell largely depends on the B cell receptor, or immunoglobulin (Ig), which it expresses on its surface (Rajewsky, 1996; Kurosaki et al., 2010). Thus, the analysis of Ig gene transcripts can give important insights into the selective forces that act upon B cells during cellular maturation or during physiological or pathological immune reactions (Schroeder and Cavacini, 2010). For example, repertoire studies of Ig transcripts have revealed that the length and composition of the Ig heavy chain third complementarity determining region (CDR-H3) is strictly regulated during ontogeny, and somatic mutations are rare during the perinatal period even in secondary antibody repertoires (Schroeder et al., 1987, 2001; Cuisinier et al., 1993; Brezinschek et al., 1997; Zemlin et al., 2001, 2007; Kolar et al., 2004; Souto-Carneiro et al., 2005; Schelonka et al., 2007; Richl et al., 2008; Prabakaran et al., 2012). It has also been shown that the composition of the antigen-binding site plays a key role during B cell maturation and during the recruitment into various B cell subsets (Schelonka et al., 2007; Arnaout et al., 2011) and during protective immune responses (Rajewsky, 1996; Frolich et al., 2010). Moreover, studies of Ig repertoires can give valuable insights into the immune dysregulation that underlies the development of autoimmunity (Dorner and Lipsky, 2005; Vrolix et al., 2010; Zuckerman et al., 2010; Kalinina et al., 2011) and allergies (Snow et al., 1998; Takhar et al., 2007; Kerzel et al., 2010). The antigen-binding site of the antibody is endowed with an almost unlimited theoretical diversity due to the imprecise junction of Variable, Joining, and (in the case of the Ig heavy chain) Diversity gene segments (Tonegawa, 1983). The random exonucleolytic truncation of the rearranged gene segments and the insertion of non-encoded N-nucleotides and P-nucleotides, the shuffling of light and heavy chains, and the insertion of somatic mutations during the germinal center reaction further expands the potential diversity exponentially. Theoretically, these mechanisms allow the production of more than 1015 different antigen-binding sites (Schroeder and Cavacini, 2010). Although seemingly limitless in theoretical potential, the human antibody response probably does not exploit more than 1% of its potential diversity (Boyd et al., 2009; Glanville et al., 2009; Arnaout et al., 2011). Thus, it seems unlikely that the expressed antibody repertoire would represent merely a random selection of the theoretical diversity. In order to discover potential biases within repertoires that may have been coined by selective forces, it is desirable to study large numbers of Ig gene transcripts. With the advent of next generation sequencing (NGS) technologies, such as Roche 454 pyrosequencing, the direct large-scale sampling of sequence collections of 104, 105, and even greater numbers, is now obtainable within the span of a few days (Boyd et al., 2009; Reddy et al., 2010; Wu et al., 2010; Zuckerman et al., 2010; Jiang et al., 2011; Ippolito et al., 2012). Previously published semi-automated instruments cannot be used for such large collections or state-of-the-art characterizations due to significant quantitative and qualitative advances in Ig gene analysis (Shannon, 1997; Johnson and Wu, 2000; Zemlin et al., 2003). Thus, novel analysis tools are required which can handle extremely large sequence batches. The online repository “International ImMunoGeneTics Information System®” (IMGT®, founder and director: Marie-Paule Lefranc, Montpellier, France (Brochet et al., 2008; Lefranc et al., 2009) offers IMGT/HighV-QUEST, a free online tool to assign Variable, Diversity, and Joining gene segments to each individual full-length Ig transcript in batches up to 150,000 sequences. In addition, IMGT/HighV-QUEST provides numerous descriptors for each individual sequence, such as assignment of N- and P-nucleotides, amino acid translation, position of somatic mutations, isoelectric point, and many others (Giudicelli et al., 2011; Alamyar et al., 2012). The output files of these analyses contain descriptions of each individual sequence and can be downloaded as text files in comma separated values (CSV) format for documentation and further analysis. Our aim was to create an easy-to-use software tool for the generation of informative statistics and publication-ready figures derived from the HighV-QUEST text-only output files. Moreover, we sought to include new and important analyses of higher order antibody features. For instance, although Shirai’s H3-rules have been formulated for the sequence-based prediction of CDR-H3 structural properties (Shirai et al., 1999), and whereas complex algorithms have been published to determine the probability by which a somatic mutation profile might arise non-randomly from antigen-driven selection (Chang and Casali, 1994; Lossos et al., 2000), there are at present no software tools available to the research community for high-throughput application of these rules and algorithms. Here we present Immunoglobulin Analysis Tool (IgAT), a novel and user-friendly software tool for the extensive analysis and graphical presentation of very large collections of Ig transcripts which have been pre-analyzed by IMGT/HighV-QUEST. IgAT additionally calculates the probability of antigen-driven selection within Ig repertoires and predicts structural properties of the antigen-binding site. IgAT can be used to analyze up to 150,000 human or murine heavy or light chain transcripts in a single run of the application and automatically generates 25 Microsoft® PowerPoint® graphics files illustrating key characteristics of the Ig repertoire, such as VDJ gene utilization, amino acid use, CDR-H3 junctional diversity, and average hydrophobicity, as well as the quantitation of somatic mutation among Ig heavy chain transcripts, to name but a few. IgAT is available free of charge. When applied to two or more sequence collections (e.g., samples from multiple individuals, different cell subsets, or identical cell subsets but under differing immunological conditions), IgAT readily yields the necessary data to allow statistical and graphical comparisons between various repertoires.

Methods

IgAT is a Microsoft® Excel® workbook containing the analysis functions as Visual Basic® for Applications (VBA) code. Each sheet is described in the results section. The workbook was created in Excel 2010 on Microsoft Windows® XP but should be compatible with Excel versions down to Excel 2003 with some limitations (Table 1). IgAT is not compatible with Excel for Mac®. The file can be found at: www.uni-marburg.de/neonat/igat

Table 1

Estimate of the maximum size of sequence collections that can be processed.

Excel version	Operation system	Max. memory	No. of sequences
Excel 2003	Windows XP	1 Gigabyte	∼40,000
	Windows 7 (32/64-bit)
Excel 2007	Windows XP	2 Gigabyte	∼60,000
Excel 2010 (32-bit)	Windows 7 (32/64-bit)
Excel 2010 (64-bit)	Windows 7 (64-bit)	8 Terabyte	150,000 (max. no. of IMGT/HighV-QUEST)

The restrictions are caused by limited addressable memory by Excel. Excel versions prior 2007 can not address more than 1 GB of memory. 32-bit versions of Excel 2007/2010 can use 2 GB of memory, while the 64-bit versions are virtually unrestricted.

Estimate of the maximum size of sequence collections that can be processed. The restrictions are caused by limited addressable memory by Excel. Excel versions prior 2007 can not address more than 1 GB of memory. 32-bit versions of Excel 2007/2010 can use 2 GB of memory, while the 64-bit versions are virtually unrestricted.

Results

In the following, we present the features offered by IgAT, using exemplarily a previously published collection of 78,569 murine Ig heavy chain sequences that contained 18,403 functional sequences (Reddy et al., 2010). These sequences were obtained from CD138+ plasma-cell-enriched bone marrow mRNA of two BALB/c mice immunized with human complement serine protease (C1S; NCBI Entrez Gene ID: 716). Begin with a text file of FASTA-formatted Ig DNA sequences as can be obtained from a Roche 454 experimental run or other techniques. When submitting the sequence batch to IMGT/HighV-QUEST, under the advanced parameters setting, “Nb of accepted D-GENE in JUNCTION” must be set to the default (1) as IgAT will only process IMGT output files that assign a maximum of one single D-gene to each V-DH-J junction. IMGT individual result files are not necessary for the analysis with IgAT.

Input

As input, IgAT takes the 11 CSV text output files standardly generated by IMGT/HighV-QUEST derived from its analysis of raw 454 sequence data uploaded by the researcher. IgAT imports the folder containing the IMGT/HighV-QUEST CSV text output files through the cell “C6” of the “input” worksheet. (Alternatively, the IgAT program may be copied and pasted into the folder, which already contains the IMGT files.) Optionally, sequences marked as “unproductive” by IMGT/HighV-QUEST can be deleted. Deleting unproductive sequences will improve performance but might discard functional transcripts as Roche 454 sequencing is prone to homopolymer errors due to technical reasons. The species (human or mouse), the Ig chain (heavy, lambda, or kappa), the minimum number of non-mutated nucleotides that are required to identify a diversity (D) gene, and the option to calculate the Taq-error must be chosen before starting the analysis. The Ig isotype is needed to calculate the Taq-error (Figure 1).

Figure 1

Screenshot of the “input” worksheet.

Screenshot of the “input” worksheet. To start the analysis simply press the button “analyze data.” If “convert formulas to text” is checked, most formulas will be replaced by their values, resulting in reduced file size and recalculation time. In this case, however, additional changes will not have any effect on the analysis output. Once the sequence analysis is complete, the graphs can be exported as Microsoft PowerPoint® files (.ppt) by pressing “save graphs as ppt.” The workbook was created in Excel 2010 and tested in Excel 2003 and 2010. To determine if your Microsoft Office® software meets this requirement, press “check office version.” It might be compatible with other versions (not tested).

Summary

The number of total, non-functional, functional, and unique sequences, as well as the number of clonotypes is listed in the “summary” worksheet (Figure 2). Deep sequencing technologies usually yield a significant proportion of incomplete or otherwise defective sequences. IgAT counts the sequences which were labeled “unproductive,” “no result,” or “unknown” by IMGT/HighV-QUEST.

Figure 2

Screenshot of the “summary” worksheet of IgAT.

Screenshot of the “summary” worksheet of IgAT. Sequences are considered clonally related if they (i) use the same V and J genes, (ii) have an identical CDR-3 length, and (iii) a highly homologous CDR-3 region. The default definition of “highly homologous CDR-3 region” is ≤10% difference in nucleotide sequence. IgAT gives the user the flexibility to choose another percentage difference in nucleotide sequence, or a total number of nucleotide matches, or a percentage or total number difference in amino acid sequence when defining clonotypic parameters.

Data

The “Data” worksheet contains the imported data of the IMGT/HighV-QUEST output files. IgAT uses the taxonomy and numbering of the IMGT repository (Lefranc et al., 2009).

Sequence

In this worksheet, each nucleotide sequence occurs in an individual row and is split into framework regions (FR) 1–4 and complementarity determining regions 1–3. The sequences are ordered by functionality, which is defined by the existence of an open reading frame throughout the sequence, and by V gene segment utilization. Furthermore, the “Sequence” worksheet provides the length and amino acid translation for CDR-3, number of clonotypes, and identifies sequences with potential “VH-replacement footprints” (only human sequences) that can originate from VH replacement during receptor editing according to Zhang et al. (2003). In addition, sequences can be tagged with the sample ID. Based on sample IDs, the analysis can be confined to one or several samples or the transcripts can be divided into two groups for comparison.

VDJ

The “VDJ” worksheet contains absolute numbers, percentages, and graphs of the V-, DH-, and J-gene families and individual genes in the order of their localization in the germline (Figure 3).

Figure 3

The “VDJ” worksheet contains graphs displaying the relative utilization of V.

CDR-3_length

The “CDR-3_length” worksheet displays the nucleotide length distribution of CDR-3, N1-, and N2-nucleotides within the analyzed sequence collection (Figure 4). In addition, the average lengths of the components of CDR-3, namely V length, P-nucleotides 3′ of V, N1-nucleotides, P-nucleotides 5′ of D, D length, P-nucleotides 3′ of D, N2-nucleotides, P-nucleotides 3′ of J, and J length are displayed in a deconstruction graph. A separate graph displays the deconstruction of those sequences without an identifiable D-gene. As a default for the IgH chain, CDR-H3 is defined as amino acids 105–117, according to the IMGT unique numbering system. The descriptive statistics given in the “CDR_length” worksheet can be used for comparative statistics with other sequence collections.

Figure 4

The graphs in the “CDR-3_length” (positions 105–117) worksheet display the length distribution of CDR-H3 (A), N1 (B), N2 (C), and deconstruction graphs for CDR-H3 with (D) or without (E) identifiable D. Lengths are given in nucleotides.

Somat_mut

This worksheet displays the somatic mutation rate of each transcript (mutations per 1,000 nt), as well as the average mutational frequency (Figure 5A). In addition, the probability of antigen selection is analyzed by assessing the distribution of replacement and silent mutations between FRs and CDRs (only available for heavy chains). Using the method of Lossos et al. (2000), we determined the replacement frequency and the relative length of FR and CDR of each germline VH gene. The average probability that a random mutation would allocate in CDR was calculated to be 0.23 ± 0.012, and the sequence-inherent probability that a mutation in the CDR would be a replacement mutation was estimated to be 0.79 ± 0.01. Therefore, the chance for a random mutation to introduce a replacement mutation into the CDR was 0.18. The binomial distribution method of Chang and Casali (1994) was used to calculate the 90 and 95% confidence limits for the ratio of replacement mutations in the CDR (RCDR) to the number of total mutations in the V region (MV) as described by Dahlke et al. (2006). These confidence intervals are shown as dark (90%) and light gray (95%) shaded area in Figure 5B. A data point falling outside these confidence limits represents a sequence that has a high proportion of replacement mutations in the CDR. Therefore, an allocation above the upper or below the lower confidence limit is considered indicative of Ag-driven selection. It should be mentioned that refined methods for calculation of antigen selection have been published and are available to the public (Hershberg et al., 2008; Uduman et al., 2011). However, at the present IgAT is not suitable to include this type of analyses, because sequence alignments in large sequence collections would require a different software environment.

Figure 5

(A) Somatic mutation frequency of Ig transcripts (mutations per 1000 nt). Each data point represents the somatic mutation frequency of one sequence. (B) Inference of Ag selection in Ig transcripts. Shown is the ratio of replacement mutations in CDR-H1 and CDR-H2 (RCDR) to the total number of mutations in the V region (MV) plotted against MV. The dark shaded area represents the 90% confidence limits and the light gray shaded area the 95% confidence limits for the probability of random mutations. A data point falling outside these confidence limits represents a sequence that has a high proportion of replacement mutations in the CDR. The probability that such a sequence has accumulated as many replacement mutations in the CDR by mere random mutation is p = 0.1 and p = 0.05, respectively. An allocation above the upper confidence limit was considered indicative of Ag selection. Data points are accompanied by their observed frequency. 6.5% (α = 0.05) of the sequences were Ag selected (α = 0.1: 9.6%).

AA

This worksheet shows the amino acid distribution and frequency of the CDR-3 loop for sequences with the same CDR-3 length as entered in cell “G3” and different resulting amino acid variability plot (Shannon entropy, a measure of amino acid variability at a given position of aligned protein sequences, and Kabat–Wu plot, the number of different amino acids observed at a position divided by the frequency of the most common amino acid; Shannon, 1997; Johnson and Wu, 2000; Zemlin et al., 2003; Figure 6).

Figure 6

Graphic output of the analysis of amino acid frequencies and variability, using as an example the CDR-H3 sequences with the length of 12 amino acids (positions 105–117, . (A) The Shannon entropy for each position in the CDR-H3 region (the higher the score the more variable the position in terms of amino acids). (B) Relative amino acid frequencies at the positions 105–117 for CDR-H3 region. Each bar represents 100% of the amino acid residues found at this specific position. The amino acid residues are stacked in the order of their hydrophobicity according to a normalized Kyte–Doolittle Index (Eisenberg, 1984). Charged amino acid residues are at the bottom, and hydrophobic amino acid residues at the top of each bar as presented previously (Zemlin et al., 2003). (C) The Kabat–Wu variability for each position in the CDR-3 region (the higher the score the more variable the position). (D) Overall amino acid frequencies within the CDR-3 loop (positions 107–114).

AA_frequency

This diagram shows the amino acid frequencies of the CDR-3 loop for all sequences (Figure 7). The frequency is given as percent of all amino acids encoded by CDR-3 from all unique sequences studied. As a default for the IgH chain, the CDR-H3 loop is defined as the amino acids 107–114, according to the IMGT unique numbering system, but the definition of the loop can be modified by the user by entering the limits into the worksheet “AA,” cells N5 and N6.

Figure 7

Amino acid frequencies of the CDR-H3 loop for all unique sequences (positions 107–114).

Kyte–doolittle

The normalized Kyte–Doolittle scale assigns one value to each amino acid. Negative numbers represent polar/hydrophilic amino acids and positive values represent hydrophobic amino acids (Kyte and Doolittle, 1982; Eisenberg, 1984). Figure 8 displays the distribution of average CDR-3 hydrophobicities according to the normalized Kyte–Doolittle scale.

Figure 8

Distribution of average CDR-H3 loop hydrophobicities according to a normalized Kyte–Doolittle scale (positions 107–114; Eisenberg, 1984).

IGHD

This worksheet displays the DH gene reading frame usage (Figure 9). For each DH segment there is one reading frame encoding predominantly hydrophilic residues (especially tyrosine and serine; RF1), followed by a hydrophobic reading frame (RF2), and lastly a third reading frame that often encodes a stop codon (RF3). Thus, the third reading frame can be used only if either somatic mutations or else nucleotide losses during VDJ recombination delete the germline stop codon.

Figure 9

Reading frame utilization given as percent of all unique sequences with identifiable D. The DH reading frames are defined according to the nomenclature of Ichihara et al. (1989).

Shirai

In this worksheet the predicted structural features of the CDR-H3 are displayed (Figure 10). The “H3-rules” by Shirai (Shirai et al., 1999; Kuroda et al., 2008) are used to predict the structure of the CDR-H3 loop and base classified upon amino acid sequence, localization, and characteristics like hydrophobicity and size of the amino acid side chain. The structure of the base can be either extended, kinked, or extra kinked. In case of the latter two, the H3-rules may predict whether an intact hydrogen bond ladder or a deformed hairpin is formed within the loop.

Figure 10

Predicted structural features of the CDR-3 according to the “H3-rules” by Shirai et al. (. (K−, kinked base; K+, extra kinked base; K−/+, kinked or extra kinked base; E, extended base; hp def K−, deformed hairpin in sequences with kinked base; hp def K+, deformed hairpin in sequences with extra kinked base; hp def K−/+, deformed hairpin in sequences with kinked and extra kinked base; H lad K−, intact hydrogen bond ladder in sequences with kinked base; H lad K+, intact hydrogen bond ladder in sequences with extra kinked base; H lad K−/+, intact hydrogen bond ladder in sequences with kinked and extra kinked base).

Taq-error

This worksheet calculates the Taq-error rate. To exclude a relevant biasing of the somatic mutation frequency by Taq polymerase errors, IgAT calculates the Taq-error rate within the stretches of the Ig constant region when it is included in the PCR amplificates.

Discussion

Since the discovery of the Ig genes, as well as the fundamental mechanisms describing their combinatorial somatic rearrangement, numerous studies have been published with the goal of understanding the selective forces which might govern B cell and T cell development and the diversification of their lymphocyte receptor repertoires. Whereas B and T cells share a common mode of initial diversification (VDJ recombination), it is only B cells which include additional postrecombination diversification mechanisms such as VH replacement and somatic hypermutation. Furthermore, whereas the selective forces shaping the receptor repertoire of developing T cells have been well established (Morris and Allen, 2012), the same cannot be said for the antibody receptor repertoire of B cells. For instance, mechanisms of positive selection are not clearly defined for B cell antibody repertoires; however, on the contrary, there are clear examples of negative selective mechanisms (deletion, anergy, and follicular exclusion) as well as additional mechanisms (average amino acid hydrophobicity of CDR-H3, preferential V gene utilization, VH gene replacement) which act to constrain the diversity of the antibody repertoire. Early pioneering efforts involved laborious cloning and classic Sanger DNA/cDNA sequencing which yielded sequence collections of modest size on the order of tens to a few hundreds. Novel antibody repertoire studies employ high-throughput deep sequencing technologies which can yield collections of unprecedented sizes on the order of thousands to millions of raw sequence reads (reviewed in Benichou et al., 2011). To facilitate such studies, the web-based IMGT/HighV-QUEST™ program is capable of analyzing large collections of transcripts (up to 150,000 per analysis) by comparison with the known V, DH, and J germline gene segments. Here we present IgAT, a novel easy-to-use Microsoft Excel based Visual Basic code for the summary, interrogation, and further processing of IMGT/HighV-QUEST output files. IgAT presents the data as organized spreadsheets, yields ready-to-publish statistics and figures, and allows the standardized comparison of multiple sequence batches. Conventional and Roche 454 deep sequencing of Ig heavy chain transcripts has been used to better understand the maturation of B cells, their selection into various maturational subsets (Wu et al., 2010), to determine the degree to which the repertoire might be genetically predetermined (Glanville et al., 2009; Ippolito et al., 2012), to characterize protective antibody responses (e.g., tetanus-toxoid neutralizing antibodies (Frolich et al., 2010) or HIV neutralizing antibodies (Wu et al., 2011), autoimmunity (Dorner and Lipsky, 2005; Vrolix et al., 2010; Zuckerman et al., 2010; Kalinina et al., 2011), allergies (Kerzel et al., 2010), and especially a push to monitor minimal residual disease in B cell neoplasias (Boyd et al., 2009; Logan et al., 2011). In such studies, IgAT could help indicate to what extent the repertoire has been influenced by antigen-driven selection. The detailed analyses provided by IgAT can be used to speculate about the nature of the antigen epitope(s) that evoked a biasing of the repertoire during an antibody response. In this report we have used as an example a previously published collection of >18,000 Ig heavy chain (IgH) sequences from mice immunized with the human complement serine protease C1S (Reddy et al., 2010). Although we have focused exclusively upon an analysis of heavy chain sequences in this example, IgAT is also capable of analyzing human and murine Ig kappa and lambda light chain (IgL) repertoires.

Clonotypic diversity as a measure of restriction of the expressed repertoire versus a random repertoire

In theory, a diversity of more than 1 × 1015 antibodies can be established from the human and murine Ig germline loci, respectively (Schroeder, 2006). However, several antigen-independent and antigen-dependent mechanisms restrict the expressed antibody repertoires to probably less than 1% of the theoretically available diversity. Current theory holds that during B cell development in the bone marrow, restrictions are required to avoid the production of harmful or unnecessary antibodies while focusing on potentially protective antibodies. Current data obtained from the deep sequencing of human and mouse IgH repertoires suggests that primary antibody repertoires, while highly diverse, are nonetheless constrained by genetic mechanisms imposed during antigen-independent B cell development (Arnaout et al., 2011; Glanville et al., 2011; Ippolito et al., 2012). A second shift imposed upon the antibody repertoire occurs during the response to antigen. As an indirect measure of divergence from a totally random repertoire, IgAT calculates the clonotypic diversity (clonotypes per functional sequences) and also the sequence diversity (unique sequences per functional sequences). In the example given here, the clonotypic diversity of the IgH chain repertoire after immunization against C1S amounts to 36.6%. In previous studies, we found clonotypic diversities ranging from 27% in extremely immature IgG repertoires from preterm neonates or 30% in IgE transcripts from allergic children up to 81% in peripheral blood IgM repertoires (Zemlin et al., 2007; Kerzel et al., 2010). Although the clonotypic diversity and sequence diversity are essential descriptors of a given sequence collection, the absolute values should only be compared between sequence collections that were obtained with the same method, because (a) increasing rounds of PCR increase the risk of overamplification of a non-representative set of sequences and (b) degenerate V primers may not recognize all V genes with equivalent affinity, leading to a possible underestimation of the true repertoire diversity. Thus, a low clonal diversity might reflect a focusing of the repertoire during oligoclonal or even monoclonal B cell proliferations, but could also be caused by a low number of PCR targets or by suboptimal PCR conditions. Ademokun et al. (2011) have suggested that the reduced clonal diversity observed in peripheral blood IgM, IgG, and IgA repertoires in the elderly might reflect a weaker response to vaccines when compared to young individuals (Ademokun et al., 2011). Moreover, changes of the clonal diversity of the antibody response can be studied longitudinally to characterize the maturation of the antibody repertoire during ontogeny (Zemlin et al., 2007; Kerzel et al., 2010).

IgAT helps identifying biases in V, DH, and J gene utilization that can indicate superantigen-driven selection or frequent VH gene replacement

IgAT summarizes the frequency of V and DH gene families and individual V, DH, and J genes. The VH and VL gene segments encode for four of the six complementarity determining regions and can thus have great influence on the recognition of classical antigens or superantigens. One reason for contradictory results regarding V gene utilization is the observation that southern blot probes or oligonucleotide primers may not have equal affinity to all VH gene segments, in particular when somatic mutations affect the primer binding site. To overcome this limitation, Vale et al. (2012) have suggested a novel technique for a less biased analysis of VH gene usage. A true predominance of one V gene family or V gene segment can arise from the positive selection of the repertoire for a particular classical antigen or by a superantigen (Zouali, 1995) and has also been described in Ig transcripts of B cell neoplasias (Sasso et al., 1989; Coker et al., 2005; Steininger et al., 2012). The use of individual VH genes can depend on the position of the VH gene segments in the germline as well as on epigenetic influences and the multidimensional genomic architecture of the locus (Feeney, 2011). The Ig transcripts studied here as an example were highly polarized toward utilization of the VH1-4 gene segment. Previous analyses demonstrated that this bias reflected the preferred expansion of plasma cells that produced antibodies directed against the antigen used for immunization, C1S (Reddy et al., 2010). Studying V gene expression can also indicate “gaps” in the V gene repertoire that can be a putative cause for increased susceptibility to particular infections such as Haemophilus influenzae type B (Feeney et al., 1996). Interestingly, V gene utilization of the Ig heavy and light chains can shift during V gene replacement because only an upstream V gene can replace a downstream V gene (Radic and Zouali, 1996; Zhang et al., 2003). To give a visual impression of a potential 5′ shift of V gene usage, IgAT displays V gene segment usage according to each segment’s unique position in the germline.

Biases in amino acid frequencies and average hydrophobicity of CDR-H3 calculated by IgAT reveal restrictions with potential relevance for antigen recognition

In the example presented here, IgAT calculated a slightly hydrophilic average hydrophobicity according to a normalized Kyte–Doolittle Hydrophobicity scale for the CDR-H3 region, which is representative for a typical murine primary antibody repertoire (Zemlin et al., 2003). The hydrophobicity profile of the CDR-H3 region in mice has been shown to be crucial for conservation of global features of a normal antibody repertoire, for generation of normal B cell differentiation, and for the maintenance of normal adaptive immunity to model antigens and pathogens (Ippolito et al., 2006). For example, the position of positively charged amino acids correlates with the specificity against (negatively charged) double strand-DNA in pathogenic autoantibodies (Krishnan et al., 1996) and triplets of hydrophobic amino acids within the CDR-H3 have been implicated with disturbed B cell repertoire formation during a porcine viral infection (Butler et al., 2008). CDR-H3 hydrophobicity is mainly regulated by DH gene reading frame utilization (reviewed in: Schroeder et al., 2010). The DH gene segments frequently encode for the core of the CDR-H3, which prototypically lies at the center of the classical antigen-binding site and which therefore can make direct contact with antigen and principally determines Ig specificity (Kabat and Wu, 1991; Padlan, 1994; Xu and Davis, 2000; Collis et al., 2003). Unlike the V and J genes of IgH and IgL loci, the DH genes are unique in their potential to be used in three forward and three reverse reading frames. The DH reading frames are characterized by differing hydrophobicity signatures: the first forward reading frame predominantly encodes for hydrophilic amino acids, such as tyrosine, glycine, and serine, and is the most frequent across evolution among jawed vertebrate species, while the hydrophobic second and the often non-functional third reading frame are significantly under-represented (Gu et al., 1991). Shifts in reading frame usage can be identified by IgAT and may indicate a selective bias regarding the hydrophobicity profile of the antigen-binding site. Moreover, the overall amino acid frequencies of CDR-H3 regions and the frequency of each amino acid per position in CDR-H3 sequences of identical length are presented in bar diagrams by IgAT to characterize a given collection of Ig transcripts and to compare collections that were generated under differing selective pressure.

IgAT analyzes the length of CDR-H3 and its components and calculates predictions for structural properties of CDR-H3

The CDR-H3 loop can assume an almost unlimited diversity of differing three dimensional shapes which are grouped into canonical structures (Morea et al., 1998). In general, a CDR-H3 region of more than 14 amino acids protrudes into the solvent, while shorter CDR-H3 regions form an antigen-binding groove together with the other CDRs (Ramsland et al., 2001). The three dimensional structure of the antigen-binding site is of great significance for antigen recognition. For example, antibodies directed against virus antigens contain longer CDR-H3 regions on average than antibodies directed against haptens (Collis et al., 2003). Crystallization of antibodies has allowed identifying rules for the prediction of several important structural properties of the H3 loop and of the H3-hairpin based on the deduced amino acid sequence (Shirai et al., 1996, 1999; Kuroda et al., 2008). IgAT applies Shirai’s “H3-rules” to predict a kinked, extra kinked or extended shape for the H3 base. The category of the H3 base is mainly determined by the 5′ nt of CDR-H3 which are often encoded by the JH gene. Moreover, for a subset of sequences, the Shirai rules allow the prediction whether the H3 loop can establish an intact hydrogen bond ladder or a deformed hairpin. In previous studies, we found that intact hydrogen bond ladders were significantly more frequent in IgG heavy chains from preterm neonates than from adults (Zemlin et al., 2007). In both mouse and man, one reason for reduced CDR-H3 length during fetal development is a reduction in the average number of non-templated N-nucleotide additions (Schroeder et al., 1987). Thus, the structural diversity of the H3 loop is heavily restricted during early ontogeny, potentially contributing to the low affinity and poly-reactivity that characterizes the cord blood antibody repertoire. Besides elucidating the ontogeny of antibody repertoires, the deconstruction of CDR-H3 components provided by IgAT can also give insights into the selective mechanisms during antigen responses. For example, Dorner et al. (1998a) have found that CDR-H3s are generally shorter in non-functional than in functional Ig transcripts and Rosner et al. (2001) observed that mutated Ig transcripts contain shorter CDR-H3s than non-mutated Ig transcripts. Moreover, IgH receptor editing by the mechanism of VH replacement result in increased CDR-H3 length due to retention of a portion of the 3′ end of the original VH segment (Zhang et al., 2003). IgAT identifies these “VH footprints” which tend to accumulate within the VH-DH junction during VH replacement and which typically encode for highly charged amino acids (R, E, and D) at the 5′ end of CDR-H3 (Zhang et al., 2003). VH replacement seems to occur more frequently in autoimmunity (Dorner et al., 1998b).

The nature and distribution of somatic mutations indicates antigen-driven selection

An enrichment of replacement mutations within the CDRs compared to the FRs is indicative of antigen selection (Berek et al., 1985; Chang and Casali, 1994; Rajewsky, 1996; Lossos et al., 2000). IgAT uses the algorithms created by Chang and Casali (1994), and by Lossos et al. (2000), to identify sequences reflective of antigen-driven selection. In the example given here, 6.5% of the sequences were antigen-selected. This relatively low percentage is plausible since in this experiment, the bone marrow plasma cells were harvested 1 week after immunization, thus before it could be expected that the cells would have undergone excessive class switch recombination and affinity-driven maturation. In previous studies we found that the percentage of antigen-selected transcripts in humans ranged from 9% (IgM) to 29% (IgE) in peripheral blood (Kerzel et al., 2010) and in mice from 0.6% (IgM) to 15% (IgE) in splenic B cells (Rogosch et al., 2010). With this analysis, IgAT quantitatively visualizes the extent to which antigen-mediated selection has impinged upon the B cell repertoire during the course of an immune response.

In conjunction with IMGT/HighV-QUEST, IgAT significantly accelerates the characterization of large collections of Ig transcripts

Fifteen years ago, a researcher needed ∼1 h to assign VH-, DH-, and JH-gene segments, N- and P-nucleotides, and somatic mutations to one single Ig heavy chain gene transcript (personal observation). Today, using the freely available IMGT/HighV-QUEST software and the immunoglobulin gene analysis tool, IgAT, which we present here, it is possible to perform much more detailed analyses on >105 sequences within hours and >106 sequences within one day. This comprises only a few minutes of work for the researcher while the remaining time is spent by automated data transfer and analyses. The sequence set used in this report consists of ∼18,000 functional sequences. Results from IMGT/HighV-QUEST were received after ∼2 h. The calculation time of IgAT depends on the hardware and software configuration of the computer. For example, the analysis takes merely 20 min on an Intel® Pentium® 4 (3 GHz) and 4 GB memory machine running Windows XP (32-bit) and Excel 2010 (32-bit) and 15 min on a AMD® Athlon® 4850e (2.5 GHz) and 4 GB memory machine running Windows 7 (64-bit) and Excel 2010 (32-bit). In conclusion, IgAT can be used to summarize and further analyze large sequence collections that have been pre-analyzed with IMGT/HighV-QUEST. IgAT delivers publication-ready figures and descriptive statistics that can be used to compare multiple sequence collections. Thus, IgAT can be used to characterize selective forces that act upon Ig repertoires during B cell maturation, protective immune responses, and dysregulated immune responses, such as autoimmunity, allergies, and B cell neoplasias.

Conflict of Interest Statement

The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

79 in total

1. Kabat database and its applications: 30 years after the first variability plot.

Authors: G Johnson; T T Wu
Journal: Nucleic Acids Res Date: 2000-01-01 Impact factor: 16.971

2. The diversity of rearranged immunoglobulin heavy chain variable region genes in peripheral blood B cells of preterm infants is restricted by short third complementarity-determining regions but not by limited gene segment usage.

Authors: M Zemlin; K Bauer; M Hummel; S Pfeiffer; S Devers; C Zemlin; H Stein; H T Versmold
Journal: Blood Date: 2001-03-01 Impact factor: 22.113

3. Secondary immunization generates clonally related antigen-specific plasma cells and memory B cells.

Authors: Daniela Frölich; Claudia Giesecke; Henrik E Mei; Karin Reiter; Capucine Daridon; Peter E Lipsky; Thomas Dörner
Journal: J Immunol Date: 2010-08-06 Impact factor: 5.422

4. The lambda gene immunoglobulin repertoire of human neonatal B cells.

Authors: P Richl; U Stern; P E Lipsky; H J Girschick
Journal: Mol Immunol Date: 2007-08-06 Impact factor: 4.407

5. Identical V region amino acid sequences and segments of sequences in antibodies of different specificities. Relative contributions of VH and VL genes, minigenes, and complementarity-determining regions to binding of antibody-combining sites.

Authors: E A Kabat; T T Wu
Journal: J Immunol Date: 1991-09-01 Impact factor: 5.422

Review 6. B cell signaling and fate decision.

Authors: Tomohiro Kurosaki; Hisaaki Shinohara; Yoshihiro Baba
Journal: Annu Rev Immunol Date: 2010 Impact factor: 28.527

Review 7. B-cell superantigens: implications for selection of the human antibody repertoire.

Authors: M Zouali
Journal: Immunol Today Date: 1995-08

8. Molecular events during maturation of the immune response to oxazolone.

Authors: C Berek; G M Griffiths; C Milstein
Journal: Nature Date: 1985 Aug 1-7 Impact factor: 49.962

9. IgE transcripts in the circulation of allergic children reflect a classical antigen-driven B cell response and not a superantigen-like activation.

Authors: Sebastian Kerzel; Tobias Rogosch; Benjamin Struecker; Rolf F Maier; Michael Zemlin
Journal: J Immunol Date: 2010-07-21 Impact factor: 5.422

10. Human IgM molecules that bind staphylococcal protein A contain VHIII H chains.

Authors: E H Sasso; G J Silverman; M Mannik
Journal: J Immunol Date: 1989-04-15 Impact factor: 5.422

37 in total

1. Manipulation of Cell-Type Selective Antibody Internalization by a Guide-Effector Bispecific Design.

Authors: Nam-Kyung Lee; Yang Su; Scott Bidlingmaier; Bin Liu
Journal: Mol Cancer Ther Date: 2019-04-08 Impact factor: 6.261

2. Absence of surrogate light chain results in spontaneous autoreactive germinal centres expanding V(H)81X-expressing B cells.

Authors: Ola Grimsholm; Weicheng Ren; Angelina I Bernardi; Haixia Chen; Giljun Park; Alessandro Camponeschi; Dongfeng Chen; Berglind Bergmann; Nina Höök; Sofia Andersson; Anneli Strömberg; Inger Gjertsson; Susanna Cardell; Ulf Yrlid; Alessandra De Riva; Inga-Lill Mårtensson
Journal: Nat Commun Date: 2015-05-11 Impact factor: 14.919

3. Dynamics of B Cell Recovery In Kidney/Bone Marrow Transplant Recipients.

Authors: Baoshan Gao; Yiming Gu; Chunshu Rong; Carolina Moore; Fabrice Porcheray; Waichi Wong; Frederic Preffer; Susan L Saidman; Yaowen Fu; Benedict Cosimi; David H Sachs; Tatsuo Kawai; Megan Sykes; Emmanuel Zorn
Journal: Transplantation Date: 2017-11 Impact factor: 4.939

4. CD4+ T cells and CD40 participate in selection and homeostasis of peripheral B cells.

Authors: Marc A Schwartz; Nikita S Kolhatkar; Chris Thouvenel; Socheath Khim; David J Rawlings
Journal: J Immunol Date: 2014-08-29 Impact factor: 5.422

5. Postnatal and adult immunoglobulin repertoires of innate-like CD19(+)CD45R(lo) B Cells.

Authors: Carmen Prado; Mercedes Rodríguez; Isabel Cortegano; Carolina Ruiz; Mario Alía; Belén de Andrés; María Luisa Gaspar
Journal: J Innate Immun Date: 2014-03-06 Impact factor: 7.349

6. Ligase-4 Deficiency Causes Distinctive Immune Abnormalities in Asymptomatic Individuals.

Authors: Kerstin Felgentreff; Sachin N Baxi; Yu Nee Lee; Kerry Dobbs; Lauren A Henderson; Krisztian Csomos; Erdyni N Tsitsikov; Mary Armanios; Jolan E Walter; Luigi D Notarangelo
Journal: J Clin Immunol Date: 2016-04-11 Impact factor: 8.317

7. Characterization of T and B cell repertoire diversity in patients with RAG deficiency.

Authors: Yu Nee Lee; Francesco Frugoni; Kerry Dobbs; Irit Tirosh; Likun Du; Francesca A Ververs; Heng Ru; Lisa Ott de Bruin; Mehdi Adeli; Jacob H Bleesing; David Buchbinder; Manish J Butte; Caterina Cancrini; Karin Chen; Sharon Choo; Reem A Elfeky; Andrea Finocchi; Ramsay L Fuleihan; Andrew R Gennery; Dalia H El-Ghoneimy; Lauren A Henderson; Waleed Al-Herz; Elham Hossny; Robert P Nelson; Sung-Yun Pai; Niraj C Patel; Shereen M Reda; Pere Soler-Palacin; Raz Somech; Paolo Palma; Hao Wu; Silvia Giliani; Jolan E Walter; Luigi D Notarangelo
Journal: Sci Immunol Date: 2016-12-16

8. The V gene repertoires of classical and atypical memory B cells in malaria-susceptible West African children.

Authors: Severin Zinöcker; Christine E Schindler; Jeff Skinner; Tobias Rogosch; Michael Waisberg; Jean-Nicolas Schickel; Eric Meffre; Kassoum Kayentao; Aïssata Ongoïba; Boubacar Traoré; Susan K Pierce
Journal: J Immunol Date: 2015-01-02 Impact factor: 5.422

9. Massively Parallel Sequencing of Peritoneal and Splenic B Cell Repertoires Highlights Unique Properties of B-1 Cell Antibodies.

Authors: Thomas A Prohaska; Xuchu Que; Cody J Diehl; Sabrina Hendrikx; Max W Chang; Kristen Jepsen; Christopher K Glass; Christopher Benner; Joseph L Witztum
Journal: J Immunol Date: 2018-01-29 Impact factor: 5.422

10. Next-Generation Sequencing Reveals Restriction and Clonotypic Expansion of Treg Cells in Juvenile Idiopathic Arthritis.

Authors: Lauren A Henderson; Stefano Volpi; Francesco Frugoni; Erin Janssen; Susan Kim; Robert P Sundel; Fatma Dedeoglu; Mindy S Lo; Melissa M Hazen; Mary Beth Son; Ronald Mathieu; David Zurakowski; Neng Yu; Tatiana Lebedeva; Robert C Fuhlbrigge; Jolan E Walter; Yu Nee Lee; Peter A Nigrovic; Luigi D Notarangelo
Journal: Arthritis Rheumatol Date: 2016-07 Impact factor: 10.995