Literature DB >> 25875140

Reconstructing and mining the B cell repertoire with ImmunediveRsity.

Bernardo Cortina-Ceballos¹, Elizabeth Ernestina Godoy-Lozano, Hugo Sámano-Sánchez, Andrés Aguilar-Salgado, Martín Del Castillo Velasco-Herrera, Carlos Vargas-Chávez, Daniel Velázquez-Ramírez, Guillermo Romero, José Moreno, Juan Téllez-Sosa, Jesús Martínez-Barnetche.

Abstract

The B cell antigen receptor repertoire is highly diverse and constantly modified by clonal selection. High-throughput DNA sequencing (HTS) of the lymphocyte repertoire (Rep-Seq) represents a promising technology to explore such diversity ex-vivo and assist in the identification of antigen-specific antibodies based on molecular signatures of clonal selection. Therefore, integrative tools for repertoire reconstruction and analysis from antibody sequences are needed. We developed ImmunediveRity, a stand-alone pipeline primarily based in R programming for the integral analysis of B cell repertoire data generated by HTS. The pipeline integrates GNU software and in house scripts to perform quality filtering, sequencing noise correction and repertoire reconstruction based on V, D and J segment assignment, clonal origin and unique heavy chain identification. Post-analysis scripts generate a wealth of repertoire metrics that in conjunction with a rich graphical output facilitates sample comparison and repertoire mining. Its performance was tested with raw and curated human and mouse 454-Roche sequencing benchmarks providing good approximations of repertoire structure. Furthermore, ImmunediveRsity was used to mine the B cell repertoire of immunized mice with a model antigen, allowing the identification of previously validated antigen-specific antibodies, and revealing different and unexpected clonal diversity patterns in the post-immunization IgM and IgG compartments. Although ImmunediveRsity is similar to other recently developed tools, it offers significant advantages that facilitate repertoire analysis and repertoire mining. ImmunediveRsity is open source and free for academic purposes and it runs on 64 bit GNU/Linux and MacOS. Available at: https://bitbucket.org/ImmunediveRsity/immunediversity/.

Entities: Chemical Disease Gene Species

Keywords: CDR3; CDRH3, heavy chain complementarity determining region 3; HEL, hen egg lysozyme; Ig repertoire; Rep-Seq, repertoire sequencing; SHM, somatic hypermutation.; data mining; high-throughput sequencing

Mesh：

Substances：

Year: 2015 PMID： 25875140 PMCID： PMC4622655 DOI： 10.1080/19420862.2015.1026502

Source DB: PubMed Journal: MAbs ISSN： 1942-0862 Impact factor: 5.857

hen egg lysozyme heavy chain complementarity determining region 3 repertoire sequencing somatic hypermutation

Introduction

Adaptive immunity relies on a highly diverse lymphocyte antigen receptor repertoire, which is dynamically shaped by endogenous and exogenous antigens by means of clonal selection. T and B lymphocyte receptor repertoires are generated by somatic recombination of germline V, D and J segments in an antigen-independent manner. Each lymphocyte bears a unique antigen receptor that can be clonally selected. In the case of B lymphocytes, antigen receptors can be further diversified in an antigen-dependent manner by somatic hypermutation of clonally expanded lymphocytes. Thus, the basic unit of the B cell repertoire is the idiotype, which refers to a unique antigen receptor structure (H and L chain pair) with unique epitope specificity. One or more idiotypes derived from the same VDJ recombination event and H+L pairing represents a clonotype. The degree of idiotypic diversification within a clonotype is indicative of antigen-mediated selection. High-throughput DNA sequencing (HTS) of lymphocyte antigen receptor repertoire (Rep-Seq) provides a technological solution to measure repertoire diversity, both to track the clonal responses to antigenic challenge and to infer antigen receptor specificity (repertoire mining), that in conjunction with other high-throughput approaches could affect our view of immunity to infection, vaccine efficacy, lympho-hematological malignancies, among others. Consequently, there is an increasing demand for automated data processing tools from raw data to a full virtual repertoire reconstruction and its comparative analysis within different experimental conditions. There are a number of excellent tools such as iHMMune-align, JoinSolver and IgBLAST for primary structural analysis of Ig sequences, as well as tools that provide detailed structural features of lymphocyte receptor sequences in massive datasets such as High-V_Quest and VDJfasta. However none of them integrate raw data processing, reconstruction of the ontogenic relations in terms of clonal origin and diversification by somatic hypermutation to a full qualitative and quantitative representation of the sampled repertoire. Accurate characterization of the B cell repertoire faces the challenge of correctly identifying sequences belonging to a common clonal origin, as well as the identification of the structural relationships of idiotypes within each clonal linage, accounting for sequencing errors. Clonal structure metrics, together with mutation analysis indicative of selection, represent the basis for a quantitative statistical description of repertoire diversity and could provide the means for comparative analysis between different immunological conditions, as well as for the identification of putative antigen-specific antibodies by repertoire mining.

Results

ImmunediveRsity is a flexible stand-alone pipeline based primarily in R programming language, originally designed for 454-Roche unpaired IgH sequence data derived from libraries prepared from total RNA by 5′ RACE-PCR; thus, H and L chain paring information is lost during the process. Different scripts and functions (R, Python and Perl) bundled with GNU software provide a full automated analysis per sequencing library. It uses Acacia for noise correction; IgBLAST for V and J segment assignment; HMMER3 (http://hmmer.janelia.org/) for complementarity-determining region 3 (CDRH3) Junction identification and USEARCH for clustering sequences into Heavy Chain Clonotypes, hereafter referred to as clonotypes, and their corresponding clonally related Heavy Chain lineages, referred hereafter as lineages (Fig. S1). The overall algorithm used by ImmunediveRsity is depicted in and described in detail in Figure S2. For Illumina MiSeq users, a pre-processing algorithm for paired end reads assembly using PANDAseq (Fig. S3) is provided ().

Figure 1.

The overall algorithm used by ImmunediveRsity. Input file is a *.fastq. Pre-processing consists on VDJ assignment and non-VDJ sequence trimming (5′ UTR, signal sequence and IGHC), homo-polymer correction (particularly required for 454-Roche or Ion Torrent reads), quality, size and IGH germline transcript (GLT) filters. Processing: ImmunediveRsity first identifies the CDRH3 with HMMER3 and the V and J segment rearrangement with IgBLAST. The CDRH3 reads belonging to the same V and J assignment are clustered iteratively according to sequence identity and length to define clonotypes. A second clustering step according to sequence identity is performed within the full V region of reads belonging to each clonotype to identify the lineages with different somatic hypermutation patterns. The output consists (1) Fasta files: containing the CDRH3 sequences for each read and clonotype, as well as the sequence for each consensus lineage with a unique identifier. (2) Text files, describing V, D and J assignments for each read and the relation of each read to a given clonotype and lineage. (3) Metrics files: for each clonotype is given frequency, the number of synonymous (Ks) and non-synonymous mutations (Ka) mutations, diversity indices, CDRH3 physico-chemical characteristics (P.Q.) and (4) Repertoire visualization: A series of predefined vectorized graphics: (1) Rarefaction curves, (2) Aminoacid composition per specific length of CDRH3, (3) Heat-map of VJ rearrengment frequencies, (4) CDRH3 spectratyping, (5) 3D cloud VDJ rearrangement frequency and (6) Network representation of the overall structure of the antibody repertoire. clonotypes (CG), lineages (Id).

Output

ImmunediveRsity generates 4 types of output: (1) Fasta files: containing the CDRH3 sequences for each read and clonotype, as well as the sequence for each lineage consensus. (2) Text files, describing V, D and J assignments for each read and the relation of each read to a given clonotype and lineage. (3) Metrics files: metrics of repertoire structure based on frequency, degree of diversification and somatic hypermutation that can aid the exploration of the effects of antigen-driven selection or repertoire alterations in a given disease. Such metrics include normalized clonal and lineages frequencies, global entropy measurements such as Shannon-Weaver index and Gini coefficient ( and Fig. S2). Such metrics can be calculated according to IGHV usage, potentially revealing hidden trends in antigen-driven clonal diversification that otherwise would not be detected only by a relative frequency analysis. Also, entropy is calculated to reveal the degree of lineage diversification within each clonotype, irrespectively of their IGHV segment usage (Fig. S2). Finally, the number of synonymous (Ks) and non-synonymous mutations (Ka) per lineage is calculated to indicate potential lineages under antigen-driven selection. (4) Repertoire visualization (Figs. S4–12): A series of predefined vectorized graphics providing frequency of V, D and J segment usage (Figs. S4–6), CDRH3 digital-spectratyping (Fig. S7), amino-acid composition at given CDRH3 length (Fig. S8), a heat-map of hierarchical clustering of V family usage (Fig. S5), rarefaction curves describing clonotype and lineage richness at a standardized sampling effort (Figs. S9 and S10) and read quality before and after filtering (Fig. S11). In an attempt to capture the B cell repertoire complexity, an integrative graph representing a network of clonotype with their respective lineages is generated in the context of a previously described HEL-immunization experiment in mice using iGraph (). These graphs can be customized to plot parameters other than hypermutation, such as diversity indices (see Methods and Fig. S12) or CDRH3 physicochemical properties. Finally, ImmunediveRsity provides a collection of scripts (Post-processing multi-library analysis toolbox) aimed to aid with comparisons within multiple library experiments (, Fig. S2). A tool for sampling equal number of reads or clonotypes is particularly useful for such task. A tool for searching convergent CDRH3 in different individuals is also provided ().

Figure 2.

iGraph network representation of the sampled antibody repertoire in mouse spleen 15 days after immunization with HEL. (A) Left: IgM compartment. Right: IgG compartment. Each clonotype is represented by the agglomeration of lineages (nodes; represented by circles), the diameter of the circle represents the relative frequency and the color code according to the number of non-synonymous mutations. The CDRH3 sequence (*3G1 ARGEGNYGY) of recombinant HEL-specific antibody as described is shown. Fading of certain clonotypes allows visualization of other clonotypes in the background. (B) Quantitative analysis of SHM in the IgM vs. IgG compartment in the same dataset as in A. (C) Statistical analysis of SHM in the IgM vs. IgG compartment. Median for each compartment is shown (dotted line). U-Mann-Whitney test. Frame shifted sequences were removed before the analysis. NSM, non-synonymous mutations.

Performance of ImmunediveRsity

To test ImmunediveRsity, we used 3 benchmark data sets: (1) A mouse benchmark composed by 5,359 reads generated by sequencing a PCR amplicon library generated from a cloned 5′ RACE-PCR product derived from the spleen of a MD4 transgenic mouse (see supplementary material), which bears a monoclonal (IGHV6-3*01-IGHD4-1*01-IGHJ2*01) B cell compartment. This benchmark was also used to assess ImmunediveRsity's clonotype and lineage assignment performance, such that the identification of a single clonotype and a single lineage was expected. The MD4 amplicon contains one G homopentamer and one A homotetramer within the CDRH3 region, and 3 additional homotetramers in FWR2, 3 and 4, respectively, providing a means to assess homopolymeric sequencing errors and Acacia correction performance. (2) A human benchmark composed by 1,044 sequences of a single clonotype (IGHV1-3*01-IGHD3-10*01-IGHJ3*02) manually identified from a human library (see supplementary material). To construct a reference dataset, sequences were aligned, indels were manually corrected and 10 lineages were identified according to their mutation patterns and based on the error pattern observed in the MD4 sequence data. As for the mouse benchmark, the manually curated human data set was used to evaluate ImmunediveRsity's lineage assignment accuracy using the corresponding human raw sequences as input. (3) The previously described Stanford 22 dataset was used to assess ImmunediveRsity's clonotype assignment capacity. It consists of 13,141 human IgH sequences referred to as being derived from independent V(D)J recombination events (non- identical V, D, and J segments and non-identical V-D and D-J junction sequences). Although, absolute certainty regarding the independence of a V(D)J recombination event cannot be ascertained, the Stanford 22 data set was used as a proxy of non-clonally related sequences. Using the MD4 mouse dataset as a proxy for sequence error calibrator, a major clonotype containing 99.5% of the reads was obtained. A second clonotype with 10 reads (0.18%) was found. Closer examination of the sequences in this clonotype determined that a G insertion in the G homopentamer located in the CDRH3 anchor created this artifact. The remaining 14 reads (0.26%) generated 8 clonotypes with low frequency and singletons (). At the lineage level, ImmunediveRsity identified 21 lineages, with a single lineage containing 99.3% reads, and the remaining were composed of ≤5 reads. Thus, based on the errors identified in the MD4 mouse data set, a minimum of 6 reads was defined as threshold to consider a true lineage (well supported). This corresponds roughly to 1 read per 1000 reads of coverage. The MD4 or other amplicons should be used as guides for calibration in terms of homopolymer content, but it does not represent all the possible sources of sequencing errors. Therefore, users are encouraged to optimize their own amplicons and means to calibrate sequencing runs.

Table 1.

Overview of the reference sequencing sets

Set	Sequenced reads	After filters¹	Observed clonotypes	Expected clonotypes	Well supported clonotypes²	Observed lineages (without singletons)	Expected lineages	Well supported lineages³
MD4	5,359	99.6%	10	1	1	21(7)	1	1
IGHV1-3	1,044	95.2%	1	1	1	469 (52)	10	7
Stanford22⁴	13,141	100%	11,779	13,141	NA	12,421	13,141	NA

1Percent of reads that pass the pre-processing filters.

2Number of clonotypes whose corresponding lineages are composed ≥ 6 reads.

3Lineages composed of ≥ 6 reads.

4The publicly available Stanford22 set was published as a set of non-clonally related immunoglobulin sequences; we removed one read with a duplicated identifier and 11 with duplicated sequences.

NA, not applicable.

Overview of the reference sequencing sets 1Percent of reads that pass the pre-processing filters. 2Number of clonotypes whose corresponding lineages are composed ≥ 6 reads. 3Lineages composed of ≥ 6 reads. 4The publicly available Stanford22 set was published as a set of non-clonally related immunoglobulin sequences; we removed one read with a duplicated identifier and 11 with duplicated sequences. NA, not applicable. Using this threshold, results for each dataset are shown in . For the human data set of 1,044 human immunoglobulin raw sequences from a single clonal origin with 10 lineages, ImmunediveRsity identified correctly a single clonotype. However, it identified 469 lineages as well, most of them singletons. Using the ≥6 read threshold, the number of lineages identified was 7. Therefore, if the experimental aim is to describe the clonal structure of the repertoire, we recommend not to use coverage filters. However, if the goal is to identify and analyze structural properties of an antibody subset, one should choose a minimal coverage to call a true lineage. For the Stanford22 dataset, after removal of one read with a duplicated identifier and 11 with duplicated sequences, ImmunediveRsity identified 11,779 different clonotypes and 12,421 lineages (), indicating that either the Stanford 22 data set contains clonally related sequences or that the CDRH3 clustering identity threshold is too low, allowing to merge non-clonally related sequences into clonotypes. As IgBLAST is time consuming and must be executed twice, ImmunediveRsity allows parallelization according to the number of cores, which can also be customized by the user. Processing of a raw dataset of 48,350 reads described in (HEL-immunized mice) takes 136 minutes or 92 minutes in 1/8 or 7/8 Intel core i7 CPU's of a 3.8 GHz and 8 Gb RAM PC computer; and 173 min in 4/4 Intel core i5 CPU's of a 2.3 GHz and 4 Gb RAM MacBook computer.

Figure 3.

Comparison of clonotype intra-clonal inequality (Gini coefficient) between control (PBS) and 2 HEL-immunized mice 15 days post-immunization. clonotypes derived from control mice are shown as green dots. Clonotypes from HEL-immunized mice are shown in red (m8) and purple (m9) dots. x axis; Gini coefficient per clonotype, y axis; clonotype relative frequency. The CDRH3 sequences of anti-HEL recombinant antibodies as described are shown. Fading dots allow the visualization of overlapping dots. To test ImmunediveRsity performance with data derived from MiSeq Illumina sequencing, we used the human data set described by Schanz et al. The whole dataset contained 2.88 × 106 paired reads, and was processed with the script, which uses PANDAseq for paired read assembly. Random subsampling of 1 × 104 and 1 × 105 assembled pairs were used as input for ImmunediveRsity taking 18 and 280 minutes, respectively, in a 7/8 Intel core i7 CPU's of a 3.8 GHz 8 Gb RAM PC computer.

Critical parameters for optimization according to user needs and data type

As described in the previous sections, it is highly recommended that users include as reference a monoclonal amplicon to optimize ImmunediveRsity. Critical parameters that may require tweaking by the user are CDRH3 identity (specified by the parameter. Default = 0.97), low frequency reads cutoff for clonotypes (specified by the parameter. Default = 6) and lineages (specified by the parameter. Default = 6). These filters are applied by default and the output is directed to different directories (WellSupportedCGs and WellSupportedIs, respectively), however unfiltered results are still available in the library output directory.

Differences in the B cell repertoire after an immune challenge

Immunization with protein antigens induces the germinal center reaction characterized for an extensive antigen-specific B cell proliferative response, somatic hypermutation and class switching. In mice, the germinal center reaction induced by immunization peaks after 10–12 days. As a proof of principle of ImmunediveRsity performance to digitally reconstruct the antibody repertoire with experimental data, we used our previously described IGHV sequence data derived from libraries (IgM and IgG class) from 2 BALB/c mice spleens at 3, 7 and 15 days post-immunization with hen egg lysozyme (HEL) and one mouse inoculated with PBS in each time point as controls. The sequencing metrics of the 18 sequenced libraries are shown on Table S1. As expected, the proportion of somatic hypermutation was higher in the IgG than in the IgM compartment (). Fifteen days after immunization, 2 HEL-specific IGHV B cell clones, functionally validated in a previous work, were identified as clones with high relative frequency, but lower Gini coefficients compared to the corresponding high relative frequency clones from control mice, (), suggesting that within the higher frequency range (y axis), antigen-specific clones are the ones having lower Gini coefficient values. Lower Gini coefficients make biological sense based on the assumption that affinity maturation would allow the selection of different lineages leading to a more even distribution than in the non-selected clonotypes. Further research is needed to clarify if Gini coefficient per clonotype is a useful measure to identify clonotypes undergoing antigen-mediated selection. The germinal center reaction is initiated by a small number (1-8) of antigen-selected IgM expressing-B cells. Due to the extensive oligoclonal proliferation of founder cells and their concomitant differentiation to IgG switched high affinity antibody secreting cells, a reduction in clonal diversity would be expected after immunization. We used ImmunediveRsity to sample a fixed number of reads per library (5,700) and to track clonal and idiotypic diversity induced by HEL immunization 3, 7 and 15 days post-immunization in the IgM and IgG compartment. Sampling was necessary to compare equal numbers of reads. The number of reads was based on the library with the lowest number of reads (Table S1). No changes in clonal and idiotypic entropy (Shannon-Weaver index), or in clonal and idiotypic inequality (Gini coefficient) were detected at day 3 and 15 post-immunization (). However, a reduction in clonal and idiotypic diversity was observed in the IgM compartment at day 7 post-immunization. In contrast, clonal and idiotypic diversity was markedly increased in the IgG compartment (). Consistently but in the opposite direction, clonal and idiotypic inequality in the IgM compartment increased at day 7 post-immunization, whereas, in the IgG compartment, they decreased (). The entropy measures suggest an increase in the circulation of non-clonally related IgG+ B cells at day 7.

Figure 4.

Clonal diversity and somatic diversification after immunization. A change in clonal (closed symbols) and lineage (open symbols) diversity measured by the average Shannon-Weaver index of HEL-immunized (n = 2) minus PBS-injected mouse (n = 1) at day 3, 7 and 15 post-immunization for the IgM (upper panel) and IgG (lower panel) compartments. (B) The corresponding change in clonal (closed symbols) and lineage (open symbols) inequality measured by the average Gini coefficient. For A and B, 5,700 reads per library were randomly sampled using the post-processing multi-library analysis toolbox. Sequencing metrics of the libraries used to estimate diversity measurements are described in .

Discussion

The lymphocyte antigen receptor repertoire is a fascinating biological system that represents the basis for acquired immunity. Hence, its analysis is critical to understand responses to vaccination and autoimmune diseases. Antibody repertoire diversity is generated by antigen-independent somatic rearrangements of germline V(D)J segments and antigen-dependent somatic hypermutation. Potentially, the repertoire diversity can reach astronomical proportions. However, it is clear that structural and functional constraints, as well as clonal selection operating at different developmental stages influence its ultimate size and shape. Owing to the capacity to screen a larger sample of the lymphocyte repertoire, HTS is offering the possibility to approach its complexity, and how it is affected during normal and pathological immune response. Additionally, the information generated by HTS on antibody repertoires can be exploited to address higher order statistical properties of biological structures in general. Aiming toward a faithful digital reconstruction of clonal diversification as a result of somatic hypermutation, repertoire sequencing imposes particular challenges regarding data analysis. Four such challenges are: (1) All HTS platforms possess certain degree of quality to throughput trade off that can potentially overestimate true diversity. To minimize the impact of sequencing errors, ImmunediveRsity attempts to correct or discard noisy reads. However, this process is not exhaustive, as exemplified by the MD4 experiment (). It is thus important that users tweak clustering parameters to obtain a reliable repertoire reconstruction. (2) Clonal relations for large data sets are difficult to assign due to the nature of V(D)J recombination process, which can be confounded by CDRH3 somatic hypermutation. (3) Incomplete knowledge of germline structure and population diversity at immunoglobulin loci, including allelic copy number variation. (4) The influence of sample size sequencing depth and BCR transcript expression variation in entropy measurements, particularly in the case of cDNA library sequencing. The development of ImmunediveRsity was motivated by the need for a stand-alone tool for digital reconstruction of the B cell repertoire and measurement of clonal diversity parameters describing functional aspects of clonal selection and to facilitate repertoire mining. There are open source tools that allow B cell repertoire analysis from HTS datasets, including clonotype reconstruction and somatic hypermutation analysis such as IgAT, AbMining ToolBox and Michaeli's algorithm (Table S2). IgAT is a basic and friendly tool that does not provide clonal assignments and runs only in Windows. AbMining ToolBox is a fast, powerful tool optimized for selection of phage displayed antibodies avoiding panning steps and increasing the yield of relevant recombinant antibodies. In many ways, the most similar tool to ImmunediveRsity is the Michaeli's algorithm, which also integrates a sequencing error correction step and allows lineage identification. However, Michaeli's algorithm is designed to run in 32-bit computers, limiting its use. Additional contributions of ImmunediveRsity are the calculation of different entropy measurements such as the Gini and Shannon-Weaver indexes per clonotype and lineage and their corresponding partitions according to IGHV gene usage. Its utilization in the analysis of the B cell repertoire in response to immunization suggests a potential application of the Gini coefficient in repertoire mining and in silico identification of antigen-specific clones (), but further experimentation is required. Moreover, entropy measurements provided the observation that clonal and somatic diversity in the IgG compartment increases at day 7 post-immunization instead of decreasing (), supporting the notion that germinal centers are open structures that can be colonized by unrelated B cells. The inverse relationship between Shannon entropy and Gini coefficient indicate that both measures are redundant. However, we have explored such correlation at the clonotype level in human peripheral blood of patients infected with dengue virus and found very weak correlation (unpublished data). The significance of the correlation between both measures warrants further research. An additional feature that contributes to an integrative analysis of the B cell repertoire is ImmunediveRsity's rich output in terms of a diverse and informative graphical output and a detailed description per clonotype sequence and structure. In conclusion, ImmunediveRsity is a highly modular and customizable tool for B cell repertoire analysis that can accelerate data integration and discovery of biologically relevant processes, as well as the identification of antibodies with potential biotechnological applications.

Methods

Input

ImmunediveRsity accepts *.fastq files in reverse complement; thus, in principle it is suitable for most HTS platforms. Currently, it can only process VH libraries from human and mouse. It should be noted that somatic hypermutation can occur along the whole variable region, which is 400 bp on average. Platforms that generate longer reads lengths can offer better representation of somatic diversity ().

Pre-processing

As an initial step, IgBLAST is performed for each individual read to map the V(D)J region and to trim the 5′ and 3′ flanks. This step is particularly important for the analysis of libraries generated by 5′ RACE-PCR to exclude germ-line transcripts (sterile transcripts) that may be present in libraries derived from total RNA, i.e., when using 5′ RACE-PCR (unpublished observations), and to trim the 5′ UTR and signal peptide sequences and possible IGHC nucleotides, which are irrelevant for antigen binding and thus for clonal selection. Moreover, the distal flank to the sequencing primer usually will have the lowest quality. As 454-Roche and Ion Torrent sequencing platforms are prone to indels in homopolymeric regions, ImmunediveRsity can call Acacia, an error-correction tool. Sequences below a median read quality of Q28 or below 200 bp long are discarded. These parameters can be easily customized ().

CDRH3 Junction identification

The CDRH3 Junction region is delimited by the 3′ end of the V gene and the 5′ end of the J gene, including the D gene and the N and P-nucleotides. The CDRH3 region is the most variable and defines the clonal origin. The CDRH3 region has conserved regions (anchors) in both flanks, hence ImmunediveRsity uses HMMER3 with a HMMER DNA profile trained to match either human or mouse CDRH3-coding sequence for each read. HMMER3 retrieves the CDRH3 by means of 2 anchors, which are trimmed on the basis of the conserved motifs Tyr-Phe/Tyr/His-Cys or Phe-Phe-Cys in the 3′ end of the IGHV segment (IMGT positions 102-104) and the motif Trp-Gly in the IGHJ segment. Additionally, it considers the potential presence of indels due to homopolymeric errors in the sequence coding for Tyr-Phe-Cys. The presence of D segments in tandem, which was previously described, are also considered since our second step returns the largest CDRH3 with the different possible motifs. Reads without an identifiable CDRH3 are discarded ().

V(D)J assignment

A critical step toward clonal origin identification is the determination of V, D and J segment usage. Different tools are publicly available to achieve this task. The main challenge relies on correctly identifying the D segment, because of its usually short length (from 11 to 37 nt) and mutations content by exonuclease activity during the recombination process. To assign the V, D and J segments, ImmunediveRsity uses IgBLAST by aligning each read to the current set of functional germ line sequences from ImMunoGeneTics database (50 V genes and 239 alleles, 23 D genes and 30 alleles, and 6 J genes and 13 alleles for human) (). However, this database can be upgraded as new genes and alleles are described.

Heavy Chain Clonotype (clonotype) identification

Although biologically a B cell clonotype corresponds to a group of B cells sharing a unique antigen receptor originated by a unique VDJ and VJ recombinatorial event leading to the respective H and L chain pair, ImmunediveRsity interprets the clonotypes as single chained (in this case IGHV) objects. The clonotypes are defined by CDRH3 clustering with USEARCH. USEARCH is fast, exhaustive and offers simplicity in setting the desired parameters. As the human D gene sequences can be as small as 11 bp, as in the case of IGHD7-27, accurate D gene assignation in mature B-lymphocytes is difficult to achieve. For that reason, we define that 2 reads belong to the same clonotype if the following is true: (1) The V and J gene assignment is the same in both sequences; (2) Junction regions of the 2 sequences have a nucleotide identity ≥ 97%. This parameter () can be adjusted according to the user needs and calibration results; and (3) The trimmed length of the shorter Junction sequence ≥ 97% of the length of the larger Junction sequence (). CDRH3 length identity and length parameters used for clustering were determined by estimating the accuracy obtained by sequencing an IgH amplicon derived from an anti-HEL transgenic mouse (MD4) that is virtually monoclonal and a manually-curated human immunoglobulin set (IGHV1-3) (see Results. Performance of ImmunediveRsity. ), but can be easily customized.

Unique heavy chain (lineage) identification

As for clonotypes, ImmunediveRsity obviates L chain pairing and interprets a lineage as an object derived from a unique VDJ recombinatorial event and a unique SHM pattern. Lineages are identified on a second iterative clustering step with USEARCH, using the complete VDJ region of each read within each clonotype. Reads corresponding to the same clonotype may have different lengths due to the sequence quality decay during the sequencing process. Shorter reads generated by early sequencing termination or shortened by quality filters would be still retained in the analysis. Two reads belong to the same lineage if: (1) Both reads have the same clonotype assignment; (2) The V(D)J regions of the 2 sequences have a nucleotide identity ≥ 99.5%; and (3) The length of the shorter V(D)J region is ≥ 60% of the larger read length (). These criteria are based on the mouse MD4 and the human IGHV1-3 sets (see Performance of ImmunediveRsity, benchmark data sets, ).

Entropy and Gini coefficient measurement

The Shannon-Weiner index was calculated by the following formula: , where p is the proportion of i elements (clonotypes or lineages). The Gini coefficient was calculated according to the following formula: where n is the number of elements (lineages per clonotype), xi is the sorted proportion of each lineage within clonotype.

Post-processing multi-library analysis toolbox

ImmunediveRsity provides a collection of scripts aimed to aid comparisons within multiple library experiments (, Fig. S1). A tool for sampling an equal number of reads or clonotypes is particularly useful for such task. A search tool for convergent CDRH3 in different individuals is also provided.

50 in total

Review 1. Rep-Seq: uncovering the immunological repertoire through next-generation sequencing.

Authors: Jennifer Benichou; Rotem Ben-Hamo; Yoram Louzoun; Sol Efroni
Journal: Immunology Date: 2012-03 Impact factor: 7.397

2. Fast, accurate error-correction of amplicon pyrosequences using Acacia.

Authors: Lauren Bragg; Glenn Stone; Michael Imelfort; Philip Hugenholtz; Gene W Tyson
Journal: Nat Methods Date: 2012-04-27 Impact factor: 28.547

3. IMGT(®) tools for the nucleotide analysis of immunoglobulin (IG) and T cell receptor (TR) V-(D)-J repertoires, polymorphisms, and IG mutations: IMGT/V-QUEST and IMGT/HighV-QUEST for NGS.

Authors: Eltaf Alamyar; Patrice Duroux; Marie-Paule Lefranc; Véronique Giudicelli
Journal: Methods Mol Biol Date: 2012

4. Amplification of cDNA ends based on template-switching effect and step-out PCR.

Authors: M Matz; D Shagin; E Bogdanova; O Britanova; S Lukyanov; L Diatchenko; A Chenchik
Journal: Nucleic Acids Res Date: 1999-03-15 Impact factor: 16.971

5. The application of rarefaction techniques to molecular inventories of microbial diversity.

Authors: Jennifer B Hughes; Jessica J Hellmann
Journal: Methods Enzymol Date: 2005 Impact factor: 1.600

Review 6. Reflections on the clonal-selection theory.

Authors: Melvin Cohn; N Av Mitchison; William E Paul; Arthur M Silverstein; David W Talmage; Martin Weigert
Journal: Nat Rev Immunol Date: 2007-10 Impact factor: 53.106

7. The mathematical theory of communication. 1963.

Authors: C E Shannon
Journal: MD Comput Date: 1997 Jul-Aug

Review 8. Recombination centres and the orchestration of V(D)J recombination.

Authors: David G Schatz; Yanhong Ji
Journal: Nat Rev Immunol Date: 2011-03-11 Impact factor: 53.106

9. High-throughput sequencing of the zebrafish antibody repertoire.

Authors: Joshua A Weinstein; Ning Jiang; Richard A White; Daniel S Fisher; Stephen R Quake
Journal: Science Date: 2009-05-08 Impact factor: 47.728

10. The extent of clonal structure in different lymphoid organs.

Authors: M H Hermans; A Wubbena; F G Kroese; S V Hunt; R Cowan; D Opstelten
Journal: J Exp Med Date: 1992-05-01 Impact factor: 14.307

16 in total

1. Clonify: unseeded antibody lineage assignment from next-generation sequencing data.

Authors: Bryan Briney; Khoa Le; Jiang Zhu; Dennis R Burton
Journal: Sci Rep Date: 2016-04-22 Impact factor: 4.379

2. Lower IgG somatic hypermutation rates during acute dengue virus infection is compatible with a germinal center-independent B cell response.

Authors: Elizabeth Ernestina Godoy-Lozano; Juan Téllez-Sosa; Gilberto Sánchez-González; Hugo Sámano-Sánchez; Andrés Aguilar-Salgado; Aarón Salinas-Rodríguez; Bernardo Cortina-Ceballos; Héctor Vivanco-Cid; Karina Hernández-Flores; Jennifer M Pfaff; Kristen M Kahle; Benjamin J Doranz; Rosa Elena Gómez-Barreto; Humberto Valdovinos-Torres; Irma López-Martínez; Mario H Rodriguez; Jesús Martínez-Barnetche
Journal: Genome Med Date: 2016-02-25 Impact factor: 11.117

3. SONAR: A High-Throughput Pipeline for Inferring Antibody Ontogenies from Longitudinal Sequencing of B Cell Transcripts.

Authors: Chaim A Schramm; Zizhang Sheng; Zhenhai Zhang; John R Mascola; Peter D Kwong; Lawrence Shapiro
Journal: Front Immunol Date: 2016-09-21 Impact factor: 7.561

4. 3D: diversity, dynamics, differential testing - a proposed pipeline for analysis of next-generation sequencing T cell repertoire data.

Authors: Li Zhang; Jason Cham; Alan Paciorek; James Trager; Nadeem Sheikh; Lawrence Fong
Journal: BMC Bioinformatics Date: 2017-02-27 Impact factor: 3.169

5. VDJML: a file format with tools for capturing the results of inferring immune receptor rearrangements.

Authors: Inimary T Toby; Mikhail K Levin; Edward A Salinas; Scott Christley; Sanchita Bhattacharya; Felix Breden; Adam Buntzman; Brian Corrie; John Fonner; Namita T Gupta; Uri Hershberg; Nishanth Marthandan; Aaron Rosenfeld; William Rounds; Florian Rubelt; Walter Scarborough; Jamie K Scott; Mohamed Uduman; Jason A Vander Heiden; Richard H Scheuermann; Nancy Monson; Steven H Kleinstein; Lindsay G Cowell
Journal: BMC Bioinformatics Date: 2016-10-06 Impact factor: 3.169

6. Strategies for Generating Diverse Antibody Repertoires Using Transgenic Animals Expressing Human Antibodies.

Authors: Weihsu C Chen; Christopher M Murawsky
Journal: Front Immunol Date: 2018-03-07 Impact factor: 7.561

7. Longitudinal analysis of the peripheral B cell repertoire reveals unique effects of immunization with a new influenza virus strain.

Authors: Bernardo Cortina-Ceballos; Elizabeth Ernestina Godoy-Lozano; Juan Téllez-Sosa; Marbella Ovilla-Muñoz; Hugo Sámano-Sánchez; Andrés Aguilar-Salgado; Rosa Elena Gómez-Barreto; Humberto Valdovinos-Torres; Irma López-Martínez; Rodrigo Aparicio-Antonio; Mario H Rodríguez; Jesús Martínez-Barnetche
Journal: Genome Med Date: 2015-11-25 Impact factor: 11.117

8. VDJPipe: a pipelined tool for pre-processing immune repertoire sequencing data.

Authors: Scott Christley; Mikhail K Levin; Inimary T Toby; John M Fonner; Nancy L Monson; William H Rounds; Florian Rubelt; Walter Scarborough; Richard H Scheuermann; Lindsay G Cowell
Journal: BMC Bioinformatics Date: 2017-10-11 Impact factor: 3.169

9. 5' Rapid Amplification of cDNA Ends and Illumina MiSeq Reveals B Cell Receptor Features in Healthy Adults, Adults With Chronic HIV-1 Infection, Cord Blood, and Humanized Mice.

Authors: Eric Waltari; Manxue Jia; Caroline S Jiang; Hong Lu; Jing Huang; Cristina Fernandez; Andrés Finzi; Daniel E Kaufmann; Martin Markowitz; Moriya Tsuji; Xueling Wu
Journal: Front Immunol Date: 2018-03-26 Impact factor: 7.561

Review 10. Next-Generation Sequencing of Antibody Display Repertoires.

Authors: Romain Rouet; Katherine J L Jackson; David B Langley; Daniel Christ
Journal: Front Immunol Date: 2018-02-02 Impact factor: 7.561