Yicheng Guo1, Kevin Chen2, Peter D Kwong3,4, Lawrence Shapiro1,3,4, Zizhang Sheng1. 1. Zuckerman Mind Brain Behavior Institute, Columbia University, New York, NY, United States. 2. College of Arts and Science, Stony Brook University, Stony Brook, NY, United States. 3. Vaccine Research Center, National Institute of Allergy and Infectious Diseases (NIAID), National Institutes of Health (NIH), Bethesda, MD, United States. 4. Department of Biochemistry and Molecular Biophysics, Columbia University, New York, NY, United States.
Abstract
The diversity of B cell receptors provides a basis for recognizing numerous pathogens. Antibody repertoire sequencing has revealed relationships between B cell receptor sequences, their diversity, and their function in infection, vaccination, and disease. However, many repertoire datasets have been deposited without annotation or quality control, limiting their utility. To accelerate investigations of B cell immunoglobulin sequence repertoires and to facilitate development of algorithms for their analysis, we constructed a comprehensive public database of curated human B cell immunoglobulin sequence repertoires, cAb-Rep (https://cab-rep.c2b2.columbia.edu), which currently includes 306 immunoglobulin repertoires from 121 human donors, who were healthy, vaccinated, or had autoimmune disease. The database contains a total of 267.9 million V(D)J heavy chain and 72.9 million VJ light chain transcripts. These transcripts are full-length or near full-length, have been annotated with gene origin, antibody isotype, somatic hypermutations, and other biological characteristics, and are stored in FASTA format to facilitate their direct use by most current repertoire-analysis programs. We describe a website to search cAb-Rep for similar antibodies along with methods for analysis of the prevalence of antibodies with specific genetic signatures, for estimation of reproducibility of somatic hypermutation patterns of interest, and for delineating frequencies of somatically introduced N-glycosylation. cAb-Rep should be useful for investigating attributes of B cell sequence repertoires, for understanding characteristics of affinity maturation, and for identifying potential barriers to the elicitation of effective neutralizing antibodies in infection or by vaccination.
The diversity of B cell receptors provides a basis for recognizing numerous pathogens. Antibody repertoire sequencing has revealed relationships between B cell receptor sequences, their diversity, and their function in infection, vaccination, and disease. However, many repertoire datasets have been deposited without annotation or quality control, limiting their utility. To accelerate investigations of B cell immunoglobulin sequence repertoires and to facilitate development of algorithms for their analysis, we constructed a comprehensive public database of curated human B cell immunoglobulin sequence repertoires, cAb-Rep (https://cab-rep.c2b2.columbia.edu), which currently includes 306 immunoglobulin repertoires from 121 humandonors, who were healthy, vaccinated, or had autoimmune disease. The database contains a total of 267.9 million V(D)J heavy chain and 72.9 million VJ light chain transcripts. These transcripts are full-length or near full-length, have been annotated with gene origin, antibody isotype, somatic hypermutations, and other biological characteristics, and are stored in FASTA format to facilitate their direct use by most current repertoire-analysis programs. We describe a website to search cAb-Rep for similar antibodies along with methods for analysis of the prevalence of antibodies with specific genetic signatures, for estimation of reproducibility of somatic hypermutation patterns of interest, and for delineating frequencies of somatically introduced N-glycosylation. cAb-Rep should be useful for investigating attributes of B cell sequence repertoires, for understanding characteristics of affinity maturation, and for identifying potential barriers to the elicitation of effective neutralizing antibodies in infection or by vaccination.
B cells comprise a crucial component of the adaptive immune response (1). B cells recognize three-dimensional epitopes of antigens through the variable domains of the B cell receptor (BCR), or its various secreted forms of antibody. The variable domains of BCRs and antibodies are composed of immunoglobulin heavy and light chains, encoded by separate genes. Through V(D)J gene recombination and somatic hypermutation (SHM), a high level of sequence diversity is introduced to the variable domain (1–4), allowing B cells to recognize diverse antigens. Thus, an interrogation of BCR diversity and function is key to understanding the B cell immune response. The application of next-generation sequencing (NGS) to BCR repertoires provides snapshots of BCR diversity, and such studies in the past decade have characterized numerous features of B cell responses to infection, immunization, and autoimmune disease (3, 5–18). Databases, programs, and websites have been developed to store, process, and annotate BCR repertoire data (17, 19–24) [software reviewed in Chaudhary and Wesemann (25)]. Nonetheless, most repertoire data are deposited in public databases in the format of raw NGS reads, which contain both sequencing duplicates and sequencing errors and may be difficult to annotate. A comprehensive database of curated and well-annotated BCR transcripts, should therefore accelerate repertoire studies including but not limited to characterization of B cell receptor diversity, mechanisms of clonal expansion, development of BCR repertoire analysis algorithms, estimation of frequency of antigen-specific antibodies and their precursor-like cells, and effects of SHM. In addition, such a database should assist researchers in performing repertoire-related data mining.Designing vaccines that can elicit broadly neutralizing antibodies (bnAbs) is a long-term goal for preventing infections from fast evolving and/or highly diversified pathogens such as HIV-1 and influenza (26, 27). Studies have revealed that some epitopes could elicit bnAbs in different individuals with similar modes of recognition and shared genetic signatures [e.g., V(D)J gene origin, complementarity determining region 3 (CDR3) length, CDR3 motifs, SHM] (9, 28–33), defined as convergent, multidonor, or public bnAb classes. Such antibody classes often target conserved epitopes; those that can elicit the activation and maturation of naïve B cells with bnAb potential might provide templates for universal vaccines (30, 34–36). However, not all antibody classes appear with high frequencies in BCR repertoires; those that do not may be limited due to disfavored developmental steps or “roadblocks.” Thus, the identifications of bnAb classes with precursor-like cells prevalent in humans is a critical consideration for determining which template antibodies are good targets for elicitation, and thus for immunogen design (9, 37, 38). A comprehensive database of curated B cell repertoire transcripts, for which uncertainties and variations in B cell repertoires (e.g., sampling bias, infection history, aging, and genetic diversification) (12) have been minimized, would be helpful for predicting antibody class prevalences.A database of curated BCR repertoires could, moreover, be used to investigate SHM preference and mechanisms of affinity maturation. Antibodies accumulate mutations with high preference determined by both intrinsic gene mutability and functional selection (1, 39). We previously developed gene-specific substitution profiles (GSSPs) to characterize positional substitution types and frequencies in 69 human V genes (14). By incorporating additional BCR transcripts, GSSPs could be built for more genes, and this could have broad application; GSSPs or similar approaches have been applied to examine whether bnAbs mature with shared pathways (40), to identify highly frequent SHMs and common mechanisms of affinity modulation (41, 42), and to estimate whether rare SHMs (SHMs generated with very low frequency by the SHM machinery) in bnAbs could form barriers to re-elicitation by vaccination (34, 43).In this study, we constructed a database of curated human B cell immunoglobulin sequence repertoires (cAb-Rep) from 306 high quality human repertoires, and developed methods to search cAb-Rep using either sequence or sequence signature. We used the database to construct GSSPs for 102 human antibody V genes and developed a script to identify rare SHMs in input sequences. We evaluated the robustness of query results relative to the number of repertoires, using antibody prevalence estimates as a test case. In summary, the database developed here should help to investigate B cell repertoire features, to find antibody templates for vaccine design, and overall to understand mechanisms of antibody development.
Materials and Methods
BCR Repertoire Dataset
We assembled a total of 376 BCR repertoire datasets from 108 humandonors deposited in the NCBI short reads archive (SRA) database (https://www.ncbi.nlm.nih.gov/sra). These repertoires were all sequenced using illumina MiSeq or HiSeq and were published in 11 studies (Table 1). In addition, annotated full length sequences of three donors (sequenced by AbHelix with high depth or HD repertoires, ~108 high quality sequences per donor) (4) and the near full length V(D)J sequences (sequenced with framework 1 region primers) from 10 HD repertoires (2) were downloaded from the links provided by the authors.
Table 1
Summary of donor type, library preparation and sequencing methods, BCR repertoire statistics, and references.
Type of donors
Experimental methods
Total repertoires
Unique donors
Average no. raw reads
Average no. unique reads
Average no. clones
No. of studies
Healthy
5′ RACE/FR1 primer MiSeq
59
31
1,247,695
13,432
7,142
5
High-depth healthy donor
5′ RACE/FR1 primer MiSeq/HiSeq
13
13
43,984,663
25,632,108
N.D.
2
Hepatitis B prime vaccination
FR1 primer MiSeq
88
9
291,249
7,677
5,648
1
Hepatitis B Boost vaccination
FR1 primer MiSeq
18
9
338,467
26,570
21,936
1
Influenza vaccination
5′ RACE/FR1 primer MiSeq
81
46
742,531
16,029
7,016
3
Multiple sclerosis brain
5′ RACE MiSeq
17
4
1,654,572
34,326
11,130
1
Myasthenia gravis
5′ RACE MiSeq
18
9
862,543
17,512
10,332
1
Systemic lupus erythematosus
5′ RACE MiSeq
8
7
3,741,944
410,286
41,705
1
Tetanus vaccination
5′ RACE MiSeq
4
4
1,999,230
58,787
17,309
1
Total
5′ RACE, FR1 primer MiSeq/HiSeq
306
121
818,868,037
340,841,950
2,655,416
11
Summary of donor type, library preparation and sequencing methods, BCR repertoire statistics, and references.
Next-Generation Sequencing Data Processing
The NGS data were analyzed with the SONAR pipeline, version 2.0 (https://github.com/scharch/sonar/) developed in our lab (23). Briefly, USEARCH was used to merge the 2 × 300 or 2 × 250 raw reads to single transcripts and to remove transcripts potentially containing more than 20 miscalls calculated from sequencing quality score (44). Merged transcripts shorter than 300 nucleotides were removed. BLAST (http://www.ncbi.nlm.nih.gov/blast/) was used to assign germline V, D, and J genes to each transcript with customized parameters (23, 45). CDR3 was identified from BLAST alignment using the conserved 2nd Cysteine in V region and WGXG (heavy chain) or FGXG (light chain) motifs in J region (X represents any of the 20 amino acids). To assign an isotype for each heavy chain transcript, we used BLAST with default parameters to search the 3′ terminus of each transcript against a database of human heavy chain constant domain 1 region obtained from the international ImMunoGeneTics information system (IMGT) database. A BLAST E-value threshold of 1E-6 was used to find significant isotype assignments. Then, sequences other than the V(D)J region of a transcript were removed and transcripts containing frame-shift and/or stop codon were excluded.We then applied two approaches to remove PCR duplicates and sequencing errors. For datasets sequenced with unique molecular identifier (UMI), we first grouped merged raw transcripts having identical UMI. Due to PCR crossover and UMI collisions, transcripts in a group may originate from different cells (2, 12). We therefore clustered the raw transcripts in each group using USEARCH with a 97% sequence identity. All transcripts in a USEARCH cluster were aligned using CLUSTALO (46) and a consensus sequence was generated. To further reduce redundant transcripts derived from multiple mRNA molecules of the same cell, we removed duplicates in the consensus sequences. Singletons, which have neither UMI duplicates nor consensus duplicates, were removed. SONAR was used to annotate the unique transcripts. For datasets sequenced without UMI, similar to our previous studies (14, 18, 23), we clustered transcripts of each repertoire using USEARCH with sequence identity of 0.99, and selected only one transcript with the highest sequencing depth or numbers of duplicates per cluster for later analyses. To further remove low quality reads, we excluded clusters with size <2. Finally, a unique dataset of transcripts was generated for each repertoire. To reduce artifacts and effect of small sample size on the comparisons of features of antibody repertoires, we excluded repertoires containing fewer than 1,700 unique transcripts.For each of the 13 HD repertoires curated in previous studies, we further removed duplicates, sequences containing frameshifts or/and stop codons, and corrected annotation errors to form a unique dataset. The gene annotation information was retrieved from the annotation files provided by the studies.
Germline Gene Database and Prediction of New Gene/Alleles
The human germline gene database from IMGT was used to assign germline V(D)J genes. To detect new germline genes or alleles, we combined unique reads from 108 repertoires, and used IgDiscover v0.9 with default parameters to identify potential new germline genes or alleles that are observed in multiple repertoires (47). While IgDiscover prefers to identify new genes or alleles from IgM repertoire, we used repertoires containing all isotypes. Nonetheless, identical unique reads were observed for each predicted gene/allele in at least two donors, suggesting that the predictions are still reliable. The predicted genes were submitted to European Nucleotide Archive (ENA) with project accession numbers: PRJEB31020. For each of the 13 HD repertoires, we randomly selected 1 million IgM transcripts for novel gene prediction (as recommended by IgDiscover manual) and no new gene or allele was found.
V and J Gene Usage and Antibody Position Numbering
The unique dataset for each repertoire was used to calculate the distributions of germline gene usage and SHM levels. For each transcript, we used ANARCI to assign each position according to the Kabat, Chothia, and IMGT numbering schemes (48).
Signature Prevalence and Rarefaction Analysis
We developed a python script, Ab_search.py, to search the BCR repertoire database in two modes: signature motif and full V(D)J sequence. Briefly, in the signature searching mode, an amino acid signature motif from either CDR3 region or other positions of interest (defined with the Kabat, Chothia, and IMGT numbering scheme) is converted to python regular expression format and searched against all transcripts in the database. For the sequence searching mode, BLAST+ is called to find similar transcripts with E-value < 1E-6 (49). Signature frequency in each repertoire is calculated by dividing the number of matched transcripts with the total number of unique transcripts originated from the same germline gene or allele.To evaluate the effect of number of repertoires on estimation of signature frequency, we performed rarefaction analysis by sampling i subset repertoires from PBMC samples (1 to 35). For each subset i, the random sampling was performed N times (we used N = 20 in the current study) and the mean signature frequency from each sampling was calculated. Then the coefficient of variance for each i, CV, was calculated as:
Identification of Antibody Clones and Construction of GSSPs
For each repertoire, we first sorted all transcripts to groups based on identical V and J gene usages. For each group, transcripts with 90% CDR3 sequence identity and the same CDR3 length were clustered into clones using USEARCH. One representative sequence was selected in each clone and representative transcripts from all repertoires were combined to build GSSPs for V genes using mGSSP (14). We exclude GSSPs built with <100 clones because of lack of information (14). A python script, SHM_freq.py, was developed to identify SHMs in an input sequence, to search the GSSP of an assigned V gene, and to output the rarity of each mutation, calculated as: Rarity = (1 – Frequency of mutation) * 100%.
Construction of Gene-Specific N-Glycosylation Profile
Glycosylation sites for sequences having SHMs >1% in the unique datasets of healthy and vaccination donors were predicted using an artificial neural network method NetNGlyc v1.0 (http://www.cbs.dtu.dk/services/NetNGlyc/). Predictions with high specificity (potential >0.5 and jury agreement of 9/9) and do not match Asn-Pro-Ser/Thr motifs were included. N-glycosylation sites encoded in germline genes were excluded.
Statistical Tests
ANOVA and T-tests were performed in R.
Results
cAb-Rep Contains a Diverse Compendium of Annotated Next-Generation Sequencing Datasets
We first assembled 376 BCR repertoire deep sequencing data sets from NCBI SRA database, each sequenced by Illumina MiSeq or HiSeq and with library preparation protocols to cover full length V(D)J region [5′ primers at leader regions or 5′ Rapid amplification of cDNA ends (RACE), 3′ primers targeting constant region 1] or near full length (5′ primers at N-terminus of framework 1 region). These datasets contain a total of 247 million raw reads (Table 1 and Table S1). We performed quality control including removing sequencing errors, PCR crossover, PCR duplicates, and annotated each sequence using SONAR (See Methods and Figure 1). A total of 293 repertoires from 108 donors were selected to construct the cAb-Rep database, with 6.6 and 0.9 million curated full length V(D)J heavy and light chain transcripts. Twenty-five thousand five hundred forty-one curated transcripts were obtained per repertoire. In addition, annotated sequences from 13 healthy donor repertoires sequenced with high depth (more than 20 million unique sequences from about 107-108 B cells per donor) (2, 4), were curated and incorporated.
Figure 1
Flowchart for the processing of repertoire data and developed tools. Next-generation sequencing data was processed and annotated using SONAR. Other published programs used were highlighted in bold font. Scripts developed in this study were in italic font.
Flowchart for the processing of repertoire data and developed tools. Next-generation sequencing data was processed and annotated using SONAR. Other published programs used were highlighted in bold font. Scripts developed in this study were in italic font.These repertoires were from studies including Hepatitis B vaccination, influenza vaccination, Tetanus vaccination, Multiple Sclerosis, Myasthenia Gravis, Systemic Lupus Erythematosus, and healthy donors [Table 1 and Table S1; (6–9, 12, 15–17, 50)]. Among all repertoires, 72 repertoires were from 43 healthy donors (35.5% of total donors). The database repertoires were obtained from three tissues: 2 from brain, 15 from cervical lymph node, and 289 from peripheral blood mononuclear cell (PBMC). Among PBMC samples, 16 were from antibody secreting B cells, 27 from Hepatitis B surface antigen (HBsAg)+ B cells, 8 from Human Leukocyte Antigen – DR isotype (HLA-DR)+ plasma cells, 139 from IgG+ B cells, 20 from memory B cells, 34 from IgM or naïve B cells, and 52 from whole PBMC (Table S1). Thus, cAb-Rep contains BCR repertoire data in different settings, which will enable cross-study comparisons.For each transcript in cAb-Rep, we numbered positions using Kabat, Chothia, and IMGT schemes (Figure 1), and annotated gene origin, CDR3 sequence and length, somatic hypermutation level, isotype, donor, and clonotype. These curated datasets can be used for downstream analyses. The transcripts are stored in FASTA format with annotation information in the header line of each transcript. Such a format can be easily altered to feed into other programs for repertoire analysis.Because the 13 HD repertoires contained over 330 million unique sequences, to avoid sampling bias for profile constructions and improve speed of searching cAb-Rep (see sections below), we randomly selected 20,000 unique sequences of each isotype of each repertoire and combined with the rest of repertoires for all analyses below. Nonetheless, in case rare events of BCR signatures are of interests, we provided an option in our scripts and at cAb-Rep website to search the 13 HD repertoires alone.
Ab_Search Identifies Similar Antibody Transcripts and Estimates Signature Frequency
To identify antibody transcripts of interest from cAb-Rep, we developed a python script, Ab_search.py. The script accepts sequences or amino acid motifs as input and output frequency of the signature and matched sequences (See Materials and Methods). Here, as examples, we searched transcripts having signature motifs of a selected malaria antibody class in PBMC BCR repertoires or of a selected HIV-1 bnAb classes in IgM and IgG repertoires (Figure 2).
Figure 2
Frequencies of malaria and HIV-1 broadly neutralizing antibody classes-like transcripts. (A) Frequencies of influenza IGHV3-33/IGKV1-5 class heavy and light chain signatures in 15 heavy and 17 light chain PBMC repertoires of healthy donors. Signatures are: IGHV3-33 gene and W52 for heavy chain and IGKV1-5 and 8 amino acids CDRL3 for light chain. (B,C) Showed the frequencies of VRC01 class heavy and light chain signatures, respectively. The searched signatures were IGHV1-2*02 origin for heavy chain and either IGKV1-33, IGKV3-15, IGKV3-20, or IGLV2-14 origin plus a CDR L3 of five amino acids matching motif X-X-[AFILMYWV]-[EQ]-X for light chain. (D) Rarefaction analysis to show the effect of number of repertoires on prediction of frequencies of bnAb signatures (malaria antibody heavy chain, light chain, and HIV bnAb VRC01 light chain). We used coefficient of variation to measure the variation of signature frequency at each sampling size. The analysis showed that the coefficient of variation decreases substantially when sampling more than 10 repertoires, suggesting that increasing sampling size helps to estimate signature frequency with higher confidence. The coefficient of variation for malaria VH3-33/KV1-5 class antibody heavy and light chain signatures were colored blue and orange, respectively, while those of the HIV bnAb VRC01 light signature were colored magenta.
Frequencies of malaria and HIV-1 broadly neutralizing antibody classes-like transcripts. (A) Frequencies of influenzaIGHV3-33/IGKV1-5 class heavy and light chain signatures in 15 heavy and 17 light chain PBMC repertoires of healthy donors. Signatures are: IGHV3-33 gene and W52 for heavy chain and IGKV1-5 and 8 amino acids CDRL3 for light chain. (B,C) Showed the frequencies of VRC01 class heavy and light chain signatures, respectively. The searched signatures were IGHV1-2*02 origin for heavy chain and either IGKV1-33, IGKV3-15, IGKV3-20, or IGLV2-14 origin plus a CDR L3 of five amino acids matching motif X-X-[AFILMYWV]-[EQ]-X for light chain. (D) Rarefaction analysis to show the effect of number of repertoires on prediction of frequencies of bnAb signatures (malaria antibody heavy chain, light chain, and HIV bnAb VRC01 light chain). We used coefficient of variation to measure the variation of signature frequency at each sampling size. The analysis showed that the coefficient of variation decreases substantially when sampling more than 10 repertoires, suggesting that increasing sampling size helps to estimate signature frequency with higher confidence. The coefficient of variation for malaria VH3-33/KV1-5 class antibody heavy and light chain signatures were colored blue and orange, respectively, while those of the HIV bnAb VRC01 light signature were colored magenta.The selected malaria antibody class contains both heavy (IGHV3-33 gene with a tryptophan at position 52) and light chain (IGKV1-5 gene with 8 amino acid CDRL3) signatures (51). The heavy chain signature was observed with a frequency of 3.2 ± 0.94% in 15 PBMC repertoires in healthy donors (Figure 2A), the light chain signature was found with a frequency of 1.24 ± 0.18% in 17 healthy donor PBMC repertoires. All healthy donors we searched contain both heavy and light sequences. By assuming random pairing (52), heavy-light paired antibodies similar to the malaria antibody class could be generated with a high frequency (~40 per million B cells). Thus, vaccine mediated elicitation of such antibodies are promising candidates to enable protection.The HIV-1 VRC01 class contains both heavy and light chain signatures (31–33, 53). Its heavy chain uses IGHV1-2*02 allele. The light chain is originated from a limited set of genes, including IGKV1-33, IGKV3-15, IGKV3-20, or IGLV2-14 and contains a CDR3 of five amino acids matching motif X-X-[AFILMYWV]-[EQ]-X. Searching cAb-Rep with these signatures showed that the heavy chain allele for VRC01 appears in IgM and IgG repertories with similar frequencies (2.7 ± 1.9% and 2.85 ± 2.13%, respectively; Figure 2B). Further, VRC01 light chain-like transcripts were found in ~33% of donors, with a mean frequency of 0.005 ± 0.0098% in PBMCs (Figure 2C). By assuming random pairing of heavy and light chains, our calculation of the mean frequency of VRC01 class-like antibodies was 1.4 per million B cells in healthy donors, close to estimates from previous studies (33, 38).
Impact of Repertoire Diversity on Signature Prevalence
Due to diversification in BCR repertoires by antigen selection and other factors, the predicted prevalence of transcripts of interest could vary. To evaluate the effect of the number of sampled repertoires on prevalence prediction, we performed rarefaction analysis to sample a subset of repertoires (ranging from 1 to 35) with each subset randomly sampling 20 times to calculate frequencies of the malariaIGHV3-33/IGKV1-5 class and VRC01 class light chain signatures. For each sampling size, we calculated the coefficient of variation (standard deviation/mean) to measure the degree of variation among sampling repeats (Figure 2D). For all signatures, our analysis revealed that the coefficient of variation decreases dramatically when the sampling size increases to 10 or more. This suggests that 10 or more repertoires will be optimal to have a consistent estimation of signature frequency.
Gene-Specific Substitution Profiles and Substitution Frequency Analysis
To investigate substitution preferences in V genes, we predicted new germline genes in the database, clustered transcripts in each repertoire into clones, and selected one representative sequence per clone to build gene-specific substitution profiles (GSSPs) (see Methods). Overall, we identified 5 novel heavy chain alleles, 2 novel lambda chain alleles, and 1 kappa chain allele, each of which was found in two or more donors. By incorporating these germline gene sequences in cAb-Rep sequence annotation, we built GSSPs for 102 human V genes, compared to GSSPs built for 69 genes using three repertoires in our previous study. Our analysis further showed that the GSSPs of the two studies were highly consistent (r = 0.983 for IGHV1-2 gene, Figure 3A).
Figure 3
Comparison of gene-specific substitution profiles and usage of a substitution profile for investigating substitution preference. (A) Comparison of substitution frequencies of all amino acid types at all IGHV1-2 positions estimated using cAb-Rep dataset and previous dataset. A Pearson correlation coefficient of 0.982 suggested that the substitution profiles of IGHV1-2 are highly consistent. (B) The gene-specific substitution profile of IGHV1-2 and rarity of somatic hypermutations in HIV-1 bnAbs and autoantibodies. Rare mutations, colored red, are observed frequently in HIV-1 bnAbs but not in autoantibodies, suggesting the mutation patterns in HIV-1 bnAbs may be generated with low frequency. For each antibody sequence, residues identical to IGHV1-2*02 germline gene were shown with dots. Missing residues were showed with minus sign. The disease and antigen were labeled on the right side of each sequence.
Comparison of gene-specific substitution profiles and usage of a substitution profile for investigating substitution preference. (A) Comparison of substitution frequencies of all amino acid types at all IGHV1-2 positions estimated using cAb-Rep dataset and previous dataset. A Pearson correlation coefficient of 0.982 suggested that the substitution profiles of IGHV1-2 are highly consistent. (B) The gene-specific substitution profile of IGHV1-2 and rarity of somatic hypermutations in HIV-1 bnAbs and autoantibodies. Rare mutations, colored red, are observed frequently in HIV-1 bnAbs but not in autoantibodies, suggesting the mutation patterns in HIV-1 bnAbs may be generated with low frequency. For each antibody sequence, residues identical to IGHV1-2*02 germline gene were shown with dots. Missing residues were showed with minus sign. The disease and antigen were labeled on the right side of each sequence.To facilitate exploring substitution preference, we developed a python script, SHM_freq.py, to identify mutations in an input sequence, call the GSSP of corresponding V gene, and find the frequency of the mutation being generated by the somatic hypermutation machinery. To demonstrate how this information can be helpful, we analyzed frequencies of substitutions observed in the heavy chain of VRC01 class bnAbs (Figure 3B). This analysis showed that all lineages in this class contain over 30% mutations, with ~30% of the mutations being low frequency or rare mutations (frequency <0.5% in IGHV1-2 GSSP). These mutations are generated with low frequency either because they require multiple nucleotide substitutions (14) or are from single substitutions in silent SHM positions (43). Functional studies have shown that some rare mutations are essential for potency and neutralization (54). However, the likelihood of immunogens maturing antibodies to have similar mutations could be low or require longer maturation times. In contrast, we observed that autoantibodies [e.g., collected from HIV, autoimmune thyroid disease, atherosclerosis, Hashimoto disease, and rheumatic carditis (55–60)] originated from IGHV1-2 genes contain very few rare mutations, suggesting somatic mutations may not provide a barrier to elicitation of these lineages.
Gene-Specific N-Glycosylation Profiles (GSNPs)
Post-translation modifications (PTM) (glycosylation, tyrosine sulfation, etc.), which affects antibody functions (42, 61), can be introduced to antibodies by V(D)J recombination and somatic hypermutation processes. To understand the frequency and preference of PTMs generated by somatic hypermutation, as an example, we predicted V-gene-specific frequency of N-glycosylation sequons at each position using healthy and vaccination donor unique sequences that having more than 1% SHM. Overall, consistent with previous study (42), the predicted N-glycosylation sites were enriched in CDR1, CDR2, and framework 3 regions, but different genes have different hotspots for glycosylation (Figure 4A). Structural analysis showed that the side chains of these hotspot positions to be surface-exposed (Figure 4B), suggesting these sites to be spatially accessible for modification. GSNPs should thus be able to provide information for further experimental validation and investigations of impact of N-glycosylations.
Figure 4
Predicted glycosylation sites generated by somatic hypermutation in V genes and their structural location. (A) SHM hotspots for glycosylation in IGHV1-69, IGHV3-11, and IGHV4-39 genes. (B) A structural demo (PDBID: 1dn0) shows the predicted glycosylation hotspots to be surface-exposed, indicating accessibly for post-translational modification.
Predicted glycosylation sites generated by somatic hypermutation in V genes and their structural location. (A) SHM hotspots for glycosylation in IGHV1-69, IGHV3-11, and IGHV4-39 genes. (B) A structural demo (PDBID: 1dn0) shows the predicted glycosylation hotspots to be surface-exposed, indicating accessibly for post-translational modification.
cAb-Rep Website to Search Frequencies of Signature Motif and SHM
While we developed scripts to search cAb-Rep, these may be of limited utility to users not familiar with programing. Therefore, we developed a website for searching cAb-Rep (https://cab-rep.c2b2.columbia.edu/). The website implements all functions of the scripts we developed above, including querying cAb-Rep using the three signature modes (CDR3, position, BLAST) with specified isotype, numbering scheme, and VJ recombinations, identifying rare SHMs for an input sequence, and showing the GSSP of a V gene (Figure 5). Users can also query GSNP of an input gene as well as download all the datasets in this study.
Figure 5
Querying frequencies of signature and somatic hypermutation at cAb-Rep website. (A) cAb-Rep interfaces to estimate frequency of an input signature, gene-specific rarity of somatic hypermutation, and gene-specific glycosylation frequency. (B) Result for search of VRC01 bnAb-like light chain sequences using the CDR3 mode. The full results including sequences and donor information can be obtained from the download link, and the frequency of the motif in each donor is shown online. The parameters used are: IGKV3-20 gene and a five amino acid CDR L3 matching [A-Z]{2}[AFILMYWV][EQ][A-Z] motif.
Querying frequencies of signature and somatic hypermutation at cAb-Rep website. (A) cAb-Rep interfaces to estimate frequency of an input signature, gene-specific rarity of somatic hypermutation, and gene-specific glycosylation frequency. (B) Result for search of VRC01 bnAb-like light chain sequences using the CDR3 mode. The full results including sequences and donor information can be obtained from the download link, and the frequency of the motif in each donor is shown online. The parameters used are: IGKV3-20 gene and a five amino acid CDR L3 matching [A-Z]{2}[AFILMYWV][EQ][A-Z] motif.
Discussion
To understand the mechanisms of BCR diversity generated by V(D)J recombination and somatic hypermutation, high quality BCR repertoire datasets are critical. In this study, we have constructed a database of curated BCR transcripts from previous deep sequencing studies of B cell immunoglobulin sequence repertoires and have demonstrated how this database can provide helpful information for repertoire studies both locally and online. Currently, heavy-light pairing information is not obtained in cAb-Rep datasets. As new technologies are applied to repertoire sequencing, we believe more paired heavy-light chain transcripts will be incorporated, which will greatly advance functional characterization of BCRs. We will continue to update this database to include more disease conditions as well as those from animal models.In cAb-Rep, we incorporated BCR datasets sequenced both with and without UMI. Besides filtering out transcripts with low sequencing quality, different approaches were used to identify high quality transcripts. For repertoires sequenced using UMI, we used the UMI information to generate consensus sequences, which have proven effective at removing sequencing errors (62). For BCR repertoires sequenced without UMI, we used a clustering approach, which we first sorted transcripts based on redundancy and selected transcripts with the most redundancy as seeds to cluster transcripts at a given identity cutoff and only used the seeds as high quality transcripts. The assumption was that each B cell contains multiple BCR mRNA molecules and the original BCR could be PCR amplified and sequenced with many copies. Sequencing errors and PCR crossover are rare and close to random events (63). Sequences that differ from the seed due to PCR crossover are likely to appear as singletons after clustering, whereas differences arising from sequencing errors are likely to be only a few hamming distances away from the seed transcripts, and cluster together. Thus, removing transcripts highly similar to the seed transcripts and singletons will remove many sequencing errors, although we may lose a portion of biological transcripts with low sequencing coverage. This approach is proven effective at finding functional transcripts in previous studies (64, 65).Compared to other BCR repertoire databases, cAb-Rep provides more flexibility to genetic signature search. Another public database, iReceptor, supported by the adaptive immune receptor repertoire community, has been developed to store, query, and analyze immune receptor repertoire data (66). While iReceptor includes BCR repertoires amplified with various library preparation protocols and sequenced in multiple platforms (Illumina, 454, ion torrent, etc.), cAb-Rep is limited to include full-length or near-full length sequences with enough sample size for repertoire related analyses. For repertoire analysis, iReceptor allows query with V(D)J recombination and a CDR3 peptide (no regular expression grammars allowed), while cAb-Rep provides BLAST mode and a more flexible CDR3 searching mode. cAb-Rep also allows to search motifs across the V(D)J region in the position mode. Although iReceptor includes curated repertoire data deposited by researchers, these curated datasets are processed using different bioinformatics pipeline and parameters (germline gene database, exclusion of redundancy, etc.), which may introduce bias when performing statistical analysis across studies. Another B cell repertoire database is available, but only includes repertoires from three donors (20). Moreover, the 13 HD curated BCR repertoires, which provide high depth BCR diversity information, have not been incorporated in other databases. These datasets increase the statistical power to study features as well as rare events of BCR diversity.A further goal of cAb-Rep is to advance the investigation of impact of somatic hypermutation, which, as far as we know, is not available in other B cell repertoire databases. We provide a query portal and GSSPs for human antibody V genes to understand gene- and positional-specific substitution preferences. We also predicted gene-specific frequencies of N-glycosylation in human antibody V genes. Such sequence-derived information, together with functional study, is critical to identify common mechanism of modulation of affinity and understand similarities in pathways of antibody affinity maturation (40–42). Nonetheless, our current knowledge on impact of SHM is still very limited. In the future, we will collect literatures to characterize how SHM affects antibody structure and function as well as develop new tools to predict effects of SHM. A portal in cAb-Rep will be created to query studied SHMs, which will elucidate the relations of antibody sequence, structure, and function and provide knowledge for antibody design.In summary, we believe cAb-Rep and the tools developed in this study to be complimentary to other B cell repertoire databases and to be helpful to researchers not familiar with repertoire annotation for exploring features of repertoires and compare datasets across studies and diseases.
Data Availability Statement
The raw next-generation sequencing data were downloaded from NCBI SRA database (https://www.ncbi.nlm.nih.gov/sra). The project accession numbers were listed in the ProjectID column of Table S1. The curated data can be found at our website: https://cab-rep.c2b2.columbia.edu/tools/.
Author Contributions
PK, LS, and ZS designed the research. YG, KC, and ZS analyzed data. YG, PK, LS, and ZS wrote the paper. All authors reviewed, commented on, and approved the manuscript.
Conflict of Interest
The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.
Authors: Joseph G Jardine; Daniel W Kulp; Colin Havenar-Daughton; Anita Sarkar; Bryan Briney; Devin Sok; Fabian Sesterhenn; June Ereño-Orbea; Oleksandr Kalyuzhniy; Isaiah Deresa; Xiaozhen Hu; Skye Spencer; Meaghan Jones; Erik Georgeson; Yumiko Adachi; Michael Kubitz; Allan C deCamp; Jean-Philippe Julien; Ian A Wilson; Dennis R Burton; Shane Crotty; William R Schief Journal: Science Date: 2016-03-25 Impact factor: 47.728
Authors: Damian C Ekiert; Gira Bhabha; Marc-André Elsliger; Robert H E Friesen; Mandy Jongeneelen; Mark Throsby; Jaap Goudsmit; Ian A Wilson Journal: Science Date: 2009-02-26 Impact factor: 47.728
Authors: Peter D Kwong; Gwo-Yu Chuang; Brandon J DeKosky; Tatyana Gindin; Ivelin S Georgiev; Thomas Lemmin; Chaim A Schramm; Zizhang Sheng; Cinque Soto; An-Suei Yang; John R Mascola; Lawrence Shapiro Journal: Immunol Rev Date: 2017-01 Impact factor: 12.988
Authors: Florian Rubelt; Christopher R Bolen; Helen M McGuire; Jason A Vander Heiden; Daniel Gadala-Maria; Mikhail Levin; Ghia M Euskirchen; Murad R Mamedov; Gary E Swan; Cornelia L Dekker; Lindsay G Cowell; Steven H Kleinstein; Mark M Davis Journal: Nat Commun Date: 2016-03-23 Impact factor: 14.919
Authors: Xuejun Chen; Tongqing Zhou; Stephen D Schmidt; Hongying Duan; Cheng Cheng; Gwo-Yu Chuang; Ying Gu; Mark K Louder; Bob C Lin; Chen-Hsiang Shen; Zizhang Sheng; Michelle X Zheng; Nicole A Doria-Rose; M Gordon Joyce; Lawrence Shapiro; Ming Tian; Frederick W Alt; Peter D Kwong; John R Mascola Journal: Immunity Date: 2021-01-15 Impact factor: 31.745
Authors: Chen-Hsiang Shen; Brandon J DeKosky; Yicheng Guo; Kai Xu; Ying Gu; Divya Kilam; Sung Hee Ko; Rui Kong; Kevin Liu; Mark K Louder; Li Ou; Baoshan Zhang; Cara W Chao; Martin M Corcoran; Eric Feng; Jesse Huang; Erica Normandin; Sijy O'Dell; Amy Ransier; Reda Rawi; Mallika Sastry; Stephen D Schmidt; Shuishu Wang; Yiran Wang; Gwo-Yu Chuang; Nicole A Doria-Rose; Bob Lin; Tongqing Zhou; Eli A Boritz; Mark Connors; Daniel C Douek; Gunilla B Karlsson Hedestam; Zizhang Sheng; Lawrence Shapiro; John R Mascola; Peter D Kwong Journal: Cell Host Microbe Date: 2020-03-03 Impact factor: 21.023
Authors: Rahmad Akbar; Habib Bashour; Puneet Rawat; Philippe A Robert; Eva Smorodina; Tudor-Stefan Cotet; Karine Flem-Karlsen; Robert Frank; Brij Bhushan Mehta; Mai Ha Vu; Talip Zengin; Jose Gutierrez-Marcos; Fridtjof Lund-Johansen; Jan Terje Andersen; Victor Greiff Journal: MAbs Date: 2022 Jan-Dec Impact factor: 5.857
Authors: Bailey B Banach; Prabhanshu Tripathi; Lais Da Silva Pereira; Jason Gorman; Thuy Duong Nguyen; Marlon Dillon; Ahmed S Fahad; Patience K Kiyuka; Bharat Madan; Jacy R Wolfe; Brian Bonilla; Barbara Flynn; Joseph R Francica; Nicholas K Hurlburt; Neville K Kisalu; Tracy Liu; Li Ou; Reda Rawi; Arne Schön; Chen-Hsiang Shen; I-Ting Teng; Baoshan Zhang; Marie Pancera; Azza H Idris; Robert A Seder; Peter D Kwong; Brandon J DeKosky Journal: J Exp Med Date: 2022-06-23 Impact factor: 17.579
Authors: Alice Cho; Frauke Muecksch; Zijun Wang; Tarek Ben Tanfous; Justin DaSilva; Raphael Raspe; Brianna Johnson; Eva Bednarski; Victor Ramos; Dennis Schaefer-Babajew; Irina Shimeliovich; Juan P Dizon; Kai-Hui Yao; Fabian Schmidt; Katrina G Millard; Martina Turroja; Mila Jankovic; Thiago Y Oliveira; Anna Gazumyan; Christian Gaebler; Marina Caskey; Theodora Hatziioannou; Paul D Bieniasz; Michel C Nussenzweig Journal: J Exp Med Date: 2022-07-01 Impact factor: 17.579
Authors: Marianna Agudelo; Frauke Muecksch; Dennis Schaefer-Babajew; Alice Cho; Justin DaSilva; Eva Bednarski; Victor Ramos; Thiago Y Oliveira; Melissa Cipolla; Anna Gazumyan; Shuai Zong; Danielle A S Rodrigues; Guilherme S Lira; Luciana Conde; Renato Santana Aguiar; Orlando C Ferreira; Amilcar Tanuri; Katia C Affonso; Rafael M Galliez; Terezinha Marta Pereira Pinto Castineiras; Juliana Echevarria-Lima; Marcelo Torres Bozza; Andre M Vale; Paul D Bieniasz; Theodora Hatziioannou; Michel C Nussenzweig Journal: J Exp Med Date: 2022-07-07 Impact factor: 17.579
Authors: Sven Kratochvil; Chen-Hsiang Shen; Ying-Cing Lin; Kai Xu; Usha Nair; Lais Da Silva Pereira; Prabhanshu Tripathi; Johan Arnold; Gwo-Yu Chuang; Eleonora Melzi; Arne Schön; Baoshan Zhang; Marlon Dillon; Brian Bonilla; Barbara J Flynn; Kathrin H Kirsch; Neville K Kisalu; Patience K Kiyuka; Tracy Liu; Li Ou; Marie Pancera; Reda Rawi; Mateo Reveiz; Kareen Seignon; Lawrence T Wang; Michael T Waring; John Warner; Yongping Yang; Joseph R Francica; Azza H Idris; Robert A Seder; Peter D Kwong; Facundo D Batista Journal: Immunity Date: 2021-11-16 Impact factor: 43.474
Authors: Christian Gaebler; Zijun Wang; Julio C C Lorenzi; Frauke Muecksch; Shlomo Finkin; Minami Tokuyama; Alice Cho; Mila Jankovic; Dennis Schaefer-Babajew; Thiago Y Oliveira; Melissa Cipolla; Charlotte Viant; Christopher O Barnes; Yaron Bram; Gaëlle Breton; Thomas Hägglöf; Pilar Mendoza; Arlene Hurley; Martina Turroja; Kristie Gordon; Katrina G Millard; Victor Ramos; Fabian Schmidt; Yiska Weisblum; Divya Jha; Michael Tankelevich; Gustavo Martinez-Delgado; Jim Yee; Roshni Patel; Juan Dizon; Cecille Unson-O'Brien; Irina Shimeliovich; Davide F Robbiani; Zhen Zhao; Anna Gazumyan; Robert E Schwartz; Theodora Hatziioannou; Pamela J Bjorkman; Saurabh Mehandru; Paul D Bieniasz; Marina Caskey; Michel C Nussenzweig Journal: Nature Date: 2021-01-18 Impact factor: 69.504
Authors: Zijun Wang; Fabian Schmidt; Yiska Weisblum; Frauke Muecksch; Christopher O Barnes; Shlomo Finkin; Dennis Schaefer-Babajew; Melissa Cipolla; Christian Gaebler; Jenna A Lieberman; Thiago Y Oliveira; Zhi Yang; Morgan E Abernathy; Kathryn E Huey-Tubman; Arlene Hurley; Martina Turroja; Kamille A West; Kristie Gordon; Katrina G Millard; Victor Ramos; Justin Da Silva; Jianliang Xu; Robert A Colbert; Roshni Patel; Juan Dizon; Cecille Unson-O'Brien; Irina Shimeliovich; Anna Gazumyan; Marina Caskey; Pamela J Bjorkman; Rafael Casellas; Theodora Hatziioannou; Paul D Bieniasz; Michel C Nussenzweig Journal: Nature Date: 2021-02-10 Impact factor: 69.504