Literature DB >> 31602484

VDJbase: an adaptive immune receptor genotype and haplotype database.

Aviv Omer¹, Or Shemesh¹, Ayelet Peres¹, Pazit Polak¹, Adrian J Shepherd², Corey T Watson³, Scott D Boyd⁴, Andrew M Collins⁵, William Lees², Gur Yaari¹.

Abstract

VDJbase is a publicly available database that offers easy searching of data describing the complete sets of gene sequences (genotypes and haplotypes) inferred from adaptive immune receptor repertoire sequencing datasets. VDJbase is designed to act as a resource that will allow the scientific community to explore the genetic variability of the immunoglobulin (Ig) and T cell receptor (TR) gene loci. It can also assist in the investigation of Ig- and TR-related genetic predispositions to diseases. Our database includes web-based query and online tools to assist in visualization and analysis of the genotype and haplotype data. It enables users to detect those alleles and genes that are significantly over-represented in a particular population, in terms of genotype, haplotype and gene expression. The database website can be freely accessed at https://www.vdjbase.org/, and no login is required. The data and code use creative common licenses and are freely downloadable from https://bitbucket.org/account/user/yaarilab/projects/GPHP.

Entities: Chemical Disease Gene Species

Mesh：

Substances：

Year: 2020 PMID： 31602484 PMCID： PMC6943044 DOI： 10.1093/nar/gkz872

Source DB: PubMed Journal: Nucleic Acids Res ISSN： 0305-1048 Impact factor: 16.971

INTRODUCTION

An important application of the recent advances in high throughput DNA sequencing is the exploration of adaptive immune receptor repertoires (AIRR). AIRR sequencing (AIRR-seq) enables exploration of the dynamics of the adaptive immune system (1), and has applications to the study of aging (2,3), cancer (4), autoimmune diseases (5–7), allergy (8), infectious diseases (9) and vaccine design (10). A crucial step in the analysis of AIRR-seq data is the correct identification of specific V, D and J germline genes that contribute to each antibody and T cell receptor gene sequence. It is the starting point for in-depth analyses such as the identification and quantification of somatic hypermutation (11), determination of gene usage distribution, and correlation of AIRR-seq data with clinical conditions (12). For example, it was recently demonstrated that the presence or absence of a specific allele greatly affects the response to influenza A and HIV infections (13–15). Other infectious diseases as well as cancer and allergy may also be sensitive to the germline repertoire. However, our knowledge of the genetic loci encoding Ig and TR is very incomplete, since the genomic regions encoding these receptors contain many duplications, deletions, and other complex events, which hinder their direct sequencing using short reads (16). This is true for all studied species to date, including humans. Genomic studies of the human loci have come from just a handful of individuals, and we therefore do not know the extent of population variation within these loci, though there is reason to believe the variation is significant (17–21). Recently, we and others have published several computational tools to help explore these regions, to infer previously unknown alleles, deletion polymorphisms, and complete sets of immungolobulin genes that are expressed by different individuals (genotypes and haplotypes) from AIRR-seq data (21–28). Germline sequences affirmed by the new tools are curated in the international ImMunoGeneTics (IMGT) information system (29) after review by the Inferred Allele Review Committee (IARC) of the AIRR Community (30). This process is facilitated by OGRDB (the Open Germline Receptor Database: https://ogrdb.airr-community.org), which provides supporting evidence for published alleles, including details of repertoires in which they have been observed. Currently, there are ∼60 alleles that are either under review or have recently been affirmed by the IARC and accepted into IMGT. However, data relating to genotypes, haplotypes, and general gene usage across the human population are currently beyond the scope of IMGT and OGRDB. There is a need to better understand the usage of germline alleles in different individuals, ethnic and clinical groups. A better picture of the set of alleles expressed by each individual should lead to important discoveries such as predispositions to disease and variable responses to vaccination and drug therapy. Currently, the prevalence within the human population of each allele curated in IMGT is unclear, and the very existence or functionality of many sequences has even been questioned (31). For this reason, we have developed VDJbase, a publicly available database that offers easy searching of antibody genotype and haplotype data inferred from AIRR-seq datasets. VDJbase stores information about genotypes and haplotypes inferred from individuals from diverse ethnic and clinical backgrounds, and produces summary statistics about sets of samples that are filtered according to their associated meta-data.

IMPLEMENTATION

The web interface of VDJbase offers researchers a fast and convenient way to browse for genotypes and haplotypes, compare published datasets, generate interactive visual analyses, and submit AIRR-seq data to foster continuous growth. To allow for unbiased comparisons, the database inputs are generated by an identical data processing pipeline. The pipeline’s input is pre-processed Ig and TR sequences. The pipeline begins with a preliminary V(D)J assignment, that includes an inference of previously unknown alleles, followed by inference of genotype and haplotype (see materials and methods section). Users can freely access the database using any browser. Results are displayed as a table, along with samples and their related metadata, providing files and figures that can be downloaded to the user’s own computer. We exploit the HTML platform for interactive visualizations. Unlike static charts, interactive data visualizations encourage users to explore and even manipulate the data to uncover other factors. See Figure 1 for a schematic diagram of VDJbase. The entry page includes tutorials about the database service, with links to the user guide and to the content search.

Figure 1.

Schematic chart of VDJbase workflow.

Database search

Browsing is a very useful capability in VDJbase, which can be easily searched by selecting samples of interest and output fields. Users can interactively interrogate genotypes and haplotypes using the ‘Database Search’ page. Searches can be performed by various queries, such as cell type (e.g. memory B cell), tissue type (e.g. blood), health status (e.g. celiac), Ig group (e.g. heavy chain), isotype (e.g. IgM), sex or specific genes and alleles. For a better view of the information, we have established a capability to generate user-friendly visualization graphs. The ‘Export Graphs’ menu enables the creation of a visual analysis of the user’s selected samples. Using a set of drop-down tag lists, users can filter all visualizations according to genes, alleles, or the certainty level of inferences (Kdiff, see (21,26)) for each genotype/haplotype decision. All entries can be downloaded using the ‘Download Selected’ tab, which provides a .zip file with the selected data and related meta-data that can be viewed, for instance, in Microsoft Excel. This combination of the annotation and the availability of the underlying data for large sample sets is currently available nowhere else for AIRR-seq data, and is a step forward in the context of open data sharing. Graphs are downloadable in PDF file format. To allow users to quickly assess the complete information on the experimental set-up and materials used for their selected sample, the ‘Reference’ column contains clickable icons which open any manuscript describing the data in a new browser window. Each section contains help materials to ensure ease of use, without prerequisite knowledge or experience. Clicking the ‘?’ symbol located to the right of each item pops open an explanation. On the ‘Explore Data’ page, users can view representative examples of interesting findings revealed by VDJbase, together with a summary of the number of studies, samples and related metadata currently stored in the database.

Visualization and analysis

Apart from PDF format, VDJbase uses the Javascript graphing library plotly.js to provide online, interactive data visualization designed to help users gain insights into the data. Genotypes stored in the database include a measure of certainty of the genotype call for each gene (Kdiff). Briefly, this measure is the ratio between the models’ likelihoods calculated from the posterior probability distribution. A ‘Genotype’ tab creates an interactive graph which allows users to modify parameters to explore the genotype data according to their interests. For instance, users can focus on specific alleles or screen the results by certainty level (Kdiff). The graph can visualize 1–20 genotypes in a single page, to facilitate comparison between individuals. Comparison of a larger number of genotypes and haplotypes is enabled through a heatmap graph (see Figure 2A and B).

Figure 2.

Data visualizations in VDJbase. (A) Comparison between 8 genotypes. Row and column represent individual and gene, respectively. Colors correspond to alleles. (B) Comparison between two haplotypes. Haplotype is inferred using the IGHD2-21 gene as an anchor gene. The upper panel corresponds to the chromosome carrying IGHD2-21*01 and the lower panel to the chromosome carrying IGHD2-21*02. (C) Heterozygosity abundance for each gene in the samples of interest. (D) Gene usage. Each point represents an individual. Colors correspond to V family gene. (E) Allele appearance of IGHJ6. Left Y axis corresponds to the number of individuals, and the right Y axis corresponds to the frequency of the allele in the samples of interest. (F) Allele appearance of IGHJ4 and IGHJ5. Y axis is the same as in (E). Gene order in all graphs correspond to their location on the chromosome. Inference of personal genotypes also allows us to estimate the heterozygosity of genes in the population. We consider genes for which more than one allele is carried by an individual to be heterozygous. The ‘Heterozygous’ graph allows users to assess the level of hemizygosity/heterozygosity/homozygosity for each gene in different populations. To enable users to obtain more specific information, frequency values and raw counts appear as pop-ups when users hover above the corresponding bar (Figure 2C). The ‘Gene usage’ graph provides a view of gene expression in the population (Figure 2D). Each point in the graph represents a single individual, and colors represent the gene families. The order of genes is based on their chromosomal location (17). The ‘Allele distribution’ graph represents the allele distribution within a selected population, for each gene. Users can compare the distributions of alleles between different populations (Figure 2E and F). VDJbase utilizes two R packages for visualizations, ‘vdjbaseVis’ and ‘RAbHIT’ (28). These packages can be downloaded and used for free under the CC BY-SA 4.0 license from https://bitbucket.org/account/user/yaarilab/projects/GPHP.

Data submission

To enable users to submit their own published AIRR sequences, a straightforward submission form is available upon request via our ‘Discussion Forum’. Submitted forms are validated and the associated data processed by the site administrator. This will allow the repository to grow, and for contributors to receive appropriate credit from database users.

DATABASE CONTENTS

To date we have populated the database with >500 samples, which are associated with various diseases (e.g. MS, celiac, HCV, influenza) and tissue types (e.g. brain lesions, lymph nodes, blood) originating from >15 studies. To facilitate standardization of VDJbase data integration, we use a standard pipeline for all genotype and haplotype inferences (see Materials and Methods section for details).

Use case examples

Initial use of this database and annotation system has enabled rigorous testing of human adaptive immune genetic findings. As one example, a number of putative allelic variants reported in older literature appear to be completely absent from Caucasian populations (e.g. IGHJ4*01, IGHJ5*01, IGHJ6*01). Another example is a mosaic deletion pattern that was validated for a much larger cohort (21) (see Figure 2D–F). Approximately one-third of the population is known to be heterozygous for gene IGHJ6 (21,24), and our database—in which 127 of 326 individuals are heterozygous—is consistent with this observation. We have additionally identified numerous IGH genes to be heterozygous in individuals with a defined genotype: nine IGHV genes observed to be heterozygous in over 50% of individuals (IGHV2-70D, IGHV1-69, IGHV3-53, IGHV3-48, IGHV3-49, IGHV4-30-2, IGHV3-30-3, IGHV3-30 and IGHV3-11); 12 IGHV genes observed to be heterozygous in over 30% of individuals (IGHV3-73, IGHV3-64, IGHV4-61, IGHV1-58, IGHV3-53, IGHV5-51, IGHV4-39, IGHV1-46, IGHV1-18, IGHV3-13, IGHV2-5 and IGHV4-4); and two IGHD genes observed to be heterozygous in 38-65% of individuals (IGHD2-8 and IGHD2-21) (see Figure 2C).

MATERIALS AND METHODS

Inferred genotypes and haplotypes

We align each of the pre-processed datasets using the most recent version of the IgBLAST (32) aligner and the current IMGT germline reference set. We infer previously unknown alleles using TIgGER’s inferGenotype function (22) with a modification to the position range input, which allows detection of novel alleles involving sequence variation at nucleotide positions beyond 312 (current default). The sequences are then aligned again, using IgBLAST with a germline reference set that is extended to include any novel IGHV alleles inferred by inferGenotype. The output of IgBLAST is converted to the Change-O format (33) that is compliant with the MiAIRR standard (34). To improve the subsequent quality of allele inference in samples containing highly mutated sequences, we infer clones using SCOPEr (35) and choose a single representative, with the lowest number of mutations, for each clone. A genotype is then inferred using TIgGER’s new inferGenotypeBayesian function (26), which can detect novel alleles at greater hamming distance from sequences in the provided reference set than previous versions of TIgGER, and assigns a probability, Kgenotype, to each allele in the inferred genotype. The sequences are then aligned for a third time with IgBLAST, using a germline reference set that contains only sequences of those alleles included in the personalised genotype created by inferGenotypeBayesian. Lastly, haplotypes are inferred for heterozygous individuals for genes IGHJ6/IGHD2-8/IGHD2-21 using RAbHIT (28). In the current version of TIgGER, up to four distinct alleles are allowed in an individual’s genotype. This reflects the possibility of a gene duplication with both loci being heterozygous, as previously observed in this region (17). Samples with fewer than 2000 sequences either in the first IgBLAST or after the collapsing of clones are excluded from further analysis, and data is not incorporated into VDJbase. For datasets with partial V-region coverage, some modifications (described below) are made to the processing protocol. We consistently update the datasets according to the latest versions of the above mentioned tools, and use version control for the website for reproducibility. Versions are displayed on the site.

IGHV annotation for datasets with partial V-region coverage

Aligners are more likely to make ambiguous calls (for example ‘IGHV3-23*01 or IGHV3-23*02’) when aligning sequences with partial V-region coverage, and this can influence downstream analyses. To resolve this situation, in these datasets, we collapse ambiguous allele assignments using the RAbHIT reliability scores. Each allele for which more than 60% of alignments are ambiguous calls are marked as non-reliable alleles (NRA), and later collapsed. Where ambiguous calls remain, such as ‘IGHV3-23*01 or IGHV3-23*02’, this is indicated by allele designations such as 01_02. We screen for reliable alleles prior to inferring novel alleles and genotypes. We also modify the numbering of the starting position of partial novel inferences to correspond to the implied nucleotide numbers of full length sequences.

CONCLUSION AND FUTURE DEVELOPMENTS

VDJbase introduces a database for genotypes and haplotypes inferred from AIRR-seq data. These user-friendly web-based queries of VDJbase open new opportunities to browse and extract valuable biological information from the rapidly accumulating AIRR-seq data. In its current form, the focus has been on human B cell receptors, though our database structure is broadly applicable also to T cell receptors, antibody light chains, and Ig and TR from other species. In addition, currently inference of previously unknown alleles is performed by TIgGER. We intend to include options to infer novel alleles using other available tools such as Partis (25) and IgDiscover (23) once there is community agreement on how to reconcile the discrepancies between the results produced by these tools. We hope that the interface between OGRDB and VDJbase will accelerate this process. VDJbase can shed new light on the variability of germline alleles within and across populations. We anticipate that this database will be extended to effective completeness of human Ig and TR gene variants in all human populations, with enhanced search and analysis features added. Extending the database to include further samples from less sequenced clinical cohorts and ethnic backgrounds should dramatically increase the possibility of discovery of new biomarkers for diseases and treatments.

35 in total

1. Change-O: a toolkit for analyzing large-scale B cell immunoglobulin repertoire sequencing data.

Authors: Namita T Gupta; Jason A Vander Heiden; Mohamed Uduman; Daniel Gadala-Maria; Gur Yaari; Steven H Kleinstein
Journal: Bioinformatics Date: 2015-06-10 Impact factor: 6.937

2. Neutralizing antibodies against West Nile virus identified directly from human B cells by single-cell analysis and next generation sequencing.

Authors: Konstantinos Tsioris; Namita T Gupta; Adebola O Ogunniyi; Ross M Zimnisky; Feng Qian; Yi Yao; Xiaomei Wang; Joel N H Stern; Raj Chari; Adrian W Briggs; Christopher R Clouser; Francois Vigneault; George M Church; Melissa N Garcia; Kristy O Murray; Ruth R Montgomery; Steven H Kleinstein; J Christopher Love
Journal: Integr Biol (Camb) Date: 2015-10-20 Impact factor: 2.192

3. Complete haplotype sequence of the human immunoglobulin heavy-chain variable, diversity, and joining genes and characterization of allelic and copy-number variation.

Authors: Corey T Watson; Karyn M Steinberg; John Huddleston; Rene L Warren; Maika Malig; Jacqueline Schein; A Jeremy Willsey; Jeffrey B Joy; Jamie K Scott; Tina A Graves; Richard K Wilson; Robert A Holt; Evan E Eichler; Felix Breden
Journal: Am J Hum Genet Date: 2013-03-28 Impact factor: 11.025

4. Differences in Allelic Frequency and CDRH3 Region Limit the Engagement of HIV Env Immunogens by Putative VRC01 Neutralizing Antibody Precursors.

Authors: Christina Yacoob; Marie Pancera; Vladimir Vigdorovich; Brian G Oliver; Jolene A Glenn; Junli Feng; D Noah Sather; Andrew T McGuire; Leonidas Stamatatos
Journal: Cell Rep Date: 2016-11-01 Impact factor: 9.423

5. B-cell-lineage immunogen design in vaccine development with HIV-1 as a case study.

Authors: Barton F Haynes; Garnett Kelsoe; Stephen C Harrison; Thomas B Kepler
Journal: Nat Biotechnol Date: 2012-05-07 Impact factor: 54.908

6. Ageing of the B-cell repertoire.

Authors: Victoria Martin; Yu-Chang Bryan Wu; David Kipling; Deborah Dunn-Walters
Journal: Philos Trans R Soc Lond B Biol Sci Date: 2015-09-05 Impact factor: 6.237

Review 7. The promise and challenge of high-throughput sequencing of the antibody repertoire.

Authors: George Georgiou; Gregory C Ippolito; John Beausang; Christian E Busse; Hedda Wardemann; Stephen R Quake
Journal: Nat Biotechnol Date: 2014-01-19 Impact factor: 54.908

8. A spectral clustering-based method for identifying clones from high-throughput B cell repertoire sequencing data.

Authors: Nima Nouri; Steven H Kleinstein
Journal: Bioinformatics Date: 2018-07-01 Impact factor: 6.937

9. IMGT, the international ImMunoGeneTics information system.

Authors: Marie-Paule Lefranc; Véronique Giudicelli; Chantal Ginestoux; Joumana Jabado-Michaloud; Géraldine Folch; Fatena Bellahcene; Yan Wu; Elodie Gemrot; Xavier Brochet; Jérôme Lane; Laetitia Regnier; François Ehrenmann; Gérard Lefranc; Patrice Duroux
Journal: Nucleic Acids Res Date: 2008-10-31 Impact factor: 16.971

10. Single B-cell deconvolution of peanut-specific antibody responses in allergic patients.

Authors: Ramona A Hoh; Shilpa A Joshi; Yi Liu; Chen Wang; Krishna M Roskin; Ji-Yeun Lee; Tho Pham; Tim J Looney; Katherine J L Jackson; Vaishali P Dixit; Jasmine King; Shu-Chen Lyu; Jennifer Jenks; Robert G Hamilton; Kari C Nadeau; Scott D Boyd
Journal: J Allergy Clin Immunol Date: 2015-07-04 Impact factor: 10.793

14 in total

1. Adaptive Immune Receptor Repertoire (AIRR) Community Guide to TR and IG Gene Annotation.

Authors: Lmar Babrak; Susanna Marquez; Christian E Busse; William D Lees; Enkelejda Miho; Mats Ohlin; Aaron M Rosenfeld; Ulrik Stervbo; Corey T Watson; Chaim A Schramm
Journal: Methods Mol Biol Date: 2022

2. Chromosome-Level Haplotype Assembly for Equus asinu.

Authors: Xinyao Miao; Yonghan Yu; Zicheng Zhao; Yinan Wang; Xiaobo Qian; Yonghui Wang; Shengbin Li; Changfa Wang
Journal: Front Genet Date: 2022-05-27 Impact factor: 4.772

3. Stitchr: stitching coding TCR nucleotide sequences from V/J/CDR3 information.

Authors: James M Heather; Matthew J Spindler; Marta Herrero Alonso; Yifang Ivana Shui; David G Millar; David S Johnson; Mark Cobbold; Aaron N Hata
Journal: Nucleic Acids Res Date: 2022-07-08 Impact factor: 19.160

Review 4. The adaptive immune receptor repertoire community as a model for FAIR stewardship of big immunology data.

Authors: Jamie K Scott; Felix Breden
Journal: Curr Opin Syst Biol Date: 2020-10-10

5. Poorly Expressed Alleles of Several Human Immunoglobulin Heavy Chain Variable Genes are Common in the Human Population.

Authors: Mats Ohlin
Journal: Front Immunol Date: 2021-02-24 Impact factor: 7.561

6. Machine Learning Analysis of Naïve B-Cell Receptor Repertoires Stratifies Celiac Disease Patients and Controls.

Authors: Or Shemesh; Pazit Polak; Knut E A Lundin; Ludvig M Sollid; Gur Yaari
Journal: Front Immunol Date: 2021-03-10 Impact factor: 7.561

7. T cell receptor beta germline variability is revealed by inference from repertoire data.

Authors: Aviv Omer; Ayelet Peres; Oscar L Rodriguez; Corey T Watson; William Lees; Pazit Polak; Andrew M Collins; Gur Yaari
Journal: Genome Med Date: 2022-01-07 Impact factor: 11.117

8. Commentary on Population matched (pm) germline allelic variants of immunoglobulin (IG) loci: relevance in infectious diseases and vaccination studies in human populations.

Authors: Andrew M Collins; Ayelet Peres; Martin M Corcoran; Corey T Watson; Gur Yaari; William D Lees; Mats Ohlin
Journal: Genes Immun Date: 2021-10-19 Impact factor: 2.676

Review 9. The 27th annual Nucleic Acids Research database issue and molecular biology database collection.

Authors: Daniel J Rigden; Xosé M Fernández
Journal: Nucleic Acids Res Date: 2020-01-08 Impact factor: 16.971

10. A Novel Framework for Characterizing Genomic Haplotype Diversity in the Human Immunoglobulin Heavy Chain Locus.

Authors: Oscar L Rodriguez; William S Gibson; Tom Parks; Matthew Emery; James Powell; Maya Strahl; Gintaras Deikus; Kathryn Auckland; Evan E Eichler; Wayne A Marasco; Robert Sebra; Andrew J Sharp; Melissa L Smith; Ali Bashir; Corey T Watson
Journal: Front Immunol Date: 2020-09-23 Impact factor: 7.561