| Literature DB >> 31566225 |
William Lees1, Christian E Busse2, Martin Corcoran3, Mats Ohlin4, Cathrine Scheepers5,6, Frederick A Matsen7, Gur Yaari8, Corey T Watson9, Andrew Collins10, Adrian J Shepherd1.
Abstract
High-throughput sequencing of the adaptive immune receptor repertoire (AIRR-seq) is providing unprecedented insights into the immune response to disease and into the development of immune disorders. The accurate interpretation of AIRR-seq data depends on the existence of comprehensive germline gene reference sets. Current sets are known to be incomplete and unrepresentative of the degree of polymorphism and diversity in human and animal populations. A key issue is the complexity of the genomic regions in which they lie, which, because of the presence of multiple repeats, insertions and deletions, have not proved tractable with short-read whole genome sequencing. Recently, tools and methods for inferring such gene sequences from AIRR-seq datasets have become available, and a community approach has been developed for the expert review and publication of such inferences. Here, we present OGRDB, the Open Germline Receptor Database (https://ogrdb.airr-community.org), a public resource for the submission, review and publication of previously unknown receptor germline sequences together with supporting evidence.Entities:
Mesh:
Substances:
Year: 2020 PMID: 31566225 PMCID: PMC6943078 DOI: 10.1093/nar/gkz822
Source DB: PubMed Journal: Nucleic Acids Res ISSN: 0305-1048 Impact factor: 16.971
Figure 1.Partial screenshot of a genotype panel showing the statistics provided for inferred alleles, and, beneath, the statistics provided for all alleles (see Table 1 for a description of the information contained in these tables).
Figure 2.Illustrative plots produced by OGRDB analysis scripts from supported inference tools. These can be provided as part of a novel allele submission, or used independently by researchers interested in exploring the usage characteristics of an AIRR-seq repertoire. (A) IMGT alignment of all sequences assigned to a novel allele (IMGT gaps shown as ‘.’, base calls with low confidence as ‘-'), showing, in this case, a high-quality underlying dataset. (B) for V-genes, a ‘zoom’ of the previous plot showing the final 8 nucleotides at the 3′ end and illustrating the effect of the recombination process on the difficulty associated with inference of the final nucleotide of the IGHV gene based on the underlying data (27) (for J-genes, a similar plot is shown of the 5′ end). (C) a histogram, similar to that produced by IgDiscover (21), showing the reads assigned to the allele, distributed by the number of nucleotide differences to the reference (or inferred reference) sequence. (D) For the analysis of novel V-genes, a plot showing the usage of J-alleles within each J-gene, which can be used to identify heterozygosity (28). In this case, the subject is heterozygous in the IGHJ6 gene. For the analysis of novel J-genes, a similar plot of V-gene allele usage is provided. (E) where heterozygosity may be present, a plot is provided showing the number of reads of each V-gene, split by their usage of the heterozygous allele (in this case alleles *02 and *03 of IGHJ6). The novel allele IGHV_S1 is found exclusively in association with the IGHJ6*03 and no other alleles of IGHV_S1 were identified in the genotype: the exclusive association with *03 therefore provides additional support for the novel inference.
Information provided in the OGRDB standardized genotype
| Field | Description |
|---|---|
| sequence_id | Identifier of the allele (either IMGT, or the name assigned by the submitter to an inferred gene) |
| sequences | Overall number of sequences assigned to this allele |
| closest_reference | For inferred alleles, the closest reference gene and allele, as inferred by the tool |
| closest_host | For inferred alleles, the closest reference gene and allele that is in the subject's inferred genotype |
| nt_diff | For inferred alleles, the number of nucleotides that differ between this sequence and the closest reference gene and allele |
| nt_diff_host | For inferred alleles, the number of nucleotides that differ between this sequence and the closest reference gene and allele that is in the subject's inferred genotype |
| nt_substitutions | For inferred alleles, comma-separated list of nucleotide substitutions (e.g. G112A) between the sequence and the closest reference gene and allele. IMGT numbering is used for V-genes, and number from start of coding sequence for D- or J- genes. |
| aa_diff | For inferred alleles, the number of amino acids that differ between this sequence and the closest reference gene and allele |
| aa_substitutions | For inferred alleles, the list of amino acid substitutions (e.g. A96N) between the sequence and the closest reference gene and allele. IMGT numbering is used for V-genes, and number from start of coding sequence for D- or J- genes. |
| unmutated_sequences | The number of sequences exactly matching this unmutated sequence |
| assigned_unmutated_frequency | The number of sequences exactly matching this allele divided by the number of sequences assigned to this allele, *100 |
| unmutated_umis | The number of molecules (identified by Unique Molecular Identifiers) exactly matching this unmutated sequence (if UMIs were used) |
| allelic_percentage | The number of sequences exactly matching the sequence of this allele divided by the number of sequences exactly matching any allele of this specific gene, *100 |
| unmutated_frequency | The number of sequences exactly matching this sequence divided by the number of sequences exactly matching any allele of any gene, *100 |
| unique_vs | The number of V allele calls (i.e. unique allelic sequences) found associated with this allele |
| unique_ds | The number of D allele calls (i.e. unique allelic sequences) found associated with this allele |
| unique_js | The number of J allele calls (i.e. unique allelic sequences) found associated with this allele |
| unique_cdr3s | The number of unique CDR3s found associated with this allele |
| unique_vs_unmutated | The number of V allele calls (i.e. unique allelic sequences) associated with unmutated sequences of this allele |
| unique_ds_unmutated | The number of D allele calls (i.e. unique allelic sequences) associated with unmutated sequences of this allele |
| unique_js_unmutated | The number of J allele calls (i.e. unique allelic sequences) associated with unmutated sequences of this allele |
| unique_cdr3s_unmutated | The number of unique CDR3s associated with unmutated sequences of this allele |
| haplotyping_gene | The gene or genes from which haplotyping was inferred, where haplotyping is possible (e.g. IGHJ6) |
Provision of statistics for each allele in the personalized genotype (both reference alleles and novel alleles) allows the novel inferences to be considered in the context of overall gene usage (usage frequency, exact unmutated matches, association with distinct CDR3 and so on), and also provides useful aggregate information on overall gene usage.