| Literature DB >> 34671351 |
Yixun Huang1, Linnea Thörnqvist1, Mats Ohlin1.
Abstract
Upstream and downstream sequences of immunoglobulin genes may affect the expression of such genes. However, these sequences are rarely studied or characterized in most studies of immunoglobulin repertoires. Inference from large, rearranged immunoglobulin transcriptome data sets offers an opportunity to define the upstream regions (5'-untranslated regions and leader sequences). We have now established a new data pre-processing procedure to eliminate artifacts caused by a 5'-RACE library generation process, reanalyzed a previously studied data set defining human immunoglobulin heavy chain genes, and identified novel upstream regions, as well as previously identified upstream regions that may have been identified in error. Upstream sequences were also identified for a set of previously uncharacterized germline gene alleles. Several novel upstream region variants were validated, for instance by their segregation to a single haplotype in heterozygotic subjects. SNPs representing several sequence variants were identified from population data. Finally, based on the outcomes of the analysis, we define a set of testable hypotheses with respect to the placement of particular alleles in complex IGHV locus haplotypes, and discuss the evolutionary relatedness of particular heavy chain variable genes based on sequences of their upstream regions.Entities:
Keywords: 5’-untranslated region; adaptive immune receptor repertoire (AIRR); germline gene inference; immunoglobulin germline gene; immunoglobulin heavy chain variable domain; leader sequence
Mesh:
Substances:
Year: 2021 PMID: 34671351 PMCID: PMC8521166 DOI: 10.3389/fimmu.2021.730105
Source DB: PubMed Journal: Front Immunol ISSN: 1664-3224 Impact factor: 7.561
Figure 1Schematic illustration of the pre-processing of Illumina MiSeq paired-end reads and of the pipeline of 5’UTR-leader sequences inference and validation process.
Figure 2Overarching 5’UTR-leader sequence germline data set inferred in the present study. In addition, upstream regions of IGHV1-3*02 and IGHV4-4*01 have been identified in a separate study (23).
Figure 3Distribution patterns of CDR3 length encoded by transcripts associated to 5’UTR-leader sequences of (A) IGHV4-4*02, (B) IGHV4-4*07. For each 5’UTR-leader sequence of a specific allele, the number of filtered reads in each length of CDR3 was counted to create the plots. Every line in the plots represents the 5’UTR-leader sequence from one subject (at maximum 8 subjects were included in each plot). Distribution patterns of CDR3 length for 5’UTR-leader sequences of other alleles are displayed in .
Haplotyping to support the validity of diverse 5’UTR-leader sequence of allele IGHV4-4*02 and IGHV4-4*07.
| Data set | IGHV gene and upstream sequence | IGHJ6 read distribution | |
|---|---|---|---|
| IGHJ6*02 | IGHJ6*03 | ||
| ERR2567266 | IGHV4-4*02-C | 58 | 0 |
| IGHV4-4*07-A | 1 | 107 | |
| ERR2567189 | IGHV4-4*02-F | 0 | 24 |
| IGHV4-4*07-E | 37 | 0 | |
| ERR2567200 | IGHV4-4*02-C | 0 | 46 |
| IGHV4-4*07-B | 48 | 0 | |
| ERR2567230 | IGHV4-4*02-A | 0 | 38 |
| IGHV4-4*07-D | 72 | 0 | |
| ERR2567192 | IGHV4-4*02-C | 16 | 0 |
| IGHV4-4*02-A | 0 | 17 | |
| ERR2567204 | IGHV4-4*02-C | 74 | 0 |
| IGHV4-4*02-D | 0 | 75 | |
| ERR2567246 | IGHV4-4*02-F | 0 | 65 |
| IGHV4-4*02-C | 36 | 0 | |
| ERR2567254 | IGHV4-4*02-C | 55 | 0 |
| IGHV4-4*02-F | 0 | 42 | |
| ERR2567261 | IGHV4-4*02-C | 99 | 0 |
| IGHV4-4*02-F | 0 | 78 | |
| ERR2567271 | IGHV4-4*02-D | 51 | 0 |
| IGHV4-4*02-F | 0 | 5 | |
| ERR2567274 | IGHV4-4*02-E | 24 | 0 |
| IGHV4-4*02-C | 0 | 21 | |
| ERR2567187 | IGHV4-4*02-C | 65 | 0 |
| IGHV4-4*01 | – | – | |
| ERR2567201 | IGHV4-4*07-E | 27 | 0 |
| IGHV4-4*07-D | 0 | 35 | |
| ERR2567263 | IGHV4-4*07-F | 0 | 94 |
| IGHV4-4*07-C | 84 | 1 | |
The sequence counts of 5’UTR-leader sequences of alleles of IGHV4-4 associated to different alleles of IGHJ6 in rearranged sequences. Haplotyping data for other 5’UTR-leader sequences are available in .