| Literature DB >> 34806017 |
Rashedul Islam1,2,3, Misha Bilenky3, Andrew P Weng4,5, Joseph M Connors6, Martin Hirst1,2,3.
Abstract
MOTIVATION: B cells display remarkable diversity in producing B-cell receptors through recombination of immunoglobulin (Ig) V-D-J genes. Somatic hypermutation (SHM) of immunoglobulin heavy chain variable (IGHV) genes are used as a prognostic marker in B-cell malignancies. Clinically, IGHV mutation status is determined by targeted Sanger sequencing which is a resource-intensive and low-throughput procedure. Here, we describe a bioinformatic pipeline, CRIS (Complete Reconstruction of Immunoglobulin IGHV-D-J Sequences) that uses RNA sequencing (RNA-seq) datasets to reconstruct IGHV-D-J sequences and determine IGHV SHM status.Entities:
Year: 2021 PMID: 34806017 PMCID: PMC8600631 DOI: 10.1093/bioadv/vbab021
Source DB: PubMed Journal: Bioinform Adv ISSN: 2635-0041
Fig. 1.IGHV-D-J recombination and SHM during B-cell development. BCRs are generated by ordered assembly of the Ig heavy chain gene segments (V, D and J) during B-cell development. Addition and deletion of junctional nucleotides (N) contribute to the diversity of BCR repertoires. BCR sequences undergo affinity maturation upon antigen stimulation through SHMs in the variable domain (indicated in black arrows). SHMs of Ig are enriched at the complementarity-determining regions (CDRs)
Genomic coordinates of the putative Ig loci in the GRCh38 reference
| Chromosome/contig | Start | End | Length (bp) |
|---|---|---|---|
| Chr14 | 105 550 001 | 106 880 000 | 1 329 999 |
| Chr15 | 21 710 000 | 22 190 000 | 480 000 |
| Chr16 | 31 950 001 | 33 970 000 | 2 019 999 |
| chr14_KI270726v1_random | 1 | 43 739 | 43 739 |
| chr16_KI270728v1_random | 1 | 1 872 759 | 1 872 759 |
Fig. 2.CRIS workflow. CRIS extract reads from the putative Ig loci prior to assembly of Ig transcripts and quantify transcript abundances. The percent of IGHV mutations of Ig transcripts is calculated by comparing to the germline sequences
Fig. 3.Evaluation of CRIS to reconstruct IGHV-D-J sequences. (a) The most abundant Ig transcript from US-1422278 sample was aligned to the germline database using IgBLAST where top hit germline genes are shown. In the alignment, mismatches are represented as nucleotide bases and matches as dots. The alignment length, number of matches and mismatches are 296, 280 and 16, respectively. Total number of matched nucleotides between query and germline IGHV sequence is used to calculate percent identity e.g., 100*(280/296) = 96.4%. N-junctional sequences are highlighted in gray boxes. (b) Fraction of the IGHV gene assembled in two CLL RNA-seq datasets with different sequence depths and lengths as indicated. An unpaired two-tailed t-test demonstrated no significant (P = 0.15) difference between the two distributions (NS). (c) Scatter plot comparing the percent of mutation of IGHV as predicted by CRIS and clinical PCR-Sanger-based analysis for 16 CLL patient samples obtained from GSE66228
Concordance of IGHV gene prediction and percent mutation between PCR-Sanger-based analysis and CRIS
| Sample ID | Sanger | CRIS | ||||||
|---|---|---|---|---|---|---|---|---|
| IGHV | Mutation (%) | IGHV |
| IGHD | IGHJ | No. of Ig transcript | No. of clonotype | |
| US-1422282 | V1-69 | 0.4 | IGHV1-69*04 | 0.3 | IGHD6-19*01 | IGHJ4*02 | 7 | 4 |
| US-1422366 | V1-18 | 0.34 | IGHV1-18*04 | 0 | IGHD3-3*01 | IGHJ6*02 | 21 | 5 |
| US-1422311 | V3-11 | 2 | IGHV3-11*01 | 2 | IGHD4-17*01 | IGHJ4*02 | 5 | 4 |
| US-1422278 | V3-74 | 5.4 | IGHV3-74*01 | 5.4 | IGHD5-18*01 | IGHJ6*02 | 5 | 3 |
| US-1422335 | V4-59 | 10.2 | IGHV4-59*02 | 8.5 | IGHD3-10*01 | IGHJ4*02 | 3 | 2 |
| US-1422321 | V3-66 | 0.7 | IGHV3-66*02 | 0.7 | NA | IGHJ4*02 | 9 | 4 |
| US-1422333 | V4-34 | 0 | IGHV4-34*01 | 0 | IGHD3-3*01 | IGHJ6*02 | 6 | 3 |
| US-1422356 | V2-70 | 0.8 | IGHV2-70*01 | 0.3 | IGHD3-16*01 | IGHJ3*02 | 15 | 8 |
| US-1422368 | V3-74 | 6.1 | IGHV3-74*03 | 8.8 | IGHD1-1*01 | IGHJ5*02 | 2 | 2 |
| US-1422309 | V3-53 | 8.8 | IGHV3-53*01 | 6.1 | IGHD3-10*01 | IGHJ6*03 | 4 | 3 |
| US-1422302 | V2-70 | 0.3 | IGHV2-70*01 | 0.3 | IGHD2-15*01 | IGHJ4*02 | 20 | 4 |
| US-1422351 | V1-46 | 0 | IGHV1-46*01 | 0 | IGHD3-10*01 | IGHJ4*02 | 6 | 3 |
| US-1422314 | V1-3 | 0.7 | IGHV1-3*01 | 0 | IGHD6-19*01 | IGHJ4*02 | 5 | 3 |
| US-1422342 | V3-21 | 0 | IGHV3-21*01 | 0 | IGHD3-16*01 | IGHJ4*02 | 4 | 2 |
| US-1422350 | V3-48 | 2.8 | IGHV3-48*03 | 2.4 | IGHD3-22*01 | IGHJ4*02 | 3 | 2 |
| US-1422352 | V1-46 | 0 | IGHV1-46*01 | 0 | IGHD3-22*01 | IGHJ6*02 | 17 | 4 |
Notes: CRIS reconstructed V-D-J segments of Ig transcripts and identified multiple transcripts per sample that belong to different clonotypes. NA is used in cases where IGHD genes were absent.
Fig. 4.Comparison of CRIS with clinical data and existing tools. (a and b) Confusion matrix represents the classification accuracy of CRIS compared to Sanger-PCR data in two independent CLL cohorts. The P-value was calculated by one-sided binomial test. (c) Comparison of CRIS, V’DJer and TRUST to reconstruct the proportion of IGHV sequences in GSE66228 (Blachly ) dataset. The average fraction of IGHV gene length for each tool is represented by dashed horizontal lines