| Literature DB >> 34667305 |
Andrew M Collins1, Ayelet Peres2,3, Martin M Corcoran4, Corey T Watson5, Gur Yaari2,3, William D Lees6, Mats Ohlin7.
Abstract
Entities:
Year: 2021 PMID: 34667305 PMCID: PMC8674141 DOI: 10.1038/s41435-021-00152-6
Source DB: PubMed Journal: Genes Immun ISSN: 1466-4879 Impact factor: 2.676
Fig. 1Problems caused by short read mis-assignments.
Upper panel: Alignment of known and novel candidate germline sequences to GRCh37 reference assembly using the BLAT genome browser. Alleles IGHV3-11*05 and IGHV3-11*06 are indicated with the vertical red bar, IGHV3-48*01, IGHV3-48*02, IGHV3-48*03 and IGHV3-48*04 with the green bar and pmIG candidate alleles IGHV3-48_7, IGHV3-48_8, IGHV3-48_9 and IGHV3-48_10 with the blue bar. Position of the IGHV3-11 specific cluster of SNP variants is shown with the blue bracket. Lower panel: Assignment of low coverage sequences from 1000 genomes case HG00105, from the (HGO0105.mapped.ILLUMINA.bwa.GBR.low_coverage.20130415.bam) BAM file of GRCh37 assigned sequences visualized using the Broad Institute IVG viewer. Three short read sequences are incorrectly assigned to IGHV3-48 locus, ERR229777.81393649, ERR229777.109101109 and ERR229777.100808222 as shown by the presence of an IGHV3-11*05/06 specific segment containing four SNP variations, rs199879022 T (A in IGHV3-48), rs200437959 G (A in IGHV3-48) rs200973953 T (G in IGHV3-48) rs199815306 A (T in IGHV3-48). Four candidate IGHV3-48 germline sequences, IGHV3-48_7, IGHV3-48_8, IGHV3-48_9 and IGHV3-48_10, appear to be chimeric in origin, erroneously containing the IGHV3-11*05/06 segment.
Fig. 2pmIG reference database introduces erroneous mutations.
Repertoire analysis was performed on a naïve B-cell cohort (not expected to carry somatic mutations) of 98 individuals reported by Gidoni et al. (SRA: PRJEB26509) [6]. The repertoires were sequenced using the 5’RACE protocol and pre-processing was done as described in Gidoni et al. [6]. For the downstream analysis, repertoires were initially aligned (IgBLAST version 1.16.0) with the IMGT reference (March 29, 2021). Non-functional sequences, and functional sequences that were not full-length, were not assigned to a single V-gene unambiguously, or were assigned to a V-gene not present in the pmIG database were removed. The remaining 70% of functional sequences were then aligned with the pmIG reference database (downloaded from https://pmtrig.lumc.nl/ on July 15th, 2021) and the repertoires were compared. A For each repertoire, the mean mutation count was calculated using each reference database. Each dot represents the mean mutation count and each boxplot represents the variation within the cohort for each of the reference databases. B B.1 Each dot is the median of the mean individual mutation frequency per gene. The X axis is based on the IMGT reference and the Y axis is based on the pmIG reference. Red labels represent the genes with a duplicated copy in the chromosome (e.g., IGHV1-69/IGHV1-69D). B.2 Each dot represents an individual with IGHV4-4*02 in their IMGT data. These calls were matched via sequence IDs to calls in the matching pmIG datasets, and different gene annotations are shown with different colors. Where multiple allele calls were made, these calls are separated in the legend by vertical bars. The X axis shows the annotations to the two datasets. The Y axis shows the mean mutation numbers for the sequences assigned to the IGHV4-4*02 allele and to their matching calls in the pmIG dataset. C The count of alleles that are represented in the cohort for each of the reference databases.