| Literature DB >> 32109676 |
Michael Ford1, Ehsan Haghshenas1, Corey T Watson2, S Cenk Sahinalp3.
Abstract
One of the remaining challenges to describing an individual's genetic variation lies in the highly heterogeneous and complex genomic regions that impede the use of classical reference-guided mapping and assembly approaches. Once such region is the Immunoglobulin heavy chain locus (IGH), which is critical for the development of antibodies and the adaptive immune system. We describe ImmunoTyper, the first PacBio-based genotyping and copy number calling tool specifically designed for IGH V genes (IGHV). We demonstrate that ImmunoTyper's multi-stage clustering and combinatorial optimization approach represents the most comprehensive IGHV genotyping approach published to date, through validation using gold-standard IGH reference sequence. This preliminary work establishes the feasibility of fine-grained genotype and copy number analysis using error-prone long reads in complex multi-gene loci and opens the door for in-depth investigation into IGHV heterogeneity using accessible and increasingly common whole-genome sequence. Published by Elsevier Inc.Entities:
Keywords: Bioinformatics; Biological Sciences; Computational Bioinformatics; Genomic Analysis
Year: 2020 PMID: 32109676 PMCID: PMC7044747 DOI: 10.1016/j.isci.2020.100883
Source DB: PubMed Journal: iScience ISSN: 2589-0042
Figure 1Histogram of the Edit Distance between Each Allele from the IGHV (Pseudo)Gene Database and its Most Similar Allele (with Respect to Edit Distance)
Figure 2Read Depth of IGH Region for CHM1 WGS PacBio Reads Mapped to CHM1 Reference Using minimap2 with Default Parameters, Demonstrating Significant Deviation from the Expected Coverage, Including at Positions Containing IGHV Genes, Which Are Marked by Vertical Green Lines
Genotype Results for Simulated and CHM1 Real Data Samples
| Sample | # IGHV Occurrences in Reference | # IGHV Calls | Precision | Recall | True Positive | False Positive | False Negative |
|---|---|---|---|---|---|---|---|
| CHM1 (simulated) | 117 | 111 | 94.6% | 89.7% | 105 | 6 | 12 |
| GRCh37 (simulated) | 112 | 109 | 97.2% | 94.6% | 106 | 3 | 6 |
| CHM1 + GRCh37 (simulated) | 229 | 227 | 94.3% | 93.4% | 214 | 13 | 15 |
| CHM1 WGS | 117 | 110 | 87.3% | 82.1% | 96 | 14 | 21 |
Allele Sequence Error Reduction Results
| Sample | Expected Read Error | Median Mapping Error |
|---|---|---|
| CHM1 (simulated) | 15.8% | 2.0% |
| GRCh37 (simulated) | 15.8% | 2.0% |
| CHM1 + GRCh37 (simulated) | 15.8% | 2.2% |
| CHM1 WGS | 16.19% | 2.3% |
Taken from Laehnemann et al., 2015.
Sequence Differences between CHM1 and GRCh37 References
| Insertion Identifier | Reference | Genes and Pseudogenes Present and Their Alleles |
|---|---|---|
| A | CHM1 | 1-69*06, 1-69-2*01, 2-70D*04, 3-69-1*01 |
| B | GRCh37 | 4-31*02, (II)-31-1*01 |
| C | CHM1 | (II)-30-21*01, 4-30-2*01 |
| D | CHM1 | 3-64D*06, 5-10-1*03 |
| E | GRCh37 | 3-9*01, 2–10*01, 1–8*01 |
| F | CHM1 | 7-4-1*01 |
IGHV Identification in Insertion Sequences Between GRCh37 and CHM1 in Diploid Sample
| Insertion | Reference | Number of Genes and Pseudogenes | Number of Matching Genes in Result | Number of Correct Allele Calls | Missing Allele Calls |
|---|---|---|---|---|---|
| A | CHM1 | 4 | 3 | 3 | 3-69-1*01 |
| B | GRCh37 | 2 | 2 | 2 | |
| C | CHM1 | 2 | 2 | 2 | |
| D | CHM1 | 2 | 2 | 2 | |
| E | GRCh37 | 3 | 2 | 2 | 1-8*01 |
| F | CHM1 | 1 | 1 | 1 |
Calls for Known CNV Genes in the CHM1 + GRCh37 Sample
| Gene | Number of Copies in Sample | Number of Copies | Correct Allele Calls | False-Positive Calls | False-Negative Calls |
|---|---|---|---|---|---|
| 1-69 | 4 | 5 | 1-69-2*01, 1-69*06, 1-69*06 | 1-69*06, 1-69*06 | 1-69*01 |
| 2-70 | 3 | 3 | 2-70*01, 2-70D*04 2-70*13 | ||
| 3-64 | 3 | 3 | 3-64*02, 3-64D*06 3-64*02 | ||
| 4-31 | 2 | 2 | 4-30-2*01, 4-31*02 |