| Literature DB >> 28954626 |
Martin L Buchkovich1, Chad C Brown2, Kimberly Robasky2, Shengjie Chai3,4, Sharon Westfall2, Benjamin G Vincent3,4,5, Eric T Weimer6, Jason G Powers2.
Abstract
BACKGROUND: The human leukocyte antigen (HLA) system is a genomic region involved in regulating the human immune system by encoding cell membrane major histocompatibility complex (MHC) proteins that are responsible for self-recognition. Understanding the variation in this region provides important insights into autoimmune disorders, disease susceptibility, oncological immunotherapy, regenerative medicine, transplant rejection, and toxicogenomics. Traditional approaches to HLA typing are low throughput, target only a few genes, are labor intensive and costly, or require specialized protocols. RNA sequencing promises a relatively inexpensive, high-throughput solution for HLA calling across all genes, with the bonus of complete transcriptome information and widespread availability of historical data. Existing tools have been limited in their ability to accurately and comprehensively call HLA genes from RNA-seq data.Entities:
Keywords: HLA; HSCT; Immunology; RNA-sequencing; Transplantation
Mesh:
Substances:
Year: 2017 PMID: 28954626 PMCID: PMC5618726 DOI: 10.1186/s13073-017-0473-6
Source DB: PubMed Journal: Genome Med ISSN: 1756-994X Impact factor: 11.117
Fig. 1Overview of the HLAProfiler workflow. The HLAProfiler workflows to create the reference k-mer profile database (green) and HLA calling in RNA-seq data (blue). Each step label in the workflow corresponds to the text (see “Implementation”). The workflows share the k-mer filtering and profile creation step (blue/green box)
Fig. 2HLA calling accuracy. a The accuracy of HLA calling was evaluated for six algorithms. Datasets were simulated using GeT-RM alleles from 109 samples (left panels) and rare alleles for 100 samples (right panels) at two-field precision (upper panels) and exact precision (lower panels) when available. b Concordance of HLA calling in 358 lymphoblastoid cell lines compared with gold standard HLA allele calls generated by Sanger sequencing (left panel). Sequences were downsampled to five million reads, HLA alleles were called, and concordance was recalculated (middle panel). Discrepancies between HLAProfiler, OptiType, and Sanger sequencing were resolved using TruSight HLA for 38 samples, the gold standard calls were updated with the resolved genotype, and concordance was recalculated for all methods with the addition of the original Sanger sequencing calls (right panel)
Fig. 3HLAProfiler correctly identifies the disease-associated B*27 allele incorrectly called by the gold standard. a Sequence coverage of RNA-seq data from NA11840 when aligned to B*27:03 (gold standard call), B*27:05:02 (identified by RNA-seq algorithms), and B*27:05:03 (full sequence predicted by HLAProfiler with allele refinement, and allele confirmed by TruSight HLA). Exon boundaries relative to the allele and differences between the alleles responsible for dips in coverage are also noted. b HLAProfiler generated comparison statistics of the three alleles, indicating the proportion of observed reads accounted for by the profile, the proportion of the profile accounted for by observed reads, and the correlation between the observed reads and the profile
Two-field accuracy and exact sequence matching for novel and partial alleles
| HLAProfiler | OptiType | seq2hla | HLAForest | HLAminer | PHLAT | |
|---|---|---|---|---|---|---|
| Novel alleles* | ||||||
| Sequence identified | 63% | - | - | - | - | - |
| Sequence identified or two-field accuracy | 68% | - | - | - | - | - |
| One-field accuracy | 97% | 15% | 17% | 19% | 37% | 96% |
| Two-field accuracy | 25% | 4% | 4% | 3% | 10% | 26% |
| Partial alleles | ||||||
| Sequence identified | 67% | - | - | - | - | - |
| Sequence identified or two-field accuracy | 85% | - | - | - | - | - |
| One-field accuracy | 97% | 95% | 94% | 19% | 72% | 97% |
| Two-field accuracy | 85% | 41% | 21% | 3% | 23% | 46% |
Results are based on two sets of simulated data for 100 samples, each having exactly one partial allele or one novel allele. Accuracy for novel alleles is defined as identification of the exact sequence, one-field accuracy, or two-field accuracy
*In the case of novel alleles, the correct protein sequence can be identified without correctly identifying one- or two-field precision, or one- or two-field precision can be identified while missing the exact protein sequence