| Literature DB >> 25143287 |
András Szolek1, Benjamin Schubert2, Christopher Mohr2, Marc Sturm1, Magdalena Feldhahn1, Oliver Kohlbacher1.
Abstract
MOTIVATION: The human leukocyte antigen (HLA) gene cluster plays a crucial role in adaptive immunity and is thus relevant in many biomedical applications. While next-generation sequencing data are often available for a patient, deducing the HLA genotype is difficult because of substantial sequence similarity within the cluster and exceptionally high variability of the loci. Established approaches, therefore, rely on specific HLA enrichment and sequencing techniques, coming at an additional cost and extra turnaround time. RESULT: We present OptiType, a novel HLA genotyping algorithm based on integer linear programming, capable of producing accurate predictions from NGS data not specifically enriched for the HLA cluster. We also present a comprehensive benchmark dataset consisting of RNA, exome and whole-genome sequencing data. OptiType significantly outperformed previously published in silico approaches with an overall accuracy of 97% enabling its use in a broad range of applications.Entities:
Mesh:
Substances:
Year: 2014 PMID: 25143287 PMCID: PMC4441069 DOI: 10.1093/bioinformatics/btu548
Source DB: PubMed Journal: Bioinformatics ISSN: 1367-4803 Impact factor: 6.937
Fig. 1.OptiType’s four-digit HLA typing pipeline. Reference libraries for genomic and CDS are generated by extracting exons 2 and 3 from each known HLA-I allele. For genomic sequences, flanking intronic regions are also extracted. If some of these regions are missing, phylogenetic information is used to reconstruct the missing segments from the closest relative HLA-I allele. NGS reads are mapped against the so-constructed HLA allele reference (A). From the mapping result a binary hit matrix is constructed for all reads mapping to at least one allele of the reference with if read could be mapped to allele ; otherwise, (B). Based on this hit matrix, an ILP is formulated that optimizes the number of explainable reads by selecting up to two alleles (columns of the hit matrix) for each HLA-I locus (C). The selected alleles represent the most probable genotype
Fig. 2.Performance comparison of HLA typing algorithms. OptiType’s average prediction accuracy for major HLA-I loci was compared with four other published HLA typing methods capable of four-digit typing on publicly available datasets previously used to evaluate these methods
Fig. 3.Coverage and read length dependence of prediction accuracy. To determine the influence of coverage depth on HLA typing accuracy, reads of 253 exome sequencing runs of the 1000 Genomes Project were subsampled >4000 times to simulate different coverage depth conditions. To investigate the impact of read length on performance, original reads were trimmed to 37 bp and evaluated with the same subsampling procedure. Read length alone shows little effect on prediction accuracy, and an average coverage depth greater than 10× over the HLA-I loci was already found to yield maximal accuracy