| Literature DB >> 23712092 |
X Zheng1, J Shen2, C Cox3, J C Wakefield1, M G Ehm2, M R Nelson2, B S Weir1.
Abstract
Genotyping of classical human leukocyte antigen (HLA) alleles is an essential tool in the analysis of diseases and adverse drug reactions with associations mapping to the major histocompatibility complex (MHC). However, deriving high-resolution HLA types subsequent to whole-genome single-nucleotide polymorphism (SNP) typing or sequencing is often cost prohibitive for large samples. An alternative approach takes advantage of the extended haplotype structure within the MHC to predict HLA alleles using dense SNP genotypes, such as those available from genome-wide SNP panels. Current methods for HLA imputation are difficult to apply or may require the user to have access to large training data sets with SNP and HLA types. We propose HIBAG, HLA Imputation using attribute BAGging, that makes predictions by averaging HLA-type posterior probabilities over an ensemble of classifiers built on bootstrap samples. We assess the performance of HIBAG using our study data (n=2668 subjects of European ancestry) as a training set and HLA data from the British 1958 birth cohort study (n≈1000 subjects) as independent validation samples. Prediction accuracies for HLA-A, B, C, DRB1 and DQB1 range from 92.2% to 98.1% using a set of SNP markers common to the Illumina 1M Duo, OmniQuad, OmniExpress, 660K and 550K platforms. HIBAG performed well compared with the other two leading methods, HLA*IMP and BEAGLE. This method is implemented in a freely available HIBAG R package that includes pre-fit classifiers for European, Asian, Hispanic and African ancestries, providing a readily available imputation approach without the need to have access to large training data sets.Entities:
Mesh:
Substances:
Year: 2013 PMID: 23712092 PMCID: PMC3772955 DOI: 10.1038/tpj.2013.18
Source DB: PubMed Journal: Pharmacogenomics J ISSN: 1470-269X Impact factor: 3.550
Figure 1Overview of the HIBAG prediction algorithm. HIBAG is an ensemble classifier consisting of individual classifiers (C) with human leukocyte antigen (HLA) and single-nucleotide polymorphism (SNP) haplotype probabilities estimated from bootstrapped samples (B) and SNP subsets (S). The SNP subsets are determined by a variable selection algorithm with a random component. HLA-type predictions are averaged over the posterior probabilities from all classifiers.
The numbers of individuals with four-digit HLA types and the observed number of HLA alleles for each locus
| 90 | 68 | 90 | 90 | 90 | 90 | 90 | |
| 90 | 88 | 88 | 89 | 90 | 88 | 12 | |
| 89 | 89 | 89 | 88 | 89 | 87 | 58 | |
| 884 | 1532 | 840 | 1129 | 0 | 1004 | 0 | |
| 1857 | 2572 | 1866 | 2436 | 1740 | 1924 | 1624 | |
| 517 | 624 | 522 | 608 | 495 | 525 | 469 | |
| 298 | 430 | 300 | 420 | 269 | 312 | 263 | |
| 80 | 112 | 80 | 102 | 74 | 78 | 69 | |
| 48 | 88 | 37 | 55 | 17 | 21 | 26 | |
| 43 | 72 | 34 | 49 | 17 | 19 | 29 | |
| 41 | 85 | 31 | 44 | 14 | 17 | 26 | |
| 36 | 45 | 24 | 30 | 13 | 17 | 23 | |
| 85 | 144 | 49 | 80 | 19 | 27 | 49 | |
Abbreviations: HLA, human leukocyte antigen; WTCCC, Wellcome Trust Case Control Consortium.
Summary of the four-digit prediction accuracies (call rates) for HLARES of European ancestry, using four-digit HLA data from the British 1958 birth cohort study as independent validation samples
| No. of SNPs | 273 | 341 | 356 | 327 | 356 |
| No. of training samples | 1857 | 2572 | 1866 | 2436 | 1924 |
| No. of validation samples | 884 | 1532 | 840 | 1129 | 1004 |
| CT=0 | 98.1 (100) | 96.9 (100) | 96.5 (100) | 92.2 (100) | 97.8 (100) |
| CT=0.5 | 98.2 (99.4) | 97.4 (97.3) | 96.6 (99.5) | 94.0 (94.6) | 98.0 (99.0) |
Abbreviations: CT, call threshold; HLA, human leukocyte antigen; SNP, single-nucleotide polymorphism.
HIBAG CT of 0 and 0.5 were used.
SNP markers common to the Illumina 1M Duo, OmniQuad, OmniExpress, 660K and 550K platforms within a flanking region of 500 kb were used.
The comparison of four-digit accuracies for HIBAG and HLA*IMP on HLARES data of European ancestry with no call threshold
| No.of validation samples | 1787 | 2471 | 1830 | 2383 | 1917 |
| No.of SNPs | 50 | 39 | 27 | 50 | 34 |
| HLA*IMP (%) | 91.0 | 94.4 | 98.4 | 87.9 | 96.2 |
| HIBAG | 96.7 | 94.8 | 98.7 | 90.0 | 98.6 |
| No. of SNPs | 489 | 562 | 554 | 474 | 447 |
| HIBAG | 97.7 | 95.1 | 98.7 | 91.8 | 98.4 |
Abbreviations: HLA, human leukocyte antigen; MHC, major histocompatibility complex; SNP, single-nucleotide polymorphism.
The full SNP list is shown in Supplementary Table S4.
The training samples are HapMap 30 CEU trios plus WTCCC samples.
The SNP markers within a flanking region of 250 kb are used.
Summary of the four-digit prediction accuracies (call rates) stratified by ancestries and HLA loci
| No. of SNPs | 273 | 341 | 356 | 327 | 349 | 356 | 279 |
| CT=0.0 | 98.2 (100) | 96.6 (100) | 98.8 (100) | 92.1 (100) | 97.3 (100) | 98.8 (100) | 93.8 (100) |
| CT=0.5 | 98.7 (98.8) | 97.8 (94.2) | 99.2 (98.0) | 94.9 (90.1) | 97.8 (97.9) | 99.2 (97.9) | 94.8 (96.0) |
| 98.1 (100) | 95.5 (100) | 97.7 (100) | 92.9 (100) | 96.4 (100) | 97.9 (100) | 94.7 (100) | |
| No. of SNPs | 259 | 334 | 346 | 319 | 341 | 348 | 272 |
| CT=0.0 | 92.1 (100) | 87.5 (100) | 96.6 (100) | 88.7 (100) | 86.8 (100) | 96.0 (100) | 89.8 (100) |
| CT=0.5 | 93.8 (91.7) | 94.7 (71.0) | 97.8 (93.9) | 95.8 (71.5) | 90.0 (80.8) | 98.1 (96.3) | 95.3 (82.8) |
| 93.8 (100) | 83.7 (100) | 94.5 (100) | 87.7 (100) | 86.7 (100) | 97.3 (100) | 91.2 (100) | |
| No. of SNPs | 274 | 341 | 356 | 326 | 348 | 355 | 278 |
| CT=0.0 | 93.4 (100) | 75.0 (100) | 96.2 (100) | 82.0 (100) | 93.8 (100) | 95.7 (100) | 93.1 (100) |
| CT=0.5 | 96.0 (82.5) | 93.8 (37.5) | 98.4 (87.4) | 93.5 (50.8) | 95.8 (90.8) | 98.9 (90.0) | 97.5 (81.5) |
| 89.1 (100) | 75.0 (100) | 92.3 (100) | 78.7 (100) | 94.6 (100) | 96.3 (100) | 91.9 (100) | |
| No. of SNPs | 266 | 335 | 349 | 325 | 343 | 351 | 269 |
| CT=0.0 | 92.4 (100) | 76.8 (100) | 88.5 (100) | 77.1 (100) | 80.0 (100) | 79.4 (100) | 74.2 (100) |
| CT=0.5 | 100 (74.6) | 96.7 (21.1) | 96.5 (66.2) | 100 (22.2) | 97.2 (27.7) | 97.7 (34.9) | 75.0 (12.9) |
| 93.2 (100) | 71.1 (100) | 86.9 (100) | 81.2 (100) | 79.2 (100) | 76.2 (100) | 79.0 (100) | |
Abbreviations: HLA, human leukocyte antigen; SNP, single-nucleotide polymorphism.
STUDY data were divided into training and validation sets with equal sizes. HIBAG call thresholds (CTs) of 0 and 0.5 were used.
SNP markers common to the Illumina 1M Duo, OmniQuad, OmniExpress, 660K and 550K platforms within a flanking region of 500 kb are used.
No call threshold.
Figure 2The relationship between accuracy and call rate when HLARES data for individuals of European ancestry are divided into training and validation sets with equal sizes. On the curve for each HLA (human leukocyte antigen) locus, the 0.5 call threshold is indicated by •.
Figure 3The relationship between training sample size and accuracy. HLARES data of European ancestry were divided into training and validation sets with equal sizes, and random subsets of training samples (n=100, 250, 500, 750, 1000 and max) were used to build a HIBAG model, which was applied to the same validation samples. No call threshold was applied.