| Literature DB >> 32681667 |
Nicolas Vince1, Venceslas Douillard1, Estelle Geffard1, Diogo Meyer2, Erick C Castelli3, Steven J Mack4, Sophie Limou1,5, Pierre-Antoine Gourraud1.
Abstract
Genome-wide associations studies have repeatedly identified the major histocompatibility complex genomic region (6p21.3) as key in immune pathologies. Researchers have also aimed to extend the biological interpretation of associations by focusing directly on human leukocyte antigen (HLA) polymorphisms and their combination as haplotypes. To circumvent the effort and high costs of HLA typing, statistical solutions have been developed to infer HLA alleles from single-nucleotide polymorphism (SNP) genotyping data. Though HLA imputation methods have been developed, no unified effort has yet been undertaken to share large and diverse imputation models, or to improve methods. By training the HIBAG software on SNP + HLA data generated by the Consortium on Asthma among African-ancestry Populations in the Americas (CAAPA) to create reference panels, we highlighted the importance of (a) the number of individuals in reference panels, with a twofold increase in accuracy (from 10 to 100 individuals) and (b) the number of SNPs, with a 1.5-fold increase in accuracy (from 500 to 24,504 SNPs). Results showed improved accuracy with CAAPA compared to the African American models available in HIBAG, highlighting the need for precise population-matching. The SNP-HLA Reference Consortium is an international endeavor to gather data, enhance HLA imputation and broaden access to highly accurate imputation models for the immunogenomics community.Entities:
Keywords: HLA; SNP; consortium; imputation
Mesh:
Substances:
Year: 2020 PMID: 32681667 PMCID: PMC7540691 DOI: 10.1002/gepi.22334
Source DB: PubMed Journal: Genet Epidemiol ISSN: 0741-0395 Impact factor: 2.135
Figure 1Influence of the number of individuals (a) and SNPs (b) in the HIBAG reference panel building on the accuracy of HLA alleles prediction. From the CAAPA data set (N = 880 and SNPs = 24,504), we produced a set of 10 training subsets (n training = 100) and test (n test = 780) sets to assess HLA imputation accuracy in different scenarios. Each model was validated by comparing the typed HLA alleles to the model‐predicted HLA alleles across all individuals to provide an accuracy percentage (postprobability call threshold = 0). (a) By randomly selecting individuals in the training data set, we created sub‐datasets containing 10, 20, and 50 individuals. Custom HIBAG models were computed for these subsets as well as for the whole 100 training individuals, using every available SNP. (b) Subsets of the training data set with 500, 1,000, 5,000, 10,000 randomly selected SNPs (out of the 24,504 available SNPs) were created and the corresponding models computed. The number of SNPs on the x‐axis is indicative of the number of SNPs in the data set. The number of SNPs kept to create the model, which varies depending on the gene studied and the subset, is five times lower on average (see Tables S1.1 and S1.2). Note that the horizontal marks on each HLA gene curve indicate the accuracies obtained with the default African American HIBAG models. HIBAG, HLA Genotype Imputation with Attribute Bagging; HLA, human leukocyte antigen; SNP, single‐nucleotide polymorphism; nS, number of SNPs in the model; nT, number of individuals in the model
Figure 2The SNP‐HLA Reference Consortium (SHLARC) design. Aim 1: Increase the amount of SNP + HLA data available both in terms of quantity and diversity. Aim 2: Optimize SNP‐HLA imputation methods. Aim 3: The SHLARC website will allow users from the scientific community to benefit from the data and knowledge accumulated by the consortium on SNP‐to‐HLA allele imputation. From a list of SNPs and a selected ethnicity of interest, or alternatively from uploading SNP genotype data sets, the best custom reference panel for HLA allele imputation will be built in our servers. HLA, human leukocyte antigen; SNP, single‐nucleotide polymorphism