Literature DB >> 32681667

SNP-HLA Reference Consortium (SHLARC): HLA and SNP data sharing for promoting MHC-centric analyses in genomics.

Nicolas Vince¹, Venceslas Douillard¹, Estelle Geffard¹, Diogo Meyer², Erick C Castelli³, Steven J Mack⁴, Sophie Limou^1,5, Pierre-Antoine Gourraud¹.

Abstract

Genome-wide associations studies have repeatedly identified the major histocompatibility complex genomic region (6p21.3) as key in immune pathologies. Researchers have also aimed to extend the biological interpretation of associations by focusing directly on human leukocyte antigen (HLA) polymorphisms and their combination as haplotypes. To circumvent the effort and high costs of HLA typing, statistical solutions have been developed to infer HLA alleles from single-nucleotide polymorphism (SNP) genotyping data. Though HLA imputation methods have been developed, no unified effort has yet been undertaken to share large and diverse imputation models, or to improve methods. By training the HIBAG software on SNP + HLA data generated by the Consortium on Asthma among African-ancestry Populations in the Americas (CAAPA) to create reference panels, we highlighted the importance of (a) the number of individuals in reference panels, with a twofold increase in accuracy (from 10 to 100 individuals) and (b) the number of SNPs, with a 1.5-fold increase in accuracy (from 500 to 24,504 SNPs). Results showed improved accuracy with CAAPA compared to the African American models available in HIBAG, highlighting the need for precise population-matching. The SNP-HLA Reference Consortium is an international endeavor to gather data, enhance HLA imputation and broaden access to highly accurate imputation models for the immunogenomics community.

Entities: Chemical Disease Gene Mutation Species

Keywords: HLA; SNP; consortium; imputation

Mesh：

Substances：
HLA Antigens

Year: 2020 PMID： 32681667 PMCID： PMC7540691 DOI： 10.1002/gepi.22334

Source DB: PubMed Journal: Genet Epidemiol ISSN： 0741-0395 Impact factor: 2.135

INTRODUCTION

Beginning with the discovery of the HLA system in the 1950s, the characterization of HLA polymorphism and HLA disease associations have been performed in parallel (Dausset, 1999; Trowsdale & Knight, 2013). In the genome‐wide association study (GWAS) era, the focus was shifted on single‐nucleotide polymorphisms (SNP) with little to no biological relevance. Even when located in the major histocompatibility complex (MHC) region (6p21.3), these SNP associations have largely supplanted the traditional study of HLA allele associations. GWASs have however confirmed the crucial role of the HLA loci for the genetic epidemiology of nearly a quarter of all diseases and traits (MacArthur et al., 2017; Trowsdale & Knight, 2013), but SNP associations do not convey the immune‐biological relevance that specific HLA alleles have. For example, GWASs of HIV disease identified the rs2395029 SNP near the HCP5 gene on chromosome 6 as being the strongest associated with viral control (Fellay et al., 2007; Limou & Zagury, 2013). This SNP, which is located 100 kb from HLA‐B, is in nearly complete linkage disequilibrium with the HLA‐B*57:01, which can present HIV peptides crucial for HIV detection by the immune system (Chen et al., 2012; Limou & Zagury, 2013). Using novel bioinformatic approaches, we now have the ability to statistically infer HLA alleles from genotypic SNP data (imputation), returning HLA molecular functions to the forefront of disease‐associated research (Meyer & Nunes, 2017; Pappas et al., 2018). Imputations are statistical methods that infer or predict missing information based on haplotypes. Haplotypes are a combination of genetic variants on one chromosome, they can be SNP haplotype (e.g., 011010, referring as the presence or absence of SNPs), gene haplotype (e.g., HLA‐A*01:01~HLA‐B*08:01~HLA‐C*07:01~HLA‐DRB1*03:01~HLA‐DQB1*02:01) or a combination of different genetic variants (SNP, indels, substitution) haplotype (e.g., HLA alleles). In genomics, SNP imputation can infer the identity of missing SNPs that were not genotyped on GWAS arrays (Delaneau, Zagury, & Marchini, 2013; McCarthy et al., 2016) by comparing whole‐genome SNP genotypes to a large reference panel of SNP haplotypes (Delaneau et al., 2013). Filling the genotyping gaps, SNP imputation performance and accuracy increased significantly when new large reference haplotype panels became available (McCarthy et al., 2016), which has contributed to a large number of discoveries over the past decade (Visscher et al., 2017). In parallel to SNP, imputation also applies to HLA polymorphisms themselves, alone or in combination. It has revealed key associations in numerous diseases (Fellay et al., 2007; Limou & Zagury, 2013; MacArthur et al., 2017; Trowsdale & Knight, 2013; Vince et al., 2020) and can, as such, lead to the development of new drugs or patient‐care guidance. Efforts to impute HLA alleles from these GWAS should be pursued to empower the community to go beyond simple SNP associations and to discover new disease associations (Khor et al., 2015; Meyer & Nunes, 2017; Shen et al., 2018); as an example, HLA alleles can bring new functional immunogenomics data such as prediction of amino acid, haplotypes (five genes: A~B~C~DRB1~DQB1) or imputed HLA‐C expression easily implemented with Easy‐HLA (Geffard et al., 2019; Vince et al., 2016). HLA allele imputation appears as a time and cost‐effective alternative to the laborious HLA typing of all GWAS subjects. However, to rely on HLA imputation we must consider its accuracy, which depends on the reference panel quality (e.g., matching ancestry background, matching SNPs composition; Khor et al., 2015) and size (e.g., number of individuals with both SNP as well as HLA typing data, referred as SNP + HLA data; Pappas et al., 2018; Zheng et al., 2014). Successful HLA imputation, therefore, depends on the availability of large and diverse reference panels, which warrants a major collective effort in organizing community resources. Here, we advocate for the development of the SNP‐HLA Reference Consortium (SHLARC), a new international network focused on collecting a large collection of high‐quality HLA and SNP data, especially from an ethnically diverse population, with the goal to develop and share large reference panels and help worldwide researchers exploring HLA allelic information from their cohorts.

RESULTS

We had access to the CAAPA (Consortium on Asthma among African‐ancestry Populations in the Americas) data set (Daya et al., 2019; Vince et al., 2020) that consists of 880 whole‐genome sequenced African American subjects with associated SNP GWAS data and typed HLA alleles at a two‐field resolution (corresponding to the protein level). We chose the HLA Genotype Imputation with Attribute Bagging (HIBAG) R package (Zheng et al., 2014) to test the impact of the number of subjects and SNPs on HLA imputation accuracy. HIBAG demonstrates improved imputation accuracy over other available methods (Pappas et al., 2018) and allows the creation of custom reference panels, using the machine‐learning technique of attribute bagging. Building reference panels requires heavy computing power which is related to the number of subjects and number of SNPs in an almost linear correlation (Zheng et al., 2014). The development of machine‐learning algorithms heavily relies on the evolution of computational power. We used graphics processing units (GPUs) as they are architecturally better suited to handle the computationally intensive tasks. For this project, we took advantage of the upgraded HIBAG version (HIBAG v1.15.3, HIBAG.gpu v0.9.1; Zheng, 2018) and used GPUs to build and compare multiple reference panels with a fivefold reduction in computation time relative to central processing units). Starting with the complete data set (n = 880 individuals), we simulated scenarios of reference panel building by creating a collection of training and test sets. Each of the condition was replicated 10 times to assess the variability in the frequency of SNPs and HLA types and display confidence intervals for each prediction: (a) from a set of 100 samples (n training = 100), we created 40 different reference panels with either increasing numbers of individuals (10/20/500/1,000) or increasing numbers of SNPs (500/1,000/5,000/10,000/24,504; see Supporting Information Methods) and (b) a test set (n test = 780) used to assess the accuracy of HLA imputation from the 40 different reference panels (5 HLA genes × [4 different number of individuals + 4 different number of SNPs]; Figure 1). Accuracy is defined by the percentage of correct HLA allele prediction.

Figure 1

Influence of the number of individuals (a) and SNPs (b) in the HIBAG reference panel building on the accuracy of HLA alleles prediction. From the CAAPA data set (N = 880 and SNPs = 24,504), we produced a set of 10 training subsets (n training = 100) and test (n test = 780) sets to assess HLA imputation accuracy in different scenarios. Each model was validated by comparing the typed HLA alleles to the model‐predicted HLA alleles across all individuals to provide an accuracy percentage (postprobability call threshold = 0). (a) By randomly selecting individuals in the training data set, we created sub‐datasets containing 10, 20, and 50 individuals. Custom HIBAG models were computed for these subsets as well as for the whole 100 training individuals, using every available SNP. (b) Subsets of the training data set with 500, 1,000, 5,000, 10,000 randomly selected SNPs (out of the 24,504 available SNPs) were created and the corresponding models computed. The number of SNPs on the x‐axis is indicative of the number of SNPs in the data set. The number of SNPs kept to create the model, which varies depending on the gene studied and the subset, is five times lower on average (see Tables S1.1 and S1.2). Note that the horizontal marks on each HLA gene curve indicate the accuracies obtained with the default African American HIBAG models. HIBAG, HLA Genotype Imputation with Attribute Bagging; HLA, human leukocyte antigen; SNP, single‐nucleotide polymorphism; nS, number of SNPs in the model; nT, number of individuals in the model We observed that increasing the number of individuals in the reference panel increased HLA imputation accuracy (two‐field resolution) for all loci (Figure 1a). As an example, accuracy rose from 60% with 10 individuals to 93% with 100 individuals for HLA‐DQB1, and from 27% with 10 individuals to 71% with 100 individuals for HLA‐B on average. We then compared the HLA imputation accuracies obtained from our CAAPA‐based test set with pre‐existing reference panels available on the HIBAG website (http://www.biostat.washington.edu/~bsweir/HIBAG/). These precomputed reference panels were all created with more than 100 individuals of African American ancestry (from 137 for HLA‐DQB1 to 171 for HLA‐B) from the HLARES data and the HapMap Yoruba population. The accuracies using the precomputed HIBAG reference panels (represented as horizontal lines in Figure 1a) ranged from 70% (HLA‐DRB1) to 87% (HLA‐A) and were lower than those obtained using the CAAPA‐based reference panels using a smaller number of individuals. This illustrates the importance of close matching of ancestry between the reference panel and the genotyped subjects, even within a single ancestry group (here African ancestry). In addition, we reduced the number of SNPs in the training data set (500, 1,000, 5,000 and 10,000 out of the 24,504 available chromosome‐6 SNPs) and observed that increasing the number of SNPs in the reference panel increased the HLA imputation accuracy for all genes (Figure 1b). For example, accuracy rose from 86% with 500 SNPs to 91% with the full set of 24,504 SNPs for HLA‐A, and from 65% with 500 SNPs to 77% accuracy with the full set of SNPs for HLA‐B. The number of SNPs in the training data set differs from the number of SNPs in the statistical model (or bag) as HIBAG does not use all SNPs provided in the input to create the reference panels (see Tables S1.1 and 1.2 for exact numbers). Indeed, HIBAG only includes SNPs within a 500‐kb window around the gene of interest, and only keeps those improving the model after random selection (see Supporting Information Methods). For in‐depth analysis of HLA imputation, we have also plotted the sensitivity and frequency of each allele to predict in the validation data set, to identify alleles decreasing the overall accuracy (see Figures S1–S5 and Table S2).

DISCUSSION

Our results illustrate the importance of matching large reference panels with high SNP coverage to the input data set for efficient and accurate HLA allele imputation (Dilthey et al., 2016; Jia et al., 2013; Khor et al., 2015; Pappas et al., 2018). The goal of the SHLARC is to combine international expertise with data and computational resources. It will bring data to a level of interpretation that is key to solving questions on immune‐related pathologies through innovative algorithms and powerful computation tool development. To achieve this goal, we determined three main objectives (Figure 2):

Figure 2

The SNP‐HLA Reference Consortium (SHLARC) design. Aim 1: Increase the amount of SNP + HLA data available both in terms of quantity and diversity. Aim 2: Optimize SNP‐HLA imputation methods. Aim 3: The SHLARC website will allow users from the scientific community to benefit from the data and knowledge accumulated by the consortium on SNP‐to‐HLA allele imputation. From a list of SNPs and a selected ethnicity of interest, or alternatively from uploading SNP genotype data sets, the best custom reference panel for HLA allele imputation will be built in our servers. HLA, human leukocyte antigen; SNP, single‐nucleotide polymorphism

Data. By bringing together scientists from around the world, we will collectively increase the amount of SNP + HLA data available, both in terms of quantity and genetic diversity. Building new reference panels from these data will improve the performance of HLA allele imputation from SNPs as large, diverse, well‐defined genomic data are the prima materia of successful collaborations and machine‐learning applications for dissecting the genetic determinants of disease association. Applied mathematical and computer sciences. We will further optimize SNP‐HLA imputation methods using the HIBAG tool, and particularly for genetically diverse and admixed populations as (a) the higher complexity of their MHC region is a challenge for imputation and (b) these populations are still underrepresented in genomic studies (Sirugo, Williams, & Tishkoff, 2019). In addition, we will explore new machine‐learning approaches such as deep learning to develop new, more efficient methods of HLA imputation. Accessibility and service to the scientific community. Following the Haplotype Reference Consortium initiative (McCarthy et al., 2016), our network envisions building a free, user‐friendly webserver where researchers can access our improved imputation protocols by simply uploading their data and obtaining the most accurate possible HLA imputation for their data set. This service will offer several solutions (a) ready‐to‐use anonymized reference panels for researchers wishing to impute the HLA themselves, (b) allow the on‐demand creation and sharing of tailored (customized) reference panels based on data available in our database, or (c) provide a full SNP‐to‐HLA imputation service from uploaded raw SNP genotypes. We will also explore how to create the reference panel with the best fit for ancestry and genotyping platforms, given the queried samples, without the need for the centralization of individual data. Indeed, distributed calculation techniques may allow to create reference panels from data hosted on different servers without collecting all the information in a single place. The SNP‐HLA Reference Consortium (SHLARC) design. Aim 1: Increase the amount of SNP + HLA data available both in terms of quantity and diversity. Aim 2: Optimize SNP‐HLA imputation methods. Aim 3: The SHLARC website will allow users from the scientific community to benefit from the data and knowledge accumulated by the consortium on SNP‐to‐HLA allele imputation. From a list of SNPs and a selected ethnicity of interest, or alternatively from uploading SNP genotype data sets, the best custom reference panel for HLA allele imputation will be built in our servers. HLA, human leukocyte antigen; SNP, single‐nucleotide polymorphism Our objectives require access to the extensive computation power that is readily available through several GPU servers within the Université de Nantes. For each submission, we aim to design custom reference panels, for which SNPs, HLA, and reference panel data will be securely stored on University's servers. Importantly, reference panels represent statistical models that do not allow individual re‐identification. The current SHLARC partners share complementary expertise including but not limited to bioinformatics, population genetics, and immunogenetics. Importantly, our network is designed around data sharing to facilitate open research as we believe research can be accelerated by freely sharing knowledge and data. With this in mind, we have added this consortium as a component of the 18th International HLA and Immunogenetics Workshop (https://www.ihiw18.org/). HLA imputation is primarily intended for research applications, as clinical applications such as hematopoietic stem cell transplantation (HSCT) cannot tolerate statistical uncertainty, even though it might be used to accelerate pre‐selection of HSCT patients as well (Meyer & Nunes, 2017; Pappas et al., 2018). The 1000 Genomes project (1000 Genomes Project Consortium et al., 2015) generated a large collection of polymorphisms from 2,504 individuals of diverse ancestry (SNPs, indels, and copy number variants), along with HLA allele typings (Gourraud et al., 2014), providing an informative overview of genetic diversity among human populations. However, a recent study by Abi‐Rached et al. (2018) highlighted the absence of several common HLA alleles (>1% allele frequency) from the 1000 Genomes project which shows how HLA imputation results could be biased by an insufficient reference panel. With the proper sampling and a shared effort in gathering diverse data, HLA imputation could bridge the gap between HLA allele diversity and the understanding of its impact on phenotypes by harnessing the latent information stored in GWAS data sets to upgrade genetic epidemiological knowledge of immune‐related diseases. As shown previously (Okada et al., 2015), predicting HLA alleles from population‐matching reference panels not only increases the confidence in the predicted HLA but above all, allows prediction of specific HLA alleles that could not be imputed otherwise. Therefore, the informed choice of the applied model would strengthen the relation between HLA, ancestry, and disease risk factor. By applying this customization at a general level, we would assess ancestry with SNP relatedness, a consistent marker of population, rather than using self‐reported ancestry which can be often misleading (Sanchez‐Mazas et al., 2012). To develop this ambitious project, we encourage willing participants with available two‐fields HLA alleles + SNPs data sets to join the SNP‐HLA reference consortium (https://www.ihiw18.org/component-bio-informatics/snp-hla-reference/) to contribute empowering the immunogenetic community to move into the era of immunogenomic association.

CONFLICT OF INTERESTS

The authors declare that there are no conflict of interests. Supporting information Click here for additional data file. Supporting information Click here for additional data file. Supporting information Click here for additional data file.

28 in total

1. HLA-IMPUTER: an easy to use web application for HLA imputation and association analysis using population-specific reference panels.

Authors: Jiangshan J Shen; Chao Yang; Yong-Fei Wang; Ting-You Wang; Mengbiao Guo; Yu Lung Lau; Xuejun Zhang; Yujun Sheng; Wanling Yang
Journal: Bioinformatics Date: 2019-04-01 Impact factor: 6.937

2. Construction of a population-specific HLA imputation reference panel and its application to Graves' disease risk in Japanese.

Authors: Yukinori Okada; Yukihide Momozawa; Kyota Ashikawa; Masahiro Kanai; Koichi Matsuda; Yoichiro Kamatani; Atsushi Takahashi; Michiaki Kubo
Journal: Nat Genet Date: 2015-06-01 Impact factor: 38.330

3. TCR clonotypes modulate the protective effect of HLA class I molecules in HIV-1 infection.

Authors: Huabiao Chen; Zaza M Ndhlovu; Dongfang Liu; Lindsay C Porter; Justin W Fang; Sam Darko; Mark A Brockman; Toshiyuki Miura; Zabrina L Brumme; Arne Schneidewind; Alicja Piechocka-Trocha; Kevin T Cesa; Jennifer Sela; Thai D Cung; Ildiko Toth; Florencia Pereyra; Xu G Yu; Daniel C Douek; Daniel E Kaufmann; Todd M Allen; Bruce D Walker
Journal: Nat Immunol Date: 2012-06-10 Impact factor: 25.606

4. HLA-C Level Is Regulated by a Polymorphic Oct1 Binding Site in the HLA-C Promoter Region.

Authors: Nicolas Vince; Hongchuan Li; Veron Ramsuran; Vivek Naranbhai; Fuh-Mei Duh; Benjamin P Fairfax; Bahara Saleh; Julian C Knight; Stephen K Anderson; Mary Carrington
Journal: Am J Hum Genet Date: 2016-11-03 Impact factor: 11.025

Review 5. Major histocompatibility complex genomics and human disease.

Authors: John Trowsdale; Julian C Knight
Journal: Annu Rev Genomics Hum Genet Date: 2013-07-15 Impact factor: 8.929

6. Strategies to work with HLA data in human populations for histocompatibility, clinical transplantation, epidemiology and population genetics: HLA-NET methodological recommendations.

Authors: A Sanchez-Mazas; B Vidan-Jeras; J M Nunes; G Fischer; A-M Little; U Bekmane; S Buhler; S Buus; F H J Claas; A Dormoy; V Dubois; E Eglite; J F Eliaou; F Gonzalez-Galarza; Z Grubic; M Ivanova; B Lie; D Ligeiro; M L Lokki; B Martins da Silva; J Martorell; D Mendonça; D Middleton; D Papioannou Voniatis; C Papasteriades; F Poli; M E Riccio; M Spyropoulou Vlachou; G Sulcebe; S Tonks; M Toungouz Nevessignsky; C Vangenot; A-M van Walraven; J-M Tiercy
Journal: Int J Immunogenet Date: 2012-04-26 Impact factor: 1.466

7. The new NHGRI-EBI Catalog of published genome-wide association studies (GWAS Catalog).

Authors: Jacqueline MacArthur; Emily Bowler; Maria Cerezo; Laurent Gil; Peggy Hall; Emma Hastings; Heather Junkins; Aoife McMahon; Annalisa Milano; Joannella Morales; Zoe May Pendlington; Danielle Welter; Tony Burdett; Lucia Hindorff; Paul Flicek; Fiona Cunningham; Helen Parkinson
Journal: Nucleic Acids Res Date: 2016-11-29 Impact factor: 16.971

8. Association study in African-admixed populations across the Americas recapitulates asthma risk loci in non-African populations.

Authors: Michelle Daya; Nicholas Rafaels; Tonya M Brunetti; Sameer Chavan; Albert M Levin; Aniket Shetty; Christopher R Gignoux; Meher Preethi Boorgula; Genevieve Wojcik; Monica Campbell; Candelaria Vergara; Dara G Torgerson; Victor E Ortega; Ayo Doumatey; Henry Richard Johnston; Nathalie Acevedo; Maria Ilma Araujo; Pedro C Avila; Gillian Belbin; Eugene Bleecker; Carlos Bustamante; Luis Caraballo; Alvaro Cruz; Georgia M Dunston; Celeste Eng; Mezbah U Faruque; Trevor S Ferguson; Camila Figueiredo; Jean G Ford; Weiniu Gan; Pierre-Antoine Gourraud; Nadia N Hansel; Ryan D Hernandez; Edwin Francisco Herrera-Paz; Silvia Jiménez; Eimear E Kenny; Jennifer Knight-Madden; Rajesh Kumar; Leslie A Lange; Ethan M Lange; Antoine Lizee; Pissamai Maul; Trevor Maul; Alvaro Mayorga; Deborah Meyers; Dan L Nicolae; Timothy D O'Connor; Ricardo Riccio Oliveira; Christopher O Olopade; Olufunmilayo Olopade; Zhaohui S Qin; Charles Rotimi; Nicolas Vince; Harold Watson; Rainford J Wilks; James G Wilson; Steven Salzberg; Carole Ober; Esteban G Burchard; L Keoki Williams; Terri H Beaty; Margaret A Taub; Ingo Ruczinski; Rasika A Mathias; Kathleen C Barnes
Journal: Nat Commun Date: 2019-02-20 Impact factor: 14.919