Literature DB >> 18193213

Variation analysis and gene annotation of eight MHC haplotypes: the MHC Haplotype Project.

Roger Horton1, Richard Gibson, Penny Coggill, Marcos Miretti, Richard J Allcock, Jeff Almeida, Simon Forbes, James G R Gilbert, Karen Halls, Jennifer L Harrow, Elizabeth Hart, Kevin Howe, David K Jackson, Sophie Palmer, Anne N Roberts, Sarah Sims, C Andrew Stewart, James A Traherne, Steve Trevanion, Laurens Wilming, Jane Rogers, Pieter J de Jong, John F Elliott, Stephen Sawcer, John A Todd, John Trowsdale, Stephan Beck.   

Abstract

The human major histocompatibility complex (MHC) is contained within about 4 Mb on the short arm of chromosome 6 and is recognised as the most variable region in the human genome. The primary aim of the MHC Haplotype Project was to provide a comprehensively annotated reference sequence of a single, human leukocyte antigen-homozygous MHC haplotype and to use it as a basis against which variations could be assessed from seven other similarly homozygous cell lines, representative of the most common MHC haplotypes in the European population. Comparison of the haplotype sequences, including four haplotypes not previously analysed, resulted in the identification of >44,000 variations, both substitutions and indels (insertions and deletions), which have been submitted to the dbSNP database. The gene annotation uncovered haplotype-specific differences and confirmed the presence of more than 300 loci, including over 160 protein-coding genes. Combined analysis of the variation and annotation datasets revealed 122 gene loci with coding substitutions of which 97 were non-synonymous. The haplotype (A3-B7-DR15; PGF cell line) designated as the new MHC reference sequence, has been incorporated into the human genome assembly (NCBI35 and subsequent builds), and constitutes the largest single-haplotype sequence of the human genome to date. The extensive variation and annotation data derived from the analysis of seven further haplotypes have been made publicly available and provide a framework and resource for future association studies of all MHC-associated diseases and transplant medicine.

Entities:  

Mesh:

Substances:

Year:  2008        PMID: 18193213      PMCID: PMC2206249          DOI: 10.1007/s00251-007-0262-2

Source DB:  PubMed          Journal:  Immunogenetics        ISSN: 0093-7711            Impact factor:   2.846


Introduction

The MHC has long been believed to be the most important region in the human genome with respect to infection, inflammation, autoimmunity and transplant medicine (Lechler and Warrens 2000). This was recently confirmed by the largest genome-wide association study carried out to date for seven common diseases, including two autoimmune diseases (type 1 diabetes and rheumatoid arthritis) and one inflammatory disease (Crohn’s disease). The highest associations were found between the MHC and these two autoimmune diseases (The Wellcome Trust Case Control Consortium 2007). The complex aetiology of MHC-associated disease coupled with high density, polymorphism, linkage disequilibrium (LD) and frequent non-Mendelian inheritance of gene loci have made it challenging to identify variations that cause or contribute to disease phenotypes. Additional limiting factors have been our incomplete knowledge of the allelic variation of genes and regions flanking the nine classical human leukocyte antigen (HLA) loci and the lack of a single haplotype reference sequence, the original reference sequence being a composite of multiple MHC haplotypes (Mungall et al. 2003; The MHC Sequencing Consortium 1999). Recognizing that the future identification of variants conferring susceptibility to common disease is critically dependent on fully informative polymorphism and haplotype maps, the MHC Haplotype Consortium formed in 2000 with the aim to generate these critical data and to make them publicly available as a general resource for MHC-linked disease studies. Similar efforts, but with different experimental approaches, were also carried out in Japan (Shiina et al. 2006) and the USA (Smith et al. 2006). To develop the resource, eight HLA-homozygous MHC haplotypes were selected on the basis of conferring either protection against or susceptibility to two autoimmune diseases, type 1 diabetes and multiple sclerosis, and that represented common haplotypes in European populations. In the subsequent years, incremental data, materials and tools comprising this resource have been released (Allcock et al. 2002; Horton et al. 2004; Stewart et al. 2004; Traherne et al. 2006) and have contributed towards the construction of a high-resolution LD map and a first generation of HLA tag single nucleotide polymorphisms (SNPs; de Bakker et al. 2006; Miretti et al. 2005) and the identification of a second MHC susceptibility locus for multiple sclerosis (The International Multiple Sclerosis Genetics Consortium; Yeo et al. 2007). In this paper, we report the final account of this international effort, including, analysis of the last four of the eight haplotypes, up-to-date variation statistics, gene annotation, population-specific aspects and a detailed description of the databases and tools for viewing and accessing the data in the context of existing genome annotation.

Materials and methods

Variation analysis

The method previously reported for comparison of MHC haplotype sequences (Stewart et al. 2004; Traherne et al. 2006) was extended to cover all eight haplotypes. Briefly, the most suitable method proved to be a clone by clone comparison using the discrepancy-list option of the cross_match program (Green, unpublished; http://www.phrap.org/), an implementation of the Smith–Waterman sequence alignment algorithm (Smith and Waterman 1981), using the alignment of a haplotype clone sequence with the appropriate overlapping reference sequence from a PGF clone or clones. All variations were submitted to dbSNP using the submitter handle SI_MHC_SNP and user identifiers of the form [PGF BAC clone sequence version]_[position in PGF BAC clone sequence]_[variation change]. Thus, AL662890.3_6645_TC indicates a substitution in which the base T at base position 6645 in AL662890.3 (PGF BAC 308K3) was substituted by C in the other haplotype. In the case of indels, the ‘variation change’ consists of ‘i’ or ‘d’ (for insertion or deletion), followed by a numerical value for the length of the indel, in turn followed by the inserted or deleted sequence if this were of 12 or fewer bases. For longer indels, an X value is given, which refers to a look-up table (http://www.sanger.ac.uk/HGP/Chr6/MHC/Xfile). Thus, AL662890.3_7470_d8TACACACA indicates a deletion in AL662890.3 after base 7470 of the eight bases ‘TACACACA’. Further, AL662890.3_10559_i5ATATT indicates an insertion in AL662890.3 starting after base 10559 of the five bases ‘ATATT’. AL662890.3_7475_d14X1 indicates a 14-base deletion after base 7475 in AL662890.3 of a sequence coded as X1 which is ‘ATACACACACACAC’. Major indel sequences, appearing as breaks in the cross_match discrepancy lists between two clones from difference haplotypes, were extracted and subjected to analysis by RepeatMasker to detect the presence of retrotransposible elements.

Gene annotation

The finished genomic sequence for each of the eight haplotypes was analysed using a modified Ensembl pipeline (Searle et al. 2004). CpG islands were predicted on unmasked sequence. Interspersed and tandem repeats were masked out by RepeatMasker (Smit, AFA, Hubley, R & Green, P. RepeatMasker Open-3.0. 1996–2004, http://www.repeatmasker.org) and Tandem Repeats Finder (TRF; Benson 1999), respectively. The sequence was then BLAST searched (BLAST, basic local alignment search tool; Altschul et al. 1990) using a vertebrate set of complementary DNAs (cDNAs) and expressed sequence tags (ESTs) from the European Molecular Biology Laboratory (EMBL) nucleotide database (Kulikova et al. 2007), followed by the re-alignment of significant hits. Non-redundant proteins were aligned similarly. Protein domain matches were provided through alignment of Pfam to the genomic sequence using Genewise (Birney et al. 2004), thereby providing protein domain data to the annotator. Ab initio gene predictions were performed by Genscan (Burge and Karlin 1997) and Fgenesh (Salamov and Solovyev 2000), and potential transcriptional start sites were predicted by Eponine (Down and Hubbard 2002). Analysis results were displayed, and annotation was performed through an in-house annotation software system. Genes were manually annotated according to the human and vertebrate analysis and annotation (HAVANA) guidelines (http://www.sanger.ac.uk/HGP/havana/) using evidence based on comparison with external databases as of August 2005. All gene structures are supported by transcriptional evidence, either from cDNA, EST, or protein. In general, annotations are supported by best-in-genome evidence. Haplotype-specific evidence is assigned where possible. As with previous MHC annotation (Stewart et al. 2004; Traherne et al. 2006), some olfactory receptors have been built upon protein homology alone because of their restricted expression. Locus and variant types were annotated according to established standards (Harrow et al. 2006), with the modification that, within the MHC region, the artefact locus has been used to tag historically annotated structures that are no longer deemed valid. HLA-A, HLA-B, HLA-C, HLA-DRB1, HLA-DQA1, and HLA-DQB1 allele types were assessed by comparison against the IMGT/HLA database (http://www.ebi.ac.uk/imgt/hla/; Marsh et al. 2005).

Annotation status of haplotypes

The PGF, COX, and QBL haplotypes have already been annotated in detail (Stewart et al. 2004; Traherne et al. 2006). It was decided, however, to re-annotate and update this annotation to maintain consistency between all eight haplotypes with the current supporting evidence and pipeline analyses. The SSTO haplotype was manually annotated de novo. The new annotation from the PGF haplotype was projected through a DNA–DNA alignment to each of the remaining haplotypes (APD, DBB, MANN and MCF) where possible. This projection was checked thoroughly and non-alignable regions were manually adjusted (including the C4 and HLA-DRB1 hypervariable regions). Polyadenylation sites and signals were not annotated for haplotypes APD, DBB, MANN and MCF because of time constraints. In the main, however, these features may be assumed to correspond to the same positions as in the first four haplotypes.

Combination of variation and annotation data

By employing a series of Perl scripts, the array of haplotype variation was combined with the annotation of gene loci, repeat elements and microsatellites, extracted from the Vertebrate Genome Annotation (VEGA) database in general feature format (GFF; http://www.sanger.ac.uk/Software/formats/GFF/), to determine the variation status of all loci.

Distribution of sequenced HLA haplotypes in Europeans

To assess the distribution of sequenced haplotypes at the population level, 180 founder haplotypes were reconstructed using genotypic data from Centre d’Etude Polymorphisme Humain (CEPH) trios (de Bakker et al. 2006). A ~214 kb segment spanning the HLADRB1DQB1 genes was selected for the analyses. This segment, represented by 54 SNPs, is delimited by rs2187823 and rs2856691, with NCBI build 36 chromosome 6 coordinates 32547486 and 32761413, respectively. Phased haplotypes with known HLADRB1DQB1 alleles were then used to construct a neighbor-joining tree (Kumar et al. 2001) and a phylogenetic network (Bandelt et al. 1999).

Resources

All sequences presented in this paper have been submitted to the EMBL/GenBank/DNA Data Bank of Japan (DDBJ) database and allocated accession numbers. For clarity, all bacterial artificial chromosome (BAC) clones are referred to using their accession numbers. The annotation of each haplotype has been entered in the VEGA database and is accessible through its browser (http://www.VEGA.sanger.ac.uk). All variations from the study were submitted to dbSNP (http://www.ncbi.nlm.nih.gov/SNP) using the submitter handle SI_MHC_SNP. BAC clones from the CHORI-501 (PGF) and CHORI-502 (COX) libraries can be requested from BACPAC resources (http://www.bacpac.chori.org/). Clones from the other libraries can be requested from john.elliott@ualberta.ca. The web site for the MHC Haplotype Project provides links to various data resources (http://www.sanger.ac.uk/HGP/Chr6/MHC/). DAS sources for all substitutions and indels are available from http://www.das.ensembl.org/das as follows: ens_35_COX_SNP ens_35_COX_DIP ens_35_QBL_SNP ens_35_QBL_DIP ens_35_SSTO_SNP ens_35_SSTO_DIP ens_35_APD_SNP ens_35_APD_DIP ens_35_DBB_SNP ens_35_DBB_DIP ens_35_MANN_SNP ens_35_MANN_DIP ens_35_MCF_SNP ens_35_MCF_DIP These can be accessed via the VEGA browser.

Results and discussion

One of the main aims of the MHC Haplotype Project was to generate a comprehensive variation map of this most variable region of the human genome. To achieve this, eight haplotypes were sequenced and subjected to variation analysis. Table 1 details the lengths of the sequence contigs, the number of sequence gaps and the allelic types of major HLA loci for each haplotype. Of the eight haplotypes sequenced, three have already been described: PGF and Cox (Stewart et al. 2004) both of which formed single contigs of approximately 4.7 Mb, and QBL (Traherne et al. 2006), of approximately 4.2 Mb but with five gaps. The remaining haplotypes sequenced all contained gaps, their coverage ranging from 2.33 Mb (DBB with 28 gaps) to 4.19 Mb (MANN with 10 gaps).
Table 1

Haplotype sequence contig length, number of gaps and HLA allele types

HaplotypeLength (bp)GapsHLA-AHLA-BHLA-CHLA-DQA1HLA-DQB1HLA-DRB1
PGF47548290A*03010101B*070201Cw*07020103DQA1*010201DQB1*0602DRB1*150101
COX47318780A*01010101B*080101Cw*070101DQA1*050101DQB1*020101DRB1*030101
QBL42492725A*260101B*180101Cw*050101DQA1*050101DQB1*020101DRB1*030101
APD416096516A*01010101
DBB233010128A*02010101Cw*06020101DQA1*0201DQB1*030302DRB1*070101
MANN419101410A*290201B*440301Cw*160101DQA1*0201DQB1*0202DRB1*070101
MCF408741315[A*020101]B*15010101Cw*030401DQA1*0303DQB1*030101
SSTO370424922A*320101B*44020101Cw*050101DQA1*030101DQB1*030501DRB1*040301

Sequence length (bp) and number of gaps in each haplotype sequence, together with the HLA gene types obtained by BLAST against the IMGT/HLA database. Dashes or data in square brackets indicate the absence or the partial presence, respectively, of a gene owing to a sequence gap.

Haplotype sequence contig length, number of gaps and HLA allele types Sequence length (bp) and number of gaps in each haplotype sequence, together with the HLA gene types obtained by BLAST against the IMGT/HLA database. Dashes or data in square brackets indicate the absence or the partial presence, respectively, of a gene owing to a sequence gap. For the variation analysis, each of the above haplotypes was compared with the PGF reference sequence, resulting in the identification of 44,544 variations (37,451 substitutions and 7,093 indels, Table 2), which have all been submitted to dbSNP. The success of this exercise is illustrated by the fact that examination of this public database (NCBI dbSNP build 127, March 2007) showed that there were only a further 19,598 variations, submitted by other laboratories, in this region which were not identified by this project. In accordance with the annotation that we also generated for each haplotype (see below), the variations shown in Table 2 were further classified as untranslated region (UTR), exonic, intronic, intergenic and eight more sub-categories (Table 3). Coding substitutions, which are of particular interest with respect to altered functionality, were further classified as synonymous, non-synonymous conservative, or non-synonymous non-conservative and grouped depending on whether they affected HLA or other genes (Table 4). The actual variations and affected amino acids can be viewed using the VEGA browser as illustrated in Fig. 1 and described in the corresponding section later on. In addition, we have analysed all haplotype sequences for inversions, which represent another important variation category that has been linked to genomic disorders (Shaw and Lupski 2004). Using Ssaha2 (Ning et al. 2001), we found no evidence of any inversion polymorphism within the generated sequences but could not exclude large-scale (e.g. involving entire MHC) inversions with breakpoints outside the MHC regions sequenced here.
Table 2

Distribution of substitutions and indels amongst haplotypes

HaplotypeSubstitutionsIndelsALL
COX15,9672,39318,360
QBL15,2822,36017,642
SSTO14,9822,30017,282
APD4,2306834,913
DBB14,2551,97516,230
MANN12,1021,65413,756
MCF10,7901,54512,335
Overall37,4517,09344,544

Number of variations found by comparing the PGF haplotype sequence with each of the other haplotype sequences in turn.

Table 3

Distribution of substitutions and indels within different sequence regions amongst haplotypes

Sequence regionBase pairsCOXQBLSSTOAPDDBBMANNMCF
SIDSIDSIDSIDSIDSIDSID
Coding247,5053538503193802740351640193482
UTR155,960382344385933135389326393033530931
Intronic1,283,4723,1415713,1355902,6585056021472,8975092,1853932,126404
Total intragenic1,686,9373,8766134,0766683,3695427141563,5745542,8894372,783437
Pseudogenic57,223235152262122719101819110109611310
Pseudogenic intron63,1085075422027215181582025822981317913
Transcript exon78,0921903020733119227181361788167015
Transcript intron332,7051,2431971,1862161,05315585291,2451921,08116126853
REPEATS:
LINEs608,4292,1102212,0152402,388255755932,0972172,0841931,530164
SINEs428,5671,3814281,3164011,3113853461341,229318928241936271
Other repeats487,8632,6052072,5182292,514207925562,7481992,1981772,170169
Total in repeats1,524,8596,0968565,8498706,2138472,0262836,0747345,2106114,636604
Microsatellite15,18518616895852221981429607661719068
All above3,297,59012,3331,93311,8591,92011,4181,8013,16953311,5381,6059,5361,3158,1391,200
Other intergenic996,7203,6344603,4234403,5644991,0611502,7173702,5663392,651345
Total4,754,82915,9672,39315,2822,36014,9822,3004,23068314,2551,97512,1021,65410,7901,545

Variations shown in Table 2 ascribed to sequence regions identified during annotation. These included exonic, UTR and intronic regions of coding; pseudogenic and transcript loci; repeat elements, microsatellites and other intergenic regions

S Substitution, ID indel

Table 4

Codon variation caused by substitutions in HLA and other gene loci

Codons variation by virtue of substitutionsCOXQBLSSTOAPDDBBMANNMCF
HLAOtherTotalHLAOtherTotalHLAOtherTotalHLAOtherTotalHLAOtherTotalHLAOtherTotalHLAOtherTotal
Synonymous498113071106177725712912425666913559791388052132
Non-synonymousTotal Conservative1257620118412130516472236192746120761961449123514756203
6842110102721749239131111829674010777601378235117
Non-conservative573491824913172331058917533689673198652186
Total174157331255227482236129365205171186145331203170373227108335

Coding substitutions analysed for their effects on protein sequences and listed in by haplotype for HLA genes (HLA-A HLA-B HLA-C HLA-DRB1 HLA-DRA HLA-DQA1 HLA-DQB1 HLA-DPA1 HLA-DPB1) and for all other genes according to the changes they induced in codons as either synonymous, non-synonymous conservative, or non-synonymous non-conservative changes.

Fig. 1

Annotation and variation data in VEGA. VEGA ‘overview’ (a), ‘detailed view’ (b) and ‘basepair view’ (c) example of the variation in the OR2J1 locus in which a STOP codon is present in all haplotypes except MCF

Annotation and variation data in VEGA. VEGA ‘overview’ (a), ‘detailed view’ (b) and ‘basepair view’ (c) example of the variation in the OR2J1 locus in which a STOP codon is present in all haplotypes except MCF Distribution of substitutions and indels amongst haplotypes Number of variations found by comparing the PGF haplotype sequence with each of the other haplotype sequences in turn. Distribution of substitutions and indels within different sequence regions amongst haplotypes Variations shown in Table 2 ascribed to sequence regions identified during annotation. These included exonic, UTR and intronic regions of coding; pseudogenic and transcript loci; repeat elements, microsatellites and other intergenic regions S Substitution, ID indel Codon variation caused by substitutions in HLA and other gene loci Coding substitutions analysed for their effects on protein sequences and listed in by haplotype for HLA genes (HLA-A HLA-B HLA-C HLA-DRB1 HLA-DRA HLA-DQA1 HLA-DQB1 HLA-DPA1 HLA-DPB1) and for all other genes according to the changes they induced in codons as either synonymous, non-synonymous conservative, or non-synonymous non-conservative changes. There have been several previous annotations of the gene content of the MHC (Horton et al. 2004; Mungall et al. 2003; Stewart et al. 2004; The MHC Sequencing Consortium 1999; Traherne et al. 2006). The maximum region annotated in this study extends from the telomeric ZNF452 gene in the MHC extended class I region (COX haplotype) to the centromeric ZBTB9 gene just telomeric of the MHC extended class II region (PGF and SSTO haplotypes). The PGF haplotype (Stewart et al. 2004) remains the longest complete MHC haplotype, encompassing 320 annotated loci with 1,267 variants. The number of variants ascribed to each locus-type is listed in Table 5. A comparison of the statistics for loci in each haplotype is shown in Table 6.
Table 5

Splice-variant statistics for PGF annotation

TypeNo.
Total splice variants1,267
Coding523
Unprocessed_pseudogene50
Processed_pseudogene41
Expressed_pseudogene7
Transcript271
Putative71
Retained_intron263
Nonsense_mediated_decay30
Artefact11
Total loci320

Splice variants annotated in the PGF haplotype.

Table 6

Gene annotation statistics for eight MHC haplotypes

Locus typePGFCOXQBLSSTOAPDDBBMANNMCF
Coding16515915013182146129150
Transcript2828262619262722
Putative181815156161214
Pseudogenes total9895939859929575
Unprocessed5048485336525342
Processed4142403919343728
Expressed75564655
Artefact111110110000
Total loci320311294281166281264261
Total variants1,2671,1911,1551,0585681,1389601,115

Annotation statistics for loci in each haplotype. For definitions of locus types see “Materials and methods”.

Splice-variant statistics for PGF annotation Splice variants annotated in the PGF haplotype. Gene annotation statistics for eight MHC haplotypes Annotation statistics for loci in each haplotype. For definitions of locus types see “Materials and methods”.

VEGA database and browser

The VEGA database provides access to gene annotation of the eight MHC haplotype sequences, a valuable public resource and a means of integrating annotation and variation data. The VEGA database also provides the facility to download nucleotide or peptide sequences for genes of interest, by selecting ‘export cDNA’ or ‘export peptide’ from the menu obtained by clicking on gene cartoons in the VEGA ‘detailed view’ or ‘basepair view’ window. From these, any desired alignments can be made. Variation data may be viewed in the browser linked to a distributed annotation system (DAS) source of any given variation (see “Materials and methods”). This is illustrated An example of the use of this browser to view a C to T substitution is illustrated for the OR2J1 locus (Fig. 1). An overview of the genomic environment is given in Fig. 1a, showing the gene within a cluster of olfactory gene loci on chromosome 6. The detailed view (Fig. 1b) shows OR2J1 with associated variations in all haplotypes. The basepair view (Fig. 1c) illustrates the presence of the C/T substitution in all haplotypes except MCF, and its positioning above the translated sequence, at the first position of a CAG codon, indicating the presence of a stop codon instead of glutamine.

Annotation changes

In addition to loci annotated in the previous studies, newly recognised with official Hugo Gene Nomenclature Committee (HGNC) symbols have also been annotated. These have included the mitochondrial coiled–coil domain protein 1 gene MCCD1 (Semple et al. 2003) and the related unprocessed pseudogenes MCCD1P1 and MCCD1P2, as well as the zinc-finger and BTB domain-containing protein gene ZBTB9, annotated at the very centromeric boundary of the sequenced region. The C6orf21 gene (De Vet et al. 2003; XXbac–BPG32J3.17-001) of the MHC class III region was annotated as a separate locus from the adjacent centromeric locus LY6G6D (splice variants XXbac–BPG32J3.4-001 and XXbac–BPG32J3.4-002). There was, however, a further coding splice variant of LY6G6D (XXbac–BPG32J3.4-004), which spanned not only the other LY6G6D splice variants but also C6orf21, suggesting that this is a possible so-called chimeric transcript (Parra et al. 2006).

HLA-DRB1 hypervariable region

Of the five newly annotated MHC haplotypes, APD alone exhibited the HLA–DRBDR52 antigenic specificity found on DRB1*3, DRB1*05 (DRB1*11 and DRB1*012) and DR6 (DRB1*13 and DRB1*14) haplotypes and encoded by HLA–DRB3, whereas the remainder (SSTO, DBB, MANN and MCF) exhibited the DR53 specificity, encoded by HLA–DRB4, here annotated for the first time in genomic sequence. The HLA–DRB53 sequences included three known loci (HLA–DRB4, HLA–DRB7 and HLA–DRB8), as well as three novel pseudogenes (DASS–218M11.1, DASS–23B5.1 and DASS–23B5.2). DASS–23B5.1 corresponds to a pseudogene derived from the gene for the protein kinase, interferon-inducible double-stranded RNA dependent activator (Chida et al. 2001) for which the symbol PRKRAP1 has now been recognised. A further processed pseudogene, FAM8A5P (Jamain et al. 2001), was also annotated in the DR53 specificity.

HLA-V and HLA-P

Our analysis showed that the two unprocessed class I pseudogenes HLA-V and HLA-P ( previously HLA-75 and HLA-90, Geraghty et al. 1992) should in fact be merged together; individually they merely represented the 5′ and 3′ portions of a single unprocessed pseudogene, separated by repeat elements. According to our annotation guidelines (see “Materials and methods”), the newly merged locus was assigned the symbol from the 3′ component, in this case, HLA-P. Best-in-genome nucleotide evidence was found to support five transcript variants at the 5′ end, which, together with evidence for continued locus-transcription, led us to designate the locus as a transcribed pseudogene. Because transcription appears to still occur at this locus, it was, therefore, designated as a transcribed pseudogene. A further six expressed pseudogenes were identified in the MHC region (HLA–DPB2, HLA-J, CYP21A1P, HLA–DRB6, HLA–L and PPP1R2P1).

RCCX hypervariable region

This module within the MHC class III region, named for its gene content (RP-C4A/B-CYP21-TNXB), may be duplicated or triplicated (Chung et al. 2002), and the pseudogenes CYP21A1P, TNXA and STK19P contain the complement component gene, C4, in either or both of the two versions, C4A and C4B (Awdeh and Alper 1980). This gene may also be present in either long (C4AL, C4BL) or short (C4AS, C4BS) forms depending on the presence or absence of an inserted HERVC4 element in intron 9. Contrary to our previous annotation (Stewart et al. 2004) see also legend to (Fig. 2), the PGF haplotype now appears to possess an arrangement in which C4AL precedes C4BL, whereas COX has a single module with C4BS and QBL has a single module with C4AS (Traherne et al. 2006). For the new haplotype sequences reported in this paper, SSTO was bimodular with two copies of C4BL, whereas DBB was bimodular with C4AL followed by C4BS. Although a sequence gap was present in MCF, this haplotype appeared to be bimodular in that, although the telomeric copy of the C4 gene could not be identified, there was evidence for the pseudogenes CYP21A1P, TNXA and STK19P in a telomeric module. The second centromeric module in MCF contained C4AL. The RCCX region in the APD and MANN haplotypes was incomplete because of sequence gaps.
Fig. 2

Variation and annotation map of eight MHC haplotypes. The map represents the complete reference sequence (orange bar split into three 1.6 Mb sections) labelled PGF and marked with a scale (Mb) and approximate megabase positions on the NCBI36 build of chromosome 6 (grey milestones). Below the reference sequence are arrows representing gene positions and orientations colour-coded for variation status (invariable, black; with synonymous variation only, green; with non-synonymous, conservative variation, red; with non-synonymous, non-conservative variation, purple; see Table 8) and their symbols on a band denoting MHC class (extended class I, green; class I, yellow; class III, pale orange; class II, light blue; extended class II, pink; outside MHC, pale grey). Above the reference sequence, coloured bands represent the sequences of the other seven haplotypes (COX, orange; QBL, mauve; APD, yellow; DBB, green; MANN, light blue; SSTO, dark blue; MCF, purple) with sequence gaps in dark grey; the RCCX hyper-variable region shown with green (C4A block) and/or red (C4B block) or black (block absent), and the HLA–DRB hyper-variable region in shades of blue-green. Above each haplotype bar, a bar-graph represents total variation between the haplotype and the reference sequence (total variations/10 kb) in dark red. Re-examination of the sequence AL645922 from the PGF haplotype, which contains the RCCX region, has shown that the original assembly was erroneous. Correction of these errors leads us now to the conclusion that the C4A gene precedes the C4B gene in this clone sequence. This new gene order is reflected in Fig. 2

C6orf205

Variability in the C6orf205 gene has been reported to consist of extension of the minisatellite in exon 2 from 27 copies in PGF and COX to 31 copies in QBL (Traherne et al. 2006). In the newly annotated haplotypes, we found the minisatellite to extend to 29 in MANN. The APD, DBB and MCF possessed 27 copies. There was a sequence gap in this region in the SSTO haplotype.

MICA

The known allelic polymorphism of MICA reported for the DRB1*03 QBL cell line sequence, in which a four-base insertion (GCGT) extended the open reading frame in coding exon 5 haplotype (Traherne et al. 2006), was also present in the DRB1*07 MANN haplotype. The insertion was absent from PGF, COX and SSTO. No sequence was available in APD, DBB and MCF for this gene.

PPP1R2P1

The intronless pseudogene PPP1R2P1 reported to have a full-length open reading frame in the PGF, COX and QBL haplotypes (Stewart et al. 2004; Traherne et al. 2006) was found to have a similar open reading frame in the DBB and MANN haplotypes but to have the frameshift mutation seen in the original chromosome reference sequence (Mungall et al. 2003) in the SSTO, APD and MCF haplotypes.

PSORS1C1

The QBL haplotype remains the only one in which there was a single nucleotide deletion in a polyC tract of exon 5 (Traherne et al. 2006). DBB, MANN and MCF resembled PGF and COX. No sequence was available for this gene in SSTO or APD.

POU5F1

The PGF haplotype has been reported to have a disrupted start codon for alternative splice variant of POU5F1 (Traherne et al. 2006). This disruption was not present in COX or QBL nor was it present in the further haplotypes reported in this paper, namely SSTO, DBB, MANN and MCF. APD had no sequence in this region.

OR2J1

This olfactory receptor OR2J1 has been reported to have both functional and non-functional alleles (Ehlers et al. 2000), the latter the result of a premature stop codon at amino acid position 194 introduced by a substitution in the coding sequence. In our annotation, we found the PGF and MCF haplotypes to contain the full-coding sequence, whereas the COX, QBL SSTO, APD, DBB and MANN haplotypes to contain the truncated sequence as an unprocessed pseudogene (see above and Fig. 1).

Other annotation differences

Other loci included in the current but not the previous PGF annotation were HCG4P11, HCG4P8, HCG4P7, HCG4P5, HCG4P3 and the loci without symbols listed in Table 7. Previously annotated loci not annotated in this study or considered artefacts because they did not reach our current standards of annotation included HLA-X, C6orf215, HCG2P7, HCG8, HCP5P2, HCP5P3, HCP5P6, HCP5P12, HCP5P13, HCP5P14, HCP5P15, HCG8 and HCG26.
Table 7

Other newly annotated loci

LocusLocus type
XXbac-BCX196D17.5Transcript
XXbac-BPG116M5.14Putative
XXbac-BPG116M5.15Putative
XXbac-BPG116M5.16Putative
XXbac-BPG118E17.9Putative
XXbac-BPG126D10.10Processed pseudogene
XXbac-BPG126D10.11Processed pseudogene
XXbac-BPG13B8.10Transcript
XXbac-BPG13B8.9Unprocessed pseudogene
XXbac-BPG154L12.4Putative
XXbac-BPG181B23.4Transcript
XXbac-BPG181M17.4Putative
XXbac-BPG246D15.8Transcript
XXbac-BPG248L24.10Unprocessed pseudogene
XXbac-BPG248L24.9Processed pseudogene
XXbac-BPG249D20.9Putative
XXbac-BPG250I8.13Transcript
XXbac-BPG254F23.5Putative
XXbac-BPG254F23.6Putative
XXbac-BPG254F23.7Transcript
XXbac-BPG254F23.7Putative
XXbac-BPG27H4.7Transcript
XXbac-BPG27H4.8Transcript
XXbac-BPG294E21.7Processed pseudogene
XXbac-BPG296P20.14Putative
XXbac-BPG296P20.15Putative
XXbac-BPG299F13.14Putative
XXbac-BPG308J9.3Transcript
XXbac-BPG308K3.5Putative
XXbac-BPG308K3.6Transcript
XXbac-BPG309N1.15Unprocessed pseudogene
XXbac-BPG32J3.18Putative
XXbac-BPG8G10.2Unprocessed pseudogene
DAQB-12N14.5Transcript
DAQB-331I12.5Putative
DAQB-335A13.8Transcript

Newly annotated loci without HGNC symbols.

Other newly annotated loci Newly annotated loci without HGNC symbols.

Non-canonical splice sites

Eight variants within six loci were shown to exhibit haplotypic variation at their splice sites (canonical to non-canonical motif; Table 8). These variations may affect the gene expression at the post-transcriptional level. Hoarau et al. (2004, 2005) have already described the differential splicing within the HLADQA1 locus, and this can clearly be seen by comparing the new HLADQA1 annotation through the VEGA genome browser.
Table 8

Haplotype variation at splice sites

GeneVariantAffected exonsDonor*Acceptor*dbSNP cluster IDBest evidencePGFQBLCOXSSTODBBAPDMANNMCF
TRIM3123/4ggttggrs28400887cDNANCNCNCCNDNCNCC
TRIM3152/3ggttggrs28400887ESTNCNCNCCNDNCNCC
C4B73/4ggtcggESTNCNDNCCNCNDNDND
C4A73/4ggtcggESTNCNCNDCNCNDNDNC
HLA-DQA144/5ggtcggrs707947cDNACCCNCNCNDNCNC
HLA-DQA154/5ggttaa/caars3667cDNANCNCNCCCNDCC
HLA-DRB122/3gatcagrs9271083ESTNCCCCCNDCND

Gene loci and variants that are affected by disruptive variations at splice sites. C Canonical splice site (donor = ngt; acceptor = nag), NC non-canonical, and ND no data (gene absent or gap). Donor and acceptor variable nucleotides in bold with equivalent dbSNP cluster ID number given in column to right. The C4A and C4B genes are, for these purposes, effective duplicates of each other. The two TRIM31 variants share the same splice site (but differ elsewhere in structure). The two HLA–DQA variants share the same donor but have alternative acceptors. Note the mutually exclusivity of these variants amongst the haplotypes (Hoarau et al. 2004; Hoarau et al. 2005).

Haplotype variation at splice sites Gene loci and variants that are affected by disruptive variations at splice sites. C Canonical splice site (donor = ngt; acceptor = nag), NC non-canonical, and ND no data (gene absent or gap). Donor and acceptor variable nucleotides in bold with equivalent dbSNP cluster ID number given in column to right. The C4A and C4B genes are, for these purposes, effective duplicates of each other. The two TRIM31 variants share the same splice site (but differ elsewhere in structure). The two HLA–DQA variants share the same donor but have alternative acceptors. Note the mutually exclusivity of these variants amongst the haplotypes (Hoarau et al. 2004; Hoarau et al. 2005). The data for sequence contig length, gaps, variation rate within haplotypes and PGF coding gene annotation have been combined in the map in Fig. 2. This illustrates the concentration of variation around the HLA gene loci, specifically in 3 areas: around HLA-F, HLA-G and HLA-A; around HLA-C and HLA-B; and around HLA-DRB1, HLA-DQA1, HAL-DQB1, HLA-DQA2 and HLA-DQB2. The variation status of genes of the PGF haplotype is shown in Table 9.
Table 9

Variation status of the main coding variant of each gene in the PGF haplotype annotation

InvariableSynonymous variation onlyNon-synonymous variation
Conservative variationNon-conservative variation
ABCF1BAT1aAGERBAT2
AGPAT1BAT5BRD2aBAT3
AIF1C2BTNL2BAT4
APOMCREBL1C6orf21C4A
ATP6V1G2DAXXC6orf27C4B
B3GALT4DDR1aCFBC6orf10
C6orf134GNL1DOM3ZC6orf100
C6orf136aGPSM3DPCR1C6orf15
C6orf26GTF2H4EGFL8C6orf205
C6orf48HLA-DOAaEHMT2C6orf25
CLIC1HSPA1BFKBPLC6orf47
CSNK2BLY6G6CGABBR1CCHCR1
CUTAMSH5HLA-DMACDSN
CYP21A2PBX2HLA-DOBCOL11A2
DDAH2POU5F1HLA-DQB2DHX16
FLOT1PPP1R11HLA-DRAHLA-A
HLA-DPA1PRR3HSPA1AHLA-B
HLA-DRB5RING1LY6G6DHLA-C
HSD17B8RNF5MCCD1HLA-DMB
KIFC1aRXRBMOGaHLA-DPB1
LSM2SYNGAP1OR11A1HLA-DPB2
LST1TRIM10OR2H2HLA-DQA1
LTBTRIM26OR2J1HLA-DQA2
LY6G5CTRIM27OR2J2HLA-DQB1
LY6G6ETRIM39aOR2J3HLA-DRB1
MAS1LVPS52PHF1HLA-E
MRPS18BZBTB12PSMB9HLA-F
NCR3ZBTB9RPP21HLA-G
NEU1ZNRD1SFTPGHSPA1L
NRMSKIV2LIER3
OR2B3SLC44A4KIAA1949
OR2H1TAP2LTA
OR2W1TRIM15LY6G5B
PFDN6WDR46MDC1
PPP1R10ZBTB22MICA
PRRT1ZNF311MICB
PSMB8bNFKBIL1
RDBPNOTCH4
RGL2OR10C1
RPS18OR12D2
SLC39A7OR12D3
STK19OR5U1
TNFOR5V1
TUBBPPT2
ZFP57PSORS1C1
PSORS1C2
RNF39
TAP1
TAPBP
TCF19
TNXB
TRIM31
TRIM40
UBD
VARS
VARSL

Gene coding sequences may be invariable (no recorded variation), have synonymous variation only (variation at the nucleotide but not the peptide level) or have non-synonymous variation (variation at both the nucleotide and peptide level), which in turn, may be conservative or non-conservative variation according to the criteria of positive or negative values in the BLOSUM62 matrix. The main coding variant is that numbered 001 in the VEGA database except for LY6G6E and HLA-DPB2 where the main variant is not coding. C4A and C4B were excluded from calculation of variation because the order of these genes in the PGF sequence precluded alignment with other haplotype sequences. Nevertheless, alignment of the coding sequences for each gene separately showed that there were non-synonymous, non-conservative variations. HLA-DRB5 is present in this study only in the PGF haplotype and, therefore, here appears invariable

aCoding genes where the main variant does not harbour non-conservative, non-synonymous variation but other variants do (BAT1 BRD2 DDR1 C6orf136 HLA-DOA MOG KIFC1 and TRIM39).

bSimilarly, coding genes where the main variant does not harbour conservative non-synonymous variation but other variants do (PSMB8).

Variation and annotation map of eight MHC haplotypes. The map represents the complete reference sequence (orange bar split into three 1.6 Mb sections) labelled PGF and marked with a scale (Mb) and approximate megabase positions on the NCBI36 build of chromosome 6 (grey milestones). Below the reference sequence are arrows representing gene positions and orientations colour-coded for variation status (invariable, black; with synonymous variation only, green; with non-synonymous, conservative variation, red; with non-synonymous, non-conservative variation, purple; see Table 8) and their symbols on a band denoting MHC class (extended class I, green; class I, yellow; class III, pale orange; class II, light blue; extended class II, pink; outside MHC, pale grey). Above the reference sequence, coloured bands represent the sequences of the other seven haplotypes (COX, orange; QBL, mauve; APD, yellow; DBB, green; MANN, light blue; SSTO, dark blue; MCF, purple) with sequence gaps in dark grey; the RCCX hyper-variable region shown with green (C4A block) and/or red (C4B block) or black (block absent), and the HLA–DRB hyper-variable region in shades of blue-green. Above each haplotype bar, a bar-graph represents total variation between the haplotype and the reference sequence (total variations/10 kb) in dark red. Re-examination of the sequence AL645922 from the PGF haplotype, which contains the RCCX region, has shown that the original assembly was erroneous. Correction of these errors leads us now to the conclusion that the C4A gene precedes the C4B gene in this clone sequence. This new gene order is reflected in Fig. 2 Variation status of the main coding variant of each gene in the PGF haplotype annotation Gene coding sequences may be invariable (no recorded variation), have synonymous variation only (variation at the nucleotide but not the peptide level) or have non-synonymous variation (variation at both the nucleotide and peptide level), which in turn, may be conservative or non-conservative variation according to the criteria of positive or negative values in the BLOSUM62 matrix. The main coding variant is that numbered 001 in the VEGA database except for LY6G6E and HLA-DPB2 where the main variant is not coding. C4A and C4B were excluded from calculation of variation because the order of these genes in the PGF sequence precluded alignment with other haplotype sequences. Nevertheless, alignment of the coding sequences for each gene separately showed that there were non-synonymous, non-conservative variations. HLA-DRB5 is present in this study only in the PGF haplotype and, therefore, here appears invariable aCoding genes where the main variant does not harbour non-conservative, non-synonymous variation but other variants do (BAT1 BRD2 DDR1 C6orf136 HLA-DOA MOG KIFC1 and TRIM39). bSimilarly, coding genes where the main variant does not harbour conservative non-synonymous variation but other variants do (PSMB8). As well as the variations reported above, major indels revealed as breaks in cross_match discrepancy lists and analysed by RepeatMasker are given in Table 10. Many of these have been previously reported (Dangel et al. 1994; Dunn et al. 2003; Dunn et al. 2002; Gaudieri et al. 1999; Horton et al. 1998; Kulski and Dunn 2005; Stewart et al. 2004). These indels were most frequently but not exclusively associated with AluY elements.
Table 10

Major indels in the form of retrotransposible elements

Chr6 pos’nFlanking lociPresence in haplotypeDetails
PGFCOXQBLSSTOAPDDBBMANNMCF
29002370TRIM27:C6orf100CCCC??CCComplex region (A)
29440424OR5V1:OR12D3???XXAluYa5
29784097C6orf40:HCP5P15X?XXXAluYa5/8 175..304
29788451Within HCP5P15XXX?XAluYa5/8 176..310
29794763HCP5P15:HLA-FXX?XXXSVA_E plus simple rpt.s
29922942HLA-G:MICFXL1ME3B 5940..6165
29954495MICF:HLA-HXXXXXXXHERVK9 inserted in MER9
30008633HLA-K:HLA-21XXXX?SVA E/F plus simple rpt.
30106475HCG8:ETF1P1XXXXXAluYb8
30547387SUCLA2P:RANP1XXX?XX?AluJb 1..283 and parts of MLT1D/L1PBa
31079582C6orf205:HCG22XXXXX?XAluYb8 37..297
31117638C6orf205:HCG22XXXAluY (whole & part) and MER63 1017..1062
31301931HCG27:HLA-CX?HERV3 part (6489...7339)
31320352HCG27:HLA-CXXX?XXXSVA_F 349..850 plus GC rich rpt.
31358220RPL3P2:WASF5PXXX?XXXAluY 35..306
31400900WASF5P:HLA-B?XXXAluSp plus L1PREC2 part (3205...4617)
31405648WASF5P:HLA-BX?XxxHERVIP10F (part) and AluSg (only cf CX DB)
31418854WASF5P:HLA-B?XL1PA5 part (5503..5876)
31530995MICA:HCP5X???X?SVA B/F plus simple rpt.s
32421915within C6orf10XXXXXAluYb8
32486228BTNL2:HLA-DRAXXXL1P1/L1HS parts
32655545HLA-DRB1 intron 5xxX??AluYa5 within more or less partial LTR12
32660731HLA-DRB1 intron 1X/X✓/XX/X✓/✓?✓/✓✓/✓?Tigger4/AluSx
32661119HLA-DRB1 intron 1CCCC?CC?Complex region (B)
32663167HLA-DRB1 intron 1X/✓✓/✓✓/✓✓/X?✓/X✓/X?AluSq/AluY
32669534HLA-DRB1:HLA-DQA1CCCC?CC?Complex region (C)
32679461HLA-DRB1:HLA-DQA1XXX?XX?AluY
32693271HLA-DRB1:HLA-DQA1?X?L1PA4 (parts)
32697545HLA-DRB1:HLA-DQA1XXXX??L1HS 7..6032
32701428HLA-DRB1:HLA-DQA1X?xXxL1PA2 part and from CX: MER2B and AluY
32728179HLA-DQA1: HLA-DQB1CCCC?CCCComplex region (D)
32739664within HLA-DQB1XXX?XXAluY
32743646HLA-DQB1: MTCO3P1XXXX?XXLTR13
32746780HLA-DQB1: MTCO3P1XXXX?XL1PA4 (parts)
32751442HLA-DQB1: MTCO3P1XXXX?XXLTR5_Hs
32753489HLA-DQB1: MTCO3P1?XXL1PA10 268..4888 around L1PA4 (part)
32756020HLA-DQB1: MTCO3P1XXXX?XXLTR5_Hs
32764047HLA-DQB1: MTCO3P1?XXAluSx
32765930HLA-DQB1: MTCO3P1XXXX?XXAluYa5
32785062MTCO3P1:HLA-DQB3?XXXTigger4 (Zombi)/L1HS (parts) and T-rich
32795150MTCO3P1:HLA-DQB3XXXXXXAluY
32796573MTCO3P1:HLA-DQB3XXXXXXAluY
32815974HLA-DQB3: HLA-DQA2XXXXXXAluYa5
32857369HLA-DQB2:HLA-DOBXXXXAluYg6
32881426HLA-DQB2:HLA-DOBXX?XX?AluYa5
32887265HLA-DQB2:HLA-DOBX?XXXLTR42 and parts of L1MC5 and AluSc 3..105
33201559within HLA-DPB2XXX?X?AluYb8
33234360HCG24:COL11A2??XAluY (1..293) AluJb (26..306)

Where there was a break in the cross_match discrepancy list match between two clones, the inserted sequence was extracted and subjected to analysis by RepeatMasker to assess the number of major indels that were a result of retrotransposible elements. Chromosome 6 position (NCBI35/36) of the inserted sequence was that of the midpoint where the sequence was an insertion in PGF or the position before the deletion in PGF. Flanking loci were retrieved from the annotation. Insertion in a haplotypes is indicated by ‘✓’, deletion by ‘X’, complex regions by ‘C’. Where there is a sequence gap in a haplotype corresponding to the indel, this is shown by ‘?’. Four complex deletion/insertion events are listed: A, B, C and D; for details, see text.

Major indels in the form of retrotransposible elements Where there was a break in the cross_match discrepancy list match between two clones, the inserted sequence was extracted and subjected to analysis by RepeatMasker to assess the number of major indels that were a result of retrotransposible elements. Chromosome 6 position (NCBI35/36) of the inserted sequence was that of the midpoint where the sequence was an insertion in PGF or the position before the deletion in PGF. Flanking loci were retrieved from the annotation. Insertion in a haplotypes is indicated by ‘✓’, deletion by ‘X’, complex regions by ‘C’. Where there is a sequence gap in a haplotype corresponding to the indel, this is shown by ‘?’. Four complex deletion/insertion events are listed: A, B, C and D; for details, see text. Four of these major indels were complex and designated as complex regions A, B, C and D in Table 10. They include three known regions from the comparison of the PGF and COX haplotypes (Stewart et al. 2004). Complex region A (involving MIR, MER41B, MER115, AluSx, Flam_C, AluSg, AluY, AluSx, L2 and MER38 elements) maps between TRIM27 and C6orf100 and was found to be deleted in COX but present in PGF, QBL, SSTO, MANN and MCF. Complex region B (involving L2 and AluY elements) maps to intron 1 of HLADRB1 and was also found by comparing PGF with COX, QBL, DBB, MANN and SSTO. Complex region C (SVA and low-complexity repeat elements) maps between HLADRB1 and HLADQA1 and was noted in COX as a deletion of the SVA and low-complexity repeats. Whereas, DBB, MANN and SSTO displayed the same deletion, as well as a telomeric deletion of AluSx/MIRb, QBL had both deletions plus that of an intervening 2.5 kb sequence containing Alu, L3 and MLT1A1 elements. Complex region D maps between HLADQA1 and HLADQB1 and is more complicated than previously reported. At the telomeric end, PGF lacks an L1PA4 fragment of >300 bp that is present in COX, QBL, SSTO and MCF and is also absent in DBB and MANN where it is interrupted by about 1.3 kb of SVA sequence. Centromeric to this PGF contains an AluSx, an AluY and an AluYd2, flanked by long interspersed nuclear element repeats, all deleted in the other haplotypes. Further towards the centromere there is an L1MA7 fragment, into which in PGF alone there are insertions of an AluSx followed by an AluY; a subsequent AluSg present in all haplotypes contains an insertion of 795 bp of SVA sequence in just COX and QBL. Finally, at the centromeric end of this region, PGF uniquely contains intact MER11C and LTR5 elements.

Representation of haplotypes within European populations

The eight haplotypes analysed in this study were selected on the basis of their association with type 1 diabetes and multiple sclerosis and their high population frequencies. To determine how representative these haplotypes are with respect to SNP haplotypic diversity in a population, we determined their distribution in the haplotypic tree space in the European population. For this analysis, we selected a segment of ~214 kb, spanning the HLADRB1 and HLADQB1 genes in a population of European ancestry with known HLA allelic data (de Bakker et al. 2006). Phylogenetic analysis of 180 founder haplotypes derived from genotypic data (54 substitutions) shows that the eight haplotypes selected as part of the MHC Haplotype Project share identical HLA alleles over most of the tree space (Fig. 3a), representing almost the entire variation observed in the population assayed with the exception of two branches (DRB1*1103–DQB1*0301 and DRB1*0101–DQB1*0501).
Fig. 3

Clusters of haplotypes in the European haplotypic diversity. Phylogenetic relationship of 180 founder SNP haplotypes from CEPH trios spanning a 214-kb segment of the MHC class II region, including the HLA-DRB1 and HLA-DQB1 genes (54 substitutions from rs2187823 to rs2856691). a Sequenced haplotypes are widely distributed in this NJ tree and represent the vast majority of the variation in the population sampled. Four-digit alleles are indicated for the corresponding DRB1 and DQB1 genes in each haplotype ID label to highlight the HLA haplotypic distribution based on the underlying nucleotide variation. The NJ tree was constructed using pairwise genetic distances considering the Kimura 2-parameters model without correction for rate variation among sites as implemented in the MEGA2 software (Kumar et al. 2001). b Each haplotype sequenced is associated to a single haplotype cluster. This phylogenetic network (Bandelt et al. 1999) also shows that clusters (shaded area) are constituted by one central haplotype and its derivatives. Circles represent individual haplotypes, and the size of the circle is proportional to the haplotype frequency. The length of the lines connecting nodes is relative to the distance between them, e.g. distances within shaded areas (clusters) never exceed three mutation steps. Cluster of haplotypes sharing HLA alleles with sequenced cell lines are named accordingly: COX and QBL: DRB1*0301 DQB1*0201–PGF: DRB1*1501 DQB1*0602–APD: DRB1*1301 DQB1*0603–MCF: DRB1*0401 DQB1*0301–DBB: DRB1*0701 DQB1*0303–SSTO: DRB1*0403 DQB1*0302–MANN: DRB1*0701 DQB1*0202. HLA haplotypes DRB1*1103–DQB1*0301 and DRB1*0101–DQB1*0501 indicate the two major haplotype clusters not represented in the MHC haplotype project data

Clusters of haplotypes in the European haplotypic diversity. Phylogenetic relationship of 180 founder SNP haplotypes from CEPH trios spanning a 214-kb segment of the MHC class II region, including the HLA-DRB1 and HLA-DQB1 genes (54 substitutions from rs2187823 to rs2856691). a Sequenced haplotypes are widely distributed in this NJ tree and represent the vast majority of the variation in the population sampled. Four-digit alleles are indicated for the corresponding DRB1 and DQB1 genes in each haplotype ID label to highlight the HLA haplotypic distribution based on the underlying nucleotide variation. The NJ tree was constructed using pairwise genetic distances considering the Kimura 2-parameters model without correction for rate variation among sites as implemented in the MEGA2 software (Kumar et al. 2001). b Each haplotype sequenced is associated to a single haplotype cluster. This phylogenetic network (Bandelt et al. 1999) also shows that clusters (shaded area) are constituted by one central haplotype and its derivatives. Circles represent individual haplotypes, and the size of the circle is proportional to the haplotype frequency. The length of the lines connecting nodes is relative to the distance between them, e.g. distances within shaded areas (clusters) never exceed three mutation steps. Cluster of haplotypes sharing HLA alleles with sequenced cell lines are named accordingly: COX and QBL: DRB1*0301 DQB1*0201–PGF: DRB1*1501 DQB1*0602–APD: DRB1*1301 DQB1*0603–MCF: DRB1*0401 DQB1*0301–DBB: DRB1*0701 DQB1*0303–SSTO: DRB1*0403 DQB1*0302–MANN: DRB1*0701 DQB1*0202. HLA haplotypes DRB1*1103–DQB1*0301 and DRB1*0101–DQB1*0501 indicate the two major haplotype clusters not represented in the MHC haplotype project data Haplotype diversity in this sub-population is restricted to relatively few haplotype clusters (Fig. 3b). Each cluster consists of a founder haplotype, depicted by the most frequent and centrally located haplotype within the cluster. Recently derived haplotypes show lower frequencies and are connected to the central haplotype by relatively few mutation steps (in this case, up to three). This phylogenetic network clearly shows that all the sequenced haplotypes occupy central positions in their respective haplotypic groups. Inferences about phylogenetic relationships between haplotype clusters are, however, only approximate as a consequence of recombination events. It should also be noted that SNP haploptypes derived from CEPH pedigrees of European ancestry by no means represent an exhaustive sampling of European diversity. Nevertheless, the sampling has been shown to represent the European population in the UK reasonably well (Ke et al. 2005). In conclusion, our analysis demonstrates that the HLA haplotypes selected for the MHC Haplotype Project are ancestral haplotypes, representative of MHC diversity in the European population.

Conclusion and outlook

The MHC Haplotype Project has succeeded in providing a new public resource for immune-linked disease and population genetic studies. First reports from studies using the resource indicate that it adds significant power to the identification and fine-mapping of disease-associated variations (Yeo et al. 2007). The data have also contributed to the recent identification of a first set of HLA tag SNPs, which hold great promise for future applications in clinical settings, e.g. to complement or replace classical HLA-typing in transplant medicine (de Bakker et al. 2006). While costs and other limitations of the current (capillary) sequencing technology have restricted our study to only few (eight) MHC haplotypes, the number of new variations found, combined with the fact that no variation plateau has yet been reached, indicates that there are many more variations to be discovered. The recent introduction of several new and massively parallel sequencing platforms (for review, see Bentley 2006) has created the opportunity to do just that by re-sequencing haplotypes and, eventually, entire genomes at the population level and as integral part of case control studies. Because of its wide-ranging medical importance, the MHC can be expected to be among the first regions of the human genome to be sequenced in this way. Such sequencing will provide the critical, and until now missing, data to identify causal variations and their underlying mechanisms on an unprecedented scale.
  46 in total

1.  Median-joining networks for inferring intraspecific phylogenies.

Authors:  H J Bandelt; P Forster; A Röhl
Journal:  Mol Biol Evol       Date:  1999-01       Impact factor: 16.240

2.  A high-resolution linkage-disequilibrium map of the human major histocompatibility complex and first generation of tag single-nucleotide polymorphisms.

Authors:  Marcos M Miretti; Emily C Walsh; Xiayi Ke; Marcos Delgado; Mark Griffiths; Sarah Hunt; Jonathan Morrison; Pamela Whittaker; Eric S Lander; Lon R Cardon; David R Bentley; John D Rioux; Stephan Beck; Panos Deloukas
Journal:  Am J Hum Genet       Date:  2005-03-01       Impact factor: 11.025

3.  A new splicing acceptor site and poly(A)+ sequence signal within DQA1*0401 and DQA1*0501 mRNA 3'UTR contribute to increase the extraordinary diversity of mRNA isoforms.

Authors:  J J Hoarau; F Festy; M Cesari; M Pabion
Journal:  Immunogenetics       Date:  2005-04-05       Impact factor: 2.846

4.  A comparison of tagging methods and their tagging space.

Authors:  Xiayi Ke; Marcos M Miretti; John Broxholme; Sarah Hunt; Stephan Beck; David R Bentley; Panos Deloukas; Lon R Cardon
Journal:  Hum Mol Genet       Date:  2005-08-15       Impact factor: 6.150

Review 5.  Polymorphic Alu insertions within the Major Histocompatibility Complex class I genomic region: a brief review.

Authors:  J K Kulski; D S Dunn
Journal:  Cytogenet Genome Res       Date:  2005       Impact factor: 1.636

6.  Nomenclature for Factors of the HLA System, 2004.

Authors:  Steven G E Marsh; Ekkehard D Albert; Walter F Bodmer; Ronald E Bontrop; Bo Dupont; Henry A Erlich; Daniel E Geraghty; John A Hansen; Carolyn K Hurley; Bernard Mach; Wolfgang R Mayr; Peter Parham; Effie W Petersdorf; Takehiko Sasazuki; Geziena M Th Schreuder; Jack L Strominger; Arne Svejgaard; Paul I Terasaki; John Trowsdale
Journal:  Hum Immunol       Date:  2005-03-03       Impact factor: 2.850

7.  Prediction of complete gene structures in human genomic DNA.

Authors:  C Burge; S Karlin
Journal:  J Mol Biol       Date:  1997-04-25       Impact factor: 5.469

8.  Large-scale sequence comparisons reveal unusually high levels of variation in the HLA-DQB1 locus in the class II region of the human MHC.

Authors:  R Horton; D Niblett; S Milne; S Palmer; B Tubby; J Trowsdale; S Beck
Journal:  J Mol Biol       Date:  1998-09-11       Impact factor: 5.469

9.  Different evolutionary histories in two subgenomic regions of the major histocompatibility complex.

Authors:  S Gaudieri; J K Kulski; R L Dawkins; T Gojobori
Journal:  Genome Res       Date:  1999-06       Impact factor: 9.043

Review 10.  Gene map of the extended human MHC.

Authors:  Roger Horton; Laurens Wilming; Vikki Rand; Ruth C Lovering; Elspeth A Bruford; Varsha K Khodiyar; Michael J Lush; Sue Povey; C Conover Talbot; Mathew W Wright; Hester M Wain; John Trowsdale; Andreas Ziegler; Stephan Beck
Journal:  Nat Rev Genet       Date:  2004-12       Impact factor: 53.242

View more
  136 in total

1.  Custom CGH array profiling of copy number variations (CNVs) on chromosome 6p21.32 (HLA locus) in patients with venous malformations associated with multiple sclerosis.

Authors:  Alessandra Ferlini; Matteo Bovolenta; Marcella Neri; Francesca Gualandi; Alessandra Balboni; Anton Yuryev; Fabrizio Salvi; Donato Gemmati; Alberto Liboni; Paolo Zamboni
Journal:  BMC Med Genet       Date:  2010-04-28       Impact factor: 2.103

2.  A simple Bayesian mixture model with a hybrid procedure for genome-wide association studies.

Authors:  Yu-Chung Wei; Shu-Hui Wen; Pei-Chun Chen; Chih-Hao Wang; Chuhsing K Hsiao
Journal:  Eur J Hum Genet       Date:  2010-04-21       Impact factor: 4.246

3.  Haplotype variation, recombination, and gene conversion within the turkey MHC-B locus.

Authors:  Lee D Chaves; Gretchen M Faile; Stacy B Krueth; Julie A Hendrickson; Kent M Reed
Journal:  Immunogenetics       Date:  2010-05-12       Impact factor: 2.846

Review 4.  Nomenclature for factors of the HLA system, 2010.

Authors:  S G E Marsh; E D Albert; W F Bodmer; R E Bontrop; B Dupont; H A Erlich; M Fernández-Viña; D E Geraghty; R Holdsworth; C K Hurley; M Lau; K W Lee; B Mach; M Maiers; W R Mayr; C R Müller; P Parham; E W Petersdorf; T Sasazuki; J L Strominger; A Svejgaard; P I Terasaki; J M Tiercy; J Trowsdale
Journal:  Tissue Antigens       Date:  2010-04

5.  Gene inactivation and its implications for annotation in the era of personal genomics.

Authors:  Suganthi Balasubramanian; Lukas Habegger; Adam Frankish; Daniel G MacArthur; Rachel Harte; Chris Tyler-Smith; Jennifer Harrow; Mark Gerstein
Journal:  Genes Dev       Date:  2011-01-01       Impact factor: 11.361

Review 6.  Comparative genomics of the human, macaque and mouse major histocompatibility complex.

Authors:  Takashi Shiina; Antoine Blancher; Hidetoshi Inoko; Jerzy K Kulski
Journal:  Immunology       Date:  2016-07-10       Impact factor: 7.397

7.  Novel Transcriptional Activity and Extensive Allelic Imbalance in the Human MHC Region.

Authors:  Elizabeth Gensterblum-Miller; Weisheng Wu; Amr H Sawalha
Journal:  J Immunol       Date:  2018-01-08       Impact factor: 5.422

8.  Evolutionary analysis of two classical MHC class I loci of the medaka fish, Oryzias latipes: haplotype-specific genomic diversity, locus-specific polymorphisms, and interlocus homogenization.

Authors:  Mayumi I Nonaka; Masaru Nonaka
Journal:  Immunogenetics       Date:  2010-02-20       Impact factor: 2.846

9.  Sequence and Phylogenetic Analysis of the Untranslated Promoter Regions for HLA Class I Genes.

Authors:  Veron Ramsuran; Pedro G Hernández-Sanchez; Colm O'hUigin; Gaurav Sharma; Niamh Spence; Danillo G Augusto; Xiaojiang Gao; Christian A García-Sepúlveda; Gurvinder Kaur; Narinder K Mehra; Mary Carrington
Journal:  J Immunol       Date:  2017-02-01       Impact factor: 5.422

10.  HLA-A allele associations with viral MER9-LTR nucleotide sequences at two distinct loci within the MHC alpha block.

Authors:  Jerzy K Kulski; Atsuko Shigenari; Takashi Shiina; Kazuyoshi Hosomichi; Makoto Yawata; Hidetoshi Inoko
Journal:  Immunogenetics       Date:  2009-03-18       Impact factor: 2.846

View more

北京卡尤迪生物科技股份有限公司 © 2022-2023.