| Literature DB >> 18846218 |
Ching Ouyang1, David D Smith, Theodore G Krontiris.
Abstract
Variation in gene expression may give rise to a significant fraction of inter-individual phenotypic variation. Studies searching for the underlying genetic controls for such variation have been conducted in model organisms and humans in recent years. In our previous effort of assessing conserved underlying haplotype patterns across ethnic populations, we constructed common haplotypes using SNPs having conserved linkage disequilibrium (LD) across ethnic populations. These common haplotypes cluster into a simple evolutionary structure based on their frequencies, defining only up to three conserved clusters termed 'haplotype frameworks'. One intriguing preliminary finding was that a significant portion of reported variants strongly associated with cis-regulation tags these globally conserved haplotype frameworks. Here we expand the investigation by collecting genes showing stringently determined cis-association between genotypes and expression phenotypes from major studies. We conducted phylogenetic analysis of current major haplotypes along with the corresponding haplotypes derived from chimpanzee reference sequences. Our analysis reveals that, for the vast majority of such cis-regulatory genes, the tagging SNPs showing the strongest association also tag the haplotype lineages directly separated from ancestry, inferred from either chimpanzee reference sequences or the allele frequency-derived haplotype frameworks, suggesting that the differentially expressed phenotypes were evolved relatively early in human history. Such evolutionary signatures provide keys for a more effective identification of globally-conserved candidate regulatory haplotypes across human genes in future epidemiologic and pharmacogenetic studies.Entities:
Mesh:
Year: 2008 PMID: 18846218 PMCID: PMC2557141 DOI: 10.1371/journal.pone.0003362
Source DB: PubMed Journal: PLoS One ISSN: 1932-6203 Impact factor: 3.240
Figure 1Ancestry-based haplotype clusters and the association with expression phenotypes.
(A) In this hypothetical example, five extant haplotypes are observed (1–5) within a chromosome segment showing strong LD (low recombination rate). These haplotypes are derived through five mutation steps (resulting in five SNPs in current populations) from the inferred ancestral sequence (boxed in black) and can be grouped into two major haplotype clusters (boxed in green and red). Separating the ancestry-based haplotype clusters are earlier mutation steps (G3 → A; C2 → T). Alleles of these SNPs can be applied for “tagging” the clusters (typed in green and red). Currently, ancestry is commonly inferred by either the allele frequencies of SNPs or the corresponding nucleotides in non-human primate species. When the frequency of SNP alleles is applied (preferably using those of African populations), the haplotype clusters are referred to as “haplotype frameworks” [9]. The SNPs tagging the frameworks are termed “framework SNPs” or “fmSNPs”. (B) Tree structure of the five extant haplotypes and the expression phenotype clusters. Given a simple hypothesis that an historical mutation creates a variant altering the expression phenotype (resulting either enhancing or suppressing expression), two alternative schemes of resulting phenotype clusters associated with the variant are illustrated. The left panel exemplifies an evolutionarily earlier expression alteration caused by a mutation tagging the ancestry-based haplotype clusters, and the right panel demonstrates the alteration caused by a more recent mutation (with the mutations boxed and the resulting expression phenotype clusters circled).
Correlation between SNPs showing strongest association with cis-regulation and SNPs tagging haplotype frameworks.
| Gene | Genomic region investigated | Tagging SNP showing strongest association with | ||||||
| Reference number | Allele frequency | Relative position to gene | Tagging frequency-derived haplotype frameworks | Tagging lineages derived from chimpanzee reference | Nominal p-value significant across multiple populations | |||
|
| ||||||||
| 1 | HSD17B12 | chr11:43648880..43844743; 195.9 kb | rs10838162 | 26%/41%/20% | intragenic |
|
| ✓ |
| 2 | IRF5 | chr7:128161944..128194036; 32.1 kb | rs2280714 | 42%/44%/47% | dn 4.6 kb |
|
| ✓ |
| 3 | CD151 | chr11:812985..838833; 25.9 kb | rs4075289 | 26%/37%/5% | up 2.3 kb |
|
| ✓ |
| 4 | CCT8 | chr21:29340517..29377880; 37.4 kb | rs965951 | 14%/29%/12% | intragenic |
|
| ✓ |
| 5 | PPAT | chr4:57090458..57152772; 62.3 kb | rs9683679 | 29%/36%/29% | intragenic |
|
| ✓ |
| 6 | LOC388796 | chr20:36472655..36507865; 35.2 kb | rs3752278 | 9%/48%/13% | intragenic |
|
| ✓ |
| 7 | TMEM8 | chr16:351860..381907; 30.1 kb | rs3743888 | 39%/29%/62% | intragenic |
|
| ✓ |
| 8 | CTBP1 | chr4:1185057..1242737; 57.7 kb | rs3755920 | 46%/34%/70% | up 0.7 kb |
|
| ✓ |
| 9 | ATF5 | chr19:55114271..55139002; 24.73 kb | rs3826777 | 36%/18%/44% | up 1.1 kb |
|
| ✓ |
| 10 | ARTS-1 | chr5:96112276..96179397; 67.1 kb | rs30187 | 30%/39%/45% | intragenic |
|
| |
| 11 | IL16 | chr15:79252254..79402155; 149.9 kb | rs11638444 | 25%/43%/2% | intragenic |
|
| |
| 12 | CTSH | chr15:76991161..77034474; 43.3 kb | rs1036938 | 29%/86%/87% | intragenic |
| No | ✓ |
| 13 | CHI3L2 | chr1:111472322..111508101; 35.8 kb | rs12048900 | 37%/27%/13% | up 4.2 kb |
| No | ✓ |
| 14 | VAMP8 | chr2:85706374..85730810; 24.4 kb | rs3731828 | 38%/42%/33% | intragenic |
| No | ✓ |
|
| ||||||||
| 1 | BTN3A2 | chr6:26453120..26496524; 43.4 kb | rs9393713 | 13%/3%/11% | intragenic | No |
| ✓ |
| 2 | SERPINB10 | chr18:59723724..59763455; 39.7 kb | rs8085490 | 21%/81%/42% | intragenic | No |
| ✓ |
| 3 | LRAP | chr5:96231023..96319053; 88.0 kb | rs2247650 | 49%/59%/56% | intragenic | No |
| ✓ |
| 4 | CAV2 | chr7:115669574..115742544; 73.0 kb | rs17138767 | 10%/1%/20% | up 1.8 kb | No |
| ✓ |
| 5 | PAX8 | chr2:113679805..113762727; 82.9 kb | rs11123170 | 39%/28%/32% | intragenic | No |
| ✓ |
| 6 | CAT | chr11:34407053..34460178; 53.1 kb | rs10836244 | 13%/13%/54% | intragenic | No |
| ✓ |
| 7 | OAS1 | chr12:111797458..111830430; 33 kb | rs1859336 | 40%/0%/27% | dn 9.6 kb | No |
| |
|
| ||||||||
| 1 | RPS26 | chr12:54711952..54783960; 72.0 kb | rs11171739 | 38%/84%/27% | dn 32.6 kb | No | No | ✓ |
| 2 | CPNE1 | chr20:33667381..33726261; 58.9 kb | rs12480408 | 10%/8%/7% | intragenic | No | No | ✓ |
| 3 | CSTB | chr21:44008259..44068882; 60.6 kb | rs880987 | 18%/1%/48% | up 28.2 kb | No | No | ✓ |
| 4 | RAB7L1 | chr1:202397009..202475780; 78.8 kb | rs951366 | 42%/16%/40% | dn 52.3 kb | No | No | ✓ |
| 5 | SFRS6 | chr20:41509931..41558894; 49.0 kb | rs8124813 | 33%/1%/16% | dn 13.2 kb | No | No | |
Including 10 kb upstream/downstream sequences of initiation/termination sites or an extended area to cover local LD block. Based on HapMap Phase II data release #21 in July 2006 and NCBI B35 assembly.
Frequency of rare allele derived in HapMap CEU Population versus the frequency of the same allele in YRI or CHB+JPT. CEU: CEPH (Utah residents with ancestry from northern and western Europe); YRI: Yoruba in Ibadan, Nigeria; CHB: Han Chinese in Beijing, China; JPT: Japanese in Tokyo, Japan.
Relative position (upstream; up / downstream; dn) to initiation/termination sites.
Based on public data from GSE 6536 (Illumina platform) or GSE 2552 / GSE 5859 (Affymetrix platform).
Reported in Morley et al., Nature 430, 743–7 (2004).
Reported in Cheung et al., Nature 437, 1365–9 (2005).
Reported in Pastinen et al., Hum. Mol. Genet. 14, 3963–71 (2005). Association data provided online from authors' website.
Reported in Deutsch et al., Hum. Mol. Genet. 14, 3741–9 (2005).
Reported in Stranger et al., PLoS Genet. 1, e78 (2005). Association data provided by the authors.
The LD block extends into intragenic region.
The LD block extends to upstream 0.8 kb.
The LD block extends to upstream 2.3 kb.
Haplotype framework-tagging SNP shows significant association in at least one population measured by either platform.
Figure 2Delineation of underlying haplotype framework structure encompassing the HSD17B12 gene (Class I cis-regulatory gene).
(A) Diagram depicting the HSD17B12 gene and its chromosomal position (reproduced from the HapMap graphical browser), aligned with the local LD structure determined in YRI (output from the Haploview program) using LD-selected SNPs. For simplifying this presentation, we focused on common SNPs (frequency >0.2) in either the HapMap YRI or CEU populations. Pairwise calculation of standardized LD, r2, was first determined using YRI data. SNPs in strong LD (r2>0.8) with at least one other SNP and also exhibiting conserved LD in CEU and CHB/JPT were selected for the LD plot and haplotype analyses. The original SNP reported to show the strongest association with expression (peak SNP) is marked with a solid black triangle at its physical position and mapped to its corresponding position in the LD plot. The LD block containing the peak SNP is surrounded with black lines. (B) Haplotype frameworks within the block containing the peak SNP. The major haplotypes (>5% in either population) and their population frequencies were inferred using the Haploview program. Five major haplotypes in the YRI population clustered into two haplotype frameworks (A and B) that can be tagged by a set of SNPs (fmSNPs) in strong LD and having the highest allele frequency within the block. The common alleles of fmSNPs are colored green, and the rare alleles red. The rare alleles of other lower-frequency SNPs are colored purple. Comparison of major haplotypes delineated in CEU and CHB/JPT using the same sets of SNPs showed an identical haplotype structure with a different frequency distribution as shown to the right. (Four SNPs having no genotype information in CEU were left blank.) All SNP reference (rs) numbers are shown above, with the original reported peak SNP, rs4755741, outlined in black. The chimpanzee nucleotides corresponding to each SNP are shown below. The colors of SNP alleles used in CEU, CHB/JPT, and chimpanzee follow the convention defined in YRI. The stars below chimpanzee nucleotides indicate polymorphisms located at (C/T)pG positions on either strand.
Figure 3Delineation of underlying haplotype framework structure encompassing the BTN3A2 gene (Class II cis-regulatory gene).
(A) Diagram depicting the BTN3A2 gene, its chromosomal position, and the local LD structure. This panel follows the convention in Figure 2 except that, for simplifying the presentation, we focused on common SNPs with frequency >0.1 in HapMap populations. (B) Haplotype frameworks within the block containing the peak SNP. This panel also follows the convention in Figure 2. The pairwise LD measure, r2, between the peak SNP and fmSNP is shown in all three populations. Sets of SNPs in strong LD, determined using YRI genotypes and based on the criterion of r2>0.8, are depicted at the bottom. The number of SNPs in each bin is shown to the left. The SNP set marked in red, containing an extraordinarily large number of SNPs relative to other bins (tagging haplotype B4 within the block), shows the strongest association with expression phenotypes.
Figure 4Phylogenetic relationships among current major haplotypes and the association to expression phenotypes.
Median-joining (MJ) network analysis was conducted using the Network program for HSD17B12 (Class I; shown in panel A) and BTN3A2 (Class II; shown in panel B). The major haplotypes in HapMap populations shown in Figure 2 and 3 were entered, using their population haplotype frequencies, along with the corresponding chimpanzee haplotype. The SNPs located at (C/T)pG positions on either strand (marked with stars in panel B of the previous figures) were generally excluded from this analysis because of their potentially high mutation rates. Each haplotype is represented by a circle. The area of each circle, except for the chimpanzee reference (colored yellow), reflects the observed frequency of each haplotype in the total dataset (YRI, CEU, CHB/JPT). The portion of YRI, CEU, and CHB/JPT chromosomes in each circle is denoted with green/red, white, and grey colors, respectively. The colors of haplotype frameworks A and B follow the green and red convention in previous figures. The length of lines between any two haplotype nodes is proportional to the number of mutation steps. The rs numbers of SNPs are labeled along the lines. The median vector (mv) is shown as a small circle and can be interpreted as possibly extant unsampled haplotypes. For each set of SNPs in strong LD (generally marking the branches in the lineage), a SNP with the most complete genotypes (underlined) was chosen for testing association. The nominal p-values for these SNPs in each major population, based on expression data sets GSE6536 (Illumina platform) and GSE2552/GSE5859 (Affymetrix platform), are shown in order. The coalescent-based maximum likelihood tree structure and the regression of expression phenotypes are plotted at the bottom of each panel. The set of SNPs in LD showing the strongest association with expression phenotypes is typed in red in the phylogenetic network and shown as red dots in the genealogical tree.