Literature DB >> 23435227

Resolving the polymorphism-in-probe problem is critical for correct interpretation of expression QTL studies.

Adaikalavan Ramasamy¹, Daniah Trabzuni, J Raphael Gibbs, Allissa Dillman, Dena G Hernandez, Sampath Arepalli, Robert Walker, Colin Smith, Gigaloluwa Peter Ilori, Andrey A Shabalin, Yun Li, Andrew B Singleton, Mark R Cookson, John Hardy, Mina Ryten, Michael E Weale.

Abstract

Polymorphisms in the target mRNA sequence can greatly affect the binding affinity of microarray probe sequences, leading to false-positive and false-negative expression quantitative trait locus (QTL) signals with any other polymorphisms in linkage disequilibrium. We provide the most complete solution to this problem, by using the latest genome and exome sequence reference data to identify almost all common polymorphisms (frequency >1% in Europeans) in probe sequences for two commonly used microarray panels (the gene-based Illumina Human HT12 array, which uses 50-mer probes, and exon-based Affymetrix Human Exon 1.0 ST array, which uses 25-mer probes). We demonstrate the impact of this problem using cerebellum and frontal cortex tissues from 438 neuropathologically normal individuals. We find that although only a small proportion of the probes contain polymorphisms, they account for a large proportion of apparent expression QTL signals, and therefore result in many false signals being declared as real. We find that the polymorphism-in-probe problem is insufficiently controlled by previous protocols, and illustrate this using some notable false-positive and false-negative examples in MAPT and PRICKLE1 that can be found in many eQTL databases. We recommend that both new and existing eQTL data sets should be carefully checked in order to adequately address this issue.

Entities: Chemical Disease Gene Mutation Species

Mesh：

Substances：
Oligonucleotide Probes

Year: 2013 PMID： 23435227 PMCID： PMC3627570 DOI： 10.1093/nar/gkt069

Source DB: PubMed Journal: Nucleic Acids Res ISSN： 0305-1048 Impact factor: 16.971

INTRODUCTION

Expression quantitative trait locus (eQTL) studies look for association signals between genetic variation (typically single nucleotide polymorphisms, or SNPs) and gene expression. Here, we use ‘eQTL’ to refer to all kinds of expression quantitative trait loci, whether arising from association with gene-level or exon-level expression patterns. These eQTL studies have provided insights into the mode and regulatory action of gene-level expression and the differential expression of alternatively spliced transcripts, and have provided important insights into the causal mechanisms behind some genome-wide association study signals (1–3). It is anticipated that RNA-seq will become the future platform of choice for these studies. However, the protocols for this technology are still immature, and the costs for assaying large numbers of individuals are still high. For now, older platforms relying on microarrays remain important, not least because large and valuable repositories of data exist based on this technology. Expression microarrays work through the binding of oligonucleotide probe sequences to expressed mRNA. Two widely used examples are the Illumina Human HT12 array, which uses 50-mer probes biased towards the 3′ end of mRNA transcripts to estimate whole-gene levels of expression, and the Affymetrix Human Exon 1.0 ST array, which uses 25-mer probes, typically grouped in sets of four per exon, to estimate exon-specific levels of expression. All microarrays are susceptible to a polymorphism-in-probe problem, which arises because probes are typically designed to match one reference sequence only. Thus sequences which depart from this reference, either due to the presence of different nucleotides (i.e. SNPs) or (particularly) due to the presence or absence of nucleotides (i.e. indels), are likely to exhibit a weaker binding affinity for the probe in question (4). This results in an apparent association between genotype and expression, confounding eQTL studies that are looking for just such a signal. Furthermore, this problem will not only generate a false eQTL signal with the polymorphism in the probe sequence, but also localized linkage disequilibrium (LD) will ensure that all polymorphisms in high LD with the offending polymorphism will likewise display a false association signal. We note that an analogous problem can arise in RNA sequence-based eQTL studies. Allele-specific biases may occur when aligning RNA sequence reads to a single reference genome. Addressing these biases is an active line of enquiry in the field of allele-specific expression studies (5–7). We anticipate that aligning reads to personal genomes (e.g. via exome sequencing) will provide the best solution to this problem in the context of RNA sequencing. Previous microarray-based eQTL studies have dealt with the polymorphism-in-probe problem with varying degrees of thoroughness and with considerable differences both in how investigators sought to identify suspect probes and also in how they then chose to remove suspect eQTL signals based on this information (see Supplementary Table S1 for selected examples). Several studies have attempted to quantify this problem empirically, either explicitly or as part of a biological result paper (Table 1), but again they vary in protocol (8–14). Several factors are likely to have led investigators to underestimate the scale of this issue. These include incomplete reference information on the location of all common SNPs and indels, associated difficulties in applying high quality imputation techniques to enable the prediction of non-genotyped SNPs and a lack of appreciation for the possibility of false associations due not only to genotyped polymorphisms within the probe sequences, but also polymorphisms in LD located outside the probe sequence and other polymorphisms like indels.

Table 1.

Studies that have provided an empirical assessment of the polymorphism-in-probe problem

Article (PMID)	Tissues and sample size	Expression chip (probe length)	SNP set used to check SNP-in-probe (# SNPs in set)	Method of assessment and reported findings
Walter et al. (2007)^METHOD	Whole brain from six C57BL/6J strain mice and six DBA/2J strain mice	Affymetrix MOE430 2.0 chips (25-mer probes but only transcript-level was analysed)	NIEHS/Perlegen Mouse Resequencing Project & Mouse Phenome Database SNP Tool & Sanger resequencing (∼3.9 m SNPs)	Compared results before and after masking SNP-containing probes.
Nature Methods				22% false-positive rate and 12% false-negative rates (RMA) or
PMID: 17762873				36% false-positive rate and 13% false-negative rates (MAS 5.0)

Meyers et al. (2007)	193 neuropathologically normal human brains (pooled regions)	Illumina Human Refseq-8 Expression (50-mer probes)	Genotyped SNPs (366 140 SNPs)	Discarded associations if probe contained a SNP
Nature Genetics				13% of significant cis-eQTLs discarded
PMID: 17982457				5% of significant trans-eQTLs discarded

Benovoy et al. (2008)^METHOD	57 CEU HapMap individuals, LCLs	Affymetrix Human Exon 1.0 ST (25-mer probes)	HapMap II release 21 (∼ 4 million SNPs)	Compared results before and after masking SNP-containing probes.
Nucleic Acids Research				86.6% false-positive rate and 0.3% false-negative rate (exon-level)
PMID: 18596082				8.1% false-positive rate and 0.05% false-negative rate (gene-level)

Heinzen et al. (2008)	93 frontal cortex	Affymetrix Human Exon 1.0 ST (25-mer probes)	Genotyped SNPs (<550 thousand SNPs)	Discarded associations if the hit SNP was inside the probe sequence or in high LD (r²> 0.50) with a SNP inside the probe sequence
PLoS Biology	80 blood cell
PMID: 19222302				36.6% of significant cis-eQTLs (exon-level) discarded

Gamazon et al. (2010)^METHOD	57 CEU HapMap individuals	Affymetrix Human Exon 1.0 ST (25-mer probes)	1000 Genomes Pilot 1 (April 2009) + dbSNP v129 (unclear on number of SNPs)	Focused on 782 differentially spliced probesets from their previous published study and reports that ∼15% of these could be affected by novel SNPs in 1000Genomes Pilot 1 (compared with dbSNP v129).
PLoS One	56 YRI HapMap individuals, LCLs
PMID: 20186275

Stranger et al. (2012)	726 individuals from 8 HapMap populations, LCLs	Illumina Sentrix Human-6 Expression BeadChip version 2 (50-mer probes)	1000 Genomes Pilot 1 (Aug 2010) with MAF > 5% (unclear on number of SNPs)	6.5% of probes contained SNP(s) within the probe sequence while 7.4% of the significant probes (i.e. has at least one significant cis-eQTL) also contained SNP(s).
PLoS Genetics
PMID: 22532805
PMID: 22532805				Therefore, concluded no significant enrichment.

Ramasamy et al. (2013)^METHOD (current article)	130 cerebellum 127 frontal cortex	Affymetrix Human Exon 1.0 ST (25-mer probes)	1000 Genomes Integrated Phase 1 version 3 (March 2012) and NHLBI-ESP (∼9.3 million SNPs ∼1 million indels)	Proportion of cis eQTLs discarded (depending on P-value): 60.2–90% in FCTX and 49.7–72.7% in CRBL

	301 cerebellum 304 frontal cortex	Illumina HT12 (50-mer probes)		31–52.6% in FCTX and 20–46.7% in CRBL

METHOD Indicates methodological articles that explicitly studied this problem in greater detail.

Studies that have provided an empirical assessment of the polymorphism-in-probe problem METHOD Indicates methodological articles that explicitly studied this problem in greater detail. This article aims to deal with this problem comprehensively. First, using the most recent releases of the 1000 Genomes (March 2012) and NHLBI Exome Sequencing Project (NHLBI-ESP) together with data generated on two popular platforms, we provide the most comprehensive method to date for the identification and removal of suspect eQTL signals due to the polymorphism-in-probe problem. Second, we conduct a systematic investigation of the effect of the signal removal protocol on the quality of downstream eQTL signals. Third, we consider and evaluate the available solution for this problem. And finally, we provide some guidance and a website for users on how to identify probes that may contain polymorphisms.

MATERIALS AND METHODS

Data source

To demonstrate the extent of this problem, we used data from two consortia that have genotyped and expression profiled human cerebellum (CRBL) and frontal cortex (FCTX) from neuropathologically normal individuals using two popular platforms. Details on the data set generation and characteristics are given in the Supplementary Methods and summarized in Supplementary Table S2. Briefly, the Illumina HT12 data set consists of 304 individuals profiled by the North American Brain Expression Consortium (NABEC) (15–17), whereas the Affymetrix Human Exon data set consists of 134 individuals profiled by the UK Brain Expression Consortium (UKBEC) (18). In terms of probe design, the Illumina HT12-v3 BeadChip Array uses 50-mer probes whereas the Affymetrix Human Exon 1.0 ST array uses sets of 25-mer probes designed to target individual exons (usually 4 probes per set), the basic unit of expression in this case. These two arrays are used in the large majority of expression QTL studies to date (Supplementary Table S1). The two consortia used different genome-wide genotyping arrays but both imputed additional markers (∼5.8 million SNPs) using the 1000 Genomes (March 2012) data, thereby improving the coverage in SNPs between both data sets for eQTL analysis. The eQTL analysis was restricted to the autosomal regions of the genome for expression and genotype data.

Expression QTL analysis and LD-resolved signal identification

We tested the association between each SNP and each expression profile assuming an additive genetic model for SNPs. The computation was done using MatrixEQTL software (19) and R (http://www.r-project.org/) on a high performance linux-based computer cluster. The process of imputation and natural LD across the genome, while useful to identify causal variants, does create a problem in that eQTLs from SNP-rich high LD regions would be represented several times by LD proxy. Therefore, we treated multiple associations for a given probe/probeset as a single signal if the associated SNPs were in pairwise LD of r2 > 0.5 with each other, and the SNP with the smallest P-value as the ‘LD-resolved’ eQTL. We consider an eQTL signal as cis-acting if the hit SNP is located within 1 Mb of the transcription start site of the associated transcript.

Polymorphism reference data sources and identification of polymorphism-containing probes

We define a ‘suspect cis-eQTL’ to be any cis-eQTL signal where the relevant probe contains a polymorphism with a minor allele frequency >1% in Europeans, regardless of the LD between the hit SNP and the polymorphism-in-probe. To identify probes containing polymorphisms, we considered several different genetic variation reference data sets, which differ in their completeness. The smallest is the set of SNPs available on the genotyping chip (Illumina HumanHap550 for the NABEC data set and Illumina Omni-1 M Quad for the UKBEC data set). We then considered the CEU panel of the final release of HapMap (release #28, merged Phase I + II + III data), although one should note that most of the studies listed in Supplementary Table S1 used earlier versions of HapMap. Next, we considered the SNP and indel data of the European panel (n = 381) of the latest version of the 1000 Genomes Project (March 2012: Integrated Phase I haplotype release version 3, based on the 2010–11 data freeze and 2012-03-14 haplotypes). Finally, we considered the SNPs (average read depth ≥ 10) from the Exome Variant Server, NHLBI-ESP, Seattle, WA (URL: http://evs.gs.washington.edu/EVS/) (accessed 11 May 2012), taken from 3510 European Americans drawn from multiple ESP cohorts. In all reference data sources, we restricted to polymorphisms that were identified with at least 1% allele frequency in European descent samples. The list of probes and probesets used in Affymetrix Exon 1.0 ST and Illumina HT12 in this article along with the positions of the polymorphism-in-probe (if any) is given in Supplementary Table S5.

Probe masking for Affymetrix Exon 1.0 ST Array data

Affymetrix probes are grouped into probesets of typically four probes, which measure the expression of a given exon. If one of the four probes contains a polymorphism and three good one remain, we re-estimated the exon signal from the remaining three. We refer to these as ‘altered’ probesets. If less than three probes remain (either because more than one probe has a polymorphism or because of other QC-related drop out of probes) then the remaining information was considered insufficient and the probeset was discarded. Masking was done using Affymetrix Power Tools (see Supplementary Table S2 for codes). Probe masking has the advantage that both false-positive and false-negative eQTL signals can be recovered.

Conditional analysis for rescuing suspect cis-eQTLs from discarded probes/probesets

We applied conditional analysis by including the genotype dosage (number of minor alleles in the genotype) of the polymorphism in probe as a covariate in the linear model regressing the expression of a discarded probe/probeset against the SNP of interest. Multiple covariates were used if more than one polymorphism in the probe or probeset was found. We note that this method can in principle correct both false-positive signals (where the only signal is from the polymorphism-in-probe) and false-negative signals (where the polymorphism-in-probe counteracts the true signal). However, unlike probe masking, true signals can only be recovered if the truly associated SNP is in low LD with the polymorphism in the probe. High-LD SNPs are irretrievably confounded and unrecoverable by this method. Indeed, any SNPs that are in perfect LD with the corresponding polymorphism-in-probe will fail to fit in the conditional model, and must be assigned a conditional association P-value of 1 regardless of whether they are a true hit or not. The method also requires that the polymorphism-in-probe genotype be known for all individuals in the eQTL study (either via imputation or more directly via sequencing).

LD filtering for rescuing suspect cis-eQTLs from discarded probes/probesets

We applied LD filtering by choosing an arbitrary threshold for pairwise LD between the SNP of interest and the polymorphism-in-probe, to rescue cis-eQTLs with low LD. In contrast to conditional analysis, LD values can be obtained directly from the reference data source and therefore knowledge of the polymorphism-in-probe genotype for individuals in the eQTL data set is not required. We note that this method can only rescue false-positive signals, not false-negative ones, and furthermore, the rescued signals still carry some probability of being false positives via LD (and indeed we shall show this probability remains high even for very stringent LD thresholds).

An efficient approach to identifying probes containing polymorphism

The start and stop positions of probes from commercial arrays are generally available from the microarray chip manufacturer’s websites. The positions of the variants are available from the latest releases of public projects such as the 1000 Genomes or NHLBI-ESP or other in-house sequencing projects. After obtaining this, one could then scan for overlapping variants in between the start and stop positions of every probe. Although this can be coded in many ways, we found the intersectBED tool, which uses the concept of an interval tree from the BEDtools suite (20), to be efficient. For example, it took approximately 3 s to search through 6 million SNPs and indels for 5000 probes. The codes for implementing this are given in Supplementary Methods. Special care is required when dealing with insertion polymorphisms. A user-friendly implementation (PiP Finder) is available at http://bit.ly/pipfinder using the final variation set defined here.

RESULTS

Proportion of probes (and probesets) containing polymorphism(s) in probe sequence

Using different reference data sources for defining polymorphisms, we identified the number of probes/probesets affected by the polymorphism-in-probe problem in both datasets (Table 2). The difference between the two datasets in the proportion of probes / probesets affected is roughly proportional to the amount of mRNA sequence covered. As a result of probe drop out and overlapping probes, each Affymetrix Human Exon 1.0 ST Array probeset covers on average 72.3 unique nucleotides, compared with the 50 nucleotides of each Illumina HT12 Expression Array, which explains why the proportion of altered plus discarded Affymetrix probesets is higher than the proportion of discarded Illumina probes (i.e. 15.8% vs. 11.7% for the latest polymorphism reference data source) (see ‘Materials and Methods’ section for definitions of ‘altered probeset’ versus ‘discarded probeset’).

Table 2.

Classification of probes /probesets in both data sets with progressively more comprehensive polymorphism reference data source

		Affymetrix Human Exon 1.0 ST (∼1.2 million 25-mer probes grouped into 298 k probesets) based on the UK Human Brain Expression Consortium (UKBEC, N = 134)				Illumina Human HT12 (43 009 50-mer probes) based on the North American Brain Expression Consortium (NABEC, N = 304)
Polymorphism reference data source (restricted to autosomes and frequency > 1%)	No. of variants	No. of unique variants in probe sequence	No. of core probesets unaltered	No. of core probesets altered (%)	No. of core probesets discarded (%)	No. of unique variants in probe sequence	No. of probes unaltered	No. of probes discarded (%)
Illumina Infinium HumanHap550 (after QC)	512 771 SNPs					362 SNPs	42 638	371 (0.86%)

Illumina Omni 1M (after QC)	795 391 SNPs	20 926 SNPs	278 585	13 515 (4.5%)	6260 (2.1%)

CEU panel of HapMap release 28 (August 2010) [unrelated N = 60 (Phase I/II), 112 (Phase III)]	2 602 611 SNPs	24 911 SNPs	275 010	16 162 (5.4%)	7188 (2.4%)	1557 SNPs	41 448	1561 (3.6%)

SNPs from European panel of 1000 Genomes Integrated Phase 1 version 3 (March 2012) (N = 381)	9 013 135 SNPs	50 813 SNPs	254 277	28 932 (9.7%)	15 151 (5.1%)	5186 SNPs	38 356	4653 (10.8%)

+ SNPs info from European Americans from the NHLBI-ESP (N = 3,510)	9 025 738 SNPs	52 843 SNPs	252 692	29 808 (10.0%)	15 860 (5.3%)	5361SNPs	38 243	4766 (11.1%)

+ indels from European panel of 1000 Genomes Integrated Phase 1 version 3 (March 2012) (N = 381)	9 025 738 SNPs + 927 779 indels	52 843 SNPs + 2097 indels	251 313	30 621 (10.3%)	16 426 (5.5%)	5361 SNPs + 332 indels	37 993	5016 (11.7%)

Classification of probes /probesets in both data sets with progressively more comprehensive polymorphism reference data source As one might expect, as the number of polymorphisms available in the reference data source increases, so the true extent of the polymorphism-in-probe problem becomes more evident. Since the majority of expression QTL studies listed in Supplementary Table S1 have attempted to identify SNP-containing probes using earlier HapMap information, it is important to note the considerable increase in the number of SNPs and the availability of data on indels between the final release of HapMap and the current release of 1000 Genomes (March 2012). Therefore, even findings from these studies have to be rigorously checked for any residual polymorphism-in-probe problem.

Proportion of LD-resolved cis-acting eQTLs arising from probes containing polymorphism(s) in probe sequence

We investigated the number of LD-resolved cis-acting eQTLs (<1 Mb from transcription start site of associated transcript, SNPs in a single LD block counted as one signal) that can be considered suspect because they are associated with polymorphism-containing probes/probesets. We considered a wide range of significance thresholds, polymorphism reference sources and different brain regions (Figure 1).

Figure 1.

The proportion of LD-resolved cis-eQTL signals discarded because of the polymorphism-in-probe sequence problem using different polymorphism reference data sources and P-value thresholds. Multiple significant associations with a probe/probeset caused by SNPs in high LD (r2 ≥ 0.5) were treated as a single ‘LD-resolved’ signal. Also shown is the expected proportion that would be discarded if the rate was the same as the proportion of all probe/probesets (including ones without a cis-eQTL signal) discarded using the 1000 genomes (March 2012) plus Exome Sequencing Project reference data source. We found that the proportion of the LD-resolved cis-eQTLs affected by polymorphism-in-probe is much larger than the overall proportion of probes affected (Tables 2 and 3). This finding is consistent across the two brain regions and across data sets. In the frontal cortex of the Affymetrix Human Exon data set, we found that up to 90% of the declared eQTL signals involved polymorphism-containing probes when we should have expected only 6.1% based on the overall proportion of such probes. Table 3 illustrates this point by tabulating the number of suspect LD-resolved cis-eQTLs at P-value < 10−12 when using the European ancestry panels of the 1000 Genomes Project (March 2012) and NHLBI-ESP.

Table 3.

Affymetrix Human Exon 1.0 (25-mer probe design) based on the UK Human Brain Expression Consortium (UKBEC, N = 134)			Illumina Human HT-12v3 (50-mer probe design) based on the North American Brain Expression Consortium (NABEC, N = 304)
	CRBL	FCTX		CRBL	FCTX
Total number of cis-eQTLs	1275	705	Total number of cis-eQTLs	1192	1018
Type of probeset giving rise to the cis-eQTL			Type of probe giving rise to the cis-eQTL
None of the corresponding probes contain a polymorphism (‘unaltered’)	517	227	Probe does not contain a polymorphism (‘unaltered’)	793	681
Only one corresponding probe contains polymorphism(s) (‘altered’)	119	54	Probe contains polymorphisms(s) (‘discarded’)	396	337
Two or more corresponding probes contain polymorphism(s) (‘discarded’)	639	424
Proportion of eQTLs discarded (excluding altered) = discarded / (discarded + unaltered)	55.2%	65.1%	Proportion of eQTLs discarded	33.2%	33.1%
Expected proportion of eQTLs to be discarded	6.1%		Expected proportion of eQTLs to be discarded	11.7%

The expected proportion to be discarded is the proportion of all probe/probesets discarded (including ones without a cis-eQTL signal).

Number of LD-resolved cis-eQTLs (P < 10−12) for the two data sets, using polymorphisms (present with minor allele frequency >1% in Europeans) from the combined 1000 Genomes (March 2012) plus NHLBI Exome Sequence Project data sources The expected proportion to be discarded is the proportion of all probe/probesets discarded (including ones without a cis-eQTL signal). The exon-level cis-eQTL results generated using the Affymetrix array (25-mer probe design) are much more affected by the polymorphism-in-probe problem than those results generated using the Illumina array (50-mer probe design). This is in agreement with previous studies showing that the presence of a polymorphism in a longer sequence has a less pronounced effect on the binding affinity than in a shorter sequence (21). However, the enrichment of false positives at gene-level by averaging exon-level data is comparable with the performance of the Illumina array (Supplementary Table S3). Finally, we note that the proportion of suspect cis-eQTLs generally increases with more stringent P-value cut-offs. Therefore, and somewhat counter-intuitively, the more significant a result is the more likely it is to be a false positive. When we repeated the analysis with trans-eQTL signals, we also saw a small, but noticeable, enrichment of false positives (Supplementary Figure S1). This affects some of the commonly presented statistics from eQTLs studies such as cis- to trans-eQTL ratios (Supplementary Figure S2).

Approaches to dealing with suspect cis-eQTLs from discarded probes/probesets

For Affymetrix Exon 1.0 ST arrays, where a probeset expression value is typically estimated from four constituent probes, we apply probe masking (10) to exclude the polymorphism-containing probe and re-estimate the probeset expression value. If less than three probes remain free of polymorphism(s) for estimation, the probeset is still discarded. This solution recovers about two thirds of the polymorphism-containing probesets (to become ‘altered’ probesets). This still leaves a large number of suspect cis-eQTLs from discarded probesets, especially for Illumina array data where probe masking is not applicable. Three alternatives to dealing with these suspect cis-eQTLs are (i) to remove all suspect cis-eQTLs; (ii) apply conditional association analysis; and (iii) apply LD filtering (see Materials and Methods section for details). Conditional association is a better motivated approach than LD filtering, but requires either imputation or sequencing to obtain the polymorphism-in-probe genotypes, a laborious step for existing eQTL data sets. We find that we can rescue <6% of the suspect cis-eQTLs from both discarded Affymetrix probesets and discarded Illumina probes via the conditional analysis method (Supplementary Figure S3), so there is little additional benefit to using this method over probe masking (which does not require genotype imputation). LD filtering does a poor job of identifying these true eQTLs, regardless of the r2 threshold applied. In all conditions considered, even if a stringent LD threshold of r < 0.1 is used, at least 80% of the cis-eQTLs signals ‘rescued’ by LD filtering are in fact false according to the more proper conditional association method (Supplementary Table S4), and this percentage increases with the P-value stringency used to declare eQTL signals. This contraindicates the use of LD filtering to recover suspect cis-eQTLs.

Examples of false positives and false negatives due to polymorphism-in-probe

We selected three examples in two genes of relevance in brain disorders to demonstrate this problem in cerebellum (Figure 2). Common genetic variation at the MAPT gene has been associated with multiple neurodegenerative disorders (22) including Parkinson’s Disease, progressive supranuclear palsy and corticobasal degeneration while PRICKLE1 has been implicated in progressive myoclonic epilepsy (23).

Figure 2.

Illustrative examples of eQTLs with relevance to brain disorders. (A) Boxplots show the false-positive association between rs650927 genotypes and the measured expression of each of the four probes contained within the probeset 3723733 (exon 6 of MAPT) due to an SNP in the probe sequences. Two SNPs are present in the target sequence but only one is in high LD with the hit SNP. (B) Boxplots show the false-negative association between rs34725377 and probeset 3412103 (exon 8 of PRICKLE1) due to an SNP in one of the probe sequences. (C) Boxplots show the false-positive association between rs1751739 and the probe ILMN_1710903 (3′UTR region of the MAPT) due to a common 2-base pair deletion. The association between this SNP and ILMN_2310814, which also targets the 3′UTR of MAPT but is free of polymorphisms, is shown. The first example (Figure 2A) shows a false-positive association between rs650927 and exon 6 of MAPT (probeset 3723733) in the Affymetrix Human Exon data set. The target sequence contains two SNPs that affect all four constituent probes, but the hit SNP is in high LD only with rs10445337 (r2 = 0.98), which affects two of the probes. After excluding these two probes and re-calculating the probeset expression value via probe masking, the eQTL signal is no longer significant (P-value changes from 4.2 × 10−20 to 4.6 × 10−5). Conditioning on both SNPs in the probe sequence also results in a non-significant result (P = 0.644). The second example (Figure 2B) shows a false-negative association involving exon 8 of PRICKLE1 (probeset 3412103). One of the probes contains an SNP that is in high LD (r2 = 0.96) with the hit SNP, which results in an opposite association compared with the other three probes. Excluding this probe results in the discovery of a significant eQTL signal (P-value changes from 1.4 × 10−5 to 4.3 × 10−18), which would have been missed otherwise. Conditional analysis, using the original probeset expression, is not useful here, as the truly associated SNP is in too high LD with the polymorphism-in-probe, resulting in too high a level of confounding. The final example (Figure 2C) is the false-positive association between rs1751739 (which tags the H1/H2 haplotype) and the probe ILMN_1710903 in the 3′UTR region of the MAPT gene. This influential finding was first reported in 2007(9) and has since been replicated in a number of other high profile studies (16,24). The hit SNP is in high LD with this common 2-base pair deletion (labelled as chr17:44102741:D in 1000 Genomes or as rs67759530 in dbSNP) within the probe (r2 = 0.91, minor allele frequency = 23%), giving rise to a highly significant association in the Illumina HT12 data set (P-value = 8 × 10−31). Since there are no constituent probes like there are in the Human Exon array, we investigated the association of the hit SNP with ILMN_2310814, which is located 2738 base pairs away and also in the 3′UTR of the MAPT gene, and observed no significant associations (P-value = 0.76). We discuss more about the eQTLs in this gene elsewhere (25).

DISCUSSION

In this article, we show that the presence of a small proportion of probes binding to sequences containing common polymorphisms massively inflates the number of cis-eQTL signals. These false eQTL signals tend to generate large effect sizes in relation to true signals, and so the problem only becomes worse as one increases the stringency of the P-value threshold used to define significance. Furthermore, these false signals will appear to replicate across studies if one uses the same array platform. We show here that previous eQTL studies are likely to have failed to adequately correct for this problem. This is primarily because of incomplete reference data on the location of all common polymorphisms at the time the studies were performed, but other factors have also played a part. For example, we show here that LD filtering introduces a large number of false positives, even if very low r2 thresholds are used. Our study suggests we are close to reaching a ‘saturation point’ in cataloguing all common exonic polymorphisms in the human genome. Although the number of polymorphisms with minor allele frequencies >1% goes up with every new reference data source we considered, often doubling or tripling compared with previous definitions, the proportion of cis-eQTLs discarded showed smaller and smaller changes. For example, the proportion of cis-eQTLs discarded in the frontal cortex samples of UKBEC at P < 10−12 is 39.8% using genotype only data, 49.8% using HapMap release 28, 64.0% using 1000 Genomes SNPs, 64.4% using 1000 Genomes SNPs plus NHLBI-ESP exomes and 65.1% using 1000 Genomes (SNPs and indels) plus NHLBI-ESP exomes. This suggests we are close to having the full list of common polymorphisms, and that the ones we are missing are likely to be close to 1% in frequency and so with less of a tendency to generate false cis-eQTL signals. It is also worth noting that the number of probes/probesets discarded owing to the addition of nearly 1 million indels in the latest release of the 1000 Genomes Project is relatively low. One possible explanation for this is that, unlike SNPs, the existence of indels within exons is more likely to lead to deleterious frameshift changes in peptide sequences and thus are under negative selection. Although the present study is the most complete analysis of the polymorphism-in-probe problem to date, it has limitations. We have not considered all types of polymorphisms, namely inversions and copy number variations, which are currently not as well characterized as SNPs or indels. We have also not attempted to model the binding affinity of probes as a function of the number of polymorphisms within a sequence; the position, nucleotide type and length of polymorphisms relative to the probe sequence; and the surrounding nucleotide types (4,21,26). We are, however, confident that the current size of reference data from the 1000 Genomes and Exome Sequence projects means that we are close to a complete catalogue of all common point-mutation polymorphisms in the major human populations. Although the technology for assaying gene expression is now moving away from microarray-based methods and towards RNA sequencing, properly addressing the problem of polymorphism-in-probe remains important. Sample size remains the biggest driver for eQTL discovery, and while microarrays remain cheaper than RNA sequencing, they will continue to be used for large-scale studies. In particular, low-expressed genes may require prohibitively large amounts of RNA sequencing to capture, but remain cheaply detectable via microarrays. Indeed, we are aware of only two eQTL studies based on RNA-sequence conducted in humans (both published in 2010), which use 69 HapMap West Africans (27) and 60 HapMap Europeans (28). In contrast, there are numerous recent studies, some involving thousands of samples, using microarray technology [e.g. 2355 samples from Grundberg et al. (29); 1490 samples from Zeller et al. (30)]. There is also a large and important body of existing microarray-based eQTL data spanning multiple tissue types and using large sample sizes. We believe that these data should be reassessed for potential false signals caused by polymorphism-in-probe issues, especially given the widespread distribution of these data via catalogues such as the Phenotype–Genotype Integrator, eQTLbrowser, seeQTL (31), SNPexpress (11) and GeneVar (32). Ideally, these catalogues should automatically flag up any suspect signals arising from polymorphism-containing probes. We have also written a web tool, PiP Finder (http://bit.ly/pipfinder), to provide researchers with an easy-to-use interface to identify this issue in any given eQTL signal. For an example of how we used this tool to check a recent publication for suspect eQTLs, please see our online comments to Zou et al. (33). The polymorphism-in-probe problem is widely recognised in the eQTL literature, and various solutions have been proposed and implemented. Despite this, we show that false eQTL signals are likely to be widespread both in the literature and in extant eQTL databases. We note that the pervasive presence of false eQTL signals may have implications for, inter alia, the overlap of eQTL signals with genome-wide association study signals; the empirical distribution of eQTL signals relative to the transcription start site of genes; and apparent ratios of tissue-specific to cross-tissue eQTL signals. More generally, the findings of our study act as a cautionary tale for the interpretation of all types of genomic data, illustrating that even a relatively well-understood problem can be inadequately corrected. Although we show that a large proportion of published cis-eQTL signals could be false, we also show that this problem can now be identified and resolved. From our own experience, meaningful, exciting and valid insights into the regulation of gene expression emerge once the polymorphism-in-probe problem is properly addressed.

SUPPLEMENTARY DATA

Supplementary Data are available at NAR Online: Supplementary Tables 1–5, Supplementary Figures 1–3, Supplementary Methods and Supplementary references [34-39].

FUNDING

Medical Research Council, UK, through the MRC Sudden Death Brain Bank (to C.S.); Project Grant [G0901254 to J.H. and M.W.]; Training Fellowship [G0802462 to M.R.]; King Faisal Specialist Hospital and Research Centre, Saudi Arabia (to D.T.); Intramural Research Program of the National Institute on Aging, National Institutes of Health, part of the US Department of Health and Human Services [ZIA AG000932-04] (in part) (to work performed by the North American Brain Expression Consortium). Funding for open access charge: Medical Research Council, UK. Conflict of interest statement. None declared.

38 in total

1. Tissue and organ donation for research in forensic pathology: the MRC Sudden Death Brain and Tissue Bank.

Authors: T Millar; R Walker; J-C Arango; J W Ironside; D J Harrison; D J MacIntyre; D Blackwood; C Smith; J E Bell
Journal: J Pathol Date: 2007-12 Impact factor: 7.996

2. A survey of genetic human cortical gene expression.

Authors: Amanda J Myers; J Raphael Gibbs; Jennifer A Webster; Kristen Rohrer; Alice Zhao; Lauren Marlowe; Mona Kaleem; Doris Leung; Leslie Bryden; Priti Nath; Victoria L Zismann; Keta Joshipura; Matthew J Huentelman; Diane Hu-Lince; Keith D Coon; David W Craig; John V Pearson; Peter Holmans; Christopher B Heward; Eric M Reiman; Dietrich Stephan; John Hardy
Journal: Nat Genet Date: 2007-11-04 Impact factor: 38.330

Review 3. Revealing the architecture of gene regulation: the promise of eQTL studies.

Authors: Yoav Gilad; Scott A Rifkin; Jonathan K Pritchard
Journal: Trends Genet Date: 2008-07-01 Impact factor: 11.639

4. The Sun Health Research Institute Brain Donation Program: description and experience, 1987-2007.

Authors: Thomas G Beach; Lucia I Sue; Douglas G Walker; Alex E Roher; LihFen Lue; Linda Vedders; Donald J Connor; Marwan N Sabbagh; Joseph Rogers
Journal: Cell Tissue Bank Date: 2008-03-18 Impact factor: 1.522

5. A homozygous mutation in human PRICKLE1 causes an autosomal-recessive progressive myoclonus epilepsy-ataxia syndrome.

Authors: Alexander G Bassuk; Robyn H Wallace; Aimee Buhr; Andrew R Buller; Zaid Afawi; Masahito Shimojo; Shingo Miyata; Shan Chen; Pedro Gonzalez-Alegre; Hilary L Griesbach; Shu Wu; Marcus Nashelsky; Eszter K Vladar; Dragana Antic; Polly J Ferguson; Sebahattin Cirak; Thomas Voit; Matthew P Scott; Jeffrey D Axelrod; Christina Gurnett; Azhar S Daoud; Sara Kivity; Miriam Y Neufeld; Aziz Mazarib; Rachel Straussberg; Simri Walid; Amos D Korczyn; Diane C Slusarski; Samuel F Berkovic; Hatem I El-Shanti
Journal: Am J Hum Genet Date: 2008-10-30 Impact factor: 11.025

6. Sequence polymorphisms cause many false cis eQTLs.

Authors: Rudi Alberts; Peter Terpstra; Yang Li; Rainer Breitling; Jan-Peter Nap; Ritsert C Jansen
Journal: PLoS One Date: 2007-07-18 Impact factor: 3.240

7. Strong position-dependent effects of sequence mismatches on signal ratios measured using long oligonucleotide microarrays.

Authors: Catriona Rennie; Harry A Noyes; Stephen J Kemp; Helen Hulme; Andy Brass; David C Hoyle
Journal: BMC Genomics Date: 2008-07-03 Impact factor: 3.969

8. Impact of point-mutations on the hybridization affinity of surface-bound DNA/DNA and RNA/DNA oligonucleotide-duplexes: comparison of single base mismatches and base bulges.

Authors: Thomas Naiser; Oliver Ehler; Jona Kayser; Timo Mai; Wolfgang Michel; Albrecht Ott
Journal: BMC Biotechnol Date: 2008-05-13 Impact factor: 2.563

9. Effect of polymorphisms within probe-target sequences on olignonucleotide microarray experiments.

Authors: David Benovoy; Tony Kwan; Jacek Majewski
Journal: Nucleic Acids Res Date: 2008-07-02 Impact factor: 16.971

10. Mapping cis- and trans-regulatory effects across multiple tissues in twins.

Authors: Elin Grundberg; Kerrin S Small; Åsa K Hedman; Alexandra C Nica; Alfonso Buil; Sarah Keildson; Jordana T Bell; Tsun-Po Yang; Eshwar Meduri; Amy Barrett; James Nisbett; Magdalena Sekowska; Alicja Wilk; So-Youn Shin; Daniel Glass; Mary Travers; Josine L Min; Sue Ring; Karen Ho; Gudmar Thorleifsson; Augustine Kong; Unnur Thorsteindottir; Chrysanthi Ainali; Antigone S Dimas; Neelam Hassanali; Catherine Ingle; David Knowles; Maria Krestyaninova; Christopher E Lowe; Paola Di Meglio; Stephen B Montgomery; Leopold Parts; Simon Potter; Gabriela Surdulescu; Loukia Tsaprouni; Sophia Tsoka; Veronique Bataille; Richard Durbin; Frank O Nestle; Stephen O'Rahilly; Nicole Soranzo; Cecilia M Lindgren; Krina T Zondervan; Kourosh R Ahmadi; Eric E Schadt; Kari Stefansson; George Davey Smith; Mark I McCarthy; Panos Deloukas; Emmanouil T Dermitzakis; Tim D Spector
Journal: Nat Genet Date: 2012-09-02 Impact factor: 38.330

29 in total

1. Additive, epistatic, and environmental effects through the lens of expression variability QTL in a twin cohort.

Authors: Gang Wang; Ence Yang; Candice L Brinkmeyer-Langford; James J Cai
Journal: Genetics Date: 2013-12-02 Impact factor: 4.562

2. A systematic heritability analysis of the human whole blood transcriptome.

Authors: Tianxiao Huan; Chunyu Liu; Roby Joehanes; Xiaoling Zhang; Brian H Chen; Andrew D Johnson; Chen Yao; Paul Courchesne; Christopher J O'Donnell; Peter J Munson; Daniel Levy
Journal: Hum Genet Date: 2015-01-14 Impact factor: 4.132

3. Identification of common genetic variants controlling transcript isoform variation in human whole blood.

Authors: Xiaoling Zhang; Roby Joehanes; Brian H Chen; Tianxiao Huan; Saixia Ying; Peter J Munson; Andrew D Johnson; Daniel Levy; Christopher J O'Donnell
Journal: Nat Genet Date: 2015-02-16 Impact factor: 38.330

4. The 5p12 breast cancer susceptibility locus affects MRPS30 expression in estrogen-receptor positive tumors.

Authors: David A Quigley; Elisa Fiorito; Silje Nord; Peter Van Loo; Grethe Grenaker Alnæs; Thomas Fleischer; Jorg Tost; Hans Kristian Moen Vollan; Trine Tramm; Jens Overgaard; Ida R Bukholm; Antoni Hurtado; Allan Balmain; Anne-Lise Børresen-Dale; Vessela Kristensen
Journal: Mol Oncol Date: 2013-12-03 Impact factor: 6.603

5. Allelic mapping bias in RNA-sequencing is not a major confounder in eQTL studies.

Authors: Nikolaos I Panousis; Maria Gutierrez-Arcelus; Emmanouil T Dermitzakis; Tuuli Lappalainen
Journal: Genome Biol Date: 2014-09-20 Impact factor: 13.583

6. Gene expression markers of age-related inflammation in two human cohorts.

Authors: Luke C Pilling; Roby Joehanes; David Melzer; Lorna W Harries; William Henley; Josée Dupuis; Honghuang Lin; Marcus Mitchell; Dena Hernandez; Sai-Xia Ying; Kathryn L Lunetta; Emelia J Benjamin; Andrew Singleton; Daniel Levy; Peter Munson; Joanne M Murabito; Luigi Ferrucci
Journal: Exp Gerontol Date: 2015-06-16 Impact factor: 4.032

7. A Polymorphic Antioxidant Response Element Links NRF2/sMAF Binding to Enhanced MAPT Expression and Reduced Risk of Parkinsonian Disorders.

Authors: Xuting Wang; Michelle R Campbell; Sarah E Lacher; Hye-Youn Cho; Ma Wan; Christopher L Crowl; Brian N Chorley; Gareth L Bond; Steven R Kleeberger; Matthew Slattery; Douglas A Bell
Journal: Cell Rep Date: 2016-04-14 Impact factor: 9.423

8. Synthesis of 53 tissue and cell line expression QTL datasets reveals master eQTLs.

Authors: Xiaoling Zhang; Hinco J Gierman; Daniel Levy; Andrew Plump; Radu Dobrin; Harald H H Goring; Joanne E Curran; Matthew P Johnson; John Blangero; Stuart K Kim; Christopher J O'Donnell; Valur Emilsson; Andrew D Johnson
Journal: BMC Genomics Date: 2014-06-27 Impact factor: 3.969

9. Genetic variability in the regulation of gene expression in ten regions of the human brain.

Authors: Adaikalavan Ramasamy; Daniah Trabzuni; Sebastian Guelfi; Vibin Varghese; Colin Smith; Robert Walker; Tisham De; Lachlan Coin; Rohan de Silva; Mark R Cookson; Andrew B Singleton; John Hardy; Mina Ryten; Michael E Weale
Journal: Nat Neurosci Date: 2014-08-31 Impact factor: 24.884

10. Frontotemporal dementia and its subtypes: a genome-wide association study.

Authors: Raffaele Ferrari; Dena G Hernandez; Michael A Nalls; Jonathan D Rohrer; Adaikalavan Ramasamy; John B J Kwok; Carol Dobson-Stone; William S Brooks; Peter R Schofield; Glenda M Halliday; John R Hodges; Olivier Piguet; Lauren Bartley; Elizabeth Thompson; Eric Haan; Isabel Hernández; Agustín Ruiz; Mercè Boada; Barbara Borroni; Alessandro Padovani; Carlos Cruchaga; Nigel J Cairns; Luisa Benussi; Giuliano Binetti; Roberta Ghidoni; Gianluigi Forloni; Daniela Galimberti; Chiara Fenoglio; Maria Serpente; Elio Scarpini; Jordi Clarimón; Alberto Lleó; Rafael Blesa; Maria Landqvist Waldö; Karin Nilsson; Christer Nilsson; Ian R A Mackenzie; Ging-Yuek R Hsiung; David M A Mann; Jordan Grafman; Christopher M Morris; Johannes Attems; Timothy D Griffiths; Ian G McKeith; Alan J Thomas; P Pietrini; Edward D Huey; Eric M Wassermann; Atik Baborie; Evelyn Jaros; Michael C Tierney; Pau Pastor; Cristina Razquin; Sara Ortega-Cubero; Elena Alonso; Robert Perneczky; Janine Diehl-Schmid; Panagiotis Alexopoulos; Alexander Kurz; Innocenzo Rainero; Elisa Rubino; Lorenzo Pinessi; Ekaterina Rogaeva; Peter St George-Hyslop; Giacomina Rossi; Fabrizio Tagliavini; Giorgio Giaccone; James B Rowe; Johannes C M Schlachetzki; James Uphill; John Collinge; Simon Mead; Adrian Danek; Vivianna M Van Deerlin; Murray Grossman; John Q Trojanowski; Julie van der Zee; William Deschamps; Tim Van Langenhove; Marc Cruts; Christine Van Broeckhoven; Stefano F Cappa; Isabelle Le Ber; Didier Hannequin; Véronique Golfier; Martine Vercelletto; Alexis Brice; Benedetta Nacmias; Sandro Sorbi; Silvia Bagnoli; Irene Piaceri; Jørgen E Nielsen; Lena E Hjermind; Matthias Riemenschneider; Manuel Mayhaus; Bernd Ibach; Gilles Gasparoni; Sabrina Pichler; Wei Gu; Martin N Rossor; Nick C Fox; Jason D Warren; Maria Grazia Spillantini; Huw R Morris; Patrizia Rizzu; Peter Heutink; Julie S Snowden; Sara Rollinson; Anna Richardson; Alexander Gerhard; Amalia C Bruni; Raffaele Maletta; Francesca Frangipane; Chiara Cupidi; Livia Bernardi; Maria Anfossi; Maura Gallo; Maria Elena Conidi; Nicoletta Smirne; Rosa Rademakers; Matt Baker; Dennis W Dickson; Neill R Graff-Radford; Ronald C Petersen; David Knopman; Keith A Josephs; Bradley F Boeve; Joseph E Parisi; William W Seeley; Bruce L Miller; Anna M Karydas; Howard Rosen; John C van Swieten; Elise G P Dopper; Harro Seelaar; Yolande A L Pijnenburg; Philip Scheltens; Giancarlo Logroscino; Rosa Capozzo; Valeria Novelli; Annibale A Puca; Massimo Franceschi; Alfredo Postiglione; Graziella Milan; Paolo Sorrentino; Mark Kristiansen; Huei-Hsin Chiang; Caroline Graff; Florence Pasquier; Adeline Rollin; Vincent Deramecourt; Florence Lebert; Dimitrios Kapogiannis; Luigi Ferrucci; Stuart Pickering-Brown; Andrew B Singleton; John Hardy; Parastoo Momeni
Journal: Lancet Neurol Date: 2014-07 Impact factor: 44.182