Literature DB >> 26366987

WASP: allele-specific software for robust molecular quantitative trait locus discovery.

Bryce van de Geijn^1,2, Graham McVicker³, Yoav Gilad¹, Jonathan K Pritchard^3,4,5.

Abstract

Allele-specific sequencing reads provide a powerful signal for identifying molecular quantitative trait loci (QTLs), but they are challenging to analyze and are prone to technical artifacts. Here we describe WASP, a suite of tools for unbiased allele-specific read mapping and discovery of molecular QTLs. Using simulated reads, RNA-seq reads and chromatin immunoprecipitation sequencing (ChIP-seq) reads, we demonstrate that WASP has a low error rate and is far more powerful than existing QTL-mapping approaches.

Entities: Chemical

Mesh：

Year: 2015 PMID： 26366987 PMCID： PMC4626402 DOI： 10.1038/nmeth.3582

Source DB: PubMed Journal: Nat Methods ISSN： 1548-7091 Impact factor: 28.547

Next generation sequencing data can be used to identify allele-specific signals because reads that overlap heterozygous sites can be assigned to one chromosome or the other. Molecular QTLs are associated with allelic imbalance[1-4], and thus allele-specific reads can potentially augment the power of statistical tests for QTL discovery[5,6]. However, use of allele-specific reads can introduce artefacts into many stages of analysis. Uncorrected mapping of allele-specific reads can be highly biased and can easily yield false signals of allelic imbalance[7,8]. Homozygous sites which are incorrectly called as heterozygous are another source of false positives, and allele-specific read counts are overdispersed compared to the theoretical expectation of a binomial distribution[9]. Here we describe a suite of tools called WASP that is designed to overcome these technical hurdles. WASP carefully maps allele-specific reads, corrects for incorrect heterozygous genotypes and other sources of bias, and models overdispersion of sequencing reads. Finally, by integrating allele-specific information into a QTL mapping framework WASP attains greater power than standard QTL mapping approaches. Mapping of reads to a reference genome is biased by sequence polymorphisms[7]. Reads which contain the non-reference allele may fail to map uniquely or map to a different (incorrect) location in the genome[7]. A common approach is to map to a ‘personalized’ genome where the reference sequence is replaced by non-reference alleles that are known to be present in the sample[10]. However, personalized genomes do not fully address the mapping problem because the genomic locations that are uniquely mappable in the reference and non-reference genome sequences differ (Fig. 1a). While these type of errors may only affect a small number of sites, they comprise a large fraction of the most significant results when tests of allelic imbalance are performed genome-wide. Genomic DNA sequencing reads can also be used to control for mapping bias, however this method reduces power to detect allelic imbalance[11].

Figure 1

Mapping of allele specific reads. (a) Mapping to ‘personalized’ genomes can result in allelic bias because reads from one allele may not map uniquely. (b) Schematic of mapping pipeline to remove allelic bias. (c) The percentage of simulated 100 bp reads at heterozygous sites where a read with one allele maps correctly and the corresponding read with the other allele does not. Reads were simulated with sequencing errors introduced at several different rates. (d) The fraction of false-positives as a function of the effect size using a nominal Benjamini-Hochberg false-discovery rate of 10%. We simulated 100 bp allele-specific reads under null (odds ratio = 1) and alternative models (odds-ratio > 1) of allelic imbalance at heterozygous sites in the genome. 90% and 10% of sites were assumed to be null and alternative sites respectively. We mapped reads using WASP, personal-genome (AlleleSeq10) or N-masked-genome mapping strategies and called allele-specific sites using a binomial test.

WASP uses a simple approach to overcome mapping bias that can be readily incorporated into any read mapping pipeline. First, reads are mapped normally using a mapping tool selected by the user; mapped reads that overlap single nucleotide polymorphisms (SNPs) are then identified. For each read that overlaps a SNP, its genotype is swapped with that of the other allele and it is re-mapped. If a re-mapped read fails to map to exactly the same location, it is discarded (Fig. 1b). Unknown polymorphisms in the sample are not considered but will typically have little effect since the tests of allelic imbalance are only performed at known heterozygous sites. We performed a simulation to assess the impact of unknown polymorphisms and found that the proportion of heterozygous sites with biased mapping is very small (Supplementary Fig. 1 and Supplementary Note 1). We evaluated the performance of WASP’s remapping method by simulating reads at heterozygous sites in a lymphoblastoid cell line (LCL) that has been completely genotyped and phased (GM12878). At each heterozygous SNP we simulated all possible overlapping reads from both haplotypes, additionally allowing reads to contain mismatches at a predefined sequencing error rate. We mapped the simulated reads using three approaches to account for mapping bias: mapping to a genome with N-masked SNPs, mapping to a personalized genome using AlleleSeq10, and mapping to the genome using WASP. While reads mapped to the N-masked and personalized genomes were substantially biased and gave rise to a large number of false positives, reads mapped using WASP were almost perfectly balanced (Fig. 1c,d). One disadvantage of WASP’s approach is that some reads are discarded, which can cause the overall expression level of a locus to be underestimated. Several statistical methods can recover ambiguously mapped reads[12,13], however, they are not designed for unbiased allele-specific mapping and incorporating them into WASP would be technically challenging. WASP employs a number of techniques to remove noise and biases from mapped reads. Amplification bias is a common feature of experiments that yield libraries with low complexity (e.g. ChIP-seq). To control for amplification bias it is common to remove ‘duplicate’ reads that map to the same location. However, existing tools that remove duplicate reads retain the one with the highest mapping score, which will usually match the reference[14]. WASP provides a tool to filter duplicate reads at random, thus eliminating reference bias from this step. GC content often affects read depth in a manner that is inconsistent between sequencing experiments[3,15]. In addition, the distribution of read depths across the genome differs from experiment to experiment. For example, ChIP-seq experiments with more efficient pull-downs tend to have more reads within peaks. WASP corrects for both of these issues by fitting polynomials to the genome-wide read counts and calculating a corrected read depth for each region (Supplementary Note 2). Both allele-specific and total read depth counts are more dispersed than expected under models of binomial and Poisson sampling[9,16]. To accommodate overdispersion in the data, WASP estimates separate overdispersion parameters for each individual and genomic region used in a study (Supplementary Note 3). Finally, to account for any remaining unknown covariates, WASP allows principal components to be included in the model fitting procedure (Supplementary Note 4). Following correction for biases described above, WASP uses a statistical test, the combined haplotype test (CHT), to identify cis-acting QTLs. The CHT tests whether the genotype of a ‘test SNP’ is associated with total read depth and allelic imbalance in a ‘target region’ (Fig. 2a). The CHT jointly models two components: the allelic imbalance at phased heterozygous SNPs and the total read depth in the target region. The two components of the test are linked together by shared parameters that define their effect sizes.

Figure 2

The combined haplotype test and its performance. (a) Schematic of the combined haplotype test. A ‘test SNP’ is tested for association with mapped reads within a ‘target region’. All reads are used by the read depth component of the test; allele-specific reads are used by the allelic imbalance component of the test. (b) Identification of novel QTLs using H3K27ac ChIP-seq data from 10 Yoruba lymphoblastoid cell lines. (c) Identifying European eQTLs from the GEUVADIS consortium using an independent dataset of RNA-seq from 69 Yoruba lymphoblastoid cell lines.

For a target region and test SNP pair, the CHT models the expected number of reads for an individual as a function of the individual’s genotype, the effect size, the GC content, additional covariates (such as principal component loadings), and the total number of mapped reads in the region (across all individuals). The probability of the observed number of reads in the target region is calculated using the expected number of reads and two overdispersion parameters. Allelic imbalance of reads overlapping heterozygous SNPs within a target region is modeled as a function of the shared effect size parameters. The probability of the observed allele-specific read counts is then defined by the effect size and a single overdispersion parameter. We also allow for the possibility of genotyping errors by assuming that allele-specific read counts are drawn from a mixture, with a small probability that a given individual is a mistyped homozygote. WASP combines information across multiple heterozygous sites and the current implementation assumes that haplotype phasing is correct. Incorrect phasing will decrease WASP’s power to detect associations (Supplementary Note 5) but will not increase false positives. To evaluate the performance of WASP on a small dataset, we used it to call novel QTLs genome-wide using data from H3K27ac ChIP-seq experiments that were performed in 10 LCLs[16]. Remarkably, WASP identifies 2426 H3K27ac QTLs (10% FDR), whereas a linear regression approach is unable to identify any (Fig. 2b and Supplementary Note 5). We also evaluated the ability of WASP to call gene expression QTLs (eQTLs) in a larger dataset (Fig 2c and Supplementary Note 5). We obtained a set of 2098 eQTLs identified in 52 LCLs derived from European individuals[17]. We tested whether we could identify these eQTLs, using an independent dataset of RNA-seq from 69 Yoruba LCLs[3]. WASP discovers 627 of the eQTLs at a false discovery rate (FDR) of 10%, which is impressive considering (1) our smaller sample size, (2) that some fraction of the original eQTLs are false positives, and (3) that some of the European eQTLs will be absent or at very low frequency in the Yoruba. This number increases to 673 when 5 principal components are included as covariates. By comparison, when we adopt a standard eQTL discovery method (linear regression on quantile normalized and GC-corrected data), we identify only 446 eQTLs (617 when 5 principal components are included as co-variates). P values obtained by running the CHT on the same dataset with permuted genotypes do not depart substantially from the null expectation, indicating that the test is well-calibrated (Supplementary Fig. 2). We compared the CHT to several other methods by simulating reads under null and alternative models of genetic association (Supplementary Fig. 3 and Supplementary Note 6). For small samples (10 or 20 individuals), the CHT outperforms all other tests, but for large samples (50 or 100 individuals) TReCASE[6] performs similarly well. Like the CHT, TreCASE uses both allelic imbalance and read depth information, however it does not account for overdispersion, genotyping errors, or biased mapping, which increase the false positive rate when using real data. WASP can only test for gene-level expression differences and does not consider the expression of individual transcript isoforms. Some QTLs detected by WASP may therefore be due to differences in isoform usage rather than differences in overall gene expression[18,19]. Our results demonstrate that WASP is a powerful approach for the identification of molecular QTLs, particularly when sample sizes are small. WASP accounts for numerous biases in allele-specific data and is flexible enough to work with different read mappers and multiple types of sequencing data such as ChIP-seq and RNA-seq. By modeling biases and dispersion differences directly, WASP eliminates the need for quantile normalization of the data, thereby making estimated effect sizes more interpretable. The source code and documentation for WASP are open source and can be downloaded from https://github.com/bmvdgeijn/WASP/.

Online Methods

Unbiased read mapping with WASP

To detect differences in molecular phenotypes from sequencing data it is essential to remove read mapping biases, which are a major source of false positives. The WASP read mapping pipeline accomplishes this task by ensuring that the mapping of each individual read is unbiased. The user first maps reads to the genome using any mapper that outputs BAM or SAM format (Supplementary Fig. 4). For example, ChIP-seq reads can be mapped by BWA or Bowtie 2, and RNA-seq reads can be mapped using tophat. WASP then identifies mapped reads that overlap known polymorphisms. For each read that overlaps a polymorphism, all possible allelic combinations that differ from the original read are generated and re-mapped to the genome. For example, when a read overlaps two bi-allelic SNPs, four allelic combinations are possible, three of which differ from the original read. The original read is discarded if any of the allelic combinations map non-uniquely or map to another location. Reads which overlap insertion or deletion polymorphisms are currently discarded by WASP. This simple method has the advantages that it works with almost any existing mapping pipeline and it handles reads with sequencing errors, which are a major source of biased mapping[7].

Discovery of quantitative trait loci with WASP

To discover molecular quantitative trait loci (QTLs) WASP uses a statistical test, which we call the combined haplotype test (CHT). As input, the CHT takes genotype probabilities at known SNPs as well as mapped reads from sequencing-based experiments such as ChIP-seq or RNA-seq. The CHT combines two types of information: the depth of mapped reads and the allelic imbalance of mapped reads that overlap heterozygous sites. The CHT models the overdispersion of read counts (both across regions and across individuals) and accounts for variability introduced by GC content and the fraction of reads that fall within peaks (Supplementary Fig. 5).

The combined haplotype test

The combined haplotype test (CHT) determines whether the genotype of a test SNP, m, is associated with read depth and allelic imbalance within a nearby target region, j, on the same chromosome (Figure 2a). Each test is performed on a test SNP and target region pair, h = {m, j}. A target region may be discontiguous and span multiple genomic loci. For example, the exons of a gene can be used as a target region when searching for expression QTLs using RNA-seq reads. The test SNP is not required to be within the target region, but is assumed to be nearby and cis-acting. This allows us to combine information from across phased heterozygous SNPs and assign reads to one haplotype or the other. Mathematical variables for the CHT are summarized in Supplementary Table 1.

The basic CHT model

The CHT is a likelihood ratio test with two components. One component models the depth of mapped reads within the target region, and the other component models the allelic-imbalance of reads that overlap heterozygous SNPs. Both components of the test are parameterized by α and β, which define the expected read depth from chromosomes with the reference and alternative alleles. Since variants are assumed to be additive and cis-acting, the expected allelic imbalance in heterozygotes is .

Modeling the read depths

The number of reads mapping to a target region is often modeled using a Poisson distribution. However, the Poisson assumption that the variance is equal to the mean is often violated because read counts from target regions are overdispersed. Part of this overdispersion can be accommodated by modeling the data with a negative-binomial distribution with a variance parameter for each test. However, the negative binomial distribution assumes that the mean and variance have a quadratic relationship that is consistent across individuals. We have found that this assumption is violated by sequencing data and causes poor calibration of the tests, particularly when sample sizes are small. The CHT therefore includes negative binomial overdispersion parameters for each individual, Ω, and for each target region, ϕ. After adding these additional dispersion parameters, the data are modeled with a beta-negative binomial (BNB) distribution. The expected number of read counts for an individual, λ, is defined as: where G is the genotype of individual i at test SNP m, and T is the total number of reads mapped genome-wide for individual i. The likelihood of the parameters is then given by the equation: where x is the number of reads for individual i in target region j. The CHT can additionally adjust the total reads for each target region and individual by taking into account the GC content and the fraction of reads found in target regions (Supplementary Note 2). To account for unknown co-variates the total reads can also be adjusted using principal component analysis (Supplementary Note 4).

Modeling the allelic imbalances

Allele-specific read counts are sometimes modeled using the binomial distribution, however, we have found that allele-specific read counts are overdispersed. We instead model allele-specific read counts with a beta-binomial (BB) distribution and include a parameter ϒ (estimated separately) that captures the overdispersion for each individual. The likelihood of the parameters given the data is then: where y is the number of allele-specific reads from the reference haplotype and n is the total number of allele-specific reads for individual i at target SNP k. The expected fraction of allele-specific reads from the reference allele is .

Correcting for incorrect genotype calls

SNP genotypes that are incorrectly called as heterozygous are a major source of false positives, since reads that overlap them appear to come from only one allele. To account for this issue, we assume that allele-specific reads are drawn from a mixture of two beta-binomials, with probabilities H and 1 − H, where H is the probability that individual i is heterozygous for SNP k. Reads from heterozygous individuals contain the reference allele with probability p. We assume that reads from homozygous individuals still have a small probability of coming from the other allele due to sequencing errors, which occur with probability, perr. The probability of observing y reads from the reference allele for individual i at SNP k then becomes: We found that even SNPs with heterozygous probabilities of 1.0 are occasionally miscalled so we set heterozygous probabilities to a maximum value of 0.99. We then update this heterozygous probability using sequencing data obtained from the same individual. Sequencing data may consist of DNA sequencing reads or reads aggregated across multiple types of experiments performed on the same individual (e.g. RNA-seq and ChIP-seq reads). For a SNP with heterozygous probability , we define the updated heterozygous probability, Ĥ as:

The combined likelihood ratio test

The combined likelihood of both components of the model is: The overdispersion parameters for the combined likelihood model can be estimated using a maximum likelihood approach that uses data from many genomic regions (Supplementary Note 3). To test for an association with genotype we perform a likelihood ratio test that compares the alternative hypothesis α ≠ β to the null hypothesis α = β. The CHT returns a likelihood ratio statistic where θ̂1 and θ̂0 are maximum likelihood estimates of the parameters under the alternative and null hypotheses. P values can be calculated from the test statistic under the asymptotic assumption that −2log(Λ) is χ2 distributed with one degree of freedom. The CHT is robust to non-additive allelic effects (Supplementary Note 7), and has a running time that is linear with the number of individuals in the study (Supplementary Note 8).

19 in total

1. A powerful and flexible statistical framework for testing hypotheses of allele-specific gene expression from RNA-seq data.

Authors: Daniel A Skelly; Marnie Johansson; Jennifer Madeoy; Jon Wakefield; Joshua M Akey
Journal: Genome Res Date: 2011-08-26 Impact factor: 9.043

2. Analysis and design of RNA sequencing experiments for identifying isoform regulation.

Authors: Yarden Katz; Eric T Wang; Edoardo M Airoldi; Christopher B Burge
Journal: Nat Methods Date: 2010-11-07 Impact factor: 28.547

3. Transcriptome genetics using second generation sequencing in a Caucasian population.

Authors: Stephen B Montgomery; Micha Sammeth; Maria Gutierrez-Arcelus; Radoslaw P Lach; Catherine Ingle; James Nisbett; Roderic Guigo; Emmanouil T Dermitzakis
Journal: Nature Date: 2010-03-10 Impact factor: 49.962

4. The Sequence Alignment/Map format and SAMtools.

Authors: Heng Li; Bob Handsaker; Alec Wysoker; Tim Fennell; Jue Ruan; Nils Homer; Gabor Marth; Goncalo Abecasis; Richard Durbin
Journal: Bioinformatics Date: 2009-06-08 Impact factor: 6.937

5. Understanding mechanisms underlying human gene expression variation with RNA sequencing.

Authors: Joseph K Pickrell; John C Marioni; Athma A Pai; Jacob F Degner; Barbara E Engelhardt; Everlyne Nkadori; Jean-Baptiste Veyrieras; Matthew Stephens; Yoav Gilad; Jonathan K Pritchard
Journal: Nature Date: 2010-03-10 Impact factor: 49.962

6. DNase I sensitivity QTLs are a major determinant of human expression variation.

Authors: Jacob F Degner; Athma A Pai; Roger Pique-Regi; Jean-Baptiste Veyrieras; Daniel J Gaffney; Joseph K Pickrell; Sherryl De Leon; Katelyn Michelini; Noah Lewellen; Gregory E Crawford; Matthew Stephens; Yoav Gilad; Jonathan K Pritchard
Journal: Nature Date: 2012-02-05 Impact factor: 49.962

7. Haplotype and isoform specific expression estimation using multi-mapping RNA-seq reads.

Authors: Ernest Turro; Shu-Yi Su; Ângela Gonçalves; Lachlan J M Coin; Sylvia Richardson; Alex Lewin
Journal: Genome Biol Date: 2011-02-10 Impact factor: 13.583

8. Differential expression analysis for sequence count data.

Authors: Simon Anders; Wolfgang Huber
Journal: Genome Biol Date: 2010-10-27 Impact factor: 13.583

9. AlleleSeq: analysis of allele-specific expression and binding in a network framework.

Authors: Joel Rozowsky; Alexej Abyzov; Jing Wang; Pedro Alves; Debasish Raha; Arif Harmanci; Jing Leng; Robert Bjornson; Yong Kong; Naoki Kitabayashi; Nitin Bhardwaj; Mark Rubin; Michael Snyder; Mark Gerstein
Journal: Mol Syst Biol Date: 2011-08-02 Impact factor: 11.429

10. Effect of read-mapping biases on detecting allele-specific expression from RNA-sequencing data.

Authors: Jacob F Degner; John C Marioni; Athma A Pai; Joseph K Pickrell; Everlyne Nkadori; Yoav Gilad; Jonathan K Pritchard
Journal: Bioinformatics Date: 2009-10-06 Impact factor: 6.937

183 in total

1. Fast and efficient QTL mapper for thousands of molecular phenotypes.

Authors: Halit Ongen; Alfonso Buil; Andrew Anand Brown; Emmanouil T Dermitzakis; Olivier Delaneau
Journal: Bioinformatics Date: 2015-12-26 Impact factor: 6.937

2. Tissue-Specific cis-Regulatory Divergence Implicates eloF in Inhibiting Interspecies Mating in Drosophila.

Authors: Peter A Combs; Joshua J Krupp; Neil M Khosla; Dennis Bua; Dmitri A Petrov; Joel D Levine; Hunter B Fraser
Journal: Curr Biol Date: 2018-11-29 Impact factor: 10.834

Review 3. Advances in Transcriptomics: Investigating Cardiovascular Disease at Unprecedented Resolution.

Authors: Robert C Wirka; Milos Pjanic; Thomas Quertermous
Journal: Circ Res Date: 2018-04-27 Impact factor: 17.367

4. Genetic regulatory signatures underlying islet gene expression and type 2 diabetes.

Authors: Arushi Varshney; Laura J Scott; Ryan P Welch; Michael R Erdos; Peter S Chines; Narisu Narisu; Ricardo D'O Albanus; Peter Orchard; Brooke N Wolford; Romy Kursawe; Swarooparani Vadlamudi; Maren E Cannon; John P Didion; John Hensley; Anthony Kirilusha; Lori L Bonnycastle; D Leland Taylor; Richard Watanabe; Karen L Mohlke; Michael Boehnke; Francis S Collins; Stephen C J Parker; Michael L Stitzel
Journal: Proc Natl Acad Sci U S A Date: 2017-02-13 Impact factor: 11.205

5. Bayesian estimation of genetic regulatory effects in high-throughput reporter assays.

Authors: William H Majoros; Young-Sook Kim; Alejandro Barrera; Fan Li; Xingyan Wang; Sarah J Cunningham; Graham D Johnson; Cong Guo; William L Lowe; Denise M Scholtens; M Geoffrey Hayes; Timothy E Reddy; Andrew S Allen
Journal: Bioinformatics Date: 2020-01-15 Impact factor: 6.937

Review 6. Characterization of noncoding regulatory DNA in the human genome.

Authors: Ran Elkon; Reuven Agami
Journal: Nat Biotechnol Date: 2017-08-08 Impact factor: 54.908

7. Promoter shape varies across populations and affects promoter evolution and expression noise.

Authors: Ignacio E Schor; Jacob F Degner; Dermot Harnett; Enrico Cannavò; Francesco P Casale; Heejung Shim; David A Garfield; Ewan Birney; Matthew Stephens; Oliver Stegle; Eileen E M Furlong
Journal: Nat Genet Date: 2017-02-13 Impact factor: 38.330

8. Type 2 Diabetes Variants Disrupt Function of SLC16A11 through Two Distinct Mechanisms.

Authors: Victor Rusu; Eitan Hoch; Josep M Mercader; Danielle E Tenen; Melissa Gymrek; Christina R Hartigan; Michael DeRan; Marcin von Grotthuss; Pierre Fontanillas; Alexandra Spooner; Gaelen Guzman; Amy A Deik; Kerry A Pierce; Courtney Dennis; Clary B Clish; Steven A Carr; Bridget K Wagner; Monica Schenone; Maggie C Y Ng; Brian H Chen; Federico Centeno-Cruz; Carlos Zerrweck; Lorena Orozco; David M Altshuler; Stuart L Schreiber; Jose C Florez; Suzanne B R Jacobs; Eric S Lander
Journal: Cell Date: 2017-06-29 Impact factor: 41.582

9. Genetic Regulatory Mechanisms of Smooth Muscle Cells Map to Coronary Artery Disease Risk Loci.

Authors: Boxiang Liu; Milos Pjanic; Ting Wang; Trieu Nguyen; Michael Gloudemans; Abhiram Rao; Victor G Castano; Sylvia Nurnberg; Daniel J Rader; Susannah Elwyn; Erik Ingelsson; Stephen B Montgomery; Clint L Miller; Thomas Quertermous
Journal: Am J Hum Genet Date: 2018-08-23 Impact factor: 11.025

10. Tools and best practices for data processing in allelic expression analysis.

Authors: Stephane E Castel; Ami Levy-Moonshine; Pejman Mohammadi; Eric Banks; Tuuli Lappalainen
Journal: Genome Biol Date: 2015-09-17 Impact factor: 13.583