Literature DB >> 19008252

The SNPMaP package for R: a framework for genome-wide association using DNA pooling on microarrays.

Oliver S P Davis¹, Robert Plomin, Leonard C Schalkwyk.

Abstract

SUMMARY: Large-scale genome-wide association (GWA) studies using thousands of high-density SNP microarrays are becoming an essential tool in the search for loci related to heritable variation in many phenotypes. However, the cost of GWA remains beyond the reach of many researchers. Fortunately, the majority of statistical power can still be obtained by estimating allele frequencies from DNA pools, reducing the cost to that of tens, rather than thousands of arrays. We present a set of software tools for processing SNPMaP (SNP microarrays and pooling) data from CEL files to Relative Allele Scores in the rich R statistical computing environment.

Entities: Chemical

Mesh：

Substances：
DNA

Year: 2008 PMID： 19008252 PMCID： PMC2639010 DOI： 10.1093/bioinformatics/btn587

Source DB: PubMed Journal: Bioinformatics ISSN： 1367-4803 Impact factor: 6.937

1 INTRODUCTION

Genetic variation has been an important factor in nearly every aspect of human health and disease. This takes forms ranging from single-locus (Mendelian) traits to small effects from numerous loci. The early triumphs of human genetics and positional cloning involved rare, drastic, single mutations. These were amenable to recombination mapping techniques using large multi-generation pedigrees, and allowed the genome to be scanned for linkage using just a few hundred DNA markers. The extension of these techniques to quantitative traits offered the same systematic genome coverage but poor statistical power for detecting loci of small effect (Balding, 2006). Unrelated individuals from a population are much easier to recruit in large numbers than are families, and individual candidate mutations can be tested for association with a phenotypic trait with good statistical power. This association generally extends to nearby sequences as well, because of the relative rarity of new mutations and historical recombination in the population, giving rise to linkage disequilibrium (Slatkin, 2008). This indirect association allows scanning of candidate loci or regions. Given enough polymorphic markers, association scanning can be extended genome wide (McCarthy et al., 2008). The number of markers required for a truly comprehensive scan of the human genome is thought to be in the order of a million (Barrett and Cardon, 2006). SNP genotyping microarrays have made it possible to genotype individuals at a million loci quickly, and genome-wide association (GWA) studies are now being used to scan the human genome for loci related to heritable variation in many phenotypes (Wellcome Trust Case Control Consortium, 2007). Many population samples with associated phenotypic information could yield valuable insights through GWA analysis, but the funding and industrial-scale infrastructure required for comprehensive genotyping of thousands of individuals will not be available; nor will funding be available to genotype samples again as new microarrays emerge. Additionally, given the extreme degree of multiple testing, the current generation of GWA experiments is far from definitive, and many more studies will be needed to fully confirm findings. Fortunately, the majority of statistical power can be obtained by estimation of allele frequencies from DNA pools, in which small amounts of DNA from individuals in a group (e.g. cases or controls) are combined on the same microarray. This in effect averages allele frequencies biologically (rather than genotyping each individual and averaging their allele frequencies statistically). Importantly, the cost in time and effort for GWA studies using pooled DNA is well within the scope of a PhD project or post-doctoral contract even when several (usually 10 or more) independent pools are created to represent each group (Butcher et al., 2004; Kirov et al., 2006; Pearson et al., 2007). We present a set of software tools for processing SNPMaP (SNP microarrays and pooling) data in the increasingly popular R environment for statistical computing (R Development Core Team, 2008; http://www.r-project.org).

2 SOFTWARE OVERVIEW

The SNPMaP package has been designed to handle the processing of SNPMaP data from the CEL files generated by the Affymetrix GeneChip Command Console (AGCC) or GeneChip Operating Software (GCOS), through to the RAS (Relative Allele Scores—the pooling equivalent of a relative allele frequency; Butcher et al., 2008) used in most analyses. This can be as simple as typing at the R prompt. The package will identify and read in the CEL files from the current directory, extract the relevant probe intensities and calculate a mean RAS for each SNP on each chip, returning a SNPMaP S4 object containing the scores. Given the large amount of data generated by current SNP arrays, even with the relatively modest numbers of arrays (tens) typical of SNPMaP experiments, we have provided a lowMemory option that uses memory-mapping (with the R.huge package) to allow analysis to be done on a 32-bit desktop PC with 1GB of memory (naturally there is a speed penalty). If memory limits are exceeded in the course of the analysis, SNPMaP attempts to automatically switch from storing objects in memory to storing objects on disk. Most SNPMaP analyses use 20 to 30 arrays, so, as a severe test, we generated RAS summaries from 50 Affymetrix 6.0 arrays simultaneously on a 2.4GHz Pentium 4 system with 1GB of RAM, running Windows XP and reading and writing the data to a remote server. This took 7h using the lowMemory option. S4 methods for generic R functions such as summary(), plot() and boxplot() make it easy to query the SNPMaP object and visualize the data it contains. Accessors provide convenient access to the data. All functions are documented through the R help system. For example, typing ?snpmap will bring up a page describing the snpmap() function and its usage. Similarly, package?SNPMaP and class?SNPMaP will bring up help pages for the SNPMaP package and the SNPMaP class, respectively. Although the SNPMaP object is intended to be useable for further analyses, the data can also easily be extracted to a matrix using A user who wants CEL files transformed into a spreadsheet of RAS in the simplest possible way need not use R interactively at all; example scripts that can be invoked from various shells are available from the web site, including a point-and-click front end for Windows. These steps comprise the simplest route from CEL files to the RAS used for association analysis. On the other hand, a user who wants to examine all steps of the analysis and experiment with new methods has access to the data in a straightforward and convenient form. This flexibility is one of the major strengths of an implementation in R because of the impressive array of cutting-edge statistical techniques already available in the R environment. A more involved approach might begin by extracting the raw probe intensities from the CEL files (running the workflow function cel2raw rather than the default cel2rasS): This allows the user to plot the raw probe intensities and generate pseudoimages of the processed chips using the image() method, so the user can check for scanning artifacts, such as dust or fingerprints. The raw intensities can then be further processed to individual probe quartet RAS by a workflow function: Other options available at the snpmap() or workflow stage include normalize, which quantile normalizes the raw probe intensities across chips; log.intensities, which causes SNPMaP to use the natural logarithm of the probe intensities; and useMM, which causes SNPMaP to subtract mismatch probe intensities (where available) before calculating RAS. To calculate RAS, the package uses the method most commonly described in the literature; that is dividing the (possibly modified) intensity of allele A by the sum of the intensities of alleles A and B. If summary RAS scores for each array are required, this is calculated by a user-defined function (defaulting to the mean) of the RAS scores for each pair of probes. However, there are other ways of calculating summary RAS, such as taking the mean of allele A intensities across an array and dividing that by the summed means of allele A and B. Should users wish to use this alternative method of calculating RAS scores, a modified version of the SNPMaP package is available from the authors. On recent Affymetrix arrays mismatch probes have been discarded in favor of greater perfect match probe density, so that mismatch intensities can no longer be subtracted before calculating RAS. Although this means that RAS scores based on the new arrays are no longer necessarily a good estimate of the absolute allele frequency in the pool, the critical comparison for an association analysis is the difference in allele frequencies between case and control pools. For this reason, the loss of the mismatch probes has relatively little effect on the resulting association analysis (Pearson et al., 2007). For a discussion of the value of mismatch probes, see Millenaar et al. (2006). Some authors have discussed the possibility of correcting pooled SNP assays for differential hybridization of the allele-specific probes, a process known as k-correction (Le Hellard et al., 2002; Simpson et al., 2005). The intention here is to produce more accurate estimates of absolute allele frequency in the pools. However, others have noted that again, since the critical comparison is between cases and controls rather than between absolute allele frequencies and a reference population, such correction has little effect on the results of the analysis (Macgregor et al., 2006). Nevertheless, because of the way SNPMaP is implemented, for those users with access to a large reference sample of individuals typed on individual arrays, it is straightforward to apply the correction.

3 DISCUSSION

The SNPMaP package represents an evolution of R scripts that have been used by us (e.g. Meaburn et al., 2006, 2008) and others (e.g. Wilkening et al., 2007) for processing SNPMaP data. Although a few other software applications are available, such as GenePool (Pearson et al., 2007) or MPDA (Yang et al., 2008), the strength of the SNPMaP package is in making the data readily available in the rich R environment, allowing easy access to the early stages of the analysis, effective visualization using R's powerful graphics system and great flexibility in constructing association tests, comparing methods and making modifications (Barratt et al., 2002; Sham et al., 2002): a hands-on, rather than a hands-off approach. The package currently supports the most recent Affymetrix arrays: both Mapping250K arrays and the 5.0 and 6.0 GenomeWideSNP arrays. Support for other arrays will be added in future versions. We also have plans for a supporting package aimed at providing implementations of common association analyses described in the literature, along with some novel methods and further visualization tools. We have kept the data structures straightforward to facilitate further development of analysis methods by users; although the current version is specifically aimed at SNPMaP using Affymetrix SNP arrays, it is readily extended for other platforms. For example, our tests have included using the package to normalize the raw intensities across 265 Affymetrix Mouse Exon (gene expression) arrays. Above all, the SNPMaP package affords users the flexibility to make best use of the sophisticated tools already available in the R environment to analyze their SNP microarrays and pooling studies.

19 in total

1. Identification of the sources of error in allele frequency estimations from pooled DNA indicates an optimal experimental design.

Authors: B J Barratt; F Payne; H E Rance; S Nutland; J A Todd; D G Clayton
Journal: Ann Hum Genet Date: 2002-11 Impact factor: 1.670

2. Genotyping pooled DNA on microarrays: a systematic genome screen of thousands of SNPs in large samples to detect QTLs for complex traits.

Authors: Lee M Butcher; Emma Meaburn; Lin Liu; Cathy Fernandes; Linzy Hill; Ammar Al-Chalabi; Robert Plomin; Leo Schalkwyk; Ian W Craig
Journal: Behav Genet Date: 2004-09 Impact factor: 2.805

3. Identification of the genetic basis for complex disorders by use of pooling-based genomewide single-nucleotide-polymorphism association studies.

Authors: John V Pearson; Matthew J Huentelman; Rebecca F Halperin; Waibhav D Tembe; Stacey Melquist; Nils Homer; Marcel Brun; Szabolcs Szelinger; Keith D Coon; Victoria L Zismann; Jennifer A Webster; Thomas Beach; Sigrid B Sando; Jan O Aasly; Reinhard Heun; Frank Jessen; Heike Kolsch; Magdalini Tsolaki; Makrina Daniilidou; Eric M Reiman; Andreas Papassotiropoulos; Michael L Hutton; Dietrich A Stephan; David W Craig
Journal: Am J Hum Genet Date: 2006-12-06 Impact factor: 11.025

Review 4. Genome-wide association studies for complex traits: consensus, uncertainty and challenges.

Authors: Mark I McCarthy; Gonçalo R Abecasis; Lon R Cardon; David B Goldstein; Julian Little; John P A Ioannidis; Joel N Hirschhorn
Journal: Nat Rev Genet Date: 2008-05 Impact factor: 53.242

Review 5. A tutorial on statistical methods for population association studies.

Authors: David J Balding
Journal: Nat Rev Genet Date: 2006-10 Impact factor: 53.242

6. Analysis of pooled DNA samples on high density arrays without prior knowledge of differential hybridization rates.

Authors: Stuart Macgregor; Peter M Visscher; Grant Montgomery
Journal: Nucleic Acids Res Date: 2006-04-20 Impact factor: 16.971

7. A central resource for accurate allele frequency estimation from pooled DNA genotyped on DNA microarrays.

Authors: Claire L Simpson; Joanne Knight; Lee M Butcher; Valerie K Hansen; Emma Meaburn; Leonard C Schalkwyk; Ian W Craig; John F Powell; Pak C Sham; Ammar Al-Chalabi
Journal: Nucleic Acids Res Date: 2005-02-08 Impact factor: 16.971

8. Genome-wide quantitative trait locus association scan of general cognitive ability using pooled DNA and 500K single nucleotide polymorphism microarrays.

Authors: L M Butcher; O S P Davis; I W Craig; R Plomin
Journal: Genes Brain Behav Date: 2008-01-22 Impact factor: 3.449

9. How to decide? Different methods of calculating gene expression from short oligonucleotide array data will give different results.

Authors: Frank F Millenaar; John Okyere; Sean T May; Martijn van Zanten; Laurentius A C J Voesenek; Anton J M Peeters
Journal: BMC Bioinformatics Date: 2006-03-15 Impact factor: 3.169

10. Allelotyping of pooled DNA with 250 K SNP microarrays.

Authors: Stefan Wilkening; Bowang Chen; Michael Wirtenberger; Barbara Burwinkel; Asta Försti; Kari Hemminki; Federico Canzian
Journal: BMC Genomics Date: 2007-03-16 Impact factor: 3.969

16 in total

Review 1. Genetics of acute rejection after kidney transplantation.

Authors: Casey R Dorr; William S Oetting; Pamala A Jacobson; Ajay K Israni
Journal: Transpl Int Date: 2017-11-08 Impact factor: 3.782

2. Common variants in DGKK are strongly associated with risk of hypospadias.

Authors: Loes F M van der Zanden; Iris A L M van Rooij; Wout F J Feitz; Jo Knight; A Rogier T Donders; Kirsten Y Renkema; Ernie M H F Bongers; Sita H H M Vermeulen; Lambertus A L M Kiemeney; Joris A Veltman; Alejandro Arias-Vásquez; Xufeng Zhang; Ellen Markljung; Liang Qiao; Laurence S Baskin; Agneta Nordenskjöld; Nel Roeleveld; Barbara Franke; Nine V A M Knoers
Journal: Nat Genet Date: 2010-11-28 Impact factor: 38.330

3. Allelic skewing of DNA methylation is widespread across the genome.

Authors: Leonard C Schalkwyk; Emma L Meaburn; Rebecca Smith; Emma L Dempster; Aaron R Jeffries; Matthew N Davies; Robert Plomin; Jonathan Mill
Journal: Am J Hum Genet Date: 2010-02-12 Impact factor: 11.025

4. A three-stage genome-wide association study of general cognitive ability: hunting the small effects.

Authors: Oliver S P Davis; Lee M Butcher; Sophia J Docherty; Emma L Meaburn; Charles J C Curtis; Michael A Simpson; Leonard C Schalkwyk; Robert Plomin
Journal: Behav Genet Date: 2010-03-21 Impact factor: 2.805

5. A Pooling Genome-Wide Association Study Combining a Pathway Analysis for Typical Sporadic Parkinson's Disease in the Han Population of Chinese Mainland.

Authors: Yakun Hu; Libing Deng; Jie Zhang; Xin Fang; Puming Mei; Xuebing Cao; Jiari Lin; Yi Wei; Xiong Zhang; Renshi Xu
Journal: Mol Neurobiol Date: 2015-07-31 Impact factor: 5.590

6. The Potential Mutation of GAK Gene in the Typical Sporadic Parkinson's Disease from the Han Population of Chinese Mainland.

Authors: Jie Zhang; Hanyi Zeng; Lei Zhu; Libing Deng; Xin Fang; Xia Deng; Huiting Liang; Chunyan Tang; Xuebing Cao; Yi Lu; Jiao Li; Xiao Ren; Wenjie Zuo; Xiong Zhang; Renshi Xu
Journal: Mol Neurobiol Date: 2015-12-17 Impact factor: 5.590

7. Explorative results from multistep screening for potential genetic risk loci of Alzheimer's disease in the longitudinal VITA study cohort.

Authors: Claus-Jürgen Scholz; Heike Weber; Susanne Jungwirth; Walter Danielczyk; Andreas Reif; Karl-Heinz Tragl; Peter Fischer; Peter Riederer; Jürgen Deckert; Edna Grünblatt
Journal: J Neural Transm (Vienna) Date: 2017-10-12 Impact factor: 3.575

8. Utility of the pooling approach as applied to whole genome association scans with high-density Affymetrix microarrays.

Authors: Alexandra Schosser; Katrina Pirlo; Darya Gaysina; Sarah Cohen-Woods; Leonard C Schalkwyk; Amanda Elkin; Ania Korszun; Cerisse Gunasinghe; Joanna Gray; Lisa Jones; Emma Meaburn; Anne E Farmer; Ian W Craig; Peter McGuffin
Journal: BMC Res Notes Date: 2010-11-01

9. UPDG: utilities package for data analysis of pooled DNA GWAS.

Authors: Daniel W H Ho; Maurice K H Yap; Shea Ping Yip
Journal: BMC Genet Date: 2012-01-17 Impact factor: 2.797

10. A genome-wide association study identifies multiple loci associated with mathematics ability and disability.

Authors: S J Docherty; O S P Davis; Y Kovas; E L Meaburn; P S Dale; S A Petrill; L C Schalkwyk; R Plomin
Journal: Genes Brain Behav Date: 2009-11-10 Impact factor: 3.449