Literature DB >> 17517769

QuickSNP: an automated web server for selection of tagSNPs.

Deepak Grover¹, Alonzo S Woodfield, Ranjana Verma, Peter P Zandi, Douglas F Levinson, James B Potash.

Abstract

Although large-scale genetic association studies involving hundreds to thousands of SNPs have become feasible, the associated cost is substantial. Even with the increased efficiency introduced by the use of tagSNPs, researchers are often seeking ways to maximize resource utilization given a set of SNP-based gene-mapping goals. We have developed a web server named QuickSNP in order to provide cost-effective selection of SNPs, and to fill in some of the gaps in existing SNP selection tools. One useful feature of QuickSNP is the option to select only gene-centric SNPs from a chromosomal region in an automated fashion. Other useful features include automated selection of coding non-synonymous SNPs, SNP filtering based on inter-SNP distances and information regarding the availability of genotyping assays for SNPs and whether they are present on whole genome chips. The program produces user-friendly summary tables and results, and a link to a UCSC Genome Browser track illustrating the position of the selected tagSNPs in relation to genes and other genomic features. We hope the unique combination of features of this server will be useful for researchers aiming to select markers for their genotyping studies. The server is freely available and can be accessed at the URL http://bioinformoodics.jhmi.edu/quickSNP.pl.

Entities: Chemical Disease Gene Species

Mesh：

Year: 2007 PMID： 17517769 PMCID： PMC1933212 DOI： 10.1093/nar/gkm329

Source DB: PubMed Journal: Nucleic Acids Res ISSN： 0305-1048 Impact factor: 16.971

INTRODUCTION

The biggest challenge in human genetics currently is to identify the genes whose alleles confer susceptibility to disease. It is believed that there will be many loci that increase the risk for each common disease (1). Since each causative gene may make only a very modest contribution to disease risk, identification of particular susceptibility variants becomes quite difficult. While genetic association studies have been used in gene mapping, their efficiency has been limited because they have typically assessed only one or a few genes at a time. The development of new SNP genotyping technologies, which can handle from dozens to hundreds of thousands of SNPs, and large numbers of samples, promises to accelerate gene mapping. Current platforms include the Illumina BeadArray and BeadChip systems (2,3), the Affymetrix GeneChip Mapping Arrays (4) and Applied Biosystem's TaqMan SNP Genotyping Assays (5). The newest technologies, while powerful, bring with them substantial costs, as they can involve as many as hundreds of millions of genotypes. For this reason, researchers have been trying to devise ways to maximize efficiency of resource utilization given a set of SNP-based gene-mapping goals. It has been found that the pattern of linkage disequilibrium (LD) varies across the human genome and that there are discrete regions of high LD in the genome, called haplotype blocks (6). Most variation in populations can be characterized by a small number of common haplotypes. By selecting SNPs that uniquely identify or ‘tag’ these haplotypes, the number of markers and, hence, the cost of genotyping can be significantly reduced. The approach became more powerful with the availability of genetic data from the International HapMap Project (7), which contains genotype data for ∼4 million SNPs from each of four populations: Yoruba from Ibadan, Nigeria (YRI), Japanese from Tokyo (JPT), Chinese from Beijing (CHB) and United States residents with European ancestry (CEU). Even with the increased efficiency introduced by tagSNPs, investigators are typically in the position of having to make strategic decisions about which set of tagSNPs to study. One strategy is to focus on those within genes, as these have the greatest likelihood of being functionally relevant or being in LD with those that are functional (8). Recently, a similar question was explored using empirical data from the HapMap-ENCODE project; tagSNPs chosen to capture common variation in exonic as well as evolutionarily conserved regions yielded genotype savings compared with a tagging approach that captured all common variation across the region (9). While the extent to which functionally important elements in the genome reside strictly within and near genes is not known, a gene-centric genotyping strategy may be a reasonable approach to searching for disease susceptibility alleles in the setting of limited resources. The choice of SNPs for genetic association testing, thus, is a crucial step that will directly affect both the cost and the outcome of studies. Since the number of SNPs can range into the thousands, manual selection can be extremely time-consuming. There are some useful internet-based tools available for selection and prioritization of SNPs for genotyping. These include SNPper (10) (http://snpper.chip.org/), TAMAL (11) (http://neoref.ils.unc.edu/tamal/index.jsp), SNPSelector (12) (http://snpselector.duhs.duke.edu/hqsnp36.html), SNPHunter (13) (http://www.hsph.harvard.edu/ppg/software.htm), PupasView (14) (http://pupasview.bioinfo.cipf.es/) and tagger (15) (http://www.broad.mit.edu/mpg/tagger/). These programs have a variety of strengths as well as limitations. Among the gaps: most of them do not allow for automated selection of gene-based SNPs in a region, and none examines SNP coverage on genome-wide microarray SNP genotyping platforms. We have developed a web server named QuickSNP to provide selection of tagSNPs in a chromosomal region, and to fill in some of the gaps in existing SNP selection tools. One useful feature of QuickSNP is the option to input the coordinates of a chromosomal region and have the program select SNPs, in an automated fashion, only from the genes within that region. Other useful features include automated selection of coding non-synonymous SNPs, SNP filtering based on inter-SNP distances, and reporting of whether SNPs have available assays or are present on whole genome chips. There are several situations where we believe this tool will be particularly useful, including: (i) planning an LD-mapping study of a region de novo, where one has decided for any of a number of reasons to focus on genes and (ii) one is planning to obtain, or has obtained, data from a genome-wide association chip, and one wants to ‘fill in’ a particular region either because the chip scan produced a positive result or because of other information (e.g. a linkage peak or interest in a particular gene pathway), and one wants to find additional tagSNPs as well as coding non-synonymous SNPs in genes in the region.

MATERIALS AND METHODS

Implementation

QuickSNP utilizes Apache as its web server, and CGI (Common Gateway Interface) scripts are used to handle dataflow and validation to and from a dynamic HTML interface that utilizes cascading style sheet objects and integrated JavaScript. The data extraction and manipulation portion of the program is written in PERL (practical extraction and report language) modules and features two other programs embedded in the main code—Haploview, a freely available Java-based utility, and liftOver, a freely available Linux command-line application. QuickSNP is available at the URL http://bioinformoodics.jhmi.edu/quickSNP.pl. It is located on a cluster of processors running Linux OS at the Johns Hopkins McKusick-Nathans Institute for Genetic Medicine. All databases are locally downloaded and placed in the storage space of the Linux cluster. Files are not copied to a fileserver during user data uploads, but instead data is extracted dynamically from these files using CGI file handles, and thus information uploaded by users will not be retained on a file server.

Functionality

The basic function of QuickSNP is to generate a list of tagSNPs in a given chromosomal region, or for the genes in that region, or for any specified list of genes. For genomic position (whole region) searches, genotype data for SNPs lying in the region are extracted from the HapMap database (Figure 1). For genomic position (genes only) and gene name-based queries, gene coordinates are first extracted from the Entrez gene database and then SNP genotype data for those positions are extracted from HapMap. If the genomic position entered is not for NCBI build 35 (May 2004), it is first converted to that by the liftOver program. Also, genomic position is adjusted according to the length of flanking sequence used. The resulting SNP list is passed to the Haploview program, which generates tagSNPs based on the tagger algorithm (15) using the user-specified r2, minimum minor allele frequency and include/exclude tags specifications (if any).

Figure 1.

Schematic overview of the functioning of the QuickSNP web server.

Schematic overview of the functioning of the QuickSNP web server. There are categories of options that the user can select to obtain the best results from QuickSNP (see below). Input options like include/exclude tags and coding non-synonymous SNPs are used before tagSNP selection and they affect the list of SNPs that is used by Haploview for tagSNP selection. After selection of tagSNPs, the results can be further filtered by options such as removing SNPs lying too close (by a user-defined distance criterion). The user sees a results web page with two types of output. One is the core output consisting of a summary statistics table, a file displaying tagSNPs selected, and another file displaying pairwise LD tests as well as LD bins. The other type of output consists of additional information, including tables for genotype and allele frequencies, for the occurrence of SNPs in whole-genome chips and assays, and for the cost of genotyping. There is also a link to a graphical display of tagSNPs in the UCSC Genome Browser.

User interface

We have attempted to create a simplified user interface for QuickSNP so that it can be employed without the need for sophisticated computational skills. The input screen is divided into three main sections: input method, search conditions and additional options. The user may enter either genomic positions or gene names in the search window of the input method section. Multiple gene names can be entered by either typing in the corresponding window or uploading a file with a list of gene names. For the genomic position-based searches, users can further specify whether they wish to consider the whole region or just the genes within the specified region for tagSNP selection. The user is then required, in the search conditions section, to enter the desired r2, minor allele frequency and HapMap population. To access the basic functionality of the server, the user need not consider the section containing additional options. However, depending on the study design, these options can enable more judicious and efficient selection of tagSNPs. For example, the user may want to include or exclude certain SNPs (based on availability of PCR primers or on past performance of genotyping assays). The user has the option to include flanking sequence around genes, reject SNPs that are too close to each other (because they are less likely to work with certain genotyping platforms), and force include coding non-synonymous SNPs, which can be identified and included automatically by QuickSNP through a search of the whole-genome coding SNP database. There are other result-related options that display various kinds of information for the chosen tagSNPs. These include the cost of genotyping using some popular methods, allele and genotype frequencies of tagSNPs in four HapMap populations, and occurrence of tagSNPs in available whole-genome chips and assays. For genomic position-based queries, the user also has the option to graphically visualize the tagSNPs in relation to genes, transcripts, conserved regions and other genomic features in that region using the UCSC Genome Browser (http://genome.ucsc.edu/cgi-bin/hgGateway). The results are generated in the form of a zipped archive containing multiple files (for multiple genes), as well as text files corresponding to an individual gene or region. A file containing a list of tagSNPs chosen and another file with details regarding LD tests and bins is generated for each gene/region. A summary table is also generated that displays the number of SNPs in the HapMap database for the gene/region queried and the eventual number of tagSNPs selected by QuickSNP for individual genes as well as the whole region. If include/exclude existing tags and/or coding non-synonymous SNPs were implemented in a QuickSNP search, an additional result table would be generated that lists the included or excluded SNPs, which of them were used in the tagSNP search (only those with genotype data in HapMap database can be used for tagging), and their type (user-specified include/exclude tags versus coding non-synonymous SNPs). Other results are also generated based on the additional options used for QuickSNP search (Figure 2).

Figure 2.

Snapshot of the results page generated by QuickSNP for a typical search with various components highlighted and explained.

Snapshot of the results page generated by QuickSNP for a typical search with various components highlighted and explained. There are three levels of help available to QuickSNP users: (a) QuickHelp, which can be accessed by clicking on the [?] symbol next to each option, and which briefly explains the purpose of that option; (b) frequently asked questions, which provides more detail and (c) direct contact with the authors, available by emailing us at QuickSNP@jhmi.edu.

VALIDATION AND USAGE EXAMPLE

We validated the core functionality of QuickSNP for various genes and genomic regions by comparing results from QuickSNP to those derived from a manual tagSNP selection using HapMap and the tagger algorithm in Haploview. Since many QuickSNP options and features are unique to this tool, they could not be compared to existing automated resources for tagSNP selection. For those cases, we manually performed steps of analyses for some of the options (for example, gene-based searches in a genomic region including coding non-synonymous SNPs), and compared the results with those generated in an automated manner by QuickSNP. The results generated by QuickSNP were always in agreement with those generated by the manual procedures. We extensively used QuickSNP to select tagSNPs for a 6 Mb region on chromosome 17 that produced evidence for linkage to major depressive disorder in our Genetics of Recurrent Early Onset Depression (GenRED) collaborative project (16). Our aim was to select SNPs for an initial LD mapping association study of this region using the Illumina BeadStation custom genotyping platform. The region contained a total of ∼8000 HapMapII SNPs. Using criteria of r2 ⩾ 0.8 and MAF ⩾ 0.1, there were 1526 tagSNPs selected from across the full region and an additional 438 coding non-synonymous SNPs. Our project budget allowed us to study approximately 800 SNPs from the region in this initial experiment, so that excellent tagSNP coverage could be achieved if we focused on genes and their associated regulatory regions. We searched the region with QuickSNP using the genomic position, genes-only input method, force including coding non-synonymous and some previously genotyped SNPs, and rejecting SNPs that were closer than 60 bp. We performed various searches for different r2, MAF values and lengths of flanking region around genes. Table 1 shows the number of tagSNPs selected by QuickSNP using different combinations of parameters. We elected to genotype the 809 SNPs that resulted from tagging with r2 = 0.8, MAF ≥ 0.1 and a 5 kb flanking region on either side of each gene.

Table 1.

Number of tagSNPs selected for chromosome 17p 6.59–13 Mb region

	r²	0.8		0.9
	Minor allele freq.	0.05	0.1	0.05	0.1
Flanking region around genes	1 kb	875	739	912	773
	3 kb	922	777	960	812
	5 kb	960	809	999	845

Different lengths of flanking regions around genes and values of r2 and minor allele frequency cutoff were selected to generate this data.

Number of tagSNPs selected for chromosome 17p 6.59–13 Mb region Different lengths of flanking regions around genes and values of r2 and minor allele frequency cutoff were selected to generate this data.

SUMMARY

QuickSNP offers many useful features (see Table 2 for a comparison with other available programs):

Table 2.

Comparison of QuickSNP features with those of comparable software programs

		SNPper	SNPSelector	SNPhunter	TAMAL	PUPASview	tagger	QuickSNP
INPUT- RELATED FEATURES	Gene name	Yes	Yes	Yes	Yes	Yes	No	Yes
	Chromosomal position	Yes	Yes	No	No	Yes	Yes	Yes
	Chromosomal band	Yes	No	No	No	Yes	No	No
	Batch query for gene names	Yes	Yes	No	Yes	Yes	No	Yes
	Conversion of coordinates between different genome assemblies	No	No	No	No	No	No	Yes
	Gene-centric tag selection in a chromosomal region	No	Yes	No	No	No	No	Yes
FILTERING OPTIONS	MAF	No	Yes	No	No	No	Yes	Yes
	r²	No	Yes	No	No	No	Yes	Yes
	Option to force include/ exclude SNPs	No	Yes	No	No	No	Yes	Yes
	Selection of only relevant include/exclude SNPs for tagging*	No	No	No	No	No	No	Yes
	Include flanking region around genes	Yes	Yes	Yes	Yes	Yes	No	Yes
	Automatic inclusion of coding non-synonymous SNPs for tagging	No	No	No	No	No	No	Yes
	Spacing between SNPs	Yes	Yes	Yes	No	No	Yes	Yes
OUTPUT-RELATED FEATURES	Allele and genotype frequencies for selected tagSNPs	No	No	No	No	Yes^†	No	Yes
	Financial cost of genotyping	No	No	No	No	No	No	Yes
	Occurrence of tagSNPs in popular whole genome chips and assays	No	No	No	No	No	No	Yes
	Representation of tag SNPs in UCSC genome browser	No	Yes	No	Yes	No	No	Yes

*This option predetermines which of the include/exclude SNPs are present in HapMap database for the given population, and uses only those for tagging. If this criterion is not used, the whole search aborts if any one of the include/exclude tag is absent in the HapMap database.

†Only allele frequencies, but not genotype frequencies.

Allows for a gene-centric approach to tagSNP selection; Accepts multiple gene names as input; Allows for automatic conversion of coordinates between different genome assemblies; Provides the option to include flanking sequence around genes; Provides the option to reject SNPs that are too closely spaced, since they are less likely to work in some genotyping platforms; Calculates the cost for the genotyping study; For the ‘include tag’ and ‘exclude tag’ options, predetermines which SNPs are present in the HapMap database for the given population, and implements inclusion or exclusion of only those (in other existing tools, the whole search aborts if any include/exclude tag is absent from the HapMap database); Automatically includes coding non-synonymous SNPs in the region, if specified by the user; Displays selected tagSNPs in the UCSC Genome Browser; Reports allele and genotype frequencies for tagSNPs in different populations; Reports the number of SNPs that have available assays or are present on whole genome chips provided by commercial genotyping platforms and Provides a user-friendly summary table, and downloadable results files. Comparison of QuickSNP features with those of comparable software programs *This option predetermines which of the include/exclude SNPs are present in HapMap database for the given population, and uses only those for tagging. If this criterion is not used, the whole search aborts if any one of the include/exclude tag is absent in the HapMap database. †Only allele frequencies, but not genotype frequencies. In the last few years, millions of new SNPs have been identified, and SNP genotyping technologies have developed rapidly. Investigators need to determine how to select SNPs for study in a chromosomal region in a manner that is efficient while still preserving power. There is a need for new tools, which can perform these functions in an automated manner. QuickSNP provides all of the basic SNP selection functions present in existing tools, while adding additional features.

FUTURE DIRECTIONS

At present, QuickSNP can handle regions as large as 5 Mb (for a genomic position-based search) or 40 genes (for a gene-name based search). In the future, we will attempt to optimize the algorithm and/or upgrade the hardware in order to increase this search limit. Since QuickSNP uses many public domain datasets, we will download and integrate the new updates as soon as they are released. An additional feature we are developing is the ability to assess the SNP coverage within genome-wide platforms for any given gene or genomic region. We will always welcome suggestions and bug reports by users, and will try to respond to these promptly.

16 in total

1. Allelic discrimination using fluorogenic probes and the 5' nuclease assay.

Authors: K J Livak
Journal: Genet Anal Date: 1999-02

2. SNPper: retrieval and analysis of human SNPs.

Authors: A Riva; I S Kohane
Journal: Bioinformatics Date: 2002-12 Impact factor: 6.937

3. Genotyping over 100,000 SNPs on a pair of oligonucleotide arrays.

Authors: Hajime Matsuzaki; Shoulian Dong; Halina Loi; Xiaojun Di; Guoying Liu; Earl Hubbell; Jane Law; Tam Berntsen; Monica Chadha; Henry Hui; Geoffrey Yang; Giulia C Kennedy; Teresa A Webster; Simon Cawley; P Sean Walsh; Keith W Jones; Stephen P A Fodor; Rui Mei
Journal: Nat Methods Date: 2004-11 Impact factor: 28.547

Review 4. Genome-wide association studies: theoretical and practical concerns.

Authors: William Y S Wang; Bryan J Barratt; David G Clayton; John A Todd
Journal: Nat Rev Genet Date: 2005-02 Impact factor: 53.242

5. A genome-wide scalable SNP genotyping assay using microarray technology.

Authors: Kevin L Gunderson; Frank J Steemers; Grace Lee; Leo G Mendoza; Mark S Chee
Journal: Nat Genet Date: 2005-04-17 Impact factor: 38.330

Review 6. Searching for genetic determinants in the new millennium.

Authors: N J Risch
Journal: Nature Date: 2000-06-15 Impact factor: 49.962

7. BeadArray technology: enabling an accurate, cost-effective approach to high-throughput genotyping.

Authors: Arnold Oliphant; David L Barker; John R Stuelpnagel; Mark S Chee
Journal: Biotechniques Date: 2002-06 Impact factor: 1.993

8. Genetics of recurrent early-onset major depression (GenRED): final genome scan report.

Authors: Peter Holmans; Myrna M Weissman; George S Zubenko; William A Scheftner; Raymond R Crowe; J Raymond Depaulo; James A Knowles; Wendy N Zubenko; Kathleen Murphy-Eberenz; Diana H Marta; Sandra Boutelle; Melvin G McInnis; Philip Adams; Madeline Gladis; Jo Steele; Erin B Miller; James B Potash; Dean F Mackinnon; Douglas F Levinson
Journal: Am J Psychiatry Date: 2007-02 Impact factor: 18.112

9. SNPHunter: a bioinformatic software for single nucleotide polymorphism data acquisition and management.

Authors: Lin Wang; Simin Liu; Tianhua Niu; Xin Xu
Journal: BMC Bioinformatics Date: 2005-03-18 Impact factor: 3.169

10. PupasView: a visual tool for selecting suitable SNPs, with putative pathological effect in genes, for genotyping purposes.

Authors: Lucía Conde; Juan M Vaquerizas; Carles Ferrer-Costa; Xavier de la Cruz; Modesto Orozco; Joaquín Dopazo
Journal: Nucleic Acids Res Date: 2005-07-01 Impact factor: 16.971

11 in total

Review 1. Methods for the Analysis and Interpretation for Rare Variants Associated with Complex Traits.

Authors: J Dylan Weissenkampen; Yu Jiang; Scott Eckert; Bibo Jiang; Bingshan Li; Dajiang J Liu
Journal: Curr Protoc Hum Genet Date: 2019-03-08

2. Genetic variability within the cholesterol lowering pathway and the effectiveness of statins in reducing the risk of MI.

Authors: Bas J M Peters; Helmi Pett; Olaf H Klungel; Bruno H Ch Stricker; Bruce M Psaty; Nicole L Glazer; Kerri L Wiggins; Josh C Bis; Anthonius de Boer; Anke-Hilse Maitland-van der Zee
Journal: Atherosclerosis Date: 2011-06-17 Impact factor: 5.162

3. Genetic variants in antioxidant genes are associated with diisocyanate-induced asthma.

Authors: Berran Yucesoy; Victor J Johnson; Zana L Lummus; Grace E Kissling; Kara Fluharty; Denyse Gautrin; Jean-Luc Malo; André Cartier; Louis-Philippe Boulet; Joaquin Sastre; Santiago Quirce; Dori R Germolec; Susan M Tarlo; Maria-Jesus Cruz; Xavier Munoz; Michael I Luster; David I Bernstein
Journal: Toxicol Sci Date: 2012-05-17 Impact factor: 4.849

4. Inferring population structure and relationship using minimal independent evolutionary markers in Y-chromosome: a hybrid approach of recursive feature selection for hierarchical clustering.

Authors: Amit Kumar Srivastava; Rupali Chopra; Shafat Ali; Shweta Aggarwal; Lovekesh Vig; Rameshwar Nath Koul Bamezai
Journal: Nucleic Acids Res Date: 2014-07-16 Impact factor: 16.971

5. Association study of Wnt signaling pathway genes in bipolar disorder.

Authors: Peter P Zandi; Pamela L Belmonte; Virginia L Willour; Fernando S Goes; Judith A Badner; Sylvia G Simpson; Elliot S Gershon; Francis J McMahon; J Raymond DePaulo; James B Potash
Journal: Arch Gen Psychiatry Date: 2008-07

6. N-Acetyltransferase 2 Genotypes Are Associated With Diisocyanate-Induced Asthma.

Authors: Berran Yucesoy; Grace E Kissling; Victor J Johnson; Zana L Lummus; Denyse Gautrin; André Cartier; Louis-Philippe Boulet; Joaquin Sastre; Santiago Quirce; Susan M Tarlo; Maria-Jesus Cruz; Xavier Munoz; Michael I Luster; David I Bernstein
Journal: J Occup Environ Med Date: 2015-12 Impact factor: 2.162

7. Family-based association of YWHAH in psychotic bipolar disorder.

Authors: Deepak Grover; Ranjana Verma; Fernando S Goes; Pamela L Belmonte Mahon; Elliot S Gershon; Francis J McMahon; James B Potash; Elliot S Gershon; Francis J McMahon; James B Potash
Journal: Am J Med Genet B Neuropsychiatr Genet Date: 2009-10-05 Impact factor: 3.568