Literature DB >> 16845085

PupaSuite: finding functional single nucleotide polymorphisms for large-scale genotyping purposes.

Lucía Conde¹, Juan M Vaquerizas, Hernán Dopazo, Leonardo Arbiza, Joke Reumers, Frederic Rousseau, Joost Schymkowitz, Joaquín Dopazo.

Abstract

We have developed a web tool, PupaSuite, for the selection of single nucleotide polymorphisms (SNPs) with potential phenotypic effect, specifically oriented to help in the design of large-scale genotyping projects. PupaSuite uses a collection of data on SNPs from heterogeneous sources and a large number of pre-calculated predictions to offer a flexible and intuitive interface for selecting an optimal set of SNPs. It improves the functionality of PupaSNP and PupasView programs and implements new facilities such as the analysis of user's data to derive haplotypes with functional information. A new estimator of putative effect of polymorphisms has been included that uses evolutionary information. Also SNPeffect database predictions have been included. The PupaSuite web interface is accessible through http://pupasuite.bioinfo.cipf.es and through http://www.pupasnp.org.

Entities: Chemical Disease Gene Species

Mesh：

Year: 2006 PMID： 16845085 PMCID： PMC1538854 DOI： 10.1093/nar/gkl071

Source DB: PubMed Journal: Nucleic Acids Res ISSN： 0305-1048 Impact factor: 16.971

INTRODUCTION

Single nucleotide polymorphisms (SNPs) are the simplest and most frequent type of DNA sequence variation among individuals and constitute one of the most powerful tools in the search for disease susceptibility genes, drug response-determining genes and the like (1,2). With the introduction of large-scale genotyping techniques the bottleneck in this type of experiments has moved towards the management and analysis of the data generated. In this context, one of the topics which has become a problem is the step of the selection of the optimal set of SNPs (among several thousands of candidates in some cases) for the genotyping experiment. Optimal SNPs must be the best possible markers for traits, which often are multigenic, usually reflecting disruptions in proteins that participate in a protein complex or a in a pathway (3). Unfortunately, complex multigenic traits, for which markers display weak associations, still constitute a challenge. Factors such as linkage disequilibrium (LD) and minor allele frequency (MAF) are of major importance for selecting optimal candidate SNPs. Recently, the predicted functional effect of an SNP is gaining importance as a selection criterium because it constitutes a potential important factor for increasing the sensitivity of association tests significantly (3–6). The availability of information on LD from projects such as HapMap (7), on MAFs (8) and improved methods for predicting function (5,6,9), allow for a more sophisticated selection of candidate SNPs beyond the classical one-SNP-at-a-time approach. Thus, SNPs can be selected taking into account the evolutionary constraints of the region analysed along with its likelihood of being the causative agent of any type of damage. Algorithms which use information to facilitate the posterior analysis of the results, such as the estimation of haplotype blocks (10), combined with functional prediction of the effect of the SNPs, are expected to have a major impact on the efficiency of a large-scale genotyping study. PupaSuite belongs to this new generation of tools. PupaSuite combines the facilities offered by PupaSNP (6) and PupasView (5) with new algorithms and visualisation procedures for functional haplotype prediction. The PupaSNP and PupasView programs are part of the pipeline of genotyping of the Spanish National Genotyping Center (CeGen; ). Both tools combined bear an average of 60 SNP designs per day.

OUTLINE OF THE PROGRAM

PupaSuite combines the functionality of PupaSNP (6) and PupasView (5) in a unique and more integrated interface, and adds new modules to facilitate the selection of the optimal set of SNPs for a large-scale genotyping study. Following the philosophy of PupaSNP, the program allows to input either lists of genes or chromosomal regions, which would correspond to two common types of analysis: genes probably related to a disease because they are functionally related (e.g. they belong to a pathway affected in the disease), or genes present in a chromosomal region linked to a disease. PupaSuite can also directly analyse lists of SNPs. In these three cases a list of SNPs with their putative functional effect is reported. In the case of chromosomal regions it is also possible to find haplotype blocks (10). For the list of SNPs, in addition to their putative functional effect, it is possible to retrieve information on MAF in different populations from dbSNP (8) [as annotated in the Ensembl (11)], as well as LD parameters and haplotype blocks. In addition to the analysis of lists of SNPs there is another new option: Functional haplotypes. This option (see below) allows the user to test their own SNP data and to find haplotypes (12) with the functional SNPs (5,6) and the tag SNPs (13) highlighted. Case-control studies can also be performed at this stage. The option Display and Filter SNPs for a single gene implements new functionalities in an environment a la PupasView (5). More information is presented in a graphical intuitive format (Figure 1). This option allows the sequential and interactive application of filters based on functionality, conservation, MAF and the like (5) thus permitting an easy selection of a set of optimal SNPs for a particular gene.

Figure 1

Output with the graphic representation of SNPs with putative functional effect in the gene BRCA2, along with LD maps.

CRITERIA TO SELECT SNPS AS A GOOD CANDIDATES FOR GENOTYPING

Here three important features of a SNP have been taken into account in order to be considered as an optimal candidate for genotyping purposes: MAF, LD with respect to other candidates (5) and putative functional effect. MAF values were taken from the Ensembl (11), which maps dbSNP (8) data onto the corresponding chromosomal coordinates. LD are calculated as r2 and D′ with the Haploview program (14). The putative functional effect has been estimated in both coding and non-coding regions as described in (5). The following features have been used to report the putative functional effect of a polymorphism in non-coding nucleotides: Regarding the putative impact of a cSNP, the following data and estimators are reported: The likelihood of the predictions can be reinforced by looking simultaneously for human-mouse conserved regions (22) as reported in Ensembl (). Transcription factor binding sites from the Transfac database (15). Intron/exon border consensus sequences. Exonic splicing enhancers (16). Triplex-forming oligonucleotide target sequences (17). SNPs in exons causing an amino acid change (purely a list of cSNPs) Pmut (18,19) predictions. Selective strengths (ω parameter). This estimator is new in this version of the program (see below) SNPeffect (9,20,21) predictions. New in this version of the program (see below).

EVOLUTION AT WORK: THE SELECTIVE STRENGTHS ON CSNPS

The combined effect of all the selective pressures causes the preservation of the functionally relevant parts of the genes. Under this perspective, comparative and evolutionary studies have been used to predict the putative functional effect of SNPs (19,23) although these have mainly ignored the underlying phylogeny. Here we present another more accurate estimator of functional effect, based on sequence comparison, but taking into account phylogenetic information (24). The selective pressures acting at a codon-level where non-synonymous cSNPs are found were evaluated by means of two alternative approaches: codon-based maximum likelihood (ML) models (25) implemented in PAML (26), and likelihood-ratio (SLR) method (27) for testing deviations of neutrality. Under the first approximation, an a priori statistical distribution describing the variation of ω = dN/dS among sites is assumed for a number k of different classes of sites with ω values at a proportion p of the sequences representing the effects of purifying selection (0 < ω0 < 1), neutral evolution (ω1 = 1), and positive selection (ω2 > 1) (25). The method involves two main steps: first, the adjustment by maximum likelihood of the evolutionary parameters to the sequences of the species compared considering two different models; and second, the use of the Bayes theorem to compute the posterior probability that each site belongs to a specific site class ω defined under an a priori distribution (28). Two different models (M2a and M8) were evaluated by maximum likelihood on the sequences (29). Under the sitewise likelihood-ratio method (SLR) a site-by-site approach to test for neutrality is used. In contrast to similar approaches developed previously (30), SLR uses the entire alignment of the sequence to determine parameters common to all sites, such as evolutionary distances. Using this approach there is no need to specify a model of how ω varies along the sequence. A correction for multiple testing in order to obtain statistical confidence for inferences on deviations from neutrality on each site is also performed.

SNPEFFECT DATABASE

The SNPeffect database (9) describes the effect of coding non-synonymous SNPs on several phenotypic properties of human proteins using either sequence-based or structural bioinformatics tools. Molecular phenotypes are grouped in three categories: structure and dynamics, functional sites and cellular processing. Next to various external tools SNPeffect uses algorithms developed at the collaborating research groups, among which Tango (20) to predict β-aggregation regions in protein sequences and FoldX (21) to predict the stability change caused by the single amino acid variation.

FUNCTIONAL HAPLOTYPES

In addition to using already available data, the users can input their own data to use the predictions on possible functional effects in combination with haplotype analysis. This possibility can be used through the Functional haplotypes option. Data must be provided to the program in linkage pedigree format (pre MAKEPRED, ). The PupaSuite estimates blocks by three methods: Confidence intervals (10), Four gamete rule (31) and Solid Spine of LD (14) and reconstruct haplotypes using the EM algorithm (12) as implemented in Haploview (14). The haplotypes found in this way are represented with the corresponding functional information on all the SNPs included in it and all the LD values. This representation provides a very intuitive picture of the possible functional impact of any of the haplotypes beyond the individual effect of each SNP. For case/control data a chi-square test is performed and the corresponding P-value for the allele frequencies in cases versus control is reported. The combination of functional haplotype information with case/control tests allows to easily ascribe cases to haplotypes with functional alterations.

DISCUSSION

We have presented an integrated resource for helping in the selection of optimal sets of SNPs oriented to large-scale genotyping assays. The program merges the functionalities of other two previous resources, PupaSNP (6) and PupasView (5), and expand the capabilities of the program with new information and new facilities. The SNPeffect database (9) as well as a new, unpublished prediction method has been included to improve the estimation of the putative pathological effect of SNPs. Moreover, in addition to use publicly available data on SNPs, users can analyse their own experiments. What is novel and unique to tools of this type is the possibility of analysing functionally haplotypes, beyond the classical analysis one-SNP-at-a-time which ignores interactions between the mutations. The usefulness of this type of resources is proven by the use made by the CeGen in its pipeline of genotyping. The previous tools, which have been running for more than two years, have now an approximate average of 60 daily SNP designs ( and ).

31 in total

1. A method for detecting positive selection at single amino acid sites.

Authors: Y Suzuki; T Gojobori
Journal: Mol Biol Evol Date: 1999-10 Impact factor: 16.240

2. TRANSFAC: an integrated system for gene expression regulation.

Authors: E Wingender; X Chen; R Hehl; H Karas; I Liebich; V Matys; T Meinhardt; M Prüss; I Reuter; F Schacherer
Journal: Nucleic Acids Res Date: 2000-01-01 Impact factor: 16.971

3. Codon-substitution models for heterogeneous selection pressure at amino acid sites.

Authors: Z Yang; R Nielsen; N Goldman; A M Pedersen
Journal: Genetics Date: 2000-05 Impact factor: 4.562

4. A vision for the future of genomics research.

Authors: Francis S Collins; Eric D Green; Alan E Guttmacher; Mark S Guyer
Journal: Nature Date: 2003-04-14 Impact factor: 49.962

5. PupaSNP Finder: a web tool for finding SNPs with putative effect at transcriptional level.

Authors: Lucía Conde; Juan M Vaquerizas; Javier Santoyo; Fátima Al-Shahrour; Sergio Ruiz-Llorente; Mercedes Robledo; Joaquín Dopazo
Journal: Nucleic Acids Res Date: 2004-07-01 Impact factor: 16.971

6. Prediction of sequence-dependent and mutational effects on the aggregation of peptides and proteins.

Authors: Ana-Maria Fernandez-Escamilla; Frederic Rousseau; Joost Schymkowitz; Luis Serrano
Journal: Nat Biotechnol Date: 2004-09-12 Impact factor: 54.908

7. PAML: a program package for phylogenetic analysis by maximum likelihood.

Authors: Z Yang
Journal: Comput Appl Biosci Date: 1997-10

Review 8. Discovering genotypes underlying human phenotypes: past successes for mendelian disease, future approaches for complex disease.

Authors: David Botstein; Neil Risch
Journal: Nat Genet Date: 2003-03 Impact factor: 38.330

9. Triplex-forming oligonucleotide target sequences in the human genome.

Authors: J Ramon Goñi; Xavier de la Cruz; Modesto Orozco
Journal: Nucleic Acids Res Date: 2004-01-15 Impact factor: 16.971

10. Ensembl 2005.

Authors: T Hubbard; D Andrews; M Caccamo; G Cameron; Y Chen; M Clamp; L Clarke; G Coates; T Cox; F Cunningham; V Curwen; T Cutts; T Down; R Durbin; X M Fernandez-Suarez; J Gilbert; M Hammond; J Herrero; H Hotz; K Howe; V Iyer; K Jekosch; A Kahari; A Kasprzyk; D Keefe; S Keenan; F Kokocinsci; D London; I Longden; G McVicker; C Melsopp; P Meidl; S Potter; G Proctor; M Rae; D Rios; M Schuster; S Searle; J Severin; G Slater; D Smedley; J Smith; W Spooner; A Stabenau; J Stalker; R Storey; S Trevanion; A Ureta-Vidal; J Vogel; S White; C Woodwark; E Birney
Journal: Nucleic Acids Res Date: 2005-01-01 Impact factor: 16.971

111 in total

1. Impact of genetic variations in ADORA2A gene on depression and symptoms: a cross-sectional population-based study.

Authors: Sílvia Oliveira; Ana Paula Ardais; Clarissa Ribeiro Bastos; Marta Gazal; Karen Jansen; Luciano de Mattos Souza; Ricardo Azevedo da Silva; Manuella Pinto Kaster; Diogo Rizzato Lara; Gabriele Ghisleni
Journal: Purinergic Signal Date: 2018-12-03 Impact factor: 3.765

2. CRY1, CRY2 and PRKCDBP genetic variants in metabolic syndrome.

Authors: Leena Kovanen; Kati Donner; Mari Kaunisto; Timo Partonen
Journal: Hypertens Res Date: 2014-11-13 Impact factor: 3.872

3. Path to facilitate the prediction of functional amino acid substitutions in red blood cell disorders--a computational approach.

Authors: Rajith B; George Priya Doss C
Journal: PLoS One Date: 2011-09-13 Impact factor: 3.240

Review 4. Bioinformatics and cancer: an essential alliance.

Authors: Joaquín Dopazo
Journal: Clin Transl Oncol Date: 2006-06 Impact factor: 3.405

5. Next generation tools for the annotation of human SNPs.

Authors: Rachel Karchin
Journal: Brief Bioinform Date: 2009-01 Impact factor: 11.622

6. Evolutionary dynamics of the human ABO gene.

Authors: Francesc Calafell; Francis Roubinet; Anna Ramírez-Soriano; Naruya Saitou; Jaume Bertranpetit; Antoine Blancher
Journal: Hum Genet Date: 2008-07-16 Impact factor: 4.132

Review 7. Applications of computational algorithm tools to identify functional SNPs.

Authors: C George Priya Doss; C Sudandiradoss; R Rajasekaran; Parikshit Choudhury; Priyanka Sinha; Pragnya Hota; Udit Prakash Batra; Sethumadhavan Rao
Journal: Funct Integr Genomics Date: 2008-06-19 Impact factor: 3.410

8. Association of TGFBR2 polymorphism with risk of sudden cardiac arrest in patients with coronary artery disease.

Authors: Zian H Tseng; Eric Vittinghoff; Stacy L Musone; Feng Lin; Dean Whiteman; Ludmila Pawlikowska; Pui-Yan Kwok; Jeffrey E Olgin; Bradley E Aouizerat
Journal: Heart Rhythm Date: 2009-08-28 Impact factor: 6.343

9. Evidence of statistical epistasis between DISC1, CIT and NDEL1 impacting risk for schizophrenia: biological validation with functional neuroimaging.

Authors: Kristin K Nicodemus; Joseph H Callicott; Rachel G Higier; Augustin Luna; Devon C Nixon; Barbara K Lipska; Radhakrishna Vakkalanka; Ina Giegling; Dan Rujescu; David St Clair; Pierandrea Muglia; Yin Yao Shugart; Daniel R Weinberger
Journal: Hum Genet Date: 2010-04 Impact factor: 4.132

10. ADORA2A polymorphism predisposes children to encephalopathy with febrile status epilepticus.

Authors: Mayu Shinohara; Makiko Saitoh; Daisuke Nishizawa; Kazutaka Ikeda; Shinichi Hirose; Jun-ichi Takanashi; Junko Takita; Kenjiro Kikuchi; Masaya Kubota; Gaku Yamanaka; Takashi Shiihara; Akira Kumakura; Masahiro Kikuchi; Mitsuo Toyoshima; Tomohide Goto; Hideo Yamanouchi; Masashi Mizuguchi
Journal: Neurology Date: 2013-03-27 Impact factor: 9.910