Literature DB >> 15980522

PupasView: a visual tool for selecting suitable SNPs, with putative pathological effect in genes, for genotyping purposes.

Lucía Conde¹, Juan M Vaquerizas, Carles Ferrer-Costa, Xavier de la Cruz, Modesto Orozco, Joaquín Dopazo.

Abstract

We have developed a web tool, PupasView, for the selection of single nucleotide polymorphisms (SNPs) with potential phenotypic effect. PupasView constitutes an interactive environment in which functional information and population frequency data can be used as sequential filters over linkage disequilibrium parameters to obtain a final list of SNPs optimal for genotyping purposes. PupasView is the first resource that integrates phenotypic effects caused by SNPs at both the translational and the transcriptional level. PupasView retrieves SNPs that could affect conserved regions that the cellular machinery uses for the correct processing of genes (intron/exon boundaries or exonic splicing enhancers), predicted transcription factor binding sites and changes in amino acids in the proteins for which a putative pathological effect is calculated. The program uses the mapping of SNPs in the genome provided by Ensembl. PupasView will be of much help in studies of multifactorial disorders, where the use of functional SNPs will increase the sensitivity of the identification of the genes responsible for the disease. The PupasView web interface is accessible through http://pupasview.ochoa.fib.es and through http://www.pupasnp.org.

Entities: Chemical Disease Gene Species

Mesh：

Year: 2005 PMID： 15980522 PMCID： PMC1165690 DOI： 10.1093/nar/gki476

Source DB: PubMed Journal: Nucleic Acids Res ISSN： 0305-1048 Impact factor: 16.971

INTRODUCTION

Single nucleotide polymorphisms (SNPs) are the simplest and most frequent type of DNA sequence variation among individuals and, with the recent availability of high-throughput methodologies, are considered one of the most powerful tools in the search for e.g. disease susceptibility genes and drug response-determining genes (1,2). However, complex diseases, for which markers display weak associations, still constitute a challenge. Most probably, advancement in the knowledge of such diseases will come from improved genotyping methods in combination with the proper bioinformatics design strategies (3). It is generally believed that multigenicity reflects disruptions in proteins that participate in a protein complex or in a pathway (4). Typically, SNPs have been used as markers; that is, the real determinant of the disease was not the SNP itself but some other mutation in linkage disequilibrium (LD) with it. Because of this, the use of functional SNPs could be an important factor in increasing significantly the sensitivity of association tests. In fact, several complex genetic disorders such as Alzheimer's disease (5) and Crohn' disease (6) have been associated with functional SNPs, lending weight to strategies giving priority to candidate markers based upon predictable function. Several estimations suggest that, on average, some 20% of SNPs could directly damage proteins (7). Much attention has been focused on modelling by different methods the possible phenotypic effect of SNPs that cause amino acid changes (7–13), and only recently has interest focused on functional SNPs affecting regulatory regions or the splicing process (14). However, there is increasing evidence that many human disease genes are the result of exonic or non-coding mutations affecting regulatory regions (15–17). A recent large-scale screening over a set of 16 chromosomes found SNPs in the promoter regions of 35% of the genes, and experimental evidence suggested that around a third of promoter variants may alter gene expression to a functionally relevant extent (18). Alternative splicing produced by mutations in intron/exon junctions, or in distinct binding motifs, such as exonic splicing enhancers (ESEs) (19), has also been related to different diseases (20). In fact, it has been estimated that 15% of point mutations that result in human genetic diseases cause RNA splicing defects (21). In addition to functional information, population frequency is another important factor to be taken into account when selecting SNPs. Thus, infrequent polymorphisms will be of scarce interest as markers. Also, LD is another interesting factor in selecting SNPs as markers since, if two SNPs are in strong LD, only one of them will provide enough information for any association or linkage test. With the idea of selecting optimal sets of SNPs using as much information as possible on putative phenotypic effect, population frequencies and LD, we have developed PupasView (Putative Phenotypic Alterations caused by SNPs Viewer), a server that can be used alone or in combination with PupaSNP (14). PupasView works not only as a viewer of where SNPs are located, but also as a selector in which different filters based on combinations of functionality and population frequencies can be interactively applied over the LD parameters in order to obtain an optimal selection of SNPs for genotyping studies, in such a way that with a minimum number of SNPs maximum information on the genic region is obtained.

Criteria to consider an SNP a good candidate for genotyping studies

There are three important properties for an SNP to be considered an optimal candidate for genotyping purposes: functional effect, minor allele frequency and LD with respect to other SNPs. Finding such optimal SNPs is not always possible, but the idea behind PupasView is to facilitate the selection process in order to achieve a final collection of SNPs bearing the maximum amount of information. PupasView works as an SNP selector. Different filters can be interactively applied to the LD information available based on distinct functional properties, cross-species conservation and population frequency. This permits a final selection of a minimum number of SNPs with optimal properties in terms of population frequencies and potential phenotypic effect.

Finding SNPs with potential phenotypic effect

PupasView uses a precompiled database which contains a collection of dbSNP entries mapped to the Golden Path genome assembly, as implemented in the human section of Ensembl (). Part of this database is common to the PupaSNP program (14). The SNPs have been labelled according to their potential effects on the phenotype. We have taken into account both transcriptional and gene product levels. Regions 10 000 bp upstream of the genes belonging to the promoter region of each gene in the list have been scanned for the presence of possible different regulatory motifs. These include alterations in: Transcription factor binding sites. Promoter regions were scanned for the presence of possible transcription factor binding sites. The program Match (22) was used for this purpose, using only high-quality matrices and with a cut-off to minimize false positives from the Transfac database (23). SNPs located within these motifs are considered to have a putative phenotypic effect in the expression of the gene. Almost four million such motifs were found, with 130 373 SNPs mapping onto them. Intron/exon border consensus sequences. Ensembl APIs (24) were used to extract the intron/exon organization of the genes and the corresponding sequences. The two conserved nucleotides on each side of the splicing point, which constitute the splicing signal (21), were then located and all the SNPs altering these signals were recorded. More than 700 000 intron/exon boundaries could be defined in human genes with 1786 SNPs mapping onto them. ESEs. Mutations that inactivate or activate an ESE sequence may result in exon skipping, errors in alternative splicing patterns, malformation and so on. Different classes of ESE consensus motifs have been described, but they are not always easily identified. Exon sequences were scanned to identify putative ESEs responsive to the human SR proteins SF2/ASF, SC35, SRp40 and SRp55, using the available weight matrices (20). A score was obtained that is related to the likelihood that the site found is a real ESE. Only ESE sites with scores over the threshold [see (20) for details] were taken into account in the analysis. More than 11 million ESEs were found, with 299 106 SNPs located in them. Triplex-forming oligonucleotide target sequences (TTSs). It has been found that the population of TTSs is much more numerous than expected from simple random models (25). The population of TTSs is large in the whole genome, without major differences between chromosomes, but with a large concentration in regulatory regions, especially in promoter zones, which suggests a tremendous potential for triplex strategy in the control of gene expression (25). Although the role of TTSs in regulation is still a matter of speculation, the program also reports SNPs disrupting these structures. Some 5.4 million putative triplex-forming sequences were found, and 364 314 SNPs mapped onto them. SNPs in exons that cause an amino acid change. Any SNP causing a change of amino acid, independent of any speculation on its possible phenotypic effect, is reported. There are 45 906 such SNPs. SNPs in exons that cause an amino acid change with putative pathological effect. The putative pathological effect of an amino acid change can be predicted using neural networks (NNs) carefully trained to predict disease-associated amino acidic polymorphism (12,13). The server implements a small NN (1 hidden layer and 20 nodes) and three sequence-derived descriptors (PAM40, PSSM and variability), which are either retrieved from databases or determined internally from multiple alignments using two-iterations PSI-Blast (26) run over a non-redundant SwissProt/TrEMBL database. The trained method displays a success rate >80% in cross-validation experiments. According to the algorithm, 19 309 SNPs displayed a high probability of having pathological effect. Human–mouse conserved regions. Untranslated whole genome comparisons by BLASTZ were performed for species pairs which are thought to be similar enough to be able to detect homology directly at the DNA level (27). Of particular interest is mouse (or rat) because of its phylogenetic position with respect to humans: distant enough to interpret conservation as important but not so distant as to lose most of the similarity. The phenotypic effect of a change in such regions is quite speculative, but cross-species conservation can be useful in cases in which no other information is available. It is also useful for reinforcing the likelihood of other predictions (e.g. an ESE in a conserved region is more likely to be real than one in a non-conserved region).

Frequency information and validation status

There are >10 million SNPs stored in the last build of dbSNP (build 124), and more than half of these have been validated by different means (). Validation status is annotated and is an important field in terms of trusting an SNP. But, in addition to being real, an SNP must exist in the population at frequencies which make it a suitable marker. Very infrequent SNPs are not suitable for association or linkage studies. For almost half a million SNPs frequency data in different populations are available.

Blocks and LD parameters

LD measures the correlation between two neighbouring genetic variants in a specific population. The program HaploView (28) is used to infer blocks using different procedures. In one of the most common procedures (29), 95% confidence bounds based on the D′ LD parameter are generated and each comparison is called ‘strong LD’, ‘inconclusive’ or ‘strong recombination’. A block is created if 95% of informative (i.e. non-inconclusive) comparisons are ‘strong LD’. A block can be considered a region with a low recombination rate. Ideally, a block could properly be described by a unique SNP. Two other methods are used: the four gamete rule (30) and the Solid Spine of LD (28). Blocks are displayed in the bottom of the PupasView window. Also D′, R2 and LOD parameters between adjacent SNPs can be visualized by placing the cursor between them. Only HapMap genotyped SNPs (31) are used to calculate blocks and LD parameters.

The web interface of the SNPs selector

The main purpose of PupasView is to provide the user with an optimal set of SNPs for genotyping experiments by filtering the annotated SNPs using a series of filters related to their impact in protein functionality and pathology, their population frequency and LD. The input is a gene identifier (Ensembl IDs or external IDs, which include GenBank, Swissprot/TrEMBL and other gene IDs supported by Ensembl). The program can also be invoked from PupaSNP. The program presents a list of options that can be selected and applied as many times as desired. The options include Validation status obtained from dbSNP Type of SNP (coding, intron, untranslated region, local), according to its position in the gene Frequency and population, an option that allows the possibility of filtering by a range of frequencies of the minor allele in one or more populations (Europe; Europe, multinational; Europe, North America; North America; Central/South America; North/East Africa and Middle East; Central/South Africa; West Africa; Central Asia; East Asia; Pacific; multinational; unknown; HapMap) Functional properties as follows: non-synonymous SNPs [all or only those predicted as pathological by the pmut algorithm (12,13)] SNPs disrupting predicted transcription factor binding sites (all or only those that are in regions conserved in the mouse genome) SNPs disrupting predicted ESEs (all or only those that are in regions conserved in the mouse genome) SNPs disrupting potential triplex-forming regions (all or only those that are in regions conserved in the mouse genome) SNPs disrupting intron/exon boundaries regions conserved in mouse Options for the way in which blocks are constructed: confidence intervals (29) four gamete rule (30) Solid Spine of LD (28). Figure 1 shows the view of the results. The viewer of PupasView has been constructed using Ensembl APIs (24). Figure 1A shows the result of running PupasView on the gene TP53 without applying any filter. All the SNPs in the gene and the neighbourhood are displayed. If the cursor is over an SNP, information on it is displayed by means of pop-up text. Figure 1B shows a subselection of these SNPs obtained after selecting only SNPs for which population frequency was available. Finally, Figure 1C shows the selection obtained if only SNPs with putative functional effect are chosen. This will constitute the final, reduced subset of optimal SNPs. The upper horizontal bar below the figure represents LD parameters (which can be individually obtained by placing the cursor over them). The lower horizontal bar represents the block found with the selected algorithm. The blocks are displayed graphically with brown rectangles going from the first to the last SNP within the block. When the cursor is over the rectangles, a tooltip text pops up in the block showing the SNPs and the haplotypes (with HapMap frequencies in parentheses). Tag SNPs are signalled with an exclamation mark (!).

Figure 1

Sequential application of filters in PupasView. (A) SNPs in gene TP53. (B) SNPs together with population frequencies. (C) SNPs with any functional characteristic. Depending on the versions of Ensembl and dbSNP, the appearance of the figure can change.

DISCUSSION

It is believed that improved genotyping methods in combination with the proper bioinformatics design strategies will offer better opportunities for the study of complex diseases (3). The use of functional SNPs could be an important factor in increasing the sensitivity of association tests. Different bioinformatics approaches have been focused mainly on the effect of coding SNPs, but also recently on SNPs affecting the regulation or the splicing of genes (14). PupasView is the first tool that integrates both transcriptional and translational phenotypic effects caused by polymorphisms. It provides an interactive environment in which functional information and population frequency data can be used over LD parameters as sequential filters to obtain a final list of SNPs optimal for genotyping purposes. PupasView is closely linked to our previous program PupaSNP (14), which is a tool for selecting SNPs with putative phenotypic effects. PupaSNP, designed for high-throughput experiments, has been used to design >9000 sets of SNPs, and has a daily average of 50 uses. PupasView assists in the last refinement step of gene-by-gene selection of SNPs. Figure 1 illustrates the effect of applying successive filter steps, which are, conceptually, first to select only those SNPs which are real (with reported population frequencies) and then to select only functional SNPs. In the last view (Figure 1C), LD parameters can be used to help in the final selection. More than 5000 SNPs have been selected using PupaSNP and PupasView in the first step of the pipeline for the study of polymorphisms at the Spanish National Genotyping Centre (CeGen).

31 in total

1. TRANSFAC: an integrated system for gene expression regulation.

Authors: E Wingender; X Chen; R Hehl; H Karas; I Liebich; V Matys; T Meinhardt; M Prüss; I Reuter; F Schacherer
Journal: Nucleic Acids Res Date: 2000-01-01 Impact factor: 16.971

2. Predicting the functional consequences of non-synonymous single nucleotide polymorphisms: structure-based assessment of amino acid variation.

Authors: D Chasman; R M Adams
Journal: J Mol Biol Date: 2001-03-23 Impact factor: 5.469

3. Prediction of deleterious human alleles.

Authors: S Sunyaev; V Ramensky; I Koch; W Lathe; A S Kondrashov; P Bork
Journal: Hum Mol Genet Date: 2001-03-15 Impact factor: 6.150

4. Characterization of disease-associated single amino acid polymorphisms in terms of sequence and structure properties.

Authors: Carles Ferrer-Costa; Modesto Orozco; Xavier de la Cruz
Journal: J Mol Biol Date: 2002-01-25 Impact factor: 5.469

5. Predicting deleterious amino acid substitutions.

Authors: P C Ng; S Henikoff
Journal: Genome Res Date: 2001-05 Impact factor: 9.043

6. The structure of haplotype blocks in the human genome.

Authors: Stacey B Gabriel; Stephen F Schaffner; Huy Nguyen; Jamie M Moore; Jessica Roy; Brendan Blumenstiel; John Higgins; Matthew DeFelice; Amy Lochner; Maura Faggart; Shau Neen Liu-Cordero; Charles Rotimi; Adebowale Adeyemo; Richard Cooper; Ryk Ward; Eric S Lander; Mark J Daly; David Altshuler
Journal: Science Date: 2002-05-23 Impact factor: 47.728

Review 7. Listening to silence and understanding nonsense: exonic mutations that affect splicing.

Authors: Luca Cartegni; Shern L Chew; Adrian R Krainer
Journal: Nat Rev Genet Date: 2002-04 Impact factor: 53.242

8. Sequence-based prediction of pathological mutations.

Authors: C Ferrer-Costa; M Orozco; X de la Cruz
Journal: Proteins Date: 2004-12-01

9. Understanding human disease mutations through the use of interspecific genetic variation.

Authors: M P Miller; S Kumar
Journal: Hum Mol Genet Date: 2001-10-01 Impact factor: 6.150

10. Association of NOD2 leucine-rich repeat variants with susceptibility to Crohn's disease.

Authors: J P Hugot; M Chamaillard; H Zouali; S Lesage; J P Cézard; J Belaiche; S Almer; C Tysk; C A O'Morain; M Gassull; V Binder; Y Finkel; A Cortot; R Modigliani; P Laurent-Puig; C Gower-Rousseau; J Macry; J F Colombel; M Sahbatou; G Thomas
Journal: Nature Date: 2001-05-31 Impact factor: 49.962

13 in total

Review 1. Bioinformatics and cancer: an essential alliance.

Authors: Joaquín Dopazo
Journal: Clin Transl Oncol Date: 2006-06 Impact factor: 3.405

2. A novel computational and structural analysis of nsSNPs in CFTR gene.

Authors: C George Priya Doss; R Rajasekaran; C Sudandiradoss; K Ramanathan; R Purohit; R Sethumadhavan
Journal: Genomic Med Date: 2008-05-14

Review 3. Genome and proteome annotation: organization, interpretation and integration.

Authors: Gabrielle A Reeves; David Talavera; Janet M Thornton
Journal: J R Soc Interface Date: 2009-02-06 Impact factor: 4.118

Review 4. Triplex technology in studies of DNA damage, DNA repair, and mutagenesis.

Authors: Anirban Mukherjee; Karen M Vasquez
Journal: Biochimie Date: 2011-04-11 Impact factor: 4.079

5. Association analysis of 33 lipoprotein candidate genes in multi-generational families of African ancestry.

Authors: I Miljkovic; L M Yerges-Armstrong; L H Kuller; A L Kuipers; X Wang; C M Kammerer; C S Nestlerode; C H Bunker; A L Patrick; V W Wheeler; R W Evans; J M Zmuda
Journal: J Lipid Res Date: 2010-03-22 Impact factor: 5.922

Review 6. Applications of computational algorithm tools to identify functional SNPs.

Authors: C George Priya Doss; C Sudandiradoss; R Rajasekaran; Parikshit Choudhury; Priyanka Sinha; Pragnya Hota; Udit Prakash Batra; Sethumadhavan Rao
Journal: Funct Integr Genomics Date: 2008-06-19 Impact factor: 3.410

7. Human Dendritic Cell Response Signatures Distinguish 1918, Pandemic, and Seasonal H1N1 Influenza Viruses.

Authors: Boris M Hartmann; Juilee Thakar; Randy A Albrecht; Stefan Avey; Elena Zaslavsky; Nada Marjanovic; Maria Chikina; Miguel Fribourg; Fernand Hayot; Mirco Schmolke; Hailong Meng; James Wetmur; Adolfo García-Sastre; Steven H Kleinstein; Stuart C Sealfon
Journal: J Virol Date: 2015-07-29 Impact factor: 5.103

8. In silico profiling of deleterious amino acid substitutions of potential pathological importance in haemophlia A and haemophlia B.

Authors: George Priya Doss C
Journal: J Biomed Sci Date: 2012-03-16 Impact factor: 8.410

9. Next station in microarray data analysis: GEPAS.

Authors: David Montaner; Joaquín Tárraga; Jaime Huerta-Cepas; Jordi Burguet; Juan M Vaquerizas; Lucía Conde; Pablo Minguez; Javier Vera; Sach Mukherjee; Joan Valls; Miguel A G Pujana; Eva Alloza; Javier Herrero; Fátima Al-Shahrour; Joaquín Dopazo
Journal: Nucleic Acids Res Date: 2006-07-01 Impact factor: 16.971

10. Investigation on the role of nsSNPs in HNPCC genes--a bioinformatics approach.

Authors: C George Priya Doss; Rao Sethumadhavan
Journal: J Biomed Sci Date: 2009-04-24 Impact factor: 8.410