Literature DB >> 18315851

SNP@Promoter: a database of human SNPs (single nucleotide polymorphisms) within the putative promoter regions.

Byoung-Chul Kim¹, Woo-Yeon Kim, Daeui Park, Won-Hyong Chung, Kwang-Sik Shin, Jong Bhak.

Abstract

BACKGROUND: Analysis of single nucleotide polymorphism (SNP) is becoming a key research in genomics fields. Many functional analyses of SNPs have been carried out for coding regions and splicing sites that can alter proteins and mRNA splicing. However, SNPs in non-coding regulatory regions can also influence important biological regulation. Presently, there are few databases for SNPs in non-coding regulatory regions. DESCRIPTION: We identified 488,452 human SNPs in the putative promoter regions that extended from the +5000 bp to -500 bp region of the transcription start sites. Some SNPs occurring in transcription factor (TF) binding sites were also predicted (47,832 SNP; 9.8%). The result is stored in a database: SNP@promoter. Users can search the SNP@Promoter database using three entries: 1) by SNP identifier (rs number from dbSNP), 2) by gene (gene name, gene symbol, refSeq ID), and 3) by disease term. The SNP@Promoter database provides extensive genetic information and graphical views of queried terms.
CONCLUSION: We present the SNP@Promoter database. It was created in order to predict functional SNPs in putative promoter regions and predicted transcription factor binding sites. SNP@Promoter will help researchers to identify functional SNPs in non-coding regions.

Entities: Chemical Disease Gene Mutation Species

Mesh：

Substances：
RNA, Untranslated

Year: 2008 PMID： 18315851 PMCID： PMC2259403 DOI： 10.1186/1471-2105-9-S1-S2

Source DB: PubMed Journal: BMC Bioinformatics ISSN： 1471-2105 Impact factor: 3.169

Background

After finishing the Human Genome Project, biologists' interest has shifted to non-repetitive sequence variants in genome, by far the most common of which are single nucleotide polymorphisms (SNPs). For a variation to be considered an SNP, it must occur in at least 1% of the population. SNPs, which make up about 90% of all human genetic variation, occur every 100 to 300 bases along the 3-billion-base human genome [1,2]. It is generally believed that the complete human sequence will reveal at least a million SNPs of coding regions, including introns and promoters. As a general rule, many SNPs have no effect on cell function, but some SNPs are reported to be highly related to diseases or to influence cells' response to a drug. Although more than 99% of human DNA sequences are the same across all populations, some SNPs can have a major impact on how humans respond to diseases; environmental insults such as bacteria, viruses, toxins, and chemicals; and drugs and other therapies. This makes SNPs of great value for biomedical research and for developing pharmaceutical products and for medical diagnostics. New bioinformatics tools and public SNP resources for SNP studies, specifically for linkage disequilibrium and disease association studies, will form part of the new scientific landscape [3-9]. These public SNP resources are possible through the large-scale and high-throughput systems to screen SNPs on many individuals. The challenge is to accomplish this while reducing the cost per genotype and required completion time. The public SNP resources are producing information about SNPs which are related to diseases or that modify biological function. Many functional studies of SNPs were focused on SNPs located in coding regions that can influence phenotype by altering the encoded proteins [9,10]. They can also influence premature termination that can cause nonsense-mediated mRNA decay (NMD) [11]. Another function of SNPs is that they affect splice sites which results in alternative splicing [12]. Additionally, there are many SNPs in non-coding regulatory regions. The exact functions of the non-coding regulatory region SNPs are not clear yet. However, some SNPs are predicted to be related to genes by influencing the binding affinity of transcription factors. For example, the G/C polymorphism in the promoter region of the FCGR2B promoter regulates gene expression [13]. -783A/G and -1438A/G polymorphisms in the promoter of HTR2A gene regulate gene expression. -783 G allele and -1438 G allele are known to reduce the binding activity of transcription factors [14]. However, there are no public resources that provide promoter information of SNPs influencing the non-coding regulatory regions in the human genome. The rSNP_Guide system is the only one that has reported SNPs that are related to potential transcription factor candidates among 41 types of known transcription factor binding sites. [15,16]. ORegAnno is focused not on SNP information of the regulatory regions in the human genome but on the registration and validation of SNPs from promoters, transcription factor binding sites, and regulatory variation [17]. SNP@Promoter is a large database that contains various types of information on the location and function for putative promoter regions in the human genome for gene regulation study. In particular, SNP@Promoter provides a platform for biologists including disease associated genes, transcription factor binding sites, and a graphic viewer.

Methods and results

We developed an integrated computational system for identifying SNPs in non-coding regulation regions (Fig 1). In this system, we: 1) predicted TF binding sites in putative promoter regions, 2) identified SNPs in the putative promoter regions and selected SNPs within predicted TF binding sites, 3) examined evolutionary conservation of predicted TF binding sites, and 4) integrated a variety of gene annotation information.

Figure 1

Flow chart for identifying SNPs in putative promoter regions. Cylinders represent databases. Rectangles are computational applications. (a) Putative promoter regions are identified in the human genome sequence. (b) Transcription binding sites are predicted in the putative promoter regions by using TransFac database. (c) SNPs are mapped. (d) Evolution conservation scores are calculated within transcription factor binding sites. (e) The disease association and functional annotation of target genes carried out by using an in-house functional annotation database.

Prediction of TF binding sites in putative promoter region

We identified TF binding sites in the putative promoter regions in the human genome. The promoter region is defined as the sequence of 5 kb upstream to 500 downstream bases of a transcription start site. The annotation information of genes, which is mapped to the genome, was obtained from the NCBI Gene database. To find TF binding sites in the putative promoter regions, we used the MATCH (Matrix Search For Transcription Factor Binding Site) program from the Transfac database (ver. 8.4) [18,19]. As a result, we predicted 1,497,317 TF binding sites from 28,644 human genes.

Identification of SNPs on predicted TF binding sites

The SNP annotation information was derived from a public SNP database (dbSNP ver. 126). We identified SNPs in putative promoter regions and selected SNPs that are predicted to be within TF binding sites. As a result, we mapped 488,452 SNPs and filtered out 47,832 SNPs within the putative TF binding sites.

Applying a conservation score

Using computational methods for predicting TFBS (TF binding sites) is not optimal due to a high false positive rate. However, recent algorithms have been improved in their reliability in TFBS prediction. Popular algorithms examine well-conserved regulatory sequences by comparing upstream sequences of orthologous genes across species [20-28]. Therefore, as an index of reliability for such an approach, we calculated an evolutionary conservation score for all the predicted TF binding sites. Users can see how reliable their predicted TF binding sites are. We used the phastcons16way file derived from UCSC human genome data. This file contains a conservation score from multiple genome alignment data calculated by the phastCons program [29].

Integration with functional annotation

The SNP@Promoter database adopted various gene annotations including pathways (KEGG), gene ontology (GOA), and disease information such as GAD, HGMD, and OMIM. The raw data files were integrated into the SNP@Promoter database based on a gene synonym table from HGNC (HUGO). These annotations provide insight into the effects of SNPs within TF binding sites and help users to characterize target genes regulated by SNPs.

User interface

As shown in Fig. 2(A), a user can search the SNP@Promoter database using three kinds of entries: 1) an SNP identifier (rs number from dbSNP), 2) a gene (Gene name, gene symbol, refSeq ID), or (3) a disease term. When the user submits a gene or a disease term, SNP@Promoter returns a gene list related to queries. In the case of accessing details of the query gene, it shows SNP information, gene information, and transcription factor binding site information of target genes as shown Fig. 2(B). SNP@Promoter provides graphical views of the queried SNPs and genes. Fig. 3 shows a putative promoter region browser.

Figure 2

Figure 3

A graphic viewer of transcription regulatory region. The green bar represents a putative promoter region (5500 bp). The arrows in the green bar show a strand of transcription, orange box is transcription start region, yellow inverted triangles are SNP positions, and purple triangles are predicted transcription binding sites.

SNP@Promoter user interface. SNP@Promoter main page. (A) Users can search using three entries: 1) an SNP identifier (rs number from dbSNP), 2) a gene (Gene name, gene symbol, refSeq ID), or 3) a disease term. (B) SNP@Promoter gene retrieval page. The SNP Information table shows identified SNPs within putative promoter region and TF biding sites. The Gene Information table shows various gene annotations including pathways (KEGG), gene ontology (GOA). The Information of Transcription Factor Binding Sites table shows a variety off TF information such as TF start position, upstream position, TF strand, match score, TF binding sequences, conservations score. A graphic viewer of transcription regulatory region. The green bar represents a putative promoter region (5500 bp). The arrows in the green bar show a strand of transcription, orange box is transcription start region, yellow inverted triangles are SNP positions, and purple triangles are predicted transcription binding sites.

Conclusion

SNP@Promoter is a database for functional SNPs within putative promoter regions and predicted TF binding sites. The database provides genetic information and graphical views of queried terms. SNP@Promoter will help researchers to identify functional SNPs in non-coding regions. Users can access the SNP@Promoter at or directly at .

Competing interests

The authors declare that they have no competing interests.

Authors' contributions

BK constructed the database. WYK developed the website and assisted to construction of database. WH and KS helped to develop the website. BK initiated this project and wrote the manuscript. DP assisted the manuscript writing. JB directed the study and helped to draft the manuscript.

29 in total

1. rSNP_Guide, a database system for analysis of transcription factor binding to DNA with variations: application to genome annotation.

Authors: Julia V Ponomarenko; Tatyana I Merkulova; Galina V Orlova; Oleg N Fokin; Elena V Gorshkova; Anatoly S Frolov; Vadim P Valuev; Mikhail P Ponomarenko
Journal: Nucleic Acids Res Date: 2003-01-01 Impact factor: 16.971

2. MATCH: A tool for searching transcription factor binding sites in DNA sequences.

Authors: A E Kel; E Gössling; I Reuter; E Cheremushkin; O V Kel-Margoulis; E Wingender
Journal: Nucleic Acids Res Date: 2003-07-01 Impact factor: 16.971

3. SIFT: Predicting amino acid changes that affect protein function.

Authors: Pauline C Ng; Steven Henikoff
Journal: Nucleic Acids Res Date: 2003-07-01 Impact factor: 16.971

4. SNPper: retrieval and analysis of human SNPs.

Authors: A Riva; I S Kohane
Journal: Bioinformatics Date: 2002-12 Impact factor: 6.937

Review 5. rSNP_Guide: an integrated database-tools system for studying SNPs and site-directed mutations in transcription factor binding sites.

Authors: Julia V Ponomarenko; Galina V Orlova; Tatyana I Merkulova; Elena V Gorshkova; Oleg N Fokin; Gennady V Vasiliev; Anatoly S Frolov; Mikhail P Ponomarenko
Journal: Hum Mutat Date: 2002-10 Impact factor: 4.878

6. SNP2NMD: a database of human single nucleotide polymorphisms causing nonsense-mediated mRNA decay.

Authors: Areum Han; Woo-Yeon Kim; Seong-Min Park
Journal: Bioinformatics Date: 2006-11-22 Impact factor: 6.937

Review 7. Evolutionary strategies for the elucidation of cis and trans factors that regulate the developmental switching programs of the beta-like globin genes.

Authors: D L Gumucio; D A Shelton; W Zhu; D Millinoff; T Gray; J H Bock; J L Slightom; M Goodman
Journal: Mol Phylogenet Evol Date: 1996-02 Impact factor: 4.286

8. Enrichment of regulatory signals in conserved non-coding genomic sequence.

Authors: S Levy; S Hannenhalli; C Workman
Journal: Bioinformatics Date: 2001-10 Impact factor: 6.937

9. TRANSFAC: transcriptional regulation, from patterns to profiles.

Authors: V Matys; E Fricke; R Geffers; E Gössling; M Haubrock; R Hehl; K Hornischer; D Karas; A E Kel; O V Kel-Margoulis; D-U Kloos; S Land; B Lewicki-Potapov; H Michael; R Münch; I Reuter; S Rotert; H Saxel; M Scheer; S Thiele; E Wingender
Journal: Nucleic Acids Res Date: 2003-01-01 Impact factor: 16.971

10. Cross-species comparison significantly improves genome-wide prediction of cis-regulatory modules in Drosophila.

Authors: Saurabh Sinha; Mark D Schroeder; Ulrich Unnerstall; Ulrike Gaul; Eric D Siggia
Journal: BMC Bioinformatics Date: 2004-09-09 Impact factor: 3.169

26 in total

1. Associations among types of impulsivity, substance use problems and neurexin-3 polymorphisms.

Authors: Scott F Stoltenberg; Melissa K Lehmann; Christa C Christ; Samantha L Hersrud; Gareth E Davies
Journal: Drug Alcohol Depend Date: 2011-06-14 Impact factor: 4.492

2. A fast and accurate SNP detection algorithm for next-generation sequencing data.

Authors: Feng Xu; Weixin Wang; Panwen Wang; Mulin Jun Li; Pak Chung Sham; Junwen Wang
Journal: Nat Commun Date: 2012 Impact factor: 14.919

3. Association of interleukin-10 gene promoter polymorphisms with susceptibility to acute pyelonephritis in children.

Authors: Juraj Javor; Karol Králinský; Eva Sádová; Oľga Červeňová; Mária Bucová; Michaela Olejárová; Milan Buc; Adriana Liptáková
Journal: Folia Microbiol (Praha) Date: 2014-01-22 Impact factor: 2.099

Review 4. Bioinformatic tools for identifying disease gene and SNP candidates.

Authors: Sean D Mooney; Vidhya G Krishnan; Uday S Evani
Journal: Methods Mol Biol Date: 2010

5. Prediction of functional regulatory SNPs in monogenic and complex disease.

Authors: Yiqiang Zhao; Wyatt T Clark; Matthew Mort; David N Cooper; Predrag Radivojac; Sean D Mooney
Journal: Hum Mutat Date: 2011-09-09 Impact factor: 4.878

6. Association analysis of Notch pathway signalling genes in diabetic nephropathy.

Authors: D Kavanagh; G J McKay; C C Patterson; A J McKnight; A P Maxwell; D A Savage
Journal: Diabetologia Date: 2010-11-20 Impact factor: 10.122

Review 7. Role for protein-protein interaction databases in human genetics.

Authors: Kristine A Pattin; Jason H Moore
Journal: Expert Rev Proteomics Date: 2009-12 Impact factor: 3.940

8. Polymorphisms in poly (ADP-ribose) polymerase-1 (PARP1) promoter and 3' untranslated region and their association with PARP1 expression in breast cancer patients.

Authors: Lili Zhai; Shuai Li; Huilan Li; Yi Zheng; Ronggang Lang; Yu Fan; Feng Gu; Xiaojing Guo; Xinmin Zhang; Li Fu
Journal: Int J Clin Exp Pathol Date: 2015-06-01

9. Identification of candidate regulatory SNPs by combination of transcription-factor-binding site prediction, SNP genotyping and haploChIP.

Authors: Adam Ameur; Alvaro Rada-Iglesias; Jan Komorowski; Claes Wadelius
Journal: Nucleic Acids Res Date: 2009-05-18 Impact factor: 16.971

10. From SNPs to pathways: integration of functional effect of sequence variations on models of cell signalling pathways.

Authors: Anna Bauer-Mehren; Laura I Furlong; Michael Rautschka; Ferran Sanz
Journal: BMC Bioinformatics Date: 2009-08-27 Impact factor: 3.169