Literature DB >> 16845086

New Onto-Tools: Promoter-Express, nsSNPCounter and Onto-Translate.

Purvesh Khatri¹, Valmik Desai, Adi L Tarca, Sivakumar Sellamuthu, Derek E Wildman, Roberto Romero, Sorin Draghici.

Abstract

The Onto-Tools suite is composed of an annotation database and eight complementary, web-accessible data mining tools: Onto-Express, Onto-Compare, Onto-Design, Onto-Translate, Onto-Miner, Pathway-Express, Promoter-Express and nsSNPCounter. Promoter-Express is a new tool added to the Onto-Tools ensemble that facilitates the identification of transcription factor binding sites active in specific conditions. nsSNPCounter is another new tool that allows computation and analysis of synonymous and non-synonymous codon substitutions for studying evolutionary rates of protein coding genes. Onto-Translate has also been enhanced to expand its scope and accuracy by fully utilizing the capabilities of the Onto-Tools database. Currently, Onto-Translate allows arbitrary mappings between 28 types of IDs for 53 organisms. Onto-Tools are freely available at http://vortex.cs.wayne.edu/Projects.html.

Entities: Chemical Disease Species

Mesh：

Substances：

Year: 2006 PMID： 16845086 PMCID： PMC1538776 DOI： 10.1093/nar/gkl213

Source DB: PubMed Journal: Nucleic Acids Res ISSN： 0305-1048 Impact factor: 16.971

INTRODUCTION

While high-throughput sequencing and microarray technologies have allowed the collection of a staggering amount of data per experiment rapidly, they have also posed the challenges of translating such data into a better understanding of the underlying biological phenomena. First released in 2001, Onto-Tools is a freely available web-accessible software suite that addresses some of these challenges (1–6). This is achieved using a probabilistic functional analysis that bridges the gap between low-level, high-throughput gene expression data and high-level functional knowledge, as well as public annotations within the framework of the Gene Ontology (GO). This analysis approach has become the de facto standard in the second-stage analysis of microarray experiments (7). The Onto-Tools suite includes (i) Onto-Express—used to translate lists of differentially regulated genes into a better understanding of the underlying biological phenomena; (ii) Onto-Design—used to select the best set of genes to be included on a custom microarray designed for the study of a given biological phenomenon; (iii) Onto-Compare—used to analyze the functional bias of various focused commercial microarrays and select the one that is most appropriate for a given biological hypothesis; (iv) Onto-Translate—used to translate lists of genes from one reference system to another (e.g. from GenBank accession numbers to UniGene cluster IDs to Affymetrix probe IDs, etc.); (v) Onto-Miner—providing a unified access point and an application programming interface (API) allowing queries for various information such as the gene name, official symbol, reference accession number, coded protein, etc.; (vi) Pathway-Express—which helps the users find most interesting pathway(s) involving their genes of interest; (vii) Promoter-Express—which allows the users to find condition-specific transcription factor binding sites (TFBSs) and (viii) nsSNPCounter—which allows analysis of synonymous and non-synonymous codon substitutions in protein coding genes. Previous publications have described in detail the motivation, implementation and validation of these tools (1–7). The logical work-flow between the Onto-Tools applications has been previously explained (1,4). This paper describes two new tools added to the ensemble and discusses various other additions and enhancements made to the existing tools.

PROMOTER-EXPRESS

Transcription initiation is accomplished by complex and tightly coordinated protein-DNA interactions between a number of transcription factors and the promoter region(s) of a gene. While Onto-Express and Pathway-Express in the Onto-Tools ensemble help identify the significant biological processes and pathways in the condition under study, developing a detailed mechanistic model of the regulatory mechanism(s) that control these processes requires the identification of the the genetic elements involved in these mechanisms. Promoter-Express (PE) is a new tool in the Onto-Tools ensemble designed to help the user identify cis-regulatory elements on the DNA (8). The underlying hypothesis behind its approach is that similarly expressed genes involved in related biological phenomena are likely to be regulated by a common transcriptional mechanism (9–13). Given this hypothesis, PE accepts a list of genes that are known to be involved in the same or related biological processes. For example, this list could come from Onto-Express, and could contain genes involved in the same biological processes. Alternatively, the list could contain genes with similar expression profiles. For each input gene, PE queries the back-end Onto-Tools database and retrieves the nucleotide sequences in the upstream regions of all genes in the list. Currently, the organisms supported by PE are human and mouse. By default, PE retrieves a nucleotide sequence 1000 bp upstream and 200 bp downstream from the start of the mRNA of the target gene. Most TFBSs are likely to be within this region. However, PE also allows the user to expand or restrict the search boundaries. Note that in most cases, the start of an mRNA corresponds to its transcription start site (TSS). However, in case of some mRNAs, the TSS may not be annotated precisely or not annotated at all. For such situations, PE also allows the user to submit an arbitrary list of FASTA formatted sequences. Next, PE performs pairwise sequence comparisons using a sliding window approach. By default, PE uses a window size of 9 since most of the known TFBSs are 6–20 bp long. However, PE allows the user to use a different window size. Furthermore, when a match is found, the program tries to expand the matching sequence in both directions. The result of the pairwise sequence comparisons is a number of exact matching subsequences found on both the input sequences. We define each of these exact matching subsequences as an element. It has been shown previously that two genes involved in a similar biological process and regulated by the same transcription regulation mechanism may require that these elements appear in the same order and approximately at the same distance on both genes (14). Hence, after finding the exact matching elements in both genes, PE searches for the combinations of those elements that appear in the same order with approximately the same distance among the elements on both genes. A set of elements that satisfies these criteria is defined as a module (see Figure 1). Such a module represents a ‘footprint’ of the transcriptional regulatory mechanisms at work in a specific biological context.

Figure 1

Example of a typical module. The black thick lines are two upstream sequences to be analyzed. The thin shorter color-coded segment between them are the elements common to both, and the thick color-coded segments are the elements that together form a module. The gap X is approximately equal to gap ∼X and gap Y is approximately equal to gap ∼Y.

PE's output shows each input gene as a color-coded continuous line labeled with Entrez Gene ID or gene name (Figure 2). Under each line, it displays the elements found as short color-coded line segments that quickly allow the user to find out how many and what genes an element is found in. When a user moves the mouse over an element, PE shows its nucleotide sequence, start and end positions on both the genes, and the strand on which the element was found on (i.e. forward or reverse strand). Selecting a gene and one of its modules displays all elements in the selected module in color, while the rest of the elements on all genes are represented in white color (Figure 2). PE also allows the users to save the results on the user's machine in a binary file which can be opened at a later time for further analysis.

Figure 2

The output provided by Promoter-Express for a selected module. The figure also shows some of the possible data manipulations and interactions with the GUI.

nsSNPCOUNTER

Studying the evolutionary rates of different protein coding genes usually requires the computation of the number of synonymous and non-synonymous substitutions among genes (15–17). Recent studies have focused on evolutionary changes among single nucleotide polymorphisms (SNPs) (18). Such a change is considered synonymous if it leads to a synonymous codon, i.e. a change in the nucleotide sequence of a gene does not change the amino acid sequence of the protein translated from it. Alternatively, if the change in the nucleotide sequence does change the amino acid sequence of the protein, the change is non-synonymous. Clearly, non-synonymous mutations are much more important both from an evolutionary and from a clinical perspective. The dbSNP () is a SNP database provided by the NCBI that allows the retrieval of a list of known SNPs within the coding region of a given gene, identified for instance by a refseq ID. However, dbSNP does not provide the means to distinguish and automatically count the synonymous and nonsynonymous SNPs occurrences in the database (SynCounts and NonSynCounts) for a given refseq ID. The task is especially cumbersome when one needs to extract this information for thousands of genes simultaneously. The nsSNPCounter is a web-based tool that was designed to fulfil this need. Beside this main functionality it also gives the user an estimate of the number of synonymous (SynSites) and non-synonymous (NonSynSites) sites available in the sequence of each gene. This supplementary information is needed to adjust the SynCounts and NonSynCounts due to their uneven proportions (19). The PAML software collection () provides functionality to estimate the SynSites and NonSynSites but this is done only for one gene at a time. The new nsSNPCounter brings together all this information (SynCounts, NonSynCounts, SynSites and NonSynSites) for thousands of genes at a time. nsSNPCounter requires a list of mRNA RefSeq IDs and the name of the organism as input. The user can also specify other optional search criteria to refine the search. These optional criteria include heterozygosity range, SNPs validation method, etc. For each RefSeq ID, nsSNPCounter queries NCBI's dbSNP database (using the esearch tool provided by NCBI) in order to obtain its corresponding reference SNP cluster ID (RS ID). The RS ID of the SNP is then used to obtain its corresponding gene and Entrez Gene ID using the efetch tool provided by NCBI. The Entrez Gene ID is further used to query dbSNP database again to retrieve all known SNPs (RS IDs) for the gene. The output of this query is further processed to retain only non-redundant SNPs, and to count the synonymous and non-synonymous substitutions relative to the reference contig. To compute the SynSites and NonSynSites we need the sequence of the coding region of each gene of interest. This information is obtained by querying the NCBI GenBank database for every single refseq in the list to obtain a sequence GI ID. Then, the GI IDs are used to query the GenBank database again in order to retrieve the actual nucleotide sequences, and the start and end positions of the coding regions. The coding sequences are then used as an input to the PAML software which calculates the number of synonymous and non-synonymous sites. The nsSNPCounter automatically processes the PAML output and integrates the results in unique output containing the RefSeq ID, synonymous and non-synonymous SNP counts, calculated by nsSNPCounter, as well as the number of synonymous and non-synonymous sites, calculated by the PAML software (see right side of Figure 3).

Figure 3

Input (left panel) and output (right panel) for nsSNPCounter.

ONTO-TRANSLATE

In order to correctly interpret the results of an experiment, the researchers need to build a complete picture of the biological phenomenon under study, to the extent possible, using the knowledge accumulated in various annotation databases. However, our current knowledge is spread over a number of different databases where various databases are rather specialized and no single database contains all available data. Although within each database, the data are consistent, coherent and non-redundant, most of these annotation databases are developed by independent groups. These groups use different designs and different sets of identifiers for the same biological entities. The result of these independent efforts is replication of the same information in multiple databases. Furthermore, these databases cross-reference to some of the other databases to facilitate navigation from one resource to another. In order to build a complete picture of the biological phenomenon under study, a researcher is not only responsible for mapping various types of IDs from one another, but also for being aware of relationships among these resources. Onto-Translate (OT) is designed to address these name-space issues and help the user with the problem of mapping various types of IDs to each other. The ultimate goal of OT is to provide the users with a non-redundant and complete mapping from any type of identification system to any other type. In order to achieve this goal, OT uses the custom design of Onto-Tools database that integrates 20 publicly available biological databases including dbEST (20), GenBank (21), UniGene (22), KEGG (23), WormBase (), NetAffx, dbEST library (), eVOC (24), Swiss-Prot (25), TrEMBL (25), PIR (26), UniProt (27), Eukaryotic Promoter Database (EPD) (28), Human Genome Nomenclature Committee (HGNC) (29), GenPept, Online Mendelian Inheritance in Man (OMIM) (30), Protein Data Bank (31), iProClass (26), HomoloGene (), RefSeq (32) and Gene Ontology (GO)(33,34). In addition, Onto-Tools database also integrates information about commercial microarrays from nine manufacturers including Affymetrix, Agilent Technologies, Amersham's codelink microarrays, SuperArray, Takara Biosystems, Perkin-Elmer, NIA, SigmaGenosys and Clontech. Over the past year, OT has been enhanced to allow arbitrary mappings among 28 types of IDs for 53 organisms. Currently, OT can translate thousands of IDs in a single batch run. It also provides a graphical user interface to select the desired input and output which are hyperlinked with the corresponding online database resources. As an example of the capabilities of OT, Figure 4 shows a comparison between the translations performed by OT and MatchMiner, a similar tool from NCI (35). The figure shows the percentages of genes that are successfully translated from probe IDs to gene symbols for a number of popular Affymetrix arrays. Note that NetAffx performs such translations, from probe IDs to gene symbols, but the lists to be translated are limited to at most 5000 genes. Hence, none of the translations shown here can be performed on NetAffx.

Figure 4

A comparison between the performance of Onto-Translate (OT) and MatchMiner (MM). The figures show the percentage of successful translations from probe IDs to gene symbols, for a number of sets of genes corresponding to popular Affymetrix human (left) and mouse (right) arrays.

SUMMARY

The Onto-Tools suite is composed of a back-end database and eight integrated, web-accessible, free data mining tools: Onto-Express, Onto-Compare, Onto-Design, Onto-Translate, Onto-Miner, Pathway-Express, Promoter-Express and nsSNPCounter. Promoter-Express is a new tool that allows identification of condition-specific TFBSs for co-expressed genes that are involved in same or related biological processes. nsSNPCounter is another new tool that allows analysis of synonymous and non-synonymous codon substitutions for studying evolutionary rates of protein coding genes. Over the past year, Onto-Translate was enhanced to improve its scope. The Onto-Tools are freely available at .

33 in total

1. The Protein Data Bank.

Authors: H M Berman; J Westbrook; Z Feng; G Gilliland; T N Bhat; H Weissig; I N Shindyalov; P E Bourne
Journal: Nucleic Acids Res Date: 2000-01-01 Impact factor: 16.971

2. RefSeq and LocusLink: NCBI gene-centered resources.

Authors: K D Pruitt; D R Maglott
Journal: Nucleic Acids Res Date: 2001-01-01 Impact factor: 16.971

3. Gene ontology: tool for the unification of biology. The Gene Ontology Consortium.

Authors: M Ashburner; C A Ball; J A Blake; D Botstein; H Butler; J M Cherry; A P Davis; K Dolinski; S S Dwight; J T Eppig; M A Harris; D P Hill; L Issel-Tarver; A Kasarskis; S Lewis; J C Matese; J E Richardson; M Ringwald; G M Rubin; G Sherlock
Journal: Nat Genet Date: 2000-05 Impact factor: 38.330

4. Estimating synonymous and nonsynonymous substitution rates under realistic evolutionary models.

Authors: Z Yang; R Nielsen
Journal: Mol Biol Evol Date: 2000-01 Impact factor: 16.240

5. The KEGG databases at GenomeNet.

Authors: Minoru Kanehisa; Susumu Goto; Shuichi Kawashima; Akihiro Nakaya
Journal: Nucleic Acids Res Date: 2002-01-01 Impact factor: 16.971

6. Profiling gene expression using onto-express.

Authors: Purvesh Khatri; Sorin Draghici; G Charles Ostermeier; Stephen A Krawetz
Journal: Genomics Date: 2002-02 Impact factor: 5.736

7. Creating the gene ontology resource: design and implementation.

Authors:
Journal: Genome Res Date: 2001-08 Impact factor: 9.043

8. Computer-assisted identification of cell cycle-related genes: new targets for E2F transcription factors.

Authors: A E Kel; O V Kel-Margoulis; P J Farnham; S M Bartley; E Wingender; M Q Zhang
Journal: J Mol Biol Date: 2001-05-25 Impact factor: 5.469

9. Regulatory context is a crucial part of gene function.

Authors: Sabine Fessele; Holger Maier; Christian Zischek; Peter J Nelson; Thomas Werner
Journal: Trends Genet Date: 2002-02 Impact factor: 11.639

10. Natural selection on protein-coding genes in the human genome.

Authors: Carlos D Bustamante; Adi Fledel-Alon; Scott Williamson; Rasmus Nielsen; Melissa Todd Hubisz; Stephen Glanowski; David M Tanenbaum; Thomas J White; John J Sninsky; Ryan D Hernandez; Daniel Civello; Mark D Adams; Michele Cargill; Andrew G Clark
Journal: Nature Date: 2005-10-20 Impact factor: 49.962

11 in total

1. Layer-specific CREB target gene induction in human neocortical epilepsy.

Authors: Thomas L Beaumont; Bin Yao; Aashit Shah; Gregory Kapatos; Jeffrey A Loeb
Journal: J Neurosci Date: 2012-10-10 Impact factor: 6.167

2. Insights into novel cellular injury mechanisms by gene expression profiling in nephropathic cystinosis.

Authors: Poonam Sansanwal; Li Li; Szu-Chuan Hsieh; Minnie M Sarwal
Journal: J Inherit Metab Dis Date: 2010-09-24 Impact factor: 4.982

3. Detecting phenotype-specific interactions between biological processes from microarray data and annotations.

Authors: Nadeem A Ansari; Riyue Bao; Călin Voichiţa; Sorin Drăghici
Journal: IEEE/ACM Trans Comput Biol Bioinform Date: 2012 Sep-Oct Impact factor: 3.710

4. Differential expression of genes in the calcium-signaling pathway underlies lesion development in the LDb mouse model of atherosclerosis.

Authors: Solida Mak; Hua Sun; Frances Acevedo; Lawrence C Shimmin; Lei Zhao; Ba-Bie Teng; James E Hixson
Journal: Atherosclerosis Date: 2010-07-07 Impact factor: 5.162

5. Epithelial phenotype confers resistance of ovarian cancer cells to oncolytic adenoviruses.

Authors: Robert Strauss; Pavel Sova; Ying Liu; Zong Yi Li; Sebastian Tuve; David Pritchard; Paul Brinkkoetter; Thomas Möller; Oliver Wildner; Sari Pesonen; Akseli Hemminki; Nicole Urban; Charles Drescher; André Lieber
Journal: Cancer Res Date: 2009-06-02 Impact factor: 12.701

6. Gene expression profiling reveals signatures characterizing histologic subtypes of hepatoblastoma and global deregulation in cell growth and survival pathways.

Authors: Adekunle M Adesina; Dolores Lopez-Terrada; Kwong K Wong; Preethi Gunaratne; Yummy Nguyen; Joseph Pulliam; Judith Margolin; Milton J Finegold
Journal: Hum Pathol Date: 2009-02-05 Impact factor: 3.466

7. Desmoglein 2 is a receptor for adenovirus serotypes 3, 7, 11 and 14.

Authors: Hongjie Wang; Zong-Yi Li; Ying Liu; Jonas Persson; Ines Beyer; Thomas Möller; Dilara Koyuncu; Max R Drescher; Robert Strauss; Xiao-Bing Zhang; James K Wahl; Nicole Urban; Charles Drescher; Akseli Hemminki; Pascal Fender; André Lieber
Journal: Nat Med Date: 2010-12-12 Impact factor: 53.440

8. Inflammatory gene regulatory networks in amnion cells following cytokine stimulation: translational systems approach to modeling human parturition.

Authors: Ruth Li; William E Ackerman; Taryn L Summerfield; Lianbo Yu; Parul Gulati; Jie Zhang; Kun Huang; Roberto Romero; Douglas A Kniss
Journal: PLoS One Date: 2011-06-02 Impact factor: 3.240

9. DDEC: Dragon database of genes implicated in esophageal cancer.

Authors: Magbubah Essack; Aleksandar Radovanovic; Ulf Schaefer; Sebastian Schmeier; Sundararajan V Seshadri; Alan Christoffels; Mandeep Kaur; Vladimir B Bajic
Journal: BMC Cancer Date: 2009-07-06 Impact factor: 4.430

10. Onto-Tools: new additions and improvements in 2006.

Authors: Purvesh Khatri; Calin Voichita; Khalid Kattan; Nadeem Ansari; Avani Khatri; Constantin Georgescu; Adi L Tarca; Sorin Draghici
Journal: Nucleic Acids Res Date: 2007-06-21 Impact factor: 16.971