Literature DB >> 21346861

xFITOM: a generic GUI tool to search for transcription factor binding sites.

Abstract

UNLABELLED: Locating transcription factor binding sites in genomic sequences is a key step in deciphering transcription networks. Currently available software for site search is mostly server-based, limiting the range and flexibility of this type of analysis. xFITOM is a fully customizable program for locating binding sites in genomic sequences written in C++. Through an easy-to-use interface, xFITOM that allows users an unprecedented degree of flexibility in site search. Among other features,it enables users to define motifs by mixing real sites and IUPAC consensus sequences,to search the annotated sequences of unfinished genomes and to choose among 11 different search algorithms. AVAILABILITY: XFITOM IS AVAILABLE FOR DOWNLOAD AT: http://research.umbc.edu/˜erill.

Entities: Chemical Disease Species

Keywords: Transcription factor; binding site; genome analysis; motif prediction; site search

Year: 2010 PMID： 21346861 PMCID： PMC3039987 DOI： 10.6026/97320630005049

Source DB: PubMed Journal: Bioinformation ISSN： 0973-2063

Background

The discovery and analysis of transcriptional regulatory networks is a key step in elucidating the complex regulatory apparatus of living beings [1]. Transcriptional regulation is mediated mainly by transcription factors (TF) that bind DNA and can either hinder (repressors) or promote (activators) the formation of an open complex by the RNA-polymerase holoenzyme [2]. The semi-specific recognition of binding sites by their cognate transcription factors allows implementing computational tools for the discovery and detection of transcription binding motifs and sites [1]. Motif discovery methods focus on the identification of overrepresented patterns in groups of sequences [3]. Conversely, site search algorithms take in a motif description and use pattern matching techniques to search for sites on DNA sequences [4]. Scanning genomic sequences leads to a noisy but informative reconstruction of transcriptional regulatory networks, which can be later validated by in vitro and in vivo methods [5]. Site search techniques can be roughly divided into two categories. Methods based on based on regular expression syntaxes (regex) use IUPAC codes for degenerate bases and allow for complex motif descriptions including gaps, multiple repeats and variable spacers [3]. Their flexibility in motif description comes at the cost of reduced base frequency information and a combinatorial increase in complexity for elaborate patterns [3]. Position- Specific Scoring Matrices (PSSM) approaches implement rigid or semirigid feature locators based on the observed frequency of occurrence of each base at each motif position, as derived from a collection of aligned sites [6]. This matrix representation is commonly referred to as the binding motif or profile. Most PSSM approaches are based on the application of information theory to molecular biology [4, 7], which leads to the widespread representation of binding motifs as sequence logos [8]. Information theory can be applied to the information process that takes place when a transcription factor binds a site. Protein binding leads to a reduction in uncertainty that is formally defined as the difference in entropy at each motif position [4] ( supplementary material for equation). Following this approach, several methods have been devised to assess the fit of a candidate site to a particular binding motif. Nonweighted methods use only the frequency of observed bases at each position in their computation, while weighted methods apply also with information on positional conservation. Weighted methods show improved correlation with experimental binding affinities, but are more prone to false positives in genomic searches [4]. Alternative methods using relative entropy, instead of Shannon entropy, have been devised to take into account biases in genomic base composition [4]. Most software solutions developed to implement genome-wide searches of TF-binding sites using PSSM, like Virtual Footprint, are server-based applications linked to server-based databases [9] and, in some cases, integrated in web-based motif discovery suites [3]. A common problem with most web-based applications in bioinformatics is the limitation of sequence sizes in order to contain server load and/or traffic. This is normally addressed by relying on a server database of sequences, but this often limits severely the number and repertoire of target sequences available for analysis. Some solutions rely also on a limited set of precompiled matrices, while others restrict entry of motif data to either site collections or IUPAC codes. In addition, each solution relies on a single specific method for site search. This prevents users from choosing the method that they believe most appropriate and precludes the integration of results from different methods. xFITOM is a standalone, easy to use and extremely flexible GUI-based application for site search that integrates several PSSM-based methods in a single tool.

Methodology

xFITOM requires a collection of aligned sites and a set of target files to be searched. Using the site collection, xFITOM will compute the information content of the motif. It will also pre-process the target sequence and compute its a priori entropy. The program will then start a sequential search of the genome using a sliding window and one of the 11 available scoring methods. xFITOM uses a threshold, an arbitrary cut-off value or a value relative to the distribution of scores in the collection, to define putative binding sites. These are then sorted by genomic position or score. If gene annotation data is available, sites are classified as “operator”, “intragenic” or “intergenic” based on their location with respect to nearby genes. xFITOM can also apply local complexity factors to detect locally enriched motifs [10]. In this mode, xFITOM computes several mobile score averages and can rescore sites based on three different ratios between local and global score averages.

Implementation

xFITOM has been developed integrally in C++ as a standalone application for Microsoft Windows operating systems using the Microsoft Foundation Classes. The program operates on two main input files: a collection of aligned sites and a collection of target sequences. Aligned sites can be entered in FASTA and raw text format, with degenerate IUPAC characters allowing for an arbitrary degree of flexibility in the definition of the binding motif. Target sequences can be entered in FASTA and GenBank formats, with gene annotation data extracted automatically from the latter. Multiple sequences can be processed directly in FASTA format, while GenBank files can be merged into a compound GenBank file for analysis. This allows analysis of annotated unfinished genomes, giving immediate access to newly released genomic data. All xFITOM parameters can be specified by the user through the GUI (Figure 1). Basic parameters, like the search method, detection threshold and site-to-gene distances are defined in the left panel, while local complexity options are located on the right panel. The program generates comma-separated value output files to allow easy post-processing of results with spreadsheet software. Due to its standalone nature and configurability, xFITOM allows users to carry out useful non-standard analyses, such comparisons among different methods (Figure 1), affinity ranking of collection sites or annotated searches of unfinished genome assemblies (Figure 1).

Figure 1

[Left panel] GUI interface of xFITOM showing the input section and the advanced and local complexity options sections. [Top-right panel] Condensed results from an annotated search of the Pasteurella dagmatis ATCC_43325 unfinished genome sequence using a collection of LexA-binding sites from Escherichia coli [4]. For clarity, gene locus accession numbers have been abbreviated to Pdag_XXXX from the original HMPREF0621_XXXX format. Searching merged GenBank files can provide useful information on the composition regulatory networks for partly sequenced organisms. Here it shows the usual components of the SOS network in the _-Proteobacteria to be present in P. dagmatis [11]. [Bottom-right panel] ROC curve showing a comparison of search efficiencies when looking for CRP sites in the E. coli genome with different methods. Previous results [4] indicate that nonweighted methods like Ri outperform weighted ones (like Ri•Rsequence), but this does not hold true when the length of sites in a collection extends beyond the primary conserved region.

11 in total

1. In silico analysis reveals substantial variability in the gene contents of the gamma proteobacteria LexA-regulon.

Authors: Ivan Erill; Marcos Escribano; Susana Campoy; Jordi Barbé
Journal: Bioinformatics Date: 2003-11-22 Impact factor: 6.937

Review 2. How do site-specific DNA-binding proteins find their targets?

Authors: Stephen E Halford; John F Marko
Journal: Nucleic Acids Res Date: 2004-06-03 Impact factor: 16.971

3. Differences in LexA regulon structure among Proteobacteria through in vivo assisted comparative genomics.

Authors: Ivan Erill; Mónica Jara; Noelia Salvador; Marcos Escribano; Susana Campoy; Jordi Barbé
Journal: Nucleic Acids Res Date: 2004-12-16 Impact factor: 16.971

Review 4. Regulation of transcription: from lambda to eukaryotes.

Authors: Mark Ptashne
Journal: Trends Biochem Sci Date: 2005-06 Impact factor: 13.807

5. Virtual Footprint and PRODORIC: an integrative framework for regulon prediction in prokaryotes.

Authors: Richard Münch; Karsten Hiller; Andreas Grote; Maurice Scheer; Johannes Klein; Max Schobert; Dieter Jahn
Journal: Bioinformatics Date: 2005-08-18 Impact factor: 6.937

Review 6. Computational approaches to study transcriptional regulation.

Authors: M Madan Babu
Journal: Biochem Soc Trans Date: 2008-08 Impact factor: 5.407

Review 7. Finding sequence motifs in prokaryotic genomes--a brief practical guide for a microbiologist.

Authors: Jan Mrázek
Journal: Brief Bioinform Date: 2009-06-24 Impact factor: 11.622

Review 8. Eukaryotic transcription factor binding sites--modeling and integrative search methods.

Authors: Sridhar Hannenhalli
Journal: Bioinformatics Date: 2008-04-21 Impact factor: 6.937

9. Sequence logos: a new way to display consensus sequences.

Authors: T D Schneider; R M Stephens
Journal: Nucleic Acids Res Date: 1990-10-25 Impact factor: 16.971

10. A reexamination of information theory-based methods for DNA-binding site identification.

Authors: Ivan Erill; Michael C O'Neill
Journal: BMC Bioinformatics Date: 2009-02-11 Impact factor: 3.169

7 in total

1. An SOS Regulon under Control of a Noncanonical LexA-Binding Motif in the Betaproteobacteria.

Authors: Neus Sanchez-Alberola; Susana Campoy; David Emerson; Jordi Barbé; Ivan Erill
Journal: J Bacteriol Date: 2015-05-18 Impact factor: 3.490

2. Prevalence of SOS-mediated control of integron integrase expression as an adaptive trait of chromosomal and mobile integrons.

Authors: Guillaume Cambray; Neus Sanchez-Alberola; Susana Campoy; Émilie Guerin; Sandra Da Re; Bruno González-Zorn; Marie-Cécile Ploy; Jordi Barbé; Didier Mazel; Ivan Erill
Journal: Mob DNA Date: 2011-04-30

7. New insights on Pseudoalteromonas haloplanktis TAC125 genome organization and benchmarks of genome assembly applications using next and third generation sequencing technologies.

Authors: Weihong Qi; Andrea Colarusso; Miriam Olombrada; Ermenegilda Parrilli; Andrea Patrignani; Maria Luisa Tutino; Macarena Toll-Riera
Journal: Sci Rep Date: 2019-11-11 Impact factor: 4.379