Literature DB >> 21330290

FIMO: scanning for occurrences of a given motif.

Charles E Grant1, Timothy L Bailey, William Stafford Noble.   

Abstract

UNLABELLED: A motif is a short DNA or protein sequence that contributes to the biological function of the sequence in which it resides. Over the past several decades, many computational methods have been described for identifying, characterizing and searching with sequence motifs. Critical to nearly any motif-based sequence analysis pipeline is the ability to scan a sequence database for occurrences of a given motif described by a position-specific frequency matrix.
RESULTS: We describe Find Individual Motif Occurrences (FIMO), a software tool for scanning DNA or protein sequences with motifs described as position-specific scoring matrices. The program computes a log-likelihood ratio score for each position in a given sequence database, uses established dynamic programming methods to convert this score to a P-value and then applies false discovery rate analysis to estimate a q-value for each position in the given sequence. FIMO provides output in a variety of formats, including HTML, XML and several Santa Cruz Genome Browser formats. The program is efficient, allowing for the scanning of DNA sequences at a rate of 3.5 Mb/s on a single CPU.
AVAILABILITY AND IMPLEMENTATION: FIMO is part of the MEME Suite software toolkit. A web server and source code are available at http://meme.sdsc.edu.

Entities:  

Mesh:

Substances:

Year:  2011        PMID: 21330290      PMCID: PMC3065696          DOI: 10.1093/bioinformatics/btr064

Source DB:  PubMed          Journal:  Bioinformatics        ISSN: 1367-4803            Impact factor:   6.937


1 INTRODUCTION

A DNA or protein sequence motif is a short pattern that is conserved by purifying selection. In DNA, a motif may correspond to a protein binding site; in proteins, a motif may correspond to the active site of an enzyme or a structural unit necessary for proper folding of the protein. Thus, sequence motifs are one of the basic functional units of molecular evolution. Consequently, identifying and understanding these motifs is fundamental to building models of cellular processes at the molecular scale and to understanding the mechanisms of human disease. We describe here a software tool, called FIMO (Find Individual Motif Occurrences, pronounced fēmō), that carries out in an efficient, statistically rigorous fashion one of the core functions required for any motif-based sequence analysis: scanning a collection of DNA or protein sequences for occurrences of one or more motifs. FIMO is by no means the first motif scanning method; however, many publicly available motif scanners are either not currently maintained or lack some of FIMO's features. Table 1 summarizes the differences between FIMO and eight currently available motif scanners. Furthermore, as part of the MEME Suite (Bailey ), FIMO can be used seamlessly in conjunction with a variety of complementary motif-based sequence analysis tools.
Table 1.

Comparison of motif search functionality

MethodScans DNAScans proteinsSupports custom backgroundsReports P-valuesPerforms multiple testing correctionSource code freely availableWeb accessibleGFF/WIG outputXML/HTML output
MotifScanner
MotifViz
STORM
TRED
RSAT
Patser
PoSSuMsearch
MATCH
FIMO

References for the motif scanning algorithms are provided in the supplement. Note that FIMO only supports zero-order custom backgrounds.

Comparison of motif search functionality References for the motif scanning algorithms are provided in the supplement. Note that FIMO only supports zero-order custom backgrounds. Note that the MEME Suite provides two other motif scanning algorithms that are useful in different scenarios. MAST (Bailey and Gribskov, 1998) searches with one or more DNA or protein motifs against a database composed of relatively short sequences, e.g. proteins or candidate regulatory regions, assigning a single score to each target sequence assuming that every motif occurs exactly once in the sequence. MCAST (Bailey and Noble, 2003), in contrast, uses a hidden Markov model to search DNA sequences for regions that are enriched with occurrences of one or more of the given motifs. Thus, MCAST is designed to scan chromosomes to detect cis-regulatory modules containing a known collection of cofactor motifs. Compared with MAST and MCAST, FIMO is simpler and more general. FIMO only assigns scores to individual motif occurrences; it makes no attempt to assign scores to joint occurrences of motifs, to sequence regions or to complete sequences. FIMO is thus a general-purpose tool for identifying individual candidate binding sites or protein motifs.

2 IMPLEMENTATION

FIMO takes as input one or more fixed-length motifs, represented as position-specific frequency matrices. These motifs can be generated from the MEME motif discovery algorithm, extracted from an existing motif database or created by hand using a simple text format. The program computes a log-likelihood ratio score (often referred to incorrectly as a ‘log-odds score’) for each motif with respect to each sequence position and converts these scores to P-values using dynamic programming (Staden, 1994), assuming a zero-order null model in which sequences are generated at random with user-specified per-letter background frequencies. Finally, FIMO employs a bootstrap method (Storey, 2002) to estimate false discovery rates (FDRs). Because the FDR is not monotonic relative to the P-value, FIMO instead reports for each P-value a corresponding q-value, which is defined as the minimal FDR threshold at which the P-value is deemed significant (Storey, 2003). FIMO produces as output a ranked list of motif occurrences, each with an associated log-likelihood ratio score, P-value and q-value. This list is represented in multiple ways: as an HTML report, as an XML file in CisML format (Haverty and Weng, 2004), as a plain text file and as tab-delimited files in formats suitable for input to the UCSC Genome Browser (.gff and .wig). The FIMO web server allows the user to upload one or more motifs and then search either a user-supplied sequence file or one of 3102 single and multiorganism DNA and protein databases from Ensembl and Genbank. Search results are stored online, and the user is notified of their availability via email.

3 EXAMPLE

To demonstrate FIMO's functionality, we searched the human genome with a motif for CTCF, a highly conserved zinc finger DNA-binding protein that exhibits diverse regulatory functions and that plays a major role in the global organization of the chromatin architecture of the human genome (Phillips and Corces, 2009). Figure 1 shows the FIMO HTML output for the top-scoring predicted occurrences of the motif, and a precision–recall curve comparing the predicted CTCF binding sites with a gold standard derived from a ChIP-seq experiment (see Supplementary Material for details). Overall, FIMO identified 8647 candidate binding sites with q < 0.05. The precision–recall curve suggests that the top of the list is enriched with sites that overlap ChIP-seq peaks. Note that the absolute precision is low, presumably for two reasons: first, a single motif lacks sufficient information to reliably scan an entire eukaryotic genome with high precision; second, FIMO identifies many bona fide CTCF binding sites that are not active in the particular cell type in which the ChIP-seq experiment was carried out. Scanning the entire human genome took 30 min 10 s of wall clock time on an Intel Xeon 2.2 GHz CPU, equivalent to scanning 3.5 Mp/s.
Fig. 1.

Using FIMO to identify candidate CTCF binding sites in the human genome. (A) Sample FIMO HTML output, showing the locations of the top-scoring occurrences of the CTCF motif in the human genome. (B) A precision-recall curve created by comparing FIMO's ranked list of CTCF sites with a gold standard derived from a ChIP-seq experiment.

Using FIMO to identify candidate CTCF binding sites in the human genome. (A) Sample FIMO HTML output, showing the locations of the top-scoring occurrences of the CTCF motif in the human genome. (B) A precision-recall curve created by comparing FIMO's ranked list of CTCF sites with a gold standard derived from a ChIP-seq experiment. Funding: This work was supported by National Institutes of Health award 2 R01 RR021692. Conflict of Interest: none declared.
  6 in total

1.  Searching for statistically significant regulatory modules.

Authors:  Timothy L Bailey; William Stafford Noble
Journal:  Bioinformatics       Date:  2003-10       Impact factor: 6.937

2.  CisML: an XML-based format for sequence motif detection software.

Authors:  Peter M Haverty; Zhiping Weng
Journal:  Bioinformatics       Date:  2004-03-04       Impact factor: 6.937

Review 3.  CTCF: master weaver of the genome.

Authors:  Jennifer E Phillips; Victor G Corces
Journal:  Cell       Date:  2009-06-26       Impact factor: 41.582

4.  Combining evidence using p-values: application to sequence homology searches.

Authors:  T L Bailey; M Gribskov
Journal:  Bioinformatics       Date:  1998       Impact factor: 6.937

5.  Staden: searching for motifs in nucleic acid sequences.

Authors:  R Staden
Journal:  Methods Mol Biol       Date:  1994

6.  MEME SUITE: tools for motif discovery and searching.

Authors:  Timothy L Bailey; Mikael Boden; Fabian A Buske; Martin Frith; Charles E Grant; Luca Clementi; Jingyuan Ren; Wilfred W Li; William S Noble
Journal:  Nucleic Acids Res       Date:  2009-05-20       Impact factor: 16.971

  6 in total
  1566 in total

Review 1.  Phylogenetic footprinting: a boost for microbial regulatory genomics.

Authors:  Pramod Katara; Atul Grover; Vinay Sharma
Journal:  Protoplasma       Date:  2011-11-24       Impact factor: 3.356

2.  Epigenetic priors for identifying active transcription factor binding sites.

Authors:  Gabriel Cuellar-Partida; Fabian A Buske; Robert C McLeay; Tom Whitington; William Stafford Noble; Timothy L Bailey
Journal:  Bioinformatics       Date:  2011-11-08       Impact factor: 6.937

3.  ETV4 and AP1 Transcription Factors Form Multivalent Interactions with three Sites on the MED25 Activator-Interacting Domain.

Authors:  Simon L Currie; Jedediah J Doane; Kathryn S Evans; Niraja Bhachech; Bethany J Madison; Desmond K W Lau; Lawrence P McIntosh; Jack J Skalicky; Kathleen A Clark; Barbara J Graves
Journal:  J Mol Biol       Date:  2017-07-17       Impact factor: 5.469

4.  Nicotinamide metabolism regulates glioblastoma stem cell maintenance.

Authors:  Jinkyu Jung; Leo Jy Kim; Xiuxing Wang; Qiulian Wu; Tanwarat Sanvoranart; Christopher G Hubert; Briana C Prager; Lisa C Wallace; Xun Jin; Stephen C Mack; Jeremy N Rich
Journal:  JCI Insight       Date:  2017-05-18

5.  The iron stimulon and fur regulon of Geobacter sulfurreducens and their role in energy metabolism.

Authors:  Mallory Embree; Yu Qiu; Wendy Shieu; Harish Nagarajan; Regina O'Neil; Derek Lovley; Karsten Zengler
Journal:  Appl Environ Microbiol       Date:  2014-02-28       Impact factor: 4.792

6.  Novel mechanism of positive versus negative regulation by thyroid hormone receptor β1 (TRβ1) identified by genome-wide profiling of binding sites in mouse liver.

Authors:  Preeti Ramadoss; Brian J Abraham; Linus Tsai; Yiming Zhou; Ricardo H Costa-e-Sousa; Felix Ye; Martin Bilban; Keji Zhao; Anthony N Hollenberg
Journal:  J Biol Chem       Date:  2013-11-27       Impact factor: 5.157

7.  Computational Approaches for Mining GRO-Seq Data to Identify and Characterize Active Enhancers.

Authors:  Anusha Nagari; Shino Murakami; Venkat S Malladi; W Lee Kraus
Journal:  Methods Mol Biol       Date:  2017

8.  Maintenance of CTCF- and Transcription Factor-Mediated Interactions from the Gametes to the Early Mouse Embryo.

Authors:  Yoon Hee Jung; Isaac Kremsky; Hannah B Gold; M Jordan Rowley; Kanchana Punyawai; Alyx Buonanotte; Xiaowen Lyu; Brianna J Bixler; Anthony W S Chan; Victor G Corces
Journal:  Mol Cell       Date:  2019-05-02       Impact factor: 17.970

9.  Targeting nuclear receptor NR4A1-dependent adipocyte progenitor quiescence promotes metabolic adaptation to obesity.

Authors:  Yang Zhang; Alexander J Federation; Soomin Kim; John P O'Keefe; Mingyue Lun; Dongxi Xiang; Jonathan D Brown; Matthew L Steinhauser
Journal:  J Clin Invest       Date:  2018-10-02       Impact factor: 14.808

10.  Large-scale detection and analysis of adenosine-to-inosine RNA editing during development in Plutella xylostella.

Authors:  Tao He; Wenjie Lei; Chang Ge; Peng Du; Li Wang; Fei Li
Journal:  Mol Genet Genomics       Date:  2014-12-10       Impact factor: 3.291

View more

北京卡尤迪生物科技股份有限公司 © 2022-2023.