Literature DB >> 29514181

PWMScan: a fast tool for scanning entire genomes with a position-specific weight matrix.

Giovanna Ambrosini^1,2, Romain Groux^1,2, Philipp Bucher^1,2.

Abstract

Summary: Transcription factors regulate gene expression by binding to specific short DNA sequences of 5-20 bp to regulate the rate of transcription of genetic information from DNA to messenger RNA. We present PWMScan, a fast web-based tool to scan server-resident genomes for matches to a user-supplied PWM or transcription factor binding site model from a public database. Availability and implementation: The web server and source code are available at http://ccg.vital-it.ch/pwmscan and https://sourceforge.net/projects/pwmscan, respectively. Supplementary information: Supplementary data are available at Bioinformatics online.

Entities: Gene Species

Mesh：

Substances：
Transcription Factors
DNA

Year: 2018 PMID： 29514181 PMCID： PMC6041753 DOI： 10.1093/bioinformatics/bty127

Source DB: PubMed Journal: Bioinformatics ISSN： 1367-4803 Impact factor: 6.937

1 Introduction

Knowing where transcription factors (TFs) bind to the genome is the key to understanding gene regulation. The binding specificity of a TF is commonly represented by a numerical matrix, either as a position weight matrix (PWM), a position frequency matrix (PFM) or a letter probability matrix (LPM). The three representations are information-wise equivalent and inter-convertible. A PWM contains weights for each base at each motif position. By summing up weights at corresponding positions, a binding score can be computed for any base sequence of the same length as the PWM. Large collections of TF specificity matrices are nowadays available from public libraries such as JASPAR (Khan ) or HOCOMOCO (Kulakovskiy ). PWMScan is a web-server for rapid scanning of large genomes for high-scoring matches to a user-supplied or server-resident PWM. Compared to other web-based PWM scanning tools, PWMScan is unique in that it scans server-resident whole genomes rather than user-uploaded DNA sequences. Other key features are: (i) menu-driven access to genomes of >30 model organisms; (ii) menu-driven access to >300 public PWM libraries; (iii) support of various PWM representations and formats; (iv) cut-off values can be specified as match scores or P-values; (v) output in BEDdetail format with match scores and P-values; (vi) links to UCSC genome browser for visualization of results; and (vii) action buttons to transfer match lists to analysis tools. A short description of the PWMScan server follows. Technical details about algorithms, programs and data are provided under Supplementary Material.

2 Data and methods

Genome sequences were downloaded from the NCBI in FASTA format. Indexed versions for rapid scanning were generated for Bowtie (Langmead ). The motif databases offered by PWMScan have been downloaded from the MEME Suite website (Bailey ). LPMs have been converted to integer PWMs (see Supplementary Material). The input form of the PWMScan server is shown in Figure 1. The user chooses a genome assembly from a menu. Optionally, a BED file may be uploaded to restrict the search to genomic regions of particular interest, e.g. open chromatin regions. The right side of the form offers several ways to specify a DNA motif. PWMs from a server-resident database are chosen from a pull-down menu. Alternatively, matrices can be entered into a text area or uploaded. Accepted motif types are: PFMs, LPMs, real or integer PWMs and IUPAC consensus sequences. PFMs can be entered in several formats, including TRANSFAC and JASPAR.

Fig. 1.

Screen shot of the PWMScan graphical user interface

Screen shot of the PWMScan graphical user interface All motif types have to be converted into integer PWMs for input to the genome search engines (see Supplementary Material). Default conversion parameters are proposed and can be changed by the user. For instance, real PWMs can be rescaled on input by a multiplication factor to ensure sufficient resolution after integer conversion. IUPAC consensus sequences are converted into binary matrices consisting of 0 and 1. For all matrix formats, the cut-off value can be specified as PWM score, as P-value or as percentage of the score range (0% = minimal score and 100% = maximal score). For IUPAC consensus sequences, the cut-off value is specified as a maximal number of mismatches allowed. The P-value of a PWM score x is defined as the probability that a random k-mer sequence of the length of the PWM has a binding score ≥ x given the base composition of the genome. The whole genome scan takes as input an integer PWM and a corresponding cut-off. The output is a list of sequence regions that match the PWM with a match score higher or equal to the cut-off value. Depending on the length of the PWM and the cut-off, one of the following search strategies is chosen: (i) Bowtie, a fast memory-efficient short read aligner using indexed genomes and (ii) matrix_scan, a C program developed by our group using a conventional search algorithm. The first strategy is more efficient for short PWMs and high cut-off values. It requires as a first step the generation of a list of all k-mers that match the PWM with the given cut-off. The list of k-mers is then mapped to the genome using Bowtie. The second strategy takes genome sequences in FASTA format as input. Individual chromosomes are processed in parallel and distributed to multiple cores by a Python script. We empirically found that this approach becomes more efficient if the number of k-mers exceeds 105 sequences. matrix_scan was benchmarked for speed together with five other matrix scanners and was found to be the fastest (see Supplementary Material). The basic search step outputs a list of PWM matches, including the genomic coordinates, the DNA sequence and the match score. Post-processing of this list involves computation of the corresponding P-values, addition of the matrix name and, optionally, elimination of overlapping matches. The final match list is provided in BEDdetail format. The output page further shows the total number of PWM matches and a sequence logo reflecting the letter-probabilities of the input matrix. Action buttons are provided for: (i) sending the match lists to analysis tools of the ChIP-Seq and SSA servers (Ambrosini ), (ii) extracting DNA sequences around the matches, (iii) sending the output to the UCSC genome browser for visualization and (iv) liftover of the match coordinates to other assemblies of the same or related species. PWMScan is meant to support many types of genomic data analysis and designed to be interoperable with other tools from our group and elsewhere. An example of a typical workflow involving ChIP-seq data is presented in Supplementary Material. PWMScan is also available as a command-line software package from SourceForge, including a master script scheduling all computational steps running during a web job.

3 Benchmark

The runtime of PWMScan was measured by scanning the human genome (UCSC assembly hg19) with two different PWMs from JASPAR, STAT1 with length 11 bp and CTCF with length 19 bp, and different cut-off values expressed as P-values. Results are shown in Table 1. Note that for longer motifs and higher P-values, the Bowtie-based approach becomes inefficient, whereas matrix_scan remains reasonably fast.

Table 1.

Benchmark results with different PWMs and P-values

PWM/P-value	Bowtie speed	Matrix_scan speed
STAT1(len = 11 bp)/10⁻⁵	3	30/5^a
STAT1(len = 11 bp)/10⁻⁴	8	40/8^a
STAT1(len = 11 bp)/10⁻³	60	65/30^a
CTCF(len = 19 bp)/10⁻⁵	12	40/6^a
CTCF(len = 19 bp)/10⁻⁴	90	50/10^a
CTCF(len = 19 bp)/10⁻³	720	90/35^a

Note: Speed is expressed in seconds. The benchmarking tests have been run on a Linux/CentOS7/x86_64 workstation with 48 CPU-cores and 256 GB of DRAM.

Performance measurements using matrix_scan in parallel over 10 CPU-cores.

Benchmark results with different PWMs and P-values Note: Speed is expressed in seconds. The benchmarking tests have been run on a Linux/CentOS7/x86_64 workstation with 48 CPU-cores and 256 GB of DRAM. Performance measurements using matrix_scan in parallel over 10 CPU-cores. Click here for additional data file.

5 in total

1. HOCOMOCO: towards a complete collection of transcription factor binding models for human and mouse via large-scale ChIP-Seq analysis.

Authors: Ivan V Kulakovskiy; Ilya E Vorontsov; Ivan S Yevshin; Ruslan N Sharipov; Alla D Fedorova; Eugene I Rumynskiy; Yulia A Medvedeva; Arturo Magana-Mora; Vladimir B Bajic; Dmitry A Papatsenko; Fedor A Kolpakov; Vsevolod J Makeev
Journal: Nucleic Acids Res Date: 2018-01-04 Impact factor: 16.971

2. Ultrafast and memory-efficient alignment of short DNA sequences to the human genome.

Authors: Ben Langmead; Cole Trapnell; Mihai Pop; Steven L Salzberg
Journal: Genome Biol Date: 2009-03-04 Impact factor: 13.583

3. The ChIP-Seq tools and web server: a resource for analyzing ChIP-seq and other types of genomic data.

Authors: Giovanna Ambrosini; René Dreos; Sunil Kumar; Philipp Bucher
Journal: BMC Genomics Date: 2016-11-18 Impact factor: 3.969

4. JASPAR 2018: update of the open-access database of transcription factor binding profiles and its web framework.

Authors: Aziz Khan; Oriol Fornes; Arnaud Stigliani; Marius Gheorghe; Jaime A Castro-Mondragon; Robin van der Lee; Adrien Bessy; Jeanne Chèneby; Shubhada R Kulkarni; Ge Tan; Damir Baranasic; David J Arenillas; Albin Sandelin; Klaas Vandepoele; Boris Lenhard; Benoît Ballester; Wyeth W Wasserman; François Parcy; Anthony Mathelier
Journal: Nucleic Acids Res Date: 2018-01-04 Impact factor: 16.971

5. MEME SUITE: tools for motif discovery and searching.

Authors: Timothy L Bailey; Mikael Boden; Fabian A Buske; Martin Frith; Charles E Grant; Luca Clementi; Jingyuan Ren; Wilfred W Li; William S Noble
Journal: Nucleic Acids Res Date: 2009-05-20 Impact factor: 16.971

5 in total

42 in total

1. ChIAPoP: a new tool for ChIA-PET data analysis.

Authors: Weichun Huang; Mario Medvedovic; Jingwen Zhang; Liang Niu
Journal: Nucleic Acids Res Date: 2019-04-23 Impact factor: 16.971

2. Insulin Receptor Associates with Promoters Genome-wide and Regulates Gene Expression.

Authors: Melissa L Hancock; Rebecca C Meyer; Meeta Mistry; Radhika S Khetani; Alexandre Wagschal; Taehwan Shin; Shannan J Ho Sui; Anders M Näär; John G Flanagan
Journal: Cell Date: 2019-04-04 Impact factor: 41.582

3. Jpx RNA regulates CTCF anchor site selection and formation of chromosome loops.

Authors: Hyun Jung Oh; Rodrigo Aguilar; Barry Kesner; Hun-Goo Lee; Andrea J Kriz; Hsueh-Ping Chu; Jeannie T Lee
Journal: Cell Date: 2021-12-01 Impact factor: 41.582

4. Tissue-specific Grb10/Ddc insulator drives allelic architecture for cardiac development.

Authors: Aimee M Juan; Yee Hoon Foong; Joanne L Thorvaldsen; Yemin Lan; Nicolae A Leu; Joel G Rurik; Li Li; Christopher Krapp; Casey L Rosier; Jonathan A Epstein; Marisa S Bartolomei
Journal: Mol Cell Date: 2022-09-14 Impact factor: 19.328

5. Glucocorticoid-induced eosinopenia results from CXCR4-dependent bone marrow migration.

Authors: So Gun Hong; Noriko Sato; Fanny Legrand; Manasi Gadkari; Michelle Makiya; Kindra Stokes; Katherine N Howe; Shiqin Judy Yu; Nathaniel Seth Linde; Randall R Clevenger; Timothy Hunt; Zonghui Hu; Peter L Choyke; Cynthia E Dunbar; Amy D Klion; Luis M Franco
Journal: Blood Date: 2020-12-03 Impact factor: 22.113

6. CrebA increases secretory capacity through direct transcriptional regulation of the secretory machinery, a subset of secretory cargo, and other key regulators.

Authors: Dorothy M Johnson; Michael B Wells; Rebecca Fox; Joslynn S Lee; Rajprasad Loganathan; Daniel Levings; Abigail Bastien; Matthew Slattery; Deborah J Andrew
Journal: Traffic Date: 2020-09 Impact factor: 6.215

7. VvDAM-SVPs genes are regulated by FLOWERING LOCUS T (VvFT) and not by ABA/low temperature-induced VvCBFs transcription factors in grapevine buds.

Authors: Ricardo Vergara; Ximena Noriega; Francisco J Pérez
Journal: Planta Date: 2021-01-12 Impact factor: 4.116

8. The C-terminal Domain of piggyBac Transposase Is Not Required for DNA Transposition.

Authors: Laura Helou; Linda Beauclair; Hugues Dardente; Peter Arensburger; Nicolas Buisine; Yan Jaszczyszyn; Florian Guillou; Thierry Lecomte; Alex Kentsis; Yves Bigot
Journal: J Mol Biol Date: 2021-01-13 Impact factor: 5.469

9. Investigation of product-derived lymphoma following infusion of piggyBac-modified CD19 chimeric antigen receptor T cells.

Authors: Kenneth P Micklethwaite; Kavitha Gowrishankar; Brian S Gloss; Ziduo Li; Janine A Street; Leili Moezzi; Melanie A Mach; Gaurav Sutrave; Leighton E Clancy; David C Bishop; Raymond H Y Louie; Curtis Cai; Jonathan Foox; Matthew MacKay; Fritz J Sedlazeck; Piers Blombery; Christopher E Mason; Fabio Luciani; David J Gottlieb; Emily Blyth
Journal: Blood Date: 2021-10-21 Impact factor: 25.476

10. Evolutionary Protection of Krüppel-Like Factors 2 and 4 in the Development of the Mature Hemovascular System.

Authors: David R Sweet; Cherry Lam; Mukesh K Jain
Journal: Front Cardiovasc Med Date: 2021-05-17