Janne H Korhonen1,2,3, Kimmo Palin4, Jussi Taipale5, Esko Ukkonen2,3. 1. School of Computer Science, Reykjavík University, Reykjavík, Iceland. 2. Helsinki Institute for Information Technology HIIT, Helsinki, Finland. 3. Department of Computer Science. 4. Genome-Scale Biology Research Program, Research Programs Unit. 5. Department of Biosciences and Nutrition, Karolinska Institutet, Genome Scale Biology Program, Faculty of Medicine, University of Helsinki, Helsinki, Finland.
Abstract
Motivation: While the position weight matrix (PWM) is the most popular model for sequence motifs, there is growing evidence of the usefulness of more advanced models such as first-order Markov representations, and such models are also becoming available in well-known motif databases. There has been lots of research of how to learn these models from training data but the problem of predicting putative sites of the learned motifs by matching the model against new sequences has been given less attention. Moreover, motif site analysis is often concerned about how different variants in the sequence affect the sites. So far, though, the corresponding efficient software tools for motif matching have been lacking. Results: We develop fast motif matching algorithms for the aforementioned tasks. First, we formalize a framework based on high-order position weight matrices for generic representation of motif models with dinucleotide or general q -mer dependencies, and adapt fast PWM matching algorithms to the high-order PWM framework. Second, we show how to incorporate different types of sequence variants , such as SNPs and indels, and their combined effects into efficient PWM matching workflows. Benchmark results show that our algorithms perform well in practice on genome-sized sequence sets and are for multiple motif search much faster than the basic sliding window algorithm. Availability and Implementation: Implementations are available as a part of the MOODS software package under the GNU General Public License v3.0 and the Biopython license ( http://www.cs.helsinki.fi/group/pssmfind ). Contact: janne.h.korhonen@gmail.com.
Motivation: While the position weight matrix (PWM) is the most popular model for sequence motifs, there is growing evidence of the usefulness of more advanced models such as first-order Markov representations, and such models are also becoming available in well-known motif databases. There has been lots of research of how to learn these models from training data but the problem of predicting putative sites of the learned motifs by matching the model against new sequences has been given less attention. Moreover, motif site analysis is often concerned about how different variants in the sequence affect the sites. So far, though, the corresponding efficient software tools for motif matching have been lacking. Results: We develop fast motif matching algorithms for the aforementioned tasks. First, we formalize a framework based on high-order position weight matrices for generic representation of motif models with dinucleotide or general q -mer dependencies, and adapt fast PWM matching algorithms to the high-order PWM framework. Second, we show how to incorporate different types of sequence variants , such as SNPs and indels, and their combined effects into efficient PWM matching workflows. Benchmark results show that our algorithms perform well in practice on genome-sized sequence sets and are for multiple motif search much faster than the basic sliding window algorithm. Availability and Implementation: Implementations are available as a part of the MOODS software package under the GNU General Public License v3.0 and the Biopython license ( http://www.cs.helsinki.fi/group/pssmfind ). Contact: janne.h.korhonen@gmail.com.
Authors: Ei-Wen Yang; Jae Hoon Bahn; Esther Yun-Hua Hsiao; Boon Xin Tan; Yiwei Sun; Ting Fu; Bo Zhou; Eric L Van Nostrand; Gabriel A Pratt; Peter Freese; Xintao Wei; Giovanni Quinones-Valdez; Alexander E Urban; Brenton R Graveley; Christopher B Burge; Gene W Yeo; Xinshu Xiao Journal: Nat Commun Date: 2019-03-22 Impact factor: 14.919
Authors: Alexander Xi Fu; Kathy Nga-Chu Lui; Clara Sze-Man Tang; Ray Kit Ng; Frank Pui-Ling Lai; Sin-Ting Lau; Zhixin Li; Maria-Mercè Garcia-Barcelo; Pak-Chung Sham; Paul Kwong-Hang Tam; Elly Sau-Wai Ngan; Kevin Y Yip Journal: Genome Res Date: 2020-09-18 Impact factor: 9.043
Authors: Laura F Campitelli; Isaac Yellan; Mihai Albu; Marjan Barazandeh; Zain M Patel; Mathieu Blanchette; Timothy R Hughes Journal: Genetics Date: 2022-07-04 Impact factor: 4.402