Kai Ye1, Walter A Kosters, Adriaan P Ijzerman. 1. Division of Medicinal Chemistry, Leiden/Amsterdam Center for Drug Research and Leiden Institute of Advanced Computer Science, Leiden University, Leiden, The Netherlands. k.ye@lacdr.leidenuniv.nl
Abstract
MOTIVATION: Pattern discovery in protein sequences is often based on multiple sequence alignments (MSA). The procedure can be computationally intensive and often requires manual adjustment, which may be particularly difficult for a set of deviating sequences. In contrast, two algorithms, PRATT2 (http//www.ebi.ac.uk/pratt/) and TEIRESIAS (http://cbcsrv.watson.ibm.com/) are used to directly identify frequent patterns from unaligned biological sequences without an attempt to align them. Here we propose a new algorithm with more efficiency and more functionality than both PRATT2 and TEIRESIAS, and discuss some of its applications to G protein-coupled receptors, a protein family of important drug targets. RESULTS: In this study, we designed and implemented six algorithms to mine three different pattern types from either one or two datasets using a pattern growth approach. We compared our approach to PRATT2 and TEIRESIAS in efficiency, completeness and the diversity of pattern types. Compared to PRATT2, our approach is faster, capable of processing large datasets and able to identify the so-called type III patterns. Our approach is comparable to TEIRESIAS in the discovery of the so-called type I patterns but has additional functionality such as mining the so-called type II and type III patterns and finding discriminating patterns between two datasets. AVAILABILITY: The source code for pattern growth algorithms and their pseudo-code are available at http://www.liacs.nl/home/kosters/pg/.
MOTIVATION: Pattern discovery in protein sequences is often based on multiple sequence alignments (MSA). The procedure can be computationally intensive and often requires manual adjustment, which may be particularly difficult for a set of deviating sequences. In contrast, two algorithms, PRATT2 (http//www.ebi.ac.uk/pratt/) and TEIRESIAS (http://cbcsrv.watson.ibm.com/) are used to directly identify frequent patterns from unaligned biological sequences without an attempt to align them. Here we propose a new algorithm with more efficiency and more functionality than both PRATT2 and TEIRESIAS, and discuss some of its applications to G protein-coupled receptors, a protein family of important drug targets. RESULTS: In this study, we designed and implemented six algorithms to mine three different pattern types from either one or two datasets using a pattern growth approach. We compared our approach to PRATT2 and TEIRESIAS in efficiency, completeness and the diversity of pattern types. Compared to PRATT2, our approach is faster, capable of processing large datasets and able to identify the so-called type III patterns. Our approach is comparable to TEIRESIAS in the discovery of the so-called type I patterns but has additional functionality such as mining the so-called type II and type III patterns and finding discriminating patterns between two datasets. AVAILABILITY: The source code for pattern growth algorithms and their pseudo-code are available at http://www.liacs.nl/home/kosters/pg/.
Authors: Yanju Zhang; Eric-Wubbo Lameijer; Peter A C 't Hoen; Zemin Ning; P Eline Slagboom; Kai Ye Journal: Bioinformatics Date: 2012-01-04 Impact factor: 6.937
Authors: Kai Ye; Jiayin Wang; Reyka Jayasinghe; Eric-Wubbo Lameijer; Joshua F McMichael; Jie Ning; Michael D McLellan; Mingchao Xie; Song Cao; Venkata Yellapantula; Kuan-lin Huang; Adam Scott; Steven Foltz; Beifang Niu; Kimberly J Johnson; Matthijs Moed; P Eline Slagboom; Feng Chen; Michael C Wendl; Li Ding Journal: Nat Med Date: 2015-12-14 Impact factor: 53.440