Wei Yang1, Saurabh Sinha1. 1. Department of Computer Science, University of Illinois, Urbana-Champaign, Urbana, IL, USA.
Abstract
MOTIVATION: With the rapid emergence of technologies for locating cis-regulatory modules (CRMs) genome-wide, the next pressing challenge is to assign precise functions to each CRM, i.e. to determine the spatiotemporal domains or cell-types where it drives expression. A popular approach to this task is to model the typical k-mer composition of a set of CRMs known to drive a common expression pattern, and assign that pattern to other CRMs exhibiting a similar k-mer composition. This approach does not rely on prior knowledge of transcription factors relevant to the CRM or their binding motifs, and is thus more widely applicable than motif-based methods for predicting CRM activity, but is also prone to false positive predictions. RESULTS: We present a novel strategy to improve the above-mentioned approach: to predict if a CRM drives a specific gene expression pattern, assess not only how similar the CRM is to other CRMs with similar activity but also to CRMs with distinct activities. We use a state-of-the-art statistical method to quantify a CRM's sequence similarity to many different training sets of CRMs, and employ a classification algorithm to integrate these similarity scores into a single prediction of the CRM's activity. This strategy is shown to significantly improve CRM activity prediction over current approaches. AVAILABILITY AND IMPLEMENTATION: Our implementation of the new method, called IMMBoost, is freely available as source code, at https://github.com/weiyangedward/IMMBoost CONTACT: sinhas@illinois.eduSupplementary information: Supplementary data are available at Bioinformatics online.
MOTIVATION: With the rapid emergence of technologies for locating cis-regulatory modules (CRMs) genome-wide, the next pressing challenge is to assign precise functions to each CRM, i.e. to determine the spatiotemporal domains or cell-types where it drives expression. A popular approach to this task is to model the typical k-mer composition of a set of CRMs known to drive a common expression pattern, and assign that pattern to other CRMs exhibiting a similar k-mer composition. This approach does not rely on prior knowledge of transcription factors relevant to the CRM or their binding motifs, and is thus more widely applicable than motif-based methods for predicting CRM activity, but is also prone to false positive predictions. RESULTS: We present a novel strategy to improve the above-mentioned approach: to predict if a CRM drives a specific gene expression pattern, assess not only how similar the CRM is to other CRMs with similar activity but also to CRMs with distinct activities. We use a state-of-the-art statistical method to quantify a CRM's sequence similarity to many different training sets of CRMs, and employ a classification algorithm to integrate these similarity scores into a single prediction of the CRM's activity. This strategy is shown to significantly improve CRM activity prediction over current approaches. AVAILABILITY AND IMPLEMENTATION: Our implementation of the new method, called IMMBoost, is freely available as source code, at https://github.com/weiyangedward/IMMBoost CONTACT: sinhas@illinois.eduSupplementary information: Supplementary data are available at Bioinformatics online.
Authors: Bradley E Bernstein; John A Stamatoyannopoulos; Joseph F Costello; Bing Ren; Aleksandar Milosavljevic; Alexander Meissner; Manolis Kellis; Marco A Marra; Arthur L Beaudet; Joseph R Ecker; Peggy J Farnham; Martin Hirst; Eric S Lander; Tarjei S Mikkelsen; James A Thomson Journal: Nat Biotechnol Date: 2010-10 Impact factor: 54.908
Authors: Leelavati Narlikar; Noboru J Sakabe; Alexander A Blanski; Fabio E Arimura; John M Westlund; Marcelo A Nobrega; Ivan Ovcharenko Journal: Genome Res Date: 2010-01-14 Impact factor: 9.043
Authors: Miriam R Kantorovitz; Majid Kazemian; Sarah Kinston; Diego Miranda-Saavedra; Qiyun Zhu; Gene E Robinson; Berthold Göttgens; Marc S Halfon; Saurabh Sinha Journal: Dev Cell Date: 2009-10 Impact factor: 12.270
Authors: Shaad M Ahmad; Brian W Busser; Di Huang; Elizabeth J Cozart; Sébastien Michaud; Xianmin Zhu; Neal Jeffries; Anton Aboukhalil; Martha L Bulyk; Ivan Ovcharenko; Alan M Michelson Journal: Development Date: 2014-02 Impact factor: 6.868
Authors: Genevieve D Erwin; Nir Oksenberg; Rebecca M Truty; Dennis Kostka; Karl K Murphy; Nadav Ahituv; Katherine S Pollard; John A Capra Journal: PLoS Comput Biol Date: 2014-06-26 Impact factor: 4.475
Authors: M S Vijayabaskar; Debbie K Goode; Nadine Obier; Monika Lichtinger; Amber M L Emmett; Fatin N Zainul Abidin; Nisar Shar; Rebecca Hannah; Salam A Assi; Michael Lie-A-Ling; Berthold Gottgens; Georges Lacaud; Valerie Kouskoff; Constanze Bonifer; David R Westhead Journal: PLoS Comput Biol Date: 2019-11-04 Impact factor: 4.475