Literature DB >> 26530723

No Promoter Left Behind (NPLB): learn de novo promoter architectures from genome-wide transcription start sites.

Abstract

UNLABELLED: Promoters have diverse regulatory architectures and thus activate genes differently. For example, some have a TATA-box, many others do not. Even the ones with it can differ in its position relative to the transcription start site (TSS). No Promoter Left Behind (NPLB) is an efficient, organism-independent method for characterizing such diverse architectures directly from experimentally identified genome-wide TSSs, without relying on known promoter elements. As a test case, we show its application in identifying novel architectures in the fly genome.
AVAILABILITY AND IMPLEMENTATION: Web-server at http://nplb.ncl.res.in Standalone also at https://github.com/computationalBiology/NPLB/ (Mac OSX/Linux). CONTACT: l.narlikar@ncl.res.in SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.

Entities: Chemical

Mesh：

Year: 2015 PMID： 26530723 PMCID： PMC4795619 DOI： 10.1093/bioinformatics/btv645

Source DB: PubMed Journal: Bioinformatics ISSN： 1367-4803 Impact factor: 6.937

1 Introduction

Promoters play a key role in transcription initiation by harbouring specific DNA elements, which act as transcription factor recognition sites. But how these promoter elements (PEs) contribute to the diversity in transcriptional regulation is not yet clear. While high-throughput technologies are increasingly used to produce accurate maps of transcription start sites (TSSs) (Ohler and Wassarman, 2010), the subsequent step of characterizing promoters and their functions is still done using two rather dated approaches. The first involves classifying them based on known PEs such as the INR motif or TATA-box. Unfortunately, a majority of promoters and their activities cannot be explained by the presence or absence of these few PEs. Alternatively, de novo motif discovery methods are used to identify overrepresented elements directly from the sequences. These can miss PEs present only in a small fraction of promoters. Since promoters have diverse mechanisms of activation, most PEs fall in this category (Juven-Gershon ). Even methods that identify cis-regulatory modules fail here, since although they look for motif-combinations, these are still required to be common across the full set (Van Loo and Marynen, 2009). No Promoter Left Behind (NPLB) is a new method modelled along the lines of unsupervised learning with feature selection that partitions TSS-aligned promoter sequences into distinct promoter architectures (PAs), each characterized by its own set of PEs, all learned de novo (Narlikar, 2014). Since it explicitly allows for diversity, NPLB can be applied to the full dataset, leaving out no promoter, in contrast to the standard approach of presorting/preselecting promoters on the basis of criteria such as presence of known PEs (Chen ) or TSS peak characteristics (Ni ). In this new parallel software, the number of PAs and PEs are determined automatically using a mix of Bayesian modelling and cross validation.

2 Methods

2.1 Overview of NPLB

Each promoter is characterized by one PA out of a finite set of PAs. Each PA is characterized by categorical distributions over nucleotides {A,C,G,T} at specific positions relative to the TSS. These positions and their distributions are expected to be unique to that PA. All other positions follow a background categorical distribution, common for all PAs. Parameters of models with various numbers of PAs are learned using Gibbs sampling and the best model is decided using cross validation. Key advantages of NPLB are that it Written in C and Python, NPLB requires a prior installation of gnuplot 4.6+. Weblogo 3.3 (Crooks ), and is modified to generate sequence logos. identifies novel and possibly diverse architectures and elements, with the only input being the set of promoters, is an organism and a cell-type independent, can be applied to the full set, directly, employs a likelihood-based approach, thus can be used to make new predictions of promoters, as well as classify between architectures, uses multiprocessing, making it fast: takes about 2 h for bacteria and 10 h for fly on an Intel i7-3770 K desktop. (Supplementary Fig. S1 shows how runtime scales with number of promoters.)

2.2 NPLB input

NPLB can learn new PAs (promoterLearn) or categorize new promoters based on an input PA-model (promoterClassify). Both require a fasta file of promoters, aligned according to the TSS. A typical eukaryotic file would contain DNA sequences ∼50 bp up- and downstream of the TSS. promoterClassify also needs a previously learned model. Various other default settings such as number of PAs to be explored and the number of sampling iterations can be overridden by the user. This is especially useful when the user wants to choose between a quick, approximate solution and a slow, but more accurate characterization. A tab-separated text file with one line per promoter, containing additional characteristics of each TSS such as UTR length, TSS spread, etc. is an optional input. In such a situation, NPLB creates plots that can give insights into functional differences between PAs.

2.3 NPLB output

A successful run of promoterLearn produces the following outputs: A successful run of promoterClassify produces all the aforementioned files except CVLikelihoods.txt, settings.txt and the likelihood plots. PAs in two visual formats: image (PAimage.png; Fig. 1b) and logos (PAlogo.html; Supplementary Fig. S1). The input is stored as rawImage.png (Fig. 1a) for reference. An -eps option produces eps figures. More details about the PEs and PAs are reported in modelOut.txt and architectureDetails.txt.

Fig. 1.

(a) Original set of promoter sequences. (b) 30 PAs learned by NPLB, ordered here based on presence of known PEs. (c) Tags per million at TSSs in each PA. (d) Length of 5′ UTRs in each PA

(a) Original set of promoter sequences. (b) 30 PAs learned by NPLB, ordered here based on presence of known PEs. (c) Tags per million at TSSs in each PA. (d) Length of 5′ UTRs in each PA If a characteristic file is supplied, box-plots (Fig. 1c and d) or piecharts are created for real or categorical characteristics, respectively. The model itself is saved in a binary file bestmodel.p and can be used by NPLB to classify a new promoter. The best model is determined by cross validation. Likelihoods of all models are recorded in CVLikelihoods.txt. The verbose option leads to likelihoods of all sampling iterations to be plotted in separate png files. The parameters of the execution are saved in settings.txt.

3 Case study: Drosophila

promoterLearn was applied to 90-bp neighbourhoods centred on 6635 TSSs (Fig. 1a) reported in adult Drosophila melanogaster carcasses (Chen ). In the original study, four types of promoters were identified, based on known fly PEs (Ohler and Wassarman, 2010): TATA-box, INR, DPE, Dmv4 and Dmv5. These four types accounted for 2112 of the 6635 promoters (Supplementary Fig. S3a). Here, 12 PAs were identified (Supplementary Fig. S4a); promoterLearn was run again on each of them. Eight PAs were split further into a total of 23 PAs (Supplementary Fig. S4b), three of which were split to get a final set of 30 PAs (Fig. 1b). A1–A6 contain the TATA-box, but differ in its distance from the TSS. Interestingly, the INR motif TCAGTY varies slightly with the TATA-box position in A3–A6. Standard analyses miss such variations, either because they rely on known PEs or look for elements overrepresented in the full set. For instance, in the sequences left out in the original study, NPLB finds PAs characterized by known as well as novel PEs (Supplementary Fig. S3b). The characteristic file with the number of tags at each TSS and 5′ UTR length was used to construct two box-plots (Fig. 1c and d). A30 contains the ribosomal TCT motif (Parry ) in place of the INR, which explains the significantly higher number of tags at those promoters (P < 10−21). This PA was missed in the original analysis possibly since it contains <2% of all promoters. Interestingly, A7–A11, which contain variants of the DPE, but no obvious upstream element, create transcripts with longer 5′ UTRs than other PAs (P < 10−62). This has not been noted before. A more detailed description of the PAs is available in the Supplementary methods. PAs can be further analysed for function through conservation analysis (Karolchik ; Supplementary Fig. S5) and GO term enrichment studies (Huang ; Supplementary Table S1).

4 Conclusion

Data from new and advanced high-throughput technologies are increasingly making it clear that cells employ diverse mechanisms for transcriptional regulation. NPLB seeks to fulfil the need for an efficient and unbiased method that can identify these mechanisms directly from such data. Although NPLB has been designed for TSS maps, it can be applied to any DNA sequences aligned on the basis of a common genomic event such as splicing, eRNA synthesis or protein–DNA binding and expected to have distinct sequence architectures in the immediate neighbourhood. Click here for additional data file.

10 in total

1. WebLogo: a sequence logo generator.

Authors: Gavin E Crooks; Gary Hon; John-Marc Chandonia; Steven E Brenner
Journal: Genome Res Date: 2004-06 Impact factor: 9.043

2. The TCT motif, a key component of an RNA polymerase II transcription system for the translational machinery.

Authors: Trevor J Parry; Joshua W M Theisen; Jer-Yuan Hsu; Yuan-Liang Wang; David L Corcoran; Moriah Eustice; Uwe Ohler; James T Kadonaga
Journal: Genes Dev Date: 2010-08-27 Impact factor: 11.361

Review 3. Computational methods for the detection of cis-regulatory modules.

Authors: Peter Van Loo; Peter Marynen
Journal: Brief Bioinform Date: 2009-06-04 Impact factor: 11.622

Review 4. The RNA polymerase II core promoter - the gateway to transcription.

Authors: Tamar Juven-Gershon; Jer-Yuan Hsu; Joshua Wm Theisen; James T Kadonaga
Journal: Curr Opin Cell Biol Date: 2008-04-22 Impact factor: 8.382

Review 5. Promoting developmental transcription.

Authors: Uwe Ohler; David A Wassarman
Journal: Development Date: 2010-01 Impact factor: 6.868

6. Multiple novel promoter-architectures revealed by decoding the hidden heterogeneity within the genome.

Authors: Leelavati Narlikar
Journal: Nucleic Acids Res Date: 2014-10-17 Impact factor: 16.971

7. A paired-end sequencing strategy to map the complex landscape of transcription initiation.

Authors: Ting Ni; David L Corcoran; Elizabeth A Rach; Shen Song; Eric P Spana; Yuan Gao; Uwe Ohler; Jun Zhu
Journal: Nat Methods Date: 2010-05-23 Impact factor: 28.547

8. Comparative validation of the D. melanogaster modENCODE transcriptome annotation.

Authors: Zhen-Xia Chen; David Sturgill; Jiaxin Qu; Huaiyang Jiang; Soo Park; Nathan Boley; Ana Maria Suzuki; Anthony R Fletcher; David C Plachetzki; Peter C FitzGerald; Carlo G Artieri; Joel Atallah; Olga Barmina; James B Brown; Kerstin P Blankenburg; Emily Clough; Abhijit Dasgupta; Sai Gubbala; Yi Han; Joy C Jayaseelan; Divya Kalra; Yoo-Ah Kim; Christie L Kovar; Sandra L Lee; Mingmei Li; James D Malley; John H Malone; Tittu Mathew; Nicolas R Mattiuzzo; Mala Munidasa; Donna M Muzny; Fiona Ongeri; Lora Perales; Teresa M Przytycka; Ling-Ling Pu; Garrett Robinson; Rebecca L Thornton; Nehad Saada; Steven E Scherer; Harold E Smith; Charles Vinson; Crystal B Warner; Kim C Worley; Yuan-Qing Wu; Xiaoyan Zou; Peter Cherbas; Manolis Kellis; Michael B Eisen; Fabio Piano; Karin Kionte; David H Fitch; Paul W Sternberg; Asher D Cutter; Michael O Duff; Roger A Hoskins; Brenton R Graveley; Richard A Gibbs; Peter J Bickel; Artyom Kopp; Piero Carninci; Susan E Celniker; Brian Oliver; Stephen Richards
Journal: Genome Res Date: 2014-07 Impact factor: 9.043

9. DAVID Bioinformatics Resources: expanded annotation database and novel algorithms to better extract biology from large gene lists.

Authors: Da Wei Huang; Brad T Sherman; Qina Tan; Joseph Kir; David Liu; David Bryant; Yongjian Guo; Robert Stephens; Michael W Baseler; H Clifford Lane; Richard A Lempicki
Journal: Nucleic Acids Res Date: 2007-06-18 Impact factor: 16.971

10. The UCSC Genome Browser database: 2014 update.

Authors: Donna Karolchik; Galt P Barber; Jonathan Casper; Hiram Clawson; Melissa S Cline; Mark Diekhans; Timothy R Dreszer; Pauline A Fujita; Luvina Guruvadoo; Maximilian Haeussler; Rachel A Harte; Steve Heitner; Angie S Hinrichs; Katrina Learned; Brian T Lee; Chin H Li; Brian J Raney; Brooke Rhead; Kate R Rosenbloom; Cricket A Sloan; Matthew L Speir; Ann S Zweig; David Haussler; Robert M Kuhn; W James Kent
Journal: Nucleic Acids Res Date: 2013-11-21 Impact factor: 16.971

10 in total

2 in total

1. Disentangling transcription factor binding site complexity.

Authors: Ralf Eggeling
Journal: Nucleic Acids Res Date: 2018-11-16 Impact factor: 16.971

2. THiCweed: fast, sensitive detection of sequence features by clustering big datasets.

Authors: Ankit Agrawal; Snehal V Sambare; Leelavati Narlikar; Rahul Siddharthan
Journal: Nucleic Acids Res Date: 2018-03-16 Impact factor: 16.971

2 in total