Literature DB >> 28053162

POSTAR: a platform for exploring post-transcriptional regulation coordinated by RNA-binding proteins.

Boqin Hu¹, Yu-Cheng T Yang^1,2, Yiming Huang¹, Yumin Zhu¹, Zhi John Lu³.

Abstract

We present POSTAR (http://POSTAR.ncrnalab.org), a resource of POST-trAnscriptional Regulation coordinated by RNA-binding proteins (RBPs). Precise characterization of post-transcriptional regulatory maps has accelerated dramatically in the past few years. Based on new studies and resources, POSTAR supplies the largest collection of experimentally probed (∼23 million) and computationally predicted (approximately 117 million) RBP binding sites in the human and mouse transcriptomes. POSTAR annotates every transcript and its RBP binding sites using extensive information regarding various molecular regulatory events (e.g., splicing, editing, and modification), RNA secondary structures, disease-associated variants, and gene expression and function. Moreover, POSTAR provides a friendly, multi-mode, integrated search interface, which helps users to connect multiple RBP binding sites with post-transcriptional regulatory events, phenotypes, and diseases. Based on our platform, we were able to obtain novel insights into post-transcriptional regulation, such as the putative association between CPSF6 binding, RNA structural domains, and Li-Fraumeni syndrome SNPs. In summary, POSTAR represents an early effort to systematically annotate post-transcriptional regulatory maps and explore the putative roles of RBPs in human diseases.

Entities: CellLine Chemical Disease Gene Species

Mesh：

Substances：

Year: 2016 PMID： 28053162 PMCID： PMC5210617 DOI： 10.1093/nar/gkw888

Source DB: PubMed Journal: Nucleic Acids Res ISSN： 0305-1048 Impact factor: 16.971

INTRODUCTION

The regulatory maps of genomes have been revealed by various high-throughput sequencing assays from individual groups and consortium efforts such as the ENCODE project (1) and Roadmap Epigenomics project (2). Most previous studies have focused on cis-regulation, the epigenome, the transcriptome, and the proteome, leaving post-transcriptional regulation not fully explored and connected with existing knowledge. Although RNA splicing has been well-studied, other post-regulatory signatures, such as RNA modification and RNA editing, have not been profiled until recently (3–5). Furthermore, most studies of genomic variants (e.g. GWAS SNPs (6) and cancer somatic mutations (7)) have mainly focused on the transcriptional level (8,9). Recently, there has been increasing interest in studying the association between post-transcriptional regulation and disease-associated variation (10). In addition, post-transcriptional regulation could regulate cell differentiation (11) and influence the scope of the druggable proteome (12). However, post-transcriptional interactions between cell state-associated genes and druggable genes and their impacts on cell differentiation, pathology and clinical treatment remain largely uncharacterized. To perform such studies, researchers require a platform that facilitates integration and association of multi-layer information to illuminate the mechanisms underlying post-transcriptional regulation. RNA transcripts do not function as naked RNAs in eukaryotic cells from birth to death; instead, they are dynamically bound by various post-transcriptional regulatory factors, including RNA-binding proteins (RBPs) and microRNAs (miRNAs) (13,14). Recently developed high-throughput assays (e.g. CLIP-seq and RNAcompete technologies), together with computational tools, enabled researchers to obtain transcriptome-wide binding maps of RBPs and miRNAs at high resolution (15,16). Most RNA processing reactions (e.g. alternative splicing, alternative polyadenylation and nucleotide modification) and regulatory events (e.g. subcellular localization, RNA stability and translation efficiency) are mediated by miRNAs and RBPs. Constructing accurate and comprehensive RBP-RNA and miRNA-RNA interaction maps at high resolution is a necessary step toward interpreting their mechanistic roles in post-transcriptional regulation. Although there are several databases for CLIP-seq-derived RBP binding sites, such as CLIPZ (17), starBase (18), DoRiNA (19) and CLIPdb (20), they merely provide repositories of transcriptome-wide RBP binding sites or focus on miRNA-mediated post-transcriptional regulation. Later, RBP-Var (10) was developed to incorporate SNV information for RBP binding sites, but it was limited to a few regulatory events in humans. Furthermore, experimental-data-constrained RNA secondary structures are not available for RBP binding sites in any current database. Finally, it is often desirable to know whether RBPs can interact with RNA molecules, especially novel lncRNAs; however, only a few online tools (21,22) are available for predicting RBP binding sites on given RNA sequences. Previously, we developed CLIPdb (20), which simply provided RBP binding sites without further annotation and interpretation (http://CLIPdb.ncrnalab.org). Here, we constructed a new platform for CLIPdb version 2, POSTAR, which focuses on POST-trAnscriptional Regulation coordinated by RNA-binding proteins (RBPs), to facilitate searching, annotation, visualization, integration, connection, and interpretation of data regarding multiple post-transcriptional regulatory events in humans and mice. First and foremost, POSTAR provides a comprehensive repository of experimentally probed (i.e. derived from 498 CLIP-seq and 151 eCLIP-seq data sets) and computationally predicted RBP binding sites in humans and mice. Based on these binding sites, POSTAR annotates a gene/lncRNA and its RBP binding sites using extensive information: (i) two kinds of RBP binding motifs/preferences, (ii) six types of molecular regulation events, (iii) six types of genomic variants, (iv) four types of gene-function associations and (v) predicted RNA secondary structure around every RBP binding site based on whole-transcriptome RNA structural profiling data (Figure 1, Table 1). Furthermore, we designed a multi-mode usage interface: (a) ‘POSTAR’ search, (b) ‘RBP’ search, (c) ‘Structure’ visualization, (d) ‘Variation’ search, (e) ‘Functional gene’ search and (f) ‘Predict’ server for RBP binding prediction for given RNA sequences (Figure 2A). POSTAR presents the search/prediction results in many ways (Figure 2B). Moreover, binding sites from multiple RBPs and their associations with various post-transcriptional regulatory events can be visualized and explored in an integrative manner (Figure 3), which will allow users to connect different pieces of data from various resources and layers.

Figure 1.

Table 1.

Overview of data curated in POSTAR

	Category	Human	Mouse	Resource/calculation method^a
RBP binding sites	RBP binding sites from experiments	1 752 329	1 003 984	All CLIP-seq peaks called by Piranha (human: 65 RBPs; mouse: 30 RBPs)^b
		39 201	78 922	HITS-CLIP peaks called by CIMS (human: 17 RBPs; mouse: 23 RBPs)^b
		7 731 846	96 346	PAR-CLIP peaks called by PARalyzer (human: 44 RBPs; mouse: 4 RBPs)^b
		4 598 307	1 013 008	iCLIP peaks called by CITS (human: 9 RBPs; mouse: 8 RBPs)^b
		6 703 559	NA	eCLIP-seq peaks called by ENCODE (human: 56 RBPs)^c
		439 817	NA	PIP-seq peaks called by PMID24393486 (human: global RBPs)
	RBP binding site from predictions	25 623 567	18 540 386	Peaks predicted by FIMO (human: 88 RBPs; mouse: 88 RBPs)^d
		19 447 967	24 621 203	Peaks predicted by TESS (human: 88 RBPs; mouse: 88 RBPs)^d
		16 586 127	11 905 150	Peaks predicted by DeepBind (human: 82 RBPs; mouse: 82 RBPs)^e
Data module I: Gene/RBP annotations	RBPs	132	104	Ensembl, PMID25365966
	Sequence motifs	726	180	MEME, HOMER
	Structural preferences	720	179	RNApromo, RNAcontext
	Gene Ontologies	15 677	13 849	GOBP, GOMF, GOCC^f
	Biological pathways	186	105	KEGG
	Gene expression	34 cells/tissue types	18 cell/tissue types	TopHat, Cufflinks^g
	Alternative splicing (skip exon)	34 cells/tissue types	18 cell/tissue types	TopHat, MISO^g
Data module II: Molecular annotations	miRNA binding sites from experiments	3 906 955	1 588 861	AGO CLIP-seq peaks called by Piranha, the targeting miRNAs identified by miRanda^h
	miRNA binding sites from predictions	70 516 087	38 336 372	RNAhybrid, TargetScan, miRanda
	RNA modification sites	177 049	91 930	RMBase, PMID26863196
	RNA editing sites	2 583 302	8846	RADAR, DARNED
	Splicing elements	1 995 574	1 152 186	Anno. in GENCODE human v19, mouse vM7
	Conserved structural regions	725	691	EvoFam
Data module III: Genomic variants	SNPs	149 398 310	77 785 586	dbSNP v146
	Tissue-specific eQTL	19 530 607	NA	GTEx
	GWAS SNPs	278 473	NA	GWASdb2, RNAfoldⁱ
	Clinically important SNPs	131 919	NA	ClinVar, RNAfoldⁱ
	Cancer TCGA whole-exome SNVs	828 119	NA	PMID24390350, RNAfoldⁱ
	Cancer TCGA whole-genome SNVs	4 745 891	NA	PMID23945592, RNAfoldⁱ
	Cancer COSMIC SNVs	2 371 219	NA	COSMIC v76, RNAfoldⁱ
Data module IV: Gene-Function associations	Tissue-specific genes	21 549	NA	TiGER, SpeCond
	Gene-Disease associations	419 906	NA	OMIM, DisGeNET
	Gene-Cancer associations	4485	NA	Manually curated from 60 publications^j
	Gene-Drug associations	35 201	NA	DGIdb 2.0
Data module V: RNA secondary structures	Predicted local structures	82 242 543	57 095 233	RNAfold with restraints from experimental structural probing data (human: DMS-seq, PARS; mouse: icSHAPE, Frag-seq, CIRS-seq)^k

aResults and data firstly generated by POSTAR are in bold font.

bWe provide all CLIP-seq peaks called by Piranha with P < 0.01. For CIMS, CITS and PARalyzer, we provide peaks with default significance cutoffs.

cSee Supplementary File 2 for the full list of eCLIP-seq data. The peaks were called by ENCODE.

dSee Supplementary File 5 for the RBPs and motifs used for prediction.

eSee Supplementary File 6 for the RBPs in DeepBind model.

fBP, Biological Process; MF, Molecular Function; CC, Cellular Component.

gSee Supplementary File 4 for the full list of 230 RNA-seq data sets in human and mouse.

hWe used all AGO CLIP-seq peaks called by Piranha (P < 0.01). The targeting miRNAs of the peaks were identified using miRanda with default parameters.

iWe used RNAfold to calculate the minimal free energy changes of local RNA secondary structures that are induced by the mutations.

jSee Supplementary File 3 for the full list of manually curated cancer genes.

kSee Supplementary File 7 for the experimental structural probing datasets. We predicted one local structure centered on each RBP binding site (window size: 150nt).

Figure 2.

Input and output search interface of POSTAR: multiple search modes and multiple result viewers. (A) POSTAR provides six usage modes: (i) ‘POSTAR’ search, (ii) ‘RBP’ search, (iii) ‘Structure’ visualization, (iv) ‘Variation’ search, (v) ‘Functional gene’ search, and (vi) ‘Predict’ server. (B) POSTAR presents the search results in multiple ways. A table layout is the basic output format (1). In the ‘POSTAR’ search mode, the interactions between the target gene and multiple RBPs are visualized in a network (2). The expression levels of the target gene and splicing scores of skipped exons across multiple cell and tissue types are shown in a bar chart (3). Clicking on the genomic positions will direct the user to the UCSC Genome Browser, which will display any associated binding sites and regulatory events (4). In ‘Structure’ visualization mode, we provide RNA structural profiling data (5) and predicted RNA secondary structures based on these data (6). In ‘RBP’ search mode, we provide the sequence motifs (7) and structural preferences (8) of the RBP.

Figure 3.

‘POSTAR’ search enables integrative viewing of multiple RBP binding sites and their potential to post-transcriptionally regulate a target gene (TP53 as an example). (A) In the PAR-CLIP Piranha data module, users may select ‘interaction network’, ‘all binding sites’, and multiple regulatory elements, including ‘miRNA binding (pred.)’, ‘RNA modification’, and ‘ClinVar SNPs’, to obtain detailed information in one page. (B) By clicking on the ‘Visualize in browser’ button (green), a user can select four RBPs among all bound RBPs to simultaneously visualize their binding sites (red tracks) and regulatory events (blue tracks) in an integrative manner via the UCSC genome browser.

Multiple data modules in POSTAR can be used to annotate and interpret RBP binding sites at various levels. Experimentally probed and computationally predicted RBP binding sites were annotated with different genomic elements. The annotations and functions of RBPs and genes, as well as the predicted sequence motifs and structural preferences of RBPs, were provided (data module I). The RBP binding sites were annotated using extensive information at several levels, including molecular regulatory events (data module II), genomic variants (data module III), gene-function associations (data module IV), and RNA secondary structures (data module V). Input and output search interface of POSTAR: multiple search modes and multiple result viewers. (A) POSTAR provides six usage modes: (i) ‘POSTAR’ search, (ii) ‘RBP’ search, (iii) ‘Structure’ visualization, (iv) ‘Variation’ search, (v) ‘Functional gene’ search, and (vi) ‘Predict’ server. (B) POSTAR presents the search results in multiple ways. A table layout is the basic output format (1). In the ‘POSTAR’ search mode, the interactions between the target gene and multiple RBPs are visualized in a network (2). The expression levels of the target gene and splicing scores of skipped exons across multiple cell and tissue types are shown in a bar chart (3). Clicking on the genomic positions will direct the user to the UCSC Genome Browser, which will display any associated binding sites and regulatory events (4). In ‘Structure’ visualization mode, we provide RNA structural profiling data (5) and predicted RNA secondary structures based on these data (6). In ‘RBP’ search mode, we provide the sequence motifs (7) and structural preferences (8) of the RBP. ‘POSTAR’ search enables integrative viewing of multiple RBP binding sites and their potential to post-transcriptionally regulate a target gene (TP53 as an example). (A) In the PAR-CLIP Piranha data module, users may select ‘interaction network’, ‘all binding sites’, and multiple regulatory elements, including ‘miRNA binding (pred.)’, ‘RNA modification’, and ‘ClinVar SNPs’, to obtain detailed information in one page. (B) By clicking on the ‘Visualize in browser’ button (green), a user can select four RBPs among all bound RBPs to simultaneously visualize their binding sites (red tracks) and regulatory events (blue tracks) in an integrative manner via the UCSC genome browser. aResults and data firstly generated by POSTAR are in bold font. bWe provide all CLIP-seq peaks called by Piranha with P < 0.01. For CIMS, CITS and PARalyzer, we provide peaks with default significance cutoffs. cSee Supplementary File 2 for the full list of eCLIP-seq data. The peaks were called by ENCODE. dSee Supplementary File 5 for the RBPs and motifs used for prediction. eSee Supplementary File 6 for the RBPs in DeepBind model. fBP, Biological Process; MF, Molecular Function; CC, Cellular Component. gSee Supplementary File 4 for the full list of 230 RNA-seq data sets in human and mouse. hWe used all AGO CLIP-seq peaks called by Piranha (P < 0.01). The targeting miRNAs of the peaks were identified using miRanda with default parameters. iWe used RNAfold to calculate the minimal free energy changes of local RNA secondary structures that are induced by the mutations. jSee Supplementary File 3 for the full list of manually curated cancer genes. kSee Supplementary File 7 for the experimental structural probing datasets. We predicted one local structure centered on each RBP binding site (window size: 150nt).

DATA COLLECTION AND PROCESSING

Data source

POSTAR focuses on RBP binding sites in the human and mouse transcriptomes. We first obtained 338 processed data sets from CLIPdb (20). We also collected and processed 160 new CLIP-seq data sets using the same pipelines (Supplementary File 1). The data contain three CLIP-seq data types, including HITS-CLIP, PAR-CLIP and iCLIP. Moreover, we also incorporated 151 eCLIP-seq data sets (in HepG2 and K562 cells) that were released by the ENCODE consortium (23). The eCLIP-seq binding sites/peaks were directly downloaded from the ENCODE data portal (https://www.encodeproject.org, NOV 2015) (Supplementary File 2). In addition, we included genome-wide RBP binding sites profiled by PIP-seq technology (24). To annotate and interpret RBP binding sites, we retrieved conserved structural regions from EvoFam (25), RNA modification sites from RMBase (26) and a recent publication (27), RNA editing sites from RADAR (28) and DARNED (29), single nucleotide polymorphisms (SNPs) from dbSNP version 146 (30), human trait/disease-associated SNPs from GWASdb2 (31) and ClinVar (32), and cancer somatic mutations from whole-exome sequencing data (33), whole-genome sequencing data (34), and the COSMIC database (35). We calculated the mean conservation scores for the RBP binding sites using genome-wide phastCons (36) and phyloP (37) intensities. We obtained human tissue-specific eQTLs from GTEx (38). We did not include eQTL annotation for mice because systematic eQTL mapping across multiple tissue types was unavailable. To better annotate RBP targets at the gene level, we collected cell-specific genes from TiGER (39) and SpeCond (40), human disease-associated genes from OMIM (41) and DisGeNET (42), cancer-associated genes from 60 publications (4485 cancer-associated genes across 36 cancer types, see Supplementary File 3), and druggable genes from DGIdb (43). In addition, we provided basic annotations of RBPs, including gene symbol, gene ID and domain information (44). We also collected 230 RNA-seq samples from 34 human tissues/cell types and 18 mouse tissues/cell types (Supplementary File 4). Detailed descriptions and statistics for these data resources can be found in Table 1 and Figure 1.

Data re-annotation

The raw data used in POSTAR were highly heterogeneous, because the data resources were collected from various publications and databases. Therefore, we processed and re-annotated the collected and computed data. First, the genomic coordinates of all data resources were converted to hg19 and mm10 using the LiftOver utility from the UCSC Genome Browser database (45). We also unified different IDs (e.g. RefSeq and UCSC gene ID) from various data sets into Ensembl gene IDs (46) using BioMart (47). We used GENCODE (human V19 and mouse V7) (48) for the annotation of regulatory elements (including validated and predicted RBP/miRNA binding sites, splicing cis-elements, and RNA modification and editing sites), trait/disease-associated variations, and various functional genes. The positions of splicing cis-elements were defined as: −3 to +8 nucleotides for 5′ splice sites and −12 to +2 nucleotides for 3′ splice sites (49). We annotated each regulatory element with its genomic strand, associated gene, genomic element, reference literature, etc. The annotation of genomic elements is based on the following priority: CDS, canonical ncRNA (including miRNA, snRNA, snoRNA, tRNA, rRNA and miscellaneous RNA), 3′ UTR, 5′ UTR, lncRNA exon, pseudogene, intron (mRNA and lncRNA), intergenic region, and others. Here, intergenic regions were defined as regions at a distance 2000 nt away from any genic regions (coding genes, ncRNAs and pseudogenes).

RBP binding site identification and prediction

We followed the computational pipeline used in our CLIPdb to identify binding sites from CLIP-seq data sets (20). First, we used the FASTX-Toolkit (http://hannonlab.cshl.edu/fastx_toolkit) to pre-process the CLIP-seq data sets in a uniform procedure. Next, we identified the RBP binding sites of all CLIP-seq samples using Piranha (P-value < 0.01), which is applicable to all variations of CLIP-seq technology (50). In addition, we also provided binding sites called by specialized tools for different CLIP-seq technologies: PARalyzer for PAR-CLIP (51), CIMS for HITS-CLIP (52) and CITS (a module in CIMS software) for iCLIP (53). To expand our RBP binding site repository, we used several computational tools to predict putative RBP binding sites across the human and mouse transcriptomes. RBPs interact with their RNA targets via specific RNA recognition motifs, which have been extensively determined using various technologies (54,55). Therefore, we used position weight matrix (PWM) motif matching to predict genome-wide RBP binding sites. All PWMs for 88 human/mouse RBPs collected from the literature (Supplementary File 5) were used to call motif matches in the human and mouse transcriptomes. We scanned for each RBP PWM within 50-nt genome intervals using FIMO (56) (present if the P-value was <1e−4) and TESS (57) (present if the log-odds score was >7.0). We also used DeepBind (58), a deep learning-based tool, with default parameters to predict the binding strength of 50-nt genome intervals for 82 human/mouse RBPs (Supplementary File 6). The 50-nt genome intervals with top 5‰ binding strength were considered as the binding sites for each RBP model in DeepBind.

miRNA binding site identification and prediction within RBP binding sites

miRNA binding sites from experiments (i.e., derived from AGO CLIP-seq data sets) were annotated within the RBP binding sites. AGO CLIP-seq data sets experimentally identified miRNA–target interactions in a genome-wide manner. We used miRanda (59) to predict targeting miRNAs for AGO protein binding sites. In addition to AGO binding regions, we used RNAhybrid (60), TargetScan (61) and miRanda (59) to screen possible miRNA binding sites within the RBP binding sites for sequences of all gene regions, including 3′ UTRs, 5′ UTRs, lncRNAs and pseudogenes. For RNAhybrid and miRanda, human and mouse miRNA sequences were obtained from miRBase (version 20) (62). We estimated proper parameters for each miRNA and gene sequence pair when using RNAhybrid. For TargetScan, human and mouse miRNA families were obtained from the TargetScan website (http://www.targetscan.org). We downloaded a 100-way genome alignment for humans and a 60-way genome alignment for mice from the UCSC Genome Browser database (45) when using TargetScan.

Sequence motifs and structural preferences of RBP binding sites

To identify the sequence and structural motifs of RBP binding sites, we used RBP binding sites identified using Piranha (P-value < 0.01). We also used default RBP binding sites downloaded from the ENCODE data portal for the eCLIP-seq data sets. Briefly, the binding sites in each CLIP-seq data set were separated into independent training (top 500 binding sites) and testing (binding sites ranked 501–1000) sets to ensure the quality of sequence motif discovery. If one data set contained <1000 binding sites, we defined the top half of the binding sites as the training set and the remaining sites as the testing set. We used MEME (63) to identify enriched sequence motifs in the training set for each CLIP-seq data set. We set MEME to report up to five motif models per data set, with motif width between 4 and 10 nucleotides. Next, we calculated enrichment for the initially detected motif models within the testing set using FIMO (56) and selected the three most enriched sequence motifs for each data set. In addition, we used another sequence motif finding tool, HOMER (64), to identify the three most enriched sequence motifs for each data set using the pipeline described above. Sequence motifs were visualized using WebLogo (65). RBP binding sites were extended to at least 60 nt in length when structural preference was investigated. We used RNAcontext (66) to detect local structural motifs for each CLIP-seq data set using the pipeline described above. The structural annotation used in RNAcontext included paired (P), hairpin loop (L), bulge/internal/multi loop (M) and unstructured (U). In addition, we used another structural motif finding tool, RNApromo (67), to predict the three most enriched structural elements (P-value < 0.05) within the RBP binding sites for each CLIP-seq data set.

RNA secondary structure prediction for RBP binding sites

To explore the RNA secondary structure around every RBP binding site, we used RNAfold with optimized parameters (option: -D) (68) to predict local structure in a manner restrained by experimental structural-probing data (Supplementary File 7) (69,70). Experimental profiling data were processed and normalized based on the same RNAex protocol (70). For sequences without sufficient probing data, we used RNAfold to directly predict local structure based on free energy model. The folding size around the center of each RBP binding site was 150 nt.

Impact of SNVs on RNA secondary structure

SNPs (e.g. ClinVar), as well as binding sites, can be visualized on the predicted secondary structure. We calculated the folding free energy change of each SNP with RNAfold (71). Based on the human reference genome (hg19), we changed the reference allele to a corresponding alternative allele for each SNV and calculated changes in the minimal free energy (MFE) of RNA secondary structures. We used RNAfold (71) with default parameters to calculate changes in the MFE (ΔG) of RNA secondary structures (ΔΔG = ΔGalt − ΔGref) for each SNV.

Expression pattern and alternative splicing events detected by RNA-seq data

In addition to the cell specificity information derived from TiGER (39) and SpeCond (40), we also calculated expression specificity scores using raw RNA-seq data. All RNA-seq data sets in POSTAR were generated from Illumina GAII or HiSeq2000/HiSeq2500 systems. We filtered out low-quality reads in each RNA-seq data set using PRINSEQ (72). Next, we aligned the RNA-seq data to the human genome (hg19) or mouse genome (mm10) using Tophat v2.0.10 (73,74). Cufflinks (v2.1.1) (73,75) was used to calculate gene expression levels. Furthermore, we calculated percentage spliced in (PSI) scores for skipped exons in the human and mouse transcriptomes using MISO (76). The PSI score denotes the fraction of mRNA that represents the inclusion isoform.

Enrichment analysis of gene ontology and biological pathways

For Gene Ontology (GO) analysis, we used topGO (77) to assess enrichment of biological process (BP), molecular function (MF) and cellular component (CC) terms for each CLIP-seq data set based on their target genes. For pathway analysis, we calculated the significance of pathway enrichment for every CLIP-seq data set using a hypergeometric test. We considered CDS, intron, 3′ UTR, and 5′ UTR separately in the analysis of GO terms and biological pathways.

Database architecture

All metadata in POSTAR are stored in a MySQL database. The web interface of POSTAR was implemented in Hyper Text Markup Language (HTML), Cascading Style Sheets (CSS) and Hypertext Preprocessor (PHP). Web design was based on the free templates of Bootstrap (http://getbootstrap.com). Visualization was implemented using the UCSC Genome Browser.

DATABASE FEATURES AND APPLICATIONS

Web interface

We provide a user-friendly web interface for users to query the database through multiple modes and predict RBP binding sites for a given RNA: (i) ‘POSTAR’ search, (ii) ‘RBP’ search, (iii) ‘Structure’ visualization, (iv) ‘Variation’ search, (v) ‘Functional gene’ search and (vi) ‘Predict’ server (Figure 2A). POSTAR presents the search/prediction results in many ways (table, bar plot, structure interface, network, etc.) (Figure 2B). We briefly introduce each query mode below. POSTAR provides three main usage modes. (i) The ‘POSTAR’ search mode provides multiple RBP binding sites for a given protein coding gene or lncRNA, associated post-transcriptional regulatory events, trans-factors (i.e. miRNAs), and cis-elements (e.g. RNA modification/editing sites, splicing elements, and conserved structural regions), as well as functional SNPs (e.g. GWAS SNPs) and cancer somatic mutations. We applied the UCSC Genome Browser to visualize multiple RBP binding sites and their associated regulatory events or genomic variants. Users can search multiple RBPs and regulatory events for a gene/lncRNA to investigate potential crosstalk. Moreover, for a given gene/lncRNA, we also provide basic annotation, associated diseases, expression levels, and alternative splicing events across multiple cell and tissue types. (ii) The ‘RBP’ search mode provides an overview of the binding sites for a given RBP, as well as GO terms enriched in its set of target genes. In addition, users can also find the sequence and structural motifs of the binding sites of a given RBP. (iii) The ‘Structure’ visualization mode provides local RNA secondary structure visualization (78) for every RBP binding site. Previous studies suggest that RNA secondary structures can provide specific binding sites for RBPs and restrict protein binding by altering structural accessibility (79,80). Therefore, we provide the ‘Structure’ visualization mode so that users can visualize RBP binding sites and the effects of SNPs/SNVs (e.g., ClinVar) on local structure. In addition to the main usage modes, POSTAR provides three additional usage modes. (iv) The ‘Variation’ search mode provides information about the effects of SNPs and disease-associated variants on RBP binding sites. (v) The ‘Functional gene’ search mode provides linkages between binding sites and cell state-associated genes, disease-associated genes, and druggable genes in a table layout. Although the downstream effects of these functional genes are understood relatively well, the manner in which they are regulated at the post-transcriptional level, and the impact of such regulation on their functionality, remains largely unclear. We believe that this mode could provide novel hypotheses regarding the physiological and pathological functions of RBP binding sites. (vi) Finally, in the ‘Predict’ mode, we provide web-based computational tools for predicting RBP binding sites on given RNA sequences provided by users. FIMO and TESS predict binding sites using the PWMs of 88 human/mouse RBPs collected from literature (Supplementary File 5). DeepBind predicts RBP binding affinity on an RNA sequence (length between 20 nt and 50 nt) using the DeepBind models of 82 human/mouse RBPs (Supplementary File 6).

Example findings using POSTAR search mode

We illustrate some example applications of POSTAR here to demonstrate exploration of multiple RBP binding sites and various post-transcriptional regulatory events in an integrative manner. Assume that users are interested in the post-transcriptional regulation of TP53, an important oncogene in humans. Users can go to the ‘POSTAR’ search mode and select all types of RBP binding sites. In the PAR-CLIP Piranha data module, users can simultaneously select multiple regulatory events to obtain the information shown in Figure 3A, including detailed information (e.g., genomic position, conservation score, and genomic context) about each binding site. TP53 contains 414 binding sites for 14 RBPs; for each of these RBP binding sites, users can discover associated RNA modification sites, predicted miRNA binding sites, and ClinVar SNPs. Users can associate RBP binding with other post-transcriptional regulatory events and disease-associated SNPs, obtaining novel insights into biological function and disease mechanisms. As an example, the post-transcriptional regulatory mechanism that underlies Li-Fraumeni syndrome, a condition characterized by early onset of several types of cancer and associated with germline mutations of TP53 (81), remains unexplored. Here, we identified nine Li-Fraumeni syndrome SNPs covered by five binding sites from six RBPs, CPSF6, ELAVL1, IGF2BP1, LIN28B, METTL3 and YTHDF2, which function as splicing factors, 3′ processors, and RNA modification (m6A) ‘writers’ and ‘readers’. Knowledge of these binding sites could help users generate new hypotheses about the molecular mechanisms of Li-Fraumeni syndrome at the post-transcriptional level. Furthermore, users can visualize the binding sites of multiple RBPs and their associated regulatory events via the UCSC genome browser (Figure 3B).

Example findings using structure and other usage modes

Users can explore the local RNA secondary structures of RBP binding sites using the ‘Structure’ visualization mode. To extend the example described above, users can further investigate the local secondary structures of RBP binding sites that contain Li-Fraumeni syndrome SNPs (Figure 4A). One Li-Fraumeni syndrome SNP site that is bound by CPSF6 could disrupt the stem in the local structure (Figure 4B). The SNP and binding site are located in the coding sequence region of TP53. CPSF6 is involved in 3′ processing (82), mRNA export (83), and RNA splicing (84). Therefore, in addition to protein coding, our observation suggests that the RNA sequence in the coding region of TP53 influences protein expression via mechanisms that function at the post-transcriptional level. In another example, ELAVL1's binding site contains a skin cancer-associated GWAS SNP site, which is located in TP53's 3′ UTR. ELAVL1 (also known as HuR) functions in diverse mRNA metabolic processes, including splicing, degradation, and translation (85–87). The SNP in TP53's 3′ UTR could destroy the local stem structure of the ELAVL1 binding site (Figure 4C), which might alter the influence of ELAVL1 on TP53. These examples provide novel insight into the putative functions of RBP binding sites on RNA structural domains, which are poorly understood.

Figure 4.

Local structures of RBP binding sites on TP53. (A) Users can search for RBP binding sites on TP53 that are associated with disease SNPs by searching for the disease name (‘Li-Fraumeni syndrome’ as an example here) in the table on the server. (B) Predicted local secondary structure centered on a CPSF6 binding site. The local secondary structure around a Li-Fraumeni syndrome SNP (from the ClinVar database), which is a G-to-A mutation on the TP53's transcript (minus-strand), is magnified; it disrupts the base pair (G-C pair) in the hairpin's stem. Note that the mutation is a C-to-T mutation (box highlight in (A)) as annotated by ClinVar on the plus-strand. (C) Another predicted local secondary structure, centered on an ELAVL1 binding site, which contains a GWAS SNP that is an A-to-U mutation. If a user desires information regarding the putative functions of RBP binding in heart diseases, the user could access the GWAS SNP interface in the ‘Variation’ search mode, choose the ‘eCLIP ENCODE’ data module, and select the phenotype term ‘heart failure’, at which point the database will return six heart failure-associated GWAS SNPs that overlap with 11 validated binding sites for nine RBPs. Notably, one heart failure-associated SNP located in the CDS of STXBP5 can be bound by four different RBPs. STXBP5 has been identified as a novel candidate gene for cardiovascular disease via GWAS (88).

DISCUSSION

We present a comprehensive resource, POSTAR, for easily exploring RBP-target interactions and their putative functions and consequences. POSTAR is the largest collection of RBP binding sites in the human and mouse transcriptomes. We combined a large amount of functional data sources to annotate RBP binding sites. POSTAR has a convenient interface, which provides multiple search modes and enables integrated navigation of RBP binding sites with various post-transcriptional regulatory events, phenotypes, diseases and other factors. POSTAR enables integrated visualization and exploration of multiple RBPs and post-transcriptional regulatory events. Investigating the relationships between RBP binding sites, regulatory events, phenotypes, and diseases should facilitate the development of novel hypotheses. As mentioned in the STXBP5 example described above, the mechanism underlying the association between genetic variation in STXBP5 and cardiovascular disease is unclear. The search results from POSTAR suggest that binding of several RBPs (e.g. FMR1, FXR2, IGF2BP2 and IGF2BP3) to STXBP5 may be associated with the development of cardiovascular disease. Establishing the functional roles of genetic variants remains a significant challenge in the post-genomic era. Existing studies have revealed that systematic annotation of cis-regulation and the epigenome can reveal many of the functional consequences of a variant (9,89). However, similar efforts regarding post-transcriptional regulation remain limited; this limitation is partly due to the lack of systematic profiling data on post-transcriptional regulation from current genomic studies, such as the ENCODE project (1) and Roadmap Epigenomics project (2). In comparison with methods used in previous studies (10) of functional variants involved in post-transcriptional interaction and regulation, our database has several notable advantages: (i) it includes RBP binding sites and genomic variants in mouse, (ii) it provides local RNA secondary structures around genomic variants, (iii) it enables integrated searching and visualization of RBP binding sites with other genomic variants and regulatory events and (iv) it provides more search and usage modes, such as the ‘RBP’ search, ‘Functional gene’ search and ‘Predict’ server. Continued accumulation of multiple data resources related to post-transcriptional regulation will enable us to systematically identify post-transcriptional regulatory networks. For example, integrative analysis of RBP binding and miRNA binding data will allow identification of cooperative and competitive combinatorial patterning of these regulatory factors (90,91). POSTAR represents an early step toward achieving these goals. We believe that POSTAR will facilitate the generation of novel hypotheses regarding the biological functions of RBP binding sites through systematic annotation with post-transcriptional regulatory events, trait/disease-associated variants and functional genes. In the future, we will maintain and update POSTAR to ensure that it remains a useful resource for the research community.

AVAILABILITY

POSTAR is freely available at http://POSTAR.ncrnalab.org (redirected to http://lulab.life.tsinghua.edu.cn/POSTAR). The POSTAR data files can be downloaded and used in accordance with the GNU Public License and the license of their primary data sources.

90 in total

Review 1. Competition between target sites of regulators shapes post-transcriptional gene regulation.

Authors: Marvin Jens; Nikolaus Rajewsky
Journal: Nat Rev Genet Date: 2014-12-09 Impact factor: 53.242

Review 2. A census of human RNA-binding proteins.

Authors: Stefanie Gerstberger; Markus Hafner; Thomas Tuschl
Journal: Nat Rev Genet Date: 2014-11-04 Impact factor: 53.242

3. Site identification in high-throughput RNA-protein interaction data.

Authors: Philip J Uren; Emad Bahrami-Samani; Suzanne C Burns; Mei Qiao; Fedor V Karginov; Emily Hodges; Gregory J Hannon; Jeremy R Sanford; Luiz O F Penalva; Andrew D Smith
Journal: Bioinformatics Date: 2012-09-28 Impact factor: 6.937

Review 4. Functions and regulation of RNA editing by ADAR deaminases.

Authors: Kazuko Nishikura
Journal: Annu Rev Biochem Date: 2010 Impact factor: 23.643

5. DisGeNET: a discovery platform for the dynamical exploration of human diseases and their genes.

Authors: Janet Piñero; Núria Queralt-Rosinach; Àlex Bravo; Jordi Deu-Pons; Anna Bauer-Mehren; Martin Baron; Ferran Sanz; Laura I Furlong
Journal: Database (Oxford) Date: 2015-04-15 Impact factor: 3.451

6. COSMIC: exploring the world's knowledge of somatic mutations in human cancer.

Authors: Simon A Forbes; David Beare; Prasad Gunasekaran; Kenric Leung; Nidhi Bindal; Harry Boutselakis; Minjie Ding; Sally Bamford; Charlotte Cole; Sari Ward; Chai Yin Kok; Mingming Jia; Tisham De; Jon W Teague; Michael R Stratton; Ultan McDermott; Peter J Campbell
Journal: Nucleic Acids Res Date: 2014-10-29 Impact factor: 16.971

Review 7. Multi-disciplinary methods to define RNA-protein interactions and regulatory networks.

Authors: Manuel Ascano; Stefanie Gerstberger; Thomas Tuschl
Journal: Curr Opin Genet Dev Date: 2013-02-28 Impact factor: 5.578

8. Darned in 2013: inclusion of model organisms and linking with Wikipedia.

Authors: Anmol M Kiran; John J O'Mahony; Komal Sanjeev; Pavel V Baranov
Journal: Nucleic Acids Res Date: 2012-10-15 Impact factor: 16.971

9. Integrative analysis of 111 reference human epigenomes.

Authors: Anshul Kundaje; Wouter Meuleman; Jason Ernst; Misha Bilenky; Angela Yen; Alireza Heravi-Moussavi; Pouya Kheradpour; Zhizhuo Zhang; Jianrong Wang; Michael J Ziller; Viren Amin; John W Whitaker; Matthew D Schultz; Lucas D Ward; Abhishek Sarkar; Gerald Quon; Richard S Sandstrom; Matthew L Eaton; Yi-Chieh Wu; Andreas R Pfenning; Xinchen Wang; Melina Claussnitzer; Yaping Liu; Cristian Coarfa; R Alan Harris; Noam Shoresh; Charles B Epstein; Elizabeta Gjoneska; Danny Leung; Wei Xie; R David Hawkins; Ryan Lister; Chibo Hong; Philippe Gascard; Andrew J Mungall; Richard Moore; Eric Chuah; Angela Tam; Theresa K Canfield; R Scott Hansen; Rajinder Kaul; Peter J Sabo; Mukul S Bansal; Annaick Carles; Jesse R Dixon; Kai-How Farh; Soheil Feizi; Rosa Karlic; Ah-Ram Kim; Ashwinikumar Kulkarni; Daofeng Li; Rebecca Lowdon; GiNell Elliott; Tim R Mercer; Shane J Neph; Vitor Onuchic; Paz Polak; Nisha Rajagopal; Pradipta Ray; Richard C Sallari; Kyle T Siebenthall; Nicholas A Sinnott-Armstrong; Michael Stevens; Robert E Thurman; Jie Wu; Bo Zhang; Xin Zhou; Arthur E Beaudet; Laurie A Boyer; Philip L De Jager; Peggy J Farnham; Susan J Fisher; David Haussler; Steven J M Jones; Wei Li; Marco A Marra; Michael T McManus; Shamil Sunyaev; James A Thomson; Thea D Tlsty; Li-Huei Tsai; Wei Wang; Robert A Waterland; Michael Q Zhang; Lisa H Chadwick; Bradley E Bernstein; Joseph F Costello; Joseph R Ecker; Martin Hirst; Alexander Meissner; Aleksandar Milosavljevic; Bing Ren; John A Stamatoyannopoulos; Ting Wang; Manolis Kellis
Journal: Nature Date: 2015-02-19 Impact factor: 69.504

10. starBase v2.0: decoding miRNA-ceRNA, miRNA-ncRNA and protein-RNA interaction networks from large-scale CLIP-Seq data.

Authors: Jun-Hao Li; Shun Liu; Hui Zhou; Liang-Hu Qu; Jian-Hua Yang
Journal: Nucleic Acids Res Date: 2013-12-01 Impact factor: 16.971

35 in total

Review 1. Multimodal Long Noncoding RNA Interaction Networks: Control Panels for Cell Fate Specification.

Authors: Keriayn N Smith; Sarah C Miller; Gabriele Varani; J Mauro Calabrese; Terry Magnuson
Journal: Genetics Date: 2019-12 Impact factor: 4.562

2. Epigenetic Silencing of CDR1as Drives IGF2BP3-Mediated Melanoma Invasion and Metastasis.

Authors: Douglas Hanniford; Alejandro Ulloa-Morales; Alcida Karz; Maria Gabriela Berzoti-Coelho; Rana S Moubarak; Beatriz Sánchez-Sendra; Andreas Kloetgen; Veronica Davalos; Jochen Imig; Pamela Wu; Varshini Vasudevaraja; Diana Argibay; Karin Lilja; Tommaso Tabaglio; Carlos Monteagudo; Ernesto Guccione; Aristotelis Tsirigos; Iman Osman; Iannis Aifantis; Eva Hernando
Journal: Cancer Cell Date: 2020-01-13 Impact factor: 31.743

Review 3. Gain-of-Function Mutations: An Emerging Advantage for Cancer Biology.

Authors: Yongsheng Li; Yunpeng Zhang; Xia Li; Song Yi; Juan Xu
Journal: Trends Biochem Sci Date: 2019-04-29 Impact factor: 13.807

Review 4. Practical considerations on performing and analyzing CLIP-seq experiments to identify transcriptomic-wide RNA-protein interactions.

Authors: Xiaoli Chen; Sarah A Castro; Qiuying Liu; Wenqian Hu; Shaojie Zhang
Journal: Methods Date: 2018-12-06 Impact factor: 3.608

5. Exosomal long noncoding RNA LNMAT2 promotes lymphatic metastasis in bladder cancer.

Authors: Changhao Chen; Yuming Luo; Wang He; Yue Zhao; Yao Kong; Hongwei Liu; Guangzheng Zhong; Yuting Li; Jun Li; Jian Huang; Rufu Chen; Tianxin Lin
Journal: J Clin Invest Date: 2020-01-02 Impact factor: 14.808

6. A deep boosting based approach for capturing the sequence binding preferences of RNA-binding proteins from high-throughput CLIP-seq data.

Authors: Shuya Li; Fanghong Dong; Yuexin Wu; Sai Zhang; Chen Zhang; Xiao Liu; Tao Jiang; Jianyang Zeng
Journal: Nucleic Acids Res Date: 2017-08-21 Impact factor: 16.971

7. RNAscope for VEGF-A Detection in Human Tumor Bioptic Specimens.

Authors: Tiziana Annese; Roberto Tamma; Domenico Ribatti
Journal: Methods Mol Biol Date: 2022

8. N⁶-Methyladenosine-modified lncRNA LINREP promotes Glioblastoma progression by recruiting the PTBP1/HuR complex.

Authors: Xiaoshuai Ji; Zihao Liu; Jiajia Gao; Xin Bing; Dong He; Wenqing Liu; Yunda Wang; Yanbang Wei; Xianyong Yin; Fenglin Zhang; Min Han; Xiangdong Lu; Zixiao Wang; Qian Liu; Tao Xin
Journal: Cell Death Differ Date: 2022-07-23 Impact factor: 12.067

9. Pan-cancer pervasive upregulation of 3' UTR splicing drives tumourigenesis.

Authors: Jia Jia Chan; Bin Zhang; Xiao Hong Chew; Adil Salhi; Zhi Hao Kwok; Chun You Lim; Ng Desi; Nagavidya Subramaniam; Angela Siemens; Tyas Kinanti; Shane Ong; Avencia Sanchez-Mejias; Phuong Thao Ly; Omer An; Raghav Sundar; Xiaonan Fan; Shi Wang; Bei En Siew; Kuok Chung Lee; Choon Seng Chong; Bettina Lieske; Wai-Kit Cheong; Yufen Goh; Wee Nih Fam; Melissa G Ooi; Bryan T H Koh; Shridhar Ganpathi Iyer; Wen Huan Ling; Jianbin Chen; Boon-Koon Yoong; Rawisak Chanwat; Glenn Kunnath Bonney; Brian K P Goh; Weiwei Zhai; Melissa J Fullwood; Wilson Wang; Ker-Kan Tan; Wee Joo Chng; Yock Young Dan; Jason J Pitt; Xavier Roca; Ernesto Guccione; Leah A Vardy; Leilei Chen; Xin Gao; Pierce K H Chow; Henry Yang; Yvonne Tay
Journal: Nat Cell Biol Date: 2022-05-26 Impact factor: 28.213

10. Subphenotype meta-analysis of testicular cancer genome-wide association study data suggests a role for RBFOX family genes in cryptorchidism susceptibility.

Authors: Yanping Wang; Dione R Gray; Alan K Robbins; Erin L Crowgey; Stephen J Chanock; Mark H Greene; Katherine A McGlynn; Katherine Nathanson; Clare Turnbull; Zhaoming Wang; Marcella Devoto; Julia Spencer Barthold
Journal: Hum Reprod Date: 2018-05-01 Impact factor: 6.918