Literature DB >> 28605404

Recent computational developments on CLIP-seq data analysis and microRNA targeting implications.

Silvia Bottini¹, David Pratella¹, Valerie Grandjean¹, Emanuela Repetto¹, Michele Trabucchi¹.

Abstract

Cross-Linking Immunoprecipitation associated to high-throughput sequencing (CLIP-seq) is a technique used to identify RNA directly bound to RNA-binding proteins across the entire transcriptome in cell or tissue samples. Recent technological and computational advances permit the analysis of many CLIP-seq samples simultaneously, allowing us to reveal the comprehensive network of RNA-protein interaction and to integrate it to other genome-wide analyses. Therefore, the design and quality management of the CLIP-seq analyses are of critical importance to extract clean and biological meaningful information from CLIP-seq experiments. The application of CLIP-seq technique to Argonaute 2 (Ago2) protein, the main component of the microRNA (miRNA)-induced silencing complex, reveals the direct binding sites of miRNAs, thus providing insightful information about the role played by miRNA(s). In this review, we summarize and discuss the most recent computational methods for CLIP-seq analysis, and discuss their impact on Ago2/miRNA-binding site identification and prediction with a regard toward human pathologies.

Entities: Chemical Disease Gene Species

Mesh：

Substances：
MicroRNAs

Year: 2018 PMID： 28605404 PMCID： PMC6291801 DOI： 10.1093/bib/bbx063

Source DB: PubMed Journal: Brief Bioinform ISSN： 1467-5463 Impact factor: 11.622

Introduction

RNA-binding proteins (RBPs) bind RNAs to regulate their fate, function, localization or secondary structure [1] to ultimately modulate many biological processes including cell apoptosis, growth, fate and differentiation [2-4]. RBPs possess modular structure composed by at least one domain to directly bind either single- or double-stranded RNA, such as the RNA recognition motif, the zinc-finger domain, the KH domain and the double-stranded RNA-binding domain [5]. RNA-binding domains recognize primarily the RNA sequence, the RNA shape or both [6]. Because of the high versatility of the RNA-binding domains and the presence of some of them in each RBP [7], the full understanding of the RBP mode of binding is a challenging quest. To comprehensively uncover the RNA–protein interactions network in a genome-wide manner, Cross-Linking ImmunoPrecipitation associated to high-throughput sequencing (CLIP-seq) was recently developed [8]. Nowadays, CLIP-seq analysis has become one of the mainstream method to study RNA metabolism and has led to important discoveries in different fields of molecular and cellular biology [9-14]. Of particular interest for the community of RNA biologists and beyond is the application of CLIP-seq methods to Argonaute 2 (Ago2) protein, which together with microRNAs (miRNAs) form the miRNA-induced silencing complex (miRISC) [15]. miRISC targets mRNAs through partial sequence complementarity between the miRNA and the target sequence, promoting degradation or translation inhibition of the target mRNA [16]. miRNAs regulate many biological processes, including cell proliferation, differentiation and death in both physiological and pathological events [16]. Because CLIP-seq analysis of Ago2 is meant to identify the precise binding sites of miRNAs [9, 12, 17] but not of the protein itself, the downstream and validation steps of the analysis are different from those performed for other RBPs: binding features of miRNAs follow different rules to that of RBPs. On the other hand, if we consider each miRNA loaded into Ago2 as a different RBP [18], Ago2 CLIP-seq may be taken as universal example of the mode of binding to cognate RNAs. CLIP-seq protocol has many steps involving sample preparation, sequencing and bioinformatics analysis. Briefly, RNA of ultraviolet cross-linked cells or tissue lysates is partially digested by enzymatic reaction into small fragments of about 50–100 nucleotides (nts), the RBP of interest is immunoprecipitated and the RBP-RNA complex is isolated by sodium dodecyl sulfate polyacrylamide gel electrophoresis migration. Afterward, RNA from the RBP-RNA complex is recovered by acid phenol/chloroform extraction, RNA adapters are ligated, RNA template is reverse transcribed and finally high-throughput sequenced. Several protocol variants for sample preparation have been proposed, which mainly include the most popular HIgh-Throughput Sequencing of RNA isolated by CLIP (HITS-CLIP) [9, 10], the PhotoActivable-Ribonucleoside-enhanced-CLIP (PAR-CLIP) [11-13] and the individual-nucleotide resolution CLIP (iCLIP) [19]. Detailed differences among these three variants and other more recent protocols have been previously discussed by others [20, 21]. An additional variant of the CLIP-seq, named CLASH, has been developed by Helwak et al. [22], and it is particularly appropriate to Ago2 CLIP-seq. Briefly, CLASH allows high-throughput mapping of small RNA::RNA-binding interaction by adding an intramolecular RNA ligation step during the sample preparation. CLASH approach and analysis have been previously reviewed by Broughton and Pasquinelli [23]. After high-throughput sequencing, the bioinformatics analysis workflow starts by the preprocessing to filter out the low quality and duplicate reads, and to map them onto the genome or the transcriptome of reference. Afterward, to assess real signal over the noise background, the reads are processed by peak-calling programs. Called peaks are further analyzed for functional, structural and biochemical characterizations of the RNA–protein interaction, including motif discovery, expression profile and gene ontology. Importantly, because recent advances in sequencing technologies and bioinformatics analyses enable us to handle many CLIP-seq samples simultaneously, it is important to optimize a bioinformatic pipeline that can facilitate the work of researchers in obtaining unbiased and high-quality data. Despite great efforts from researchers to streamline CLIP-seq analysis, much remains to be improved on both experimental (e.g. the quality of the antibody used for the immunoprecipitation) and computational procedures. In fact, because of the complexity of the large data set coming from high-throughput technologies such as CLIP-seq, a correct interpretation of the data is facilitated by a proper and an accurate data analysis with refined and optimized computational tools. In this review, we describe the computational protocol for CLIP-seq analysis, discuss the latest bioinformatics developments for data processing and mining and provide advices for data analysis interpretation. We discuss the validity and the limitations of emerging programs for each step of the CLIP-seq analysis and the quality measurements currently available for specific tasks, by providing concrete examples on a case study: an in-house Ago2 HITS-CLIP data set generated from stem cells [24] (GEO accession: GSE85219). For simplicity and limited space, we focus the scope of this review in providing a valuable computational guideline for the bioinformatics analysis of the three main variants of CLIP-seq analysis, namely, HITS-CLIP, PAR-CLIP and iCLIP. To help the readers, we provide in Supplementary Table S1 the Web links to download all the programs cited in the present review. Finally, we discuss how Ago2 CLIP-seq analyses have improved the miRNA-binding site prediction and the understanding of miRNA function in human pathologies.

Bioinformatics workflow for CLIP-seq analysis

In Figure 1, we have summarized the main computational steps for CLIP-seq analysis. Recently, few computational pipelines have been developed such as CIMS [25], CLIPSeqTools [26], CLIPZ [27] and PARCLIPsuite [11], which provide useful resources to deal with preprocessing steps and some of the main steps of the analysis, including peak-calling procedure. In the following sections, we describe step-by-step the bioinformatics workflow of the CLIP-seq analysis giving a quality overview of existing software and providing practical examples on an in-house data set for Ago2 HITS-CLIP experiments on P19 stem cells [24].

Figure 1

Main steps of the bioinformatics workflow to analyze CLIP-seq data with the main software or pipeline to use.

Preprocessing and read mapping onto the reference genome

The first step of the analysis is the preprocessing that involves adapter removal, filtering raw data according to read quality scores and collapsing reads with the exact sequence. While for the adapter removal, specific programs have been developed such as cutadapt [28] or Trimmomatic [29], for the quality filtering, usually bioinformaticians develop ad hoc scripts. However, lately few programs have been developed, such as FASTX-Toolkit (http://hannonlab.cshl.edu/fastx_toolkit/) and PRINSEQ [30]. To quantify the differences of the strategies used by different preprocessing programs to filter reads based on quality scores and collapse duplicated reads, we applied FASTX-Toolkit, PRINSEQ and the CIMS pipelines to a published in-house Ago2 CLIP-seq data set from mouse P19 stem cells [24] (GEO accession: GSE85219). This analysis was run with the default tuning or the recommended parameters (see Supplemental Information). While some programs have tunable parameters, we forgo parameter optimization, which might have improved the results for some data sets, as this task may be beyond the ken of most users. The highest number of reads that survive the preprocessing step for the three replicates was obtained using PRINSEQ (Table 1). To inspect the quality of the reads after the preprocessing, we used the program FastQC (http://www.bioinformatics.babraham.ac.uk/projects/fastqc/ developed by S. Andrews). As shown in Supplementary Figure S1, although FASTX-Toolkit yielded reads with best quality score per sequence and per base, this program filtered out much more reads compared with the other two programs (Table 1). These data indicate that FASTX-Toolkit is stringent to select high-quality reads. Altogether, PRINSEQ showed the best balance between quality and amount of reads, suggesting that this program uses a strategy that fits better for CLIP-seq data preprocessing than the other two programs. However, the balance of stringency and sensitivity can be tuned by changing the parameters of the programs to meet the needs of the researcher, such as the minimal quality of reads or their minimal length to be selected. Finally, the three methods achieved similar results regarding the collapse of duplicates, to remove redundant fractions of the data. A high redundancy of the data may be a consequence of low library complexity, which often occurs when samples are prepared from small amount of starting material or by performing too many polymerase chain reaction cycles for library construction.

Table 1

Number of reads obtained using different preprocessing and mapping tools on the in-house Ago2 HITS-CLIP data set from P19 stem cells

# of reads before preprocessing	Preprocessing programs	# of reads after preprocessing	Mapping tools	# of reads after mapping
# of reads before preprocessing	Preprocessing programs	# of reads after preprocessing	Mapping tools	Unique mapping (%)	Multiple mapping (%)	No mapping (%)
Replicate 1
38 865 698	PRINSEQ	3 783 764	Novoalign	2 120 447 (56.0)	1 300 950 (34.4)	34 565 (0.9)
	PRINSEQ	3 783 764	STAR	1 142 926 (30.21)	1 579 235 (41.74)	1 061 603 (28.05)
	FASTX-Toolkit	933 985	Novoalign	596 192 (63.8)	315 845 (33.8)	7 004 (0.7)
	FASTX-Toolkit	933 985	STAR	372 062 (39.84)	472 457 (50.59)	89 466 (9.58)
	CIMS	2 467 666	Novoalign	1 355 710 (54.9)	784 179 (31.8)	25 516 (1.0)
	CIMS	2 467 666	STAR	787 349 (31.91)	1 297 371 (52.57)	382 946 (15.51)
Replicate 2
34.094.384	PRINSEQ	3 924 075	Novoalign	2 196 430 (56.0)	1 321 658 (33.7)	34 607 (0.9)
	PRINSEQ	3 924 075	STAR	1 311 574 (33.42)	2 036 793 (51.91)	575 708 (14.67)
	fastXtoolkit	957 572	Novoalign	634 824 (65.1)	316 995% (32.1)	6 529 (0.7)
	fastXtoolkit	957 572	STAR	423 300 (43.39)	461 865 (47.34)	72 407 (7.42)
	CIMS	2 584 470	Novoalign	1 425 553 (55.2)	801 740 (31.0)	25 557 (1.0)
	CIMS	2 584 470	STAR	871 764 (33.73)	1 314 533 (50.86)	398 173 (15.41)
Replicate 3
32.904.107	PRINSEQ	3 668 439	Novoalign	2 001 844 (54.6)	1 264 702 (34.5)	35 261 (1.0)
	PRINSEQ	3 668 439	STAR	1 152 648 (31.42)	1 956 392 (53.33)	559 399 (15.25)
	fastXtoolkit	888 186	Novoalign	562 070 (63.3)	300 647 (33.8)	6 308 (0.7)
	fastXtoolkit	888 186	STAR	358 485 (40.36)	442 132 (49.78)	87 569 (9.86)
	CIMS	2 359 084	Novoalign	1 264 546 (53.6)	750 678 (31.8)	25 402 (1.1)
	CIMS	2 359 084	STAR	740 554 (31.39)	1 242 835 (52.68)	375 695 (15.92)

Number of reads obtained using different preprocessing and mapping tools on the in-house Ago2 HITS-CLIP data set from P19 stem cells Reads that survive the preprocessing steps are mapped onto the reference sequences that can be the complete genome, the transcriptome or sequences belonging to specific categories, such as 3ʹ untranslated regions (UTRs), long noncoding RNAs, small RNAs, etc. The most common algorithms used to perform this task are Novoalign (http://www.novocraft.com/products/novoalign/), STAR [31], Bowtie [32], RMAP [33], TopHat [34], Gsnap [35], SOAP [36] and BWA [37]. A few precautions have to be considered while setting the parameters of these programs to perform a gapped alignment onto the genome. In fact, these parameters permit to map reads that include deletions, mutations or insertions caused by enzymatic errors occurring during the sample preparation. Depending on the sample, it might be important to use adequate parameters to map reads on the exon junctions. In addition, an important issue concerns the consideration of multiple mapped reads (reads that map in many loci). Although allowing for multiple read mapping would increase the number of usable reads and the sensitivity of peak detection, this may also cause an increase of false-positive peaks, as it was also suggested for ChIP-seq analysis [38]. To provide guidelines about the program to use, we ran two popular programs that achieved the best performances on RNA-seq data [39], namely, Novoalign and STAR, on the in-house Ago2 CLIP-seq data set from P19 stem cells after preprocessing with the three aforementioned programs. For this analysis, we set the recommended parameters for Novoalign and used similar ones for STAR (Supplemental Information). As shown in Table 1, regardless the preprocessing program used, Novoalign mapped uniquely between 20 and 30% more reads than STAR. To test whether there is a qualitative difference on the genomic location of the reads mapped by Novoalign and STAR, we divided the genome in bins of 100 nucleotides and counted the number of bins in which reads map by the two programs. As shown in Supplementary Table S2, we found a relative small number of bins in common between the two programs. These data indicate that Novoalign is able to map reads in about 10% more of the genomic locations than STAR. Therefore, we concluded that differences in both quantitative and qualitative mapping performances exist between Novoalign and STAR. Giving that higher numbers of correctly and uniquely mapped reads would have a beneficial effect on the peak-calling step to discriminate real peaks over the background, we would recommend to use Novoalign. However, the full version of Novoalign is not freely downloadable, and the computational analysis can take much longer with the uncomplete free version compared with STAR, which is freely downloadable. An alternative strategy is to map the reads with one program and afterward run the unmapped reads with a second program. Because this strategy was not benchmarked yet, it is unclear whether it is really advantageous. Finally, the mappability of multiple mapped reads can be dealt by tuning the minimal length of reads or using paired-end sequencing [38].

Peak calling

Assessing peaks is a central step of the analysis to determine specific signal over the noise background for the identification of real binding sites. The number of identified peaks increases with the sequencing depth because weaker sites become statistically significant with a greater number of reads [40]. However, the optimal sequencing depth can only be experimentally evaluated, as it depends on the noise background of the antibody [38]. Diverse methods of peak calling have been used by different programs. The most common strategy is to analyze distribution profiles to find clusters of reads that belong to the same peak. This strategy is used by different programs, including PIPE-CLIP [41], Pyicoclip [42], Piranha [43] and CLIPper [44], for all CLIP-seq protocol variants, and WavClusteR [45] and PARanalyzer [46] for only PAR-CLIP data. PIPE-CLIP and Pyicoclip group the reads based on positional overlap, while Piranha, MiCLIP [47] and HITS-CLIP data analyses (http://qbrc.swmed.edu/softwares.html) bin on genomic portions by fixed size. On the other hand, CIMS focuses on the identification of read clusters containing mutated cross-linked nucleotides [25]. To discriminate enriched read clusters over the background, the peak-calling programs use different statistical models. For instance, PIPE-CLIP and Piranha use the zero-truncated negative binomial likelihoods, including also additional covariates to refine the peak detection, such as identification of cross-linked-dependent mutations or transcript abundance. On the other hand, Pyicoclip performs a background estimation implementing the modified false discovery rate procedure to determine which clusters of reads are significantly enriched in a list of genomic regions, by randomly placing the same number of reads within the region and iterating the process many times. MiCLIP [47] and HITS-CLIP data analyses use hidden Markow models (HMMs) to model the spatial dependency of the reads that map in the cluster. Finally, CIMS assesses statistical significance using a permutation-based model. Specific statistical models are used for PAR-CLIP analysis. For instance, PARanalyzer uses a nonparametric kernel-density estimate classifier to identify RNA-protein interaction sites using T to C conversion rate and read density, while wavClusteR uses a two-step algorithm consisting of a nonparametric two-component mixture model and a wavelet-based procedure. Importantly, different methods can give different results; thus, to help researchers in the choice of the right program for their peak-calling analysis, we recently performed a comprehensive, quantitative and qualitative comparative evaluation of four different publicly available programs for HITS-CLIP peak-calling step, including CIMS, PIPE-CLIP, Piranha and Pyicoclip, on four published Ago2 HITS-CLIP data sets [9, 48, 49] and one in-house Ago2 HITS-CLIP data set generated from P19 stem cells [24]. By tuning the programs in default parameters, we found that Pyicoclip outperformed the other programs in terms of sensitivity, positional accuracy, agreement with TargetScan miRNA-binding site prediction program, specificity and for consistency in finding the same results on different data sets from the same tissue [24]. Nevertheless, depending on the biological question and sample conditions, scientists may need to tune the parameters of the different peak-calling programs to find the best set to perform this task such as the P value and the minimal number of reads to select significant peaks. Alternatively, we suggested to rank the detected peaks according to different statistics, such as number of reads, fold of enrichment over the background or P-value, and apply arbitrarily thresholds according to the desired stringency [24]. Finally, although not always possible, ideally the addition of control conditions, such as knockout or knockdown, or stimulation versus nonstimulation, would significantly improve the accuracy and the quality of the peak-calling results [50, 51]. Accordingly, a few number of programs have been developed to analyze differential CLIP-seq experiments. These programs include the HMM-based model dCLIP [52], Piranha that uses different statistics to model reads distribution allowing the comparison of two different conditions through the addition of covariates, Pyicoenrich [42] that uses a strategy based on the MA plot as for expression profile analysis and PARma [53] that was specifically designed for PAR-CLIP data and uses a probabilistic model for the identification of differential peaks.

Motif discovery and other features

Following the peak calling, the analysis mainly focuses on the characterization of the RBP-RNA interactions, especially looking for possible binding sequence signature(s), using a candidate screening or a de novo motifs identification. For the candidate screening approach, programs like FIMO [54] can be used to screen peak sequences for the identification of known RNA-binding motifs, such as those from the database of Ray et al. [55]. If the user is looking for unknown RNA-binding motifs, a de novo motif identification could be performed. For this task, two main parameters should be calibrated before launching the analysis. The first parameter is the nucleotide length of the motif. Mitchell and Parker [56] recently showed that different RNA-binding domains bind in average a precise number of nucleotides. Thus, the length of the motif can be set according to the domain composition of the protein. The second parameter to take into account is the so-called ‘background sequences’ that can be used as negative template in which it is not expected to contain the enriched motif sequence(s). Two main strategies can be applied to select the appropriate background sequences: (i) to randomly scramble the CLIP-seq peak sequences; (ii) to define a set of sequences not bound by the protein of interest [57]. In the first strategy, a few constraints can be imposed to the background sequences, such as the dinucleotides frequency to avoid underestimation of false-positive rates in RNA prediction. As for the second strategy, instead, it is recommended that the background sequences possess the same size, dinucleotide frequency and the GC content of the target sequences used to perform the analysis. Besides these general options, each program allows to set its own parameters that may depend on the statistical/mathematical model used by the algorithm, including the distribution to model motif sites (i.e. the number of motif expected per sequence) and the threshold/score associated to the statistical model (i.e. P-value, enrichment score and probability). If successful, the final output of the de novo motif discovery analysis would be a list of sequences enriched in the CLIP-seq data set. On motif clusterization [58], these lists of enriched sequences can be represented as position weight matrices or position-specific affinity matrices that visualize the different affinity of the RBP for each sequence. Among the most popular programs for de novo motif discovery, we can cite MEME [59] and Homer [60]. Other programs that may be used are MatrixREDUCE [61], GLAM2 [62] and cERMIT [63]. In addition to the sequence motifs, some programs take into account other sequence parameters, including secondary structure prediction, discovering therefore a structural motif that combines sequence composition with secondary structure. For such analyses, one may consider to use Zagros [64], RNAcontext [65], MEMEris [66], PhyloGibbs [67] and CapR [68]. The identification of the RBP-binding motif(s) permits the prediction of the RBP binding in different transcriptome analyses. Because Ago2 binds to miRNAs that determine the sequence specificity of the binding to target RNAs, researchers often analyze Ago2 CLIP-seq peaks with prediction programs for miRNA-binding sites, including Targetscan, PITA and Miranda. However, such an approach can mislead to high rates of false-positive and false-negative targets [69, 70]. Moreover, these programs only predict canonical miRNA-binding sites, which are defined by a perfect complementarity match between miRNA seed sequence (between second and eighth nts of miRNA sequence) and the 3ʹ UTR [16], or seed-like motifs allowing one mismatch or 1-nt bulge in the miRNA seed sequence [22, 71]. Lately, few programs have been developed to search for binding sites of highly expressed miRNAs from Ago2-CLIP-seq peaks. These programs mainly look for canonical or seed-like binding sites, such as miRTarClip for all CLIP-seq techniques, which are limited to 3ʹ UTR [72], or microMUMMIE [73] and mEAT [46] that are limited to PAR-CLIP data sets. A similar approach was also adopted by Clark et al. [74]; however, only the precomputed results for miRNAs expressed in 34 CLIP-seq data sets are available, but not the program. Finally, we have recently developed a novel method, called miRBShunter, that uses de novo motif search for an unbiased identification of miRNA-binding sites from Ago2 CLIP-seq data sets [24]. miRBShunter identifies any potential miRNA::RNA heteroduplexes for both canonical and noncanonical miRNA-binding sites, which involves portions of the miRNA sequence outside the seed or with seed-like binding, by searching for de novo motifs. Potential miRNA::RNA heteroduplexes are then ranked according to a heteroduplex score, which takes into account the following parameters of the heteroduplex: (i) free energy, (ii) the number of paired nucleotides, (iii) the number of paired nucleotides in the motif found, (iv) the number of paired nucleotides in the seed region and (v) the number of bulge nucleotides in the seed.

Downstream analysis

The last step of CLIP-seq analysis involves functional characterization of the target RNAs identified to provide clues about the molecular function of the RBP(s) or the miRNA(s) of interest. Many programs/databases address this task, including GoTermFinder [75] and topGO [76] to perform GO Term enrichment; GeneMania [77] to predict the function of a set of genes; STRING [78] and Cytoscape [79] to predict and visualize the protein interaction networks; GSEA [80] to determine whether a set of genes show similar expression differences between two biological conditions; and RAIN that integrates noncoding RNAs and protein–protein interaction networks [81]. Furthermore, the results from CLIP-seq analysis can be coupled with data sets from other genome-wide technologies, including RNA expression profile or alternative splicing. Although not always routinely updated, many resources have been developed for the functional analysis of RBPs and miRNAs. For instance, miRonTop is an online Java Web tool that integrates DNA microarrays or high-throughput sequencing data to identify the potential miRNA target mRNAs by complementary between the seed and the 3ʹ UTR sequences. The list of potential miRNA targets can be used to assess specific biological functions of miRNAs by performing Gene Ontology enrichment [82]. DIANA-mirExTra performs a combined differential expression analysis of mRNAs and miRNAs to uncover miRNAs and transcription factors that play regulatory roles between two conditions [83]. Finally, miRGator is a portal collecting high-throughput sequencing miRNA data integrated with target expression profiles [84]. This portal includes 73 deep-sequencing data sets on human samples from Gene Expression Omnibus [85], Short Read Archive (SRA) (SRA:http://www.ncbi.nlm.nih.gov/sra/), The Cancer Genome Atlas archives (http://cancergenome.nih.gov/) and several supporting programs. Among those programs, we mention miR-seq browser that provides short-read alignment with the predicted secondary structure of transcripts, read count and different features to study iso-miRs and miRNA posttranscriptional modifications.

Validation of CLIP-seq analysis

CLIP-seq experiments can be validated using different technical approaches by either candidate or genome-wide approaches [86]. For practical reasons, the candidate approach is feasible only to a limited number of targets, usually top-scored targets from statistical significance tests or identified by machine learning algorithms, or to a subset of targets with a particular biological relevance. The candidate approach can be performed to validate the direct interaction between the protein of interest and the target RNA(s) or the function played by the protein (or miRNA) on the target RNA(s). The interaction can be validated by a plethora of wet laboratory techniques, such as the in vitro electrophoretic mobility shift assay or RNA immuno precipitation followed by either northern bloting or reverse transcriptase quantitative polymerase chain reaction (RT-qPCR) from cell or tissue extracts. Functional validation may include knockdown/knockout or overexpression experiments on cells or tissues of the protein (or miRNA) of interest followed by RT-qPCR or northern bloting to check on the expression levels of the target RNA(s), or minigene assay to check on alternative splicing events. The same functional validation can be also performed at the genome-wide scale assays, which may include RNA-seq or microarray experiments followed by an appropriate data analysis that depends on the function investigated. While the latter approach is more accurate and comprehensive, the cost can be higher.

Implication for miRNA-binding prediction and human pathogenesis

miRNAs are small noncoding RNAs of about 22 nts that associate to Ago2 to bind to RNA for degradation and/or translation block [16]. About 1000 miRNAs have been experimentally validated in human [87], which regulate many biological processes during physiopathological events and development [3, 4]. In this part of the review, we discuss the latest developments in the field of miRNA target prediction and mode of action in human pathologies made by the use of Ago2 CLIP-seq analysis.

Ago2 CLIP-seq data ameliorate miRNA target prediction

To date, the main miRNA target prediction programs take into account the following miRNA::RNA interaction features: (i) occurrence of perfect complementarity match between miRNA seed sequence and target mRNAs, (ii) sequence conservation of the target sequence across species, (iii) the free energy of the miRNA::RNA heteroduplex and (iv) the target site accessibility [88-91]. Even considering all these features, the rate of false positives and negatives is still high [70], indicating that more is needed to better predict miRNA target sequences. The presence of >15 Ago2 CLIP-seq analyses performed in several cells or tissues and deposited in the Starbase database [92, 93] provides an important resource for a genome-wide investigation of the miRNA targeting features. Based on these studies, a second generation of miRNA target prediction programs has been developed. Although the second-generation programs seem to perform better than the first one, a major limitation for both of them is the lack of an accurate list of bona fide miRNA-binding sites to calculate true- and false-positive rates. Here, we briefly describe the recent improvements of the prediction programs for miRNA-binding sites, based on CLIP-seq data (Table 2). Recently developed second-generation programs propose new models/parameters for the implementation of new algorithms. For instance, TargetSpy uses for the first time Ago2-CLIP-seq data to train a machine learning algorithm [100]. MIRZA develops a biophysical model through the parametrization of miRNA::mRNA target alignments and the free energy of the binding optimized using CLIP-seq data [101]. STarMir implements a logistic prediction models based on thermodynamic parameters of the miRNA::RNA heteroduplexes and the secondary structure features of the target mRNAs from CLIP-seq analyses [103]. MiRTar2GO is a rule-based machine learning approach to predict cell type-specific miRNA target mRNAs, which are ranked using validated binding sites from luciferase assay or Ago CLASH data sets [105]. Lu and Leslie [106] developed the program chimiRic that uses a discriminative machine learning approach on Ago2 CLIP-seq and CLASH data to train a novel miRNA target prediction model. On the other hand, some of the first-generation miRNA prediction programs have been refined and updated thanks to Ago2 CLIP-seq analyses. For example, DIANA-micro-T-CDS [102] is an extension of the first-generation algorithm DIANA-micro-T [107] that uses a machine learning approach to identify the most relevant features of miRNA targeting from CLIP-seq data sets. Finally, the latest version of miRDB contains miRNA target prediction based on an updated version of the MirTarget computational model by including CLIP ligation (cross-linking and immunoprecipitation followed by RNA ligation) data in the training data set [104].

Table 2

Main characteristics of first- and second-generation algorithms to predict miRNA-binding sites

		Binding site position on mRNA			Type of binding site		Features
Program name	Type of resource	5ʹ UTR	CDS	3ʹ UTR	Seed	NCSL	Conservation	Free energy	Accessibility	CLIP data	Reference	Web site
First generation
Targetscan	WS	No	No	Yes	Yes	No	Yes	No	No	No	[93]	http://www.targetscan.org/vert_71/
PicTar	WS	No	No	Yes	Yes	SL	No	Yes	No	No	[94]	http://www.pictar.org/
PITA	WS/SA	No	No	Yes	Yes	No	Yes	Yes	Yes	No	[95]	https://genie.weizmann.ac.il/pubs/mir07/mir07_prediction.html
miRanda	D/SA	No	No	Yes	Yes	No	Yes	Yes	No	No	[96]	http://www.microrna.org/microrna/home.do
RNAhybrid	WS/SA	No	No	Yes	Yes	Yes	No	Yes	No	No	[97]	https://bibiserv2.cebitec.uni-bielefeld.de/rnahybrid?id=rnahybrid_view_download
RNA22	WS/SA	Yes	Yes	Yes	Yes	SL	No	Yes	No	No	[98]	https://cm.jefferson.edu/rna22/
Second generation
TargetSpy	WS/D	No	No	Yes	Yes	Yes	No	Yes	Yes	Yes	[99]	http://webclu.bio.wzw.tum.de/targetspy/index.php?search=true
MIRZA	SA/WS	No	No	Yes	Yes	SL	No	Yes	No	Yes	[100]	http://www.clipz.unibas.ch/downloads/mirza/
DIANA-micro-T-CDS	WS/SA	No	Yes	Yes	Yes	No	Yes	Yes	Yes	Yes	[101]	http://diana.imis.athena-innovation.gr/DianaTools/index.php?r=microT_CDS/index
STarMir	WS	Yes	Yes	Yes	Yes	Yes	No	Yes	Yes	Yes	[102]	http://sfold.wadsworth.org/cgi-bin/starmir.pl
miRDB	D/WS	Yes	Yes	Yes	Yes	No	Yes	Yes	No	Yes	[103]	http://www.mirdb.org/miRDB/
miRTar2GO	D/WS	No	No	Yes	Yes	No	Yes	Yes	No	Yes	[104]	http://www.mirtar2go.org/
chimiRic	SA	No	No	Yes	Yes	Yes	No	Yes	No	Yes	[105]	https://bitbucket.org/leslielab/chimiric

WS, Web server; D, database; SA, stand-alone software; CDS, Protein-coding sequence; NC, Noncanonical; SL, seed-like.

Main characteristics of first- and second-generation algorithms to predict miRNA-binding sites WS, Web server; D, database; SA, stand-alone software; CDS, Protein-coding sequence; NC, Noncanonical; SL, seed-like. In addition to the features within the miRNA::RNA heteroduplex identified by CLIP-seq analyses, few reports showed that the binding activity of miRNAs is also modulated by RBPs that sit on the sequence surrounding miRNA-binding sites [108-110]. This has led to the concept of a sequence microenvironment surrounding miRNA-binding sites that can play an important role in regulating miRNA activity [89]. However, much remains to be explored about the use of RBP-binding motifs to improve the prediction of miRNA target sequences. A first step toward this direction was made by Incarnato et al. [111]. Briefly, the authors used Pumilio-binding motif to predict miRNA-binding sites within a distance of 100 nts. Validation of this analysis was carried out by RNA expression profile. We foresee that the full incorporation of RBP-binding motifs may pave the way to a third generation of miRNA target prediction programs.

Application of Ago2 CLIP-seq in human pathologies

Several publications reported miRNA targetome in different cells and tissues [12, 48–50, 110, 112–116]. Ago2 CLIP-seq experiments significantly contributed to our knowledge about the role of miRNAs in human pathogenesis. For instance, recently several reports use Ago2 CLIP-seq analysis to study the role of either host or viral miRNAs during viral infection [117]. Noteworthy, Kim et al. [118] combined Ago2 CLIP-seq and bioinformatics to identify miRNA targetomes of the human cytomegalovirus miRNAs during infection. This study reveals that viral miRNAs can regulate multiple pathways and cooperatively function with the host human miRNAs to promote viral replication [118]. Surprisingly for some human pathologies, such as cardiovascular diseases, despite the vast literature and the well-established roles of small RNAs in their pathogenesis [2, 119–121], genome-wide studies of miRNA regulatory networks are poorly developed. Indeed, only two Ago2 CLIP-seq studies have been performed in cardiovascular diseases, including one in the heart of transgenic mice overexpressing miR-133a and miR-499 [122], and the other one in the left ventricular cardiac tissue from six men with cardiomyopathy [123]. Interestingly, in ventricular tissue of patients with cardiomyopathy, about 4000 Ago2-binding sites that contain seed sequence complementarity for the most highly expressed cardiac miRNAs have been identified. The authors deeply characterized the targetome of miR-133, known to be enriched in many pathological conditions of the heart and characterized new roles for miR-29 [123]. In particular, they found that miR-29 targets several mRNAs, including Ryr2, Serca2 and Junctin that are key regulators of sarcoplasmic reticulum Ca2+ in cardiomyocytes, PIK3R1 (p85-alpha) and Med13 that are involved in cardiomyocyte growth and metabolic signaling and Lama2 that plays a more general role in extracellular matrix composition of muscle cells. The authors speculate that these target genes suggest new roles for miR-29 in cardiomyocyte growth and calcium handling, which may have significant clinical relevance to cardiac hypertrophy and contractile dysfunction in cardiomyopathy patients. The novel aspect of these findings is also based on the fact that prior studies have focused on miR29 function in cardiac fibroblasts ignoring the cardiomyocytes [124]. Further preclinical and clinical investigations may strengthen these findings and eventually propose miR-29-based therapeutic approach to cure or prevent cardiopathies. We foresee that Ago2 CLIP-seq experiments from biopsies of cardiovascular case-control studies and animal models will unravel the global role of miRNAs in the pathogenesis of cardiovascular disease and other human pathologies. A major limitation in the application of this approach is the limited amount of primary tissues derived from biopsies. Future technological advances to increase the depth of single-cell sequencing would overcome this limitation [125].

Concluding remarks

In this review, we discuss the recent computational developments of CLIP-seq analysis and highlight key points associated with each step, providing useful guideline to nonexpert users. We stress that a quality check of the data at each step is important to properly perform the analysis. While our manuscript was in revision, Uhl et al. [126] published a review on the CLIP-seq data analysis, which mainly focuses on the peak-calling step, whereas our review provides a more general point of view of the computational workflow. In addition, by addressing recent applications of CLIP-seq analysis for Ago2/miRNA target identification and prediction in physiological conditions or in human pathologies, our review provides practical examples of direct applications of this technique in biomedical fields. We believe that this review benefits scientists in RNA biology, gene expression fields and beyond. CLIP-seq analysis provides a huge amount of data that often are not fully exploited by the researchers because of the lack of time or tools dedicated to perform/integrate different analyses. A further effort should be done to make these data available in a user-friendly format and resource databases to collect them. Some databases, such as StarBase [92], CLIPdb [127] and CLIPZ [27], collect raw data of CLIP-seq studies and provide peak sequences and/or coordinates. However, they are not always updated and use standard pipelines that may not be suitable for every RBPs. Therefore, the development of integrated platform of CLIP-seq data analysis and databases could be a direction to be taken in the near future. In particular, these platforms should combine multiple software, such as pipelines covering multiple steps of data analysis or multiple databases including other high-throughput techniques as RIP-seq, ChIP-seq, RNA-seq and quantitative proteomics data, making the analysis faster and more comprehensive. On the other hand, improvement of the experimental protocol or conditions, such as quantification of the benefits of using replicates and/or control experiments, is needed to improve the reproducibility of the data and increase the efficiency of the data mining. Overall, we have shown that CLIP-seq experiments associated to sophisticated bioinformatics analysis have become nowadays an essential instrument to gain insight into the direct regulatory network(s) of RBP-RNA interactions to address central questions of RNA biology and gene expression control in normal and pathological events of physiology or development. We foresee that progression in software development will take the stage in the near future to render the CLIP-seq analysis more integrated to other genome-wide approaches and more accessible to nonexpert users.

Key Points

Recent developments in sequencing technologies and bioinformatics analyses enable us to handle many CLIP-seq samples simultaneously; thus, it is important to optimize bioinformatics pipelines that can facilitate the work of researchers to obtain unbiased and high-quality data. Despite great effort from researchers to streamline this CLIP-seq analysis, much remains to be improved on computational procedures. In this review, we discuss the validity and the limitations of emerging programs for CLIP-seq analysis and the quality measurements currently available for specific tasks by providing concrete examples on an in-house Ago2 HITS-CLIP data set generated in stem cells. We have focused the scope of this review in providing a valuable computational guideline for the bioinformatics analysis of the three main variants of CLIP-seq analysis, namely, HITS-CLIP, PAR-CLIP and iCLIP. We discuss how Ago2 CLIP-seq analyses have improved the miRNA-binding site prediction and the understanding of miRNA function in human pathologies. Click here for additional data file.

126 in total

Review 1. MicroRNAs and their targets: recognition, regulation and an emerging reciprocal relationship.

Authors: Amy E Pasquinelli
Journal: Nat Rev Genet Date: 2012-03-13 Impact factor: 53.242

2. Combinatorial microRNA target predictions.

Authors: Azra Krek; Dominic Grün; Matthew N Poy; Rachel Wolf; Lauren Rosenberg; Eric J Epstein; Philip MacMenamin; Isabelle da Piedade; Kristin C Gunsalus; Markus Stoffel; Nikolaus Rajewsky
Journal: Nat Genet Date: 2005-04-03 Impact factor: 38.330

Review 3. CLIP: viewing the RNA world from an RNA-protein interactome perspective.

Authors: Yin Zhang; ShuJuan Xie; Hui Xu; LiangHu Qu
Journal: Sci China Life Sci Date: 2015-01-10 Impact factor: 6.038

Review 4. Virus meets host microRNA: the destroyer, the booster, the hijacker.

Authors: Yang Eric Guo; Joan A Steitz
Journal: Mol Cell Biol Date: 2014-07-21 Impact factor: 4.272

Review 5. Optimization of PAR-CLIP for transcriptome-wide identification of binding sites of RNA-binding proteins.

Authors: Aitor Garzia; Cindy Meyer; Pavel Morozov; Marcin Sajek; Thomas Tuschl
Journal: Methods Date: 2016-10-17 Impact factor: 3.608

6. Quality control and preprocessing of metagenomic datasets.

Authors: Robert Schmieder; Robert Edwards
Journal: Bioinformatics Date: 2011-01-28 Impact factor: 6.937

7. GimmeMotifs: a de novo motif prediction pipeline for ChIP-sequencing experiments.

Authors: Simon J van Heeringen; Gert Jan C Veenstra
Journal: Bioinformatics Date: 2010-11-15 Impact factor: 6.937

8. Fast and accurate short read alignment with Burrows-Wheeler transform.

Authors: Heng Li; Richard Durbin
Journal: Bioinformatics Date: 2009-05-18 Impact factor: 6.937

9. Discovering sequence motifs with arbitrary insertions and deletions.

Authors: Martin C Frith; Neil F W Saunders; Bostjan Kobe; Timothy L Bailey
Journal: PLoS Comput Biol Date: 2008-05-09 Impact factor: 4.475

10. Comprehensive Identification of RNA-Binding Domains in Human Cells.

Authors: Alfredo Castello; Bernd Fischer; Christian K Frese; Rastislav Horos; Anne-Marie Alleaume; Sophia Foehr; Tomaz Curk; Jeroen Krijgsveld; Matthias W Hentze
Journal: Mol Cell Date: 2016-07-21 Impact factor: 17.970

13 in total

Review 1. Clip for studying protein-RNA interactions that regulate virus replication.

Authors: Christian Shema Mugisha; Kasyap Tenneti; Sebla B Kutluay
Journal: Methods Date: 2019-11-22 Impact factor: 3.608

Review 2. Practical considerations on performing and analyzing CLIP-seq experiments to identify transcriptomic-wide RNA-protein interactions.

Authors: Xiaoli Chen; Sarah A Castro; Qiuying Liu; Wenqian Hu; Shaojie Zhang
Journal: Methods Date: 2018-12-06 Impact factor: 3.608

3. CLIPick: a sensitive peak caller for expression-based deconvolution of HITS-CLIP signals.

Authors: Sihyung Park; Seung Hyun Ahn; Eun Sol Cho; You Kyung Cho; Eun-Sook Jang; Sung Wook Chi
Journal: Nucleic Acids Res Date: 2018-11-30 Impact factor: 16.971

4. siAbasic: a comprehensive database for potent siRNA-6Ø sequences without off-target effects.

Authors: Jongyeun Park; Seung Hyun Ahn; Kwang Moon Cho; Dowoon Gu; Eun-Sook Jang; Sung Wook Chi
Journal: Database (Oxford) Date: 2018-01-01 Impact factor: 3.451

5. Galaxy CLIP-Explorer: a web server for CLIP-Seq data analysis.

Authors: Florian Heyl; Daniel Maticzka; Michael Uhl; Rolf Backofen
Journal: Gigascience Date: 2020-11-11 Impact factor: 6.524

6. Systemic CLIP-seq analysis and game theory approach to model microRNA mode of binding.

Authors: Fabrizio Serra; Silvia Bottini; David Pratella; Maria G Stathopoulou; Wanda Sebille; Loubna El-Hami; Emanuela Repetto; Claire Mauduit; Mohamed Benahmed; Valerie Grandjean; Michele Trabucchi
Journal: Nucleic Acids Res Date: 2021-06-21 Impact factor: 16.971

Review 7. Noncoding RNAs in B cell responses.

Authors: Eric J Wigton; K Mark Ansel
Journal: RNA Biol Date: 2021-02-15 Impact factor: 4.652

Review 8. Non-coding RNAs in cancer: platforms and strategies for investigating the genomic "dark matter".

Authors: Katia Grillone; Caterina Riillo; Francesca Scionti; Roberta Rocca; Giuseppe Tradigo; Pietro Hiram Guzzi; Stefano Alcaro; Maria Teresa Di Martino; Pierosandro Tagliaferri; Pierfrancesco Tassone
Journal: J Exp Clin Cancer Res Date: 2020-06-20

9. miRTarBase update 2022: an informative resource for experimentally validated miRNA-target interactions.

Authors: Hsi-Yuan Huang; Yang-Chi-Dung Lin; Shidong Cui; Yixian Huang; Yun Tang; Jiatong Xu; Jiayang Bao; Yulin Li; Jia Wen; Huali Zuo; Weijuan Wang; Jing Li; Jie Ni; Yini Ruan; Liping Li; Yidan Chen; Yueyang Xie; Zihao Zhu; Xiaoxuan Cai; Xinyi Chen; Lantian Yao; Yigang Chen; Yijun Luo; Shupeng LuXu; Mengqi Luo; Chih-Min Chiu; Kun Ma; Lizhe Zhu; Gui-Juan Cheng; Chen Bai; Ying-Chih Chiang; Liping Wang; Fengxiang Wei; Tzong-Yi Lee; Hsien-Da Huang
Journal: Nucleic Acids Res Date: 2022-01-07 Impact factor: 16.971

Review 10. CLIP-related methodologies and their application to retrovirology.

Authors: Paul D Bieniasz; Sebla B Kutluay
Journal: Retrovirology Date: 2018-05-02 Impact factor: 4.602