Literature DB >> 25452690

A Bayesian Framework to Improve MicroRNA Target Prediction by Incorporating External Information.

Zixing Wang¹, Wenlong Xu¹, Haifeng Zhu², Yin Liu³.

Abstract

MicroRNAs (miRNAs) are small regulatory RNAs that play key gene-regulatory roles in diverse biological processes, particularly in cancer development. Therefore, inferring miRNA targets is an essential step to fully understanding the functional properties of miRNA actions in regulating tumorigenesis. Bayesian linear regression modeling has been proposed for identifying the interactions between miRNAs and mRNAs on the basis of the integrated sequence information and matched miRNA and mRNA expression data; however, this approach does not use the full spectrum of available features of putative miRNA targets. In this study, we integrated four important sequence and structural features of miRNA targeting with paired miRNA and mRNA expression data to improve miRNA-target prediction in a Bayesian framework. We have applied this approach to a gene-expression study of liver cancer patients and examined the posterior probability of each miRNA-mRNA interaction being functional in the development of liver cancer. Our method achieved better performance, in terms of the number of true targets identified, than did other methods.

Entities: Chemical Disease Gene Mutation Species

Keywords: gene expression; gene regulation; microRNA target prediction; prior information; sequence feature

Year: 2014 PMID： 25452690 PMCID： PMC4238384 DOI： 10.4137/CIN.S16348

Source DB: PubMed Journal: Cancer Inform ISSN： 1176-9351

Introduction

MicroRNAs (miRNAs) are highly conserved small RNAs that have diverse functions, including regulation of cellular differentiation, proliferation, and apoptosis.1,2 These RNA molecules exert their function by inhibiting translation or inducing degradation of their target messenger RNAs. A given miRNA is able to pair with hundreds of transcripts by its seed miRNA nucleotides, allowing it to regulate complex gene-expression programs and induce global physiological changes.3 Dysfunction of these miRNA molecules has been linked to several human diseases, including different types of cancer. Virtually, all examined tumor types have globally abnormal miRNA expression patterns, where miRNAs play regulatory roles as potential oncogenes or oncosuppressor genes.4–6 Genome-wide profiling showed that about half of miRNA genes are localized in cancer-related genomic regions or fragile loci,7 where mutations, deletions, or amplifications occur in many human tumors. These observations indicate that miRNAs are candidate genes for tumorigenesis and cancer progression. An essential step and major challenge to understanding the functions of miRNAs in cancer is identification of their target genes. Many computational and experimental approaches have been used to improve the reliability of miRNA-target prediction. TargetScan,8–10 PicTar,11 DIANA-microT,12 miRanda,13 and TargetS14,15 are examples of computational approaches that are based on an analysis of miRNA and mRNA sequences. Generally, they use the following principles to predict miRNA targets: Seed matches: the Waston-Crick pairing between the 5′ region of the miRNA centered on nucleotides 2–7 and the 3′ untranslated region (UTR) of the target mRNA. Degree of conservation: a functional miRNA target is preferentially conserved across multiple species. Thermodynamic stability, measured by the hybridization energy between miRNA and its candidate target site. It is believed that the total free energy of a functional targeting must be thermodynamically favorable, ie, negative valued. Accessibility energy, which is the free energy required to unpair the nucleotides on the target site to make the target accessible to the miRNA. Target site context, including local AU content; the target position within 3′ UTR; and the residue pairing at 3′ of the putative target site.9 These computational methods, which integrate multiple types of sequence and structural features, however, have low specificity and a high number of false positives for miRNA-target prediction. More importantly, predictions based on sequence and structural features only represent static miRNA–mRNA interactions. It is not clear to what extent these predicted interactions align with functional miRNA regulation in a particular phenotype or pathological condition. Thus, expression profiling has been proposed as an important information resource for discovering miRNA targets under different conditions. On the basis of this idea, some novel approaches have been developed to predict miRNA targets by integrating expression data into sequence-based prediction. Among them are GeneMiR++,16,17 TaLasso,18 HOCTAR,19 BLasso,20 MAGIA,21 and HCTarget.22 They mainly use paired miRNA and mRNA expression data from the same set of samples to refine the sequence-based prediction results and obtain more reliable miRNA targets. However, these approaches do not consider the full spectrum of available sequence and structural features of putative miRNA targets. Instead, they view all potential targets in sequence-based prediction results as equally biologically meaningful. Recently, Xu et al systematically evaluated the effects of sequence and structural features on miRNA-target prediction using the pSILAC dataset as a benchmark.14,15 It was found that all these features were important for improving the accuracy of miRNA-target identification. In this study, we combined the paired expression data of miRNAs and mRNAs from liver cancer patients, and the sequence and structural features of miRNA targeting to improve miRNA-target prediction. Our approach was based on a Bayesian linear regression model coupled with the Markov chain Monte Carlo (MCMC) algorithm. It uses both sequence and structural feature information to establish a prior probability of a miRNA-target interaction being functional, and paired miRNA and mRNA expression data to compute the likelihood of a putative miRNA-target interaction. By combining these two sources of information, our Bayesian method allows us to effectively sample from the large search space of putative miRNA–mRNA interactions and compute the posterior probability of each putative miRNA target. It represents a powerful means of reconstructing miRNA–mRNA interaction networks, specifically in liver cancer samples, and might help us uncover the mechanisms of tumorigenesis and progression in liver cancer.

Methods

Given a set of expression data of miRNA and mRNA, we modeled the interaction between miRNAs and target mRNAs using a linear model. The log-conditional likelihood function of data can be written in the following form, assuming a normal distribution: where y ϵ R represents the collection of mRNA expression data with N number of samples. X ϵ R× is the collection of miRNA expression data, where M is the number of miRNAs, and is the noise. β ϵ R is the regression coefficient vector of the gth mRNA.

Without additional sequence and structural feature information

The goal of this analysis is to identify a small subset of miRNA–mRNA interactions that are biologically meaningful. In the framework of variable selection,23,24 an indicator matrix is defined as Here, r is a binary indicator of whether the interaction between the mth miRNA and the gth mRNA is functional. In this model (without sequence and structural feature information), we only incorporated the computationally predicted, sequence-based miRNA-target information as prior information. We used an additional indicator matrix, C, in the current model, where the entry c is an indicator whose value is 1 if the gth mRNA is a potential target of the mth miRNA in the database and 0 otherwise. We focused on the entries with c = 1. We also assumed that r is independent of each other and follows a Bernoulli distribution, as in the following equation: Here π can be regarded as the proportion of the true targets in databases. We used a non-informative prior for : The joint posterior distribution is written as To efficiently search the parameter space of r using MCMC sampling, we integrate β and out; the marginal distribution of r is proportional to where p is the total number of miRNAs in the model, which is equal to sum(c). Because of independence of r, we can infer an individual r conditional on r−, where r− is the vector of r without the mth element and Because rmg is binary, we can define its marginal distribution as a Bernoulli distribution with a success probability of Here, we implemented a Gibbs sampler to sample each r. We initialized the vector r at random and then sampled each entry of r with other entries r− fixed on the basis of the Bernoulli distribution, with a success probability λ.

With additional sequence and structural features as prior information

To incorporate the sequence and structural features of miRNA-target sites into the model, we introduced an F-dimensional vector that was composed of F features associated with each miRNA–mRNA pair (m,g). We denoted π = p(r = 1|c =1, f,w) as the prior probability of r = 1 given F features. To simplify the model, we assumed that each of the F features independently contributes to π with a weight of w, where f = 1,…, F. Here, w is an unknown parameter with positive values. We defined the prior as We further specified a hyperprior on w as gamma distribution w~Gamma(a,b), ensuring the positivity of the parameter. In this work, we included four types of features that play crucial roles in miRNA target recognition (Please see the Results section for the details of the features included in our model.). Therefore, the feature vector f has four dimensions, the same as w. These features should be normalized to obtain positive values that lie in the same range, with a bigger value corresponding to a higher prior probability. Following the Gibbs sampling of r given a success probability π, we sampled w using Metropolis steps25 so that we an update π, depending on the value of the sequence features. The proposal is made via a truncated normal random walk kernel. The proposed w is then accepted with the probability where is a truncated normal with mean and is truncated at 0, given the positive nature of w. The variance of this distribution has to be set to accommodate an appropriate acceptance rate during MCMC sampling. The sampling of r and w was iterated until the MCMC chain was converged. Using the MCMC sampling procedure, we could explore the search space and find the most relevant predictions using a stochastic search variable selection method. The posterior probability of an miRNA–mRNA interaction, that is, p(r = 1| c = 1,Y, X), can be estimated directly from the MCMC sampling results by taking the proportion of MCMC iterations for which r = 1.

Results

We studied the regulatory roles of miRNAs in a dataset of matched miRNA and mRNA expression data for 125 patients with liver cancer from The Cancer Genome Atlas (TCGA). We log-transformed the expression data to ensure that they approximately followed normal distribution during the data preprocessing step. The computationally predicted miRNA–mRNA interactions were extracted from TargetScanHuman (release 6.1)10 and mapped to the expression dataset.10 This yielded 67,798 interactions between 170 human miRNAs and 4973 mRNA transcripts, which were used as the prior information for our first model without the addition of miRNA features. To determine the effects of the additional sequence and structural features of miRNA on target prediction, we obtained context scores and aggregated probability of conserved targeting scores for each miRNA–mRNA pair from TargetScan Human. The probability of conserved targeting score is a target site conservation score and has been used to measure the degree of miRNA target sequence conservation across multiple species. We also calculated the thermodynamic stability (ΔG) and the accessibility energy (ΔΔG) for each putative miRNA–mRNA interaction.14,15 Therefore, totally four sequence and structural features were integrated into our model to establish the prior probability of a miRNA-target interaction being biologically functional. We then compared this algorithm to the method without the addition of these four features. The miRNA-target interaction set with the highest scores from each approach was selected and compared in terms of its enrichment results in an experimentally validated interaction. In our Bayesian framework without additional miRNA features, the parameter π of the Bernoulli distribution reflects the prior belief about the proportion of true targets in the computationally predicted miRNA–mRNA interactions. Since there were 67,798 putative miRNA–mRNA interactions included in the liver cancer dataset, we set π = 0.07, indicating that 7% of putative interactions are true; thus, the expected number of miRNA–mRNA interactions for each mRNA would be approximately equal to 1. In our model with the miRNA sequence and structural features, we tuned the prior probability of each interaction according to the values of their corresponding features. We set the hyperparameters for the gamma distribution of weights as a = 1.5 and b = 0.05, and the variance of the truncated normal proposal distribution of w to 0.01 so that we could obtain an acceptance rate close to 25%. Figure 1 shows the summary trace plots for the number of r samplings and the corresponding log-posterior probabilities for our two models (with and without the additional features). In this application, the MCMC chain was run for 106 iterations, starting from a randomly chosen set of 5000 miRNA–mRNA interactions, so that each gene was targeted by approximately one miRNA on average, which is consistent with our prior specification.

Figure 1

Trace plots for (A) the number of selected miRNA–mRNA interactions and (B) the log-posterior probability along the number of iterations.

To assess the performance of our approach, we evaluated the enrichment scores of the results from experimentally validated miRNA–mRNA interactions. If the top-ranked miRNA and mRNA interactions identified from an algorithm include more experimentally validated targets, this algorithm will be considered to have better performance because more predicted interactions can be validated. Here, we extracted the experimentally validated target information from TarBase 6.0, which includes more than 65,000 manually curated, experimentally validated miRNA-gene interactions from eight species.26 To examine the overlaps between the TarBase information and our prediction results, we mapped all miRNAs in our dataset to the miRNA families in TarBase using the annotations in miRBase. In the liver cancer expression dataset, 609 miRNA–mRNA interactions have been biologically verified. We found that 68 and 79 of these interactions, without and with the addition of miRNA features, respectively, overlapped with the top 5000 targets detected by our model; the well-known GenMiR++ method only identified 56 interactions. On the basis of this observation, we obtained the numbers of false positives and false negatives, and calculated the corresponding statistical significance of the number of true targets identified by different methods using the hypergeometric distribution. For a given number of identified true targets, the smaller the P-value, the more enriched the predicted set of targets in the experimentally validated interaction. The results demonstrate that our model with the addition of miRNA sequence and structural features resulted in a most significant P-value, compared to the non-feature model and the GenMiR++ method, as shown in Table 1. We also examined the top 500 targets and observed similar results. The experimentally validated targets that were predicted by at least two of the three methods are listed in Table 2.

Table 1

Enrichment values of experimentally validated targets obtained from our feature-dependent model (feature-MCMC), the model without additional features (non-feature), and the GenMiR++ method. In the top 5000 or 500 predicted interactions, the numbers of experimentally validated targets (true positives), false positives, false negatives, and enrichment significance (P-value) are given. P-values were calculated on the basis of the hypergeometric distribution.

MODEL	TOP	TRUE POSITIVE	FALSE POSITIVE	FALSE NEGATIVES	P-VALUE
Feature-MCMC	5000	79	4921	530	2.23E-06
	500	15	485	594	7.90E-05
Non-feature	5000	68	4932	521	1.51E-04
	500	15	485	594	7.90E-05
GenMir++	5000	56	4944	553	2.31E-02
	500	10	490	599	5.41E-03

Table 2

Predicted experimentally validated targets obtained from our feature-dependent model (feature-MCMC), the model without additional features (non-feature), and the GenMiR++ method.

miRNA	GENE	FEATURE-MCMC	NON-FEATURE	GenMiR++
hsa-mir-103a-3p	Smarce1	×		×
hsa-mir-103a-3p	FKBP1A	×		×
hsa-mir-103a-3p	BCKDK	×	×
hsa-mir-103a-3p	CCNE1	×		×
hsa-mir-103a-3p	aadat	×	×
hsa-mir-103a-3p	SCAF1	×		×
hsa-mir-10a-5p	NDUFB6	×	×
hsa-mir-145–5p	aph1a	×		×
hsa-mir-145–5p	MUC1	×	×
hsa-mir-16–5p	CCNE1	×	×	×
hsa-mir-16–5p	Tppp3	×	×
hsa-mir-185–5p	CCNE1	×		×
hsa-mir-186–5p	TMEM183A	×	×
hsa-mir-191–5p	Mpst	×	×
hsa-mir-19b-3p	WBP2	×	×
hsa-mir-21–5p	TPM1	×	×	×
hsa-mir-22–3p	BTF3L1	×		×
hsa-mir-24–3p	vps25	×	×
hsa-mir-24–3p	MARCKSL1	×	×
hsa-mir-29a-3p	DNMT3A	×		×
hsa-mir-32–5p	Hivep1	×	×
hsa-mir-32–5p	BCAT2	×		×
hsa-mir-34a-5p	MAGEA12	×	×	×
hsa-mir-34a-5p	Magea6	×	×	×
hsa-mir-7–5p	TCOF1	×		×
hsa-mir-7–5p	Pole4	×	×
hsa-mir-7–5p	c18orf10	×	×
hsa-mir-7–5p	dtymk	×	×	×
hsa-mir-93–5p	Gramd1a	×		×

To further investigate the function of our predicted targets and the potential regulatory roles of miRNA in patients with liver cancer, we analyzed the biological relevance of the target genes in a KEGG pathway enrichment study.27 We used the KEGG pathway annotation to measure the enrichment of the top 200 genes predicted by different methods using the GeneCodis 3.0 tool. As a result, several KEGG pathways were found to be significantly enriched in the results obtained from our models and GenMiR++ (Table 3). Both of our models (with and without additional features) resulted in significantly enriched pathways related to cancer and focal adhesion. For GeneMiR++, the two most prominent pathways were related to cell cycle and focal adhesion.

Table 3

Top 10 enriched KEGG pathways from our feature-dependent model (feature-MCMC), the model without additional features (non-feature), and the GenMiR++ method.

FEATURE-MCMC MODEL	NUMBER OF GENES	P-VALUE
Pathways in cancer	14	5.32E-07
Focal adhesion	10	8.61E-06
Regulation of actin cytoskeleton	9	5.32E-05
Leukocyte transendothelial migration	6	0.0003
Pathways in cancer, focal adhesion	6	0.0001
Leukocyte transendothelial migration, adhere junction	4	4.04E-05
Adhere junction, bacterial invasion of epithelial cells	4	3.94E-05
Leukocyte transendothelial migration, adhere junction, tight junction	3	0.0002
Long-term depression, progesterone-mediated oocyte maturation	3	0.0002
Regulation of actin cytoskeleton, focal adhesion, leukocyte transendothelial migration, adhere junction	3	0.0001
NON-FEATURE MODEL	NUMBER OF GENES	P-VALUE
Pathways in cancer	8	0.0001
Focal adhesion	8	2.27E-05
Regulation of actin cytoskeleton	7	0.00022
Huntington’s disease	6	0.0002
Pathways in cancer, focal adhesion	5	0.0001
Regulation of actin cytoskeleton, focal adhesion	5	0.0001
Pathway in cancer, focal adhesion, small cell lung cancer	4	0.0001
Pathway in cancer, focal adhesion, ECM-receptor interaction	3	0.0003
Regulation of actin cytoskeleton, focal adhesion, leukocyte transendothelial migration, bacterial invasion of epithelial cells	3	0.0001
Adhere junction, bacterial invasion of epithelial cells		0.0001
GenMiR++	NUMBER OF GENES	P-VALUE
Cell cycle	10	6.41E-08
Focal adhesion	7	0.0003
Pyrimidine metabolism	6	8.34E-05
Focal adhesion, amoebiasis	6	1.83E-06
Pathways in cancer, small cell lung cancer	5	0.0003
Focal adhesion, ECM-receptor interaction	5	3.58E-05
DNA replication	4	0.0002
Pathways in cancer, focal adhesion, amoebiasis	4	9.02E-05
Focal adhesion, ECM-receptor interaction, amoebiasis	4	9.02E-05
DNA replication, cell cycle	3	5.44E-05

Among the miRNAs shown in Table 2, hsa-miR-145 and hsa-miR-21 are key regulators during hepatocellular carcinoma genesis.28,29 In particular, hsa-miR-145 functions as a tumor suppressor in liver cancer by targeting the chromatin modification enzyme, histone deacetylase.28 In this study, we discovered many novel functional targets of hsa-miR-145, including MUC1. We used the pair hsa-miR-145-MUC1 to illustrate the effectiveness of our model. We grouped liver cancer patients by their hsa-miR-145 expression level (higher or lower than average). The patients in the high hsa-miR-145 group had significantly lower MUC1 expression than those in the other group (the P-value was 0.06, one-sided Wilcoxon test). Their cumulative distributions displayed a negative shift of MUC1 (Fig. 2). This example further confirms the gene down-regulatory effect of hsa-miR-145 and indicates that MUC1 is a reliable target gene. It is expected that the hsa-miR-145-MUC1 pair will provide novel hypotheses for testing the roles of MUC1 in liver cancer development. If successful, it can serve as a biomarker for better directing the diagnosis and treatment of liver cancer patients.

Figure 2

Down-regulatory effect of hsa-miR-154 on MUC1. The liver cancer patients were grouped according to their hsa-miR-145 expression levels (higher or lower than average). The cumulative distribution of the MUC1 expression levels in these two groups of patients was plotted, respectively (high hsa-miR-154, red dashed line; low hsa-miR-154, blue solid line). The x-axis represents the MUC1 expression levels represented by the Reads Per Kilobase of transcript per Million mapped reads (RPKM) values from the RNA-seq data.

Conclusions

In this study, we integrated matched miRNA and mRNA expression data with the sequence and structural features of miRNA seeds to improve miRNA target prediction. Compared to previous approaches,25 our model restricts our search space to the putative miRNA targets obtained from a well-known miRNA target prediction database; thus, our model led to significantly less computational complexity but higher target prediction specificity. In addition, using a Bayesian linear regression model, we successfully incorporated four key features of miRNA–mRNA interactions; each assigned a different weight by the MCMC sampling procedure, as the prior knowledge in our model. Our investigation of paired miRNA and mRNA expression profiles in liver cancer patients successfully demonstrated the advantages of our feature-dependent model. Our results showed that the top interactions identified by our feature-dependent model are significantly more enriched in experimentally validated targets and are more biologically meaningful than are those identified by the GenMiR++ method or the model without additional feature information. In addition, with the recent intensive research in this field, a large body of experimentally verified miRNA target information has accumulated in available databases, such as StarBase30 and miRWalk.31 There is a strong interest in leveraging this information to improve target prediction sensitivity and accuracy, and this will be the focus of our future work. From a Bayesian perspective, we expect to be able to easily incorporate this information by assigning different prior distributions to the information sources according to their reliability, as in a proposed prior Lasso framework.32

31 in total

1. Gene selection: a Bayesian variable selection approach.

Authors: Kyeong Eun Lee; Naijun Sha; Edward R Dougherty; Marina Vannucci; Bani K Mallick
Journal: Bioinformatics Date: 2003-01 Impact factor: 6.937

2. Conserved seed pairing, often flanked by adenosines, indicates that thousands of human genes are microRNA targets.

Authors: Benjamin P Lewis; Christopher B Burge; David P Bartel
Journal: Cell Date: 2005-01-14 Impact factor: 41.582

3. Combinatorial microRNA target predictions.

Authors: Azra Krek; Dominic Grün; Matthew N Poy; Rachel Wolf; Lauren Rosenberg; Eric J Epstein; Philip MacMenamin; Isabelle da Piedade; Kristin C Gunsalus; Markus Stoffel; Nikolaus Rajewsky
Journal: Nat Genet Date: 2005-04-03 Impact factor: 38.330

4. MicroRNA targeting specificity in mammals: determinants beyond seed pairing.

Authors: Andrew Grimson; Kyle Kai-How Farh; Wendy K Johnston; Philip Garrett-Engele; Lee P Lim; David P Bartel
Journal: Mol Cell Date: 2007-07-06 Impact factor: 17.970

5. Expression of MUC1 and its significance in hepatocellular and cholangiocarcinoma tissue.

Authors: Shi-Fang Yuan; Kai-Zong Li; Ling Wang; Ke-Feng Dou; Zhen Yan; Wei Han; Ying-Qi Zhang
Journal: World J Gastroenterol Date: 2005-08-14 Impact factor: 5.742

6. Human microRNA genes are frequently located at fragile sites and genomic regions involved in cancers.

Authors: George Adrian Calin; Cinzia Sevignani; Calin Dan Dumitru; Terry Hyslop; Evan Noch; Sai Yendamuri; Masayoshi Shimizu; Sashi Rattan; Florencia Bullrich; Massimo Negrini; Carlo M Croce
Journal: Proc Natl Acad Sci U S A Date: 2004-02-18 Impact factor: 11.205

7. The characterization of microRNA-mediated gene regulation as impacted by both target site location and seed match type.

Authors: Wenlong Xu; Zixing Wang; Yin Liu
Journal: PLoS One Date: 2014-09-19 Impact factor: 3.240

8. Accurate microRNA target prediction correlates with protein repression levels.

Authors: Manolis Maragkakis; Panagiotis Alexiou; Giorgio L Papadopoulos; Martin Reczko; Theodore Dalamagas; George Giannopoulos; George Goumas; Evangelos Koukis; Kornilios Kourtis; Victor A Simossis; Praveen Sethupathy; Thanasis Vergoulis; Nectarios Koziris; Timos Sellis; Panagiotis Tsanakas; Artemis G Hatzigeorgiou
Journal: BMC Bioinformatics Date: 2009-09-18 Impact factor: 3.169

9. The microRNA.org resource: targets and expression.

Authors: Doron Betel; Manda Wilson; Aaron Gabow; Debora S Marks; Chris Sander
Journal: Nucleic Acids Res Date: 2007-12-23 Impact factor: 16.971

10. starBase v2.0: decoding miRNA-ceRNA, miRNA-ncRNA and protein-RNA interaction networks from large-scale CLIP-Seq data.

Authors: Jun-Hao Li; Shun Liu; Hui Zhou; Liang-Hu Qu; Jian-Hua Yang
Journal: Nucleic Acids Res Date: 2013-12-01 Impact factor: 16.971

2 in total

1. Integrating full spectrum of sequence features into predicting functional microRNA-mRNA interactions.

Authors: Zixing Wang; Wenlong Xu; Yin Liu
Journal: Bioinformatics Date: 2015-06-30 Impact factor: 6.937

2. Predicting miRNA Targets by Integrating Gene Regulatory Knowledge with Expression Profiles.

Authors: Weijia Zhang; Thuc Duy Le; Lin Liu; Zhi-Hua Zhou; Jiuyong Li
Journal: PLoS One Date: 2016-04-11 Impact factor: 3.240

2 in total