Yi-Wen Hsiao1, Lin Wang1, Tzu-Pin Lu1,2. 1. Institute of Epidemiology and Preventive Medicine, Department of Public Health, College of Public Health, National Taiwan University, Taipei, Taiwan. 2. Bioinformatics and Biostatistics Core, Center of Genomic and Precision Medicine, National Taiwan University, Taipei, Taiwan.
Abstract
Competitive endogenous RNA (ceRNA) represents a novel mechanism of gene regulation that controls several biological and pathological processes. Recently, an increasing number of in silico methods have been developed to accelerate the identification of such regulatory events. However, there is still a need for a tool supporting the hypothesis that ceRNA regulatory events only occur at specific miRNA expression levels. To this end, we present an R package, ceRNAR, which allows identification and analysis of ceRNA-miRNA triplets via integration of miRNA and RNA expression data. The ceRNAR package integrates three main steps: (i) identification of ceRNA pairs based on a rank-based correlation between pairs that considers the impact of miRNA and a running sum correlation statistic, (ii) sample clustering based on gene-gene correlation by circular binary segmentation, and (iii) peak merging to identify the most relevant sample patterns. In addition, ceRNAR also provides downstream analyses of identified ceRNA-miRNA triplets, including network analysis, functional annotation, survival analysis, external validation, and integration of different tools. The performance of our proposed approach was validated through simulation studies of different scenarios. Compared with several published tools, ceRNAR was able to identify true ceRNA triplets with high sensitivity, low false-positive rates, and acceptable running time. In real data applications, the ceRNAs common to two lung cancer datasets were identified in both datasets. The bridging miRNA for one of these, the ceRNA for MAP4K3, was identified by ceRNAR as hsa-let-7c-5p. Since similar cancer subtypes do share some biological patterns, these results demonstrated that our proposed algorithm was able to identify potential ceRNA targets in real patients. In summary, ceRNAR offers a novel algorithm and a comprehensive pipeline to identify and analyze ceRNA regulation. The package is implemented in R and is available on GitHub (https://github.com/ywhsiao/ceRNAR).
Competitive endogenous RNA (ceRNA) represents a novel mechanism of gene regulation that controls several biological and pathological processes. Recently, an increasing number of in silico methods have been developed to accelerate the identification of such regulatory events. However, there is still a need for a tool supporting the hypothesis that ceRNA regulatory events only occur at specific miRNA expression levels. To this end, we present an R package, ceRNAR, which allows identification and analysis of ceRNA-miRNA triplets via integration of miRNA and RNA expression data. The ceRNAR package integrates three main steps: (i) identification of ceRNA pairs based on a rank-based correlation between pairs that considers the impact of miRNA and a running sum correlation statistic, (ii) sample clustering based on gene-gene correlation by circular binary segmentation, and (iii) peak merging to identify the most relevant sample patterns. In addition, ceRNAR also provides downstream analyses of identified ceRNA-miRNA triplets, including network analysis, functional annotation, survival analysis, external validation, and integration of different tools. The performance of our proposed approach was validated through simulation studies of different scenarios. Compared with several published tools, ceRNAR was able to identify true ceRNA triplets with high sensitivity, low false-positive rates, and acceptable running time. In real data applications, the ceRNAs common to two lung cancer datasets were identified in both datasets. The bridging miRNA for one of these, the ceRNA for MAP4K3, was identified by ceRNAR as hsa-let-7c-5p. Since similar cancer subtypes do share some biological patterns, these results demonstrated that our proposed algorithm was able to identify potential ceRNA targets in real patients. In summary, ceRNAR offers a novel algorithm and a comprehensive pipeline to identify and analyze ceRNA regulation. The package is implemented in R and is available on GitHub (https://github.com/ywhsiao/ceRNAR).
Regulation of gene expression can occur at multiple levels via both transcriptional and post-transcriptional mechanisms [1]. Many non-coding RNAs have critical roles in post-transcriptional regulation of protein-coding genes [2]. MicroRNAs (miRNAs) are short, non-coding, single-stranded RNAs with ~22 nucleotides. They usually bind protein-coding genes via partial complementarity with many miRNA response elements (MREs) to repress gene expression by inhibiting translation. Previous studies have shown that miRNAs are involved in a broad range of cancer-associated biological processes, including apoptosis, proliferation, metastasis, and angiogenesis [3]. Similar to gene expression, miRNA expression has cancer-specific patterns that can be used to detect cancers. Therefore, the expression values of RNA can serve as diagnostic, prognostic, or therapeutic biomarkers in a diverse range of cancers [4].The concept of competing endogenous RNAs (ceRNAs), also called miRNA sponges or miRNA decoys, has revolutionized our knowledge of miRNA regulatory mechanisms. Such RNAs include canonical protein-coding messenger RNAs (mRNAs), long non-coding RNAs (lncRNAs), circular RNAs (circRNAs), and pseudogenes [5]. Their mechanism is to compete with miRNAs for binding their regulatory sequences. There are two primary hypotheses regarding the regulatory function of ceRNAs, based on their expression level or their number of MREs [6]. Taking miRNA-mRNA regulation for example (i.e., where two mRNAs act as ceRNAs that can bind to the same miRNA), the first hypothesis is that the miRNA tends to be sequestered by the mRNA with the higher expression level, leading to weakened inhibitory effects of the miRNA on the other mRNA and thus increasing the expression of the other mRNA under the assumption of equal MREs on the two RNAs. The second hypothesis is that the miRNA has a greater affinity for the mRNA with more MREs in its sequence.Some ceRNAs have been identified in multiple cancers; for instance, PTEN is an important tumor suppressor gene that was also reported to encode ceRNA in prostate cancer [7], glioblastoma [8], and melanoma [9]. This suggests that elucidating ceRNAs can improve the understanding of biological mechanisms in regulating cancer cells. However, using biological experiments to identify ceRNAs is time-consuming and labor-intensive. To address this issue, an increasing number of computational methods have been developed for identifying ceRNAs.The traditional algorithm is based on the probability theory that two mRNAs share miRNAs and their binding sites (i.e., MREs) [10, 11]. A hypergeometric test is applied to find out if the probability of binding of an mRNA to a given miRNA is larger than that which would occur by chance. However, such an approach usually uses a selected threshold to choose significant genes and takes only these genes into consideration; it may not extend to the whole-genome level and fails to consider the correlation among genes because this approach treats each gene independently [12].Recently, because of the popularity of and advances in genomic sequencing technology, more and more mRNA and miRNA gene expression profiles at the whole-genome level have been released publicly [13-15]. However, the analytical results of such continuous data may tend to be sensitive to the outliers in the sample and to the size of the dataset [16]. Therefore, a second approach has been developed based on the observation of linear correlations between pairs of mRNAs that suggest they have a higher chance of competing with a specific miRNA [17-19]. Unfortunately, such a method ignores the contribution of the expression of the miRNA in a ceRNA binding event. Additionally, it also uses a permutation test to estimate mRNA pairs with significant correlation results; sometimes, it has a higher computational cost. To overcome these drawbacks, in this study, we present a novel rank-based algorithm considering the contribution of miRNA expression in a ceRNA binding event and extending the pairwise correlation approach to identify ceRNA-miRNA triplets. All the steps in this algorithm have been incorporated into a user-friendly R package called ceRNAR, which also provides several downstream analyses to further interpret the biological meaning of identified ceRNA events for its users.
Results
Simulation results
To evaluate the performance of our proposed method, simulations of mRNA and miRNA expression data in 100 samples were performed in different scenarios (S1 Table). Notably, we focused on the sensitivity and the positive predictive value (PPV) because the number of ceRNA triplets was small among all possible combinations of triplets.In the beginning, we presumed the sample distribution of each gene follows normal distribution because about 98% of genes’ sample distributions passed the normality test based on 9,835 samples in The Cancer Genome Atlas (TCGA) pan-cancer atlas (S1 Fig). Therefore, we firstly presumed synthetic expression data were generated from a multivariate normal distribution with a mean value of 0 and a covariance matrix whose entries are 0.9. This ensured that the ceRNAR algorithm supports the hypothesis that pairs of ceRNA binding partners of a specific miRNA are highly correlated and that it works well to sensitively identify them among a pool of target pairs. However, it is unclear whether such an event between two target genes would occur at the lower or higher expression of a specific miRNA. Therefore, five scenarios were designed to capture how the molecular elements within a triplet interact with each other, and simulations of different parameters were performed. Notably, the number of identified ceRNAs dropped as the window size increased (S2 and S3 Tables, Fig 1). This is because more uncorrelated samples were included in the analysis when longer window sizes were used, especially larger than 30%. In other words, higher noise was included in the analysis, resulting in difficulty in identifying true ceRNA triplets. Next, we simulated different scenarios in which the ceRNA peak was located at different miRNA expression levels (scenarios 1 to 4). As shown in S2 Table and Fig 1A, the performance of our proposed algorithm was highly similar across the four scenarios with true ceRNA signals. Such performance also was shown when removing the permutation test (Fig 1B and S3 Table). Lastly, to reduce the calculation complexity and computation time, we evaluated the performance of the algorithm without the random walk step, which we called the “fast” version. Only minor differences were observed in these two versions (0.1 to 0.25 in terms of PPV value, S2 and S3 Tables), suggesting that both versions were able to identify true ceRNA triplets without reporting high false positive results (Fig 1).
Fig 1
Performance of our proposed method in four scenarios (1 to 4).
(A) The complete version. (B) The fast version. The numbers in blue represent the average number of identified ceRNA-miRNA triplets after 100 simulations.
Performance of our proposed method in four scenarios (1 to 4).
(A) The complete version. (B) The fast version. The numbers in blue represent the average number of identified ceRNA-miRNA triplets after 100 simulations.In addition to the analytical parameters for the algorithm, a simulation of different proportions of the correlated samples was performed. Because no major differences were observed in the simulated scenarios and running versions, only scenario 3 was evaluated in this round of testing. As shown in S4 Table, four settings for the proportion of correlated samples were analyzed (10%, 20%, 30%, and 40%). Notably, the proposed algorithm showed similar performance in the three highest settings (Fig 2A).
Fig 2
Performance of our proposed method in three extended cases.
(A) Different proportions of correlated samples in scenario 3 using the fast version. (B) Comparison between complete and fast versions in the null scenario (i.e., scenario 5). (C) Different distances for merging peaks when window size is equal to 40 using the complete version. The numbers in blue represent the number of identified ceRNA-miRNA triplets.
Performance of our proposed method in three extended cases.
(A) Different proportions of correlated samples in scenario 3 using the fast version. (B) Comparison between complete and fast versions in the null scenario (i.e., scenario 5). (C) Different distances for merging peaks when window size is equal to 40 using the complete version. The numbers in blue represent the number of identified ceRNA-miRNA triplets.To further examine whether more false positive ceRNA triplets are reported in the fast version, a null scenario without a true ceRNA signal was simulated (scenario 5). The results showed that the negative predictive values of these versions were all higher than 0.9 (S1 and S2 Tables), suggesting there is a low chance of identifying false positive signals in both versions of the algorithm (Fig 2B). Subsequently, we examined how much time can be saved by using the fast version of the proposed algorithm, which omitted the random walk step. On average, the fast version accelerated the algorithm by approximately 76 times (283.967 seconds versus 21,600 seconds), and only slightly higher false positive rates were reported (Fig 2B). Lastly, to evaluate whether different window sizes for merging peaks are critical to the performance of the proposed algorithm, four settings of the proportion of correlated samples and merging window sizes were analyzed. As shown in Fig 2C and S5 Table, the performance of the proposed algorithm was not sensitive to the window sizes for merging peaks.In addition, to ensure the algorithm only sensitively identified ceRNA events under higher correlation values between them, we also generated simulated data from a multivariate normal distribution with the same mean value but a covariance matrix whose entries are 0.6 (S2A Fig and S6 Table) or 0.3 (S2B and S7 Table). Although the ceRNAR algorithm was still able to detect ceRNA pairs with moderate sensitivity values (0.4 to 0.5) when the correlation values between target genes within a pair were 0.6, it did not work well (sensitivity values were all below 0.25) when the correlation values between target genes within a pair were 0.3. Together, these results support our correlation-based hypothesis that ceRNA events tend to occur when they are highly correlated. However, the above-mentioned results were based on simple and naïve simulations. We also mimicked the distribution of expression from pan-cancer data (9,835 samples), allowing the synthetic data to be generated from a multivariate normal distribution with a randomly selected mean value and a covariance matrix whose entries are also randomly selected. The simulation studies were also performed under different correlation levels. As shown in S3 Fig and S8–S11 Tables, the performance of the ceRNAR algorithm was determined by the level of the correlation values. Nevertheless, it still performed well (sensitivity values > 0.75) when the correlation between genes was high and when the window size was 10 throughout all scenarios (S3B Fig.), suggesting the ceRNAR algorithm is efficient to identify most potential ceRNA pairs when their correlation pattern is relatively high (0.8–0.9) to compete with a specific miRNA, and such a pattern exists in at least 20% of the sample. The optimal parameter settings for real data were also observed. For 100 samples, the best window size was 10, and the best cutoff correlation value for selecting the most significant event was 0.7.
Comparison with other tools
We have compared our tool with other state-of-the-art tools, including SPONGE, JAMI, GDCRNATools, and CERNIA, in terms of their performances using synthetic data and their running time using real data. Fig 3A and S12 Table illustrate that ceRNAR workflow generally outperformed the other tools in terms of sensitivity and PPV in all scenarios when the window size is set to 10 and the correlation cutoff is set to 0.7. It can be mainly observed when the correlation values among correlated genes are relatively high. However, all of the tools could identify valid ceRNA triplets without reporting high false-positive results except JAMI (Fig 3B and S12 Table). GDCRNATools generally had a high sensitivity compared to the other tools, and a lower specificity than ceRNAR, suggesting that it is good for catching ceRNAs but comes with a relatively high rate of false positives. Regarding running time, we used different sample numbers (250, 500, and 100) on a small subset of the pan-cancer dataset with 15 genes which form 105 triplets; we also used different triplet numbers (105, 1,225, and 4,950) on a small subset of the pan-cancer dataset with 250 samples (S4 Fig and S13 Table). Although the ceRNA algorithm was not fast with a large sample size and a large number of triplets compared with SPONGE, GDCRNATools and CERNIA, it was slightly faster than JAMI.
Fig 3
Performance comparisons between ceRNIA, JAMI, SPONGE, GDCRNATools and ceRNAR packages using synthetic data that mimic the real-world case.
(A) Sensitivity and PPV of various tools at different correlation levels for 100 samples of 105 pairs of target genes under four scenarios (1 to 4). (B) Specificity and NPV of various tools at different correlation levels for 100 sample of 105 pairs of target genes under null scenario (scenario 5). “cor_levels” represents four different correlation levels (all: 0.3–0.9; high: 0.8–0.9; low: 0.3–0.4; mid: 0.5–0.7).
Performance comparisons between ceRNIA, JAMI, SPONGE, GDCRNATools and ceRNAR packages using synthetic data that mimic the real-world case.
(A) Sensitivity and PPV of various tools at different correlation levels for 100 samples of 105 pairs of target genes under four scenarios (1 to 4). (B) Specificity and NPV of various tools at different correlation levels for 100 sample of 105 pairs of target genes under null scenario (scenario 5). “cor_levels” represents four different correlation levels (all: 0.3–0.9; high: 0.8–0.9; low: 0.3–0.4; mid: 0.5–0.7).
Application to TCGA cancer cohort datasets
To further validate the applicability and robustness of the ceRNAR algorithm, we also applied the algorithm to two TCGA-derived lung cancer cohorts–TCGA-LUAD and TCGA-LUSC (S14 Table). The top bridging miRNAs and the hub genes among ceRNA triplets are shown in S15 and S16 Tables. Intriguingly, the two cancer cohorts (LUAD and LUSC) shared some common triplets involving 53 miRNAs and 905 ceRNAs, which allowed construction of a miRNA-modulated ceRNA regulatory network (S5 Fig). Among them, PLEKHG6 had the largest number of co-expressed ceRNAs, and of the 53 common miRNAs, the top three miRNAs that bridged over 20 ceRNA pairs were hsa-miR-183-5p, hsa-miR-133a-3p, and hsa-miR-142-5p. MAP4K3 was another common hub gene in both datasets, and its bridging miRNA was hsa-let-7c-5p, around which a regulatory network of corresponding ceRNAs was built (S6A Fig), and the expression level related to the regulatory occurrence of its bridging ceRNAs is displayed in S6B Fig. Since lung cancers share some common molecular characteristics, such results demonstrate the applicability and robustness of the ceRNAR algorithm in multiple cancers or diseases.In addition, we compared our findings against experimentally validated miRNA-gene pairs in the miRSponge database to endorse the potential ceRNAs identified by ceRNAR. As shown in S7 Fig, around 1% (ceRNA pairs that can be validated among total ceRNA founded based on ceRNAR algorithm) of experimentally validated ceRNA triplets were identified in LUAD and LUSC. The low proportion may be attributed to the fact that only 158 ceRNA triplets were analyzed and that those ceRNA triplets may not be expressed in lung tissues. Furthermore, approximately 13–14% of ceRNA triplets with at least one experimentally validated miRNA-target interaction were identified by using the ceRNAR algorithm. A Chi-square test was performed to examine whether the findings from the ceRNAR package were significantly enriched in the TCGA data, and the results showed that the P-values obtained from the LUAD and LUSC datasets were both less than 2.2e-16, suggesting our ceRNAR package can successfully identify previously reported ceRNA triplets.
Discussion
ceRNA events are a newly discovered type of post-transcriptional regulation, and the identification of ceRNA-miRNA triplets using in silico methods is an emerging research area. Therefore, we developed a novel computational algorithm to explore such regulatory events for further biomedical interpretation and application. Our proposed method is based on a simple pairwise correlation approach that considers the miRNA-modulated ceRNA interaction. First, we ranked samples based on their miRNA expression value to include the contribution of miRNA expression and identify which miRNA expression intervals tend to have a higher correlation with pairs of mRNA targets. Secondly, we used a sliding window approach to form more correlation values in a triplet to improve the performance and outcomes of the subsequent statistical approach. Lastly, we applied a cumulation-like approach to sum up the slight changes in correlation values across samples. We used segment clustering to understand the sample clustering in terms of the gene-gene correlations and the miRNA expression intervals so that we could also use sample proportion to support our findings. Several simulations have been conducted for the optimization of the parameters subject to specific ranges of settings, and the robustness of our approach when it does not involve a permutation test has also been evaluated through a simulation study. Connecting with six downstream analyses, our R package may assist researchers to have a deeper understanding of the disease-specific biological regulation and prognostic application for each identified ceRNA-miRNA triplet.Recently, more and more tools have been developed to identify potential ceRNA events. It is important to systematically evaluate the ceRNAR package in comparison with the five other published ceRNA prediction tools—SPONGE [20], CERNIA [21], GDCRNATools [22], JAMI [23], and CUPID [24]—that are expression-based (rather than sequence-based, i.e., spongeScan [25]). We have compared them in terms of several features, such as miRNA-target data sources, study design, ceRNA classes, ceRNA prediction algorithm, and language for implementation (S17 Table). Noting that JAMI is the multi-threading version of CUPID, we decided to keep only JAMI for further analyses. Thus, we used four algorithms, including SPONGE, CERNIA, GDCRNATools, and JAMI, for the comparisons. Notably, the four algorithms and our ceRNAR all used a similar strategy, which is utilizing miRNA-target data sources to identify potential miRNA-gene/lncRNA/pseudogene pairs from other databases. All four of these algorithms are implemented in R. JAMI is based on conditional mutual information, which is particularly useful to capture non-linear associations by estimating the effect of a miRNA on its target pairs through a permutation test [26]. Excepting JAMI, the rest algorithms consider the correlations between miRNA-mediated genes/lncRNA. Although the majority of such algorithms are correlation-based, some differences still exist. For examples, GDCRNATools is based on sensitivity correlation computation through effectively estimating covariate matrices and also considering the impact of a miRNA on its target pairs. SPONGE also uses sensitivity correlation to quantify the impact of a miRNA on its target pairs (i.e., linear partial correlation), but further applies a null model-based p-value computation to estimate potential ceRNA pairs. It is worth mentioning that CERNIA and JAMI consider both MRE- and expression-based data, whereas SPONGE, GDCRNATools and ceRNAR only analyze genome-wide expression data. Since several studies [18, 27, 28] indicate that ceRNA triplets may be observed in a specific range of miRNA expression, such an approach can help to focus on the true positive region with high signal-to-noise ratio instead of missing the ceRNA triplets due to signal dilution by global noise. Our ceRNAR algorithm showed the highest sensitivity in identifying potential ceRNA triplets (Fig 3).Regarding the two online servers, miRTissue_ce [29] and Encori (i.e., starBase v2) [30] are two web servers that integrate ceRNA data sources, ceRNA prediction algorithms, and even some data analyses and visualizations that can be easily accessed by the users. Fiannaca et al. [29] have compared miRTissue_ce and Encori in terms of many features. Here, we further compared these web servers with our ceRNAR based on these features to see whether there is any add-on value that ceRNAR can provide these web services. First, the interactions between miRNAs and target genes in ceRNAR are supported by nine databases, including two experimentally validated miRNA-target databases and seven computationally predicted miRNA-target databases. But Encori and miRTissue_ce are supported by 4 and 8 computationally predicted miRNA-target databases, respectively. Second, Encori uses a hypergeometric test to predict ceRNA, and miRTissue_ce integrates that method with a global test and SPONGE. However, one major disadvantage of the hypergeometric test is that the test requires a predefined p-value threshold to select significant genes. When using differentially expressed genes, such an approach is not suitable to be applied to the whole genome, because the hypergeometric test fails to consider the interactions among genes due to its independence assumption about genes. This is why we present a novel rank-based algorithm considering the contribution of miRNA expression in a ceRNA event and extend the pair-wise correlation approach to identify ceRNA-miRNA triplets using whole genomic information. Lastly, these two web servers can predict many types of ceRNA events, but ceRNAR only focuses on one ceRNA event class (i.e., mRNA-miRNA).In real application, we utilized non-small cell lung cancer (NSCLC) data to evaluate the applicability of ceRNAR. NSCLC accounts for 85% of lung cancer and is one of the most common malignant tumors worldwide [26]. Although there has been progress in successful treatment for NSCLC patients these past several decades, the 5-year survival rate for NSCLC is still relatively low (25%) [31]. Also, the molecular networks involved in NSCLC remain incompletely described in terms of their roles in etiology, progression, and metastasis. Hence, we applied the ceRNAR algorithm to two NSCLC-related cancer datasets in TCGA lung cancer cohorts. Several common miRNAs and ceRNAs identified by the ceRNAR algorithm have also been previously reported by other studies. For example, hsa-miR-183-5p was found to inversely regulate PTPN4, serving as a therapeutic target to suppress the metastatic potential in NSCLC patients [32], and hsa-let-7c-5p was verified to prevent cancer metastasis by degrading its bridging hub ceRNA, MAP4K3 [33]. Although our simulation results suggest that the majority of performance indicators have only slight differences in the four scenarios using both complete and fast versions, and the best parameter setting for window size is 10 and for peak threshold is 0.7, and the fine-tuning of appropriate parameters for non-TCGA datasets still needs to be tested. Nevertheless, these results still demonstrated our proposed method is robust and potentially applicable, allowing it to be extended to studies of other diseases.However, some limitations still existed in our study. First, a smaller sample size of cancer cohorts (i.e., a smaller window size in our case) may lead to less statistical power of the findings. Second, we presumed a linear relationship between the two ceRNAs in each triplet, but in reality, they were not always linearly correlated. Although we have implemented a sliding window approach to capture such relationships, other methods such as mutual information [34] can also be applied. Moreover, the accuracy of the miRNA target prediction databases we used may have affected the definition of putative ceRNA-miRNA triplets and the outcomes of the ceRNAR algorithm because the mechanisms of some miRNA targeting systems have not been fully understood. It is also worth mentioning that our simulation results were based on a predefined covariance matrix. That is the true positive events were from the correlation-based approaches, and thus such events only showed linear relationships among the elements. Notably, such design may lead to the poor performance of JAMI because their algorithm was developed by using the mutual information strategy, which was able to capture non-linear relationships among ceRNA pairs in addition to the linear ones. Lastly, the majority of our findings from the real case study were novel compared to the miRSponge database, although some of the miRNA-targets contained an interaction that was previously experimentally validated. Perhaps further experimental validation of those triplets that contained one experimentally validated miRNA-target interaction should be prioritized to increase the robustness of our algorithm and the reliability of the novel findings. Therefore, the consideration of all types of miRNA sponges, the amount of MREs, multiple miRNAs that may compete for the same pair of target genes, nontrivial correlations which involve the comparison of pairwise correlation and pairwise partial correlation, and the minimization of computational time are important key areas for further optimization and extension of the ceRNAR algorithm.In summary, ceRNAR is a promising tool for the recognition of ceRNA-miRNA triplets and ceRNA-ceRNA interaction networks in many human diseases, and hence will speed up our knowledge of the regulatory mechanisms and functions of ceRNA-miRNA triplets in the pathogenesis of disease, including cancers.
Materials and methods
Pipeline of ceRNAR
The ceRNAR package is written in R (version 4.0.5) and is available in the Github repository. The main pipeline of ceRNAR is illustrated in Fig 4 and contains three major components for the identification and analysis of ceRNA-miRNA triplets:
Fig 4
An overview of our ceRNAR package.
This package includes three main parts. First, “Input data” has three functions: ceRNAputativePairs() for extracting curated miRNA-target lists, ceRNATCGA() for retrieving TCGA data, and ceRNAcustomize() for importing customized data. Second, “Identification of miRNA-ceRNA triplets” involves ceRNApairFiltering() and SegmentClusteringPlusPeakMerging() and is wrapped into a ceRNAmethod() function. Lastly, “Downstream analyses” contains six functions for different tasks, that is, ceRNAFunction(), ceRNAModule(), ceRNASurvival(), ceRNALocation(), ceRNAValidate() and ceRNAIntegrate().
Data preprocessingIdentification of ceRNA-miRNA tripletsDownstream analyses
An overview of our ceRNAR package.
This package includes three main parts. First, “Input data” has three functions: ceRNAputativePairs() for extracting curated miRNA-target lists, ceRNATCGA() for retrieving TCGA data, and ceRNAcustomize() for importing customized data. Second, “Identification of miRNA-ceRNA triplets” involves ceRNApairFiltering() and SegmentClusteringPlusPeakMerging() and is wrapped into a ceRNAmethod() function. Lastly, “Downstream analyses” contains six functions for different tasks, that is, ceRNAFunction(), ceRNAModule(), ceRNASurvival(), ceRNALocation(), ceRNAValidate() and ceRNAIntegrate().To reduce the computational complexity and time cost, the interactions between miRNAs and target genes were based on two experimentally validated miRNA-target databases (miRTarBase [35] and miRecords [36]) and seven computationally predicted miRNA-target databases (DIANA-micro T-CDS [37], EIMMO [38], miRDB [39], miRanda [40], PITA [41], RNA22 [42] and TargetScan [43]). In the default settings, only those interactions that were validated by experiments and/or predicted by more than half of the databases are retained as target miRNAs and target genes (S8A Fig). Conceptually, the ceRNAR algorithm iteratively goes over each miRNA-target list and runs through each mRNA pair in a list to evaluate the chance of the potential ceRNA event involved. For a specific triplet (i.e., a miRNA and its two targets), their expression vectors are extracted from the original expression matrix. Therefore, a miRNA expression vector, miRNAm = [mm1, mm2, …], and two mRNA expression vectors, mRNAi = [genei1, genei2, …] and mRNAj = [genej1, genej2, …], are used as inputs into the ceRNAR algorithm to iteratively evaluate whether each mRNA pair is a potential ceRNA event (S8B Fig).
Data preprocessing
To prepare expression data for further analyses, ceRNAR can automatically retrieve TCGA data, including mRNA expression, miRNA expression, and survival data, by entering the cancer acronym [44], but it also supports the use of customized miRNA and mRNA expression matrices that are pre-normalized and formatted according to the instructions. In ceRNAR, we implement two functions to fulfill these two approaches: ceRNATCGA and ceRNACustomize.
Identification of ceRNA-miRNA triplets
To identify miRNA-ceRNA triplets (defined here as a miRNA and two target genes) from expression profiles at a specific miRNA expression level, the ceRNAMethod function can be used, and it contains three modules sequentially: ceRNApairFiltering, SegmentClustering, and PeakMerging (Fig 5). We have two assumptions in this study: (1) the expression levels of two target genes tend to be highly correlated when a possible ceRNA event occurs; (2) such events between target genes of a certain miRNA occur at a specific expression interval of that miRNA. However, for each ceRNA triplet from the real data, it is difficult to know which levels of miRNA expression (low, middle, or high expression intervals) will lead to a high correlation between a pair of target genes. Notably, for one miRNA, the high correlation values between two target genes can only be observed in a specific range of the miRNA expression. Definitely, the expression level of one miRNA can be regulated by many factors, such as compensation and/or other interactors. However, with our current understanding of all miRNAs, it is not feasible to consider all potential regulators of one specific miRNA at the same time. Therefore, for one single miRNA, we adapted the approach of utilizing its expression level as the final output instead of considering all possible confounding factors, which can be regarded as the hidden layers of the miRNA expression value.
Fig 5
Three main modules involved in the identification of ceRNA-miRNA triplets.
First, ‘ceRNA pair filtering’ contains sample sorting based on a miRNA expression vector, data augmentation in terms of correlation calculation through a sliding window, and permutation test by random walk. Next, ‘Segment clustering’ is based on circular binary segmentation to identify certain miRNA expression intervals (i.e., segments) with the highest correlation between potential ceRNA pairs. Short segments defined by our criteria are further merged and whether a segment is a peak or a trough is defined. Lastly, ‘Adjacent peak merging’ checks whether a triplet is a candidate ceRNA through two filtering conditions.
Three main modules involved in the identification of ceRNA-miRNA triplets.
First, ‘ceRNA pair filtering’ contains sample sorting based on a miRNA expression vector, data augmentation in terms of correlation calculation through a sliding window, and permutation test by random walk. Next, ‘Segment clustering’ is based on circular binary segmentation to identify certain miRNA expression intervals (i.e., segments) with the highest correlation between potential ceRNA pairs. Short segments defined by our criteria are further merged and whether a segment is a peak or a trough is defined. Lastly, ‘Adjacent peak merging’ checks whether a triplet is a candidate ceRNA through two filtering conditions.
The ceRNApairFiltering method
We adopted the sliding window approach to identify the correlation patterns of the two target genes within a specific range of expression values for the miRNA. That is the reason why we ranked the samples based on their miRNA expression value from each triplet to identify the correlation patterns and provided the number of samples that meet the criteria. Therefore, the purpose of this function is to identify ceRNA-miRNA triplets based on the Pearson correlation coefficient through the sliding window approach (S9 Fig) and a running sum statistic for such values by the random walk approach (S10 Fig). First, the samples are sorted based on their miRNA expression levels. For a putative triplet (genei, miRNAm, genej) among N samples, the correlation coefficients of gene expression values between mRNA and miRNA are calculated within each window (i.e., a length of sample size that is always less than N) with predefined window size (w) as follows:
Here k is the number of windows and a predefined integer. Technically, the correlation value between two genes is calculated using the gene expression values of all samples. By using a sliding window approach [45], we can artificially create different varieties of a real dataset to increase its size, which is a sort of data augmentation technique [46].Because the accumulated changes in terms of correlation values between two genes among samples may tend to increase the chance that a gene pair is highly correlated and, further, will affect a specific phenotype, we borrowed the concept of gene set enrichment analysis (GSEA) [47], which captures the accumulated changes in the expression of all genes within a pathway through a random-walk method, to identify a significant triplet according to the number of the samples enriched in such an event. The main idea of the GSEA algorithm is to understand whether the differentially expressed genes are significantly enriched in the samples belonging to the same phenotype (case or control). First, the differentially expressed genes were ranked by using the differences in expression level between the two phenotypes. Next, the GSEA algorithm gave a positive score to the genes located in the pathway, whereas a negative score was assigned to the genes outside of the pathway. One possible scenario to obtain a high score is that the differentially expressed genes were clustered and enriched in one phenotype instead of being randomly distributed. This scenario is the same as what we want to identify for the ceRNA triplet. That is, the highest correlation values were observed in a specific range of the miRNA expression and thus we used the two statistics, Pobey(k) and Pviolate(k) to identify the range (S11 Fig). Conceptually, to calculate a score (S) to represent the enriched correlation levels among samples against a specific phenotype, we first rank ordered the k windows to form L = {miRNA1, miRNA2, …, miRNAk} based on the average miRNA expression within each window:
Next, the S is computed by walking down the window list to evaluated Pobey(k), which is the proportion of samples whose gene correlations are over 0.3 (i.e., ), weighted by their corresponding correlation and Pviolate(k), which is the proportion of the rest of the samples at a given ranking window position k across all samples. The formulas are as follows:
S is then defined as the maximum distance from zero of the substations of Pobey(k) from Pviolate(k). If the samples with similar correlation values are not enriched at a particular miRNA expression interval, it means those gene pairs are not biologically relevant to compete with miRNA in a miRNA-gene triplet; that is, there is no ceRNA event observed in this triplet (S11 Fig). Finally, we accessed the significance of an observed S by comparing it with the theoretical S computed by randomly permutating the expression of the candidate miRNAm 1,000 times to provide assessment of significance. When an entire sample of potential triplets is evaluated, these p-values will be adjusted by the false discovery rate, which provides the estimated probability of whether a triplet within an entire set of potential triplets is a false positive finding or not. After that, we report the ceRNA pairs with statistical significance of the observed S (e.g., adjusted p-value <0.05) as significant ceRNA interactions. For one miRNA, a low score S will be observed if the two target genes have no correlation within a specific range of miRNA expression. On the other hand, the score S will be high if a large proportion of the samples showed high correlation among the two target genes (Pobey(k) is high). Consequently, we can use the score S to identify a ceRNA triplet when S is high, and a statistical approach was performed to ensure that S cannot be identified randomly.
The SegmentClustering method
The motivation of the segment clustering is to group the samples showing high correlation between the expression levels of the two target genes into one single cluster. In our analysis, we divided the samples into small groups (i.e., a window) to calculate the correlation values of the two target genes. Accordingly, we may identify several different groups showing high correlations that actually can be clustered into one single cluster because of their similar correlation values. Therefore, in this method, the concept of a circular binary segment algorithm, which was originally designed for change-point problems such as the identification of copy number variation, is used [48] to evaluate whether those small windows showing high correlation values should be clustered into a larger group. This method has been widely used in the analysis of copy number variations (CNVs) by many algorithms [49, 50]. Since several previous studies [18, 27, 28] indicated that ceRNA triplets may be observed in a specific range, the purpose of such an approach is to explore the clustering patterns of samples in terms of their gene-gene correlation values per triplet and then group samples with similar correlation values so that a certain miRNA expression interval with the highest correlation within a ceRNA pair can be observed. This algorithm starts with all segments (i.e., several intervals of rank-ordered miRNA expression) identified from the whole dataset. Similar to the previous method, the samples are sorted by the miRNA expression values. A recursive test for the change-points is calculated based on the correlation values of each gene-miRNA pair between each set of two neighboring regions of rank-ordered miRNA expression, and it stops when no significant changes can be found in any two segments. The maximal t-statistic with a permutation reference distribution is chosen to obtain the corresponding p-value. Let X1, …,X be the correlation coefficients of two genes, which are indexed by the corresponding miRNA expression of the n samples being studied. The test statistic is given by , where Z is the two-sample t-statistic to compare the mean of the correlation of two genes with the index, which refers to the position within the rank-ordered miRNA expression from i+1 to j, to the mean of the correlation of the rest of the genes. That is,
where are the partial sums. If the p-values are smaller than a threshold level α (typically 0.01), a change is declared to be statistically significant. The locations of the change-points as the i and j that maximize the test statistic are also estimated by using either Monte Carlo simulations [51] or the approximation approach [52]. After performing the segment clustering approach, only a few groups of samples showing high correlation remain for further analyses. We have a basic assumption here for the ceRNA triplet: the correlation values should be stable across the sliding window approach and thus the correlation values should not be bouncing up and down within a small sample size. Following this assumption, we performed the peak merging step to ensure two nearby peaks were merged into one under certain criteria.
The PeakMerging method
This method is designed to prevent smashed segments resulting from noise. First, two segments are merged when the difference of their correlation values is smaller than a predefined value, and whether the finalized segments are peaks or troughs is defined compared to the baseline. Then, the Fisher transformation is performed to test the difference between two adjacent peaks by comparing their differences against the null hypothesis. Two adjacent peaks are further merged if there is no statistically significant difference between them. Because we presumed it is less likely that two genes are highly correlated to compete for a miRNA at more than two miRNA expression intervals, the triplet was abandoned when more than two peaks occurred in such a relationship. Lastly, candidate ceRNAs are finally selected when their correlation pattern contains a peak with a correlation value over a predefined threshold value (0.3, 0.5, or 0.7).The final output of the ceRNAMethod function provides information on each miRNA, its candidate ceRNA pairs, its peak position within the rank-ordered miRNA expression interval, and the number of samples involved in such ceRNA interactions (S12 Fig).
Downstream functional analyses
To characterize the biological functions and pathways of identified ceRNA pairs, we implemented the ceRNAModule function to generate ceRNA modules from their interaction networks. For the ceRNA modules, the ceRNAFunction function is used to perform functional enrichment analysis based on two ontology databases, the Gene Ontology database (GO, http://geneontology.org/) [53] and the Kyoto Encyclopedia of Genes and Genomes Pathway Database (KEGG, https://www.genome.jp/kegg/) [54]. Survival analysis has been widely used in biomedical fields to indicate whether the ceRNAs in the discovered interaction module are associated with the survival of cancer patients. Hence, in ceRNAR, we implemented the ceRNASurvival function to perform survival analysis. First, the risk score of each sample is calculated using a multivariate Cox model. All the samples (i.e., different patients) are divided into high or low risk groups based on their median risk scores. The Kaplan-Meier method with a log-rank test is used to test and visualize the difference between the high and low risk groups. Additionally, the ceRNALocation function allows users to further visualize the expression level of a specific miRNA when modulated by a specific ceRNA. We also implemented an integrated function (ceRNAIntegrate) to combine our results with other state-of-the-art tools, such as SPONGE [20] and JAMI [23], and a validation function (ceRNAValidate) based on the miRSponge database. The graphical outputs of the above-mentioned functions are presented in S13–S15 Figs.
Simulation study
In the simulation study, two types of expression data (ceRNA and miRNA) of 100 samples were generated. In each triplet, we presumed 10–40% of samples have correlated genes whereas the rest of the samples do not. The simulated expression profile of 100 samples of miRNA was distributed uniformly between 0 and 1 and was used to sort our samples. We simulated the gene expression profiles of these target gene pairs under two situations. The first is a naïve profile of the correlated genes simply generated from a multivariate normal distribution with a mean value of 0 and a covariance matrix whose entries are 0.9, while the expression distribution of the uncorrelated genes was specified by setting its mean and variance to zero. Another simulation scenario is mimicking a real-data profile (9,835 pan-cancer samples) generated from a multivariate normal distribution with a randomly selected mean value (±2, ±1 and 0), and a covariance matrix whose entries are also randomly selected from 0.3 to 0.9 (correlation level: all), while the expression distribution of the uncorrelated genes was specified by setting its mean (also randomly selected) at ±2, ±1, and 0, and its variance (randomly selected) at 0 to 0.2 (S16 Fig). We also specified three correlation levels (high = 0.8–0.9, mid = 0.5–0.7, and low = 0.3–0.4) and randomly selected values within them to construct the covariance matrix. The normality test for sample distribution of each gene was conducted by the Anderson-Darling method [55].Five scenarios were considered and are summarized in S1 Table. The first three scenarios were designed for a single peak (i.e., the highest correlation value of the target genes compared to the baseline) occurring at different locations (i.e., different miRNA expression intervals): lower miRNA expression (left, scenario 3), higher miRNA expression (right, scenario 2), and average miRNA expression (center, scenario 1), which represents specific correlation coefficient values existing at different expression levels of miRNA. The fourth scenario was designed for considering the occurrence of two peaks. A null scenario was also compared to test the validity of the other conditions. Two parameters were set under each scenario: window size (10%, 20%, 30%, and 40%) and the threshold at which to define a peak (0.3, 0.5, and 0.7). Each scenario was simulated 1,000 times. We conducted these simulations using either the complete or fast versions of the algorithm, the latter of which reduced computational complexity by omitting the random walk step. The proportion of correlated samples (10%, 20%, 30%, and 40%) was also analyzed.
Real data application for tools comparison and validation
In order to compare many aspects, including the computational cost and the quality of the results, of ceRNAR with the other tools (SPONGE [20], CERNIA [21] and JAMI [23]), which similarly used miRNA and paired target gene expression to identify gene-miRNA-gene triplets, we used subsets of the TCGA pan-cancer atlas with different sample sizes and gene/triplet numbers. We ran these tools with default parameter settings in parallel processing with 8 cores with varying sample sizes and numbers of genes.Regarding real data application, we retrieved two sequencing datasets (TCGA-LUAD and TCGA-LUSC) from TCGA [56] to demonstrate the potential application of our proposed algorithm. These datasets are all retrieved from public domains (GDG data portal: https://portal.gdc.cancer.gov/). The TCGA-LUAD dataset contained 510 samples from lung adenocarcinoma patients, and the TCGA-LUSC dataset contained 476 samples from squamous cell carcinoma patients. To validate the results, the potential ceRNAs identified by our tool were validated experimentally using the miRsponge database [57], and the ceRNAs shared by these two datasets were further analyzed.
Normality assessment through Anderson-Darling test on sample distribution based on TCGA pan-cancer atlas across genes.
(TIFF)Click here for additional data file.
Performance of our proposed method in four scenarios (I to IV) using naïve simulated data with the complete version.
(A) Synthetic data of correlated genes were generated from multivariate normal distribution with a mean value of 0 and a covariance matrix whose entries are 0.6. (B) Synthetic data of correlated genes were generated from multivariate normal distribution with a mean value of 0 and a covariance matrix whose entries are 0.3. The numbers in blue represent the average number of identified ceRNA-miRNA triplets after 100 simulations.(TIFF)Click here for additional data file.
The performance of our proposed method in four scenarios (I to IV) uses simulated data that mimics real-world data distribution with the complete version.
(A) From a covariance matrix with correlation values ranging from 0.3 to 0.9. (B) From a covariance matrix with correlation values ranging from 0.8 to 0.9. (C) From a covariance matrix with correlation values ranging from 0.5 to 0.7. (D) From a covariance matrix with correlation values ranging from 0.3 to 0.4.(TIFF)Click here for additional data file.
Run time comparisons between published tools and our package.
(A) Run time for different sample numbers on a fixed set of genes using real data. (B) Run time for different numbers of genes (i.e., different numbers of triplets) on a fixed number of samples using real data.(TIFF)Click here for additional data file.
ceRNA regulatory network observed in TCGA-LUAD.
The size of each dot represents the number of bridged miRNAs per ceRNA.(TIFF)Click here for additional data file.
The extended analyses of the observed ceRNA pairs that are related to miRNA hsa-let-17e-5p.
(A) The network of overlapping ceRNA pairs in both TCGA datasets. (B) The distribution of miRNA expression at which specific ceRNA interactions occur. The location of each rectangle indicates the miRNA expression value at which particular ceRNA pairs interact with miRNA has-let-17e-5p, and the depth of the color in each rectangle represents the number of samples that have such ceRNA-miRNA interaction. FPKM, fragments per kilobase million.(TIFF)Click here for additional data file.
Experimental validation of real case study based on miRSponge database.
(A) TCGA-LUAD (The Cancer Genome Atlas Lung Adenocarcinoma). (B) TCGA-LUSC (The Cancer Genome Atlas Lung Squamous Cell Carcinoma).(TIFF)Click here for additional data file.
The basic concept of our idea is illustrated graphically.
(A) Lists of curated miRNAs and their targets are verified from nine miRNA target databases based on the AnamiR package and selected by being either experimentally validated or present in over half of the prediction databases. (B) Our algorithm iteratively evaluates whether each mRNA pair is a potential ceRNA event in each miRNA-target list.(TIFF)Click here for additional data file.
A graphic overview illustrates sorting and correlation calculation in the ceRNApairFiltering method.
First, samples [S1, …SN] based on the expression vector of each miRNA per triplet (genei, miRNAm, genej) are sorted. Second, the window size is defined, and the correlation between gene pairs among samples for the first window is calculated. Next, the correlation between gene pairs among samples for the second window is calculated, and so on. Finally, a new correlation vector per triplet is created.(TIFF)Click here for additional data file.
A graphic overview illustrates the random walk and permutation test in the ceRNApairFiltering method.
First, the k correlation values per triplet are ordered according to the average miRNA expression, . The score (S) reflects the degree to which the correlation of a gene pair is overrepresented across all values (i.e., high/low/moderate miRNA expression) of the entire set of ranked miRNA expression levels. It is calculated by walking down the ranked value, increasing the score when encountering a specific miRNA expression value with the correlation of a paired gene over 0.3. If the correlation is less than 0.3 (threshold), the score will be negative. The magnitude of the score depends on the number of samples supported by such correlation values within a window. Subsequently, the permuted miRNA expression data are generated 1,000 times to create a null distribution of the S, and then the empirical P-value is calculated accordingly.(TIFF)Click here for additional data file.
The meanings of score S are based on the values of Pobey and Pviolate.
(A) Samples with the highest correlation between target genes are enriched at a higher miRNA expression value, supporting that these two target genes have a higher chance to compete with this miRNA. (B) These target gene pairs do not represent a biologically relevant correlation with the competition of this miRNA, suggesting that no ceRNA event occurs in this triplet.(TIFF)Click here for additional data file.
Main tabular output in CSV format.
Five columns are involved: miRNA name, candidate ceRNA pairs, the start and end of each miRNA expression interval, and the number of segments.(TIFF)Click here for additional data file.
The graphical bubble outputs of the ceRNAFunction function are based on over-representation analysis (ORA).
(A) Based on the KEGG database. (B) Based on the ‘cellular component’ (CC) categories in the gene ontology (GO) database. (C) Based on the ‘molecular function’ (MF) categories in the GO database. (D) Based on the ‘biological process’ (BP) categories in the GO database.(TIFF)Click here for additional data file.
The graphical bar outputs of the ceRNAFunction function based on over-representation analysis (ORA).
(A) Based on the KEGG database. (B) Based on the gene ontology (GO) database and stratified by three GO levels: cellular component (CC), molecular function (MF), and biological process (BP).(TIFF)Click here for additional data file.
The graphical outputs of the ceRNAModule, ceRNASurvial, and ceRNALocation functions.
(A) Network analysis among candidate ceRNAs for a specific miRNA. (B) Survival analysis for a candidate ceRNA pair targeted by a specific miRNA. (C) A mix of a box plot and bar plot represents the proportion of all candidate ceRNA pairs targeted by a specific miRNA at their corresponding miRNA expression range.(TIFF)Click here for additional data file.
Summary statistics of the gene expression profile of 9,835 samples in TCGA pan-cancer atlas.
(A) The density plots represent the sample distribution of observed z-scores of 10 genes. (B) The density plot and boxplot represent the sample distribution of observed z-scores of 16,323 genes. (C) The pie chart shows the proportion of the Pearson correlation values of 16,323 genes. All the expression values are log-transformed and z-transformed across cancers. The median expression values of samples across cancers are used for summary statistics.(TIFF)Click here for additional data file.
A summary of the five scenarios.
(XLS)Click here for additional data file.
Performance of the ceRNAR package in five scenarios under the complete version using synthetic data with a fixed correlation value (0.9).
(XLS)Click here for additional data file.
Performance of the ceRNAR package in five scenarios under the fast version using synthetic data with a fixed correlation value (0.9).
(XLS)Click here for additional data file.
Performance of the ceRNAR package with different proportions of correlated samples in scenario III under the fast version using synthetic data with a fixed correlation value (0.9).
(XLS)Click here for additional data file.
Performance of the ceRNAR package with different correlation cutoffs and merging sizes with a fixed correlation value (0.9) and a fixed window size (40).
(XLS)Click here for additional data file.
Performance of the ceRNAR package in five scenarios under the complete version using synthetic data with a fixed correlation value (0.6).
(XLS)Click here for additional data file.
Performance of the ceRNAR package in five scenarios under the complete version using synthetic data with a fixed correlation value (0.3).
(XLS)Click here for additional data file.
Performance of the ceRNAR package in five scenarios under the complete version using synthetic data with randomly selected correlation values from 0.3 to 0.9.
(XLS)Click here for additional data file.
Performance of the ceRNAR package using synthetic data with randomly selected correlation values from 0.8 to 0.9.
(XLS)Click here for additional data file.
Performance of the ceRNAR package in five scenarios under the complete version using synthetic data with randomly selected correlation values from 0.5 to 0.7.
(XLS)Click here for additional data file.
Performance of the ceRNAR package in five scenarios under the complete version using synthetic data with randomly selected correlation values from 0.3 to 0.4.
(XLS)Click here for additional data file.
Performance comparison among tools.
(XLS)Click here for additional data file.
Running time comparison among tools.
(XLSX)Click here for additional data file.
Summary of the application to the TCGA lung cancer datasets.
(XLS)Click here for additional data file.
Top bridging miRNAs and their corresponding hub gene among ceRNAs in the two TCGA datasets.
(DOCX)Click here for additional data file.
Hub genes among ceRNA triplets in the two TCGA datasets.
(DOCX)Click here for additional data file.
Comparisons with the state-of-art tools.
(DOCX)Click here for additional data file.16 Mar 2022Dear Prof. Lu,Thank you very much for submitting your manuscript "ceRNAR: an R package for identification and analysis of ceRNA-miRNA triplets" for consideration at PLOS Computational Biology.As with all papers reviewed by the journal, your manuscript was reviewed by members of the editorial board and by several independent reviewers. In light of the reviews (below this email), we would like to invite the resubmission of a significantly-revised version that takes into account the reviewers' comments.The experts that have reviewed your manuscript has raised a number of convergent concerns, including major ones about the lackof a comparison with existing methods and the insufficient description of the algorithmic procedure.You are invited to review the manuscript taking all revewer's recommendation into account, including a systematic comparisonwith established method to demonstrate in objective terms the additional insight provided by ceRNAR.Upon resubmission, your manuscript will be returned to the three reviewers.C. MichelettiWe cannot make any decision about publication until we have seen the revised manuscript and your response to the reviewers' comments. Your revised manuscript is also likely to be sent to reviewers for further evaluation.When you are ready to resubmit, please upload the following:[1] A letter containing a detailed list of your responses to the review comments and a description of the changes you have made in the manuscript. Please note while forming your response, if your article is accepted, you may have the opportunity to make the peer review history publicly available. The record will include editor decision letters (with reviews) and your responses to reviewer comments. If eligible, we will contact you to opt in or out.[2] Two versions of the revised manuscript: one with either highlights or tracked changes denoting where the text has been changed; the other a clean version (uploaded as the manuscript file).Important additional instructions are given below your reviewer comments.Please prepare and submit your revised manuscript within 60 days. If you anticipate any delay, please let us know the expected resubmission date by replying to this email. Please note that revised manuscripts received after the 60-day due date may require evaluation and peer review similar to newly submitted manuscripts.Thank you again for your submission. We hope that our editorial process has been constructive so far, and we welcome your feedback at any time. Please don't hesitate to contact us if you have any questions or comments.Sincerely,Cristian MichelettiGuest EditorPLOS Computational BiologyIlya IoshikhesDeputy EditorPLOS Computational Biology***********************The experts that have reviewed your manuscript has raised a number of convergent concerns, including major ones about the lackof a comparison with existing methods and the insufficient description of the algorithmic procedure.You are invited to review the manuscript taking all revewer's recommendation into account, including a systematic comparisonwith established method to demonstrate in objective terms the additional insight provided by ceRNAR.Upon resubmission, your manuscript will be returned to the three reviewers.C. MichelettiReviewer's Responses to QuestionsComments to the Authors:Please note here if the review is uploaded as an attachment.Reviewer #1: Authors present ceRNAR, an algorithm, written in R, that is able to predict ceRNA-miRNA triplets. Competitive endogenous RNA (ceRNA) are RNA molecules involved in the gene regulation related to several biological and pathological processes. In particular, they form with a miRNA a small interaction network called ceRNA-miRNA triplets, composed of two RNAs and one miRNA. ceRNAR is based on a correlation approach, and moreover it provides some software modules in order to make downstream analysis over the predicted ceRNA networks, including gene enrichment and survival analysis. Experimental tests were carried out using both simultated and real case datasets. The results, in terms of accuracy, sensitivity and specificity, on simulated studies are good.The paper is clear and well written. English is good. In my opinion, the experimental part of the paper needs to be improved.Major remarks:- It is not clear what is the ground truth in your simulation studies. You should better explain that. Moreover, why did you not consider repository of experimentally validate ceRNA interactions collected in databases like miRsponge (http://bio-bigdata.hrbmu.edu.cn/miRSponge/) in order to validate ceRNAR? The simulation study is not enough.- Authors did not compare ceRNAR with other existing ceRNA predictors. Although in the Introduction they cited several similar tools, no comparison has been done. For that reason, authors shoud add comparison with tools such as SPONGE [1], CERNIA [2], GDCRNATools [3], spongeScan (http://spongescan.rc.ufl.edu/), and specify the added value ceRNAR can provide with regards to web serivices such as miRTissue_ce [4] and Encori (https://starbase.sysu.edu.cn/).- The output of ceRNAR should be better described, especially the graphic outputs briefly seen in Fig.4Minor remarks:Table 1 is reffered to TCGA datasets, not to the simulation dataset. Please fix it[1] List M, Dehghani Amirabad A, Kostka D, Schulz MH. Large-scale inference of competing endogenous RNA networks with sparse partial correlation. Bioinformatics. 2019[2] Sardina DS, Alaimo S, Ferro A, Pulvirenti A, Giugno R. A novel computational method for inferring competing endogenous interactions. Brief Bioinforma. 2017[3] Li R et al. GDCRNATools: an R/Bioconductor package for integrative analysis of lncRNA, miRNA, and mRNA data in GDC. Bioinformatics 34(14):2515–2517, 2018[4] Fiannaca, A., La Paglia, L, La Rosa, M et al. miRTissue ce: extending miRTissue web service with the analysis of ceRNA-ceRNA interactions. BMC Bioinformatics 21, 199 (2020)Reviewer #2: The authors propose a computational pipeline to identify ceRNA-miRNA triplets. Specifically, the idea tested here is based on (a) sorting the datapoints based on the miRNA expression level and (b) computing the correlation between expression of genes using the same sorting. This implies a linear relationship between ceRNAs but does not make any assumption on the dependence of the ceRNAs expression level as a function of the miRNA expression level. At the second stage, a segment clustering method is used, which takes into account the ranked miRNA expression levels. Finally, a peak merging is performed to remove false positives. The method is tested on a synthetic dataset and then applied to a dataset taken from the cancer genome atlas.The discussed approach is interesting since it promises to be much faster than other approaches proposed in the literature. In addition, the authors provide an implementation. However, the manuscript is poorly readable and it is very difficult to appreciate the performance of this method compared with other approaches. I thus think publication is premature at this stage.What is crucially missing are (a) comparisons of the reported method with other methods, perhaps done on the synthetic dataset, and (b) a fair assessment of the accuracy of the method (PPV and sensitivity) on the real dataset. In the current form, it is difficult to see the advantage of this method with respect to competitors.Figure 2 reports results for the synthetic dataset. It is not clear from this figure how the critical parameters (windows size and thresholds) can be optimized in absence of a reference result. Indeed, as fairly written in the discussion, the choice of these parameters in real applications is still to be fine tuned.The comparison with other identified triplets is merely anecdotal, with a quick mention of two previously reported triplets, but there is no systematic analysis performed. In addition, the authors should compare their method with alternative methods on their synthetic datasets, providing a quantitative assessment on the performance difference (both in terms of computational cost and quality of the results).Finally, the text should be better proof read. There are a number of incorrect sentences that makes it difficult to read. E.g., "there is still needed for a tool" (abstract) "simulations of different parameters" (line 134; perhaps "simulations with different parameters"?). Some acronyms are defined after they are use and also makes the paper difficult to read.Reviewer #3: With ceRNAR, Hsiao et al. present a new R package for inferring competing endogenous RNA triplets from paired gene and miRNA expression data. In contrast to prior approaches that leverage conditional mutual information or partial correlation to assess the mutual influence that ceRNAs exert on each other via miRNA binding competition, ceRNAR considers a direct correlation approach for which miRNAs with similar expression are grouped in windows. The likely intention behind this is to fix the miRNA expression to then investigate the competition in absence of miRNA changes. I have the following comments:#Major:- The basic intuition of the method is not explained in sufficient detail. Only the sentence "we consider gene expression and known interactions among miRNA-gene pairs at a specific miRNA expression level" point to this but does not explain the rationale behind this approach. It is left to the reader to reverse engineer the underlying hypothesis from the method section.- While the idea is certainly interesting, I'm not convinced that fixing a miRNA expression level works the way the authors expect. If the idea is to remove the influence of miRNAs one has to consider that miRNA expression changes are dynamic. Several scenarios are plausible for observing a given miRNA expression level. For instance, the miRNA might have been upregulated together with a ceRNA which then compensates for the extra miRNA copies via miRNA binding / sponging. The authors should describe in detail in which scenarios their method is likely to be successful and where it is not.- The manuscript describes two hypotheses (i) miRNA tends to be sequestered by the miRNA with the higher expression level and (ii) miRNA has a greater affinity for the mRNA with more MRE. It should be highlighted that likely both effects are relevant and mixed.- The manuscript does not describe the state of the art sufficiently. Several tools such as CUPID / JAMI or SPONGE are missing here.- Furthermore, a performance comparison of ceRNAR to these tools which infer ceRNA-miRNA triplets is missing.- Most ceRNA pairs share more than one miRNA and should ideally be considered together as e.g. in the SPONGE method. This is not feasible here since we can sort only for one miRNA but the limitation should be discussed.- The method descriptions are not easy enough to understand, e.g. in "The ceRNApair Filtering method" it is not explaind what is meant by "window", i.e window of miRNA expression values in the sorted expression vector.- The whole intuition behind Pobey and Pviolate is lost to me. This makes it hard to understand what the method aims to show here. Please revise this section.- Similarly, I fail to understand the motivation for the segment clustering and peak merging. What peaks are we talking about?- Regarding the simulation: why is the mean and variance of uncorrelated genes set to zero? This would mean that no uncorrelated genes are expressed.- The simulation setting is way too simplistic. Here, the manuscript only assumes a correlation of 0.9 between ceRNA pairs but in reality this can take any value. More realistic would be to set an random correlation of 0.3-0.9 here for each pair. Realistically, gene pairs that are not ceRNAs will also have a meaningful positive correlation which is why typically conditional mutual information or partial correlation needs to be used to determine the miRNA contribution in the triplet. I don't think the simulation results are hence reflecting the performance on real-world data. A more realistic simulation would be needed for a fair method comparison though (see my comment above).- I could not find any open source license in the github repository. Please add one.#Minor:- The manuscript states that "the partial correlation approach considers the expression of the shared miRNAs of two ceRNAs when calculating their correlation and ignores the interaction strength of them". I disagree here. Sensitivity correlation which is computed as the corrlation of two ceRNAs minus the partial correlation accounting for the miRNA which is referenced here as [14] certainly captures the effect size but albeit not the significance of an interaction. A measure of significance for sensitivity correlation was later suggested in 10.1093/bioinformatics/btz314.- Figure 4: text is too small to read anything**********Have the authors made all data and (if applicable) computational code underlying the findings in their manuscript fully available?The PLOS Data policy requires authors to make all data and code underlying the findings described in their manuscript fully available without restriction, with rare exception (please refer to the Data Availability Statement in the manuscript PDF file). The data and code should be provided as part of the manuscript or its supporting information, or deposited to a public repository. For example, in addition to summary statistics, the data points behind means, medians and variance measures should be available. If there are restrictions on publicly sharing data or code —e.g. participant privacy or use of data from a third party—those must be specified.Reviewer #1: YesReviewer #2: YesReviewer #3: Yes**********PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files.If you choose “no”, your identity will remain anonymous but your review may still be made public.Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy.Reviewer #1: NoReviewer #2: NoReviewer #3: Yes: Markus ListFigure Files:While revising your submission, please upload your figure files to the Preflight Analysis and Conversion Engine (PACE) digital diagnostic tool, . PACE helps ensure that figures meet PLOS requirements. To use PACE, you must first register as a user. Then, login and navigate to the UPLOAD tab, where you will find detailed instructions on how to use the tool. If you encounter any issues or have any questions when using PACE, please email us at .Data Requirements:Please note that, as a condition of publication, PLOS' data policy requires that you make available all data used to draw the conclusions outlined in your manuscript. Data must be deposited in an appropriate repository, included within the body of the manuscript, or uploaded as supporting information. This includes all numerical values that were used to generate graphs, histograms etc.. For an example in PLOS Biology see here: http://www.plosbiology.org/article/info%3Adoi%2F10.1371%2Fjournal.pbio.1001908#s5.Reproducibility:To enhance the reproducibility of your results, we recommend that you deposit your laboratory protocols in protocols.io, where a protocol can be assigned its own identifier (DOI) such that it can be cited independently in the future. Additionally, PLOS ONE offers an option to publish peer-reviewed clinical study protocols. Read more information on sharing protocols at https://plos.org/protocols?utm_medium=editorial-email&utm_source=authorletters&utm_campaign=protocols5 Jul 2022Submitted filename: 5.rebuttal_letter_cernar_plos edited.docxClick here for additional data file.7 Aug 2022Dear Prof. Lu,Thank you very much for submitting your manuscript "ceRNAR: an R package for identification and analysis of ceRNA-miRNA triplets" for consideration at PLOS Computational Biology. As with all papers reviewed by the journal, your manuscript was reviewed by members of the editorial board and by several independent reviewers. The reviewers appreciated the attention to an important topic. Based on the reviews, we are likely to accept this manuscript for publication, providing that you modify the manuscript according to the review recommendations.Dear Prof. Tzu-Pin Lu,The consulted experts have evaluated favourably the revised version of your manuscript and I am ready to recommend its acceptance into Plos. Comput. Biol.Towards that I wish to ask you to implement the minor changes that reviewer 3 has constructively recommended to improve the clarity and completeness of the manuscript.The recommended textual changes are very specific and I trust that they can be implemented quickly and with modest effort.Yours sincerely,C. MichelettiPlease prepare and submit your revised manuscript within 30 days. If you anticipate any delay, please let us know the expected resubmission date by replying to this email.When you are ready to resubmit, please upload the following:[1] A letter containing a detailed list of your responses to all review comments, and a description of the changes you have made in the manuscript. Please note while forming your response, if your article is accepted, you may have the opportunity to make the peer review history publicly available. The record will include editor decision letters (with reviews) and your responses to reviewer comments. If eligible, we will contact you to opt in or out[2] Two versions of the revised manuscript: one with either highlights or tracked changes denoting where the text has been changed; the other a clean version (uploaded as the manuscript file).Important additional instructions are given below your reviewer comments.Thank you again for your submission to our journal. We hope that our editorial process has been constructive so far, and we welcome your feedback at any time. Please don't hesitate to contact us if you have any questions or comments.Sincerely,Cristian MichelettiGuest EditorPLOS Computational BiologyIlya IoshikhesDeputy EditorPLOS Computational Biology***********************A link appears below if there are any accompanying review attachments. If you believe any reviews to be missing, please contact ploscompbiol@plos.org immediately:[LINK]Dear Prof. Tzu-Pin Lu,The consulted experts have evaluated favourably the revised version of your manuscript and I am ready to recommend its acceptance into Plos. Comput. Biol.Towards that I wish to ask you to implement the minor changes that reviewer 3 has constructively recommended to improve the clarity and completeness of the manuscript.The recommended textual changes are very specific and I trust that they can be implemented quickly and with modest effort.Yours sincerely,C. MichelettiReviewer's Responses to QuestionsComments to the Authors:Please note here if the review is uploaded as an attachment.Reviewer #1: Authors succesfully answered my remarks.Reviewer #2: The manuscript is significantly improved. In my opinion it is suitable for publicationReviewer #3: The authors have done an admirable job improving the manuscript and in responding to the reviewer requests. Many of my previous questions have been answered. The underlying hypotheses are now much more clearly defined and the methods are better explained, especially through the figures that have been added. I would nevertheless recommend to include some of the explanations from the reviewer response into the manuscript, e.g. where the authors explained the intuition behind Pobey and Pviolate as well as the segment clustering since I believe this will make it much easier for readers to understand. The comparison with related tools that was now included offers a good overview of existing approaches. A few comments remain to be addressed:- Figure S7: the caption does not explain what the legend labels mean. What is the meaning of no_validated, validated_both, validated_each?- In a response to reviewer 1 ("Authors did not compare ceRNAR with other existing ceRNA predictors"), the authors claim that all five tested methods are correlation based. Just to clarify this I want to point out that mutual information and correlation are two distinct concepts, meaning that CUPID and JAMI which employ conditional mutual information can not be considered correlation-based methods.- Table S17: While I really appreciate the comparison table for the different tools, I'd like to point out some issues. Some of the tested tools have no built-in "miRNA-targets data sources", thus it appears unsuitable to list specific data sources here. One should specify that the tools are flexible. For example saying "user-provided, e.g. miRecords…". The table probably needs a caption since some rows are unclear. For example, what is meant by the row "TCGA expression". Most tools are agnostic to the source of the expression data. Similarly, it is not fully clear what is meant by "case-only".- The simulation experiment that is meant to show the performance of the various too-ls is based on a pre-built covariance matrix. This approach leads to linear relationships, giving correlation-based methods a decisive advantage here. The poor performance of JAMI can be attributed to this, as no non-linear relationships can be discovered here. However, simulating non-linear relationships is not realistically feasible and the authors should just discuss openly that this evaluation is unfair for JAMI.- Figure 3: The caption does not explain the different scenarios.**********Have the authors made all data and (if applicable) computational code underlying the findings in their manuscript fully available?The PLOS Data policy requires authors to make all data and code underlying the findings described in their manuscript fully available without restriction, with rare exception (please refer to the Data Availability Statement in the manuscript PDF file). The data and code should be provided as part of the manuscript or its supporting information, or deposited to a public repository. For example, in addition to summary statistics, the data points behind means, medians and variance measures should be available. If there are restrictions on publicly sharing data or code —e.g. participant privacy or use of data from a third party—those must be specified.Reviewer #1: YesReviewer #2: YesReviewer #3: Yes**********PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files.If you choose “no”, your identity will remain anonymous but your review may still be made public.Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy.Reviewer #1: NoReviewer #2: NoReviewer #3: Yes: Markus ListFigure Files:While revising your submission, please upload your figure files to the Preflight Analysis and Conversion Engine (PACE) digital diagnostic tool, https://pacev2.apexcovantage.com. PACE helps ensure that figures meet PLOS requirements. To use PACE, you must first register as a user. Then, login and navigate to the UPLOAD tab, where you will find detailed instructions on how to use the tool. If you encounter any issues or have any questions when using PACE, please email us at figures@plos.org.Data Requirements:Please note that, as a condition of publication, PLOS' data policy requires that you make available all data used to draw the conclusions outlined in your manuscript. Data must be deposited in an appropriate repository, included within the body of the manuscript, or uploaded as supporting information. This includes all numerical values that were used to generate graphs, histograms etc.. For an example in PLOS Biology see here: http://www.plosbiology.org/article/info%3Adoi%2F10.1371%2Fjournal.pbio.1001908#s5.Reproducibility:To enhance the reproducibility of your results, we recommend that you deposit your laboratory protocols in protocols.io, where a protocol can be assigned its own identifier (DOI) such that it can be cited independently in the future. Additionally, PLOS ONE offers an option to publish peer-reviewed clinical study protocols. Read more information on sharing protocols at https://plos.org/protocols?utm_medium=editorial-email&utm_source=authorletters&utm_campaign=protocolsReferences:Review your reference list to ensure that it is complete and correct. If you have cited papers that have been retracted, please include the rationale for doing so in the manuscript text, or remove these references and replace them with relevant current references. Any changes to the reference list should be mentioned in the rebuttal letter that accompanies your revised manuscript.If you need to cite a retracted article, indicate the article’s retracted status in the References list and also include a citation and full reference for the retraction notice.10 Aug 2022Submitted filename: Reviewer_round2_final.docxClick here for additional data file.18 Aug 2022Dear Prof. Lu,We are pleased to inform you that your manuscript 'ceRNAR: an R package for identification and analysis of ceRNA-miRNA triplets' has been provisionally accepted for publication in PLOS Computational Biology.Before your manuscript can be formally accepted you will need to complete some formatting changes, which you will receive in a follow up email. A member of our team will be in touch with a set of requests.Please note that your manuscript will not be scheduled for publication until you have made the required changes, so a swift response is appreciated.IMPORTANT: The editorial review process is now complete. PLOS will only permit corrections to spelling, formatting or significant scientific errors from this point onwards. Requests for major changes, or any which affect the scientific understanding of your work, will cause delays to the publication date of your manuscript.Should you, your institution's press office or the journal office choose to press release your paper, you will automatically be opted out of early publication. We ask that you notify us now if you or your institution is planning to press release the article. All press must be co-ordinated with PLOS.Thank you again for supporting Open Access publishing; we are looking forward to publishing your work in PLOS Computational Biology.Best regards,Cristian MichelettiGuest EditorPLOS Computational BiologyIlya IoshikhesSection EditorPLOS Computational Biology***********************************************************The authors have addressed the additional minor changes recommended by one of the reviewer.31 Aug 2022PCOMPBIOL-D-22-00061R2ceRNAR: an R package for identification and analysis of ceRNA-miRNA tripletsDear Dr Lu,I am pleased to inform you that your manuscript has been formally accepted for publication in PLOS Computational Biology. Your manuscript is now with our production department and you will be notified of the publication date in due course.The corresponding author will soon be receiving a typeset proof for review, to ensure errors have not been introduced during production. Please review the PDF proof of your manuscript carefully, as this is the last chance to correct any errors. Please note that major changes, or those which affect the scientific understanding of the work, will likely cause delays to the publication date of your manuscript.Soon after your final files are uploaded, unless you have opted out, the early version of your manuscript will be published online. The date of the early version will be your article's publication date. The final article will be published to the same URL, and all versions of the paper will be accessible to readers.Thank you again for supporting PLOS Computational Biology and open-access publishing. We are looking forward to publishing your work!With kind regards,Zsofia FreundPLOS Computational Biology | Carlyle House, Carlyle Road, Cambridge CB4 3DN | United Kingdom ploscompbiol@plos.org | Phone +44 (0) 1223-442824 | ploscompbiol.org | @PLOSCompBiol
Authors: M A Harris; J Clark; A Ireland; J Lomax; M Ashburner; R Foulger; K Eilbeck; S Lewis; B Marshall; C Mungall; J Richter; G M Rubin; J A Blake; C Bult; M Dolan; H Drabkin; J T Eppig; D P Hill; L Ni; M Ringwald; R Balakrishnan; J M Cherry; K R Christie; M C Costanzo; S S Dwight; S Engel; D G Fisk; J E Hirschman; E L Hong; R S Nash; A Sethuraman; C L Theesfeld; D Botstein; K Dolinski; B Feierbach; T Berardini; S Mundodi; S Y Rhee; R Apweiler; D Barrell; E Camon; E Dimmer; V Lee; R Chisholm; P Gaudet; W Kibbe; R Kishore; E M Schwarz; P Sternberg; M Gwinn; L Hannick; J Wortman; M Berriman; V Wood; N de la Cruz; P Tonellato; P Jaiswal; T Seigfried; R White Journal: Nucleic Acids Res Date: 2004-01-01 Impact factor: 16.971
Authors: Florian A Karreth; Yvonne Tay; Daniele Perna; Ugo Ala; Shen Mynn Tan; Alistair G Rust; Gina DeNicola; Kaitlyn A Webster; Dror Weiss; Pedro A Perez-Mancera; Michael Krauthammer; Ruth Halaban; Paolo Provero; David J Adams; David A Tuveson; Pier Paolo Pandolfi Journal: Cell Date: 2011-10-14 Impact factor: 41.582
Authors: Kai Wang; Mingyao Li; Dexter Hadley; Rui Liu; Joseph Glessner; Struan F A Grant; Hakon Hakonarson; Maja Bucan Journal: Genome Res Date: 2007-10-05 Impact factor: 9.043