Literature DB >> 32241268

Prediction of circRNA-disease associations based on inductive matrix completion.

Menglu Li1, Mengya Liu2, Yannan Bin1,2, Junfeng Xia3,4.   

Abstract

BACKGROUND: Currently, numerous studies indicate that circular RNA (circRNA) is associated with various human complex diseases. While identifying disease-related circRNAs in vivo is time- and labor-consuming, a feasible and effective computational method to predict circRNA-disease associations is worthy of more studies.
RESULTS: Here, we present a new method called SIMCCDA (Speedup Inductive Matrix Completion for CircRNA-Disease Associations prediction) to predict circRNA-disease associations. Based on known circRNA-disease associations, circRNA sequence similarity, disease semantic similarity, and the computed Gaussian interaction profile kernel similarity, we used speedup inductive matrix completion to construct the model. The proposed SIMCCDA method obtains an area under ROC curve (AUC) of 0.8465 with leave-one-out cross validation in the dataset, which is obtained by the combination of the three databases (circRNA disease, circ2Disease and circR2Disease). Our method surpasses other state-of-art models in predicting circRNA-disease associations. Furthermore, we conducted case studies in breast cancer, stomach cancer and colorectal cancer for further performance evaluation.
CONCLUSION: All the results show reliable prediction ability of SIMCCDA. We anticipate that SIMCCDA could be utilized to facilitate further developments in the field and follow-up investigations by biomedical researchers.

Entities:  

Keywords:  CircRNA sequence similarity; CircRNA-disease associations; Disease semantic similarity; Inductive matrix completion

Mesh:

Substances:

Year:  2020        PMID: 32241268      PMCID: PMC7118830          DOI: 10.1186/s12920-020-0679-0

Source DB:  PubMed          Journal:  BMC Med Genomics        ISSN: 1755-8794            Impact factor:   3.063


Background

As endogenous noncoding RNA, circular RNA (circRNA) is extremely distinct from linear RNA. The largest difference is that the circRNA does not possess a terminal structure (i.e., 5′ caps and 3′ polyA tails) and is covalently closed to form a loop structure [1]. Such a loop structure facilitates the resistance of the circRNA to the degradation of RNA exonuclease and offers a stable biological effect compared with the corresponding linear structure [2, 3]. Although circRNA was discovered as early as the 1970s, it was considered ‘junk’ RNA [4]. Recently, circRNA has been re-recognized and has gradually gained attention. CircRNA is involved in numerous important biological functions, especially regulatory functions [5]. Accumulating evidence has clearly demonstrated that changes in circRNA plays an important role in developing various pathological conditions and exhibits a significant correlation with diseases, especially cancer. For example, the circRNA CDR1as is an inhibitor of miR-7, which is known to be involved in various diseases, such as neurodegenerative diseases, atherosclerosis and breast cancer [6]. Therefore, circRNA is thought to be a promising disease biomarker and treatment target [5]. Analysis of existing circRNA-disease associations is necessary to help predict other potential associations and help us understand the molecular mechanisms of human disease and identify biomarkers for disease diagnosis, treatment, and prevention at the circRNA level [7]. To date, an increasing number of experimentally verified or reported databases are available for the circRNA-disease associations, such as circR2Disease [7], circRNA disease [8], circ2Disease [9], and circ2Traits [10]. However, experimental methods are too expensive and time-consuming to obtain a large validated circRNA-disease association data. Developing computational methods to predict novel circRNA-disease associations has attracted considerable attention as they can effectively decrease the time and cost of biological experiments. In addition, few methods are available for predicting the circRNA-disease associations using computational methods. Lei et al. [11] developed the method of predicting circRNA-disease associations based on a path weighted model, and Fan et al. [12] proposed the KATZHCDA method using the KATZ model on heterogeneous networks. However, these methods predict potential associations using a single database, which is not enough to illustrate the stability of the model. Moreover, it remains challenging to achieve significant performance for predicting circRNA-disease associations. In this work, we proposed a new method called SIMCCDA (Speedup Inductive Matrix Completion for CircRNA-Disease Associations prediction), which considers the prediction of circRNA-disease associations as a recommendation system problem. To the best our knowledge, we are the first to apply the recommendation system approach inductive matrix completion (IMC) [13-15] to predict circRNA-disease associations. This method has been applied for various bioinformatics problems, such as drug-target interactions [16], drug repositioning [17], lncRNA (long non coding RNA)-disease [18] and miRNA (microRNA)-disease associations [19]. We model the circRNA-disease association prediction problem as a recommendation task and solve it using speedup IMC [20]. Three databases, including circRNA disease, circ2Disease and circR2Disease, are used as our raw data in this study. We then perform data screening, generate corresponding three sub-datasets (Dataset-1, Dataset-2 and Dataset-3), and combine them into a total dataset (named TotalCircRD-1). We first calculate circRNA sequence similarity and disease semantic similarity in these four datasets. Next, these two types of similarities are combined into a Gaussian interaction profile kernel to generate new circRNA similarity and disease similarity. Primary feature vectors of the similarity matrix are extracted by principal component analysis (PCA). The final model based on IMC is built for predicting circRNA-disease associations. Leave-one-out cross validation (LOOCV) is used to examine the performance of our method. The optimal AUC on TotalCircRD-1 is 0.8465. The AUC results on the three datasets are 0.8682 (Dataset-1), 0.8303 (Dataset-2) and 0.8509 (Dataset-3), respectively. To further evaluate the performance of the proposed method, we rank and select the top 30 predictions of each dataset to determine the number of results that existed in verified associations. We also conduct case studies in breast cancer, stomach cancer and colorectal cancer to support our predictions. Finally, we compare our method with KATZHCDA, and the prediction results indicate that our method outperforms the previous method in predicting circRNA-disease associations. In summary, the proposed SIMCCDA method has the ability to predict associations in circRNA-disease and offers a guiding significance for future biomedical clinical experiments.

Methods

Model overview

Here, we apply IMC with feature vectors to build the model called SIMCCDA. In addition, we add a linear Bregman iteration to speed up the process of calculating the final score matrix. The flowchart is presented in Fig. 1. A = 1 indicates that circRNA circ and disease d are associated, whereas A = 0 indicates that their association is currently in an unknown state. Given a known circRNA-disease association matrix A ∈ ℝ with circRNA sequences and disease DOIDs (disease ontology identities), we obtain circRNA and disease similarity, respectively. Then, PCA is employed to extract primary feature vectors from acquired similarity. Finally, we construct the model with IMC based on the above information to predict circRNA-disease associations.
Fig. 1

The overall procedure of SIMCCDA. Step 1: compute circRNA similarity and disease similarity. Step 2: extract primary feature vectors. Step 3: predict the circRNA-disease association matrix with speedup IMC. Lev: Levenshtein distance, Gkl: Gaussian interaction profile kernel

The overall procedure of SIMCCDA. Step 1: compute circRNA similarity and disease similarity. Step 2: extract primary feature vectors. Step 3: predict the circRNA-disease association matrix with speedup IMC. Lev: Levenshtein distance, Gkl: Gaussian interaction profile kernel

Human circRNA-disease associations data

We use three databases, including circRNA disease, circ2Disease and circR2Disease, all of which include known human circRNA-disease associations. All data were downloaded before September 2018. The initial information regarding each downloaded dataset is as follows: the first database circRNA disease contains 354 circRNA-disease associations (including 330 circRNAs and 48 diseases), the second database circ2Disease includes 273 circRNA-disease associations (including 249 circRNAs and 61 diseases) and the third database circR2Disease includes 739 associations (including 661 circRNAs and 100 diseases). The sequence information of circRNA and disease DOID matching are applied to the circBase [21] and Disease Ontology [22] (DO) databases. Based on the above data processing, we generate the final three datasets (Dataset-1, Dataset-2 and Dataset-3). These datasets are merged to obtain TotalCircRD-1 without duplicated redundancy. Table 1 lists the detailed statistics of the four datasets.
Table 1

Details of four datasets

DatasetNumber of circRNAsNumber of diseasesNumber of associationsMatrix density
Dataset-1223342410.032
Dataset-2215462400.024
Dataset-3389614450.019
TotalCircRD-1512716090.017
Details of four datasets The uncompleted associations in the datasets include circRNAs without sequences or diseases without DOIDs. Given that the calculation of circRNA sequence similarity requires the circRNA sequence and the disease similarity requires the disease DOID information, the preceding datasets exclude the uncompleted associations. We wanted to assess whether these uncompleted associations would influence the prediction performance, so we add several uncompleted associations to form four new datasets (Dataset-4, Dataset-5, Dataset-6 and TotalCircRD-2) based on Dataset-1, Dataset-2, Dataset-3 and TotalCircRD-1, respectively (Additional file 1: Table S1).

CircRNA sequence similarity

The sequence information of all the corresponding circRNAs in the aforementioned databases is obtained from circBase, and Levenshtein distance [23] is used to calculate the similarity between each two circRNA sequences. As a string metric for measuring the difference between two strings, the Levenshtein distance between two strings is the minimum cost of single-character edits (insertions, deletions or replacements) required to change one string into the other. In the present study, both editing costs of insertion and deletion are 1, and the replacement editing cost is 2. Formula (1) is the calculation of similarity for two circRNA sequences: where dist represents the minimum editing cost of converting the circRNA circ sequence to the circRNA circ sequence, and len(∙) represents the length of circRNA sequence.

Disease semantic similarity

We use DOSim [24] in DO-based DOSE (R package) to calculate the disease semantic similarity with Wang measure [25]. The detailed formula is displayed as follow: For a given disease d, is the ancestor term set of term d (including d itself). is defined as the contribution score of disease t () to disease d. It can be expressed by the following formula: Here, w is the semantic contribution factor of edge e, where e belongs to the set of edges connecting d and its ancestor . In DOSim, we set w = 0.7.

Gaussian interaction profile kernel similarity for circRNA and disease

By considering the assumption that similar circRNAs tend to be bound with similar diseases, Gaussian interaction profile kernel similarity is computed based on the known circRNA-disease association datasets. Inspired by van Laarhoven et al. [26], we calculate the circRNA and disease similarity using the Gaussian interaction profile kernel on four datasets. Equations (4) and (5) determine the similarity between circ and circ, where m is circRNA number, IP(circ) is the associated disease set corresponding to the circ, and γ is the regulation parameter of kernel bandwidth. The Gaussian interaction profile kernel similarity of diseases d and d is similar to the defined equations (6) and (7), where n is the number of diseases:

Integrated similarity for circRNA and disease

Based on the previously defined circRNA sequence similarity, disease semantic similarity and Gaussian interaction profile kernel similarities, the integrated circRNA similarity matrix CS and the disease similarity matrix DS are calculated using the following equations (8) and (9):

Extract primary feature vectors

To remove the similarity redundancy, we use principal component analysis (PCA) to extract the primary feature vectors from integrated similarity, CS and DS. In this method, based on the dominating energy strategy [27], we use singular value decomposition (SVD) to perform PCA and formulas (10) and (11) to obtain the primary feature vectors of circRNA and disease similarity. In the above formulas, S and S are the singular values of circRNA and the disease similarity matrix, respectively. α and α are adjusted parameters to obtain optimal results. In this study, Dataset-2, Dataset-3 and TotalCircRD-1 share the parameters α =0.6 and α =0.9, whereas the parameters of Dataset-1 are α =0.7 and α =0.9. Detailed adjustment work of α and α is discussed in the Results section.

Model construction

In this study, we formulate circRNA-disease association prediction as a recommendation system problem. Generally, a recommendation system is an information filtering system that seeks to predict the user’s preference of a certain item based on partial known preference information. We here use the recommendation system method IMC [15] to identify circRNAs for a disease that is dependent on validated circRNA-disease associations. Observing the matrix density of the last column in Table 1, we find that the association matrix is very sparse. As we know, there are a small amount of experimental data of associations due to the structural complexity of circRNAs and ignored biological functions. The available data scale is in the primary stage. As a result, we cover the unknown associations of circRNAs and diseases through IMC to enhance the quality of our data. The advantage is that IMC can solve matrix completion problems using a relatively small set of known information. The detailed process of IMC is described below. First, based on the assumption that the human circRNA-disease association matrix is A, the row vectors in A lie in the subspace spanned by the column vectors in D (disease feature vectors), and the column vectors in A lie in the subspace spanned by the column vectors in C (circRNA feature vectors). The problem can be defined as: where Z is the objective matrix to complete A, CZD is the final scoring matrix based on the association matrix and the similarity matrix, Ω represents known association sets, ‖∙‖∗ is the nuclear norm defined as the sum of the singular values, λ is the regularization parameter controlling the extent of the nuclear norm (here we set λ to 1), and ‖∙‖ is the Frobenius norm of the matrix. Representing f(Z) as , the formula (12) can be expressed as: For any given , the following quadratic approximation of f(Z) at Y can be considered as: where is the gradient of f(Z) at Y, 〈∙〉 represents matrix inner product, and τ is a proximal parameter for estimating the second-order gradient ∇2f(Y). Accordingly, the above formula (13) calculates the minimum model, which can be converted into the following formula: Then, we use the accelerated proximal gradient singular value thresholding algorithm [28] with iterate h times to obtain Z [29]. In order to see the relationship between the objective function value and the number of iterations, we divide the circRNAs into several categories according to their chromosomal location and then select randomly one from each class to view the trend of the curve. Additional file 1: Figure S1 shows that the value of the objective function decreases as the number of iterations increases. When the gap of objective function values between two iterations is particularly small, i.e. , the iterative process will end.

Results

LOOCV

To assess the predictive accuracy of SIMCCDA, we performed the following method using the leave-one-out cross validation (LOOCV) framework on the known circRNA-disease associations. The reason why LOOCV is used in this study is that the current common practice in this field (prediction of lncRNA/miRNA/circRNA-disease associations) [30-32] is to use LOOCV to measure the performance of the model. For a disease d, each known circRNA association corresponding to the disease was left as a test sample. Other known associations were used as training samples, and an initial non-association was regarded as a candidate sample. In the candidate samples and test sample set, the test sample was deemed as a positive sample, and the others were negative samples. After running the model, the probabilities of associations between candidate samples and disease d were calculated. We took the highest values as the final score of the candidate sample among probabilities. Finally, we calculated the sensitivity and specificity as follows: where TP indicate true positives, FP is false positives, TN refer to true negatives, and FN represent false negatives. A Receiver Operating Characteristics (ROC) curve is drawn based on the LOOCV result. The X-axis of the ROC graph is the 1-specificity, and the Y-axis is the sensitivity. From the ROC curve, the Area Under ROC Curve (AUC) can be calculated as an evaluation measure for the model.

The effect of adjusting parameters on the prediction result

In the PCA section of the Methods, two parameters α and α were included, which represent the percentage of singular values of circRNA and disease similarity matrix, respectively. We tried to take values between 0.1 and 1 for α and α, and the step size was 0.1. The results of the parameterization of TotalCircRD-1 are presented in Fig. 2, and results for Dataset-1, Dataset-2 and Dataset-3 are presented in Additional file 1: Figures S2-S4. As noted in Fig. 2, as α increases, AUC is initially stable and the generally declines. The results are consistent when α =0.1 or α =0.2. As α increases, the AUC gradually increases, but the growth rate is slow. The optimal parameters of the three datasets of Dataset-2, Dataset-3, TotalCircRD-1 are all α =0.6 and α =0.9, whereas the optimal parameters for Dataset-1 are α =0.7 and α =0.9. LOOCV-based AUC results for four datasets with optimal parameters are shown in Fig. 3b. The results of our model on the four datasets are at a solid level, and the gap between the maximum and minimum values is 3% in four datasets, which reveals that our model exhibits better robustness. Figure 3a shows the PR (Precision-Recall) curves on the four datasets, respectively, which have the same trend as the ROC curve. Figure 4 presents the number of experimental validated associations predicted by our model from the top 30 predicted associations from our four datasets. Additional file 1: Table S2 shows the predicted results of the top 10, 30, 50 and 100. It can be observed that whether it is top 10, top 30, top 50, or top 100, the ultimate trends are similar. For the sake of convenience, we only show the results of top 30 in this work. Based on the above optimal parameters, we predicted 30 known circRNA-disease associations from Dataset-2, Dataset-3 and TotalCircRD-1, and 26 known associations from the Dataset-1. This shows that our results are optimal under these parameters, and four unknown associations in Dataset-1 may be potential associations based on subsequent analysis.
Fig. 2

Adjusted parameters α and α with their impact on TotalCircRD-1 dataset

Fig. 3

ROC curve and PR curve using LOOCV of eight datasets under optimal parameters. a PR curve in four datasets (Dataset-1, Dataset-2, Dataset-3 and TotalCircRD-1). b ROC curve in four datasets (Dataset-1, Dataset-2, Dataset-3 and TotalCircRD-1). c PR curve in four datasets (Dataset-4, Dataset-5, Dataset-6 and TotalCircRD-2). d ROC curve in four datasets (Dataset-4, Dataset-5, Dataset-6 and TotalCircRD-2)

Fig. 4

The number of associations validated by our model for the top 30 on four datasets (Dataset-1, Dataset-2, Dataset-3, and TotalCircRD-1)

Adjusted parameters α and α with their impact on TotalCircRD-1 dataset ROC curve and PR curve using LOOCV of eight datasets under optimal parameters. a PR curve in four datasets (Dataset-1, Dataset-2, Dataset-3 and TotalCircRD-1). b ROC curve in four datasets (Dataset-1, Dataset-2, Dataset-3 and TotalCircRD-1). c PR curve in four datasets (Dataset-4, Dataset-5, Dataset-6 and TotalCircRD-2). d ROC curve in four datasets (Dataset-4, Dataset-5, Dataset-6 and TotalCircRD-2) The number of associations validated by our model for the top 30 on four datasets (Dataset-1, Dataset-2, Dataset-3, and TotalCircRD-1) In addition, we added weights to each part of the integration similarity to see how the performance could be impacted. We added weights (range from 0 to 1) to Sim(circ, circ) and Gkl(circ, circ) in equation (8) and (9), respectively. For different weights circRNAs and diseases similarity, the final results were obtained by combining the two pairs. The Additional file 1: Figure S5 shows that the combinations of different similarity weights have similar results for the models obtained on different datasets. So, in the end, our model used equation (8) and (9) to respectively calculate the circRNAs similarity and diseases similarity.

The effect of uncompleted associations

The α and α were adjusted in the same manner as described above, and the optimal parameters were selected to calculate the AUC in Dataset-4, Dataset-5, Dataset-6 and TotalCircRD-2 datasets, as presented in Fig. 3c and d. The AUC scores of new-added datasets (Fig. 3d) are slightly reduced compared with the initial datasets (Fig. 3b). Given that most of the newly added circRNA only involved in one disease, thus making the final association matrix sparser than previous one. For example, circ-BANP is only associated with colorectal cancer and is not associated with other diseases. Increasing association data are noted between circ-BANP and colorectal cancer, and the unknown associations of circ-BANP with other diseases also increase, as observed from the matrix density columns of Additional file 1: Table S1. In summary, uncompleted associations exhibit a minimal effect on the results and only slightly reduce the performance of predictions. The above results show that the sparseness of the data set has little effect on the prediction results. But if the correlation matrix is too sparse, it will still affect the final prediction results. So our method has a premise that the association matrix cannot be too sparse. We conducted the following experiment to explore how the varying sparsity of datasets affect the overall performance. Since the final result has a certain relationship with the dataset, we performed sparsity processing on each dataset (0.002 was the step size, and the sparsity was reduced by 0.002 each time), respectively. The Additional file 1: Figure S6 shows that the result is not much changed when the sparsity of Dataset-1 is 0.015. But when the sparsity is 0.013, the performance starts to drop significantly. Similarly, for the other three datasets (Dataset-2, Dataset-3, TotalCircRD-1), the performance starts to drop significantly when the sparsity is 0.013, 0.009, and 0.007, respectively.

Compared with the other method

Two methods are currently available for predicting circRNA-disease associations: PWCDA [11] and KATZHCDA [12]. Given that the PWCDA method needs to set the circRNA similarity and disease similarity < 0.5 part to 0 and most of the similarities on our datasets are less than 0.5, we only compared our method with KATZHCDA. KATZHCDA is a computational model of KATZ measures and constructs heterogeneous networks by employing the circRNA expression profiles, disease phenotype similarity and Gaussian interaction profile kernel similarity. Here, we used the same eight datasets in KATZHCDA model and obtain predicted results. The results of six datasets (Dataset-1, Dataset-2, Dataset-3, Dataset-4, Dataset-5 and Dataset-6) are presented in Additional file 1: Figure S7, and TotalCircRD-1 and TotalCircRD-2 results are presented in Fig. 5a and c. As shown in Additional file 1: Figure S7, both the PR curve and the ROC curve indicate that our model performance is superior to KATZHCDA. The AUC scores of four datasets (Dataset-1, Dataset-2, Dataset-3 and TotalCircRD-1) are 0.7604, 0.7458, 0.7442 and 0.7558, respectively. According to the comparison of two methods, our model obtains an average AUC of 0.8490, which is 9% higher than KATZHCDA. The resulting top 30 predicted associations are also analyzed, demonstrating that our predicted top 30 results are superior to KATZHCDA (Fig. 5b, d).
Fig. 5

Comparison of SIMCCDA with KATZHCDA on the TotalCircRD-1 and Total CircRD-2 dataset. a Performance of methods in terms of ROC curve using LOOCV in TotalCircRD-1 dataset. b The number of experimental validations of the top 30 predicted circRNA-disease associations from four datasets (Dataset-1, Dataset-2, Dataset-3 and TotalCircRD-1). c Performance of methods in terms of ROC curve using LOOCV in TotalCircRD-2 dataset. d The number of experimental validations of the top 30 predicted circRNA-disease associations from four datasets (Dataset-4, Dataset-5, Dataset-6 and TotalCircRD-2)

Comparison of SIMCCDA with KATZHCDA on the TotalCircRD-1 and Total CircRD-2 dataset. a Performance of methods in terms of ROC curve using LOOCV in TotalCircRD-1 dataset. b The number of experimental validations of the top 30 predicted circRNA-disease associations from four datasets (Dataset-1, Dataset-2, Dataset-3 and TotalCircRD-1). c Performance of methods in terms of ROC curve using LOOCV in TotalCircRD-2 dataset. d The number of experimental validations of the top 30 predicted circRNA-disease associations from four datasets (Dataset-4, Dataset-5, Dataset-6 and TotalCircRD-2) In addition, we compared our method with KATZHCDA by using Dataset-1 as the training set and Dataset-2, Dataset-3 as the test set. As can be seen from Additional file 1: Figure S8, our performance is slightly better than KATZHCDA. Specifically, the early stage of KATZHCDA prediction effect is better than ours, but its accuracy is reduced in the prediction of later stages. A comprehensive look at the above two results, our model is superior to KATZHCDA on the whole.

Case study

Analysis of predicted circRNA-disease associations with experimental evidence from the TotalCircRD-1 dataset

To further measure the performance of SIMCCDA, case studies of three diseases, including breast cancer, stomach cancer and colorectal cancer, from the TotalCircRD-1 dataset were analyzed in detail. The top 30 predicted disease-related circRNAs by SIMCCDA and supporting evidence from PubMed are presented in Tables 2, 3 and 4.
Table 2

Top 30 candidate circRNAs for breast cancer

RankcircRNAEvidence (PMID)RankcircRNAEvidence (PMID)
1hsa_circ_00018752848408616hsa_circ_000091128744405
2hsa_circ_00060542848408617hsa_circ_009227628803498
3hsa_circ_00000982874440518hsa_circ_000894528744405
4hsa_circ_01073272922116019hsa_circ_000383828803498
5hsa_circ_00017852904585820hsa_circ_000461928484086
6hsa_circ_01030382922116021hsa_circ_003314429221160
7hsa_circ_00028742880349822hsa_circ_000128328744405
8hsa_circ_00022202922116023hsa_circ_005712929221160
9hsa_circ_00065282880349824hsa_circ_000182428484086
10hsa_circ_00087172874440525hsa_circ_008549528803498
11hsa_circ_00008932874440526hsa_circ_000073228744405
12hsa_circ_00680332904585827hsa_circ_008624128803498
13hsa_circ_00119462959343228hsa_circ_0003221unconfirmeda
14hsa_circ_00019822893358429hsa_circ_001829328744405
15hsa_circ_00016672880349830hsa_circ_009385929593432

awithout the evidence reported in literatures

Table 3

Top 30 candidate circRNAs for stomach cancer

RankcircRNAEvidence (PMID)RankcircRNAEvidence (PMID)
1hsa_circ_00846062854460916hsa_circ_007630428831102
2hsa_circ_00001402568979517hsa_circ_005710428831102
3hsa_circ_000838328761361, 2820697218hsa_circ_013896028980874
4hsa_circ_007436228544609, 2924045919hsa_circ_001304828657541, 28206972
5hsa_circ_00031592861820520hsa_circ_000378928544609
6hsa_circ_00060222863990821hsa_circ_003544528544609
7has_circ_00310272820697222hsa_circ_005876628831102
8hsa_circ_00505472854460923hsa_circ_000189528443463
9hsa_circ_00015462854460924hsa_circ_000592728737829
10hsa_circ_00638092854460925hsa_circ_007630528831102
11hsa_circ_00847202883110226hsa_circ_000663328656881
12hsa_circ_00328212873782927hsa_circ_000015428544609
13hsa_circ_00015392818494028hsa_circ_000647028544609
14hsa_circ_00037072863990829hsa_circ_000101729098316
15hsa_circ_00061272897490030hsa_circ_000322228893265
Table 4

Top 30 candidate circRNAs for colorectal cancer

RankcircRNAEvidence (PMID)RankcircRNAEvidence (PMID)
1hsa_circ_00005232562406216hsa_circ_000617428656150
2hsa_circ_00005042865615017hsa_circ_000172429207676
3hsa_circ_00021382562406218hsa_circ_000145126884878
4hsa_circ_00005672933361519hsa_circ_007480628656150
5hsa_circ_00070062865615020hsa_circ_0003221unconfirmeda
6hsa_circ_00241692562406221hsa_circ_0001577unconfirmeda
7hsa_circ_00823332613867722hsa_circ_000850928656150
8hsa_circ_00878622865615023hsa_circ_002208028656150
9hsa_circ_00059492865615024hsa_circ_0002024unconfirmeda
10hsa_circ_00070312865615025hsa_circ_0000172unconfirmeda
11hsa_circ_00084942865615026hsa_circ_000067727058418
12hsa_circ_00000692800376127hsa_circ_0002768unconfirmeda
13hsa_circ_00482322865615028hsa_circ_0091017unconfirmeda
14hsa_circ_00030982810350729hsa_circ_000270227058418
15hsa_circ_00749302865615030hsa_circ_0128454unconfirmeda

awithout the evidence reported in literatures

Top 30 candidate circRNAs for breast cancer awithout the evidence reported in literatures Top 30 candidate circRNAs for stomach cancer Top 30 candidate circRNAs for colorectal cancer awithout the evidence reported in literatures Breast cancer is the most common cancer and remains the leading cause of cancer death among women worldwide [33]. Among top 30 predicted candidate circRNA for breast cancer, 29 are associated with breast cancer in related studies (Table 2). For instance, hsa_circ_0001875 (top 1) is upregulated in breast cancer tissues compared with the normal breast tissue [34]. In addition, circRNA hsa_circ_0006054 (top 2) expression is significantly downregulated in breast cancer tissues compared with non-breast cancer tissues [34]. Gastric cancer is the second disease to lead cancer-related mortality and the fourth most frequent cancer globally [35]. Using the SIMCCDA method, we successfully predicted 30 of top 30 candidate circRNAs for gastric cancer (Table 3). Among them, CircRNA hsa_circ_0084606 (top 1) is one of the top 10 upregulated circRNAs in stomach cancer tissues [36], whereas hsa_circ_0000140 (top 2), a typical circular RNA, is significantly increased in stomach cancer tissues compared with paired adjacent non-tumorous tissues [37]. Colorectal cancer is the third most common cancer worldwide with 1.36 million people diagnosed in 2012 [38]. The inferred results cover 23 experimental verified associations out of the top 30 ranked predictions (Table 4). The evidence in the literature reveals that circRNA hsa_circ_0000523 exhibits significantly reduced expression in cancer compared with normal colorectal tissues. In colorectal cancer cells, the well-validated circRNA hsa_circ_0000504 is upregulated [39].

Analysis of predicted circRNA-disease associations without experimental evidence from four datasets

Given that the top 30 well-validated associations were successfully investigated by our method using Dataset-2, Dataset-3 and TotalCircRD-1 dataset, here we concentrated on four new predicted potential circRNA-disease associations from Dataset-1 (as shown in Fig. 4). We employed circRNA-miRNA and miRNA-disease associations to construct corresponding circRNA-miRNA-mRNA networks for the four new circRNA-disease associations. We used the hsa_circ_0070963-stomach cancer association as an example for a detailed exposition. First, possible miRNA targets of hsa_circ_0070963 were predicted with the miRNA Target Sites tool of CircInteractome [40]. Their target genes with experimental verification were screened out from miRTarBase [41], and then, hsa_circ_0070963-miRNA-disease regulatory network was constructed using Cytoscape [42]. Finally, the corresponding experimentally verified miRNA-stomach cancer associations were obtained from HMDD [43] and added to the above network. As noted from the result (Fig. 6), hsa_circ_0070963 may be targeted by four miRNAs, including has-miR-223, has-miR-421, has-miR-610 and has-miR-526b. CircRNA can act as competing endogenous RNAs (ceRNAs) (also termed miRNA sponges) to buffer the target genes expression (i.e., mRNA) of miRNAs [36, 37], and miRNA has-miR-223 is linked the most number of targets. Thus, we hypothesize that hsa_circ_0070963 may function as a hsa-miR-223 sponge to interact with stomach carcinoma.
Fig. 6

hsa_circ_0070963-miRNA-mRNA regulatory network in stomach cancer

hsa_circ_0070963-miRNA-mRNA regulatory network in stomach cancer Other three predicted new associations (hsa_circ_0061893, hsa_circ_0071410, and hsa_circ_0054345 in stomach cancer) exhibit similar scenarios, which are presented in Additional file 1: Figures S9-S11.

Conclusions

Increasing evidence demonstrates that circRNA plays an important role in the development of various diseases. Understanding the underlying mechanisms of circRNA in disease is becoming an urgent problem worldwide. To date, the number of experimentally validated circRNA-disease associations is small, and few computational methods for predicting circRNA-disease associations are available. In this paper, we proposed a method called SIMCCDA for predicting circRNA-disease associations based on known circRNA-disease associations. Integrating data regarding circRNA similarity and disease similarity, we employed IMC to construct the model. LOOCV was applied to assess the accuracy of the SIMCCDA. We then compared our method with KATZHCDA. Further case studies were also performed on breast cancer, stomach cancer and colorectal cancer. Based on the prediction results, SIMCCDA performs well in cross validations on the four datasets we used. Simultaneously, the compared results indicate that our method can identify more associations between circRNA and disease. The prominent performances of SIMCCDA may have been facilitated by the following factors. First, SIMCCDA was constructed based on the integrated circRNA and disease similarities, which can make a full use of various similarity data to characterize potential circRNA-disease associations. Second, SIMCCDA transformed circRNA-disease associations into a recommendation system problem and applied the IMC algorithm of the recommendation system to predict potential circRNA-disease associations. A decisive advantage of IMC is that it can supplement the missing values in the circRNA-disease association matrix to improve the performance. Third, the datasets used in this study were derived from various validated databases. Observing the results obtained on the four datasets, we found that the prediction ability of our model was better than the previous method. However, our model also has some limitations. First, although we introduce the sequence similarity of circRNA and the semantic similarity of disease, the calculation of Gaussian interaction profile kernel similarity relies heavily on known circRNA-disease associations, thus causing inevitable bias towards well-investigated circRNAs and diseases. Second, SIMCCDA could not be applied to unknown circRNA and diseases. In our future work, we will extend our method to solve these limitations. Additional file 1. Supplementary file to this work (Table S1-S2 and Figures S1-S11).
  34 in total

1.  A new method to measure the semantic similarity of GO terms.

Authors:  James Z Wang; Zhidian Du; Rapeeporn Payattakool; Philip S Yu; Chin-Fu Chen
Journal:  Bioinformatics       Date:  2007-03-07       Impact factor: 6.937

2.  Novel human lncRNA-disease association inference based on lncRNA expression profiles.

Authors:  Xing Chen; Gui-Ying Yan
Journal:  Bioinformatics       Date:  2013-09-02       Impact factor: 6.937

3.  Gaussian interaction profile kernels for predicting drug-target interaction.

Authors:  Twan van Laarhoven; Sander B Nabuurs; Elena Marchiori
Journal:  Bioinformatics       Date:  2011-09-04       Impact factor: 6.937

Review 4.  Clinical epidemiology of gastric cancer.

Authors:  Tiing Leong Ang; Kwong Ming Fock
Journal:  Singapore Med J       Date:  2014-12       Impact factor: 1.858

5.  Circular RNAs are abundant, conserved, and associated with ALU repeats.

Authors:  William R Jeck; Jessica A Sorrentino; Kai Wang; Michael K Slevin; Christin E Burd; Jinze Liu; William F Marzluff; Norman E Sharpless
Journal:  RNA       Date:  2012-12-18       Impact factor: 4.942

6.  circBase: a database for circular RNAs.

Authors:  Petar Glažar; Panagiotis Papavasileiou; Nikolaus Rajewsky
Journal:  RNA       Date:  2014-09-18       Impact factor: 4.942

Review 7.  Circular RNAs: Promising Biomarkers for Human Diseases.

Authors:  Zhongrong Zhang; Tingting Yang; Junjie Xiao
Journal:  EBioMedicine       Date:  2018-08-02       Impact factor: 8.143

8.  CircR2Disease: a manually curated database for experimentally supported circular RNAs associated with various diseases.

Authors:  Chunyan Fan; Xiujuan Lei; Zengqiang Fang; Qinghua Jiang; Fang-Xiang Wu
Journal:  Database (Oxford)       Date:  2018-01-01       Impact factor: 3.451

9.  PWCDA: Path Weighted Method for Predicting circRNA-Disease Associations.

Authors:  Xiujuan Lei; Zengqiang Fang; Luonan Chen; Fang-Xiang Wu
Journal:  Int J Mol Sci       Date:  2018-10-31       Impact factor: 5.923

10.  Circ2Disease: a manually curated database of experimentally validated circRNAs in human disease.

Authors:  Dongxia Yao; Lei Zhang; Mengyue Zheng; Xiwei Sun; Yan Lu; Pengyuan Liu
Journal:  Sci Rep       Date:  2018-07-20       Impact factor: 4.379

View more
  7 in total

1.  MAGCNSE: predicting lncRNA-disease associations using multi-view attention graph convolutional network and stacking ensemble model.

Authors:  Ying Liang; Ze-Qun Zhang; Nian-Nian Liu; Ya-Nan Wu; Chang-Long Gu; Ying-Long Wang
Journal:  BMC Bioinformatics       Date:  2022-05-19       Impact factor: 3.307

Review 2.  Circular RNAs and complex diseases: from experimental results to computational models.

Authors:  Chun-Chun Wang; Chen-Di Han; Qi Zhao; Xing Chen
Journal:  Brief Bioinform       Date:  2021-11-05       Impact factor: 11.622

3.  Bioinformatics Analysis of the MicroRNA-Metabolic Gene Regulatory Network in Neuropathic Pain and Prediction of Corresponding Potential Therapeutics.

Authors:  Huai-Gen Zhang; Li Liu; Zhi-Ping Song; Da-Ying Zhang
Journal:  J Mol Neurosci       Date:  2021-09-27       Impact factor: 2.866

4.  Circular RNA circ_RANBP9 exacerbates polycystic ovary syndrome via microRNA-136-5p/XIAP axis.

Authors:  Xiaohui Lu; Haijie Gao; Bo Zhu; Guilan Lin
Journal:  Bioengineered       Date:  2021-12       Impact factor: 3.269

5.  Using Graph Attention Network and Graph Convolutional Network to Explore Human CircRNA-Disease Associations Based on Multi-Source Data.

Authors:  Guanghui Li; Diancheng Wang; Yuejin Zhang; Cheng Liang; Qiu Xiao; Jiawei Luo
Journal:  Front Genet       Date:  2022-02-07       Impact factor: 4.599

6.  Prediction of circRNA-Disease Associations Based on the Combination of Multi-Head Graph Attention Network and Graph Convolutional Network.

Authors:  Ruifen Cao; Chuan He; Pijing Wei; Yansen Su; Junfeng Xia; Chunhou Zheng
Journal:  Biomolecules       Date:  2022-07-02

7.  CircWalk: a novel approach to predict CircRNA-disease association based on heterogeneous network representation learning.

Authors:  Morteza Kouhsar; Esra Kashaninia; Behnam Mardani; Hamid R Rabiee
Journal:  BMC Bioinformatics       Date:  2022-08-11       Impact factor: 3.307

  7 in total

北京卡尤迪生物科技股份有限公司 © 2022-2023.