Literature DB >> 31514111

SKF-LDA: Similarity Kernel Fusion for Predicting lncRNA-Disease Association.

Guobo Xie¹, Tengfei Meng¹, Yu Luo², Zhenguo Liu³.

Abstract

Recently, prediction of lncRNA-disease associations has attracted more and more attentions. Various computational models have been proposed; however, there is still room to improve the prediction accuracy. In this paper, we propose a kernel fusion method with different types of similarities for the lncRNAs and diseases. The expression similarity and cosine similarity are used for lncRNAs, and the semantic similarity and cosine similarity are used for the diseases. To eliminate the noise effect, a neighbor constraint is enforced to refine all the similarity matrices before fusion. Experimental results show that the proposed similarity kernel fusion (SKF)-LDA method has the superiority performance in terms of AUC values and other measurements. In the schemes of LOOCV and 5-fold CV, AUC values of SKF-LDA achieve 0.9049 and 0.8743±0.0050 respectively. In addition, the conducted case studies of three diseases (hepatocellular carcinoma, lung cancer, and prostate cancer) show that SKF-LDA can predict related lncRNAs accurately.

Entities: CellLine Chemical Disease Gene Mutation Species

Keywords: Laplacian regularized least squares; disease similarity; lncRNA similarity; lncRNA-disease association; similarity kernel fusion

Year: 2019 PMID： 31514111 PMCID： PMC6742806 DOI： 10.1016/j.omtn.2019.07.022

Source DB: PubMed Journal: Mol Ther Nucleic Acids ISSN： 2162-2531 Impact factor: 8.886

Introduction

In humans, only about 1.5% of the genome can be encoded into proteins, while the rest are extensively transcribed into non-coding RNAs (ncRNAs).1, 2, 3 Studies have shown that ncRNAs play an important role in human biological mechanisms,4, 5, 6 especially the long non-coding RNAs (lncRNAs), a kind of significant ncRNA with over 200 nt in length.7, 8, 9, 10 In the past years, lncRNAs have been found that influenced transcription, translation, cell cycle, imprinting, splicing, and protein localization;5, 11, 12 for example, intergenic 10 regulates expression of ADAM12- and FANK1-flanking genes through modulation of the chromatin structure in cis. Additionally, it has shown that the dysregulations and mutations of some lncRNAs are associated with human diseases,14, 15, 16 such as breast cancer,17, 18 intracranial aneurysm,19, 20 and -thalassemia.21, 22 Though the detail mechanisms of lncRNA are a riddle, we can still use computational models to predict the relationship between lncRNAs and diseases, which can be helpful for disease diagnosis. Besides, the computational methods can be a powerful complementary tool to the biological experiments or can even avoid the time-consuming experiments. In the past few years, many effective computational models have been built to calculate potential lncRNA-disease associations.24, 25, 26, 27 We can roughly classify the methods into two categories according to the computational model. The first category of methods predicts the lncRNA-disease associations with designed machine learning models. Chen et al. proposed Laplacian regularized least squares for lncRNA-disease association (LRLSLDA) in the semi-supervised learning framework, which is based on the prior that similar diseases are more likely to be associated with functionally similar lncRNAs. Lan et al. integrated multiple biological data resources to predict lncRNA-disease associations with a bagging support vector machine (SVM) classifier based on lncRNA similarity and disease similarity. The SIMCLDA method uses an inductive matrix to predict lncRNA-disease associations. In this model, the Gaussian interaction profile kernel of lncRNAs is from known lncRNA-disease interactions, and the function similarity of diseases is based on disease-gene and gene-gene ontology associations. Also, the primary feature vectors from constructed feature matrices are used to complete the association matrix based on the inductive matrix completion framework. Another type of method predicts associations based on constructed networks. In an earlier study, Sun et al. designed a computational network framework, random walk with restart for lncRNA-disease association (RWRlncD), to predict lncRNA-disease association based on the lncRNA-lncRNA functional similarity network. RWRlncD integrates lncRNA-disease association and disease functional similarity and then uses random walk. Bi-random walks for lncRNA-disease association (BRWLDA) applies bi-random walks that take the structural differences into account between lncRNA similarity and disease similarity and build two networks using lncRNA functional similarity and disease semantic similarity. Then multiple the random walks method is used on both networks to predict potential lncRNA-disease associations. Paths with limited lengths for lncRNAs-diseases association (PLLDA) builds lncRNA similarity networks and disease similarity networks based on lncRNA functional similarity and disease semantic similarity. PLLDA is based on the method of connecting their pathways and their length in their respective similarity networks. The depth-first search is then used to calculate the probability of lncRNA-disease association. This method is suitable for the prediction of the relation between a new disease and lncRNAs or the relation between a new lncRNA and diseases. However, since the length-based cost function is relatively simple, it is necessary to look for model substitution for machine learning. By combining lncRNA-disease association and gene and disease association, tripartite graph for lncRNAs-diseases association (TPGLDA) can effectively predict the association between lncRNA and disease, but this method relies on the structure among the three parts, and the incomplete data will limit its performance. Although the aforementioned methods achieve relatively good results, there is still room for us to improve the accuracy of association prediction. For example, some methods only consider the similarity of lncRNA and disease in one dimension (functional or semantic) and do not fully take the multi-dimensional information into consideration. It is believed that, when more biological knowledge is applied with elaborated fusion method, we can get a more accurate prediction, as proved by previous studies on the prediction of association between microRNA (miRNA) and disease.35, 36 Recently, there are increasingly more state-of-the art computational models37, 38, 39, 40, 41 published in well-known journals that have promoted the study of the association between lncRNA and disease. For example, the integration method we used is inspired by the method of predicting miRNA and disease association. Also, the idea of using Laplacian regularization to explore the relationship between miRNA and disease inspires us to apply similar ideas to predict the association between lncRNA and disease. In this study, we present a similarity kernel fusion (SKF) method to predict lncRNA-disease association (SKF-LDA, for short). In the proposed method, two different similarities, the functional similarity and semantic similarity, are utilized with a new fusion approach. The fusion step is built on the refined similarity matrices by a neighbor-based constraint and iterates over the similarity matrices instead of a simply weighted addition. The final lncRNA-disease association is computed by a normal Laplacian regularized least-squares method. To demonstrate the prediction performance of the proposed SKF-LDA, the leave-one-out cross-validation (LOOCV) and 5-fold cross-validation (5-fold CV) frameworks are implemented. The experimental results showed that the proposed SKF-LDA achieves 0.9049 and 0.87430.0050 in terms of AUC value in the scheme of LOOCV and 5-fold CV. Furthermore, in case studies, 9, 10, and 7 out of the top 10 predicted lncRNAs for disease hepatocellular carcinoma, lung cancer, and prostate cancer, respectively, are confirmed by recent research.

Results

Overview of Proposed Method

The proposed SKF-LDA method can be summarized in four steps, as shown in Figure 1. First, we construct the lncRNA-disease correlation matrix based on the known lncRNA-disease association. Second, we calculate lncRNA similarity (lncRNA expression similarity, lncRNA cosine similarity) and disease similarity (disease semantic similarity, disease cosine similarity). Third, SKF is used to integrate the two similarities of lncRNAs and diseases. Lastly, we obtain the predicted lncRNA-disease association matrix by the Laplacian regularized least-squares method.

Figure 1

Flow Chart of SKF-LDA Applied to lncRNA-Disease Association Prediction

(A–D) SKF-LDA consists of four steps: (A) constructing the lncRNA-disease correlation matrix; (B) calculating the two similarities of lncRNA and disease similarity, respectively; (C) using SKF integration similarity; and (D) obtaining the prediction matrix by Laplacian regularized least squares.

Flow Chart of SKF-LDA Applied to lncRNA-Disease Association Prediction (A–D) SKF-LDA consists of four steps: (A) constructing the lncRNA-disease correlation matrix; (B) calculating the two similarities of lncRNA and disease similarity, respectively; (C) using SKF integration similarity; and (D) obtaining the prediction matrix by Laplacian regularized least squares. We verify the performance of SKF-LDA by LOOCV and 5-fold CV. The idea of LOOCV is to use one of the 540 lncRNA-disease relationships as a test sample and the rest as training set. 5-fold CV randomly divides 540 lncRNA-disease associations into 5 parts; each time, one part is used as a test set and the remaining as training sets. When one dataset is used as the test set, the prior knowledge is removed before calculating the similarity, and the associations in the dataset are regarded as unknown; in other words, the initial 1 is set to be 0. When the final predicted value is higher than the threshold for those who have the associations, the prediction is correct. According to different thresholds, various false-positive rates (FPRs; 1 − specificity) and true-positive rates (TPRs; sensitivity) are obtained. Based on these data, we can plot the receiver operating characteristic (ROC) curve and get the area under the ROC curve (AUC). The prediction ability is perfect if AUC = 1, the prediction ability tends to be random if AUC = 0.5, and AUC = 0 indicates that the forecast result is negative prediction. In addition, we adopt the area under the precision-recall curve (AUPR) as another measurement. In order to verify the accuracy of SKF-LDA, the precision (PRE), sensitivity (SEN), accuracy (ACC), F1-score (F1-score) and Matthews’s correlation coefficient (MCC) are defined as follows:andwhere TP represents true positive, TN represents true negative, FP represents false positive, and FN represents false negative.

Parameter Selection

In this paper, there are five parameters: the number of iteration, ; the weight parameter ; the number of neighbors, , in the SKF; and the weight parameters and in Laplacian regularized least squares. In the experiments, the iteration number is set to be , and the value of is set to be 0.1 after parameter tuning. We set the value of as a range from 0.1 to 1 with step 0.1. The AUC values with different s are shown in Figure 2 based on the LOOCV scheme and 5-fold CV scheme. As shown in Figure 2, the AUC values barely change when is in the range of 0.1 to 0.9 and decay fast in the range of 0.9 to 1. Since there are 178 diseases and 115 lncRNAs in our data, the value of ranges from 10 to 110 with step 10. As shown in Figure 3, the highest AUC value is achieved at in the LOOCV scheme and in the 5-fold CV scheme. As for the weighting parameter in Laplacian regularized least squares, previous research has shown that the performance of Laplacian regularization least squares (LapRLS) is robust to the paremeters, so we set to be equal to as , and in the experiments, the value of ranges from to 1,000. As shown in Figure 4, the AUC values in both the LOOCV scheme and 5-fold CV scheme change in a small interval and achieve the highest value when .

Figure 2

The AUC Values with Different αs

Figure 3

The AUC Values with Different Values of the Number of Neighbors (K)

Figure 4

The AUC Values with Different βs

The AUC Values with Different αs The AUC Values with Different Values of the Number of Neighbors (K) The AUC Values with Different βs

Comparison with Other Fusion Methods

To verify the superiority of the SKF, we compared SKF with two common fusion methods: average kernel fusion (AVG) and similarity network fusion (SNF). We plotted the ROC curve and the precision recall (PR) curve of three methods based on LOOCV. As shown in Figure 5, the AUC values of SKF, AVG, and SNF are 0.9049, 0.8511, and 0.8298, respectively. The AUPR values of SKF, AVG, and SNF are 0.4082, 0.3955, and 0.2752, respectively. In summary, SKF performs better than other fusion methods in terms of the prediction association accuracy between lncRNA and disease.

Figure 5

The ROC Curve and the PR Curve of Three Integration Methods

Comparison with Single Similarity

In this paper, we combined different types of similarity for both lncRNAs and diseases. To demonstrate the benefit of the combination, we performed a series of comparison experiments, including all combinations of one single similarity for the lncRNA and the disease. The AUC values of different combinations in LOOCV and the 5-fold CV scheme are shown in Table 1, from which we can see that the proposed SKF-LDA achieves the highest AUC values.

Table 1

The AUC Values of SKF-LDA and Other Single Similarity in LOOCV and 5-fold CV Scheme

lncRNA Similarity	Disease Similarity	LOOCV	5-fold CV
Expression	semantic	0.8512	0.8476 ± 0.0034
Expression	cosine	0.8630	0.8375 ± 0.0046
Cosine	semantic	0.8835	0.8502 ± 0.0073
Expression	cosine	0.8754	0.8519 ± 0.0070
Expression + cosine	semantic + cosine	0.9049	0.8743 ± 0.0050

The AUC Values of SKF-LDA and Other Single Similarity in LOOCV and 5-fold CV Scheme

Comparison with and without Neighbor Constraint

In this paper, we add a neighbor constraint to eliminate the noise effect. To demonstrate the benefit of the neighbor constraint, we tested the case without the neighbor constraint in comparison, as shown in Figure 6, which achieved 0.8915 and 0.86940.0035 in LOOCV and 5-fold CV, respectively; and the one with the neighbor constraint achieved 0.9049 and 0.87430.0050, respectively, which validates the effect of the neighbor constraint.

Figure 6

The ROC Curve and the PR Curve When Using Neighbor and without Neighbor Constraint

Comparison with Other Methods

To further validate the advantage of SKF-LDA, we compared our method with four state-of-the-art methods, namely, RWRlncD, LRLSLDA, SIMCLDA and BRWLDA. As shown in Figure 7, based on the LOOCV scheme, the AUC values of RWRlncD, LRLSLDA, SIMCLDA, and BRWLDA are 0.6448, 0.8349, 0.8298, and 0.8024, respectively, while the proposed SKF-LDA method achieves 0.9049, which is much better than the others. In the 5-fold CV scheme, the AUC values of RWRlncD, LRLSLDA, SIMCLDA, and BRWLDA are 0.6518, 0.8339, 0.8138, and 0.7907, respectively, while ours is 0.8743. The AUPR measurement is also used to evaluate different methods. The AUPRs of SKF-LDA, RWRlncD, LRLSLDA, SIMCLDA, and BRWLDA are 0.4081, 0.0808, 0.3343, 0.2555, and 0.3068, respectively. Meanwhile, we set two stringency levels to evaluate predictive performance as shown in Table 2. When the stringency level of specificity is set as , the PRE, sensitivity, accuracy, F1-score, and MCC of SKF-LDA are 0.4884, 0.3519, 0.97318, 0.5206, and 0.4013. When , the PRE, sensitivity, accuracy, F1-score, and MCC of SKF-LDA are 0.2407, 0.5852, 0.9404, 0.7383, and 0.3501, which are higher than those of RWRlncD, LRLSLDA, SIMCLDA, and BRWLDA.

Figure 7

The ROC Curve and AUC Values of Different Methods in LOOCV and 5-fold CV Scheme: SKF-LDA, RWRlncD, LRLSLDA, SIMCLDA, and BRWLDA

Table 2

Results of Different Methods

Measurement	SKF-LDA	RWRlncD³¹	LRLSLDA²⁸	SIMCLDA³⁰	BRWLDA³²
AUC	0.9049	0.6448	0.8349	0.8298	0.8024
AUPR	0.4082	0.0808	0.3343	0.2555	0.3068

Sp = 99%

PRE	0.4884	0.1076	0.4472	0.3539	0.4413
Sensitivity	0.3519	0.0444	0.2982	0.2019	0.2926
Accuracy	0.9732	0.9651	0.9718	0.9692	0.9716
F1-score	0.5206	0.0851	0.4593	0.3359	0.4527
MCC	0.4013	0.0532	0.3513	0.2526	0.3455

Sp = 95%

PRE	0.2407	0.1293	0.2283	0.2037	0.2100
Sensitivity	0.5852	0.2741	0.5463	0.4722	0.4907
Accuracy	0.9404	0.9321	0.9393	0.9374	0.9379
F1-score	0.7383	0.4302	0.7066	0.6415	0.6584
MCC	0.3501	0.1563	0.3271	0.2823	0.2937

The ROC Curve and AUC Values of Different Methods in LOOCV and 5-fold CV Scheme: SKF-LDA, RWRlncD, LRLSLDA, SIMCLDA, and BRWLDA Results of Different Methods

Case Studies

To validate the ability of SKF-LDA to predict lncRNA-disease associations, case studies were conducted for three human diseases: lung cancer, hepatocellular carcinoma, and prostate cancer. The top 10 predicted lncRNAs of each disease are confirmed by two other databases: Lnc2Cancer and MNDR. Lung cancer is one of the leading cancers that cause death. The death rate for lung cancer is nearly , as its malignancy has the highest numbers among all cancers. Therefore, it is necessary to study the biological mechanism and the cause of lung cancer. Here, in our experiments, 9 of the top 10 lncRNA-lung cancer forecast results by SKF-LDA are confirmed by known databases, as shown in Table 3. GAS5 is a novel lung cancer biomarker that is related to the diagnosis and prognosis of lung cancer patients.48, 49 CCAT2 not only promotes the non-small-cell lung cancer production but also is one specific lncRNA of lung adenocarcinoma. UCA1 is overexpressed in lung cancer cells, because it induces resistance to T790M in the AKT/mTOR pathway of non-small-cell lung cancer.

Table 3

The Top 10 lncRNA Candidates Predicted for Lung Cancer

Rank	lncRNA	Disease	Evidence
1	GAS5	lung cancer	lnc2Cancer2, MNDR
2	CCAT2	lung cancer	MNDR
3	UCA1	lung cancer	lnc2Cancer2, MNDR
4	HULC	lung cancer	unconfirmed
5	SPRY4-IT1	lung cancer	MNDR
6	CCAT1	lung cancer	MNDR
7	PVT1	lung cancer	lnc2Cancer2, MNDR
8	NEAT1	lung cancer	lnc2Cancer2, MNDR
9	XIST	lung cancer	MNDR
10	HNF1A-AS1	lung cancer	MNDR

The Top 10 lncRNA Candidates Predicted for Lung Cancer Hepatocellular carcinoma (HCC) is one of the most often seen types of cancer in the world. Since many HCC patients are already in advanced stages of cancer when they are diagnosed, it is urgent to understand the principle of HCC and improve early diagnosis ability. Studies have shown that lncRNAs have a vital effect on human HCC. In this study, the top 10 lncRNAs of lncRNA-HCC results based on SKF-LDA are confirmed by known databases and related literature as shown in Table 4. Studies have shown that GAS5 is downregulated in most cell cancer patients and can be regarded as an important prognostic factor for HCC.54, 55 In addition, UCA1 promotes the development of HCC by inhibiting miR-216b and activating the FGFR1/ERK signaling pathway. PVT1 is upregulated during the liver development and contributes to HCC by affecting the lncRNA-hPVT1/NOP2 pathway.

Table 4

The Top 10 lncRNA Candidates Predicted for Hepatocelluar Carcinoma

Rank	lncRNA	Disease	Evidence
1	GAS5	hepatocelluar carcinoma	lnc2Cancer2, MNDR
2	UCA1	hepatocelluar carcinoma	lnc2Cancer2, MNDR
3	PVT1	hepatocelluar carcinoma	lnc2Cancer2, MNDR
4	CCAT2	hepatocelluar carcinoma	lnc2Cancer2, MNDR
5	CDKN2B-AS1	hepatocelluar carcinoma	MNDR
6	CCAT1	hepatocelluar carcinoma	lnc2Cancer2, MNDR
7	BANCR	hepatocelluar carcinoma	lnc2Cancer2, MNDR
8	PTENP1	hepatocelluar carcinoma	lnc2Cancer2, MNDR
9	SPRY4-IT1	hepatocelluar carcinoma	lnc2Cancer2, MNDR
10	NEAT1	hepatocelluar carcinoma	lnc2Cancer2, MNDR

The Top 10 lncRNA Candidates Predicted for Hepatocelluar Carcinoma Prostate cancer is also a common form of malignancy among males and accounts for the second leading cause of cancer fatality. The ability to explain the principles of prostate cancer from a genetic perspective will help us to diagnose and prevent prostate cancer. 4 of the top 5 lncRNAs are successfully found in the databases, while 7 of the top 10 lncRNAs are found in the databases based on SKF-LDA, as shown in Table 5. Different variants of CDKN2B-AS1 are associated with prostate cancer. CCAT2 is upregulated in prostate cancer patients and affects the development of prostate cancer by changing the epithelial-mesenchymal transition. Among prostate cancer patients, the XIST gene locus is hypomethylated. This phenomenon may contribute to a further realization of the biological mechanism of prostate cancer.

Table 5

The Top 10 lncRNA Candidates Predicted for Prostate Cancer

Rank	lncRNA	Disease	Evidence
1	CDKN2B-AS1	prostate cancer	MNDR
2	CCAT2	prostate cancer	lnc2Cancer2, MNDR
3	XIST	prostate cancer	lnc2Cancer2, MNDR
4	PTENP1	prostate cancer	lnc2Cancer2, MNDR
5	LSINCT5	prostate cancer	unconfirmed
6	IGF2-AS	prostate cancer	unconfirmed
7	SPRY4-IT1	prostate cancer	lnc2Cancer2, MNDR
8	MINA	prostate cancer	unconfirmed
9	CCAT1	prostate cancer	lnc2Cancer2
10	BANCR	prostate cancer	MNDR

The Top 10 lncRNA Candidates Predicted for Prostate Cancer

Discussion

Numerous literatures have shown that lncRNA is of great importance in disease. Studying the relationship between lncRNA and disease not only helps us to realize the fundamentals behind disease but also contributes to the prognosis and prevention of disease. Since the current biological experimental methods are time consuming, many lncRNA-disease predictive models have emerged. In this paper, the proposed SKF-LDA method combines both the expression similarity with cosine similarity for lncRNAs and the semantic similarity with cosine similarity for diseases with an effective fusion method. Compared with the other four methods, SKF-LDA performs better in terms of AUC and AUPR in the LOOCV and 5-fold CV schemes. Other important reference indices show a perfect performance of SKF-LDA as well. To further validate the accuracy of the SKF-LDA, we predicted three diseases (lung cancer, HCC, prostate cancer) based on the forecast result by SKF-LDA. We found that the prediction success rates reached 90%, 100%, and 70%, respectively. The reason for the excellent performance of the SKF-LDA method is mainly due to several reasons as follows. First, SKF-LDA integrates two lncRNA similarities and two disease similarities, which provide us with rich biological information. Second, we integrate different similarities with the neighbor constraint, which will eliminate the noise data in the known dataset. Finally, the final lncRNA-disease correlation prediction matrix is obtained by solving an optimization model based on the Laplacian operator normalization, which has shown its successful application in many other related problems. Still, the proposed method has some shortcomings. The original lncRNA-disease association matrix is a sparse matrix. There were only 540 associations for 115 lncRNAs and 178 diseases; that is to say, there are only three associations per disease, which is not enough and unstable for the forecast result. Meanwhile, there are only two similarities in the current integration, and more biological knowledge can be applied in the future.

Materials and Methods

Human lncRNA-Disease Association Dataset

The lncRNADisease database is used as the known lncRNA-disease association dataset, which contains 687 confirmed lncRNA-disease associations between 369 lncRNAs and 247 diseases. After eliminating lncRNAs without an expression profile and diseases without disease ontology, 540 known lncRNA-disease associations including 115 lncRNAs and 178 diseases were obtained. From the aforementioned known associations, we can get the lncRNA-disease adjacency matrix , where and are the number of lncRNAs and diseases, respectively, and each row of matrix represents one lncRNA, while each column denotes one disease. 0 in indicates that the relationship between lncRNA and disease is still unknown, and 1 in indicates that lncRNA has some relationship to disease . The definition of matrix is as follows:

Similarity Kernels for lncRNAs and Diseases

The proposed method is based on the currently accepted hypothesis that lncRNAs with similar functionality tend to be associated with diseases with semantic or phenotypic similarities, and vice versa. Therefore, it is very important to get the similarity kernels for both the lncRNAs and the diseases, which can provide lncRNA-disease associations with more accuracy. In this paper, first, we will compute the expression similarity and cosine similarity for the lncRNAs. Second, we will get the semantic similarity and cosine similarity for the diseases. Then, a kernel fusion method is applied to all similarity kernels. At last, based on the integrated lncRNA similarity kernel matrix and the disease similarity matrix, the Laplacian regularized least-squares method is used to get the final lncRNA-disease associations.

lncRNA Expression Similarity

The expression profiles of the lncRNAs are downloaded from ArrayExpression: E-MEXP-3783, in which more than 1.5 million expression profiles are collected by high-throughput sequencing. The Spearman correlation is used to calculate the expression similarity between different lncRNAs.28, 65 The matrix denotes the similarity of lncRNA expression, where element represents the similarity degree between lncRNA and lncRNA ; values range from 0 to 1.

Disease Semantic Similarity

The disease semantic is very important information in characterizing a disease. The directed acyclic graph (DAG) has been studied to calculate the semantic similarity of diseases and shows great performance.66, 67 In this paper, the semantic similarity is also used as one dimension of disease similarity. The raw data of semantic similarity are downloaded from the U.S. National Library of Medicine. Based on medical subject heading (MeSH) description information, a DAG, , can be constructed, where denotes that the ancestor node of disease including itself, is the corresponding connection of . The disease semantic similarity between disease and its ancestor disease is calculated as follows:where the disease , is the weight parameter of the semantic similarity of diseases, and by default. Also, we can define the semantic value for each disease as follows:With the similarity and semantic value defined, the semantic similarity matrix can be calculated. The similarity between arbitrary disease and disease is computed as follows:

Cosine Similarity for lncRNAs and Diseases

The expression profile similarity for lncRNAs and the semantic similarity for diseases are two commonly used similarity kernels.30, 68, 69 To better improve the similarity kernels, one more dimensional similarity is used in the proposed method. Previous studies have showed that cosine similarity is successfully applied to collaborative filtering recommendation algorithms,70, 71 which inspired us to combine such similarity into the lncRNA-disease association prediction. The principle of lncRNA cosine similarity is based on the assumption that if lncRNA and lncRNA are similar to each other, then, in the lncRNA-disease association matrix, pattern and pattern should be similar to each other. The same assumption should also be true for diseases, and the cosine similarity between lncRNA and lncRNA is calculated as follows:where represents the ith row of the lncRNA-disease association matrix and contains the relationship of all the diseases to lncRNAs . Similarly, the cosine similarity between disease and disease can be calculated as follows:where is the cosine similarity matrix for diseases, and represents the ith column of the lncRNA-disease association matrix .

SKF for lncRNAs and Diseases

Now we have two lncRNA similarity kernels (lncRNA expression similarity and lncRNA cosine similarity) and two disease similarity kernels (disease semantic similarity and disease cosine similarity). Next, we use SKF to integrate the two lncRNA similarity kernels . In the first step, we normalize each lncRNA similarity kernels as follows:where represents the set of lncRNAs, and represents the normalized kernel and satisfies . In the second step, we create a neighbor-constraint kernel for two lncRNA similarity kernels as follows:where denotes a neighbor-constraint kernel and satisfies . Here, the neighbor of lncRNA is defined by the most similar lncRNAs to . In the third step, we mix up the normalized similarity kernel and the neighbor-constraint kernel literally, as follows:where denotes the nth kernel obtained after th iterations, denotes the initial value of , and denotes the weight parameter. After th iterations, the final kernel is obtained as follows:In the fourth step, one more weighted matrix is added to the embedding of more neighbor information. The weighted matrix is as follows:Finally, we get the integrated lncRNA similarity kernel matrix as follows:Similarly, we can get the integrated disease similarity kernel matrix .

Laplacian Regularized Least Squares for lncRNA-Disease Association

With the lncRNA similarity matrix and the disease similarity matrix obtained by the SKF method, LapRLS is used to predict the potential lncRNA-disease association. From the view of lncRNAs, we can build the minimization model as follows:where is the Frobenius norm; is the initial known lncRNA-disease association matrix; is the weighting parameter of LapRLS; is the correlation matrix in the lncRNA space; and is the normalized similarity matrix, where is the diagonal matrix obtained by summing the elements of each row of the lncRNA similarity matrix . The first objective function in Equation 18 is to make sure that the obtained new correlation matrix is similar to the known one. The second objective function is to make sure that the obtained correlation matrix is smooth over the lncRNA space. We can solve Equation 18 by calculating the derivative of the objective function as follows:Similarly, we can obtain the optimal correlation matrix in the disease space as follows:Finally, we integrate the prediction matrix from the lncRNA and disease space and obtain the final prediction association matrix as follows:

Author Contributions

Conceptualization, G.X. and T.M.; Formal Analysis, T.M.; Investigation and Methodology, T.M. and Y.L.; Resources, T.M.; Project Administration, G.X. and Y.L.; Supervision, G.X. and Z.L.; Visualization, T.M.; Writing – Original Draft, T.M.; Writing – Review & Editing, G.X., T.M., Y.L., and Z.L.

67 in total

Review 1. Long noncoding RNAs and human disease.

Authors: Orly Wapinski; Howard Y Chang
Journal: Trends Cell Biol Date: 2011-05-06 Impact factor: 20.808

2. Prediction of lncRNA-disease associations based on inductive matrix completion.

Authors: Chengqian Lu; Mengyun Yang; Feng Luo; Fang-Xiang Wu; Min Li; Yi Pan; Yaohang Li; Jianxin Wang
Journal: Bioinformatics Date: 2018-10-01 Impact factor: 6.937

3. Targeted disruption of Hotair leads to homeotic transformation and gene derepression.

Authors: Lingjie Li; Bo Liu; Orly L Wapinski; Miao-Chih Tsai; Kun Qu; Jiajing Zhang; Jeff C Carlson; Meihong Lin; Fengqin Fang; Rajnish A Gupta; Jill A Helms; Howard Y Chang
Journal: Cell Rep Date: 2013-09-26 Impact factor: 9.423

4. Hypomethylation of the XIST gene promoter in prostate cancer.

Authors: Thilo Laner; Wolfgang A Schulz; Rainer Engers; Mirko Müller; Andrea R Florl
Journal: Oncol Res Date: 2005 Impact factor: 5.574

5. Semi-supervised drug-protein interaction prediction from heterogeneous biological spaces.

Authors: Zheng Xia; Ling-Yun Wu; Xiaobo Zhou; Stephen T C Wong
Journal: BMC Syst Biol Date: 2010-09-13

6. MNDR v2.0: an updated resource of ncRNA-disease associations in mammals.

Authors: Tianyu Cui; Lin Zhang; Yan Huang; Ying Yi; Puwen Tan; Yue Zhao; Yongfei Hu; Liyan Xu; Enmin Li; Dong Wang
Journal: Nucleic Acids Res Date: 2018-01-04 Impact factor: 16.971

7. BRWLDA: bi-random walks for predicting lncRNA-disease associations.

Authors: Guoxian Yu; Guangyuan Fu; Chang Lu; Yazhou Ren; Jun Wang
Journal: Oncotarget Date: 2017-07-26

8. LncRNADisease: a database for long-non-coding RNA-associated diseases.

Authors: Geng Chen; Ziyun Wang; Dongqing Wang; Chengxiang Qiu; Mingxi Liu; Xing Chen; Qipeng Zhang; Guiying Yan; Qinghua Cui
Journal: Nucleic Acids Res Date: 2012-11-21 Impact factor: 16.971

9. Semi-supervised learning for potential human microRNA-disease associations inference.

Authors: Xing Chen; Gui-Ying Yan
Journal: Sci Rep Date: 2014-06-30 Impact factor: 4.379

10. MDHGI: Matrix Decomposition and Heterogeneous Graph Inference for miRNA-disease association prediction.

Authors: Xing Chen; Jun Yin; Jia Qu; Li Huang
Journal: PLoS Comput Biol Date: 2018-08-24 Impact factor: 4.475

7 in total

1. Computational Methods and Applications for Identifying Disease-Associated lncRNAs as Potential Biomarkers and Therapeutic Targets.

Authors: Congcong Yan; Zicheng Zhang; Siqi Bao; Ping Hou; Meng Zhou; Chongyong Xu; Jie Sun
Journal: Mol Ther Nucleic Acids Date: 2020-05-21 Impact factor: 8.886

2. MDAKRLS: Predicting human microbe-disease association based on Kronecker regularized least squares and similarities.

Authors: Da Xu; Hanxiao Xu; Yusen Zhang; Mingyi Wang; Wei Chen; Rui Gao
Journal: J Transl Med Date: 2021-02-12 Impact factor: 5.531

3. A representation learning model based on variational inference and graph autoencoder for predicting lncRNA-disease associations.

Authors: Zhuangwei Shi; Han Zhang; Chen Jin; Xiongwen Quan; Yanbin Yin
Journal: BMC Bioinformatics Date: 2021-03-21 Impact factor: 3.169

4. gGATLDA: lncRNA-disease association prediction based on graph-level graph attention network.

Authors: Li Wang; Cheng Zhong
Journal: BMC Bioinformatics Date: 2022-01-04 Impact factor: 3.169

5. Inferring Latent Disease-lncRNA Associations by Label-Propagation Algorithm and Random Projection on a Heterogeneous Network.

Authors: Min Chen; Yingwei Deng; Ang Li; Yan Tan
Journal: Front Genet Date: 2022-02-04 Impact factor: 4.599

6. Multiview Consensus Graph Learning for lncRNA-Disease Association Prediction.

Authors: Haojiang Tan; Quanmeng Sun; Guanghui Li; Qiu Xiao; Pingjian Ding; Jiawei Luo; Cheng Liang
Journal: Front Genet Date: 2020-02-21 Impact factor: 4.599

Review 7. Bioinformatics Analysis of Long Non-coding RNA and Related Diseases: An Overview.

Authors: Yuxin Gong; Wen Zhu; Meili Sun; Lei Shi
Journal: Front Genet Date: 2021-12-08 Impact factor: 4.599

7 in total