Literature DB >> 33980147

DSCMF: prediction of LncRNA-disease associations based on dual sparse collaborative matrix factorization.

Jin-Xing Liu¹, Ming-Ming Gao¹, Zhen Cui¹, Ying-Lian Gao², Feng Li¹.

Abstract

BACKGROUND: In the development of science and technology, there are increasing evidences that there are some associations between lncRNAs and human diseases. Therefore, finding these associations between them will have a huge impact on our treatment and prevention of some diseases. However, the process of finding the associations between them is very difficult and requires a lot of time and effort. Therefore, it is particularly important to find some good methods for predicting lncRNA-disease associations (LDAs).
RESULTS: In this paper, we propose a method based on dual sparse collaborative matrix factorization (DSCMF) to predict LDAs. The DSCMF method is improved on the traditional collaborative matrix factorization method. To increase the sparsity, the L2,1-norm is added in our method. At the same time, Gaussian interaction profile kernel is added to our method, which increase the network similarity between lncRNA and disease. Finally, the AUC value obtained by the experiment is used to evaluate the quality of our method, and the AUC value is obtained by the ten-fold cross-validation method.
CONCLUSIONS: The AUC value obtained by the DSCMF method is 0.8523. At the end of the paper, simulation experiment is carried out, and the experimental results of prostate cancer, breast cancer, ovarian cancer and colorectal cancer are analyzed in detail. The DSCMF method is expected to bring some help to lncRNA-disease associations research. The code can access the https://github.com/Ming-0113/DSCMF website.

Entities: Chemical Disease Gene Species

Keywords: Collaborative matrix factorization; Gaussian interaction profile kernel; LncRNA-disease associations

Mesh：

Substances：
RNA, Long Noncoding

Year: 2021 PMID： 33980147 PMCID： PMC8114493 DOI： 10.1186/s12859-020-03868-w

Source DB: PubMed Journal: BMC Bioinformatics ISSN： 1471-2105 Impact factor: 3.169

Background

In recent years, science and technology have developed rapidly, and many experts and scholars are paying more and more attention to long non-coding RNAs (lncRNAs). The length of lncRNAs is more than 200 nucleotides, and it is not involved in encoding protein functions [1]. Many experiments have demonstrated that lncRNAs play an important role in many aspects, such as epigenetic regulation, cell cycle control and cell differentiation regulation [2-4]. However, the current understanding of lncRNAs is still far from enough, and many unknown areas still need us to explore them. Therefore, we still need to strengthen the research on lncRNAs, which will also contribute to the better development of human biology. There are increasing evidences that lncRNAs are closely linked to many human diseases, such as common cardiovascular diseases [5, 6], diabetes [7], Alzheimer's [8] and some cancers. LncRNA like MALAT1 is a transcript that is overexpressed in many cancers [9]. It is closely related to diseases such as lung cancer [10], renal cancer [11] and esophageal cancer [12]. Another example is GAS5, which is related to head and neck cancer [13], colon cancer [14], thyroid cancer [15], etc. Although some LDAs databases have been established for research by experts and scholars, the number of known LDAs in the database are far from enough, and there are many unknown associations that require people to mine them. Therefore, it is very necessary to find a method for efficient and accurate LDAs prediction. At present, many methods have been proposed in the aspect of LDAs prediction [16]. These methods have helped more or less for predictions. For example, Sun et al. proposed a new computational model that used random walk with restart methods on the lncRNA functional similarity network [17]. A lncRNA-lncRNA functional similarity network was constructed, and the relationship between similar phenotypic diseases and functionally similar lncRNAs was used to predict novel associations. Finally, it was found through experiments that this method is indeed feasible. Chen et al. improved on the basis of the random walk with restart model, combining the disease semantic similarity matrix with the lncRNA expression similarity matrix, and setting the initial probability vector of the random walk with restart model [18]. Therefore, this model can be applied to studies of diseases without known related lncRNAs. Chen et al. proposed a Laplacian regularized least squares method to predict novel associations based on the assumption that similar diseases may be related to functionally similar lncRNAs [19]. This method was developed under the framework of semi-supervised learning and can be used to sort the candidate disease-lncRNA pairs for all diseases. Chen proposed a KATZ measurement model to predict novel LDAs by combining lncRNA expression similarity and functional similarity, as well as disease semantic similarity and GIP kernel similarity [20]. This method can predict lncRNAs with no known associations for those diseases or those with no known associations for lncRNAs. Ding et al. proposed a way to combine the gene-disease association network with the lncRNA-disease association network into a lncRNA-disease-gene tripartite graph for prediction [21]. The advantage of this method is that it can better describe the heterogeneity of coding-non-coding genes-disease associations than other methods. Ping et al. proposed a method of constructing a bipartite network to predict novel LDAs [22]. This method is based on the known topology of the lncRNA-disease network to identify those potential LDAs. Finally, the Leave-one-out cross-validation method was used to evaluate the performance of the method. Zhao et al. proposed a method for predicting novel LDAs without relying on any known lncRNA-disease association [23]. This method is based on distance correlation set that combines known lncRNA-miRNA associations and miRNA-disease associations to predict novel associations. The result proves that this method is effective and has great advantages. Ou-Yang et al. proposed a new method for predicting LDAs, called the two-side sparse self-representation method [24]. The advantage of this approach is that it can adaptively learn the self-characterization of lncRNAs and the self-characterization of diseases, a process based on the known LDAs. And this method can also be supported from the internal associations between diseases and lncRNAs. Fu et al. proposed a matrix factorization model, which mainly decomposes the data matrix of heterogeneous data sources into low-rank matrix by matrix [25]. In this paper, an improved matrix factorization model is proposed to predict LDAs. This method mainly uses the collaborative matrix factorization, and then joins the Gaussian interaction profile kernel. At the same time, the L2,1-norm is added to prevent over-fitting [26-28]. Since there may be some missing associations in the course of the experiment, the accuracy of our predictions will be reduced, so we also add the weight K nearest known neighbors (WKNKN) pre-processing process. The cross-validation method is used to obtain the AUC value of this method. At the end of the paper, the simulation experiment is carried out. The results show that our method is indeed superior to other methods. The specific improvements to our approach are as follows: In the DSCMF method, the L2,1-norm is introduced to sparse and , which reduces redundant data, improves the computational power of the model, improves the robustness of the algorithm, and reduces the influence of noise on the and matrices. Network similarity is added to the DSCMF method, and we add the lncRNA network similarity matrix and the disease network similarity matrix to our method. In the second part of this paper, we show the experimental results of the DSCMF method. The third and fourth parts discuss and summarize the DSCMF method respectively, and put forward the next work plan. The specific algorithm and detailed formula of the DSCMF method can be seen in the fifth part of this article.

Results

Human LncRNA-disease associations

The LncRNADisease database is a common database for studying lncRNA-disease associations [29]. This database contains 247 diseases, 369 lncRNAs and their associations. These associations were previously verified by 687 experiments [21]. The data used in this paper are 178 diseases without disease ontology (https://disease-ontology.org/) and 115 lncRNAs without expression profiles selected from ArrayExpress (https://www.ebi.ac.uk/arrayexpress/) [30]. Finally, we get a dataset with 540 lncRNA-disease associations, as listed in Table 1. is an adjacency matrix. If the value of this element is 1, this lncRNA is related to the disease . Otherwise, it implies that the lncRNA has nothing to do with this disease. The ten-fold cross-validation method is applied in this paper, and the above dataset is used as the gold standard dataset for experiments to predict novel LDAs.

Table 1

LncRNAs, diseases, and associations in Gold Standard Dataset

Datasets	LncRNAs	Diseases	Associations
Gold standard dataset	115	178	540

LncRNAs, diseases, and associations in Gold Standard Dataset

Cross validation

Cross-validation is used as an evaluation method in our experiments. And compared with the previously proposed LRLSLDA [19], ncPred [31], TPGLDA [21] and NTSHMDA [32] methods. The experiment process mainly uses the ten-fold cross-validation method. At the same time, in order to ensure the stability and reliability of our experimental results, each method is repeated 30 times. It should be noted that some unknown associations may be lost. To avoid this, the WKNKN pre-processing process is applied to our method. At the end of the final experiment, a corresponding AUC value [33] will be generated. This AUC value is an evaluation indicator used to evaluate the quality of our method. To know the AUC value, you need to know the area under the receiver operating characteristic (ROC) curve. The AUC value is equivalent to the area under the ROC curve. ROC curve is related to true positive rate (TPR) and false positive rate (FPR). The calculation formula is as follows:where and represent the number of positive and negative samples that are true. and represent the number of positive and negative samples that are false. The area under the ROC curve is a number not greater than 1, that is, the AUC value is a number between 0 and 1. Generally, according to past experience, the AUC value is a number between 0.5 and 1. If it is less than 0.5, it proves that this method is not feasible.

Comparison with other methods

The experimental results of the LRLSLDA, ncPred, TPGLDA, NTSHMDA and DSCMF methods are listed in Table 2. In Table 2, we show the method with the highest AUC value and its AUC value in italics. It can be clearly seen from the experimental results that the DSCMF method has the highest AUC value, followed by the NTSHMDA method, but our method is still 5.85% higher than it. The lowest AUC value is the LRLSLDA method, which is 18.98% lower than our method. A more intuitive description of the AUC values for the various methods can be found in Fig. 1.

Table 2

AUC results of cross validation experiments

Methods	Gold standard dataset
LRLSLDA	0.6625 (0.0089)
ncPred	0.7566 (0.0218)
TPGLDA	0.7586 (0.0306)
NTSHMDA	0.7938 (0.0030)
DSCMF	0.8523 (0.0049)

Fig. 1

The LRLSLDA, ncPred, TPGLDA, NTSHMDA and DSCMF methods compare the performance of the AUC and ROC curves based on the ten-fold cross-validation method. It can be seen that the DSCMF method has the best performance

AUC results of cross validation experiments The LRLSLDA, ncPred, TPGLDA, NTSHMDA and DSCMF methods compare the performance of the AUC and ROC curves based on the ten-fold cross-validation method. It can be seen that the DSCMF method has the best performance The above results fully show that the DSCMF method is better than the previous methods, which is more conducive to the prediction of LDAs. The DSCMF method adds a GIP kernel to the original CMF method, thereby increasing the lncRNA network similarity matrix and the disease network similarity matrix in the original method. The second is to add the L2,1-norm, which increases the sparsity. Therefore, this method has great advantages over other methods.

Sensitivity analysis from WKNKN

In the course of the experiment, some unknown associations that often have important influence on our prediction may be lost, so in order to avoid this negative impact will affect our experimental results, WKNKN pre-processing process is introduced in the DSCMF method. In this process, the setting of the parameters will also have a certain impact on the experimental results. Different parameters may cause the AUC value to change, so the choice of parameters is particularly important. It includes the choice of two parameters, one is the value representing the nearest known neighbor, and the other is the attenuation parameter . According to previous experience, when setting to 5 and to 0.7, AUC tends to be stable. When is set to 5 and is set to 0.7, the AUC value tends to be stable. Figures 2 and 3 show the effect of the two parameters and on AUC, respectively.

Fig. 2

The sensitivity analysis for under CV-p

Fig. 3

The sensitivity analysis for under CV-p

The sensitivity analysis for under CV-p The sensitivity analysis for under CV-p

Robust analysis of DSCMF

In this paper, we increase the L2,1-norm, and the increase of the L2,1-norm can improve the robustness of our algorithm. In order to prove the ability of the DSCMF method to learn the subspace, that is, the anti-interference ability when restoring data is strong, the DSCMF method is applied to the synthetic dataset composed of 200 two-dimensional data points, and all the data points are distributed in a one-dimensional subspace, i.e. . and refer to the position of the coordinate axis where the data point is located. In addition, we also apply the original CMF method to this synthetic dataset to compare with our method. The specific process is to add different numbers of noise points in the synthesized dataset to compare the robustness of the CMF and DSCMF methods. Figure 4 shows the data distribution after adding one noise point. It can be seen that both CMF and DSCMF methods can be relatively stable. Figures 5, 6, and 7 show the data distribution of 30, 60, and 90 noise points respectively. It can be clearly seen that with the increase of noise points, the DSCMF method can basically maintain a stable state, basically unaffected by noise points. However, the CMF method is more affected by noise points. It is therefore proved that the DSCMF method increases the robustness.

Fig. 4

The comparison of the robustness of the CMF and DSCMF methods when the noise point is 1

Fig. 5

The comparison of the robustness of the CMF and DSCMF methods when the noise point is 30

Fig. 6

The comparison of the robustness of the CMF and DSCMF methods when the noise point is 60

Fig. 7

The comparison of the robustness of the CMF and DSCMF methods when the noise point is 90

The comparison of the robustness of the CMF and DSCMF methods when the noise point is 1 The comparison of the robustness of the CMF and DSCMF methods when the noise point is 30 The comparison of the robustness of the CMF and DSCMF methods when the noise point is 60 The comparison of the robustness of the CMF and DSCMF methods when the noise point is 90

Case study

In this section, simulation experiment is performed to predict some novel LDAs. For the predicted results, four common diseases are selected for research: prostate cancer, breast cancer, ovarian cancer, and colorectal cancer. The experimental procedure is as follows: For one of the diseases, the predicted score matrix obtained is sorted from high to low. Then several lncRNAs with the highest scores are selected for analysis and verified by the databases LncRNADisease and Lnc2cancer. The first study is prostate cancer. Prostate cancer is an epithelial malignancy that is closely related to genetic factors and is present in the prostate. For more detailed information on prostate cancer, please visit the https://www.omim.org/entry/176807 website. In the original gold standard dataset, 13 lncRNAs have been shown to be associated with prostate cancer. The top 20 lncRNAs in the prediction matrix are extracted and analyzed. It is found that 12 of the original 13 lncRNAs that have been shown to be associated with prostate cancer are predicted. And in Table 3, we have indicated these 12 lncRNAs in italics. Among the remaining 8 lncRNAs, three lncRNAs, TUG1, IGF2-AS and CDKN2B-AS1, are found in the database LncRNADisease, and they are all associated with prostate cancer. Their PMIDs are 26975529 [34], 19767753 [35] and 23660942 [36], respectively. The XIST in the table is confirmed to be associated with prostate cancer in the database Lnc2cancer, and its PMID is 29212233 [37]. PTENP1, a lncRNA, is found to be associated with prostate cancer in both database LncRNADisease and Lnc2cancer. Their PMIDs are 24373479 [38] and 20577206 [39] respectively. The specific information is shown in Table 3.

Table 3

Predicted LncRNAs for prostate cancer

Rank	lncRNA	Evidence	Rank	lncRNA	Evidence
1	MALAT1	Known	11	HOTTIP	Known
2	MEG3	Known	12	DANCR	Known
3	H19	Known	13	XIST	Lnc2cancer
4	HOTAIR	Known	14	PTENP1	LncRNADisease; Lnc2cancer
5	GAS5	Known	15	TUG1	LncRNADisease
6	PVT1	Known	16	IGF2-AS	LncRNADisease
7	UCA1	Known	17	ZFAS1	Unconfirmed
8	HULC	Known	18	CDKN2B-AS1	LncRNADisease
9	KCNQ1OT1	Known	19	CCAT1	Unconfirmed
10	NEAT1	Known	20	SNHG16	Unconfirmed

Predicted LncRNAs for prostate cancer The second disease is breast cancer. Breast cancer has become a common disease that threatens women's physical and mental health. For more detailed information about breast cancer, please visit: https://www.omim.org/entry/114480. In the gold standard dataset of the experiment, there are 20 kinds of lncRNA related to breast cancer. Comparing the predictions of the first 30 lncRNAs predicted in the simulation experiment, we find that the 17 lncRNAs in our experiment are confirmed in the gold standard dataset. These 17 lncRNAs are specifically indicated in italics in Table 4. And 2 of the remaining 13 are confirmed in the LncRNADisease database. The two lncRNAs are CCAT1 and TUG1. Their PMIDs are 26464701 [40] and 27791993 [41]. There are three lncRNAs are confirmed to be associated with breast cancer in the Lnc2cancer database, which are PTENP1, SNHG16 and TUSC7, respectively. The PMIDs of these three lncRNAs are 29085464 [42], 28232182 [43], and 23558749 [44], respectively. And KCNQ1OT1, a lncRNA, is confirmed to be associated with breast cancer in both LncRNADisease and Lnc2cancer databases. The remaining seven lncRNAs are not confirmed by the databases to be associated with breast cancer. The specific experimental results are listed in Table 4. For example, in the case of lncRNA CCAT1, previous studies have demonstrated that CCAT1 is overexpressed than normal tissue.

Table 4

Predicted LncRNAs for breast cancer

Rank	lncRNA	Evidence	Rank	lncRNA	Evidence
1	HOTAIR	Known	16	ZFAS1	Known
2	MALAT1	Known	17	CDKN2B-AS1	Known
3	H19	Known	18	CCAT1	LncRNADisease
4	GAS5	Known	19	PTENP1	Lnc2cancer
5	UCA1	Known	20	HULC	Unconfirmed
6	PVT1	Known	21	BANCR	Unconfirmed
7	BC040587	Known	22	SNHG16	Lnc2cancer
8	XIST	Known	23	TUG1	LncRNADisease
9	MEG3	Known	24	MINA	Unconfirmed
10	SPRY4-IT1	Known	25	TUSC7	Lnc2cancer
11	CCAT2	Known	26	EPB41L4A-AS1	Unconfirmed
12	BCYRN1	Known	27	7SK	Unconfirmed

Predicted LncRNAs for breast cancer The third disease is ovarian cancer. Ovarian cancer is a common disease in female genital organs. Its incidence is second only to cervical cancer and endometrial cancer, posing a serious threat to women's health. For more detailed information on ovarian cancer please visit https://www.omim.org/entry/167000. In the gold standard dataset, it is known that 12 lncRNAs are associated with ovarian cancer, so the top 22 lncRNAs in the prediction matrix are selected for analysis and the results are listed in Table 5. We successfully predict 11 lncRNAs, which have been confirmed in the gold standard dataset. At the same time, these 11 lncRNAs are shown in italics in Table 5. Three lncRNAs are confirmed in the LncRNADisease database, which are GAS5, NEAT1, and CCAT2, and their PMID numbers are 27779700 [45], 27608895 [46], 27558961 [47]. MEG3, SNHG16, MNX1-AS1, and ZFAS1 are confirmed to be associated with ovarian cancer in the Lnc2cancer database, and their PMIDs are 28175963 [48], 29461589 [49], 29271994 [50], and 28154416 [51], respectively. The remaining four lncRNAs are not confirmed have any association with ovarian cancer in both databases LncRNADisease and Lnc2cancer.

Table 5

Predicted LncRNAs for ovarian cancer

Rank	lncRNA	Evidence	Rank	lncRNA	Evidence
1	HOTAIR	Known	12	GAS5	LncRNADisease
2	H19	Known	13	MEG3	Lnc2cancer
3	UCA1	Known	14	NEAT1	LncRNADisease
4	PVT1	Known	15	SPRY4-IT1	Unconfirmed
5	MALAT1	Known	16	CCAT2	LncRNADisease
6	XIST	Known	17	HULC	Unconfirmed
7	BCYRN1	Known	18	SNHG16	Lnc2cancer
8	CCAT1	Known	19	BANCR	Unconfirmed
9	SRA1	Known	20	ZFAS1	Lnc2cancer
10	LSINCT5	Known	21	PTENP1	Unconfirmed
11	MNX1-AS1	Known	22	TUSC7	Unconfirmed

Predicted LncRNAs for ovarian cancer The last disease listed is colorectal cancer. Colorectal cancer is a common malignant tumor in humans. China is a low-incidence area for colorectal cancer, but in recent years, the incidence of colorectal cancer has increased in different regions. As can be seen from the original gold standard dataset, the dataset contains 21 lncRNAs that are associated with colorectal cancer. 20 association pairs are successfully predicted by the DSCMF algorithm, they are shown in italics in Table 6. And the remaining 10 lncRNAs are verified in the two databases LncRNADisease and Lnc2cancer whether they are associated with colorectal cancer. Among them, 4 lncRNAs are confirmed to be associated with colorectal cancer in the LncRNADisease database. These 4 lncRNAs are SPRY4-IT1, CDKN2B-AS1, TUG1 and ZFAS1, respectively. Their PMIDs are 27621655 [52], 27286457 [53], 27421138 [54] and 27461828 [55] respectively. There are also six lncRNAs that are not confirmed to be associated with colorectal cancer and further research is needed. Specific information on lncRNA and colorectal cancer is shown in Table 6:

Table 6

Predicted LncRNAs for colorectal cancer

Rank	lncRNA	Evidence	Rank	lncRNA	Evidence
1	MALAT1	Known	16	TUSC7	Known
2	HOTAIR	Known	17	RPL34-AS1	Known
3	MEG3	Known	18	SNHG16	Known
4	GAS5	Known	19	MNX1-AS1	Known
5	PVT1	Known	20	NPTN-IT1	Known
6	UCA1	Known	21	SPRY4-IT1	LncRNADisease
7	H19	Known	22	PTENP1	Unconfirmed
8	XIST	Known	23	CDKN2B-AS1	LncRNADisease
9	CCAT1	Known	24	LINC00261	Unconfirmed
10	NEAT1	Known	25	BCYRN1	Unconfirmed
11	HULC	Known	26	TUG1	LncRNADisease
12	CCAT2	Known	27	MINA	Unconfirmed
13	BANCR	Known	28	7SK	Unconfirmed
14	LSINCT5	Known	29	BC040587	Unconfirmed
15	KCNQ1OT1	Known	30	ZFAS1	LncRNADisease

Predicted LncRNAs for colorectal cancer

Discussion

Numerous studies have shown that lncRNA is indeed associated with certain diseases in humans, so it is a very important contribution to find some effective methods to predict these associations. However, the process of finding LDAs takes a long time and consumes a lot of energy. So, if you find some new ways to predict LDAs, this will be of great help to our research. The DSCMF method introduced in this paper mainly adds the L2,1-norm to the traditional collaborative matrix factorization method to increase the sparsity, and at the same time, the GIP kernel is used to increase the network similarity. The final cross-validation method also proves that our method is suitable for LDAs prediction. Of course, our method is not completely without disadvantages. The DSCMF method requires a long running time. Therefore, shortening the running time of our method is an important problem that we still need to solve.

Conclusion

A ten-fold cross-validation method is used in the experimental part of this paper. And WKNKN pre-processing method is also used in the paper to solve those unknown interactions, so the accuracy of prediction is improved to the greatest extent. In the next work, we will continue to work on this aspect of research. And, try to make up for the shortcomings in the previous research process and find some new prediction methods. At the same time, we will try to apply our method to more datasets such as miRNA-disease associations datasets, so as to more fully prove the performance of our method. At the end of the paper, I hope that the DSCMF method can be helpful for predicting lncRNA-disease associations, and we will be more committed to this research and contribute to human society.

Methods

LncRNA expression similarity

ArrayExpress contains more than 60,000 expression profiles of 16 human tissues, and these expression profiles are generated by RNA-Seq technology. The lncRNA expression profile used in this paper is obtained from ArrayExpress. The correlation between each pair of lncRNA expression profiles can be expressed using the Spearman correlation coefficient, which is also the similarity of lncRNA expression. The matrix can be used to represent the lncRNA expression similarity matrix, and the expression similarity between lncRNA and lncRNA can be shown in the form of .

Disease semantic similarity

The semantic similarity of the disease was first used in the ncRNA-disease association, and the results proved its correctness [56]. In this paper, a directed acyclic graph (DAG) is used to describe the relationship between disease semantics. For disease , its directed acyclic graph can be expressed as , where is represented as the set of nodes and is represented as a set of edges between nodes. The specific formula is as follows:where represents a semantic contribution factor. Given a disease semantic similarity matrix . To determine the semantic similarity between the two diseases and , it is necessary to look at their common DAG parts. Therefore, as long as their DAG common parts are larger, their semantic similarities are greater. The specific calculation formula is as follows:

Weight K nearest known neighbors

In order to prevent the loss of some unknown correlations and make our predictions more accurate, the WKNKN preprocessing process is added to the DSCMF method. In the lncRNA-disease association matrix , if lncRNA is associated with disease, the value in the matrix is 1, otherwise it is 0. The role of pre-processing is to change these 0 or 1 to values between 0 and 1, forming a new matrix to increase the accuracy of the prediction.

Gaussian interaction profile kernel similarity

Regardless of whether the disease is associated with the lncRNA in the lncRNA-disease network, it is likely to have a similar association with the new disease. The Gaussian interaction profile kernel similarity used in this method is based on this assumption [57]. The GIP kernel similarity can be used in this method to represent the network topological structure of LDAs. The topological structure of lncRNA , and disease , are represented by the following formula: The parameters of the adjustment kernel bandwidth represented by in the above two formulas. stands for a binary vector, the -th row of , which represents the interaction profiles of the association between and each disease. Next, the lncRNA expression similarity matrix and the network similarity matrix are combined by using formula (8). Similarly, the disease semantic similarity matrix and the network similarity matrix are combined by using formula (9). In the above two formulas, , and it is a parameter that can be adjusted. Where is the final matrix combining lncRNA expression similarity and network similarity, and is the final matrix that combines the semantic similarity of disease with network similarity.

DSCMF

Collaborative filtering is introduced in the traditional CMF method [58], which can accurately predict some novel LDAs. The objective function of the traditional CMF is as follows:where is Frobenius norm. , and are positive parameters. Then, the in the traditional collaborative matrix factorization method is replaced by . Similarly, is replaced by , thereby increasing the network similarity between lncRNA and disease. The improved formula is as follows: At the same time, to increase the sparsity, the method in this paper adds L2,1-norm to matrix and respectively. The final objective function can be written as: The matrices and of this formula are two latent feature matrices produced by the decomposition of the matrix . Where , and ., are two diagonal matrices, where the values of the -th diagonal element are denoted as , , respectively. The first term is to construct an approximate model, the purpose is to find the matrix and . The second part is to add the Tikhonov regularization terms to prevent overfitting. The third part is to add the L2,1-norm to matrix . The fourth part is to add the L2,1-norm to matrix .The last two parts are the collaborative regularization terms of lncRNA expression similarity matrix and disease semantic similarity matrix. A detailed flow chart of the DSCMF method is shown in Fig. 8.

Fig. 8

Method flow chart. The DSCMF method consists of two parts. First, the matrix is decomposed into and , and L2,1-norm is added to and , respectively. Second is to join the GIP kernel in the CMF method

Optimization and algorithm of DSCMF method

In this paper, we use the least squares method to update and to optimize the new method of this paper. In the first step, the values of and need to be initialized, so the singular value decomposition (SVD) method is used in this paper. The initial formula is:where represents a diagonal matrix that contains the largest singular values. Next, based on the objective function, the partial derivatives are obtained for and , respectively, and their partial derivatives are zero. Finally, updating is stopped once and converge. The iteration formula is as follows:where , and are a combination of the best parameters automatically selected from and . Through the detailed description of the above process, the algorithm of the DSCMF method can be organized as follows: The DSCMF method is convergent. The maximum number of iterations is set to 100 times during the experiment, in order to find the local optimal solution of the objective function. The convergence curve is shown in Fig. 9. The algorithm tends to converge in about 10 times, which proves that our algorithm can converge quickly.

Fig. 9

Convergence curve of the DSCMF method. When the number of iterations is about ten, our method tends to converge

54 in total

1. The long noncoding RNA Wisper controls cardiac fibrosis and remodeling.

Authors: Rudi Micheletti; Isabelle Plaisance; Brian J Abraham; Alexandre Sarre; Ching-Chia Ting; Michael Alexanian; Daniel Maric; Damien Maison; Mohamed Nemir; Richard A Young; Blanche Schroen; Arantxa González; Samir Ounzain; Thierry Pedrazzini
Journal: Sci Transl Med Date: 2017-06-21 Impact factor: 17.956

2. Novel human lncRNA-disease association inference based on lncRNA expression profiles.

Authors: Xing Chen; Gui-Ying Yan
Journal: Bioinformatics Date: 2013-09-02 Impact factor: 6.937

3. Increased MALAT1 expression predicts poor prognosis in esophageal cancer patients.

Authors: Congxiu Huang; Zhilong Yu; Hao Yang; Yu Lin
Journal: Biomed Pharmacother Date: 2016-06-15 Impact factor: 6.529

4. Decreased expression of long non-coding RNA GAS5 promotes cell proliferation, migration and invasion, and indicates a poor prognosis in ovarian cancer.

Authors: Jie Li; He Huang; Yingguang Li; Li Li; Wenhui Hou; Zeshan You
Journal: Oncol Rep Date: 2016-10-24 Impact factor: 3.906

5. Role of the Long Non-Coding RNA Growth Arrest-Specific 5 in Glucocorticoid Response in Children with Inflammatory Bowel Disease.

Authors: Marianna Lucafò; Alessia Di Silvestre; Maurizio Romano; Alice Avian; Roberta Antonelli; Stefano Martelossi; Samuele Naviglio; Alberto Tommasini; Gabriele Stocco; Alessandro Ventura; Giuliana Decorti; Sara De Iudicibus
Journal: Basic Clin Pharmacol Toxicol Date: 2017-08-09 Impact factor: 4.080

Review 6. Long noncoding RNA in prostate, bladder, and kidney cancer.

Authors: Elena S Martens-Uzunova; René Böttcher; Carlo M Croce; Guido Jenster; Tapio Visakorpi; George A Calin
Journal: Eur Urol Date: 2013-12-14 Impact factor: 20.096

7. ArrayExpress--a public database of microarray experiments and gene expression profiles.

Authors: H Parkinson; M Kapushesky; M Shojatalab; N Abeygunawardena; R Coulson; A Farne; E Holloway; N Kolesnykov; P Lilja; M Lukk; R Mani; T Rayner; A Sharma; E William; U Sarkans; A Brazma
Journal: Nucleic Acids Res Date: 2006-11-28 Impact factor: 16.971

8. LncRNA MALAT1 functions as a competing endogenous RNA to regulate ZEB2 expression by sponging miR-200s in clear cell kidney carcinoma.

Authors: Haibing Xiao; Kun Tang; Peijun Liu; Ke Chen; Junhui Hu; Jin Zeng; Wei Xiao; Gan Yu; Weiming Yao; Hui Zhou; Heng Li; Yingtian Pan; Anping Li; Zhangqun Ye; Ji Wang; Hua Xu; Qihong Huang
Journal: Oncotarget Date: 2015-11-10

9. LncRNA XIST acts as a tumor suppressor in prostate cancer through sponging miR-23a to modulate RKIP expression.

Authors: Yang Du; Xiao-Dong Weng; Lei Wang; Xiu-Heng Liu; Heng-Cheng Zhu; Jia Guo; Jin-Zhuo Ning; Cheng-Cheng Xiao
Journal: Oncotarget Date: 2017-10-10

10. Downregulation of long non-coding RNA ANRIL suppresses lymphangiogenesis and lymphatic metastasis in colorectal cancer.

Authors: Zhenqiang Sun; Chunlin Ou; Weiguo Ren; Xiang Xie; Xiayu Li; Guiyuan Li
Journal: Oncotarget Date: 2016-07-26

1 in total

1. Prediction of lncRNA-disease association based on a Laplace normalized random walk with restart algorithm on heterogeneous networks.

Authors: Liugen Wang; Min Shang; Qi Dai; Ping-An He
Journal: BMC Bioinformatics Date: 2022-01-04 Impact factor: 3.169

1 in total