Literature DB >> 33510939

ICLRBBN: a tool for accurate prediction of potential lncRNA disease associations.

Yuqi Wang^1,2, Hao Li^1,2, Linai Kuang², Yihong Tan¹, Xueyong Li¹, Zhen Zhang¹, Lei Wang^1,2.

Abstract

Growing evidence has elucidated that long non-coding RNAs (lncRNAs) are involved in a variety of complex diseases in human bodies. In recent years, it has become a hot topic to develop effective computational models to identify potential lncRNA-disease associations. In this article, a novel method called ICLRBBN (Internal Confidence-Based Local Radial Basis Biological Network) is proposed to detect potential lncRNA-disease associations by adopting an internal confidence-based radial basis biological network. In ICLRBBN, a novel internal confidence-based collaborative filtering recommendation algorithm was designed first to mine hidden features between lncRNAs and diseases, which guarantees that ICLRBBN can be more effectively applied to predict new diseases. Then, a unique three-layer local radial basis function network consisting of diseases and lncRNAs was constructed, based on which the association probability between diseases and lncRNAs was calculated by combining different characteristics of lncRNAs with local information of diseases. Finally, we compared ICLRBBN with 6 state-of-the-art methods based on two different validation frameworks. Simulation results showed that area under the receiver operating characteristic curve (AUC) values achieved by ICLRBBN outperformed all competing methods. Furthermore, case studies illustrated that ICLRBBN has a promising future as a powerful tool in the practical application of lncRNA-disease association prediction. A web service for prediction of potential lncRNA-disease associations is available at http://leelab2997.cn/.

Entities: Chemical Disease Species

Keywords: association prediction; biological network; computational biology; lncRNA; radial basis function network

Year: 2020 PMID： 33510939 PMCID： PMC7806946 DOI： 10.1016/j.omtn.2020.12.002

Source DB: PubMed Journal: Mol Ther Nucleic Acids ISSN： 2162-2531 Impact factor: 8.886

Introduction

Historically, the hypothesis that genetic information is stored in protein-coding genes has been commonly accepted as the unerring central principle of molecular biology. However, with the continuous development of sequencing technology and the completion of various sequencing projects, researchers have found that more than 98% of the human genome does not encode protein sequences but produces a large number of non-coding RNAs (ncRNAs)., Among them, long non-coding RNAs (lncRNAs) are an important class of non-coding transcripts with lengths longer than 200 nt., Specially, they are also critical regulators involved in various important biological processes such as cell proliferation, cell apoptosis, transcription, translation, cell cycle control, and epigenetic regulation., Therefore, it is not surprising that a growing number of studies have corroborated the existence of a significant association between the mutations and abnormalities of lncRNAs and various complex human diseases.8, 9, 10 It is obvious that disease-related lncRNAs can not only provide valuable insights into the research on the pathogenesis of complex diseases at the lncRNA level but also contribute to the diagnosis, treatment, and prognosis of diseases. However, considering that traditional biological experimental methods are time consuming and expensive, it is necessary to develop computational models to infer potential lncRNA-disease associations, since it can reduce costs and save time in biological experiments. In the past few years, some advanced computational models have been proposed to predict potential lncRNA-disease associations successfully. So far, according to the different strategies adopted, these models can be roughly divided into three major categories. The first category is composed of the network-based approaches, in which different kinds of lncRNA-disease heterogeneous networks are constructed to discover potential associations between lncRNAs and diseases based on known lncRNA-disease associations. For instance, Li et al. presented a prediction model called LRWHLDA by implementing the local random walk method on a newly constructed lncRNA-disease heterogeneous network. However, due to limited known lncRNA-disease associations, these network-based methods cannot be used to predict correlations between new lncRNAs (lncRNAs without known associated diseases) and new diseases (diseases without known associated lncRNAs). Hence, in recent years, some computational models that do not rely on known lncRNA-disease associations have been proposed. For example, Wang et al. developed a tool called lncDisease to infer potential correlations between lncRNAs and diseases based on disease enrichment analysis of microRNAs (miRNAs) interacting with specific lncRNAs. These models constitute the second category. Although these models broke through the limitations of limited known lncRNA-disease association samples, they cannot be applied to infer potential associations between diseases and lncRNAs without any known related genes or miRNAs. Hence, the third category of computational models is proposed based on machine learning schemes in recent years. For example, Fu et al. developed a generic data fusion model based on matrix decomposition called MFLDA, which can explore and utilize the intrinsic structure of heterogeneous data sources to apply to correlation prediction between various types of entities. However, the performance of these machine learning-based models depends on the selection of optimal parameters, which have not been solved efficiently up to now. Inspired by the above models, in this article, a novel method called ICLRBBN (Internal Confidence-Based Local Radial Basis Biological Network) was proposed to uncover potential lncRNA-disease associations. In ICLRBBN, considering the limited known lncRNA-disease associations and the applicability of new diseases, a new measure for estimating the similarities between diseases was designed first based on the concept of internal confidence. Then, an internal confidence collaborative filtering recommendation algorithm was developed to extract features of diseases. Next, a novel three-layer complex radial basis biological network was further constructed, based on which the probability matrix of associations between lncRNAs and diseases was calculated by integrating different characteristics of lncRNAs with local information of diseases. Finally, in order to evaluate the prediction performance of ICLRBBN, two different kinds of frameworks, including the leave-one-out cross-validation (LOOCV) and the 5-fold cross-validation (5-fold CV), were implemented separately. Experimental results indicated that ICLRBBN achieved reliable area under the receiver operating characteristic (ROC) curve (AUC) values of 0.9510 and 0.9043 ± 0.0019, respectively. Furthermore, case studies of two specific diseases, breast cancer and osteosarcoma, demonstrated that ICLRBBN was an effective tool for predicting potential lncRNA-disease associations as well.

Results

Performance evaluation

Two common validation frameworks, LOOCV and 5-fold CV, were adopted to evaluate model performance of ICLRBBN. Both LOOCV and 5-fold CV were performed on ICLRBBN based on 1,695 known lncRNA-disease associations obtained from the lncRNADisease database. First, in LOOCV, each known lncRNA-disease association was taken as a test sample in turn, while the remaining 1,694 known associations were taken as training samples. In addition, all lncRNA-disease pairs with no known association in the dataset were considered as candidate samples. Subsequently, we ranked the test sample together with all the candidate samples based on the scores predicted by ICLRBBN. If the test sample ranked higher than a given threshold, the prediction of the test sample was considered successful; otherwise, the prediction was considered failed. Then, through the setting of different thresholds, we calculated the corresponding true-positive rates (TPRs or sensitivity) and false-positive rates (FPRs or 1-specificity). Here, TPR represented the ratio of test samples ranked above a given threshold, and FPR represented the ratio of candidate samples ranked above a given threshold. Finally, the ROC curve was drawn according to the TPRs and FPRs corresponding to these different thresholds. The AUC was used as an evaluation indicator to evaluate the prediction performance of the model, where an AUC value of 1 indicated an ideal perfect prediction, while an AUC value of 0.5 indicated a completely random prediction. The closer the AUC value was to 1, the better the prediction performance of the model. The simulation experiment result is illustrated in Figure 1. ICLRBBN obtained a reliable AUC of 0.9510 under the LOOCV framework, which showed that our model had outstanding prediction performance.

Figure 1

ROC curves and AUCs achieved by ICLRBBN under the framework of LOOCV and the framework of 5-fold CV

ROC curves and AUCs achieved by ICLRBBN under the framework of LOOCV and the framework of 5-fold CV Additionally, unlike LOOCV, all 1,695 known lncRNA-disease associations in the 5-fold CV were randomly and evenly divided into 5 groups, with each group taking a turn as the test set, with the remaining 4 groups as the training set. Considering that the randomness of dividing the groups may have led to deviation in the experimental results, we performed 5-fold CV 100 times to obtain the average AUC values. As illustrated in Figure 1, ICLRBBN obtained a reliable AUC of 0.9043 ± 0.0019 under the 5-fold CV framework.

Effects of parameters

In ICLRBBN, there are two main parameters: the parameter K and the overlap coefficient factor . The parameter K determines the size of the chum set of diseases, while the overlap coefficient factor controls the coverage of the basis function scope in the hidden layer. We considered and and implemented ICLRBBN several times in LOOCV and 5-fold CV based on the dataset DS1 to evaluate their effects. Since the division of the test set and the verification set in 5-fold CV was random, 5-fold CV was performed 100 times under each group of parameters K and to obtain the average AUC values. As illustrated in Figure 2 below, we found that for all values of K, the AUC values in LOOCV increased slightly when l varied from 0.1 to 0.5, and the AUC values decreased significantly when varied from 0.6 to 1. Additionally, as shown in Table 1, we demonstrated that when K = 15 and = 0.4, ICLRBBN performed best in 5-fold CV. Hence, K and were set to 15 and 0.4 in ICLRBBN, respectively.

Figure 2

Effects of parameters K and l under the framework of LOOCV

Table 1

Effects of parameters K and l under the framework of 5-fold CV

K (AUC)	0.1	0.2	0.3	0.4	0.5	0.6	0.7	0.8	0.9	1.0
5	0.8249	0.8561	0.8922	0.9025	0.8911	0.8177	0.7346	0.6656	0.6253	0.6046
10	0.8326	0.8562	0.8908	0.9036	0.8919	0.8196	0.7456	0.6784	0.6378	0.6175
15	0.8224	0.8566	0.8909	0.9043	0.8922	0.8208	0.7499	0.6837	0.6431	0.6232
20	0.8227	0.8565	0.8899	0.9040	0.8924	0.8206	0.7523	0.6872	0.6440	0.6246

Effects of parameters K and l under the framework of LOOCV Effects of parameters K and l under the framework of 5-fold CV

Comparison with other state-of-the-art methods

In order to better demonstrate the superior performance of ICLRBBN, we compared it with 6 state-of-the-art methods, including NBLDA, IIRWR, SIMCLDA, PMFILDA, KATZLDA, and LRLSLDA, based on the same two datasets obtained above. As a result, the comparative experiment results under the LOOCV framework and the 5-fold CV framework are illustrated in Figures 3 and 4 and Table 2. As shown in Figure 3, we found that ICLRBBN achieved the optimal reliable AUC of 0.9510 in LOOCV based on dataset DS1, which considerably outperformed the AUCs of other methods (NBLDA: 0.8845; IIRWR: 0.8745; PMFILDA: 0.8346; KATZLDA: 0.8257; and LRLSLDA: 0.7472). As shown in Table 2, ICLRBBN achieved the optimal mean AUC of 0.9043 on dataset DS1 in 5-fold CV, which was still far superior to the other five methods (PMFILDA: 0.8337; IIRWR:0.8082; KATZLDA: 0.7994; LRLSLDA: 0.7154; and NBLDA:0.5547). Moreover, the results of multiple metrics including the Area Under the Precision-Recall curve (AUPR), F1 and Precision also demonstrated the superior performance of ICLRBBN. As shown in Figure 4, on dataset DS2, ICLRBBN also performed much better than other methods, with a reliable AUC of 0.9050 (NBLDA: 0.8271; SIMCLDA: 0.8257; IIRWR: 0.8026; KATZLDA: 0.7640; and PMFILDA: 0.5636).

Figure 3

ROC curves and AUCs achieved by ICLRBBN, NBLDA, IIRWR, PMFILDA, KATZLDA, and LRLSLDA under the framework of LOOCV based on DS1

Figure 4

ROC curves and AUCs achieved by ICLRBBN, NBLDA, SIMCLDA, IIRWR, KATZLDA, and PMFILDA under the framework of LOOCV based on DS2

Table 2

Performance of ICLRBBN, PMFILDA, IIRWR, KATZLDA, LRLSLDA and NBLDA under the framework of 5-fold CV

Metrics and methods	ICLRBBN	PMFILDA	IIRWR	KATZLDA	LRLSLDA	NBLDA
AUC	0.9043	0.8337	0.8082	0. 7994	0.7154	0.5547
AUPR	0.1355	0.0641	0.0473	0.0868	0.0822	0.1807
F1	0.0016	0.0009	0.0007	0.0013	0.0010	0.0013
PRE	0.1268	0.0660	0.0483	0.0764	0.0742	0.1816

ROC curves and AUCs achieved by ICLRBBN, NBLDA, IIRWR, PMFILDA, KATZLDA, and LRLSLDA under the framework of LOOCV based on DS1 ROC curves and AUCs achieved by ICLRBBN, NBLDA, SIMCLDA, IIRWR, KATZLDA, and PMFILDA under the framework of LOOCV based on DS2 Performance of ICLRBBN, PMFILDA, IIRWR, KATZLDA, LRLSLDA and NBLDA under the framework of 5-fold CV Moreover, in order to evaluate the performance of ICLRBBN in predicting lncRNAs related to new diseases, we further compared it with competing methods under the framework of leave-one-out verification. During experiments, for any disease d, when calculating the association scores between it and each lncRNA, we excluded all known associations of d and only relied on the remaining known associations for prediction. As illustrated in Figure 5, ICLRBBN still achieved a reliable AUC of 0.9104 for new disease-related lncRNA prediction, which considerably outperformed AUCs achieved by KATZLDA and NBLDA. That is to say, ICLRBBN had much better performance on prediction of new disease-related lncRNAs. Overall, the performance of ICLRBBN was significantly better than all these methods.

Figure 5

The performance of ICLRBBN, KATZLDA and NBLDA on prediction of new disease-related lncRNAs

Case study

We selected two important cancers, breast cancer and osteosarcoma, for case studies on the dataset DS1 to further evaluate the prediction performance of ICLRBBN. During simulation, all known associations in the dataset were treated as training sets. In addition, for any given disease, all lncRNAs that have no known association with the disease were considered candidate-related lncRNAs for the disease. Then, according to the correlation probability score calculated by ICLRBBN, all candidate lncRNAs for the given disease were ranked. As a result, we list the top 15 candidate lncRNAs and some relevant evidence found in the PubMed literature. Breast cancer is a malignant tumor developed from the epithelial tissue cells of the breast. Its signs include breast lumps, changes in breast shape, nipple depression, nipple discharge, etc. Breast cancer is the most common malignant tumor in women, which seriously threatens the health of women worldwide. However, breast cancer is a very heterogeneous disease; thus, its pathogenesis is still unclear, and the treatment is still incomplete. In recent years, more and more studies have demonstrated that lncRNAs are involved in the biological process of breast cancer’s generation and development. Therefore, we implemented ICLRBBN to predict lncRNAs associated with breast cancer. As a result, as illustrated in Table 3 below, 8 of the top 10 candidate lncRNAs related to breast cancer predicted by ICLRBBN have been experimentally confirmed recently, while 13 of the top 15 candidate lncRNAs have been confirmed. For instance, Shi et al. utilized western blot to discover that HULC is highly expressed in triple-negative breast cancer (TNBC) tissues and is closely related to the poor prognosis of TNBC patients. Through a series of experiments in vivo and in vitro, it was found that the expression of HOTTIP was significantly upregulated in breast cancer cell lines, and HOTTIP could participate in breast cancer cell proliferation, migration, and apoptosis processes by regulating HOXA11 to some extent. It was proved that BANCR was overexpressed in breast cancer cell lines, and the proliferation, invasion, and migration ability of breast cancer cells could be reduced by BANCR knockdown.,

Table 3

The top 15 potential breast cancer-related lncRNAs predicted by ICLRBBN and relevant evidence for these predicted associations

Rank	lncRNA	Evidence	Expression pattern
1	HOTTIP	29415429	upregulated
2	BANCR	29565494, 29805676	upregulated
3	HULC	27986124	upregulated
4	AFAP1-AS1	29439313, 29974352	upregulated
5	MIAT	29100300, 29345338, 29792859	upregulated
6	DRAIC	25288503	regulation
7	HNF1A-AS1	unconfirmed	unconfirmed
8	PCAT1	28989584	upregulated
9	PCAT29	unconfirmed	unconfirmed
10	TUSC7	23558749	differential expression
11	CASC2	29523222	downregulated
12	CRNDE	28469804	upregulated
13	PTENP1	29085464, 29212574	downregulated
14	TINCR	29614984	upregulated
15	HIF1A-AS1	26339353	upregulated

The top 15 potential breast cancer-related lncRNAs predicted by ICLRBBN and relevant evidence for these predicted associations Osteosarcoma is the most common type of bone malignancy, usually occurring during adolescence. Although chemotherapy can greatly improve the survival rate of patients with osteosarcoma, about 40% of patients will experience tumor recurrence and tumor metastasis. Hence, it is particularly important to thoroughly study the pathogenesis of osteosarcoma so as to explore more effective treatment strategies. As a result, as illustrated in Table 4, 9 of the top 10 candidate lncRNAs related to osteosarcoma predicted by ICLRBBN have been confirmed by recent experiments, while 13 of the top 15 candidate lncRNAs related to osteosarcoma have been confirmed. For example, it was found that the expression level of GAS5 was significantly reduced in osteosarcoma tissue cells and GAS5 could act as an inhibitor of osteosarcoma, which can inhibit the growth and migration of osteosarcoma by sponging miR-203a or regulating miR-22., Ruan et al. adopted Kaplan-Meier survival analysis and log rank testing to demonstrated that the expression of CCAT2 in osteosarcoma tissue was significantly increased compared to normal bone tissues and related to the TNM stage of tumors in osteosarcoma patients. Some studies have indicated that NEAT1 is an oncogene, and its overexpression will downregulate the osteosarcoma inhibitor miR-34c and enhance cisplatin (DDP)-based chemotherapy resistance.

Table 4

The top 15 potential osteosarcoma-related lncRNAs predicted by ICLRBBN and relevant evidence for these predicted associations

Rank	lncRNA	Evidence	Expression pattern
1	GAS5	29414815, 28519068	downregulated
2	PVT1	28602700	upregulated
3	NEAT1	28295289, 29654165	upregulated
4	SPRY4-IT1	28078006	upregulated
5	CCAT1	28549102	upregulated
6	CCAT2	29863240	upregulated
7	XIST	29384226, 28409547, 28682435, 29254174	upregulated
8	PANDAR	28011477	upregulated
9	AFAP1-AS1	31002124	upregulated
10	LINC-ROR	unconfirmed	unconfirmed
11	BCYRN1	unconfirmed	unconfirmed
12	SOX2-OT	28960757	upregulated
13	MIAT	32196573	downregulated
14	PCAT1	29430187	upregulated
15	ATB	28469952	upregulated

The top 15 potential osteosarcoma-related lncRNAs predicted by ICLRBBN and relevant evidence for these predicted associations

Discussion

In recent years, it has become increasingly clear that lncRNAs are involved in various biological processes in the human body and have an inseparable connection with many major diseases. The identification of potential lncRNA-disease association pairs has become a hot research topic in bioinformatics, which can deepen people’s understanding of the pathogenesis of various diseases at the molecular level and promote the research progress of treatment and prognosis strategies for complex diseases. In this article, we developed a novel prediction model called ICLRBBN to infer potential lncRNA-disease associations. First, we designed an internal confidence collaborative filtering recommendation algorithm by introducing two confidence factors, which solved the problem that the known lncRNA-disease association information is too sparse and reduced the dependence of our model on the known association information. Second, by combining the radial basis function network with the information of lncRNAs and diseases and assigning biological significance to each node in the radial basis function network, we constructed a unique local radial basis function network, based on which we could predict the association probabilities between lncRNAs and the diseases according to the characteristics of lncRNAs and the local information of diseases. In addition, various experiments have been done, and experimental results have demonstrated the reliability and superiority of the prediction performance of ICLRBBN as well. Meanwhile, a web server that implements the method of ICLRBBN is available at http://leelab2997.cn/. Of course, there are still certain limitations in ICLRBBN that need to be improved and optimized in future work. For instance, ICLRBBN still has a certain dependence on known lncRNA-disease associations. However, we believe that integrating a variety of biological indicators may solve this problem to a great extent and can further improve the prediction performance of the model to make it better applicable to the case of sparse known associations. Therefore, this problem will be the focus of discussion and study in the future.

Materials and methods

We first downloaded two different datasets of known lncRNA-disease associations from two versions of the lncRNADisease database (http://www.cuilab.cn/lncrnadisease). After removing non-human data and redundant association records, we finally obtain a dataset containing 1,695 distinct experimentally verified human lncRNA-disease associations between 314 diseases and 828 lncRNAs, and a dataset containing 621 distinct lncRNA-disease associations between 226 diseases and 285 lncRNAs. For convenience, we denote these two datasets as DS1 and DS2, respectively. Next, for any given dataset of known lncRNA-disease associations, we further converted it into an original incidence matrix A, where A(i,j) = 1 if and only if there is a known association between the i-th disease and the j-th lncRNA; otherwise A(i,j) = 0. In addition, for simplicity, we defined N and N as the number of diseases and the number of lncRNAs in the given dataset, respectively. The flow chart of ICLRBBN is illustrated in Figure 6. In ICLRBBN, based on known lncRNA-disease associations, an original incidence matrix A was obtained first. To overcome the effect of limited known lncRNA-disease associations, a novel internal confidence-based collaborative filtering recommendation algorithm was designed to mine potential associations between lncRNAs and diseases without known related lncRNAs, and thus a new target output matrix A∗ and a new feature matrix T were constructed based on the original incidence matrix A. Finally, a novel three-layered local radial basis biological network was designed to infer potential lncRNA-disease associations.

Figure 6

The flowchart of ICLRBBN

Internal confidence-based collaborative filtering recommendation algorithm

Considering that known lncRNA-disease associations are quite limited, the association matrix A was very sparse. Hence, to make ICLRBBN applicable to detect potential associations between lncRNAs and new diseases (i.e., diseases without any known related lncRNAs), an internal confidence-based collaborative filtering recommendation algorithm was proposed first to excavate potential indirect features between lncRNAs and diseases. As shown below, the recommendation algorithm consists of three major parts.

Part 1: calculation of similarities between diseases

Internal confidence-based similarity for diseases

The concept of internal confidence and two factors of internal confidence, including reliability and heat, were introduced first to measure the similarities between diseases. For any given two different diseases and , let and represent the two sets of all lncRNAs with known associations with and , respectively, and || denote the number of elements in . We define two different diseases as “close friends” if and only if these two diseases have known associations with at least one common lncRNA. Based on the concept of close friends, it is reasonable to assume that: (1) the larger the number of elements in , the more similar these two different diseases and will be; and (2) supposing that there is = , then the similarity between and will be higher than the similarity between and , if the number of elements in is less than that in . Thereafter, based on the above two assumptions, we can calculate the similarity between and as follows: Next, for a given lncRNA , let denote the set of all diseases that have known associations with . Then we define as the heat score of lncRNA , which can be obtained as follows:According to Equation 2, it is obvious that the lncRNAs with higher heat scores will have known associations with fewer diseases. Moreover, based on the concept of heat score for lncRNAs, for any two given lncRNAs and , where and , it is reasonable that Equation 1 can be modified as follows: In addition, taking the situation in Figure 7 as an example, if the similarity calculation is performed according to Equations 2 and 3, the similarity between and and the similarity between and will both be 1. However, the lncRNA is associated with two different diseases, while the lncRNA is associated with four different diseases. Obviously, it is reasonable to assume that the similarity between and shall be higher than that between and . Hence, we further introduced another confidence factor called RELIAB for different lncRNAs, which represents the average heat reliability of those lncRNAs with known association with both and and can be obtained as follows:Here, denotes the average heat score of those lncRNAs with known association with both and .

Figure 7

The example of similarity calculation

The example of similarity calculation According to Equation 4, the final internal confidence-based similarity between any two given diseases and can be calculated as follows:

Semantic similarity for diseases

For each disease in a given dataset, we downloaded its corresponding MeSH descriptor from the MeSH database of the National Medical Library (https://www.nlm.nih.gov/). According to the strict classification information provided by the MeSH descriptor and the concept of directed acyclic graph (DAG) between different diseases, we can obtain the semantic similarity between diseases as follows. First, any disease can be represented in the form of a graph:. Among them, represents a set of nodes composed of itself and its ancestor nodes, and represents the corresponding set of directed edges from the parent nodes to the child nodes. Thereafter, for any node t in the graph , its semantic contribution to the disease can be calculated as follows:where Δ is a semantic contribution factor whose value is between 0 and 1, and previous experimental results have indicated that its value is set to 0.5 optimal. Furthermore, for any disease , its semantic value can be calculated as follows: Finally, based on the assumption that two diseases that share more structure in the DAG tend to have higher semantic similarity, the semantic similarity of any two different diseases and can be calculated as follows:

Integrated similarity for diseases

According to Equations 5 and 8, for any two given diseases and , we can obtain the integrated similarity between them as follows:

Part 2: construction of the feature matrix T

According to the integrated disease similarity matrix SIM obtained above, for any given disease , we can obtain “the circle of K-closest friends” of , that is, a set of K diseases with the highest integrated similarities to . Let represent the “circle of K-closest friends” of ; then for any given lncRNA , we can calculate a possible score of association between and , even if there is no known association between and . Thereafter, we can obtain a feature matrix T as follows:Here, denotes the score of known association between the disease and lncRNA , where = 1 if and only if there is known association between and . Otherwise, = 0.

Part 3: construction of the target output matrix A∗

For new diseases, since there are no known association between these diseases and lncRNAs, we cannot extract any useful information about these new diseases from the original incidence matrix A to make effective predictions. Therefore, for any given new disease , we utilize the “circle of K-closest friends” of to recommend the most likely associations for it first and then construct a new target output matrix A∗ from the original incidence matrix A according to the following steps: Step 1: let A∗ = A, and then for any given new disease , we obtain a new association score vector based on the “circle of K-closest friends” of as follows: Step 2: let , and . If the set is not empty (i.e., ), then these lncRNAs in will be recommended to as the most likely candidates. Step 3: if = , then we will traverse each of these lncRNAs in the “circle of K-closest friends” of and further obtain another new association score vector based on the “circle of K-closest friends” of each lncRNA in the “circle of K-closest friends” of as follows: Similar to above step 2, let denote the set of lncRNAs with the maximum score in , and at the same time, the scores of these lncRNAs are greater than 0. Then if , all the lncRNAs in will be recommended to as the most likely candidates as well. Hence, according to the above steps, we can obtain a new matrix A∗ from the original incidence matrix A as follows:

Construction of the local radial basis function network

The radial basis function network (RBF network) is an artificial neural network utilizing radial basis functions as activation functions. It was first proposed by Broomhead and Lowe in 1988. At present, the radial basis function network has been widely used in many fields, such as time series prediction, system control, and classification problems. Inspired by the superior performance of the radial basis function network, we designed a novel local radial basis function network for lncRNA-disease association prediction. As illustrated in Figure 6, the local radial basis function network is divided into three layers: the input layer, the hidden layer, and the output layer. Among them, there are nodes in the input layer, representing lncRNAs, and each node will accept a -dimensional vector (i.e., the eigenvector between the lncRNA and diseases) as the input. The hidden layer consists of H nodes corresponding to H distinct feature vectors in the input matrix. In addition, the output layer consists of nodes representing diseases, and the output of each node in the output layer is an -dimensional vector, which consists of probabilities of associations between this disease and lncRNAs. In ICLRBBN, the hidden layer adopts the nonlinear optimization strategy, which maps the low-dimensional eigenvectors of the input layer to the high-dimensional space through the nonlinear function. Afterward, in this high-dimensional space, the output layer adopts the linear optimization strategy, which makes linear weighted adjustment to the hidden layer’s output information to approximate the target output. In addition, the local radial basis function network is input twice, where the original incidence matrix A will be regarded as the initial eigenmatrix to be used as the input matrix for the first input of the local radial basis biological network, and the feature matrix T obtained in Equation 10 will be utilized for the second input of the local radial basis biological network. Furthermore, the target output matrix A∗ will be adopted as the first output target matrix of the local radial basis biological network, while the second output matrix of the local radial basis biological network will be regarded as the predicted lncRNA-disease association probability score matrix. Particularly, before adopting them as the inputs of the local radial basis biological network, we will normalize these two feature matrices A and T with the cross-channel normalization scheme as follows:Based on above descriptions, the prediction process based on the newly constructed local radial basis function network can be mainly divided into the following steps.

Step 1: determining the number of nodes for the hidden layer

The normalized incidence matrix A is a ∗-dimensional matrix, which will be used as the initial input matrix of the local radial basis biological network. In the local radial basis biological network, these nodes in the input layer represent lncRNAs, and each node will accept a corresponding -dimensional feature vector as the input. In addition, supposing that after removing duplicated columns in the normalized incidence matrix A, a unique ∗H-dimensional feature matrix containing H different -dimensional feature vectors will be obtained; then we will assign H nodes to form the hidden layer of the local radial basis biological network, which correspond to these H different -dimensional feature vectors separately.

Step 2: calculating the first output of the hidden layer

After determining the number of nodes for the hidden layer, the output matrix of the hidden layer can be calculated. The role of the hidden layer is to take kernel function as the basis function and map each feature vector of the input layer from low dimension to high dimension to make it linearly separable. In ICLRBBN, we adopt the Gaussian kernel function as the basis function, and for each node in the hidden layer, its output will be a -dimensional vector. Let the output vector of the k-th node in the hidden layer be ; then each element in can be calculated according to the Euclidean distance as follows:Here, represents the center of the k-th basis function (i.e., -dimensional feature vector corresponding to the k-th node in the hidden layer). represents the feature vector input by the j-th node in the input layer for the first time. Additionally, σ is the bandwidth parameter of the basis function, which controls the radial range of the basis function. represents the bandwidth of the k-th basis function, and its value can be obtained as follows:Here, represents the j-th column of the eigenmatrix T, and is an overlap coefficient factor with value ranging from 0 to 1. Obviously, the larger the value of is, the more the scope of each basis function will overlap. According to above Equations 16 and 17, for each node in the hidden layer, an -dimensional output vector can be obtained. Thereafter, considering all the nodes of the hidden layer, we will obtain an output matrix as follows:

Step 3: obtaining the weight matrix W

From the above steps, we obtained the N∗H-dimensional output matrix of the hidden layer. For convenience, let . It is obvious that the j-th row of represents the output vector of the j-th node in the hidden layer. In addition, let . Then, for any given disease , a system of equations can be locally generated according to the lncRNAs that are known to be related to it as follows:For convenience, Equation 19 can be rewritten as follows:In Equation 20, the right side in the system of equations is a dimensional column vector. Therefore, we can solve the system of equations based on the pseudo-inverse to obtain the weight vector as follows:Here, pinv represents the function to solve the pseudoinverse. According to Equation 21, for any given disease , it is easy to see that the corresponding weight vector can be calculated out. Moreover, considering the weight vectors of all diseases, then it is obvious that we can obtain a weight matrix W as follows:

Step 4: calculating the second output of the hidden layer

In ICLRBBN, the feature matrix T obtained by the internal confidence-based collaborative filtering recommendation algorithm will be used as the input matrix for the second input of the local radial basis biological network. Similar to the first input, the nodes in the input layer will accept the feature vectors corresponding to the lncRNAs in the feature matrix as their inputs. For convenience, for any node k in the hidden layer, let the output vector of its second output be . Then, considering all the nodes of the hidden layer, we will obtain another output matrix as follows:

Step 5: calculating the output matrix of the output layer

Let l denote the j-th node in the input layer and denote the j-th row in the output matrix of the hidden layer. It is obvious that represents the output vector of l in the hidden layer, which can be express as . Thereafter, for any given lncRNA , we can obtain the association probabilities between and diseases as follows:Moreover, based on Equation 24, taking all lncRNAs into account, we can obtain a matrix F as follows:Here, the matrix F is the final association probability matrix between lncRNAs and diseases, and represents the association probability score between the i-th disease and the j-th lncRNA .

30 in total

Review 1. Long non-coding RNAs: insights into functions.

Authors: Tim R Mercer; Marcel E Dinger; John S Mattick
Journal: Nat Rev Genet Date: 2009-03 Impact factor: 53.242

2. Visiting "noncodarnia".

Authors: Jeffrey M Perkel
Journal: Biotechniques Date: 2013-06 Impact factor: 1.993

3. Novel human lncRNA-disease association inference based on lncRNA expression profiles.

Authors: Xing Chen; Gui-Ying Yan
Journal: Bioinformatics Date: 2013-09-02 Impact factor: 6.937

4. Prediction of lncRNA-disease associations based on inductive matrix completion.

Authors: Chengqian Lu; Mengyun Yang; Feng Luo; Fang-Xiang Wu; Min Li; Yi Pan; Yaohang Li; Jianxin Wang
Journal: Bioinformatics Date: 2018-10-01 Impact factor: 6.937

5. Long non-coding RNA BRAF-regulated lncRNA 1 promotes lymph node invasion, metastasis and proliferation, and predicts poor prognosis in breast cancer.

Authors: Jing Jiang; Sheng-Hong Shi; Xu-Jun Li; Long Sun; Qi-Dong Ge; Chao Li; Wei Zhang
Journal: Oncol Lett Date: 2018-04-17 Impact factor: 2.967

6. Long non-coding RNA BANCR indicates poor prognosis for breast cancer and promotes cell proliferation and invasion.

Authors: K-X Lou; Z-H Li; P Wang; Z Liu; Y Chen; X-L Wang; H-X Cui
Journal: Eur Rev Med Pharmacol Sci Date: 2018-03 Impact factor: 3.507

7. Identification and analysis of functional elements in 1% of the human genome by the ENCODE pilot project.

Authors: Ewan Birney; John A Stamatoyannopoulos; Anindya Dutta; Roderic Guigó; Thomas R Gingeras; Elliott H Margulies; Zhiping Weng; Michael Snyder; Emmanouil T Dermitzakis; Robert E Thurman; Michael S Kuehn; Christopher M Taylor; Shane Neph; Christoph M Koch; Saurabh Asthana; Ankit Malhotra; Ivan Adzhubei; Jason A Greenbaum; Robert M Andrews; Paul Flicek; Patrick J Boyle; Hua Cao; Nigel P Carter; Gayle K Clelland; Sean Davis; Nathan Day; Pawandeep Dhami; Shane C Dillon; Michael O Dorschner; Heike Fiegler; Paul G Giresi; Jeff Goldy; Michael Hawrylycz; Andrew Haydock; Richard Humbert; Keith D James; Brett E Johnson; Ericka M Johnson; Tristan T Frum; Elizabeth R Rosenzweig; Neerja Karnani; Kirsten Lee; Gregory C Lefebvre; Patrick A Navas; Fidencio Neri; Stephen C J Parker; Peter J Sabo; Richard Sandstrom; Anthony Shafer; David Vetrie; Molly Weaver; Sarah Wilcox; Man Yu; Francis S Collins; Job Dekker; Jason D Lieb; Thomas D Tullius; Gregory E Crawford; Shamil Sunyaev; William S Noble; Ian Dunham; France Denoeud; Alexandre Reymond; Philipp Kapranov; Joel Rozowsky; Deyou Zheng; Robert Castelo; Adam Frankish; Jennifer Harrow; Srinka Ghosh; Albin Sandelin; Ivo L Hofacker; Robert Baertsch; Damian Keefe; Sujit Dike; Jill Cheng; Heather A Hirsch; Edward A Sekinger; Julien Lagarde; Josep F Abril; Atif Shahab; Christoph Flamm; Claudia Fried; Jörg Hackermüller; Jana Hertel; Manja Lindemeyer; Kristin Missal; Andrea Tanzer; Stefan Washietl; Jan Korbel; Olof Emanuelsson; Jakob S Pedersen; Nancy Holroyd; Ruth Taylor; David Swarbreck; Nicholas Matthews; Mark C Dickson; Daryl J Thomas; Matthew T Weirauch; James Gilbert; Jorg Drenkow; Ian Bell; XiaoDong Zhao; K G Srinivasan; Wing-Kin Sung; Hong Sain Ooi; Kuo Ping Chiu; Sylvain Foissac; Tyler Alioto; Michael Brent; Lior Pachter; Michael L Tress; Alfonso Valencia; Siew Woh Choo; Chiou Yu Choo; Catherine Ucla; Caroline Manzano; Carine Wyss; Evelyn Cheung; Taane G Clark; James B Brown; Madhavan Ganesh; Sandeep Patel; Hari Tammana; Jacqueline Chrast; Charlotte N Henrichsen; Chikatoshi Kai; Jun Kawai; Ugrappa Nagalakshmi; Jiaqian Wu; Zheng Lian; Jin Lian; Peter Newburger; Xueqing Zhang; Peter Bickel; John S Mattick; Piero Carninci; Yoshihide Hayashizaki; Sherman Weissman; Tim Hubbard; Richard M Myers; Jane Rogers; Peter F Stadler; Todd M Lowe; Chia-Lin Wei; Yijun Ruan; Kevin Struhl; Mark Gerstein; Stylianos E Antonarakis; Yutao Fu; Eric D Green; Ulaş Karaöz; Adam Siepel; James Taylor; Laura A Liefer; Kris A Wetterstrand; Peter J Good; Elise A Feingold; Mark S Guyer; Gregory M Cooper; George Asimenos; Colin N Dewey; Minmei Hou; Sergey Nikolaev; Juan I Montoya-Burgos; Ari Löytynoja; Simon Whelan; Fabio Pardi; Tim Massingham; Haiyan Huang; Nancy R Zhang; Ian Holmes; James C Mullikin; Abel Ureta-Vidal; Benedict Paten; Michael Seringhaus; Deanna Church; Kate Rosenbloom; W James Kent; Eric A Stone; Serafim Batzoglou; Nick Goldman; Ross C Hardison; David Haussler; Webb Miller; Arend Sidow; Nathan D Trinklein; Zhengdong D Zhang; Leah Barrera; Rhona Stuart; David C King; Adam Ameur; Stefan Enroth; Mark C Bieda; Jonghwan Kim; Akshay A Bhinge; Nan Jiang; Jun Liu; Fei Yao; Vinsensius B Vega; Charlie W H Lee; Patrick Ng; Atif Shahab; Annie Yang; Zarmik Moqtaderi; Zhou Zhu; Xiaoqin Xu; Sharon Squazzo; Matthew J Oberley; David Inman; Michael A Singer; Todd A Richmond; Kyle J Munn; Alvaro Rada-Iglesias; Ola Wallerman; Jan Komorowski; Joanna C Fowler; Phillippe Couttet; Alexander W Bruce; Oliver M Dovey; Peter D Ellis; Cordelia F Langford; David A Nix; Ghia Euskirchen; Stephen Hartman; Alexander E Urban; Peter Kraus; Sara Van Calcar; Nate Heintzman; Tae Hoon Kim; Kun Wang; Chunxu Qu; Gary Hon; Rosa Luna; Christopher K Glass; M Geoff Rosenfeld; Shelley Force Aldred; Sara J Cooper; Anason Halees; Jane M Lin; Hennady P Shulha; Xiaoling Zhang; Mousheng Xu; Jaafar N S Haidar; Yong Yu; Yijun Ruan; Vishwanath R Iyer; Roland D Green; Claes Wadelius; Peggy J Farnham; Bing Ren; Rachel A Harte; Angie S Hinrichs; Heather Trumbower; Hiram Clawson; Jennifer Hillman-Jackson; Ann S Zweig; Kayla Smith; Archana Thakkapallayil; Galt Barber; Robert M Kuhn; Donna Karolchik; Lluis Armengol; Christine P Bird; Paul I W de Bakker; Andrew D Kern; Nuria Lopez-Bigas; Joel D Martin; Barbara E Stranger; Abigail Woodroffe; Eugene Davydov; Antigone Dimas; Eduardo Eyras; Ingileif B Hallgrímsdóttir; Julian Huppert; Michael C Zody; Gonçalo R Abecasis; Xavier Estivill; Gerard G Bouffard; Xiaobin Guan; Nancy F Hansen; Jacquelyn R Idol; Valerie V B Maduro; Baishali Maskeri; Jennifer C McDowell; Morgan Park; Pamela J Thomas; Alice C Young; Robert W Blakesley; Donna M Muzny; Erica Sodergren; David A Wheeler; Kim C Worley; Huaiyang Jiang; George M Weinstock; Richard A Gibbs; Tina Graves; Robert Fulton; Elaine R Mardis; Richard K Wilson; Michele Clamp; James Cuff; Sante Gnerre; David B Jaffe; Jean L Chang; Kerstin Lindblad-Toh; Eric S Lander; Maxim Koriabine; Mikhail Nefedov; Kazutoyo Osoegawa; Yuko Yoshinaga; Baoli Zhu; Pieter J de Jong
Journal: Nature Date: 2007-06-14 Impact factor: 49.962

8. LncDisease: a sequence based bioinformatics tool for predicting lncRNA-disease associations.

Authors: Junyi Wang; Ruixia Ma; Wei Ma; Ji Chen; Jichun Yang; Yaguang Xi; Qinghua Cui
Journal: Nucleic Acids Res Date: 2016-02-16 Impact factor: 16.971

Review 9. Osteosarcoma Overview.

Authors: Brock A Lindsey; Justin E Markel; Eugenie S Kleinerman
Journal: Rheumatol Ther Date: 2016-12-08

10. A Probabilistic Matrix Factorization Method for Identifying lncRNA-disease Associations.

Authors: Zhanwei Xuan; Jiechen Li; Jingwen Yu; Xiang Feng; Bihai Zhao; Lei Wang
Journal: Genes (Basel) Date: 2019-02-08 Impact factor: 4.096

1 in total

Review 1. GBDTLRL2D Predicts LncRNA-Disease Associations Using MetaGraph2Vec and K-Means Based on Heterogeneous Network.

Authors: Tao Duan; Zhufang Kuang; Jiaqi Wang; Zhihao Ma
Journal: Front Cell Dev Biol Date: 2021-12-17

1 in total