Literature DB >> 30697228

Multivariate Information Fusion With Fast Kernel Learning to Kernel Ridge Regression in Predicting LncRNA-Protein Interactions.

Cong Shen¹, Yijie Ding², Jijun Tang^1,3, Fei Guo¹.

Abstract

Long non-coding RNAs (lncRNAs) constitute a large class of transcribed RNA molecules. They have a characteristic length of more than 200 nucleotides which do not encode proteins. They play an important role in regulating gene expression by interacting with the homologous RNA-binding proteins. Due to the laborious and time-consuming nature of wet experimental methods, more researchers should pay great attention to computational approaches for the prediction of lncRNA-protein interaction (LPI). An in-depth literature review in the state-of-the-art in silico investigations, leads to the conclusion that there is still room for improving the accuracy and velocity. This paper propose a novel method for identifying LPI by employing Kernel Ridge Regression, based on Fast Kernel Learning (LPI-FKLKRR). This approach, uses four distinct similarity measures for lncRNA and protein space, respectively. It is remarkable, that we extract Gene Ontology (GO) with proteins, in order to improve the quality of information in protein space. The process of heterogeneous kernels integration, applies Fast Kernel Learning (FastKL) to deal with weight optimization. The extrapolation model is obtained by gaining the ultimate prediction associations, after using Kernel Ridge Regression (KRR). Experimental outcomes show that the ability of modeling with LPI-FKLKRR has extraordinary performance compared with LPI prediction schemes. On benchmark dataset, it has been observed that the best Area Under Precision Recall Curve (AUPR) of 0.6950 is obtained by our proposed model LPI-FKLKRR, which outperforms the integrated LPLNP (AUPR: 0.4584), RWR (AUPR: 0.2827), CF (AUPR: 0.2357), LPIHN (AUPR: 0.2299), and LPBNI (AUPR: 0.3302). Also, combined with the experimental results of a case study on a novel dataset, it is anticipated that LPI-FKLKRR will be a useful tool for LPI prediction.

Entities: CellLine Chemical Disease Gene Species

Keywords: fast kernel learning; gene ontology; kernel ridge regression; lncRNA-protein interactions; multiple kernel learning

Year: 2019 PMID： 30697228 PMCID： PMC6340980 DOI： 10.3389/fgene.2018.00716

Source DB: PubMed Journal: Front Genet ISSN： 1664-8021 Impact factor: 4.599

1. Introduction

Long non-coding RNAs (lncRNAs) constitute a large class of transcribed molecules. They have a characteristic length of more than 200 nucleotides which do not encode proteins (St Laurent et al., 2015). Existing research has proven that lncRNAs can control gene expression during the transcriptional, post-transcriptional, and epigenetic procedures through interacting with the homologous RNA-binding proteins (Guttman and Rinn, 2012; Quan et al., 2015; Tee et al., 2015). A most recent research found that, a kind of lncRNA named lnc-Lsm3b can refrain the activity of the receptor RIG-I, by the induction of viruses during the regulation of immune response (Jiang et al., 2018). This is consistent with previous studies which have proven that lncRNAs are playing potential roles in complex human diseases (Li et al., 2013). Due to the laborious and time-consuming nature of wet experimental methods in molecular biology, many state-of-the-art computational researches have been carried out dealing with the conundrum, in an effort to enhance accuracy and time efficiency (Zou et al., 2012; Jalali et al., 2015; Han et al., 2018). Since it is very difficult to extract any actual details on the 3D structures of lncRNAs and relative proteins, many sequence-based and secondary structure-based approaches for the prediction of lncRNA-protein interaction (LPI) have been published in the literature. Bellucci et al. have established the well-known catRAPID (Bellucci et al., 2011) by leveraging both physicochemical properties and secondary structure information, which could be employed as compound information to handle the problem of predicting LPI. Meanwhile, the hybrid schema RPISeq has been introduced by Muppirala et al. (2011), which employs both Support Vector Machines (SVM) and Random Forest (RF). Wang et al. have proposed a classifier combining Naive Bayes (NB) and Extended NB (ENB) classifier to extrapolate LPI (Wang et al., 2012). Lu et al. have established lncPro, which translates each LPI into numerical form, and applies matrix multiplication (Lu et al., 2013). Suresh et al. developed RPI-Pred based on SVM, by using the structure and sequence information of lncRNAs and proteins (Suresh et al., 2015). In contrast to the aforementioned works, Li et al. have introduced the LPIHN by employing an heterogeneous network, assembled with a kind of random walk on lncRNA-protein association profile, with a restart mechanism (RWR) (Li et al., 2015). Ge et al. have used resource allocation mode on a dichotomous network, and they have published the algorithm as LPBNI (Ge et al., 2016). Lately, Hu et al. have proposed a kind of semi-supervised link prediction scheme, entitled LPI-ETSLP (Hu et al., 2017), which was soon upgraded to the IRWNRLPI. This method actually integrates RWR and matrix factorization (Zhao et al., 2018). Zhang et al. have suggested two classes of state-of-the-art computational intelligence approaches (Zhang et al., 2017). The first includes supervised LPI binary classifiers, which do not require prior knowledge of interactions as negative instances (Bellucci et al., 2011; Muppirala et al., 2011; Wang et al., 2012; Lu et al., 2013; Suresh et al., 2015). second category includes semi-supervised approaches which combine known interactions to suggest unknown LPI. The following are characteristic cases of this class: LPIHN (Li et al., 2015), LPBNI (Ge et al., 2016), LPI-ETSLP (Hu et al., 2017), and IRWNRLPI (Zhao et al., 2018). Transfer learning (Jonathan et al., 1995), which can recognize and leverage skills or knowledge learned in previous tasks to novel tasks, is viewed as a kind of burgeoning machine learning branch. Whereas, zero-shot learning in pairwise learning with two-step Kernel Ridge Regression (KRR) (Stock et al., 2016), is a special type of transfer learning, constructing predictors from a dataset which contains both labeled and unlabeled samples. Hence, it is a kind of effective mechanism which can reduce the need of labeled data. In order to detect the pairwises of lncRNAs and proteins that can interact with each other, the state-of-the-art statistical methods have been exploited, such as Recursive Least Squares (RLS), Kronecker RLS, Sparse Representation based Classifier (SRC), and Multiple Kernel Learning (MKL). All these techniques have already been applied in predicting Protein-Protein Interactions (PPIs) (Ding et al., 2016; Liu X. et al., 2016), Drug-Target Interactions (DTIs) (Xia Z. et al., 2010; Laarhoven et al., 2011; Twan and Elena, 2013; Nascimento et al., 2016; Shen et al., 2017b), binding sites of biomolecules (Ding et al., 2017; Shen et al., 2017a) identification of disease-resistant genes (Xia J. et al., 2010), and microRNA-disease associations (Zou et al., 2015; Peng et al., 2017) with comparative consequences. With reference to the above researches, we have enriched the categories of similarity measures adopted during LPI prediction. Integration of the heterogeneous kinds of similarity information is achieved by applying Fast Kernel Learning (FastKL) which deals with kernel weight optimization. This is done through the integration of the prediction architectures for weighting heterogeneous kernels. This research proposes a kind of two-step Kernel Ridge Regression (KRR) applied in the field of LPI prediction. LPI-FKLKRR has proven to be a more reliable and effective approach for LPI prediction, compared with other competitive methods. The core of the algorithm proposed herein has been evaluated on the benchmark dataset of LPIs. What is especially encouraging, is that many of the LPI predictions made by our method have been confirmed, with a high degree of correlation. Also, we have conducted a comparative testing on a novel dataset to illustrate the stable performance of the LPI-FKLKRR.

2. Methods

In this section, we focus on the elaboration of architecture for our model. Its basic structural components-entities are the following: The known interactions matrix of LPI and the multivariate information that consists of lncRNA expressions, the local network, the sequence information and moreover the Gene Ontology (GO). It is imperative to combine all the similarity information together with the respective combination weights. Finally, we have developed and employed the LPI with Fast Kernel Learning based on Kernel Ridge Regression Prediction (LPI-FKLKRR) identification strategy, which utilizes a kind of two-stage Kernel Ridge Regression in LPI prediction.

2.1. Problem Specification

Suppose there are m lncRNAs and n proteins involved in LPI. We formally define two kinds of molecules as and , respectively. Hence, the interactions between lncRNAs and proteins can be intuitively and succinctly expressed as an adjacency matrix F with m × n, which can be formulated as Equation (1) where f in matrix F corresponds to the prediction value of pairwise 〈l, p〉, 1 ≤ i ≤ m, 1 ≤ j ≤ n, and m, n ∈ ℕ*. If lncRNA l can interact with protein p, the value of f is marked as 1, otherwise it is marked as 0. Obviously, the identification of new interactions between lncRNAs and proteins can be viewed as a task suitable for a recommender system (Koren et al., 2009) of a bipartite network, which can mine and detect the potential associated individuals. To this end, we use Multiple Kernel Learning (MKL) to design the optimization with respect to the prediction of LPI. In the following chapter, we will support the argument that the similarity matrix is equivalent to a kernel.

2.2. lncRNA Kernels and Protein Kernels

In order to conduct MKL, it is inevitable to construct similarity matrices of molecules in lncRNA and protein kernel spaces, respectively. Specifically, lncRNA expression, protein GO, lncRNA sequence, protein sequence, and known interactions between one lncRNA and all proteins are considered in our framework. In addition, the training adjacency matrix F is obtained by masking the known pairwise 〈l, p〉, where the partial known elements in the matrix are set to 0 for the validation set, which are represented in Figure 1.

Figure 1

Technical flow chart of our LPI prediction model. (A) LncRNAs and proteins belong to two separated and independent spaces, respectively; (B) Fast kernel learning is applied to estimate the weight of each kernel in the corresponding space; (C) Kronecker Product is adopted in generating the final kernel matrix; (D) Kernel Ridge Regression (KRR) is applied in LPI prediction.

2.2.1. Gaussian Interaction Profile Kernel

Interactions can be reflected in the connectivity behavior in the subjacent network (Laarhoven et al., 2011; Twan and Elena, 2013). For the lncRNAs, we extract information of lncRNA interactions corresponding to each row of the training adjacency matrix F. Then we use a broadly applicable Gaussian Interaction Profile (GIP) kernel to device interaction kernel defined for lncRNA l and l (i, k = 1, 2, ⋯ , m). GIP about protein p and p (j, s = 1, 2, ⋯ , n) can be generated in a similar way. As a summary, each element value in GIP can be represented as follows: where F, F and F, F are the matrices of interactions for lncRNA l, l and protein p, p, respectively. The Gaussian kernel bandwidths σ and σ are initialized to the value of 1 in the experiments. Practically, when employing 5-fold CV and LOOCV, the GIP kernel similarity should be recalculated each time based on the training samples.

2.2.2. Sequence Similarity Kernel

A sequence S with length d is an ordered list of characters, which can be written as S = c1c2⋯ c⋯ c (1 ≤ h ≤ d). Enlightened by state-of-the-art methods (Yamanishi et al., 2008; Nascimento et al., 2016), we use normalized Smith-Waterman (SW) score (Smith and Waterman, 1981) to measure the sequence similarity. The formulations are represented as follows: where SW(·, ·) stands for Smith-Waterman score; S and S are the sequences for lncRNA l and l; S and S denote the sequences for protein p and p.

2.2.3. Sequence Feature Kernel

We obtain the sequence feature kernel by extracting the feature of the sequences about lncRNAs and proteins. In practice, Conjoint Triad (CT) (Shen et al., 2007) and Pseudo Position-Specific Score Matrix (Pse-PSSM) (Chou and Shen, 2007) are adopted to describe lncRNA and protein sequences, respectively. Both Sequence Feature kernels (SF) and are constructed based on a Radial Basis Function kernel (RBF) with bandwidth equals to 1.

2.2.4. lncRNA Expression Kernel

It is interesting to identify genes with concordant behaviors because different genes always show different behaviors (Lai et al., 2017). Expression profiles of lncRNAs refers to 24 cell types which come from NONCODE database (Xie et al., 2014). After expressing each lncRNA as a 24-dimensional expression profile vector, the kernel of lncRNAs expression can be generated according to the RBF, and kernel bandwidth is also set to 1.

2.2.5. GO Kernel

Inspired by a former research (Zheng et al., 2012), similar Gene Ontology (GO) with proteins are expected to act in similar biological processes, or to reside in similar cell compartments, or to have similar molecular functions. Therefore, GO annotations are employed in this paper to generate a similarity matrix in protein space. The files of Gene Ontology (GO) terms have been downloaded from the GOA database (Wan et al., 2013). Semantic similarity is always based on the overlap of the terms associated with two proteins (Wu et al., 2013). Jaccard value which we exploited in measuring the semantic similarity of two GO terms t and t related to proteins p and p is defined as follows: where t ∩ t denotes the common terms between p and p, and t ∪ t refers to total number of terms of p and p. However, there has not been any formal definition with GO common terms t∩t given before. We denote that, if the two sequences are completely consistent, two sequences S1 and S2 have common terms of GO. For example, given three sequences S1 = 〈3, 1, 5〉, S2 = 〈3, 2, 5〉, and S3 = 〈3, 2, 5〉, if we only follow that all the corresponding locations of three sequences have non-zero values, then all three sequences have common terms. Nevertheless, for sequence S2, it can be said that S2 has common terms with S3, but does not have common terms with S1, because the second characters of S1 and S2 are different. Thus, we obtain a more sparse GO similarity matrix which can facilitate the computation.

2.3. Fast Kernel Learning

In MKL, we need to find an optimal mapping vector w, i.e., we require to choose a kind of optimal weighting strategy so that object similarity matrices can be appropriately constructed. Concretely, the vector of parameter weight values for lncRNA kernels and protein kernels are represented as w and w, respectively. We have already described that there are four kernels in lncRNA space including , , , and , and four kernels in protein space including , , , and , respectively. The optimal lncRNA and protein kernels are given as follows: where and denote each element in w and w; and correspond each kind of normalized similarity matrix among the heterogenous similarity kernels in lncRNA and protein spaces. According to the description of Fast Kernel Learning (FastKL) (He et al., 2008), w is used as a substitute for the required optimal solution w or w, and K denotes kernel matrix K or K. FastKL is not only minimizing the distance between K and Y, where Y = yyT, y is a matrix corresponds to all training set labels. It considers the regularization term ||w||2 that is used to prevent overfitting. To this end, w can be drawn from the Formula 6 as follows: where F represents Frobenius norm and λ is the tradeoff parameter. In practice, we set λ 10000 when selecting the optimal parameter value. As a step forward to deduct Equation (6), since the Frobenius norm of a matrix equals to the trace about the product between the matrix itself and matrix of its transformation, i.e., , the object function with respect to the optimal solution w can be simplified as follows: where is the symbol of the trace operator; A represents each element in matrix A; K and K denote two different kernel matrices. Recapitulating the above statement, through gaining the final w and w, we have achieved the goal of MKL for fusing all kinds of similarity matrices so that the input matrix of KRR can be generated.

2.4. Kernel Ridge Regression

Stock et al. developed a scenario of pairwise learning, called Kernel Ridge Regression (KRR) (Stock et al., 2016), which can be applied in binary classification. The basic idea of KRR is to minimize a suitable objective function with an L2-complexity penalty so that it can fit the labeled dyads as much as possible. Specifically, the KRR prediction for the LPI pairwise 〈l, p〉 has two steps which are shown in Figure 2.

Figure 2

Schematic diagram of two-step Kernel Ridge Regression. (A) An intermediate prediction of LPI is conducted using an lncRNA KRR model. (B) Protein KRR is trained using the last step information for predicting new proteins. In the first step, a prediction with respect to the new protein for all intermediate LPI pairwise is obtained as an 1 × n vector f, which can be computed as follows: where k denotes the vector of lncRNA kernel evaluation between lncRNAs in the training set and a protein in the test set, and λ is the regularization parameter. In the second step, we can obtain each element in the prediction matrix F* by using another regularization parameter λ as following Equation (9): Considering the optimal lncRNAs and proteins kernels K and K, the general objective function of the two-step KRR is defined as follows: where vec(·) is a vectorization operator that can rearrange the matrix elements in one row; F* denotes the prediction of the original matrix F which can be estimated with the application of the LPI-KRR. Objective function in Equation (8) need to be minimized by iterations, and the iterations usually gets converged in about 5–10 iterations. The kernel matrix Ξ that is used in Equation (10) is defined as Equation (11): By using the lncRNAs, the proteins' kernels and the two regularization parameters λ and λ, each element in matrix F* can be represented as Equation (12): The LPI-FKLKRR calculation framework is illustrated in the following Algorithm 1.

Algorithm 1

Fast Kernel Learning based on Kernel Ridge Regression (LPI-FKLKRR).

Input: KGIPlnc, KSWlnc, KSFlnc, KEXPlnc ∈ℜ^m×m and KGIPpro, KSWpro, KSFpro, KGOpro ∈ℜ^n×n; F∈ℜ^m×n.

Output: F^*.

1: Calculate w^lnc and w^pro and adjust the parameter λ by Eq.??;

2: Calculate K_lnc and K_pro by using Eq.4a and Eq.4b;

3: Calculate the prediction value in matrix F^* by Eq.10;

4: Adjust the parameters λ_l and λ_p by using Eq.8 and Eq.9, and produce the optimal F^*.

Fast Kernel Learning based on Kernel Ridge Regression (LPI-FKLKRR).

3. Results

This section provides a quantitative evaluation that employ benchmark dataset to assess our approach. We first show a result of 5-fold cross validation, then conduct an independent analyzing about performance of single kernel. Moreover, LPI-FKLKRR is not only compared with mean weighted model but also be assessed in parallel comparison including other outstanding methods. Furthermore, we utilize the case study to evaluate our method in predicting unknown lncRNA-protein interactions. What's more, there is also a comparison between LPI-FKLKRR and state-of-the-art work on a novel dataset.

3.1. Benchmark Dataset

Although there exists a high volume of web-based resources (Park et al., 2014), available datasets should be carefully selected. We have acquired the benchmark dataset according to the state-of-the-art work by Zhang et al. (2017). They have experimentally determined lncRNA-protein interactions with 1114 lncRNAs and 96 proteins from NPInter V2.0 (Yuan et al., 2014). Non-coding RNAs and sequence information of proteins were gleaned from NONCODE (Xie et al., 2014) and SUPERFAMILY database (Gough et al., 2001), respectively. Zhang et al. also removed lncRNAs and proteins whose expression or sequence information were unavailable in order to reduce the pressure of computation. Those lncRNAs and proteins with only one interaction were removed for the same reason. A dataset with 4158 lncRNA-protein interactions which contains 990 lncRNAs and 27 proteins were finally collected.

3.2. Evaluation Measurements

To gauge the stability of our model, 5-fold Cross Validation (5-fold CV) has been employed. The Area Under ROC curve (AUC) and Area Under the Precision-Recall curve (AUPR) measures have been utilized to evaluate our approach. We would like to emphasize that AUPR is more significant than AUC as a quality measurement because of the sparsity of the true lncRNA-protein interactions.

3.3. Experimental Environment

The proposed LPI-FKLKRR algorithm, has been implemented by using MATLAB as the development and compilation platform. All programs have been validated on a computer with 3.7 GHz 4-core CPU, 20 GB of memory, and 64-bit Windows Operating Systems.

3.4. Parameter Optimization

Grid search schema has been adopted to get the optimized values of the parameters λ and λ. The range of λ is from 20 to 980 while λ parameter ranges from 2 to 27. The criteria used to select the optimal values of λ and λ were the highest AUPR value and the lowest values of λ and λ, due to the fact that the smaller values of λ and λ, the less is the running time of the algorithm. We have found that λ = 20.89 and λ = 0.02 are the best values for the two parameters (AUPR: 0.6950).

3.5. Performance Analysis

After testing different kinds of kernels on the benchmark dataset, we obtain that the AUPRs of GIP kernel, sequence feature kernel, sequence similarity kernel and gene expression & protein GO kernel are 0.6429, 0.4885, 0.5024, and 0.2663, respectively. The detailed results are listed in Table 1. It is obvious that GIP kernel has the highest AUPR value (among the single Kernels). Multiple kernels with the FastKL weighted model achieves AUPR equal to 0.6950, which is an outstanding performance. In Figure 3, we can see that the FastKL performs better than the other models. It is clear that the FastKL is effective in improving the performance of LPIs prediction.

Table 1

The AUPR and AUC of different kernels on benchmark dataset.

Kernel type	AUPR	AUC
GIP kernel	0.6429	0.8671
Sequence feature kernel	0.4885	0.8250
Sequence similarity kernel	0.5024	0.8342
Gene expression & protein GO	0.2663	0.6626
Multiple kernels with mean weighted	0.6433	0.8840
Multiple kernels with FastKL weighted	0.6950	0.9063

Bold values represent the best value in columns.

Figure 3

The ROC and PR curve of different models.

The AUPR and AUC of different kernels on benchmark dataset. Bold values represent the best value in columns. The ROC and PR curve of different models. In addition, Figure 4 shows the weight of each kernel, including lncRNA space and protein space in a 5-fold CV experiment. Conspicuously, weights of GIP kernel obtain the largest values on the lncRNA space. However, four kinds of protein similarity matrices equally divide the weights in protein space. This occasion should be explained that four kinds of protein similarity have low degree of overlapping in the representation space, i.e., each kind of protein similarity presents a specific aspect of protein feature.

Figure 4

The kernel weights in the experiment of LPI-FKLKRR on benchmark dataset.

3.6. Comparing to Existing Predictors

The comparison between our approach and other existing methods are showed in Table 2. It should be mentioned that the highest AUPR 0.6950 is achieved by our proposed approach, which is superior to all others. The AUPR values for the other established methods are the following: integrated LPLNP (AUPR: 0.4584) (Zhang et al., 2017), RWR (AUPR: 0.2827) (Gan, 2014), CF (AUPR: 0.2357) (Sarwar et al., 2001), LPIHN (AUPR: 0.2299) (Li et al., 2015), and LPBNI (AUPR: 0.3302) (Ge et al., 2016). There are two well-founded reasons for the successful improved performance of our method. Firstly, FastKL effectively combines multivariate information by employing multiple kernel learning. Simultaneously, LPI-KRR is an effective prediction algorithm employing two-step KRR to fuse lncRNA and protein feature spaces. Due to the fact that there are extrapolation difficulties for the imbalanced datasets, PRC is more effective than ROC on highly imbalanced datasets. Therefore, we have obtained acquire competitive AUC value, compared to the state-of-the-art algorithms. From all the above we conclude that our approach can be a useful tool in the prediction of LPI.

Table 2

Comparison to existing methods via 5-fold CV on benchmark dataset.

Method	AUPR	AUC
LPI-FKLKRR	0.6950	0.9063
Integrated LPLNP^*	0.4584	0.9104
RWR^*	0.2827	0.8134
CF^*	0.2357	0.7686
LPIHN^*	0.2299	0.8451
LPBNI^*	0.3302	0.8569

Results are derived from Zhang et al. (.

Comparison to existing methods via 5-fold CV on benchmark dataset. Results are derived from Zhang et al. (.

3.7. Case Study

We have also used Local Leave-One-Out Cross-Validation (LOOCV) to evaluate the predictive performance. Local LOOCV masks the relationship between one protein and all lncRNAs. Our model is trained by the rest of the known information no matter if they are interacting or not and it is tested on a masked relationship. For a protein not appearing in the trial, our approach can predict the strength of interactions between this protein and gross 990 lncRNAs in the experiment. We have ranked these values of interactions in descending order, since high ranking is connected to high interaction possibility. In Figure 5, we can see that the performance of single kernel, average weighted kernels and weighted kernels with FastKL have failed. The FastKL weighted model using Multiple kernels, gains the best performance with values 0.5506 and 0.7937 for the AUPR and the AUC respectively. The detailed results are listed in Table 3.

Figure 5

The ROC and PR curve by local LOOCV on benchmark dataset.

Table 3

The AUPR and AUC of different kernels by local LOOCV on benchmark dataset.

Kernel	AUPR	AUC
GIP kernel	0.1690	0.5189
Sequence feature kernel	0.2814	0.6800
Sequence similarity kernel	0.3546	0.7333
Gene expression & protein GO	0.3101	0.7301
Multiple kernels with mean weighted	0.4956	0.7898
Multiple kernels with FastKL weighted	0.5506	0.7937

Bold values represent the best value in columns.

The ROC and PR curve by local LOOCV on benchmark dataset. The AUPR and AUC of different kernels by local LOOCV on benchmark dataset. Bold values represent the best value in columns. As shown in Table 4, two cases of the top 20 interactions (including proteins ENSP00000309558 and ENSP00000401371), have been extrapolated by LPI-FKLKRR. Also, two cases in Table 5 including lncRNAs, NONHSAT145960 and NONHSAT031708 of the top 10 interactions have been extrapolated by the LPI-FKLKRR. We check them up in the masked relationship between one protein and all lncRNAs, or one lncRNA and all proteins. Our approach achieves successful identification proportion equal to 11/20 and 12/20 on the proteins ENSP00000309558 and ENSP00000401371, respectively, and it achieves identification proportion equal to 6/10 and 6/10 on lncRNAs NONHSAT145960 and NONHSAT031708.

Table 4

Top 20 interactions rank on protein ENSP00000309558 and ENSP00000401371.

lncRNA ID	Protein ID	Rank	Confirm?	lncRNA ID	Protein ID	Rank	Confirm?
NONHSAT011652	ENSP00000309558	1	Confirmed	NONHSAT002344	ENSP00000401371	1	Confirmed
NONHSAT027070	ENSP00000309558	2	Confirmed	NONHSAT104639	ENSP00000401371	2	–
NONHSAT104991	ENSP00000309558	3	Confirmed	NONHSAT027070	ENSP00000401371	3	Confirmed
NONHSAT001511	ENSP00000309558	4	Confirmed	NONHSAT104991	ENSP00000401371	4	Confirmed
NONHSAT079374	ENSP00000309558	5	–	NONHSAT101154	ENSP00000401371	5	–
NONHSAT009703	ENSP00000309558	6	Confirmed	NONHSAT041921	ENSP00000401371	6	Confirmed
NONHSAT138142	ENSP00000309558	7	Confirmed	NONHSAT042032	ENSP00000401371	7	–
NONHSAT104639	ENSP00000309558	8	Confirmed	NONHSAT131038	ENSP00000401371	8	Confirmed
NONHSAT135796	ENSP00000309558	9	Confirmed	NONHSAT084827	ENSP00000401371	9	–
NONHSAT077129	ENSP00000309558	10	–	NONHSAT021830	ENSP00000401371	10	Confirmed
NONHSAT023404	ENSP00000309558	11	–	NONHSAT001953	ENSP00000401371	11	Confirmed
NONHSAT063901	ENSP00000309558	12	Confirmed	NONHSAT145923	ENSP00000401371	12	Confirmed
NONHSAT099046	ENSP00000309558	13	–	NONHSAT039675	ENSP00000401371	13	–
NONHSAT031489	ENSP00000309558	14	–	NONHSAT135796	ENSP00000401371	14	Confirmed
NONHSAT041921	ENSP00000309558	15	Confirmed	NONHSAT011652	ENSP00000401371	15	Confirmed
NONHSAT013639	ENSP00000309558	16	–	NONHSAT044002	ENSP00000401371	16	–
NONHSAT027206	ENSP00000309558	17	–	NONHSAT112849	ENSP00000401371	17	–
NONHSAT134595	ENSP00000309558	18	–	NONHSAT114444	ENSP00000401371	18	Confirmed
NONHSAT054716	ENSP00000309558	19	–	NONHSAT007429	ENSP00000401371	19	Confirmed
NONHSAT122291	ENSP00000309558	20	Confirmed	NONHSAT123220	ENSP00000401371	20	–

Table 5

Top 10 interactions rank on lncRNA NONHSAT145960 and NONHSAT031708.

lncRNA ID	Protein ID	Rank	Confirm?	lncRNA ID	Protein ID	Rank	Confirm?
NONHSAT145960	ENSP00000258962	1	–	NONHSAT031708	ENSP00000385269	1	Confirmed
NONHSAT145960	ENSP00000240185	2	Confirmed	NONHSAT031708	ENSP00000258962	2	–
NONHSAT145960	ENSP00000385269	3	–	NONHSAT031708	ENSP00000240185	3	Confirmed
NONHSAT145960	ENSP00000349428	4	Confirmed	NONHSAT031708	ENSP00000349428	4	–
NONHSAT145960	ENSP00000379144	5	Confirmed	NONHSAT031708	ENSP00000258729	5	Confirmed
NONHSAT145960	ENSP00000338371	6	Confirmed	NONHSAT031708	ENSP00000338371	6	–
NONHSAT145960	ENSP00000401371	7	Confirmed	NONHSAT031708	ENSP00000379144	7	–
NONHSAT145960	ENSP00000254108	8	–	NONHSAT031708	ENSP00000254108	8	Confirmed
NONHSAT145960	ENSP00000258729	9	Confirmed	NONHSAT031708	ENSP00000401371	9	Confirmed
NONHSAT145960	ENSP00000413035	10	–	NONHSAT031708	ENSP00000371634	10	Confirmed

Top 20 interactions rank on protein ENSP00000309558 and ENSP00000401371. Top 10 interactions rank on lncRNA NONHSAT145960 and NONHSAT031708.

3.8. Speed Comparison on Benchmark Dataset

Practically, running speed is also play an important role in predicting LPI. The state-of-the-art methods of peer groups, such as LPLNP, can produce high-accuracy performances. Hence, the overall evaluation of the success of each approach, should also consider the Running Time (RT). Thus, a comparison between the RT of LPLNP and LPI-FKLKRR, has been performed. The comparative RT analysis between LPLNP and LPI-FKLKRR after running the available source code of LPLNP from the network, is illustrated in Table 6.

Table 6

Comparison of running time between LPI-FKLKRR and LPLNP in 10 times.

Method	Average running time(s)	Standard deviation(s)
LPI-FKLKRR	11.48	0.2126
LPLNP^a	352.93	2.6656

The address of LPLNP is given by Zhang et al. (.

Comparison of running time between LPI-FKLKRR and LPLNP in 10 times. The address of LPLNP is given by Zhang et al. (. Although LPLNP and LPI-FKLKRR have competitive AUC (according to results shown in Table 2) it is clear that the LPI-FKLKRR achieves better average running performance using only 11.48 s to accomplish the prediction task of LPI. This is much faster than the 352.93 s of the LPLNP (as shown in Table 6. Moreover, the standard deviation also manifest that LPI-FKLKRR is both fast and stable. Furthermore, considering the higher AUPR value of LPI-FKLKRR, we can strongly suggest that LPI-FKLKRR can be both a time-saving and useful tool for LPI prediction.

3.9. Evaluation on Novel Dataset

To support the results of the benchmark experiments, we have employed another dataset which is published by Zheng et al. The size of the novel dataset is larger than the benchmark dataset, which is shown in Table 7.

Table 7

The information of two datasets in the experiment.

Dataset	Number of lncRNAs	Number of proteins	LPIs
benchmark dataset^*	990	27	4,158
novel dataset^*	1,050	84	4,467

The benchmark dataset and the novel dataset come from the paper of Zhang et al. (.

The information of two datasets in the experiment. The benchmark dataset and the novel dataset come from the paper of Zhang et al. (. Originated from the same databases as the benchmark dataset, the novel dataset consists of 4467 LPIs, including 1050 unique lncRNAs and 84 unique proteins. We have conducted the comparison of LPI-FKLKRR and PPSNs (Zheng et al., 2017) by applying 5-fold CV on novel dataset, and list the results in Table 8. The AUC value for the LPI-FKLKRR algorithm is equal to 0.9669, which is higher than the one of PPSNs. What's more, the AUPR value which is equal to 0.7062 for the novel dataset proves the robustness performance of the LPI-FKLKRR on an imbalanced dataset.

Table 8

The AUPR and AUC of different methods on novel dataset.

Method	AUPR	AUC
LPI-FKLKRR	0.7062	0.9669
PPSNs	–^a	0.9098
NRLMF	0.4010	0.8287
CF	0.4267	0.8103

AUPR is not exploited by Zheng et al. (.

The AUPR and AUC of different methods on novel dataset. AUPR is not exploited by Zheng et al. (. Apart from the baseline methods that we have done test in Figure 2, we make a new comparison on the dataset that proposed by Zheng et al. with methods including NRLMF and CF. NRLMF, which is also capable of integrating various data sources, achieved good performance for both MDA prediction (Yan et al., 2017; He et al., 2018) and DTI prediction (Liu Y. et al., 2016). And CF method that has proposed by Sarwar et al., is another state-of-the-art work. From Table 8, we notice that no matter from the aspect of AUPR or AUC, the value of LPI-FKLKRR are higher than NRLMF (AUPR:0.4010, AUC:0.8287) and CF (AUPR:0.4267, AUC:0.8103). Both the 5-fold CV and local LOOCV are also done in the novel dataset experiment. After testing different kinds of kernels on the novel dataset, we obtain that in the 5-fold CV, the AUPRs of GIP kernel, sequence feature kernel, sequence similarity kernel and gene expression & protein GO kernel are 0.6812, 0.4819, 0.4846, and 0.2379, respectively. Multiple kernels with the FastKL weighted model achieves AUPR equal to 0.7076, which is an outstanding performance. In Figures 6, 7, we can see that the FastKL performs better than the other models. This result is consistent with the consequence on benchmark dataset.

Figure 6

The ROC and PR curve of different models with novel dataset by 5-fold CV.

Figure 7

The ROC and PR curve by local LOOCV on novel dataset.

The ROC and PR curve of different models with novel dataset by 5-fold CV. The ROC and PR curve by local LOOCV on novel dataset.

Conclusions and Discussion

In this paper, we have proposed a novel prediction method for the prediction of lncRNAs-protein interactions by using Kernel Ridge Regression, combined with a multiple kernel learning approach (LPI-FKLKRR). LPI-FKLKRR employs fast kernel learning to fuse lncRNA and protein similarity matrices, respectively. A two-step Kernel Ridge Regression is adopted to forecast the interactions between lncRNAs and proteins. The 5-fold cross validation (5-fold CV) testing of the proposed LPI-FKLKRR algorithm, achieved very reliable and promising results when applied on the benchmark dataset (AUPR: 0.6950). Furthermore, LPI-FKLKRR achieves satisfactory prediction performances compared with the state-of-the-art approaches. A comparison on a novel dataset illustrates the stability performance of our model. From the view point of the classification method about the prediction, the problem setting of lncRNA-protein interaction prediction can be the same with miRNA-disease interaction prediction and drug-target interaction prediction (Ezzat et al., 2018). For instance, CF method, which has proposed by Sarwar et al, has a recent work named MSCMF, which projects drugs and targets into a common low-rank feature space Zheng et al. (2013). This method can be transfered to the area of LPI prediction. Ezzat et al. have supposed that chemogenomic methods can be categorized into five types, including neighborhood models, bipartite local models, network diffusion models, matrix factorization models, and feature-based classification models. Consequently, in the future we will improve the predicting performance by adding information such as available 3D structure data, by constructing more heterogeneous similarity matrices, by changing weighting strategy or by drawing other effective regression models.

Data Availability Statement

The datasets and codes for this study can be found in website https://github.com/6gbluewind/LPI_FKLKRR.

Author Contributions

FG, YD, and CS conceived and designed the experiments. CS and YD performed the experiments and analyzed the data. FG and CS wrote the paper. FG and JT supervised the experiments and reviewed the manuscript.

Conflict of Interest Statement

The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

44 in total

1. Assignment of homology to genome sequences using a library of hidden Markov models that represent all proteins of known structure.

Authors: J Gough; K Karplus; R Hughey; C Chothia
Journal: J Mol Biol Date: 2001-11-02 Impact factor: 5.469

2. Predicting protein-protein interactions based only on sequences information.

Authors: Juwen Shen; Jian Zhang; Xiaomin Luo; Weiliang Zhu; Kunqian Yu; Kaixian Chen; Yixue Li; Hualiang Jiang
Journal: Proc Natl Acad Sci U S A Date: 2007-03-05 Impact factor: 11.205

3. MemType-2L: a web server for predicting membrane proteins and their types by incorporating evolution information through Pse-PSSM.

Authors: Kuo-Chen Chou; Hong-Bin Shen
Journal: Biochem Biophys Res Commun Date: 2007-06-15 Impact factor: 3.575

4. Predicting protein associations with long noncoding RNAs.

Authors: Matteo Bellucci; Federico Agostini; Marianela Masin; Gian Gaetano Tartaglia
Journal: Nat Methods Date: 2011-06 Impact factor: 28.547

5. Gaussian interaction profile kernels for predicting drug-target interaction.

Authors: Twan van Laarhoven; Sander B Nabuurs; Elena Marchiori
Journal: Bioinformatics Date: 2011-09-04 Impact factor: 6.937

Review 6. Modular regulatory principles of large non-coding RNAs.

Authors: Mitchell Guttman; John L Rinn
Journal: Nature Date: 2012-02-15 Impact factor: 49.962

7. Semi-supervised drug-protein interaction prediction from heterogeneous biological spaces.

Authors: Zheng Xia; Ling-Yun Wu; Xiaobo Zhou; Stephen T C Wong
Journal: BMC Syst Biol Date: 2010-09-13

8. Predicting RNA-protein interactions using only sequence information.

Authors: Usha K Muppirala; Vasant G Honavar; Drena Dobbs
Journal: BMC Bioinformatics Date: 2011-12-22 Impact factor: 3.169

9. Prediction of drug-target interaction networks from the integration of chemical and genomic spaces.

Authors: Yoshihiro Yamanishi; Michihiro Araki; Alex Gutteridge; Wataru Honda; Minoru Kanehisa
Journal: Bioinformatics Date: 2008-07-01 Impact factor: 6.937

10. High-throughput sequencing to reveal genes involved in reproduction and development in Bactrocera dorsalis (Diptera: Tephritidae).

Authors: Weiwei Zheng; Tao Peng; Wei He; Hongyu Zhang
Journal: PLoS One Date: 2012-05-03 Impact factor: 3.240

4 in total

1. LPI-SKF: Predicting lncRNA-Protein Interactions Using Similarity Kernel Fusions.

Authors: Yuan-Ke Zhou; Jie Hu; Zi-Ang Shen; Wen-Ya Zhang; Pu-Feng Du
Journal: Front Genet Date: 2020-12-09 Impact factor: 4.599

2. A systems genomics approach to uncover patient-specific pathogenic pathways and proteins in ulcerative colitis.

Authors: Johanne Brooks-Warburton; Dezso Modos; Padhmanand Sudhakar; Matthew Madgwick; John P Thomas; Balazs Bohar; David Fazekas; Azedine Zoufir; Orsolya Kapuy; Mate Szalay-Beko; Bram Verstockt; Lindsay J Hall; Alastair Watson; Mark Tremelling; Miles Parkes; Severine Vermeire; Andreas Bender; Simon R Carding; Tamas Korcsmaros
Journal: Nat Commun Date: 2022-04-28 Impact factor: 17.694

3. LPI-IBNRA: Long Non-coding RNA-Protein Interaction Prediction Based on Improved Bipartite Network Recommender Algorithm.

Authors: Guobo Xie; Cuiming Wu; Yuping Sun; Zhiliang Fan; Jianghui Liu
Journal: Front Genet Date: 2019-04-18 Impact factor: 4.599

Review 4. Probing lncRNA-Protein Interactions: Data Repositories, Models, and Algorithms.

Authors: Lihong Peng; Fuxing Liu; Jialiang Yang; Xiaojun Liu; Yajie Meng; Xiaojun Deng; Cheng Peng; Geng Tian; Liqian Zhou
Journal: Front Genet Date: 2020-01-31 Impact factor: 4.599

4 in total