Literature DB >> 29858068

A Deep Learning Framework for Robust and Accurate Prediction of ncRNA-Protein Interactions Using Evolutionary Information.

Hai-Cheng Yi¹, Zhu-Hong You², De-Shuang Huang³, Xiao Li⁴, Tong-Hai Jiang⁴, Li-Ping Li⁴.

Abstract

The interactions between non-coding RNAs (ncRNAs) and proteins play an important role in many biological processes, and their biological functions are primarily achieved by binding with a variety of proteins. High-throughput biological techniques are used to identify protein molecules bound with specific ncRNA, but they are usually expensive and time consuming. Deep learning provides a powerful solution to computationally predict RNA-protein interactions. In this work, we propose the RPI-SAN model by using the deep-learning stacked auto-encoder network to mine the hidden high-level features from RNA and protein sequences and feed them into a random forest (RF) model to predict ncRNA binding proteins. Stacked assembling is further used to improve the accuracy of the proposed method. Four benchmark datasets, including RPI2241, RPI488, RPI1807, and NPInter v2.0, were employed for the unbiased evaluation of five established prediction tools: RPI-Pred, IPMiner, RPISeq-RF, lncPro, and RPI-SAN. The experimental results show that our RPI-SAN model achieves much better performance than other methods, with accuracies of 90.77%, 89.7%, 96.1%, and 99.33%, respectively. It is anticipated that RPI-SAN can be used as an effective computational tool for future biomedical researches and can accurately predict the potential ncRNA-protein interacted pairs, which provides reliable guidance for biological research.

Entities: Chemical Disease Gene Species

Keywords: PSSM; RNA-protein interactions; Zernike moment; deep learning; non-coding RNA; stacked auto-encoder

Year: 2018 PMID： 29858068 PMCID： PMC5992449 DOI： 10.1016/j.omtn.2018.03.001

Source DB: PubMed Journal: Mol Ther Nucleic Acids ISSN： 2162-2531 Impact factor: 8.886

Introduction

In the Human genome, 74.7% of the sequence can be transcribed into RNA, but the total exon sequence of the mRNA is only 2.94%.1, 2, 3 The remaining sequence information is output in the form of non-coding RNA (ncRNA), which can be divided into two types: constitutive and regulatory types. The proportion of small molecule ncRNA in constitutive ncRNA and regulatory ncRNA is very small in non-coding sequences, and most of the non-coding sequences are transcribed into long ncRNA (lncRNA). Compared with mRNA, lncRNA is shorter in length, less in exon and two in focus, with an average abundance of about 1/10 of mRNA and a lower sequence conservation.5, 6, 7 It has been found that lncRNA can participate in all aspects of gene expression regulation by interacting with proteins such as chromatin modification complexes and transcription factors, thus playing a fundamental role in a variety of important biological processes such as X chromosome inactivation (Xist and Tsix), gene imprinting (H19 and Air), and developmental differentiation (HOTAIR and TINCR). Although the role of ncRNA-protein interactions (ncRPIs) in the regulation of gene expression has been doubtless, only a small number of ncRNA functions and mechanisms of action have been studied. Since ncRNA functions require the coordination of protein molecules, the identification of protein molecules bound with specific ncRNA has become the main approach to revealing the function and mechanism of ncRNA. Large-scale RNA-binding proteins (RBPs) detection experiments based on biological methods have made many important advances,14, 15, 16 such as RNAcompete, HITS-CLIP, and RNA-protein complex structure, which provide valuable information about the RNA-protein interactions (RPIs), while experimental methods are still time-consuming and overpriced (for example, it’s high-cost to determine complex structure by way of experiment). These high-throughput technologies need much time for the abortive hand-tuning of putative binding sequences. A lot of studies suggest that the sequences have enough information for predicting RPIs. The sequence-homology-based methods help to detect the binding domains of proteins and their possible functions,20, 21, 22, 23, 24 but lack the ability to determine whether a given pair of RNA and protein can form the interaction well. There is an urgent need for an accurately computational approach to predicting RPIs. In recent years, computational prediction of the interaction partner between proteins and RNAs has attracted a lot of research works.15, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35 Pancaldi et al.36, 37 trained a random forest (RF) and a support vector machine to classify whether the RNA-protein pair interact or not and used >100 different sources of features, which were extracted from genomic context, structure, or localization. The RPISeq was introduced by Muppirala et al. They also applied RF and SVM classifiers by using simple 4-mer features of RNA and 3-mer features of proteins, respectively. Thereafter, lncPro trained three types of physiochemical properties using Fisher linear discriminant. Zhou et al. presented a new SVM based approach RPI-Pred by taking into consideration both sequences and structures information to predict ncRPIs. In the studies above, hand-crafted features of RNA-protein pairs are used in some methods, which may change the real distribution back of the data and need strong domain knowledge. Other researchers extracted lowly discriminated features from noisy sequences, though they mainly got information from extracted sequences.21, 25, 26 General machine learning methods might not mine the hidden regular pattern from these noises well. Thus, efficient features and advanced models play an important role in RPI’s computational prediction.40, 41, 42, 43 In this study, we propose a powerful solution for these challenges. It’s a sequence-based approach to predict ncRPIs by using deep learning conjoint with RF classifier. More specifically, RNA sequences are first converted into k-mers sparse matrix, which retains almost all amino acid compositions and order information. Then the singular value decomposition (SVD) is used to extract the feature vector for each sequence. For protein sequences, a pseudo-Zernike moment (PZM) descriptor is used to extract the evolutionary information from the position-specific scoring matrix (PSSM).42, 46 Then, the stacked auto-encoder is further employed to automatically learn hidden high-level features from above mentioned features. Finally, these reprehensive features are fed into RF classifiers to predict RPIs. To further improve the robustness and accuracy of our method, extra layers are employed to integrate different predictors. In the experimental, the proposed method was evaluated on three benchmark datasets including RPI488, RPI1807, and RPI2241 and compared with other state-of-the-art methods, such as lncPro, RPISeq-RF, RPI-Pred, and IPMiner. The experimental results showed that our method can achieve much better prediction performance on above datasets.

Results and Discussion

In this study, we propose a deep learning method named RPI-SAN, which conjoins the stacked auto-encoder network (SAN) with RF classifiers and used PSSM with the Zernike moment and k-mers sparse matrix with SVD to predict the interactions of ncRNA-protein. First, we evaluate its predictive ability of RPIs on the RPI2241 dataset. Furthermore, we compare RPI-SAN with other state-of-the-art methods on different datasets to demonstrate the effectiveness and robustness of our approach. Then we predict ncRPIs on different datasets by using the trained model. Furthermore, we made a case study that shows, with specific examples, how RPI-SAN advanced studies regarding potential RPIs. Finally, we summarize, analyze, and discuss our method.

Evaluation of RPI-SAN’s Capability to Predict RPIs

We first test our RPI-SAN approach to evaluate its capability to predict RPIs on the RPI2241 dataset. The details listed in the Tables S2 and S3 are as follows. The mean accuracy of 5-fold cross-validation is 90.77%, the mean sensitivity is 86.17%, the mean specificity is 97.37%, the mean precision is 84.05%, and the Matthews correlation coefficient (MCC) is 82.27%. Their respective SDs are 0.52%, 0.81%, 1.71%, 1.26%, and 1.25%. Table S2 shows the 5-fold cross-validation details performed by RPI-SAN on the RPI2241 dataset, with the area under the receiver operating characteristic curve (AUC) achieving 0.962 as shown in Figure 1. Our method has achieved the best performance on the RPI2241 dataset in all methods.

Figure 1

Prediction Performance Comparison Between SA-FT-RF, SA-RF, RPISeq-RF, Average Assembling, and Stacked Assembling on ncRNA-Protein Dataset RPI2241

Prediction Performance Comparison Between SA-FT-RF, SA-RF, RPISeq-RF, Average Assembling, and Stacked Assembling on ncRNA-Protein Dataset RPI2241 Our method is manifested by three stacked separate predictors, a stacked auto-encoder with fine-tuning (SA-FT-RF), a stacked auto-encoder with RF (SA-RF), and RPISeq with RF (RPISeq-RF), with each individual predictor performing different effects on different data. The stacked auto-encoder performs well in accuracy and specificity, while RPISeq-RF specializes in precision and sensitivity. It is explained that individual predictors have weaker adaptability. It is necessary to integrate them together to give play to each other’s strengths. On the RPI2241 dataset, our RPI-SAN method performs much better than other predictors. Shown in Table S3, RPI-SAN performs at an accuracy of 90.77%, sensitivity of 86.17%, specificity of 97.37%, precision of 84.05%, MCC of 82.17%, and AUC of 0.962. It’s the best model in these four contrasting predictors. SA-RF performs at an accuracy of 63.71%, sensitivity of 64.75%, specificity of 61.72%, precision of 65.74%, and MCC of only 27.49%. The accuracy, sensitivity, specificity, precision, and MCC of RPISeq-RF are 63.96%, 64.83%, 62.59%, 65.37%, and 27.98%, and those of SA-FT-RF are 90.52%, 87.71%, 94.78%, 86.18%, and 81.56%. lncPro performs at an accuracy of 65.4%, sensitivity of 65.9%, specificity of 64.0%, precision of 66.9%, and MCC of 31.0%, respectively. lncPro performs a little worse than RPI-SAN. It has some disadvantages; it can only predict protein sequences longer than 30, which fails in predicting shorter protein sequences. Since using RNAsubopt to predict RNA structure takes a long time, especially for long sequences, it only processes the first 4,095 nucleotides if the RNA sequence is longer than 4,095. These are the reasons why our method does not include lncPro in our stacked predictors.

Comparison between Different Assembling Strategies

In our RPI-SAN method, we use stacked assembling to integrate different classifiers. This time, we compare it with other general methods, such as majority voting and averaging. As the results show in Figure 1, stacked assembling attains an AUC of 0.962 on the RPI2241 dataset, which is better than the average method and each individual classifier. Logistic regression gets different weights for the stacked auto-encoder, stacked auto-encoder with fine-tuning, and RPISeq-RF by using the raw sequence feature, which is more robust and flexible than the average stacked auto-encoder. Different predictors play different roles in the production of the final result. Stacked assembling improves the final prediction effect at different ranges. On the RPI488 and RPI1807 datasets, the three predictors have outputs similar to the RPI2241 dataset, which means a stronger correlation. So stacked assembling improves the AUC on RPI488 and RPI1807, but smaller than the improvement on RPI2241, which has a lower correlation. As a result, the stacked assembling is really effective for improving the final performance. So it is more significant on datasets with lower correlation.

Comparison with Other Methods

In order to verify the effectiveness and robustness of RPI-SAN, we compare it with other state-of-the-art methods in the same datasets. Here we have selected the RPI-Pred from the study by Suresh et al. and the RPISeq-RF from the study by Muppirala et al. because the RPISeq-RF performs better than RPISeq-SVM in this study. We have also selected the IPMiner from the study by Pan et al. and the lncPro from the study by Lu et al. Since these methods are not evaluated on the same criteria, we only compare the results of the same evaluation methods on the same datasets. As shown in Table 1 and Figure S1, on the RPI488 dataset, our method performs a little better than any other method, with an accuracy of 89.7%, sensitivity of 94.3%, specificity of 83.7%, precision of 95.2%, MCC of 79.3%, and AUC of 0.92. The performance of each parameter is optimal. For the RPI1807 dataset, all methods except RPI-Pred give great performances with the accuracy and AUC greater than 95% (shown in Figure S2). Our method also gives a great performance. Although the accuracy is not best, it still attains a high accuracy of 96%. In terms of specificity and the important parameter AUC, our method is outstanding, achieving an AUC of 0.999. For the RPI2241 dataset, before our proposed method RPI-SAN, most methods did not work very well, especially in terms of accuracy, MCC, and AUC. Compared with the best methods already published, RPI-SAN improved the accuracy by almost 7%, specificity by more than 16%, MCC by over 17%, and AUC by more than 6%, respectively.

Table 1

Comparing RPI-SAN with Other Methods on RPI4888, RPI1807 and RPI2241 Datasets

Datasets	Methods	Accuracy (%)	Sensitivity (%)	Specificity (%)	Precision (%)	MCC (%)	AUC
RPI488	IPMiner	89.1	93.9	83.1	94.5	78.4	0.914
	RPISeq-RF	88.0	92.6	82.2	93.2	76.2	0.903
	lncPro	87.0	90.0	82.7	91.0	74.0	0.901
	RPI-SAN	89.7^a	94.3^a	83.7	95.2^a	79.3^a	0.920^a
RPI1807	RPI-Pred	93.0	95.0	N/A	94.0	N/A	0.97
	IPMiner	98.6^a	98.2^a	99.3	97.8^a	97.2^a	0.998
	RPISeq-RF	97.3	96.8	98.4	96.0	94.6	0.0996
	lncPro	96.9	96.5	98.1	95.5	93.8	0.994
	RPI-SAN	96.1	93.6	99.9	91.4	92.4	0.999^a
RPI2241	RPI-Pred	84.0	78.0	N/A	88.0^a	N/A	0.89
	IPMiner	82.4	83.3	81.2	83.6	65.0	0.906
	RPISeq-RF	63.96	64.83	62.59	65.37	27.98	0.690
	lncPro	65.4	65.9	64.0	66.9	31.0	0.722
	RPI-SAN	90.77^a	86.17^a	97.37^a	84.05	82.27^a	0.962^a

aThis measure of performance is the best among the compared methods for the individual dataset.

Comparing RPI-SAN with Other Methods on RPI4888, RPI1807 and RPI2241 Datasets aThis measure of performance is the best among the compared methods for the individual dataset.

Predicting ncRPIs Using RPI-SAN

To further validate the ability of RPI-SAN to predict the interactions between ncRNA and protein, we use the RPI488 dataset to train the deep learning model and verify it on the NPInter v2.0 dataset. There is no overlap between the two datasets. There are 10,412 interaction pairs in the NPInter v2.0, which can be divided into six organisms, and we conduct experiments on them separately. The results are shown in Table 2. RPI-SAN predicts the correct number of pairs of interactions on Homo sapiens, Caenorhabditis elegans, Drosophila melanogaster, Saccharomyces cerevisiae, Mus musculus, and Escherichia coli for 6,928, 29, 90, 897, 2,153, and 177, with an accuracy of 99.33%, 80.56%, 98.90%, 98.56%, 97.95%, and 87.62%, respectively. We finally predict that the correct number of ncRNA-protein pairs is 10,274, with a total accuracy of 98.67% on the independent dataset NPInter v2.0.

Table 2

Predicted Performance of the RPI488 Trained Model on NPInter v2.0 Dataset

Organism	Number of Interaction Pairs	Predicted Number of Interaction Pairs	Accuracy (%)
Homo sapiens	6,975	6,928	99.33
Caenorhabditis elegans	36	29	80.56
Drosophila melanogaster	91	90	98.90
Saccharomyces cerevisiae	910	897	98.56
Mus musculus	2,198	2,153	97.95
Escherichia coli	202	177	87.62
Total	10,412	10,274	98.67

Predicted Performance of the RPI488 Trained Model on NPInter v2.0 Dataset

Case Study: Potential RPIs of the Top-15 Ranks Verified from Database

After evaluating the effectiveness and robustness of the proposed model, we calculate the possibility of interaction for potential RNA-protein pairs in the dataset of Homo sapiens. The training data do not overlap with the testing data. The predicted RNA-protein pairs with high probability are considered as potential interacted pairs and further verified by Gene Ontology. As a result, shown in Table 3, 15 interacted RNA-protein pairs are finally confirmed. Note that the high-ranked interactions that are not reported yet may also exist in reality. Based on these results, we anticipate that the proposed model is feasible to predict new RPIs.

Table 3

Confirmed RNA-Protein Interactions with High Ranks in the Dataset of Homo sapiens

Protein ID	RNA ID	Probability
HNRNPA1	EPB41	0.867
TARDBP	CFTR	0.866
MBNL1	DMPK	0.863
PTBP1	CD40LG	0.859
SRP19	RN7SL1	0.857
SRSF1	TNNT2	0.856
ELAVL4	MYCN	0.853
ELAVL2	ID1	0.851
HNRNPC	CSF2	0.848
HNRNPD	ADRB1	0.847
EIF5A	RNU6-1	0.845
HNRNPD	AGTR1	0.842
ELAVL3	VEGFA	0.838
YBX1	CSF2	0.833
ZBP1	ACTB	0.831

Confirmed RNA-Protein Interactions with High Ranks in the Dataset of Homo sapiens

Conclusions

In this study, we have proposed the computational method RPI-SAN based on deep learning with efficient features and stacked assembling to predict RPIs. We use PSSM and k-mers sparse matrix to extract efficient features from proteins and RNAs, respectively. Then such features will be fed into the SAN with RF predictors. The presented method gives a high performance with an accuracy of 90.77%, MCC of 82.27%, and an excellent AUC of 96.2% on the RPI2241 dataset. RPI-SAN also performs well on other previous popular datasets. Experimental results prove that the stacked auto-encoder can learn high-level features automatically from raw information, which is important for designing machine learning models. RPI-SAN gives a great performance on both RNA-protein and ncRPI prediction, which can prove that RPI-SAN is better than other state-of-the-art methods in some aspects. Through experiments, we also find that RPI-SAN has a better effect on large-scale datasets than small datasets, which we will keep studying in further work. We researched the computational techniques for predicting the interaction of ncRNA-proteins because it is more convenient and rapid than traditional hand-tuning experiments and can accurately predict the potential ncRNA-protein interacted pairs, which provides reliable guidance for the further biological researches.

Materials and Methods

Construction of Datasets

To evaluate the effectiveness and robustness of our approach, we conducted experiments on four different benchmark datasets, including RPI488, RPI1807, RPI2241, and NPInter v2.0. The RPI488 is a non-redundant lncRPI dataset based on structure complexes,51, 52 which contains 488 lncRNA-protein pairs, including 245 non-interacting pairs and 243 interacting pairs. Here it is smaller than other RNA-protein datasets, with only 243 lncRPIs. The reason is that there are much fewer lncRNA-protein complexes in the Protein Data Bank (PDB) database, where the ncRNA-protein complexes are downloaded from. The dataset RPI1807 contains 1,807 positive ncRPI pairs, including 1,078 RNA chains and 1,807 protein chains. The number of negative ncRPI pairs is 1,436, which contain 493 RNA chains and 1,436 protein chains. It is established by parsing the Nucleic Acid Database (NAD), which provides the RNA-protein complex data and protein-RNA interface data. The RPI2241 dataset is constructed in a similar way and contains 2,241 interacting RNA-protein pairs. The NPInter v2.0 is an ncRPI from a non-structure-based source, containing 10,412 ncRNA-protein pairs and 449 chains of protein and 4,636 chains of ncRNA. Table 4 shows the details of the datasets used in this study.

Table 4

The Details of the ncRNA-Protein Interaction Datasets

Dataset	Interaction Pairs	Number of Proteins	Number of RNAs
RPI488	243	25	247
RPI1807	1,807	1,807	1,078
RPI2241	2,241	2,043	332
NPInter v2.0	10,412	449	4,636

RPI488 is lncRNA-protein interactions based on structure complexes. PI369, RPI2241, and RPI1807 are RNA-protein interactions. NPInter2.0 and RPI13254 are ncRNA-protein interactions from non-structure-based source.

The Details of the ncRNA-Protein Interaction Datasets RPI488 is lncRNA-protein interactions based on structure complexes. PI369, RPI2241, and RPI1807 are RNA-protein interactions. NPInter2.0 and RPI13254 are ncRNA-protein interactions from non-structure-based source.

Representation of the ncRNA and Protein Sequences

To obtain high effective features for deep learning models, each ncRNA-protein pair is represented as 486-feature vectors, in which 256 features are used to encode the RNA sequence, and 240 features are used to encode the protein sequence. RNAs are encoded by using the k-mers sparse matrix previously proposed in Zhu-Hong et al. In this method, we scan each RNA sequence (A, C, G, U) from left to right, stepping one nucleotide at a time, which is considered the characteristic of each nucleotide. Its k-1 consecutive nucleotides and k consecutive nucleotides are regarded as a unit. For any above-mentioned RNA sequences of length L, there would be different possible k-mers and k-mers appearing in the RNA sequence. Each input of the RNA sequence is processed into a k-mers sparse matrix R. When are just equal to the ith k-mers among different k-mers, set the element a = 1. The rest can be dealt with in the same way. Then an input RNA sequence is converted into a matrix R. In this study, the value of k is set to 4 to process the RNA sequence, which can be obtained from Table S1.The 4-mer sparse matrix R is a low-rank matrix, while almost all of the information is retained, including sequence (AAAA, AAAC …UUUU) frequency, position, and order-hidden information in a protein sequence. Then, we use SVD to process a matrix R into a vector feature. Considering that RNA and protein sequences have different structures for protein amino acids sequences, we use a more biological method, the PSSM, to transform it. The PSSM algorithm containing biological evolution information was first used to detect distantly related protein, achieving great success in the prediction of the protein secondary structure and the protein binding site and the disordered regions prediction. The structure of PSSM is a matrix, while L rests with the length of the input protein sequence and 20 represents the number of naive amino acids. Supposing , PSSM is represented as followswhere in the i row of PSSM represents the probability of the ith residue being mutated into type j of 20 native amino acids during the procession of evolutionary in the protein from multiple sequence alignments. In experiments, we used the position-specific iterated BLAST (PSI-BLAST) tool to convert protein raw sequence into PSSM. We set the PSI-BLAST tool against the database of SwissProt, the number of iteration as 3, and err-value to 0.001, to get the best results. Both PSI-BLAST applications and the SwissProt database can be freely downloaded from http://blast.ncbi.nlm.nih.gov/Blast.cgi. Then we extracted the PZM features from the PSSM. PZM is widely used in the field of image processing and has achieved good results, which can extract features from the matrix more robustly and has less information redundancy. We set the PZM required parameter n, m = 30. Finally, a feature vector is obtained for each protein sequence.

SAN

Deep learning as a powerful vehicle has been widely used in different areas19, 22, 23, 43, 55, 56 and has received great attention in the field of ncRPI prediction. Among these several deep-learning architectures, the SAN is more appropriate to our demand. The stacked auto-encoder has almost all the advantages of the deep neural network (DNN) and has an outstanding expressive ability. It is usually able to obtain the “hierarchical grouping” and “partial-global decomposition” features of the raw data. Since the stacked auto-encoder tends to be able to effectively represent the original input data, we use auto-encoder as a component element of a DNN with multiple layers.44, 55, 58 The SAN is composed of a multilayer neural network sparse auto-encoder and the output from the previous layer as input of the next layer as shown in Figure 2. With hyper parameter optimization, we get the best parameters of the stacked auto-encoder neural network. The sparse auto-encoder network is constructed like Figure 3. Error represents the error between the reconstructed data and the input, while the sparsity penalty stands for regularity limit for , which constrains the majority of each layer’s node, which is 0, with only a few that are not 0.

Figure 2

The Construction of Stacked Auto-Encoder Network

Figure 3

The Construction of Sparse Auto-Encoder

The Construction of Stacked Auto-Encoder Network The Construction of Sparse Auto-Encoder Where the input x is in the form of d-dimension and the auto-encoder network maps X into the output h(X):where the f is activation function. When we select Sigmoid as the activation function,then the loss function is as follows:Usually, each layer of neural network includes a certain number of neurons. Then, the multilayer neural network makes up a stacked network of sequential connected layers, while the output of the previous layers is the input of the next layers:Among them, is the activation value of the deepest hidden unit, which is a higher order representation of the input value. By using as the input feature of the softmax classifier, the features learned in the deep auto-encoder network can be used for classification problems. We use the stochastic gradient descent (SGD) to optimize the reconstruction error between X and z, which can be measured by using the squared error. Stacking multiple auto-encoders consists of a stacked auto-encoder, a DNN that can learn high-level features automatically.60, 61 To get a better performance, we use greedy layer-wise learning, which can train each layer individually to optimize objective functions when learning the stacked auto-encoder parameters. In our network, we use two types of layers: full-connected and dropout layers. For the dropout layer, it set some node activations to 0 with a certain probability to avoid over-fitting for model training. We also add an extra soft-max layer for fine tuning, with the ReLu function as activation for the outputs from the conjoined multiple- layer network of RNA and protein as the last hidden layer, which is trained by using real label information to update weights and bias parameters for SAN.63, 64 Then we use SGD (with different learn rates and momentums for different datasets) to minimize cross entropy loss function and Adam to minimize mean squared error for each de-noising auto-encoder layer, and the dropout probability is set to 0.5 during the model training.65, 66 In this study, we use the keras library to implement the stacked auto-encoder and set the parameters batch_size and nb_epoch to 100, respectively. The details about keras can be found at http://github.com/fchollet/keras.

Stacked Assembling

Ordinarily, different classifiers have different performances in different datasets. In fact, there is no single classifier that can be adapted to all kinds of datasets. An extra-stacked assembling layer is used in our deep learning network to integrate the individual multiple classifier outputs to gain the approximate optimal target function. Previous works have proposed majority voting and average individual classifiers outputs. In our study, using multiple layer neural networks following the deep learning intuition, we define the operating mechanism as the level 0 classifiers’ outputs that will be fed into the level 1 classifier as training data. Where level 0 is the original layer and level 1 the next sequential layer, how to obtain the outputs from separate classifiers will be worked out. In our network, the outputs of the level 0 layer classifiers are the predicted probability score, while the successive level 1 classifier is logistic regression. When the weight of logistic regression for each individual classifier is the same, it degenerates to average treatment. When only one weight is not zero, it is more like a majority voting method:where is the probability score vector outputs of the individual classifiers and is the weight vector for every single different classifier. The logistic regression is from Scikit-learn.

Performance Evaluation

In this study, we trained the deep learning model to classify whether ncRNA and protein interact with each other or not. The 5-fold cross-validation method is used to evaluate the performance of our study, which randomly divides all the datasets into five equal parts. In each validation, one of them is taken as the testing set, and the other four parts are taken as the training set. The testing and training data do not overlap with each other to guarantee the unprejudiced comparison. We take the average and SDs of these results as the final validation result. We follow the widely used evaluation measure to evaluate our method, including accuracy (Acc.), sensitivity (Sen.), specificity (Spec.), precision (Prec.), and MCC defined as:where TN indicates the correctly predicted negative number, TP denotes the correctly predicted positive number, FN represents the wrongly predicted negative number, and FP stands for the wrongly predicted positive number. Certainly, the receiver operating characteristic (ROC) curve and the area under ROC curve (AUC) are also exploited to evaluate the performance of classifiers.

Author Contributions

H-C.Y. and Z-H.Y. conceived the algorithm, carried out analyses, prepared the data sets, carried out experiments, and wrote the manuscript. D-S.H., X.L., T-H.J., and L-P.L. wrote the manuscript and analyzed experiments. All authors read and approved the final manuscript.

Conflicts of Interest

The authors declare no conflict of interest.

54 in total

1. The Protein Data Bank.

Authors: H M Berman; J Westbrook; Z Feng; G Gilliland; T N Bhat; H Weissig; I N Shindyalov; P E Bourne
Journal: Nucleic Acids Res Date: 2000-01-01 Impact factor: 16.971

2. On position-specific scoring matrix for protein function prediction.

Authors: Jong Cheol Jeong; Xiaotong Lin; Xue-Wen Chen
Journal: IEEE/ACM Trans Comput Biol Bioinform Date: 2011 Mar-Apr Impact factor: 3.710

3. Towards better accuracy for missing value estimation of epistatic miniarray profiling data by a novel ensemble approach.

Authors: Xiao-Yong Pan; Ye Tian; Yan Huang; Hong-Bin Shen
Journal: Genomics Date: 2011-03-21 Impact factor: 5.736

4. Predicting protein-protein interactions from protein sequences by a stacked sparse autoencoder deep neural network.

Authors: Yan-Bin Wang; Zhu-Hong You; Xiao Li; Tong-Hai Jiang; Xing Chen; Xi Zhou; Lei Wang
Journal: Mol Biosyst Date: 2017-06-27

5. PSPEL: In Silico Prediction of Self-Interacting Proteins from Amino Acids Sequences Using Ensemble Learning.

Authors: Jian-Qiang Li; Zhu-Hong You; Xiao Li; Zhong Ming; Xing Chen
Journal: IEEE/ACM Trans Comput Biol Bioinform Date: 2017-01-10 Impact factor: 3.710

6. Tsix, a gene antisense to Xist at the X-inactivation centre.

Authors: J T Lee; L S Davidow; D Warshawsky
Journal: Nat Genet Date: 1999-04 Impact factor: 38.330

7. The GENCODE v7 catalog of human long noncoding RNAs: analysis of their gene structure, evolution, and expression.

Authors: Thomas Derrien; Rory Johnson; Giovanni Bussotti; Andrea Tanzer; Sarah Djebali; Hagen Tilgner; Gregory Guernec; David Martin; Angelika Merkel; David G Knowles; Julien Lagarde; Lavanya Veeravalli; Xiaoan Ruan; Yijun Ruan; Timo Lassmann; Piero Carninci; James B Brown; Leonard Lipovich; Jose M Gonzalez; Mark Thomas; Carrie A Davis; Ramin Shiekhattar; Thomas R Gingeras; Tim J Hubbard; Cedric Notredame; Jennifer Harrow; Roderic Guigó
Journal: Genome Res Date: 2012-09 Impact factor: 9.043

8. In silico characterization and prediction of global protein-mRNA interactions in yeast.

Authors: Vera Pancaldi; Jürg Bähler
Journal: Nucleic Acids Res Date: 2011-04-01 Impact factor: 16.971

9. PCVMZM: Using the Probabilistic Classification Vector Machines Model Combined with a Zernike Moments Descriptor to Predict Protein-Protein Interactions from Protein Sequences.

Authors: Yanbin Wang; Zhuhong You; Xiao Li; Xing Chen; Tonghai Jiang; Jingting Zhang
Journal: Int J Mol Sci Date: 2017-05-11 Impact factor: 5.923

10. starBase v2.0: decoding miRNA-ceRNA, miRNA-ncRNA and protein-RNA interaction networks from large-scale CLIP-Seq data.

Authors: Jun-Hao Li; Shun Liu; Hui Zhou; Liang-Hu Qu; Jian-Hua Yang
Journal: Nucleic Acids Res Date: 2013-12-01 Impact factor: 16.971

39 in total

Review 1. Recent advances on the machine learning methods in predicting ncRNA-protein interactions.

Authors: Lin Zhong; Meiqin Zhen; Jianqiang Sun; Qi Zhao
Journal: Mol Genet Genomics Date: 2020-10-02 Impact factor: 3.291

2. lncRNA PCAT-1 interacting with FZD6 contributes to the malignancy of acute myeloid leukemia cells through activating Wnt/β-catenin signaling pathway.

Authors: Yuan Yuan; Qian Wang; Shu Li Ma; Long Qiang Xu; Meng Yang Liu; Bin Han; Ning Du; Xiao Lan Sun; Xiao Lin Yin; Fang Fang Cao
Journal: Am J Transl Res Date: 2019-11-15 Impact factor: 4.060

3. Pathogenic Factors Identification of Brain Imaging and Gene in Late Mild Cognitive Impairment.

Authors: Xia-An Bi; Lou Li; Ruihui Xu; Zhaoxu Xing
Journal: Interdiscip Sci Date: 2021-06-09 Impact factor: 2.233

4. SAWRPI: A Stacking Ensemble Framework With Adaptive Weight for Predicting ncRNA-Protein Interactions Using Sequence Information.

Authors: Zhong-Hao Ren; Chang-Qing Yu; Li-Ping Li; Zhu-Hong You; Yong-Jian Guan; Yue-Chao Li; Jie Pan
Journal: Front Genet Date: 2022-02-28 Impact factor: 4.599

5. Learning Representations to Predict Intermolecular Interactions on Large-Scale Heterogeneous Molecular Association Network.

Authors: Hai-Cheng Yi; Zhu-Hong You; De-Shuang Huang; Zhen-Hao Guo; Keith C C Chan; Yangming Li
Journal: iScience Date: 2020-06-11

6. A Hybrid Prediction Method for Plant lncRNA-Protein Interaction.

Authors: Jael Sanyanda Wekesa; Yushi Luan; Ming Chen; Jun Meng
Journal: Cells Date: 2019-05-30 Impact factor: 6.600

7. In silico drug repositioning using deep learning and comprehensive similarity measures.

Authors: Hai-Cheng Yi; Zhu-Hong You; Lei Wang; Xiao-Rui Su; Xi Zhou; Tong-Hai Jiang
Journal: BMC Bioinformatics Date: 2021-06-01 Impact factor: 3.169

Review 8. The importance of non-coding RNAs in environmental stress-related developmental brain disorders: A systematic review of evidence associated with exposure to alcohol, anesthetic drugs, nicotine, and viral infections.

Authors: Thiago Arzua; Congshan Jiang; Yasheng Yan; Xiaowen Bai
Journal: Neurosci Biobehav Rev Date: 2021-06-26 Impact factor: 9.052

9. PRIME-3D2D is a 3D2D model to predict binding sites of protein-RNA interaction.

Authors: Juan Xie; Jinfang Zheng; Xu Hong; Xiaoxue Tong; Shiyong Liu
Journal: Commun Biol Date: 2020-07-16

10. Accurate Prediction of ncRNA-Protein Interactions From the Integration of Sequence and Evolutionary Information.

Authors: Zhao-Hui Zhan; Zhu-Hong You; Li-Ping Li; Yong Zhou; Hai-Cheng Yi
Journal: Front Genet Date: 2018-10-08 Impact factor: 4.599