| Literature DB >> 30875752 |
Hui Zhang1, Yanchun Liang2,3, Siyu Han4, Cheng Peng5, Ying Li6.
Abstract
Non-coding RNAs with a length of more than 200 nucleotides are long non-coding RNAs (lncRNAs), which have gained tremendous attention in recent decades. Many studies have confirmed that lncRNAs have important influence in post-transcriptional gene regulation; for example, lncRNAs affect the stability and translation of splicing factor proteins. The mutations and malfunctions of lncRNAs are closely related to human disorders. As lncRNAs interact with a variety of proteins, predicting the interaction between lncRNAs and proteins is a significant way to depth exploration functions and enrich annotations of lncRNAs. Experimental approaches for lncRNA⁻protein interactions are expensive and time-consuming. Computational approaches to predict lncRNA⁻protein interactions can be grouped into two broad categories. The first category is based on sequence, structural information and physicochemical property. The second category is based on network method through fusing heterogeneous data to construct lncRNA related heterogeneous network. The network-based methods can capture the implicit feature information in the topological structure of related biological heterogeneous networks containing lncRNAs, which is often ignored by sequence-based methods. In this paper, we summarize and discuss the materials, interaction score calculation algorithms, advantages and disadvantages of state-of-the-art algorithms of lncRNA⁻protein interaction prediction based on network methods to assist researchers in selecting a suitable method for acquiring more dependable results. All the related different network data are also collected and processed in convenience of users, and are available at https://github.com/HAN-Siyu/APINet/.Entities:
Keywords: biological network science; computational model; lncRNA–protein interaction prediction; machine learning
Mesh:
Substances:
Year: 2019 PMID: 30875752 PMCID: PMC6471543 DOI: 10.3390/ijms20061284
Source DB: PubMed Journal: Int J Mol Sci ISSN: 1422-0067 Impact factor: 5.923
Description of lncRNA relevant databases.
| Database | Description | Availability |
|---|---|---|
| ncRNA database (Especially lncRNAs): | ||
| NONCODE [ | Comprehensive knowledge database of non-coding RNAs, including lncRNAs from 17 species, and predicted/validated lncRNA–disease relationships. |
|
| MNDR [ | Database of ncRNA–disease associations in mammals. |
|
| deepBase [ | Database for identification, expression, evolution and function of small RNAs, lncRNAs and circular RNAs from deep-sequencing data. |
|
| NRED [ | Database integrating annotated human and mouse ncRNA expression data from various resources. |
|
| ChIPBase [ | Database on the transcriptional regulation of ncRNAs based on ChIP-sequencing data. |
|
| SomamiR [ | Cancer somatic mutations with altering microRNA–ceRNA interactions. |
|
| LncRNA2Function [ | Functional annotations and expression profiles (RNAseq) of human lncRNAs. |
|
| LincSNP [ | A database containing human lncRNAs information about linking disease related SNPs. |
|
| LncRNA-SNP [ | A database of SNPs in lncRNAs and their predicted effects in human and mouse. |
|
| LNCipedia [ | A database for annotated human lncRNA transcript sequences and structures. |
|
| ALDB [ | A farm livestock lncRNA database. |
|
| lncRNAtor [ | A database for functional investigation of lncRNAs that encompasses annotation, sequence analysis, gene expression, protein binding and phylogenetic conservation. |
|
| Co-LncRNA [ | A web-sever containing effects of lncRNAs in GO functions and KEGG pathways based on co-expressed genes. |
|
| Lnc2Cancer [ | A database for experimentally validated associations between lncRNAs and cancers. |
|
| LncRNADisease [ | A database for experimentally validated lncRNA-associated diseases. |
|
| lncRNAMap [ | A map of putative regulatory functions in the long non-coding transcriptome. |
|
| TANRIC [ | A web-resource for interactive exploration of lncRNAs in cancer. |
|
| LncRNA ontology [ | A web-resource for inferring lncRNA functions based on chroma-tin states and expression patterns. |
|
| LNCediting [ | A database for functional effects of RNA editing in lncRNAs. |
|
| LncBase [ | A database of interactions between miRNAs and lncRNAs. |
|
| TF2LncRNA [ | A Web-resource for the identification of common transcription factors for a list of lncRNA genes. |
|
| LncSubpathway [ | A web server for the identification of dysfunctional subpathways associated with risk lncRNAs. |
|
| LncRNA2Target [ | A database of differentially expressed genes after lncRNA knock-down or overexpression. |
|
| LncReg [ | A reference resource for lncRNA-associated regulatory networks. |
|
| lncRNAdb [ | An annotation database of eukaryotic lncRNAs. |
|
| Database information on proteins or microRNAs that may be associated with lncRNAs: | ||
| NPInter [ | Database of noncoding RNA-associated interactions. |
|
| PRIDB [ | Comprehensive database of RNA–protein interfaces extracted from complexes in the PDB. |
|
| PDB [ | A database of experimentally determined three-dimensional structures of proteins, nucleic acids and other biomolecules. |
|
| StarBase v 2.0 [ | A database of experimentally supported interactions from RBPs, mRNAs, miRNAs, RNAs, proteins and so on. |
|
| Nucleic acid database (NDB) [ | A database about three-dimensional nucleic acid structures and their complexes, geometric data, structure information. |
|
Details of interactions between biomolecules and the research of lncRNA functions.
| Name | Samples | Interactions | Source |
|---|---|---|---|
| LncRNA–Disease | 804 × 288 | 1454 | LncRNADisease [ |
| LncRNA–LncRNA | 1114 × 1114 | 1,179,256 | LFSCM [ |
| LncRNA–microRNA | 1127 × 277 | 10,198 | StarBase v2.0 [ |
| LncRNA–Gene | 240 × 15,527 | 6186 | LncRNA2Target [ |
| LncRNA–GO | 240 × 6428 | 3094 | GeneRIF [ |
| MicroRNA–MicroRNA | 271 × 271 | 24,062 | Zhong et al. [ |
| MicroRNA–Disease | 1080 × 592 | 11,835 | HMDD [ |
| MicroRNA–Gene | 495 × 15,527 | 135,852 | miRTarBase [ |
| MicroRNA–Target | 495 × 15,527 | 135,852 | miRTarBase [ |
| Gene–Gene | 16,785 × 16,785 | 1,515,370 | Yao et al. [ |
| Gene–Metabolite | 12,342 × 3278 | 192,763 | Yao et al. [ |
| Metabolite–Metabolite | 3764 × 3764 | 74,667 | Yao et al. [ |
| Gene–GO | 15,527 × 6428 | 1,191,503 | GO Annotation [ |
| Gene–Disease | 1715 × 1886 | 2603 | DisGeNET [ |
| Gene–Drug | 155,275 × 8283 | 3760 | DrugBank [ |
| Metabolite–Disease | 388 × 149 | 664 | HMDB [ |
| Drug–Disease | 15,527 × 412 | 115,317 | CTD [ |
| Drug–Drug | 8283 × 8283 | 453,436 | DrugBank [ |
| Drug–Side-effects | 1430 × 5880 | 140,064 | SIDER [ |
| Disease–Disease | 5080 × 5080 | 20,280,092 | Yao et al. [ |
The comparison of each method by analyzing the differences in intrinsic features and classifiers.
| CatRAPID [ | RPISeq [ | De novo [ | LncPro [ | RPI-Pred [ | rpiCOOL [ | IPMiner [ | lncADeep [ | ||
|---|---|---|---|---|---|---|---|---|---|
|
| RNA Sequence | √ | √ | √ | √ | √ | √ | ||
| Protein Sequence | √ | √ | √ | √ | √ | √ | |||
| 3D Structure(RNA) | √ | ||||||||
| 3D Structure (protein) | √ | ||||||||
| The secondary structure (RNA) | √ | √ | |||||||
| The secondary structure(protein) | √ | ||||||||
| Hydrogen-Bonding Propensities | √ | √ | |||||||
| van der Waals’ Propensities | √ | √ | |||||||
|
| Random Forest | √ | √ | √ | |||||
| Naive Bayesian | √ | ||||||||
| Extended NB | √ | ||||||||
| SVM | √ | √ | |||||||
| Fisher’s linear | √ | ||||||||
| automatic encoder | √ | ||||||||
| deep neural network | √ | ||||||||
| √ | √ | ||||||||
| Web server or offline package | √ | √ | √ | √ | √ | √ | √ | ||
1http://s.tartaglialab.com/page/catrapid_group (web server); 2 http://pridb.gdcb.iastate.edu/RPISeq (web server); 3 http://bioinfo.bjmu.edu.cn/lncpro/ (offline package and web server); 4 http://ctsb.is.wfubmc.edu/projects/rpi-pred (web server); 5 http://biocool.ir/softs/rpicool.html (offline package); 6 https://github.com/xypan1232/IPMiner (offline package); 7 https://github.com/cyang235/LncADeep (offline package).
Figure 1Overview of five computational models for lncRNA–protein interaction prediction based on network method, including data collection and core algorithm. Illustration: The specific algorithm implementation of each method is represented by rectangular boxes with dotted lines of different colors, and the solid lines with different colors outside the rectangular boxes of dotted lines represent the data sources used by different methods. These colors are the same as the colors used by method names. In addition, the solid line color in the dotted rectangular frame is used to distinguish the interaction of lncRNA–lncRNA, protein–protein or lncRNA–protein.
Differences in each network-based methods.
| Method | Dataset | Algorithm | AUC | |
|---|---|---|---|---|
| LPBNI [ | LPI | 4870 lncRNA–protein interactions from NPInter database (2380 lncRNAs and 106 proteins) | Bipartite Network | 0.8780 |
| PPI | × | |||
| LLI | × | |||
| Yang et al. [ | LPI | 4883 lncRNA–protein interactions from NPInter database (1116 lncRNAs and 99 proteins) | A random walk model HeteSim | 0.7972 |
| PPI | 1608 protein–protein interactions from STRING database | |||
| LLI | × | |||
| LPIHN [ | LPI | 10232 lncRNA–protein interactions from NPInter database (1113 lncRNAs and 99 proteins) | Random Walk with Restart | 0.8839 |
| PPI | 804 protein–protein interactions from STRING database | |||
| LLI | lncRNA expression similarity from NONCODE 4.0 database (1113 lncRNA expression profiles) | |||
| Zheng et al. [ | LPI | 4467 lncRNA–protein interactions from NPInter database (1050 lncRNAs and 84 proteins) | SNF; A random walk model HeteSim | 0.9068 |
| PPI | Sequence similarity from UniProt database; | |||
| LLI | × | |||
| PLPIHS [ | LPI | lncRNA–protein interactions from GENCODE Release 24 (15941 lncRNAs and 20284 proteins) | SVM; A random walk model HeteSim |
|
| PPI | Protein–protein interactions from STRING database | |||
| LLI | lncRNA co-expression similarity from NONCODE database (lncRNA expression profiles) | |||
Bold representation performs best in AUC values and we found that the performance of the method is better when the heterogeneous network is composed by more sources. When heterogeneous networks are constructed by the same sources, the performance will be better for the heterogeneous networks constructed by weighted networks. 1 https://github.com/USTC-HIlab/LPBNI (offline package); 2 https://github.com/cyang235/LncADeep (offline package); 3 lncRNA–protein interactions; 4 protein–protein interactions; 5 lncRNA–lncRNA interactions; 6 A relevance search based on random walk in heterogeneous network to evaluate the relevance between a pair of lncRNA and protein, and a large relevance score means a high possibility that the lncRNA and protein interacts [94]. 7 Similarity Network Fusion: It is a nonlinear message-passing based method that iteratively updates each network and makes it more and more similar to the other [95].
Figure 2Framework of LPBNI mainly including four modules: (1) Data collection: the lncRNA–protein interaction network is from NPInter and NONCODE. (2) Bipartite network construction (a toy example in Figure 1). (3) Two-step propagation on the bipartite network: (A) The process of the initial information propagated from proteins to their direct neighbor lncRNAs. For example, the initial information of three proteins is 1, 1 and 0, respectively. (B) The score on red circles is the information of each lncRNA received from proteins. (C) The process of the information propagated from lncRNAs back to proteins. The score on blue hexagon in (C) is the final information of each protein after the two-step propagation. The red circles represent lncRNAs and the blue hexagons represent proteins. (4) Model validation based on leave one out cross validation (LOOCV), the area under the receiver operating characteristic curve (AUC) and Matthew’s correlation coefficient (MCC).
Figure 3Framework of the proposed method by Zheng et al. [32] mainly containing four modules. (1) (A) Data collection: The lncRNA–protein network is from NPInter and NONCODE. The datasets from Uniprot, GO, Pfam and STRING database are collected for protein–protein similarity network construction. (B) Protein–protein similarity network construction: based on similarity network fusion (SNF) algorithm by integration of multi-resource information. (2) A heterogeneous network construction. (3) HeteSim computation on the heterogeneous network. (4) Model validation based on LOOCV and AUC.
Figure 4Pipeline of the method proposed by Yang et al. [33]. (1) Data collection: lncRNA–protein interactions from NPInter and NONCODE and protein–protein interactions from STRING database. (2) HeteSim computation based on relevance path of heterogenous network for lncRNA–protein interaction predictions. (3) Model validation based on LOOCV and AUC.
Figure 5Pipeline of LPIHN, containing three modules: (1) Data collection: lncRNA–protein interactions from NPInter, protein–protein interactions from STRING database and lncRNA–lncRNA similarity network computed based on lncRNA expression profile from NONCODE. (2) A heterogeneous network construction. (3) LncRNA–protein interactions prediction based on the random walk with restart. A score is assigned to each candidate protein of a query lncRNA, by the random walk with restart on the heterogeneous network. The candidate proteins are ranked based on the scores. (4) Model validations based on LOOCV and AUC. For LPIHN, the lncRNA–lncRNA similarity network is calculated by using the lncRNA expression profiles based on the PCC of each pair of lncRNAs. The heterogeneous network is constructed by connecting the lncRNA–lncRNA similarity network and PPI network together with the known lncRNA–protein interaction network. Blue circles indicated lncRNAs, orange squares indicated proteins, blue edges indicated lncRNA–lncRNA similarities, orange edges indicated protein–protein interactions, and blue dotted edges indicated known lncRNA–protein interactions.
Figure 6Flowchart of PLPIHS, including four modules: (1) Data collection. (2) Heterogeneous network construction consisting of a lncRNA–lncRNA similarity network, a lncRNA–protein interaction network and a protein–protein interaction network. (3) HeteSim measure is used to calculate a score for each lncRNA–protein pair in each path. (4) LncRNA–protein prediction based on SVM classifier combining the scores of different paths. (5) Model validations based on LOOCV, AUC and MCC.
Differences in evaluation measures by the network-based methods.
| Method | Measure for the Evaluation | Test Dataset | Measurement or Illustration | |||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| LOOCV | Precision Versus | Fold Enrichment | AUC | SPE | ACC | PRE | MCC | REC | F1-Score | SEN | ||
| LPBNI [ | √ | √ | 4870 lncRNA–protein interactions from NPInter v2.0 | 0.878 | 0.99 | 0.873 | 0.852 | 0.449 | − | − | 0.288 | |
| 0.95 | 0.880 | 0.681 | 0.534 | 0.532 | ||||||||
| Zheng et al. [ | √ | 4467 lncRPIs, including 1050 lncRNAs and 84 proteins | AUC values of 15 settings: Seqs-0.8565, Pfam-0.8459, GO-0.8584, STRING-0.7972; Seqs+Pfam-0.8689, Seqs+GO-0.8626, Seqs+STRING-0.8762, Pfam+GO-0.8677, Pfam+STRING-0.8977, and GO+STRING-0.8814; Seqs+Pfam+GO-0.8704, Seqs+Pfam+STRING-0.9023, Seqs+GO+STRING-0.8904, Pfam+GO+STRING-0.9066; Seqs+Pfam+GO+STRING-0.9068. | |||||||||
| Yang et al. [ | √ | MALAT1 with all 99 proteins | 0.955 | − | − | − | − | − | − | − | ||
| AK0951949 with all 99 proteins | 0.973 | |||||||||||
| LPIHN [ | √ | √ | √ | The test dataset is the interaction of each known lncRNA–protein, and the rest is used as training dataset. | 0.96 | √ | √ | √ | √ | √ | √ | |
| PLPIHS [ | √ | The remaining positive samples found in the 0.9 network had 2000 lncRNA–protein interactions and the same number of negative interactions in the 0.3 network | 0.879 | − | √ | √ | √ | √ | √ | |||
LOOCV, leave-one-out cross validation; AUC, area under the curve; SPE, specificity; ACC, accuracy; PRE, precision; MCC, Matthew’s correlation coefficient; REC, recall; SEN, sensitivity; OMIM, Online Mendelian Inheritance in Man compendium.
Figure 7The AUC value of five methods under at three different levels of heterogeneous networks. Different colors represent different network cases, and the same color bar graphs represent the verification results on the same set of data.