| Literature DB >> 27506469 |
Xiaoyong Pan1,2, Yong-Xian Fan3, Junchi Yan4, Hong-Bin Shen5.
Abstract
BACKGROUND: Non-coding RNAs (ncRNAs) play crucial roles in many biological processes, such as post-transcription of gene regulation. ncRNAs mainly function through interaction with RNA binding proteins (RBPs). To understand the function of a ncRNA, a fundamental step is to identify which protein is involved into its interaction. Therefore it is promising to computationally predict RBPs, where the major challenge is that the interaction pattern or motif is difficult to be found.Entities:
Keywords: Deep learning; Stacked ensembing; ncRNA; ncRNA-protein
Mesh:
Substances:
Year: 2016 PMID: 27506469 PMCID: PMC4979166 DOI: 10.1186/s12864-016-2931-8
Source DB: PubMed Journal: BMC Genomics ISSN: 1471-2164 Impact factor: 3.969
Fig. 1The flowchart of proposed IPMiner. It proceeded in two main steps. a Train stacked autoencoder models for RNA and protein, respectively, and fine tuning for it using label information from RNA-protein pairs. b Apply stacked ensembling to integrate SDA-RF, SDA-TF-RF and RPISeq-RF, which used high-level features before fine tuning, high-level features after fine tuning and raw k-mer frequency features, respectively. The network architectures were 256-128-64 with 256, 128, and 64 neurons in 3 hidden layers for stacked autoencoder
Performance comparison between different layer architectures on RPI488
| Architecture | Method | Accuracy | Sensitivity | Specificity | Precision | MCC | AUC |
|---|---|---|---|---|---|---|---|
| Sep-256-128-64 | IPMiner |
|
| 0.831 |
|
|
|
| SDA-RF | 0.880 | 0.922 | 0.827 | 0.928 | 0.762 | 0.904 | |
| SDA-FT-RF | 0.881 | 0.916 | 0.831 | 0.926 | 0.762 | 0.909 | |
| Con-256-128-128 | IPMiner | 0.872 | 0.893 |
| 0.894 | 0.743 | 0.903 |
| SDA-RF | 0.884 | 0.924 | 0.831 | 0.934 | 0.770 | 0.911 | |
| SDA-FT-RF | 0.864 | 0.885 | 0.836 | 0.887 | 0.727 | 0.898 | |
| Raw input | RPISeq-RF | 0.880 | 0.926 | 0.822 | 0.932 | 0.762 | 0.903 |
| Raw input | lncPro | 0.870 | 0.900 | 0.827 | 0.910 | 0.740 | 0.901 |
Raw input is concatenation of 3-mer frequency features of protein and 4-mer frequency features of RNA
The boldface indicates this measure performance is the best among the compared methods for individual dataset
Fig. 2Performance on RPI488. Performance comparison between IPMiner, SDA-FT-RF and SDA-RF on lcnRNA-protein dataset RPI488
Fig. 3Ensembing strategy. Performance comparison between stacked ensembling and average ensembling on dataset RPI2241
Performance comparison on structure-based RPI369, RPI2241 and RPI1807
| Dataset | Method | Accuracy | Sensitivity | Specificity | Precision | MCC | AUC |
|---|---|---|---|---|---|---|---|
| RPI2241 | IPMiner |
| 0.833 |
| 0.836 |
|
|
| SDA-RF | 0.648 | 0.653 | 0.630 | 0.665 | 0.296 | 0.687 | |
| SDA-FT-RF | 0.783 |
| 0.645 |
| 0.592 | 0.898 | |
| RPISeq-RF | 0.646 | 0.652 | 0.630 | 0.663 | 0.293 | 0.690 | |
| lncPro | 0.654 | 0.659 | 0.640 | 0.669 | 0.310 | 0.722 | |
| RPI369 | IPMiner |
|
|
|
|
|
|
| SDA-RF | 0.707 | 0.699 | 0.727 | 0.689 | 0.416 | 0.754 | |
| SDA-FT-RF | 0.693 | 0.664 | 0.784 | 0.602 | 0.396 | 0.728 | |
| RPISeq-RF | 0.704 | 0.705 | 0.702 | 0.707 | 0.409 | 0.767 | |
| lncPro | 0.704 | 0.708 | 0.696 | 0.713 | 0.409 | 0.740 | |
| RPI1807 | IPMiner |
|
| 0.993 |
|
|
|
| SDA-RF | 0.972 | 0.970 | 0.981 | 0.962 | 0.944 | 0.995 | |
| SDA-FT-RF | 0.972 | 0.955 |
| 0.940 | 0.944 | 0.995 | |
| RPISeq-RF | 0.973 | 0.968 | 0.984 | 0.960 | 0.946 | 0.996 | |
| lncPro | 0.969 | 0.965 | 0.981 | 0.955 | 0.938 | 0.994 |
The positive pairs are all from original papers. The negative pairs for RPI1807 is from original paper
The boldface indicates this measure performance is the best among the compared methods for individual dataset
Performance comparison on non-structure-based NPInter2.0 and RPI13254
| Dataset | Method | Accuracy | Sensitivity | Specificity | Precision | MCC | AUC |
|---|---|---|---|---|---|---|---|
| NPInter2.0 | IPMiner |
| 0.946 |
| 0.945 |
|
|
| SDA-RF | 0.937 | 0.940 | 0.935 | 0.941 | 0.876 | 0.975 | |
| SDA-FT-RF | 0.934 |
| 0.912 |
| 0.868 | 0.990 | |
| RPISeq-RF | 0.944 | 0.940 | 0.949 | 0.940 | 0.889 | 0.978 | |
| lncPro | 0.928 | 0.919 | 0.938 | 0.917 | 0.856 | 0.971 | |
| RPI13254 | IPMiner |
|
| 0.995 |
|
|
|
| SDA-RF | 0.699 | 0.717 | 0.658 | 0.741 | 0.400 | 0.761 | |
| SDA-FT-RF | 0.813 | 0.728 |
| 0.626 | 0.675 | 0.901 | |
| RPISeq-RF | 0.739 | 0.766 | 0.688 | 0.790 | 0.480 | 0.817 | |
| lncPro | 0.712 | 0.716 | 0.701 | 0.723 | 0.424 | 0.792 |
For RPI13254, it has 13524 positive pairs and 5172 negative pairs. Here we randomly sub-sampling positive pairs from original paper to create balanced dataset, so it actually consists of 5172 negative pairs and 5172 positive pairs
The boldface indicates this measure performance is the best among the compared methods for individual dataset
The predicted performance of trained model from RPI488 on NPInter2.0, RPI367 and RPIntDB dataset
| Dataset | Organism | Total # of | Predicted # of |
|---|---|---|---|
| ncRNA-protein | ncRNA-protein | ||
| NPInter2.0 | Homo sapiens | 6,975 | 6,809 (97.6 %) |
| Caenorhabditis elegans | 36 | 22 (61.1 %) | |
| Mus musculus | 2,198 | 2,115 (96.2 %) | |
| Drosophila melanogaster | 91 | 88 (96.7 %) | |
| Saccharomyces cerevisiae | 910 | 860 (94.5 %) | |
| Escherichia coli | 202 | 176 (87.1 %) | |
| Total | 10,412 | 10,070 (96.7 %) | |
| RPI367 | Homo sapiens | 148 | 132 (89.2 %) |
| Caenorhabditis elegans | 2 | 2 (100.0 %) | |
| Mus musculus | 46 | 34 (73.9 %) | |
| Drosophila melanogaster | 26 | 24 (92.3 %) | |
| Saccharomyces cerevisiae | 119 | 117 (98.3 %) | |
| Escherichia coli | 25 | 21 (84.0 %) | |
| Total | 366 | 330 (90.1 %) | |
| RPIntDB | Total | 44,586 | 38,522 (86.4 %) |
For NPInter2.0, RPI-Pred can predict 90 % of total interactions [13]. If proteins and RNAs in a pair are obsolete, then this pair will be removed. For example, in RPI367, protein O16646 is obsolete in UniProtKB, and ncRNA u1136 interacts with O16646, this pair was removed in RPI367. In RPIntDB, there is no organism information for some interaction pairs, so we only report the total prediction accuracy
Fig. 4Interaction network. Clusters of MCL clustering from ncRNA network constructed from predicted ncRNA-protein pairs using IPMiner for Caenorhabditis elegans
The number of RNA-protein interaction pairs in collected datasets
| Dataset | # of | # of RNAs | # of | Reference |
|---|---|---|---|---|
| interaction pairs | proteins | |||
| RPI1807 | 1807 | 1078 | 1807 | [ |
| RPI369 | 369 | 332 | 338 | [ |
| RPI2241 | 2241 | 842 | 2043 | [ |
| NPInter2.0 | 10412 | 4636 | 449 | [ |
| RPI13254 | 13254 | 4500 | 42 | [ |
| RPI488 | 243 | 25 | 247 | This study |
RPI488 is lncRNA-protein interactions based on structure complexes, PI369, RPI2241, RPI1807 are RNA-protein interactions. NPInter2.0 and RPI13254 are ncRNA-protein interactions from non-structure-based source