| Literature DB >> 24063787 |
Qiongshi Lu1, Sijin Ren, Ming Lu, Yong Zhang, Dahai Zhu, Xuegong Zhang, Tingting Li.
Abstract
BACKGROUND: Though most of the transcripts are long non-coding RNAs (lncRNAs), little is known about their functions. lncRNAs usually function through interactions with proteins, which implies the importance of identifying the binding proteins of lncRNAs in understanding the molecular mechanisms underlying the functions of lncRNAs. Only a few approaches are available for predicting interactions between lncRNAs and proteins. In this study, we introduce a new method lncPro.Entities:
Mesh:
Substances:
Year: 2013 PMID: 24063787 PMCID: PMC3827931 DOI: 10.1186/1471-2164-14-651
Source DB: PubMed Journal: BMC Genomics ISSN: 1471-2164 Impact factor: 3.969
Complexes used in the training set
| 1FFK, 1JJ2 | |
| 1GIY, 3HUW, 3I8I | |
| 1J5A, 2ZJP | |
| 1P85, 2GYA | |
| 2FTC | |
| 2R8S | |
| 2RKJ | |
| 2ZKQ, 2ZKR | |
| 3BBN, 3BBO | |
| 3CW1 | |
| 3JYV |
Figure 1Procedure of encoding the RNA sequences and the amino acid sequences into feature vectors. a) Procedure of encoding the RNA sequences into feature vectors. For the secondary structure, RNAsubopt was used to obtain the top six possible secondary structures with the lowest free energy. The dots and brackets were then replaced by 0 s and 1 s, respectively. The six vectors were added, and the secondary structure feature vector was obtained. For Van der Waal’s interaction and hydrogen bonding, each base was replaced by numbers representing the propensities. Finally, all three feature vectors were transformed by the Fourier series, and the first 10 terms of Fourier series were used as the new feature vector. b) Procedure of encoding the amino acid sequences into feature vectors. For the feature vector of the secondary structure, the corresponding Chou-Fasman propensities were used to encode each amino acid according to the secondary structure predicted by Predator. For the feature vectors of hydrogen bonding, each amino acid was replaced by Grantham’s and Zimmerman’s scores, respectively. Kyte-Doolittle and Bull-Breese scores were used for Van der Waals’ interaction, respectively. For all five feature vectors, the first 10 terms of the Fourier series were used as new feature vectors.
to measure the interaction between r and p. M will be a 100-D matrix because the dimension of vectors was set at 10. If we unsystematically search the matrix in the 100-D Euclidean space, the efficiency and accuracy would be low. The efficiency and accuracy will be further degraded when a higher dimension is used.
. Without loss of generality, the situation of dimension two is used to clarify the idea:
as follows:
is actually the inner product of vectors x and k. This inner product score is expected to discriminate the data into two groups. Thus, according to the theory of Fisher’s linear discriminant method, the best vector k is actually the direction k to optimize the Fisher criterion function:
. Assuming that p indicates the protein secondary structure information and that r indicates the RNA hydrogen bonding information, then the computation of
is nonsensical because combining different kinds of information is theoretically meaningless. Thus, we selected another combining method. We computed five scores using each feature vector pairs, respectively, which included: the protein and RNA secondary structures, protein Grantham’s propensities and RNA hydrogen bonding, protein Zimmerman’s propensities and RNA hydrogen bonding, protein Kyte-Doolittle propensities and RNA Van der Waals’ interaction, and protein Bull-Breese propensities and RNA Van der Waals’ interaction. Here the Grantham and Zimmerman scores are both characterizing the protein hydrogen bonds, the Kyte-Doolittle and Bull-Breese scores are both characterizing the protein Van der Waals’ interaction.
Discriminative power for each score (non-redundant set)
| DP | 88.1% | 88.1% | 89.1% | 87.8% | 65.6% | 90.3% |
Discriminative power for each score (redundant set)
| DP | 90.0% | 92.0% | 91.5% | 89.7% | 73.6% | 94.3% |
Interaction scores of MRP and RNase P
| hPop1 | 60.1 | + | 86.7 | + |
| hPop5 | 45.2 | - | 25.2 | - |
| Rpp14 | 40.9 | - | 32.9 | - |
| Rpp20 | 31.2 | + | 44.3 | + |
| Rpp21 | 64.7 | + | 77.6 | + |
| Rpp25 | 39.8 | + | 70.0 | + |
| Rpp29 | 69.6 | + | 66.8 | + |
| Rpp30 | 46.5 | - | 55.7 | - |
| Rpp38 | 58.0 | + | 61.4 | + |
| Rpp40 | 60.7 | - | 65.6 | - |
*Known experimental result: “+” indicates interactive, “-” indicates non-interactive.
Interaction scores of PRC-2 with HOTAIR and MEG3
| Ezh2 | 93.1 | 87.4 |
| Eed | 63.8 | 67.6 |
| Suz12 | 89.7 | 69.5 |
| RBBP4 | 59.9 | 88.8 |
Interaction scores of LSD1/CoREST/REST complex and HOTAIR
| LSD1 | 90.4 |
| CoREST | 73.2 |
| REST | 88.8 |
Interaction scores with roX1 and roX2
| MSL1 | 75.02 | 90.76 |
| MSL2 | 67.39 | 89.28 |
| MSL3 | 82.50 | 66.56 |
| MLE | 67.71 | 57.05 |
| MOF | 74.83 | 59.33 |
Figure 2Cumulative distribution functions (CDF) of different RNAs. The black curve is the CDF of scores between this RNA and human proteins, and the red curve is the CDF of scores between this RNA and human nuclear proteins. The x-axis indicates the score obtained by our method; the Y-axis indicates proportion.