| Literature DB >> 32351537 |
Jiaqi Wang1, Zhufang Kuang1, Zhihao Ma1, Genwei Han1.
Abstract
Interactions between genetic factors and environmental factors (EFs) play an important role in many diseases. Many diseases result from the interaction between genetics and EFs. The long non-coding RNA (lncRNA) is an important non-coding RNA that regulates life processes. The ability to predict the associations between lncRNAs and EFs is of important practical significance. However, the recent methods for predicting lncRNA-EF associations rarely use the topological information of heterogenous biological networks or simply treat all objects as the same type without considering the different and subtle semantic meanings of various paths in the heterogeneous network. In order to address this issue, a method based on the Gradient Boosting Decision Tree (GBDT) to predict the association between lncRNAs and EFs (GBDTL2E) is proposed in this paper. The innovation of the GBDTL2E integrates the structural information and heterogenous networks, combines the Hetesim features and the diffusion features based on multi-feature fusion, and uses the machine learning algorithm GBDT to predict the association between lncRNAs and EFs based on heterogeneous networks. The experimental results demonstrate that the proposed algorithm achieves a high performance.Entities:
Keywords: HeteSim score; environmental factor; gradient boosting decision tree; heterogenous network; long non-coding RNA; random walk with restart
Year: 2020 PMID: 32351537 PMCID: PMC7174746 DOI: 10.3389/fgene.2020.00272
Source DB: PubMed Journal: Front Genet ISSN: 1664-8021 Impact factor: 4.599
Figure 1Flowchart of our method: (A) Obtained the association matrix A; Calculated the gaussian interaction profile kernel similarity of lncRNA and EF respectively. (B) Calculated the chemical structure similarity matrix E. (C) Obtained lncRNA similarity information SL and construct a similarity matrix SE of EF. (D) Integrated three subnets A, SL, and SE to construct a global heterogeneous network. (E) Constructed the adjacency matrix G and obtain the diffusion feature. (F) Calculated the Hetesim score. (G) Combined the diffusion feature and the HeteSim score. (H) Trained the Gradient Boosting Decision Tree classifier (GBDT).
Figure 2Example of understanding HeteSim masure. Different color circles denote three different kinds of objects in the heterogeneous network. (A–C) represent three different nodes in the heterogeneous network.
The paths from a lncRNA to an environmental factor in our heterogeneous network with a length of less than 5.
| 1 | LLE | lncRNA-lncRNA-EF | 2 |
| 2 | LEE | lncRNA-EF-EF | 2 |
| 3 | LLLE | lncRNA-lncRNA-lncRNA-EF | 3 |
| 4 | LELE | lncRNA-EF-lncRNA-EF | 3 |
| 5 | LLEE | lncRNA-lncRNA-EF-EF | 3 |
| 6 | LEEE | lncRNA- EF-EF-EF | 3 |
| 7 | LLLLE | lncRNA-lncRNA-lncRNA-lncRNA-EF | 4 |
| 8 | LLLEE | lncRNA-lncRNA-lncRNA-EF-EF | 4 |
| 9 | LLELE | lncRNA-lncRNA-EF-lncRNA-EF | 4 |
| 10 | LLEEE | lncRNA-lncRNA-EF-EF-EF | 4 |
| 11 | LELLE | lncRNA-EF-lncRNA-lncRNA-EF | 4 |
| 12 | LELEE | lncRNA-EF-lncRNA- EF-EF | 4 |
| 13 | LEELE | lncRNA-EF-EF-lncRNA-EF | 4 |
| 14 | LEEEE | lncRNA-EF-EF-EF-EF | 4 |
The experimental parameters of GBDTL2E.
| 475 | The number of lncRNAs | |
| 152 | The number of EFs | |
| 627 | The sum number of EFs and lncRNAs | |
| 1 | The frequency band of gaussian interaction profile kernel similarity of lncRNA | |
| 1 | The frequency band of gaussian interaction profile kernel similarity of EF | |
| 0.7 | The weight parameter of correlation information of two environmental factors in SE | |
| 5 | The length constraint in Hetesim | |
| 50 | The dimension of the low-dimensional diffusion features | |
| 0.5 | The restart probability in the random walk with restart | |
| 600 | The number of training samples | |
| 10 | The number of training iterations |
The performance comparison with other machine learning methods.
| KNN | 0.953 | 0.937 | 0.952 | 0.907 | 0.985 |
| RF | 0.863 | 0.827 | 0.849 | 0.739 | 0.912 |
| SVM | 0.966 | 0.967 | 0.966 | 0.933 | 0.988 |
| GBDTL2E | 0.975 | 0.967 | 0.976 | 0.949 | 0.997 |
Figure 3The ROC curve comparison with other machine learning methods. (A) The ROC curve with using KNN. (B) The ROC curve with using RF. (C) The ROC curve with using SVM. (D) The ROC curve with using GBDT.
Figure 4The ROC curves comparison with other machine learning methods on independent dataset.
Figure 5The performance comparison of different feature groups (Diffusion, HeteSim and combined feature).
Figure 6The ROC curve comparison with different feature groups. (A) The ROC curve only with diffusion feature. (B) The ROC curve only with HeteSim feature. (C) The ROC curve with combined feature.
Figure 7The Roc curve comparison with existing method. (A) The ROC curve only of KATZ. (B) The ROC curve only of MPALERLS. (C) The ROC curve of BIRWAPALE. (D) The ROC curve of GBDTL2E.
The TOP 10 predicted lncRNAs related to cisplatin.
| 1 | AK12669 | 23741487 |
| 2 | AC015818.3 | 25250788 |
| 3 | ABCC6P1 | 25250788 |
| 4 | GABPB-AS1 | 24036268 |
| 5 | CASC2 | 28495512 |
| 6 | PSORS1C3 | 25250788 |
| 7 | H19 | 28189050 |
| 8 | AK125699 | 25250788 |
| 9 | SRGAP3-AS2 | 25250788 |
| 10 | XLOC_001406 | 25250788 |
GBDTL2E algorithm
| 1: Construct the adjacency matrix |
| 2: Initialize the global transition probability matrix |
| 3: Initialize the transition probability vector for each node |
| 4: |
| 5: Obtain the updated probability vector: |
| 6: |
| 7: |
| 8: |
| 9: |
| 10: Input L,P to caculate |
| 11: |
| 12: |
| 13: |
| 14: |
| 15: |
| 16: |
| 17: |
| 18: |
| 19: |
| 20: |
| 21: |
| 22: |
| 23: |
| 24: Divide the path into two parts. |
| 25: |
| 26: |
| 27: |
| 28: |
| 29: |
| 30: |
| 31: |
| 32: |
| 33: |
| 34: |
| 35: |
| 36: |
| 37: |
| 38: |
| 39: |
| 40: |
| 41: |
| 42: Combined with the diffusion feature and HeteSim score to get the data set |
| 43: Dtrain = {( |
| 44: Use Dtrain to train the Gradient Boosting Decision Tree (GBDT). |
| 45: Initialize the model as Θ0( |
| 46: |
| 47: |
| 48: Calculate loss function: L(y, Θ |
| 49: Calculate the residuals: |
| 50: |
| 51: Construct the |
| 52: Get the corresponding leaf node area |
| 53: |
| 54: Calculate |
| 55: |
| 56: Update weak model: Θ |
| 57: |
| 58: Get the strong model Θ |