Literature DB >> 33927752

Predicting Metabolite-Disease Associations Based on LightGBM Model.

Cheng Zhang1, Xiujuan Lei1, Lian Liu1.   

Abstract

Metabolites have been shown to be closely related to the occurrence and development of many complex human diseases by a large number of biological experiments; investigating their correlation mechanisms is thus an important topic, which attracts many researchers. In this work, we propose a computational method named LGBMMDA, which is based on the Light Gradient Boosting Machine (LightGBM) to predict potential metabolite-disease associations. This method extracts the features from statistical measures, graph theoretical measures, and matrix factorization results, utilizing the principal component analysis (PCA) process to remove noise or redundancy. We evaluated our method compared with other used methods and demonstrated the better areas under the curve (AUCs) of LGBMMDA. Additionally, three case studies deeply confirmed that LGBMMDA has obvious superiority in predicting metabolite-disease pairs and represents a powerful bioinformatics tool.
Copyright © 2021 Zhang, Lei and Liu.

Entities:  

Keywords:  computational method; features; light gradient boosting machine; metabolite-disease associations; performance evaluation

Year:  2021        PMID: 33927752      PMCID: PMC8078836          DOI: 10.3389/fgene.2021.660275

Source DB:  PubMed          Journal:  Front Genet        ISSN: 1664-8021            Impact factor:   4.599


Introduction

Metabolism is a series of ordered chemical reactions, which has a significant influence on biological life maintenance, such as the organism’s growth, reproduction, and reaction to the external environment (Dunn and Ellis, 2005). Metabolic processes are usually divided into two types. The first type is decomposing large molecules to acquire energy, such as cell respiration, while the other type is utilizing energy for the synthesis of each part inside the cells, such as nucleic acids and proteins (Cheng et al., 2017). In unhealthy conditions, relevant products in metabolism (metabolites) will be abnormal, which indicates that finding more disease-related metabolites is beneficial to disease prevention and treatment (Boja et al., 2014). Consequently, it is of high importance to identify the relationship among metabolites and diseases. Although some traditional techniques of metabolomics has prompted their analysis and development, such as nuclear magnetic resonance (NMR) spectroscopy or liquid/gas chromatography-mass spectrometry (LC/GC-MS) (Xianlin et al., 2011; Tang et al., 2014), the proportion of undiscovered associations between metabolites and diseases is still high. Moreover, some limitations exist, such as the time and funds required to mine disease-related metabolites in biological experiments. Therefore, effective computational methods for predicting disease-related metabolites are attracting more and more attention, which is beneficial to promoting the development to discover potential metabolite–disease associations. The idea of Random Walk with Restart for MiRNA-Disease Association (RWRMDA) (Hu et al., 2018) is to construct a metabolite–metabolite functional similarity network and implement RWR from known disease-related metabolite seed nodes to prioritize potential disease-related ones. However, this method uses less information for diseases or metabolites when calculating similarities, and its predictive performance needs to be improved. In this article, we present a computational method, LGBMMDA, based on Light Gradient Boosting Machine (LightGBM) (Ke et al., 2017), to identify metabolite–disease associations (Figure 1). Firstly, we extract the data of metabolite-related pathways as part of the integrated similarity network. Secondly, features are selected from the relevant similarity network and known metabolite–disease associations using the statistical measures, graph theoretical measures, and matrix factorization measures. Furthermore, the principal component analysis (PCA) (Deutsch, 2004) algorithm, which is a technique for analyzing and simplifying datasets, is utilized to extract deep features. Thirdly, the LightGBM classifier is utilized to obtain the potential association scores. In addition, the LOOCV and fivefold cross-validation are adopted to evaluate the performance of LGBMMDA, which achieves 0.9738 and 0.9715, respectively. Besides, three types of case studies for common diseases are carried out to evaluate the ability of the method to predict metabolites. These aforementioned experiments show that LGBMMDA is a reliable and excellent model to infer unknown metabolites–diseases associations.
FIGURE 1

The flowchart of LGBMMDA.

The flowchart of LGBMMDA.

Materials and Methods

Human Metabolite–Disease Associations

We extracted the experimentally confirmed human metabolite–disease associations from the last updated database (HMDB) (Wishart et al., 2017). Then, we performed the following steps on these associations: Firstly, the disease-related symptoms from the human symptom–disease network (HSDN) (Zhou et al., 2014; Ma et al., 2016) are used to calculate disease similarity after repeated associations; thus, the diseases that do not exist in the HSDN are removed. Secondly, the metabolite-related pathways from HMDB become part of the metabolite similarities, such that we keep the metabolites that are relevant to the diseases we selected. Finally, we obtain 127 diseases and 794 metabolites, which have 1,908 experimentally human metabolite–disease associations (see Figure 2). The parameters nm and nd represent the number of metabolites and diseases, respectively. Matrix M represents the adjacency matrix of metabolite–disease associations, such that the entity M(i,j) in row i and column j is 1 if disease i is associated with metabolite j and 0 otherwise.
FIGURE 2

A part of known metabolite–disease association network.

A part of known metabolite–disease association network.

Metabolite Functional Similarity

According to the hypothesis that metabolites with similar functions have a higher probability of possessing similar pathways, we utilize the Hamming similarity (Charikar, 2002) to measure the functional similarity of two metabolites by considering their related pathways. The metabolite functional similarity matrix is defined as MHS( × ), such that the element value is calculated as follows (Zhang et al., 2020) where MHS(m,m) represents the Hamming similarity between metabolite node m_i and m_j; np denotes the number of pathways. If there are existing associations between the metabolite i and pathway k, MP(k,i) is set to 1 in metabolite-pathway association network (MP).

Disease Functional Similarity

Considering the assumption that two diseases with similar functions usually have similar symptoms, the values of two disease-related symptom sets are used to obtain their functional similarities. Let the sets and sets describe the symptom sets of diseases a and b, where as and bs denote the number of symptoms associated with diseases a and b, respectively. Firstly, we calculate the information entropy of as follows (Gu et al., 2017) where Tn denotes the number of disease-symptom associations, is the number of the ith symptom related with disease a in the disease-symptom set,represents the frequency about the ith symptom associated with disease a, and is the information entropy of . The normalized mutual information (NMI) of and is used to measure the functional similarity between disease a and b as follows: where matrix DNF represents the functional similarity matrix; , , and denote the information entropy of , and the intersection set of and , respectively.

Gaussian Interaction Profile Kernel Similarity

Following literature (Gu et al., 2017) the GIP kernel for the similarities about diseases and metabolites captures the key features of the metabolite–disease association data. Calculating such kind of similarities is based on the assumption that similar diseases are more likely to contain functionally similar metabolites, and vice versa. Let the binary vector V(d), which is the row vector of the matrix M where the disease d_i is located, represent the interaction profiles of disease d_i. Then, the relevant similarities for diseases DGS(d_i,d) between the diseases d_i and d_j can be shown as follows: where ω is a parameter that controls the kernel bandwidth, acquired by normalizing the new bandwidth parameter . Similarly, the GIP kernel of the similaritiesMGS(m,m) between metabolites m_i and m_j is defined as follows: where ω is a parameter that controls the kernel bandwidth, acquired by normalizing the new bandwidth parameter .

Integrated Similarity for Metabolites and Diseases

In order to ensure that similarity information exists for every pair in metabolites or diseases, we integrated the disease functional similarities with GIP kernel similarities, which is shown as follows: where IDS(d,d) represents the integrated disease similarities. Similarly, the integrated metabolite similarity matrix (IMS) is given as follows:

Feature Extraction

Firstly, type 1 features (F1), which consist of the values of the sum, mean, and histogram distributions of metabolite/disease similarities, are calculated using the statistical measures for each disease/metabolite. We start by calculating the number of known associations in the relevant ith row/jth column of M. Then, the average of all similarity scores is computed according to the ith/jth row of IDS/IMS. Simultaneously, the similarity scores that ranges at [0, 1] are split into n parts (n = 5 in this work), and the proportion of similarity scores for d(j)/m(i) that fell into each part are counted as the histogram feature. Secondly, type 2 features (F2) are calculated, which include the information about graph theory-related statistics. Before obtaining this type of features, we construct the unweighted graph, in which two nodes have an edge if their similarity score is beyond the mean value of all entities in IDS/IMS. Then, we extract the relevant neighbors’ information, betweenness, closeness, eigenvector centrality, and PageRank (Franceschet, 2010) scores of the disease/metabolite similarity network in an unweighted graph. Thirdly, type 3 features (F3) are calculated. These features consist of the information about metabolite–disease pairs based on matrix factorization of M. The nonnegative matrix factorization (NMF) (Lee and Seung, 1999; Akbar et al., 2020), which was proposed by Lee and Seung, 1999, can help to solve the matrix sparsity problem. Thus, the metabolite–disease association matrix M can be factorized into two low-rank feature matrices A ∈ R and B ∈ R, where k denotes the dimension of the metabolite and disease features in the low-rank spaces (k = 20). Finally, the feature sets F(i,j) = [F1,F2,F3] for disease i and metabolite j is obtained. Meanwhile, PCA is applied to extract the more useful features.

Light Gradient Boosting Machine

Some boosting algorithms, such as the Gradient Boosting Decision Tree (GBDT) and eXtreme Gradient Boosting (XGBoost), have a common weakness that all the sample points for every feature are scanned when obtaining the best segmentation point; this is very time-consuming and computationally expensive to meet current needs. In order to reduce the cost of the experiment, we use LightGBM as the classifier (Friedman, 2001; Ke et al., 2017). LightGBM includes two main algorithms: Gradient-Based One-Side Sampling (GOSS) and Exclusive Feature Bundling (EFB). In the GOSS algorithm, the training instances are firstly ranked according to the absolute values of their gradients in descending order. Then, the top-a ×100% instances with the larger gradients are kept and combined into an instance subset A. Besides, the (1−a)×100% instances with the smaller gradients are integrated in the remaining set A, and a further subset B with the size b×|A| is randomly sampled. Finally, the instances are split according to the estimated variance gain over the subset A⋃B. The variance gain of splitting feature j at point d is shown as follows (Ke et al., 2017) whereA={x ∈ A:x≤d},A={x ∈ A:x > d},B={x ∈ B:x≤d},B={x ∈ B:x > d}, and is used to normalize the sum of the gradients over B back to the size of A. Each x_i is a vector with the dimension s in space X. In every gradient boosting iteration, the negative gradients of the loss function with respect to the output of the model are defined as {g1, ..., g}, where n is the number of vectors in space X. In the EFB algorithm, unnecessary computation for zero feature values is avoided by binding mutually exclusive features together in a histogram to form a feature. There are two main ideas for EFB. In algorithm 1, the function is to consider which features should be bundled together, while algorithm 2 determines how to construct the bundle as follows (Ke et al., 2017): Greedy bundling. Merge exclusive features.

Results

In this section, we utilize LOOCV and fivefold cross-validation to evaluate the performance of LGBMMDA. In LOOCV, each confirmed metabolite–disease pair is treated as the test set in turn, while the other confirmed pairs are regarded as training sets. Besides, the unconfirmed associations are regarded as potential candidates for true associations. We plot the ROCs curves and use the area under the ROC curve (AUC) as the evaluating indicator. Furthermore, we also use fivefold cross-validation as an evaluation tool to verify the performance of our method. In this method, the known information about metabolites and diseases is randomly divided into five equal parts. Then, each part is used as the test set in turn, while the other four parts represent the training set. This helps to avoid having the test and training data overlapping with each other and ensures unbiased comparisons. In this study, we compare our method with some state-of-the-art methods, including the label propagation algorithm (LP), which is a semi-supervised learning method based on graph (and its basic idea is to predict the label information of unlabeled nodes by using the label information of labeled nodes); random walk (RWR), which is close to Brownian motion and is the ideal mathematical state of Brownian motion; logistic regression (LR), which is a machine learning method solving binary (0 or 1) problems and estimating the possibility of something; and decision tree (DT), which is the process of classifying data through a series of rules. The results show that LGBMMDA achieved AUC values of 0.9738 and 0.9715 in LOOCV and fivefold cross-validation, respectively (see Figures 3, 4). In addition, we analyze the scores of known associations about LOOCV and count the number of known associations correctly identified by each algorithm (see Figure 5). It can be seen from Figure 6 that our proposed method is superior to other methods in terms of precision, recall, and F1-measure (0.898596, 0.90566, and 0.9021, respectively). Although the precision of LR is higher than our method, the recall of LR is significantly lower. Our method is steadier than LR.
FIGURE 3

The ROC about LOOCV.

FIGURE 4

The ROC about fivefold cross validation.

FIGURE 5

Comparison of the top k ranks with different methods.

FIGURE 6

Comparison of the precision, recall, and F1_measure with different methods.

The ROC about LOOCV. The ROC about fivefold cross validation. Comparison of the top k ranks with different methods. Comparison of the precision, recall, and F1_measure with different methods.

Parameter Analysis

In this section, we select some significant parameters to be adjusted in LightGBM. Firstly, we set the parameter n_estimators, which is related to the number of residual trees, from 100 to 500, while other important parameters are set to default. Figure 1 shows that we get better results when n_estimators is set to 300 (see Figure 7). In order to improve the accuracy, the values of the parameter max_depth, which limits the maximum depth of the tree model, is set from 3 to 8, and num_leaves, which controls the number of leaf nodes, is set from 5 to 100. As a result, max_depth = 7 and num_leaves = 15 achieve better performance (see Figure 8). Finally, the range of max_bin, which has an effect on overfitting, is set from 5 to 256, and min_data_in_leaf, which is the minimum number of samples contained on a leaf node, is set from 1 to 100. The results show that max_bin = 45 and min_data_in_leaf = 51 are better than other values (see Figure 9).
FIGURE 7

The AUC value of different n_estimators.

FIGURE 8

The AUC value of different max_depth and num_leaves. Different color represents different values of max_depth. The X axis represents the different values of num_leaves, and the Y axis represents relevant AUCs.

FIGURE 9

The AUC values of different max_bin and min_data_in_leaf.

The AUC value of different n_estimators. The AUC value of different max_depth and num_leaves. Different color represents different values of max_depth. The X axis represents the different values of num_leaves, and the Y axis represents relevant AUCs. The AUC values of different max_bin and min_data_in_leaf.

Case Study

In this section, we analyze three kinds of diseases, anemia, uremia, and asthma, in case studies to discover their pathogenic mechanisms from the perspective of metabolites. There are 10, 9, and 7 metabolites of these diseases that could be verified out of the top 10 predicted metabolites, respectively. Figure 10 shows anemia and its relevant metabolites.
FIGURE 10

The associations between anemia and some metabolites. The blue ellipses represent the known metabolites about anemia in this study. The yellow triangles represent the top 10 predicted metabolites relevant to anemia. The blue diamonds represent the top 10 neighbors about predicted or known metabolites.

The associations between anemia and some metabolites. The blue ellipses represent the known metabolites about anemia in this study. The yellow triangles represent the top 10 predicted metabolites relevant to anemia. The blue diamonds represent the top 10 neighbors about predicted or known metabolites. Anemia is caused by the inability of the body to produce enough hemoglobin, which is a protein that carries oxygen to blood cells and tissues. This disease has common symptoms, such as fatigue and dizziness. We conduct our method on a case study of anemia (see Table 1) to select the top 10 most likely associated metabolites, and all of them are associated with anemia according to literature in NCBI. For instance, L-histidine (Peterson et al., 1998) acts as a semi-essential amino acid, which is medically used in the treatment of anemia (Wang et al., 2020).
TABLE 1

Candidate metabolites of anemia.

Anemia
RankMetabolite nameEvidences
1L-HistidinePMID: 32498848
2L-ProlinePMID: 26821380
3GlycinePMID: 30853991
4L-ArgininePMID: 31355573
5L-ValinePMID: 30860750
6L-TryptophanPMID: 32153576
7L-GlutaminePMID: 32350885
8L-TyrosinePMID: 32764239
9L-Glutamic acidPMID: 30628549
10L-PhenylalaninePMID: 26956768
Candidate metabolites of anemia. Asthma is a common and frequent disease, which has the main symptoms of paroxysmal wheezing, chest tightness, and cough. The field of metabolomics has been used to explore the metabolic signatures of asthma, both for biomarker identification and pathophysiologic mechanisms research. We perform our method on a case study of asthma, and 7 of the top 10 predicted metabolites that are interrelated with asthma are verified to be correlative (see Table 2). For example, L-proline (Nadler et al., 1988) is one of metabolic characteristics of asthma, which is supported by experimental asthma models and clinical studies in children and adults (Pite et al., 2018). Another example is L-tryptophan (Hartzema et al., 1991), which has long been suggested to be relevant to the pathophysiology of asthma (Hu et al., 2020).
TABLE 2

Candidate metabolites of asthma.

Asthma
RankMetabolite nameEvidences
1L-HistidinePMID: 31206804
2L-ProlinePMID: 29059088
3L-TryptophanPMID: 31951781
4L-Glutamic acid
53-Hydroxybutyric acidPMID: 32213896
6Succinic acidPMID: 14846625
7L-MethioninePMID: 32778730
81-MethylhistidinePMID: 24783928
9L-Threonine
10PC(18:1(11Z)/22:1(13Z))
Candidate metabolites of asthma. Uremia is a serious kidney disease that is caused by a disorder in the internal biochemical process after renal function loss. We conduct our calculation method on a case study of uremia. As illustrated in Table 3, 9 of the top 10 predicted metabolites that are interrelated with uremia are verified to be correlative. For example, L-histidine is found to be significantly enhanced in the brain in uremia patients (Schmid et al., 1996). The L-proline in body fluids is a biological parameter for patients with renal insufficiency and chronic uremia (Hanwen, Sun et al., 2009).
TABLE 3

Candidate metabolites of uremia.

Uremia
RankMetabolite nameEvidences
1L-HistidinePMID: 8676800
2L-ProlinePMID: 20355181
33-Hydroxybutyric acid
4BiotinPMID: 6322032
5XanthinePMID: 19379356
6L-TryptophanPMID: 935125
7InosinePMID: 9607216
8Succinic acidPMID: 13837895
9L-Glutamic acidPMID: 6508956
10gamma-Aminobutyric acidPMID: 16797388
Candidate metabolites of uremia.

Discussion

Uncovering complex disease-related metabolites is a vital research topic in metabolomics. To this end, we proposed a computational model called LGBMMDA under the framework of LightGBM. The experimental results by cross-validation have proven that our method outperforms previously used methods. Furthermore, three case studies indicate that the metabolite–disease correlations predicted in our method can be effectively demonstrated by relevant experiments. The LGBMMDA method is expected to be a useful biomedical research tool for predicting potential metabolite–disease associations. There are three factors that contribute to the ideal predictive performance of LGBMMDA. Our method makes the following contributions for uncovering metabolite–disease associations: Firstly, the data of the metabolite–pathway associations are selected as metabolite functional similarities, which is a novel way to calculate similarities between metabolites. Secondly, three features are extracted by different angles, which keeps the diversity of features and contributes to a reliable performance. Thirdly, our method utilizes the reliable classifier of LightGBM, which ensures an effectively predictive accuracy. However, there are several limitations in our prediction model. On the one hand, many parameters of GBM need to be adjusted. In this work, parameter adjustment is only carried out by some experiments. In future work, some algorithms might be used to adjust those parameters. On the other hand, more useful methods for calculating relevant similarities could be beneficial to enhancing the performance of our model. In the future, more biologically relevant information is expected to be available, which can be used to refine the similarities.

Data Availability Statement

Publicly available datasets were analyzed in this study. This data about metabolites can be found here: https://hmdb.ca/.

Author Contributions

CZ carried out the method IBNPLNSMDA to predict the potential associations of metabolites and diseases, participated in its design, and drafted the manuscript. XL and LL helped to draft the manuscript. All authors read and approved the final manuscript.

Conflict of Interest

The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.
ALGORITHM 1

Greedy bundling.

ALGORITHM 2

Merge exclusive features.

  1 in total

1.  Geometric complement heterogeneous information and random forest for predicting lncRNA-disease associations.

Authors:  Dengju Yao; Tao Zhang; Xiaojuan Zhan; Shuli Zhang; Xiaorong Zhan; Chao Zhang
Journal:  Front Genet       Date:  2022-08-24       Impact factor: 4.772

  1 in total

北京卡尤迪生物科技股份有限公司 © 2022-2023.