| Literature DB >> 31824573 |
Qiang Zhu1,2, Xingpeng Jiang2,3, Qing Zhu2,3, Min Pan2,3, Tingting He2,3.
Abstract
The microbiome-wide association studies are to figure out the relationship between microorganisms and humans, with the goal of discovering relevant biomarkers to guide disease diagnosis. However, the microbiome data is complex, with high noise and dimensions. Traditional machine learning methods are limited by the models' representation ability and cannot learn complex patterns from the data. Recently, deep learning has been widely applied to fields ranging from text processing to image recognition due to its efficient flexibility and high capacity. But the deep learning models must be trained with enough data in order to achieve good performance, which is impractical in reality. In addition, deep learning is considered as black box and hard to interpret. These factors make deep learning not widely used in microbiome-wide association studies. In this work, we construct a sparse microbial interaction network and embed this graph into deep model to alleviate the risk of overfitting and improve the performance. Further, we explore a Graph Embedding Deep Feedforward Network (GEDFN) to conduct feature selection and guide meaningful microbial markers' identification. Based on the experimental results, we verify the feasibility of combining the microbial graph model with the deep learning model, and demonstrate the feasibility of applying deep learning and feature selection on microbial data. Our main contributions are: firstly, we utilize different methods to construct a variety of microbial interaction networks and combine the network via graph embedding deep learning. Secondly, we introduce a feature selection method based on graph embedding and validate the biological meaning of microbial markers. The code is available at https://github.com/MicroAVA/GEDFN.git.Entities:
Year: 2019 PMID: 31824573 PMCID: PMC6883002 DOI: 10.3389/fgene.2019.01182
Source DB: PubMed Journal: Front Genet ISSN: 1664-8021 Impact factor: 4.599
Figure 1The workflow of graph embedding deep network to conduct feature selection. 1. Construct microbial interaction network. The input is Operational Taxonomic Unit (OTU) abundance. Different approaches are adopted to obtain different interaction networks. The vertexes are species and the edges are correlation coefficient. 2. Graph embedding and model training. The feature graph is embedded into the first hidden layer in order to achieve sparse connection instead of fully-connected between the input layer and the first hidden layer. The first hidden layer (graph embedding layer) has the same neurons as the input layer. 3. Feature selection. The neurons (features) are ranked according to their importance score which is calculated via each neuron's connection weights.
Figure 2Graph embedding deep feedforward network (GEDFN). The graph embedding layer (first hidden layer) has same neurons with the input layer. The sparse connect between the input layer and the first hidden layer is marked as black. Other hidden layers are fully-connected.
Figure 3The Area Under Receiver Operating Characteristic curve (left) and accuracy of classification (right) for GEDFN, Deep Forest (DF), Random Forest (RF) and Support Vector Machines (SVM). Left: the grey dash line is the chance discrimination that located on diagonal line (AUC = 0.5). The maximum AUC = 1 means the classifier could discriminate the diseased and non-diseased perfectly while AUC = 0 means the classifier incorrectly classified all subjects with diseased as negative and all subjects with non-diseased as positive. The AUC is averaged through a five-fold cross validation. Right: the boxplot for classifiers’ classification accuracy.
Figure 4The feature selection based on Graph Embedding Deep Feedforward Network (GEDFN). The Venn diagram for top the 50 features selected via minimum Redundancy and Maximum Relevance (mRMR), Random Forest (RF), Deep Forest (DF) and GEDFN.
The performance among GEDFN + SVM, RF + SVM, GEDFN + DF and RF + DF.
| # | GEDFN + SVM | RF + SVM |
|
| ||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| P | R | F1 | P | R | F1 | P | R | F1 | P | R | F1 | |
| 10 | 0.733 | 1 |
| 0.675 | 1 | 0.806 | 0.733 | 1 |
| 0.785 | 0.871 | 0.825 |
| 15 | 0.745 | 1 |
| 0.675 | 1 | 0.806 | 0.745 | 1 |
| 0.722 | 0.909 | 0.800 |
| 20 | 0.752 | 1 |
| 0.675 | 1 | 0.806 | 0.750 | 0.991 | 0.854 | 0.717 | 0.927 | 0.805 |
| 25 | 0.706 | 1 | 0.828 | 0.675 | 1 | 0.806 | 0.705 | 0.991 | 0.824 | 0.765 | 0.907 |
|
| 30 | 0.707 | 1 |
| 0.675 | 1 | 0.806 | 0.707 | 0.983 | 0.823 | 0.718 | 0.957 | 0.821 |
| 35 | 0.698 | 1 |
| 0.675 | 1 | 0.806 | 0.698 | 1 |
| 0.692 | 0.977 | 0.810 |
| 40 | 0.704 | 1 |
| 0.675 | 1 | 0.806 | 0.709 | 0.985 | 0.824 | 0.706 | 0.962 | 0.813 |
| 45 | 0.707 | 1 |
| 0.675 | 1 | 0.806 | 0.707 | 1 |
| 0.687 | 0.991 | 0.811 |
| 50 | 0.697 | 1 |
| 0.675 | 1 | 0.806 | 0.697 | 1 |
| 0.695 | 0.974 | 0.810 |
#, number of top features; P, precision; R, recall; F1= . The best F1 scores are marked as bold.
Figure 5The cophenetic distance for top 50 features selected via Random Forest (RF), Deep Forest (DF) and Graph Embedding Deep Feedforward Network (GEDFN) respectively (The cophenetic distance is the sum of the features' pair-wise distance.). The cophenetic distance of two objects is a measure of how similar those two objects have to be in order to be grouped into the same cluster.
Figure 6The top 20 species selected via Graph Embedding Deep Feedforward Network (GEDFN). The species in red circle are higher relative abundance while species in blue star are lower relative abundance in diseased group. These species are visualized on the phylogenetic tree.