| Literature DB >> 29725024 |
Rupesh Agrahari1, Amir Foroushani1, T Roderick Docking2, Linda Chang2, Gerben Duns2, Monika Hudoba3, Aly Karsan2, Habil Zare4,5.
Abstract
Network analysis is the preferred approach for the detection of subtle but coordinated changes in expression of an interacting and related set of genes. We introduce a novel method based on the analyses of coexpression networks and Bayesian networks, and we use this new method to classify two types of hematological malignancies; namely, acute myeloid leukemia (AML) and myelodysplastic syndrome (MDS). Our classifier has an accuracy of 93%, a precision of 98%, and a recall of 90% on the training dataset (n = 366); which outperforms the results reported by other scholars on the same dataset. Although our training dataset consists of microarray data, our model has a remarkable performance on the RNA-Seq test dataset (n = 74, accuracy = 89%, precision = 88%, recall = 98%), which confirms that eigengenes are robust with respect to expression profiling technology. These signatures are useful in classification and correctly predicting the diagnosis. They might also provide valuable information about the underlying biology of diseases. Our network analysis approach is generalizable and can be useful for classifying other diseases based on gene expression profiles. Our previously published Pigengene package is publicly available through Bioconductor, which can be used to conveniently fit a Bayesian network to gene expression data.Entities:
Mesh:
Year: 2018 PMID: 29725024 PMCID: PMC5934387 DOI: 10.1038/s41598-018-24758-5
Source DB: PubMed Journal: Sci Rep ISSN: 2045-2322 Impact factor: 4.379
Figure 1Schematic view of the methodology. (A) The input is the gene expression profile (matrix). (B) We applied WGCNA to build the coexpression network and to identify gene modules (clusters). (C) PCA is used to summarize the biological information of each gene module into an eigengene. (D) A BN is fitted to the eigengenes to delineate the relationships between modules. We also used the fitted BN as a probabilistic predictive model. The tools used for each step are highlighted in red.
Figure 2Expression of eigengenes in the MILE dataset. Each row corresponds to a sample. Modules (columns) are clustered based on the similarity of expression in the MILE dataset. The majority of eigengenes show a different pattern of expression in the two diseases. The green strip at the top shows the adjusted p-values of Welch’s t-tests in logarithmic scale (base 10). The adjusted p–values are in the order of 10−60 to 10−10, which indicates that the eigengenes are highly discriminative features.
Figure 3Consensus BN structures. Each yellow node represents an eigengene. The Effect node is a binary variable that models the disease type. Its parents are denoted by red circles. The directed edges (arcs) model the probabilistic dependencies between nodes. Although these consensus networks are obtained from 500 (A) and 5, 000 (B) BNs, they have fairly similar structures.
Performance of predictions made by individual models on the training and validation partitions of the MILE dataset, which has 202 positive (AML) and 164 negative (MDS) samples.
| Model | Training partition | Validation partition | ||||
|---|---|---|---|---|---|---|
| Accuracy | Precision | Recall | Accuracy | Precision | Recall | |
| Model 1 | 96.9 | 95.7 | 98.7 | 78.4 | 71.8 | 84.8 |
| Model 2 | 92.8 | 89.6 | 97.3 | 89.2 | 84.2 | 94.1 |
| Model 3 | 91.5 | 88.8 | 87.4 | 86.5 | 80.1 | 94.5 |
| Model 4 | 88.0 | 83.1 | 94.3 | 94.6 | 95.2 | 95.2 |
| Model 5 | 96.6 | 96.9 | 96.9 | 91.5 | 95.1 | 90.1 |
|
|
|
|
|
|
|
|
Each model was trained using a subsample from the MILE dataset, which consists of four fifth of training cases (Supplementary File S2).
Performance of predictions on the BCCA dataset, which has 52 positive (AML) and 22 negative (MDS) samples.
| Model | Accuracy | Precision | Recall |
|---|---|---|---|
| Model 1 | 90.5 | 92.5 | 94.2 |
| Model 2 | 71.6 | 84.4 | 73.1 |
| Model 3 | 86.5 | 86.2 | 96.2 |
| Model 4 | 87.8 | 85.3 | 100 |
| Model 5 | 83.8 | 85.7 | 92.3 |
|
|
|
|
|
|
|
|
|
|
The majority vote performs better than the individual BN models. Each model was trained using a subsample from the MILE dataset, which consists of four fifth of training cases (Supplementary File S2).
Performance of SVM classifiers.
| Classifier | Accuracy | Precision | Recall |
|---|---|---|---|
| Radial using 33 genes | 67.7 | 67.7 | 100 |
| Radial using 600 genes | 67.7 | 67.7 | 100 |
| Linear using eigengenes | 77 | 84.6 | 83 |
| Polynomial using eigengenes | 75.7 | 94.2 | 76.6 |
|
|
|
|
|
|
|
|
|
|
These SVM classifiers were trained using the 336 samples in the MILE dataset and were tested using the 52 AML and 22 MDS samples from the BCCA dataset. Among all kernels used on eigengenes, the Gaussian radial has the best performance as expected[36], which is comparable with the majority vote of BN models. We used the polynomial kernel with degree 3 (e1071’s default value).
Figure 4ROC curves. The predictions from the Bayesian network approach (red) leads to the highest AUC. The curve corresponding to the SVM predictions (green) is close to the best curve when eigengenes are used as features.
Parents of the Effect node in the 5 BNs that were fitted to the MILE dataset (Supplementary File S2).
| Module |
|
|
|
|
|
|
|
|
|
| Frequency | 4 | 4 | 3 | 2 | 2 | 1 | 1 | 1 | 1 |