| Literature DB >> 31756219 |
Mostafa Abbas1, John Matta2, Thanh Le3, Halima Bensmail1, Tayo Obafemi-Ajayi3, Vasant Honavar4, Yasser El-Manzalawy4,5.
Abstract
Reliable identification of Inflammatory biomarkers from metagenomics data is a promising direction for developing non-invasive, cost-effective, and rapid clinical tests for early diagnosis of IBD. We present an integrative approach to Network-Based Biomarker Discovery (NBBD) which integrates network analyses methods for prioritizing potential biomarkers and machine learning techniques for assessing the discriminative power of the prioritized biomarkers. Using a large dataset of new-onset pediatric IBD metagenomics biopsy samples, we compare the performance of Random Forest (RF) classifiers trained on features selected using a representative set of traditional feature selection methods against NBBD framework, configured using five different tools for inferring networks from metagenomics data, and nine different methods for prioritizing biomarkers as well as a hybrid approach combining best traditional and NBBD based feature selection. We also examine how the performance of the predictive models for IBD diagnosis varies as a function of the size of the data used for biomarker identification. Our results show that (i) NBBD is competitive with some of the state-of-the-art feature selection methods including Random Forest Feature Importance (RFFI) scores; and (ii) NBBD is especially effective in reliably identifying IBD biomarkers when the number of data samples available for biomarker discovery is small.Entities:
Mesh:
Substances:
Year: 2019 PMID: 31756219 PMCID: PMC6874333 DOI: 10.1371/journal.pone.0225382
Source DB: PubMed Journal: PLoS One ISSN: 1932-6203 Impact factor: 3.240
Performance of the top (in terms of highest AUC and smallest number of selected features) performing RF classifiers for different choices of feature selection data set and traditional feature selection methods.
| FSDS | FSM | # Features | ACC | Sn | Sp | MCC | AUC |
|---|---|---|---|---|---|---|---|
| DS50 | None | NA | 0.66 | 0.64 | 0.75 | 0.31 | 0.74 |
| IG | 60 | 0.65 | 0.62 | 0.78 | 0.32 | 0.76 | |
| FStat | 60 | 0.63 | 0.64 | 0.62 | 0.21 | 0.69 | |
| RFE | 40 | 0.69 | 0.66 | 0.78 | 0.36 | 0.79 | |
| RFFI | 50 | 0.68 | 0.65 | 0.82 | 0.38 | 0.80 | |
| DS100 | None | NA | 0.66 | 0.64 | 0.75 | 0.31 | 0.74 |
| IG | 60 | 0.65 | 0.62 | 0.74 | 0.29 | 0.75 | |
| FStat | 20 | 0.68 | 0.66 | 0.72 | 0.32 | 0.74 | |
| RFE | 50 | 0.66 | 0.62 | 0.81 | 0.35 | 0.78 | |
| RFFI | 40 | 0.68 | 0.65 | 0.80 | 0.37 | 0.79 | |
| DS200 | None | NA | 0.66 | 0.64 | 0.75 | 0.31 | 0.74 |
| IG | 20 | 0.69 | 0.68 | 0.73 | 0.34 | 0.79 | |
| FStat | 50 | 0.68 | 0.67 | 0.72 | 0.31 | 0.75 | |
| RFE | 60 | 0.65 | 0.62 | 0.76 | 0.30 | 0.78 | |
| RFFI | 20 | 0.67 | 0.63 | 0.81 | 0.36 | 0.79 | |
| DS300 | None | NA | 0.66 | 0.64 | 0.75 | 0.31 | 0.74 |
| IG | 30 | 0.69 | 0.66 | 0.80 | 0.38 | 0.80 | |
| FStat | 60 | 0.68 | 0.67 | 0.75 | 0.34 | 0.76 | |
| RFE | 60 | 0.68 | 0.65 | 0.80 | 0.36 | 0.79 | |
| RFFI | 30 | 0.68 | 0.64 | 0.81 | 0.37 | 0.79 | |
| DS400 | None | NA | 0.66 | 0.64 | 0.75 | 0.31 | 0.74 |
| IG | 60 | 0.64 | 0.61 | 0.73 | 0.28 | 0.75 | |
| FStat | 40 | 0.70 | 0.69 | 0.72 | 0.34 | 0.76 | |
| RFE | 60 | 0.64 | 0.62 | 0.73 | 0.28 | 0.76 | |
| RFFI | 20 | 0.69 | 0.68 | 0.76 | 0.36 | 0.80 |
Fig 1Overview of the NBBD framework.
Feature Selection Dataset (FSDS) which is a subset of, or the same as, training data set in the form of two OTU tables corresponding to two groups of metagenomics samples are first used to construct two networks. The node importance scoring modules compares topological properties of shared nodes in the two networks and outputs scores to prioritize the input features. Top selected features are then used to train and evaluate a classifier.
Performance of top (in terms of highest AUC and smallest number of selected features) performing RF classifiers for combinations of different choices of Network Inference Method (NIM) and network-based feature selection using different properties for Node Topological Property Scoring.
All results were obtained using DS50 as the feature selection dataset.
| NIM | FSM | # Features | ACC | Sn | Sp | MCC | AUC |
|---|---|---|---|---|---|---|---|
| CoNet | and | 50 | 0.69 | 0.66 | 0.81 | 0.38 | 0.79 |
| btw | 30 | 0.67 | 0.64 | 0.78 | 0.34 | 0.78 | |
| cc | 60 | 0.67 | 0.65 | 0.76 | 0.33 | 0.77 | |
| cls | 60 | 0.66 | 0.63 | 0.77 | 0.32 | 0.74 | |
| cn | 60 | 0.67 | 0.64 | 0.79 | 0.35 | 0.78 | |
| ncn | 60 | 0.66 | 0.64 | 0.73 | 0.31 | 0.75 | |
| MB | and | 50 | 0.64 | 0.63 | 0.69 | 0.26 | 0.71 |
| btw | 50 | 0.65 | 0.62 | 0.77 | 0.31 | 0.74 | |
| cc | 50 | 0.61 | 0.59 | 0.72 | 0.24 | 0.71 | |
| cls | 50 | 0.63 | 0.61 | 0.69 | 0.24 | 0.72 | |
| cn | 60 | 0.62 | 0.61 | 0.67 | 0.23 | 0.69 | |
| ncn | 60 | 0.62 | 0.57 | 0.80 | 0.30 | 0.75 | |
| Proxi | and | 60 | 0.64 | 0.61 | 0.76 | 0.30 | 0.75 |
| btw | 50 | 0.69 | 0.67 | 0.78 | 0.36 | 0.77 | |
| cc | 60 | 0.62 | 0.61 | 0.68 | 0.24 | 0.70 | |
| cls | 50 | 0.57 | 0.53 | 0.72 | 0.20 | 0.67 | |
| cn | 40 | 0.65 | 0.62 | 0.77 | 0.31 | 0.78 | |
| ncn | 60 | 0.67 | 0.65 | 0.75 | 0.33 | 0.77 | |
| RMT | and | 50 | 0.66 | 0.64 | 0.75 | 0.32 | 0.78 |
| btw | 50 | 0.67 | 0.64 | 0.79 | 0.35 | 0.78 | |
| cc | 20 | 0.68 | 0.66 | 0.78 | 0.35 | 0.78 | |
| cls | 60 | 0.67 | 0.65 | 0.78 | 0.35 | 0.77 | |
| cn | 60 | 0.68 | 0.66 | 0.75 | 0.33 | 0.77 | |
| ncn | 60 | 0.68 | 0.66 | 0.76 | 0.34 | 0.78 | |
| SparCC | and | 60 | 0.61 | 0.57 | 0.73 | 0.25 | 0.69 |
| btw | 60 | 0.68 | 0.66 | 0.75 | 0.34 | 0.75 | |
| cc | 40 | 0.60 | 0.57 | 0.71 | 0.23 | 0.70 | |
| cls | 60 | 0.66 | 0.65 | 0.72 | 0.29 | 0.73 | |
| cn | 50 | 0.66 | 0.64 | 0.72 | 0.29 | 0.72 | |
| ncn | 60 | 0.63 | 0.60 | 0.74 | 0.28 | 0.71 |
Performance (highest AUC attained, and the smallest number of features chosen) by the top performing RF classifiers for combinations of different choices of Network Inference Method (NIM) and network-based feature selection using three resilience measures for Critical Attack Set Scoring (CASS).
Results obtained using DS50 as the feature selection data set.
| NIM | FSM | # Features | ACC | Sn | Sp | MCC | AUC |
|---|---|---|---|---|---|---|---|
| AUC CoNet | CASS_I | 21 | 0.66 | 0.64 | 0.75 | 0.31 | 0.77 |
| CASS_T | 35 | 0.67 | 0.64 | 0.76 | 0.33 | 0.76 | |
| CASS_V | 21 | 0.66 | 0.64 | 0.75 | 0.31 | 0.77 | |
| MB | CASS_I | 6 | 0.51 | 0.47 | 0.68 | 0.12 | 0.61 |
| CASS_T | 33 | 0.57 | 0.53 | 0.73 | 0.21 | 0.65 | |
| CASS_V | NA | NA | NA | NA | NA | NA | |
| Proxi | CASS_I | 11 | 0.65 | 0.63 | 0.72 | 0.28 | 0.72 |
| CASS_T | 39 | 0.65 | 0.64 | 0.70 | 0.27 | 0.73 | |
| CASS_V | 1 | 0.25 | 0.07 | 0.96 | 0.04 | 0.51 | |
| RMT | CASS_I | 8 | 0.49 | 0.46 | 0.58 | 0.03 | 0.52 |
| CASS_T | 12 | 0.56 | 0.52 | 0.72 | 0.19 | 0.64 | |
| CASS_V | 3 | 0.64 | 0.64 | 0.61 | 0.21 | 0.62 | |
| SparCC | CASS_I | 117 | 0.66 | 0.64 | 0.74 | 0.31 | 0.76 |
| CASS_T | 125 | 0.66 | 0.64 | 0.72 | 0.30 | 0.76 | |
| CASS_V | NA | NA | NA | NA | NA | NA |
Performance of the top performing RF classifiers (with the highest AUC and using the smallest number of features) for combinations of different choices of Network Inference Method (NIM) and hybrid feature selection based on RFFI and different properties for Node Topological Property Scoring.
All results were obtained using DS50 as the feature selection dataset.
| GIM | FS Method | # Features | ACC | Sn | Sp | MCC | AUC |
|---|---|---|---|---|---|---|---|
| CoNet | and | 30 | 0.68 | 0.65 | 0.79 | 0.36 | 0.79 |
| btw | 20 | 0.65 | 0.61 | 0.79 | 0.33 | 0.78 | |
| cc | 60 | 0.66 | 0.64 | 0.74 | 0.31 | 0.74 | |
| cls | 40 | 0.62 | 0.59 | 0.77 | 0.29 | 0.74 | |
| cn | 40 | 0.66 | 0.64 | 0.75 | 0.32 | 0.76 | |
| ncn | 40 | 0.67 | 0.65 | 0.75 | 0.32 | 0.76 | |
| MB | and | 20 | 0.73 | 0.72 | 0.76 | 0.40 | 0.82 |
| btw | 40 | 0.66 | 0.64 | 0.78 | 0.33 | 0.78 | |
| cc | 20 | 0.66 | 0.62 | 0.79 | 0.34 | 0.77 | |
| cls | 40 | 0.65 | 0.64 | 0.72 | 0.29 | 0.77 | |
| cn | 10 | 0.69 | 0.68 | 0.74 | 0.34 | 0.76 | |
| ncn | 20 | 0.65 | 0.61 | 0.80 | 0.34 | 0.79 | |
| Proxi | and | 50 | 0.68 | 0.66 | 0.77 | 0.35 | 0.78 |
| btw | 30 | 0.69 | 0.66 | 0.82 | 0.39 | 0.79 | |
| cc | 50 | 0.65 | 0.62 | 0.77 | 0.31 | 0.78 | |
| cls | 50 | 0.67 | 0.63 | 0.83 | 0.37 | 0.79 | |
| cn | 40 | 0.62 | 0.60 | 0.70 | 0.24 | 0.73 | |
| ncn | 40 | 0.68 | 0.65 | 0.80 | 0.36 | 0.79 | |
| RMT | and | 60 | 0.68 | 0.64 | 0.80 | 0.36 | 0.79 |
| btw | 40 | 0.64 | 0.60 | 0.80 | 0.32 | 0.78 | |
| cc | 40 | 0.69 | 0.65 | 0.81 | 0.38 | 0.82 | |
| cls | 50 | 0.69 | 0.66 | 0.80 | 0.37 | 0.80 | |
| cn | 40 | 0.64 | 0.60 | 0.80 | 0.32 | 0.76 | |
| ncn | 50 | 0.68 | 0.65 | 0.81 | 0.37 | 0.80 | |
| SparCC | and | 30 | 0.67 | 0.64 | 0.78 | 0.34 | 0.80 |
| btw | 40 | 0.70 | 0.68 | 0.78 | 0.37 | 0.79 | |
| cc | 30 | 0.66 | 0.63 | 0.79 | 0.34 | 0.78 | |
| cls | 30 | 0.67 | 0.64 | 0.80 | 0.36 | 0.78 | |
| cn | 50 | 0.67 | 0.63 | 0.82 | 0.36 | 0.80 | |
| ncn | 40 | 0.66 | 0.62 | 0.81 | 0.34 | 0.79 |
Performance comparison of top three RF classifiers obtained using traditional feature selection and hybrid feature selection methods.
| NIM | FSDS | FS Method | # Features | ACC | Sn | Sp | MCC | AUC |
|---|---|---|---|---|---|---|---|---|
| NA | DS400 | RFFI | 20 | 0.69 | 0.68 | 0.76 | 0.36 | 0.80 |
| MB | DS50 | RFFI × and | 20 | 0.73 | 0.72 | 0.76 | 0.40 | 0.82 |
| RMT | DS50 | RFFI ×cc | 40 | 0.69 | 0.65 | 0.81 | 0.38 | 0.82 |
AUC scores for top three RF classifiers obtained using RFFI feature selection and two hybrid feature selection methods, MB_and and RMT_cc, using different feature selection datasets.
| FSDS | RFFI | MB_and | RMT_cc |
|---|---|---|---|
| DS50 | 0.76 | 0.82 | 0.82 |
| DS100 | 0.78 | 0.76 | 0.79 |
| DS200 | 0.79 | 0.75 | 0.77 |
| DS300 | 0.77 | 0.79 | 0.77 |
| DS400 | 0.80 | 0.78 | 0.75 |
Fig 2Venn diagram of unique and shared features selected using RF Feature Importance (RFFI), network-based feature selection applied to MB (RMT) networks and using ‘and’ (‘cc’) for node importance scoring.