| Literature DB >> 26479726 |
Lihua Cai1, Honglong Wu2, Dongfang Li2, Ke Zhou3, Fuhao Zou3.
Abstract
Type 2 diabetes, which is a complex metabolic disease influenced by genetic and environment, has become a worldwide problem. Previous published results focused on genetic components through genome-wide association studies that just interpret this disease to some extent. Recently, two research groups published metagenome-wide association studies (MGWAS) result that found meta-biomarkers related with type 2 diabetes. However, One key problem of analyzing genomic data is that how to deal with the ultra-high dimensionality of features. From a statistical viewpoint it is challenging to filter true factors in high dimensional data. Various methods and techniques have been proposed on this issue, which can only achieve limited prediction performance and poor interpretability. New statistical procedure with higher performance and clear interpretability is appealing in analyzing high dimensional data. To address this problem, we apply an excellent statistical variable selection procedure called iterative sure independence screening to gene profiles that obtained from metagenome sequencing, and 48/24 meta-markers were selected in Chinese/European cohorts as predictors with 0.97/0.99 accuracy in AUC (area under the curve), which showed a better performance than other model selection methods, respectively. These results demonstrate the power and utility of data mining technologies within the large-scale and ultra-high dimensional genomic-related dataset for diagnostic and predictive markers identifying.Entities:
Mesh:
Substances:
Year: 2015 PMID: 26479726 PMCID: PMC4610706 DOI: 10.1371/journal.pone.0140827
Source DB: PubMed Journal: PLoS One ISSN: 1932-6203 Impact factor: 3.240
Data.
| Gender | Condition | |||||
|---|---|---|---|---|---|---|
| Race | Male | Female | Age(mean/sd) | BMI(mean/sd) | T2D | Normal |
| Chinese | 190 | 154 | 47.58/14.51 | 27.11/4.55 | 170 | 174 |
| European | 0 | 145 | 70.41/0.71 | 23.35/3.48 | 102 | 43 |
Chinese and European gut microbiota datasets of type 2 diabetes (T2D) used in our work. The ‘sd’ means standard deviation. BMI means body mass index.
AUC obtained by mRMR and ensemble methods (Chinese).
| Classifier | Method | 10 | 20 | 30 | 40 | 50 | 60 | 70 | 80 | 90 | 100 |
|---|---|---|---|---|---|---|---|---|---|---|---|
| SVM | mRMR | 0.75 | 0.73 | 0.74 | 0.76 |
| 0.72 | 0.72 | 0.73 | 0.73 | 0.71 |
| Ensemble (lasso) | 0.79 | 0.84 | 0.85 | 0.89 | 0.90 | 0.89 |
| 0.91 | 0.88 | 0.90 | |
| Ensemble (Enet) | 0.79 | 0.85 | 0.86 | 0.90 |
| 0.91 | 0.88 | 0.89 | 0.90 | 0.90 | |
| LR | mRMR | 0.75 | 0.75 |
| 0.75 | 0.75 | 0.75 | 0.72 | 0.70 | 0.71 | 0.72 |
| Ensemble (lasso) | 0.80 | 0.83 | 0.86 | 0.88 |
| 0.88 | 0.85 | 0.82 | 0.77 | 0.79 | |
| Ensemble (Enet) | 0.79 | 0.85 | 0.86 | 0.90 |
| 0.89 | 0.86 | 0.79 | 0.79 | 0.82 | |
| LDA | mRMR | 0.75 | 0.75 |
| 0.72 | 0.74 | 0.71 | 0.68 | 0.69 | 0.70 | 0.69 |
| Ensemble (lasso) | 0.80 | 0.84 | 0.87 | 0.90 | 0.89 | 0.90 | 0.91 |
| 0.92 | 0.92 | |
| Ensemble (Enet) | 0.80 | 0.86 | 0.87 | 0.90 | 0.91 | 0.91 | 0.90 | 0.91 |
| 0.91 | |
| NB | mRMR | 0.72 | 0.71 | 0.74 | 0.72 | 0.75 | 0.76 | 0.75 | 0.76 | 0.76 |
|
| Ensemble (lasso) | 0.76 | 0.85 | 0.87 | 0.88 | 0.88 | 0.89 | 0.89 |
| 0.88 | 0.89 | |
| Ensemble (Enet) | 0.78 | 0.85 | 0.86 | 0.87 |
| 0.88 | 0.88 | 0.89 | 0.89 | 0.89 |
AUC of t-gene (t is from 10 to 100) signatures obtained from each combination of classification algorithms and mRMR, ensemble of lasso, ensemble of elastic net, in a 10-fold cross validation over Chinese dataset. For each combination of classifiers with variable selection methods, we highlighted the best result of the minimum number of genes.
AUC obtained by ISIS-SCAD (Chinese).
| Classifier | 10 | 15 | 18 | 23 | 26 | 28 | 34 | 41 | 43 | 48 | 50 | 61 | 63 |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| SVM | 0.71 | 0.80 | 0.83 | 0.82 | 0.89 | 0.93 | 0.92 | 0.91 | 0.96 |
| 0.95 | 0.96 | 0.96 |
| LR | 0.73 | 0.81 | 0.85 | 0.83 | 0.90 |
| 0.91 | 0.92 | 0.92 | 0.89 | 0.86 | 0.89 | 0.88 |
| LDA | 0.68 | 0.79 | 0.84 | 0.80 | 0.88 | 0.92 | 0.92 | 0.92 | 0.95 |
| 0.95 | 0.96 | 0.95 |
| NB | 0.69 | 0.71 | 0.77 | 0.79 | 0.78 | 0.83 | 0.82 | 0.87 | 0.82 | 0.85 | 0.85 |
| 0.88 |
AUC of signature size in {10, 15, 18, 23, 26, 28, 34, 41, 43, 48, 50, 61, 63}, combined with four classification algorithms in a 10-fold cross-validation. For each classification method, we highlighted the best result.
AUC obtained by mRMR and ensemble methods (European).
| Classifier | Method | 10 | 20 | 30 | 40 | 50 | 60 | 70 | 80 | 90 | 100 |
|---|---|---|---|---|---|---|---|---|---|---|---|
| SVM | mRMR | 0.86 |
| 0.87 | 0.82 | 0.83 | 0.80 | 0.81 | 0.83 | 0.84 | 0.83 |
| Ensemble (lasso) | 0.87 | 0.88 | 0.87 | 0.87 | 0.92 |
| 0.93 | 0.92 | 0.89 | 0.88 | |
| Ensemble (Enet) | 0.85 | 0.89 |
| 0.87 | 0.89 | 0.88 | 0.81 | 0.89 | 0.88 | 0.89 | |
| LR | mRMR | 0.84 |
| 0.80 | 0.75 | 0.79 | 0.73 | 0.69 | 0.66 | 0.69 | 0.63 |
| Ensemble (lasso) |
| 0.84 | 0.76 | 0.73 | 0.84 | 0.78 | 0.76 | 0.73 | 0.69 | 0.63 | |
| Ensemble (Enet) | 0.87 |
| 0.81 | 0.74 | 0.79 | 0.76 | 0.75 | 0.71 | 0.77 | 0.73 | |
| LDA | mRMR | 0.80 | 0.84 |
| 0.80 | 0.78 | 0.76 | 0.75 | 0.69 | 0.67 | 0.64 |
| Ensemble (lasso) |
| 0.85 | 0.83 | 0.79 | 0.77 | 0.74 | 0.74 | 0.74 | 0.71 | 0.67 | |
| Ensemble (Enet) | 0.86 | 0.88 |
| 0.79 | 0.76 | 0.79 | 0.75 | 0.76 | 0.77 | 0.69 | |
| NB | mRMR | 0.79 | 0.76 | 0.78 |
| 0.78 | 0.77 | 0.76 | 0.77 | 0.77 | 0.76 |
| Ensemble (lasso) | 0.85 | 0.87 |
| 0.87 | 0.84 | 0.83 | 0.85 | 0.81 | 0.77 | 0.75 | |
| Ensemble (Enet) | 0.88 | 0.89 | 0.88 | 0.88 |
| 0.89 | 0.89 | 0.88 | 0.87 | 0.90 |
AUC of t-gene (t is from 10 to 100) signatures obtained from each combination of classification algorithms and mRMR, ensemble of lasso, ensemble of elastic net, in a 10-fold cross validation over European dataset. For each combination of classifiers with variable selection methods, we highlighted the best result of the minimum number of genes.
AUC obtained by ISIS-SCAD (European).
| Classifier | 4 | 11 | 15 | 22 | 24 | 26 | 27 | 28 | 29 | 32 | 34 | 35 | 36 |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| SVM | 0.77 | 0.91 | 0.87 | 0.97 |
| 0.97 | 0.98 | 0.97 | 0.94 | 0.96 | 0.94 | 0.92 | 0.94 |
| LR | 0.78 | 0.94 | 0.86 | 0.87 |
| 0.95 | 0.86 | 0.95 | 0.88 | 0.89 | 0.85 | 0.83 | 0.85 |
| LDA | 0.75 | 0.85 | 0.87 | 0.97 |
| 0.97 | 0.97 | 0.97 | 0.94 | 0.96 | 0.94 | 0.91 | 0.95 |
| NB | 0.74 | 0.81 | 0.86 | 0.90 | 0.92 | 0.83 | 0.89 | 0.83 |
| 0.92 | 0.93 | 0.88 | 0.90 |
AUC of signature size in {4, 11, 15, 22, 24, 26, 27, 28, 29, 32, 34, 35, 36}, combined with four classification algorithms in a 10-fold cross-validation. For each classification method, we highlighted the best result.
Fig 1AUC.
SVM classifier trained as a function of the size of signature, for mRMR, ensemble of lasso and ensemble of elastic net, in a 10-fold cross-validation setting on Chinese and European datasets respectively.
Fig 2AUC obtained from SVM classifier estimated on genes selected by ISIS-SCAD and ensemble feature selection.
Signature of size in {10, 15, 18, 23, 26, 28, 34, 41, 43, 48, 50, 61, 63} on Chinese dataset and size in {4, 11, 15, 22, 24, 26, 27, 28, 29, 32, 34, 35, 36} on European dataset in a 10-fold cross-validation setting.
Fig 3Averaged AUC obtained from SVM classifier combined with three variable selection methods.
SVM classifier estimated as a function of sample size in a 50 × 10-fold cross-validation setting. We show accuracy of 60-gene of ensemble feature selection and 48-gene of ISIS-SCAD on Chinese dataset. For European dataset, the accuracy of ensemble feature selection is computed on 60-gene and the accuracy of ISIS-SCAD is on 24-gene.
Results of simulated example I: accuracy of ISIS in including the true model {X 1,X 2,X 3}.
|
|
|
| |
|---|---|---|---|
|
| 0.21 | 0.16 | 0.06 |
|
| 0.02 | 0.01 | 0 |
Accuracy of ISIS on different correlation ρ and dimensionality p setting under nonlinear relationship. For each model, 100 data sets consisting of 50 observations were simulated and 20 variables were selected for computing the accuracy.
Results of simulated example II: accuracy of ISIS in including the true model {X 1,X 2,X 3,X 4}.
|
|
|
| |
|---|---|---|---|
|
| 1 | 1 | 0.97 |
|
| 0.98 | 0.97 | 0.81 |
Accuracy of ISIS on different correlation ρ and dimensionality p setting under jointly contribution scenario. 100 data sets consisting of 50 observations were simulated and 20 variables were selected for computing the accuracy.