| Literature DB >> 24707821 |
Suyan Tian1, Howard H Chang, Chi Wang, Jing Jiang, Xiaomei Wang, Junqi Niu.
Abstract
BACKGROUND: Over the last decade, metabolomics has evolved into a mainstream enterprise utilized by many laboratories globally. Like other "omics" data, metabolomics data has the characteristics of a smaller sample size compared to the number of features evaluated. Thus the selection of an optimal subset of features with a supervised classifier is imperative. We extended an existing feature selection algorithm, threshold gradient descent regularization (TGDR), to handle multi-class classification of "omics" data, and proposed two such extensions referred to as multi-TGDR. Both multi-TGDR frameworks were used to analyze a metabolomics dataset that compares the metabolic profiles of hepatocellular carcinoma (HCC) infected with hepatitis B (HBV) or C virus (HCV) with that of cirrhosis induced by HBV/HCV infection; the goal was to improve early-stage diagnosis of HCC.Entities:
Mesh:
Year: 2014 PMID: 24707821 PMCID: PMC4234477 DOI: 10.1186/1471-2105-15-97
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Figure 1The flowchart of multi-TGDR. Global: multi-TGDR global; local: multi-TGDR local.
The comparison between Multi-TGDR frameworks and PLS-DA using simulated data
| | | | | |
| Multi-TGDR: global No Bagging | 105 | 0.76 | 0.0100 | 12.21 |
| Global + Bagging (freq > 30%) | 35 | 10.69 | 0.0773 | 12.21 |
| local No Bagging | 54 (14, 46)1 | 2.29 | 0.0301 | 14.50 |
| Local + Bagging (freq > 40%) | 24 (14, 15) | 8.40 | 0.0539 | 12.98 |
| PLS-DA + Naïve Bayes as a classifier | 89 | 14.50 | 0.1313 | 19.84 |
| Multi-TGDR: global No Bagging | 110 | 0 | 0.0165 | 11.45 |
| Global + Bagging (freq > 50%) | 21 | 3.82 | 0.0237 | 7.63 |
| local No Bagging | 106(12, 95) | 0 | 0.0067 | 9.16 |
| Local + Bagging (freq > 40%) | 25(9, 18) | 3.82 | 0.0254 | 8.40 |
| PLS-DA + Naïve Bayes as a classifier | 97 | 6.87 | 0.1556 | 16.03 |
A. The performance of multi-TGDR frameworks and PLS-DA on the first simulated data. B. The performance of multi-TGDR and PLS-DA on the second simulated data.
1(No.1, No.2): No.1 represents the number of metabolites selected in the first comparison (class 2 versus class 1) by multi-TGDR local. No.2 represents the number of metabolites selected in the second comparison (class 3 versus class 1).
Figure 2The schema of the study.
Figure 3The comparison between cross validation (CV)-determined tuning parameter k (the iteration number) in both multi-TGDR frameworks. Global: multi-TGDR global; local: multi-TGDR local.
The predictive performance of Multi-TGDR frameworks and PLS-DA
| | | | | |
| Multi-TGDR global No Bagging | 45 | 0 | 4.24e-05 | 3.82 |
| Bagging (freq > 40%) | 30 | 0 | 3.68e-05 | 3.82 |
| Multi-TGDR local No Bagging | 48 | 0 | 7.57e-05 | 5.34 |
| Bagging (freq > 40%) | 29 | 0 | 5.97e-04 | 6.11 |
| Multi-TGDR global No Bagging | 42 | 0 | 1.03e-04 | 4.58 |
| Bagging (freq > 25%) | 37 | 0 | 1.13e-04 | 4.58 |
| Bagging (freq > 40%) | 26 | 0 | 3.58e-04 | 4.58 |
| Multi-TGDR local No Bagging | 42 | 0 | 6.18e-04 | 6.11 |
| Bagging (freq > 25%) | 38 | 0 | 6.87e-04 | 5.34 |
| Bagging (freq > 40%) | 25 | 0 | 2.24e-03 | 6.11 |
| Naïve Bayes as the extra classifier | 42 | 4.58 | 4.63e-02 | 7.63 |
A. The performance of multi-TGDR frameworks on the whole data: without moderated t-test filtering. B. The performance of multi-TGDR frameworks on the reduced data: with t-test filtering and 72 metabolites were filtered out. C. The performance of PLS-DA with naïve Bayes as the classifier. 42 metabolites selected by original analysis in Zhou’s study ref. [19] were used.
Note: For the reduced data, the optimal cutoff of bagging frequencies is 25%. However, in order to make a fair comparison with the results from the whole data, we analyzed the reduced data with bagging frequencies as 40% as well.
The selected metabolites by both multi-TGDR frameworks (the results of model 1_w and model 2_w)
| 191.04 | 0.64 | 0.6521 | -0.6011 | 0.7437 | -0.4971 | Beta-Lactose | |
| | 240.08 | 14.79 | -0.2734 | 0.2258 | -0.2842 | 0.0267 | 1,1′-Ethylidenebistryptophan or 1-aminopyrene |
| | 582.24 | 22.42 | -0.4816 | 0.2596 | -0.4854 | 0.1428 | Glutaminyl-Methionine |
| 91.36 | 8.47 | 0.1937 | -0.9035 | 0 | -0.4716 | Unknown | |
| | 100.32 | 1.16 | 0.1965 | -0.6931 | 0 | -0.7513 | Unknown* |
| | 101.32 | 1.16 | 0.1209 | -0.4617 | 0 | -0.4617 | Unknown |
| | 139.1 | 8.65 | 0.1969 | -0.4631 | 0 | -0.4631 | Phosphorylcholine |
| | 218.08 | 0.89 | 0.2614 | -0.7598 | 0 | -0.9121 | Pregnenolone sulfate |
| | 255.96 | 1.07 | 0.041 | 0.2284 | 0 | 0.3122 | Lsoxanthopterin |
| | 256.25 | 19.08 | -0.358 | 0.9489 | 0 | 0.9761 | Palmitic amide* |
| | 279.08 | 9.4 | 0.034 | -0.3269 | 0 | -0.4019 | Homocarnosine |
| | 361.18 | 19.64 | -0.0196 | 0.0538 | 0 | 0.1811 | Unknown |
| | 540.51 | 23.17 | 0.0211 | 0.2882 | 0 | 0.2898 | Unknown |
| | 599.25 | 9.78 | 0.5919 | 3.8635 | 0 | 4.1545 | Unknown |
| | 239.14 | 14.47 | -0.2198 | 0.0995 | -0.3349 | 0 | Phosphatidic acid |
| | 289.21 | 7.24 | -0.1335 | -0.0097 | -0.1347 | 0 | Neurosporene |
| | 356.37 | 15.91 | 0.3772 | -0.0728 | 0.2259 | 0 | Unknown |
| | 374.38 | 15.45 | 0.5133 | 0.0856 | 0.2179 | 0 | Unknown |
| | 375.39 | 15.46 | 1.4191 | 0.1173 | 1.3393 | 0 | Cholestanetriol or Unknown |
| | 402.42 | 17.55 | 0.2209 | 0.0424 | 0.4634 | 0 | 16(S)-hydroxy-18-oxo-18-CoA-LTE4 |
| | 585.27 | 9.09 | 1.435 | -0.1864 | 1.7132 | 0 | Conjugated bilirubin* |
| | 587.27 | 9.09 | 0.3402 | 0.1137 | 0.2372 | 0 | Conjugated bilirubin* |
| | 592.37 | 6.42 | 0.3208 | -0.0508 | 0.5229 | 0 | Unknown |
| | 633.25 | 10.31 | -0.7011 | 0.4138 | -0.7187 | 0 | Unknown |
| | 652.41 | 4.19 | 1.2071 | 0.4634 | 1.4441 | 0 | Ganglioside GM3 (d18:1/24:0) or Unknown |
| 181.08 | 8.6 | -0.1244 | 0.0115 | 0 | 0 | Alpha-Ketooctanoic acid | |
| | 277.17 | 10.69 | -0.1585 | 0.1796 | 0 | 0 | Phosphatidylinositol or Lithocholate 3-O-glucuronide |
| | 312.37 | 18.9 | -9.00E-04 | -0.0819 | 0 | 0 | Unknown |
| | 315.19 | 8.7 | -0.1527 | 0.0978 | 0 | 0 | 3-Oxohexadecanoic acid |
| | 608.38 | 3.97 | 0.1159 | 0.0187 | 0 | 0 | Unknown |
| 159 | 0.62 | 0 | 0 | -0.158 | 0.1842 | Glycolaldehyde | |
| | 330.35 | 15.37 | 0 | 0 | 0.1697 | 0 | Unknown* |
| | 634.26 | 10.31 | 0 | 0 | -0.0769 | 0 | Indoleacetyl glutamine |
| 810.62 | 29.87 | 0 | 0 | 0 | -0.1817 | SM(d18:1/18:0) | |
The normal controls serve as the reference. All: non-zeros in both comparisons and both versions; Common: selected by both versions, but being zero in one comparison by local; global: selected only by multi-TGDR global version; local: selected only by multi-TGDR local version. Model 1_w: the results of multi-TGDR global after bagging (BF > 40%); Model2_w: the results of multi-TGDR local after bagging (BF > 40%).
Note: *the overlaps with the metabolites selected by PLS-DA.
Figure 4The comparison of the selected metabolites by multi-TGDR frameworks on the whole data and on the reduced data (BF > 40% for both data). A. Venn-diagram for multi-TGDR global. B. Venn-diagram for multi-TGDR local. The whole data: without moderated t-test filtering. The reduced data: with t-test filtering and 72 metabolites were filtered out. The metabolites (indexed by m/z values) in red represent those filtered out by moderated t-tests. The metabolites (indexed by m/z values) in purple represent those selected by multi-TGDR framework on the whole data analysis, but excluded by bagging.