| Literature DB >> 21884622 |
Christian Baumgartner1, Melanie Osl, Michael Netzer, Daniela Baumgartner.
Abstract
The search and validation of novel disease biomarkers requires the complementary power of professional study planning and execution, modern profiling technologies and related bioinformatics tools for data analysis and interpretation. Biomarkers have considerable impact on the care of patients and are urgently needed for advancing diagnostics, prognostics and treatment of disease. This survey article highlights emerging bioinformatics methods for biomarker discovery in clinical metabolomics, focusing on the problem of data preprocessing and consolidation, the data-driven search, verification, prioritization and biological interpretation of putative metabolic candidate biomarkers in disease. In particular, data mining tools suitable for the application to omic data gathered from most frequently-used type of experimental designs, such as case-control or longitudinal biomarker cohort studies, are reviewed and case examples of selected discovery steps are delineated in more detail. This review demonstrates that clinical bioinformatics has evolved into an essential element of biomarker discovery, translating new innovations and successes in profiling technologies and bioinformatics to clinical application.Entities:
Year: 2011 PMID: 21884622 PMCID: PMC3143899 DOI: 10.1186/2043-9113-1-2
Source DB: PubMed Journal: J Clin Bioinforma ISSN: 2043-9113
Figure 1Biomarker discovery process in human disease using an MS-based metabolite profiling platform.
Commonly used supervised data mining methods for the search and prioritization of biomarker candidates in independent and dependent samples
| Independent samples | Method | Basic principle and key features of the method | Reference |
|---|---|---|---|
| Unpaired null hypothesis testing (Two-sample t-test*, Mann-Whitney-U test°) | - univariate filter method | Lehmann, | |
| Principal component analysis (PCA)# | - unsupervised projection method | Jolliffe, | |
| Information gain (IG) | - univariate filter method | Hall and Holmes, | |
| ReliefF (RF) | - multivariate filter method | Robnik-Sikonja & Kononenko, | |
| Associative voting (AV) | - multivariate filter method | Osl et al., | |
| Unpaired Biomarker Identifier (uBI) | - univariate filter method | Baumgartner et al., | |
| Guilt-by-association feature selection (GBA-FS) | - multivariate subset selection method | Shin et al., | |
| Support vector machine-recursive feature elimination (SVM-REF) | - embedded selection method | Guyon et al., | |
| Random forest models (RFM) | - embedded selection method | Enot et al., | |
| Aggregating feature selection (AFS) | - ensemble selection method | Saeys et al., | |
| Stacked feature ranking (SFR) | - ensemble selection method | Netzer et al., | |
| Wrapper approach | - evaluating the merit of a feature subset by accuracy estimates using a classifier | Hall and Holmes, | |
| Paired null hypothesis testing (Paired t-test*, Wilcoxon signed-rank test°) | - univariate filter method | Lehmann, | |
| Repeated measure analysis | - univariate and multivariate approaches | Crowder & Hand, | |
| Paired Biomarker Identifier (pBI) | - univariate filter method | Baumgartner et al., | |
* data normal distributed, ° data non-normal distributed. # PCA is an unsupervised method also used for data containing class information. All algorithms are run on continuous data as data generated in metabolomics are usually of metric nature. Data can represent absolute metabolite concentrations (given as intensity counts or more specific in μmol/L if internal standards are available) or simple m/z values from raw or preprocessed mass spectra.
Figure 2Kinetic map of metabolites on PMI data at 10, 60, 120, and 240 minutes after myocardial injury, using the pBI scoring model for prioritization of selected metabolites into groups of weak, moderate and strong predictors. Values indicate absolute pBI scores. The thresholds for prioritization are denoted below in the list of analytes. Red color increments indicate decreasing levels, blue increasing levels. In this study, a series of metabolites in pathways associated with myocardial infarction could be identified, some of which change as early as 10 minutes after injury, a time frame where no currently available clinical biomarkers are present [13,56].
Figure 3AUC analysis on the entire metabolite set (bars in the left), and on a set of the top ten ranked metabolites using four common feature selection methods, i.e. two sample t-test (P-value), the unpaired Biomarker Identifier (uBI), ReliefF, and Information gain (IG) on MCADD data (bars in the right). Red bars represent the predictive value expressed by the AUC of selected analyte sets, determined on a single derivation cohort with cross validation and blue bars without cross-validation. Interestingly, using the entire metabolite set (43 analytes) for distinguishing between the two groups, the discriminatory ability dropped from AUC = 1.0 (without cross validation) to AUC = 0.51 after 10-fold cross validation, thus indicating no discrimination between the cohorts. On the selected subset, the AUC dropped by 15% to 25% after cross validation, demonstrating weak predictive value and thus low generalizability of the selected subset in this experiment.
Figure 4The high and low concentration levels of arginine (Arg) and ornithine (Orn), respectively, in patients afflicted with severe metabolic syndrome and cardiovascular disease (MS+) versus healthy controls, implied an impacted enzyme arginase in the urea cycle (left figure). The urea cycle and associated pathways from the KEGG database are depicted in the right figure. Findings could be confirmed by literature [66,67].