| Literature DB >> 29701723 |
Anouk Suppers1, Alain J van Gool2, Hans J C T Wessels3.
Abstract
Protein biomarkers are of great benefit for clinical research and applications, as they are powerful means for diagnosing, monitoring and treatment prediction of different diseases. Even though numerous biomarkers have been reported, the translation to clinical practice is still limited. This mainly due to: (i) incorrect biomarker selection, (ii) insufficient validation of potential biomarkers, and (iii) insufficient clinical use. In this review, we focus on the biomarker selection process and critically discuss the chemometrical and statistical decisions made in proteomics biomarker discovery to increase to selection of high value biomarkers. The characteristics of the data, the computational resources, the type of biomarker that is searched for and the validation strategy influence the decision making of the chemometrical and statistical methods and a decision made for one component directly influences the choice for another. Incorrect decisions could increase the false positive and negative rate of biomarkers which requires independent confirmation of outcome by other techniques and for comparison between different related studies. There are few guidelines for authors regarding data analysis documentation in peer reviewed journals, making it hard to reproduce successful data analysis strategies. Here we review multiple chemometrical and statistical methods for their value in proteomics-based biomarker discovery and propose to include key components in scientific documentation.Entities:
Keywords: biomarker; chemometrics; classification models; clinical proteomics; feature reduction; preprocessing; review; statistics
Year: 2018 PMID: 29701723 PMCID: PMC6027525 DOI: 10.3390/proteomes6020020
Source DB: PubMed Journal: Proteomes ISSN: 2227-7382
Figure 1Biomarker discovery workflow. The encircled components highlight the focus of this review.
Figure 2Graphical representation of the pre-treatment effects by data centering and autoscaling methods. This figure represents five samples, for which each of the vertical boxes is the feature value distribution of one sample, with the mean depicted as a horizontal bar inside the box. Centering removes the offset from the data so that the sample means become zero and autoscaling converts the data so that the standard deviation becomes one.
Figure 3(a) Filter, (b) wrapper, and (c) embedded feature selection methods. Filter methods perform the feature selection independently of construction of the classification model. Wrapper methods iteratively select or eliminate a set of features using the prediction accuracy of the classification model. In embedded methods the feature selection is an integral part of the classification model.
Figure 4Schematic overview of a double cross-validation procedure. The samples are split into a training and test set to evaluate the prediction accuracy in the outer cross-validation (CV) loop. The training set is subsequently split into a training and validation set to optimize the parameter specific for a classifier in the inner cross-validation loop.
Confusion matrix for binary classification. The positive and negative class could be disease and control or two different types of diseases, etc.
| Actual Class | |||
|---|---|---|---|
| Positive | Negative | ||
| True Positive (TP) | False Positive (FP) | ||
| False Negative (FN) | True Negative (TN) | ||
Performance measures for binary classification based on the notation in Table 1.
| Performance Measure | Formula |
|---|---|
| Number of misclassifications (NMC) | |
| Accuracy | |
| Sensitivity | |
| Specificity | |
| Area under the receiver operator curve (AUC) |