| Literature DB >> 23616427 |
Michael Turewicz1, Caroline May, Maike Ahrens, Dirk Woitalla, Ralf Gold, Swaantje Casjens, Beate Pesch, Thomas Brüning, Helmut E Meyer, Eckhard Nordhoff, Miriam Böckmann, Christian Stephan, Martin Eisenacher.
Abstract
Contemporary protein microarrays such as the ProtoArray® are used for autoimmune antibody screening studies to discover biomarker panels. For ProtoArray data analysis, the software Prospector and a default workflow are suggested by the manufacturer. While analyzing a large data set of a discovery study for diagnostic biomarkers of the Parkinson's disease (ParkCHIP), we have revealed the need for distinct improvements of the suggested workflow concerning raw data acquisition, normalization and preselection method availability, batch effects, feature selection, and feature validation. In this work, appropriate improvements of the default workflow are proposed. It is shown that completely automatic data acquisition as a batch, a re-implementation of Prospector's pre-selection method, multivariate or hybrid feature selection, and validation of the selected protein panel using an independent test set define in combination an improved workflow for large studies.Entities:
Keywords: Autoantibodies; Bioinformatics; Biological markers; Parkinson's disease; Protein array analysis
Mesh:
Substances:
Year: 2013 PMID: 23616427 PMCID: PMC3810711 DOI: 10.1002/pmic.201200518
Source DB: PubMed Journal: Proteomics ISSN: 1615-9853 Impact factor: 3.984
Figure 1The basic idea of the hybrid procedure for the selection of candidate proteins, which has been performed in the ParkCHIP study, is shown. As first step, several two-group comparisons (HC vs. PD, DC vs. PD, HC vs. DC sera and lot1 vs. lot2) have been performed using the software Prospector. After rating all proteins by manual and score voting, preselection rules have been applied to them, and resulted in 215 preliminary biomarker candidates. This set has been further narrowed down by manual and automatic selection. Finally, the resulting lists containing 22 and 18 proteins, respectively, have been assembled to the final biomarker candidate list containing 36 distinct candidate proteins.
Figure 2To validate the “M Score only” selection, ten new test and training set pairs have been resampled for the discrimination of PD cases and nonaffected subjects (PD and “HC + DC”). For each training set, the best 36 proteins (concerning M Score) have been selected and an RF classifier has been trained using the respective 36 features to classify the corresponding test set. Finally, the average accuracy for these ten subruns has been computed.
For Parkinson’s (ParkCHIP) and Alzheimer’s disease data (GEO record GSE29676), particular test and training set classification accuracies for ten subruns of “M score only” selection and “hybrid” selection as well as on average are shown
| Parkinson’s disease | Alzheimer’s disease | |||||||
|---|---|---|---|---|---|---|---|---|
| M Score only | Hybrid | M Score only | Hybrid | |||||
| Subrun | Train | Test | Train | Test | Train | Test | Train | Test |
| 1 | 100 | 60.1 | 74.5 | 74.9 | 100 | 75.6 | 83.9 | 84.8 |
| 2 | 100 | 55.3 | 74.6 | 77.2 | 100 | 79.6 | 82.6 | 83.9 |
| 3 | 100 | 54.9 | 74.5 | 71.2 | 100 | 81.6 | 83.9 | 78.9 |
| 4 | 100 | 51.4 | 74 | 72.4 | 100 | 77.6 | 83.9 | 79.9 |
| 5 | 100 | 55.7 | 74.5 | 73.5 | 100 | 75.5 | 83.9 | 75.7 |
| 6 | 100 | 56.7 | 74.8 | 73.2 | 100 | 87.6 | 83.2 | 77.3 |
| 7 | 100 | 58.2 | 74.5 | 73.8 | 100 | 80.8 | 83.2 | 91.1 |
| 8 | 100 | 62.1 | 74.4 | 69.5 | 100 | 79.2 | 83.9 | 83.8 |
| 9 | 100 | 65.3 | 74.7 | 74.5 | 100 | 74.4 | 83.9 | 83.4 |
| 10 | 100 | 62.8 | 74.7 | 74.5 | 100 | 81.0 | 83.9 | 87.2 |
| Average | 100 | 58.25 | 74.5 | 73.5 | 100 | 79.3 | 83.6 | 82.6 |