| Literature DB >> 21554713 |
Chris Bauer1, Frank Kleinjung, Celia J Smith, Mark W Towers, Ali Tiss, Alexandra Chadt, Tanja Dreja, Dieter Beule, Hadi Al-Hasani, Knut Reinert, Johannes Schuchhardt, Rainer Cramer.
Abstract
BACKGROUND: Diabetes like many diseases and biological processes is not mono-causal. On the one hand multi-factorial studies with complex experimental design are required for its comprehensive analysis. On the other hand, the data from these studies often include a substantial amount of redundancy such as proteins that are typically represented by a multitude of peptides. Coping simultaneously with both complexities (experimental and technological) makes data analysis a challenge for Bioinformatics.Entities:
Mesh:
Substances:
Year: 2011 PMID: 21554713 PMCID: PMC3116487 DOI: 10.1186/1471-2105-12-140
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Figure 1Work-flow. Complete work-flow of the cluster-based ANOVA approach with feature selection for multi-factorial MALDI MS profiling data in biomarker discovery.
MALDI Number of Samples.
| B6 | 36/5 | 31/4 | 12/2 | 36/5 | 40/5 | 37/5 | 38/5 | 38/5 | 32/5 | 39/5 | 34/5 | 28/5 | ||||
| NZO | 35/5 | 35/5 | 32/4 | 40/5 | 36/5 | 40/5 | 37/5 | 38/5 | 40/5 | 28/5 | 34/5 | 34/5 | ||||
| SJL | 4/1 | 0/0 | 16/3 | 12/2 | 0/0 | 40/5 | 32/4 | 40/5 | 32/5 | 36/5 | 40/5 | 40/5 | ||||
Number of MALDI mass spectra and biological replicates for each factor combination. The first number indicates the number of spectra, the second states the number of biological replicates. In total there are 1122 spectra for 155 different biological samples derived from 31 different mouse individuals.
Figure 2Preprocessing. MALDI MD profiling raw data (top), log data (middle) and after baseline correction and peak alignment (buttom). The left column show the effect on the spectra itself while the right column shows the corresponding standard error plots including linear fit (orange line) and lowess fit (black line). The different colors reflect different genotypes (red: B6, green: NZO, blue: SJL).
Figure 3Error Plot to ensure homoscedasticity. Error plot after log transformation to ensure homoscedasticity including linear fit (orange line) and lowess fit (black line). The different colors reflect different genotypes (red: B6, green: NZO, blue: SJL).
Figure 4Cluster Dendrogram. Cluster dendrogram of all peaks identified in this dataset (see the Methods section for details). Every node is characterized by four ANOVA p-values shown as a color-coded box with four fields: diet (upper left), genotype (upper right), time (lower right) and combination of diet and genotype (lower left). The different -log10 p-value colorscales for the four factors are shown at the bottom. Three clusters for further discussion (see text) are marked with red circles.
Figure 5Dendrogram Hemoglobin. Excerpt of the dendrogram in Figure 4 showing the three peaks identified as hemoglobin (colored red on the x-axis).
Table for clusters 1-3.
| 1 | 0.71 | 2262 | 1.3e-10 | 0.00015 | 2e-19 | 3.8e-11 | 0.0018 | 0.97 | 4.6e-14 | 1.2e-05 | ||||||
| 3618 | 6.7e-07 | 2e-08 | 0.004 | 8.8e-13 | ||||||||||||
| 4075 | 2e-14 | 1.5e-29 | 3e-08 | 7.9e-14 | ||||||||||||
| 2 | 0.94 | 9305 | 0.96 | 0.0019 | 3.3e-31 | 4.3e-36 | 2.8e-19 | 2.4e-17 | 0.12 | 0.013 | ||||||
| 8720 | 0.38 | 1.2e-23 | 2.8e-16 | 0.2 | ||||||||||||
| 8735 | 0.57 | 8.2e-29 | 5.5e-22 | 0.21 | ||||||||||||
| 3 | 0.95 | 6329 | 0.0012 | 0.022 | 7.5e-75 | 6.7e-50 | 0.34 | 0.24 | 0.18 | 0.19 | ||||||
| 4237 | 9.5e-05 | 1.1e-70 | 0.61 | 0.022 | ||||||||||||
| 5029 | 0.00014 | 1.3e-91 | 0.82 | 0.14 | ||||||||||||
| 5822 | 0.0023 | 5.7e-81 | 2.3e-07 | 0.82 | ||||||||||||
Table for clusters 1-3 of Figure 4. For every cluster and peaks aggregated within this cluster, the correlation of the peaks and the ANOVA p-values for the three different experimental factors and the factor combination of Diet and Genotype are given. P-values are given for every peak separately and for the complete cluster.
Figure 6Peak 4075. Normalized peak intensities for the peak at m/z 4075 representing cluster 1 of the dendrogram in Figure 4. Peak intensities for all 3 experimental factors are drawn as bar plots with error-of-mean error bars. Genotype and diet are given below the bars for each week. The missing values for the SJL-HFD week 3 and 4 samples are due to the sample collection problems described in the Methods section.
Confusion Matrices.
| CHF | HF | SD | ||||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| nFeat | Method | CHF | HF | SD | CHF | HF | SD | CHF | HF | SD | Error | P-Value | ||||||
| 3 | ANOVA | 14 | 18 | 17 | 24 | 16 | 16 | 0.53 | 0.0028 | |||||||||
| ACO | 4 | 16 | 10 | 22 | 11 | 16 | 0.4 | 1e-08 | ||||||||||
| Cluster ANOVA | 12 | 13 | 15 | 22 | 9 | 16 | 0.44 | 1e-06 | ||||||||||
| 5 | ANOVA | 13 | 16 | 18 | 25 | 20 | 11 | 0.52 | 0.006 | |||||||||
| ACO | 3 | 14 | 12 | 20 | 10 | 15 | 0.37 | 2.7e-08 | ||||||||||
| Cluster ANOVA | 13 | 12 | 15 | 12 | 4 | 24 | 0.4 | 6.7e-07 | ||||||||||
| 8 | ANOVA | 12 | 12 | 16 | 15 | 6 | 22 | 0.42 | 9e-06 | |||||||||
| ACO | 5 | 15 | 12 | 23 | 4 | 18 | 0.39 | 5.5e-07 | ||||||||||
| Cluster ANOVA | 10 | 12 | 14 | 16 | 5 | 19 | 0.38 | 3.3e-07 | ||||||||||
Confusion matrix for 10-fold cross validation for experimental factor diet using random forest classifier. The feature selection was done by three different methods: ANOVA, ant colony optimization (ACO) and cluster-based ANOVA. The feature selection was performed three times with different number of features: 3, 5 and 8. Numbers in bold print represents true positives.
Comparison of feature selection methods.
| Method | Deterministic | Feature Selection | p-Values | Multi Dimensional | Combinations | Redundancy |
|---|---|---|---|---|---|---|
| t-Test | ✓ | ✓ | ✓ | ✕ | ✕ | ✕ |
| F-Test | ✓ | ✓ | ✓ | ✓ | ✕ | ✕ |
| ANOVA | ✓ | ✓ | ✓ | ✓ | ✓ | ✕ |
| Swarm Intelligence | ✕ | ✓ | ✕ | ✓ | ✓ | ✓ |
| GA | ✕ | ✓ | ✕ | ✓ | ✓ | ✓ |
| This Work | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ |
Comparison of different methods for biomarker candidate identification and feature selection.