| Literature DB >> 21673802 |
Nikita I Lytkin1, Lauren McVoy, Jörn-Hendrik Weitkamp, Constantin F Aliferis, Alexander Statnikov.
Abstract
BACKGROUND: The promise of modern personalized medicine is to use molecular and clinical information to better diagnose, manage, and treat disease, on an individual patient basis. These functions are predominantly enabled by molecular signatures, which are computational models for predicting phenotypes and other responses of interest from high-throughput assay data. Data-analytics is a central component of molecular signature development and can jeopardize the entire process if conducted incorrectly. While exploratory data analysis may tolerate suboptimal protocols, clinical-grade molecular signatures are subject to vastly stricter requirements. Closing the gap between standards for exploratory versus clinically successful molecular signatures entails a thorough understanding of possible biases in the data analysis phase and developing strategies to avoid them. METHODOLOGY AND PRINCIPALEntities:
Mesh:
Substances:
Year: 2011 PMID: 21673802 PMCID: PMC3105991 DOI: 10.1371/journal.pone.0020662
Source DB: PubMed Journal: PLoS One ISSN: 1932-6203 Impact factor: 3.240
Number of gene expression profiles corresponding to each category of samples from the data of Zaas et al. [9] and Ramilo et al. [12].
|
|
|
| |
|
|
| ||
|
| 10 | 9 | N/A |
|
| 11 | 9 | N/A |
|
| 9 | 8 | 18 |
|
| N/A | N/A | 73 |
|
| 56 | 6 | |
Genes that appeared in more than 20% of non-redundant and maximally predictive signatures identified by TIE* for discriminating between symptomatic and uninfected samples.
|
|
|
|
|
| 201065_s_at |
| general transcription factor IIi | 73% |
| 213674_x_at |
| immunoglobulin heavy constant delta | 73% |
| 214511_x_at |
| Fc fragment of IgG, high affinity Ib, receptor (CD64) | 72% |
| 207826_s_at |
| inhibitor of DNA binding 3, dominant negative helix-loop-helix protein | 71% |
| 213797_at |
| radical S-adenosyl methionine domain containing 2 | 71% |
| 217418_x_at |
| membrane-spanning 4-domains, subfamily A, member 1 | 70% |
| 219471_at |
| chromosome 13 open reading frame 18 | 69% |
| 219112_at |
| Rap guanine nucleotide exchange factor (GEF) 6 | 63% |
| 219073_s_at |
| oxysterol binding protein-like 10 | 59% |
| 219313_at |
| GRAM domain containing 1C | 56% |
| 204439_at |
| interferon-induced protein 44-like | 42% |
| 221234_s_at |
| BTB and CNC homology 1, basic leucine zipper transcription factor 2 | 29% |
| 216950_s_at |
| Fc fragment of IgG, high affinity Ia, receptor (CD64); Fc fragment of IgG, high affinity Ic, receptor (CD64) | 28% |
| 207431_s_at |
| degenerative spermatocyte homolog 1, lipid desaturase (Drosophila) | 25% |
| 205049_s_at |
| CD79a molecule, immunoglobulin-associated alpha | 24% |
| 202723_s_at |
| forkhead box O1 | 22% |
| 44790_s_at |
| chromosome 13 open reading frame 18 | 22% |
| 203413_at |
| NEL-like 2 (chicken) | 20% |
| 214059_at |
| Interferon-induced protein 44 | 20% |
Genes highlighted in bold are those that also comprised the 12-gene panviral signature developed by applying GLL on the entire set of samples [10].
Effects of preprocessing methods on gene selection under the null hypothesis of no association between genes and the panviral phenotype in the acute respiratory viral infections dataset [9].
|
|
| |||
|
|
|
| ||
| No preprocessing | 0.3 | 9.1 | 0.0 | 0.0 |
| Center (subtract global mean) | 0.3 | 9.1 | 0.0 | 0.0 |
| Standardize (subtract global mean and divide by stdev) | 0.3 | 9.1 | 0.0 | 0.0 |
| Scale each probe to [0,1] | 0.3 | 9.1 | 0.0 | 0.0 |
| Batch correction from the supplementary software of
Zaas et al. | 55.6 | 440.7 | 0.0 | 287.5 |
| ComBat | 71.3 | 505.1 | 0.0 | 707.0 |
The phenotype variable was randomly permuted 10,000 times. On each permutation, we applied a given preprocessing method and then performed gene selection using a two-sample t-test with the false discovery rate (FDR) correction at level 0.2 [32], [33].
Figure 1Effects of preprocessing by the supplementary software of Zaas et al. on real gene expression data.
Gene expression profiles of the uninfected subjects are shown in blue staggered on top of the profiles of the infected subjects highlighted with red. The blue and red vertical line segments denote locations of the mean expression in the uninfected and infected groups, respectively. Likewise, blue and red horizontal line segments emanating in both directions from the means denote one standard deviation within the uninfected and infected groups, respectively. P-values produced by a two-sample t-test with unequal variances are shown in parenthesis.
Figure 2Visualization of subjects in the dataset from in the space of the first two principal components of the panviral signature of Zaas et al.
The solid line is an approximation of the molecular signature (classifier) of Zaas et al.; subjects to the left of this line are classified as uninfected (healthy) and subjects to the right are classified as virally infected (Influenza A). Blue and red gradient highlighting corresponds to the regions where the majority of bacterial and viral profiles belong, respectively. Green highlighting shows the area with uninfected (healthy) profiles.
Effects of preprocessing methods on gene selection under the null hypothesis of no association between genes and the bacterial phenotype in the Candidemia dataset [11].
|
|
| |||
|
|
|
| ||
| No preprocessing | 82.6 | 640.7 | 0.0 | 607.0 |
| Center (subtract global mean) | 82.6 | 640.7 | 0.0 | 607.0 |
| Standardize (subtract global mean and divide by stdev) | 82.6 | 640.7 | 0.0 | 607.0 |
| Scale each probe to [0,1] | 82.6 | 640.7 | 0.0 | 607.0 |
| Batch correction from the supplementary software of
Zaas et al. | 221.8 | 1098.0 | 0.0 | 3543.5 |
| ComBat | 253.2 | 1174.3 | 0.0 | 3991.5 |
The phenotype variable was randomly permuted 10,000 times. On each permutation, we applied a given preprocessing method and then performed gene selection using a two-sample t-test with the false discovery rate (FDR) correction at level 0.2 [32], [33].