| Literature DB >> 25332844 |
Hilary S Parker1, Héctor Corrada Bravo2, Jeffrey T Leek1.
Abstract
Batch effects are responsible for the failure of promising genomic prognostic signatures, major ambiguities in published genomic results, and retractions of widely-publicized findings. Batch effect corrections have been developed to remove these artifacts, but they are designed to be used in population studies. But genomic technologies are beginning to be used in clinical applications where samples are analyzed one at a time for diagnostic, prognostic, and predictive applications. There are currently no batch correction methods that have been developed specifically for prediction. In this paper, we propose an new method called frozen surrogate variable analysis (fSVA) that borrows strength from a training set for individual sample batch correction. We show that fSVA improves prediction accuracy in simulations and in public genomic studies. fSVA is available as part of the sva Bioconductor package.Entities:
Keywords: Batch effects; Database; Genomics; Machine learning; Prediction; Statistics; Surrogate variable analysis
Year: 2014 PMID: 25332844 PMCID: PMC4179553 DOI: 10.7717/peerj.561
Source DB: PubMed Journal: PeerJ ISSN: 2167-8359 Impact factor: 2.984
Specifications for the three simulation scenarios used to show the performance of fSVA.
We performed three simulations under slightly different parameterizations to show the effectiveness of fSVA in improving prediction accuracy. Parameters from Eq. (2) were simulated using the distributions specified in this table. Additionally, the percentage of features in the simulation affected by batch, outcome, or both are as indicated in this table. Results from these simulations can be found in Fig. 1.
|
| |
| Scenario 1 | |
| Γ ∼ | |
| Scenario 2 | |
| Γ ∼ | |
| Scenario 3 | |
| Γ ∼ | |
|
| |
| Scenario 1 | 50% batch-affected |
| 50% outcome-affected | |
| 40% affected by both | |
| Scenario 2 | 50% batch-affected |
| 50% outcome-affected | |
| 40% affected by both | |
| Scenario 3 | 80% batch-affected |
| 80% outcome-affected | |
| 50% affected by both | |
Figure 1fSVA improves prediction accuracy of simulated datasets.
We created simulated datasets (consisting of a database and new samples) using model (2) and tested the prediction accuracy of these using R. For each simulated data set we performed either exact fSVA correction, fast fSVA correction, SVA correction on the database only, or no correction. We performed 100 iterations on each simulation scenario described in Table 1. We performed the simulation for a range of potential values for the correlation between the outcome we were predicting and the batch effects (x-axis in each plot). These plots show the 100 iterations, as well as the average trend line for each of the four methods investigated.
fSVA improves prediction accuracy in 5 of the 9 studies examined.
The remaining 4 studies showed indeterminate results since the 95% confidence intervals overlapped zero. In order to find the prediction accuracy results, each of the studies was randomly divided into “database samples” and “new samples”. Exact fSVA-correction was then performed as described above. We then built a predictive model (PAM) on the database and tested the prediction accuracy on the new samples.
| Study | No correction | Improvement |
|---|---|---|
| GSE10927 | ||
| GSE13041 | ||
| GSE13911 | ||
| GSE2034 | ||
| GSE2603 | ||
| GSE2990 | ||
| GSE4183 | ||
| GSE6764 | ||
| GSE7696 |