| Literature DB >> 30379879 |
Farnoosh Abbas-Aghababazadeh1, Qian Li1,2, Brooke L Fridley1.
Abstract
Normalization of RNA-Seq data has proven essential to ensure accurate inferences and replication of findings. Hence, various normalization methods have been proposed for various technical artifacts that can be present in high-throughput sequencing transcriptomic studies. In this study, we set out to compare the widely used library size normalization methods (UQ, TMM, and RLE) and across sample normalization methods (SVA, RUV, and PCA) for RNA-Seq data using publicly available data from The Cancer Genome Atlas (TCGA) cervical cancer study. Additionally, an extensive simulation study was completed to compare the performance of the across sample normalization methods in estimating technical artifacts. Lastly, we investigated the effect of reduction in degrees of freedom in the normalized data and their impact on downstream differential expression analysis results. Based on this study, the TMM and RLE library size normalization methods give similar results for CESC dataset. In addition, the simulated datasets results show that the SVA ("BE") method outperforms the other methods (SVA "Leek", PCA) by correctly estimating the number of latent artifacts. Moreover, ignoring the loss of degrees of freedom due to normalization results in an inflated type I error rates. We recommend adjusting not only for library size differences but also the assessment of known and unknown technical artifacts in the data, and if needed, complete across sample normalization. In addition, we suggest that one includes the known and estimated latent artifacts in the design matrix to correctly account for the loss in degrees of freedom, as opposed to completing the analysis on the post-processed normalized data.Entities:
Mesh:
Year: 2018 PMID: 30379879 PMCID: PMC6209231 DOI: 10.1371/journal.pone.0206312
Source DB: PubMed Journal: PLoS One ISSN: 1932-6203 Impact factor: 3.240
Fig 3Plots of the first two PCs with the proportion of total variation for the CESC data to cluster primary factor of interest (keratin-low and keratin-high groups): (A, C) raw counts data, and (B, D) after adjusting the library size (UQ) and removing the effect of batch ID and estimated SVs by use of PCA through normalization. Each point is colored based on either the two keratin groups (biological factor) (A, B) or the14 levels of known batch ID (C, D).
Summary of current normalization methods to correct the technical biases for RNA-Seq data.
| Technical Bias | Normalization Method | Reference |
|---|---|---|
| Total count (TC) | Dillies M. A. et al. [ | |
| Median | Dillies M. A. et al. [ | |
| UQ | Bullard J. H. et al. [ | |
| TMM | Robinson and Oshlack [ | |
| RLE | Anders and Huber [ | |
| Quantile (Q) | Smyth G. K. [ | |
| PCA | Price A. et al. [ | |
| RUV | Risso D. et al. [ | |
| SVA | Leek and Storey [ | |
| TPM | Mortazavi A. et al. [ | |
| FPKM / RPKM | Mortazavi A. et al. [ |