| Literature DB >> 31861985 |
Zachary B Abrams1, Travis S Johnson2,3, Kun Huang3,4, Philip R O Payne5, Kevin Coombes2.
Abstract
BACKGROUND: RNA sequencing technologies have allowed researchers to gain a better understanding of how the transcriptome affects disease. However, sequencing technologies often unintentionally introduce experimental error into RNA sequencing data. To counteract this, normalization methods are standardly applied with the intent of reducing the non-biologically derived variability inherent in transcriptomic measurements. However, the comparative efficacy of the various normalization techniques has not been tested in a standardized manner. Here we propose tests that evaluate numerous normalization techniques and applied them to a large-scale standard data set. These tests comprise a protocol that allows researchers to measure the amount of non-biological variability which is present in any data set after normalization has been performed, a crucial step to assessing the biological validity of data following normalization.Entities:
Keywords: Biological variability; Normalization; RNASeq; Standardization
Mesh:
Substances:
Year: 2019 PMID: 31861985 PMCID: PMC6923842 DOI: 10.1186/s12859-019-3247-x
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Percentage of total genes with a significant p-value after each normalization technique. Site p-values correspond to the association of site with gene expression and Sample p-value corresponds to the association of sample and gene expression.
| Raw | TMM | DESeq | Quant | RPKM | TPM | Log2 | |
|---|---|---|---|---|---|---|---|
| % site | 0.95 | 0.95 | 0.94 | 0.92 | 0.87 | 0.90 | 1.00 |
| % sample | 0.37 | 0.27 | 0.35 | 0.34 | 0.36 | 0.49 | 0.69 |
| Variance site | 1.40E-02 | 1.30E-02 | 1.39E-02 | 2.00E-02 | 2.73E-02 | 2.28E-02 | 4.44E-77 |
| Variance sample | 8.73E-02 | 8.78E-02 | 8.61E-02 | 8.42E-02 | 8.62E-02 | 8.12E-02 | 6.37E-02 |
| Median site | 2.28E-59 | 1.11E-64 | 5.22E-51 | 2.60E-58 | 1.64E-20 | 1.85E-35 | 0.00E+00 |
| Median sample | 2.00E-01 | 2.98E-01 | 2.17E-01 | 2.30E-01 | 2.13E-01 | 5.71E-02 | 1.28E-05 |
Fig. 1Bar Plot of Normalization Methods and their relative errors from a two-way ANOVA. The MSE for each of the features (site and biological condition) can be used to measure the amount of variance attributed to that specific feature. The top narrow striped bar is site dependent variability (batch effects); the solid bar is biological variability; and the bottom, wide striped bar is the residual variability
Fig. 2Raw read counts for the gene TP53 from the Australian Genome Research Facility site arranged by sample types (a, c, d, and b). The Y axis shows the read counts. The blank space in the middle represents where a 50–50 mixture of (a and b) would be located if one had been created and measured. By leaving this blank space, a visual interpretation can be made for the linearity between (a and b) by whether (c and d) mixture models fall on this linear line. If C or D do not fall on the linear relationship of A and B then the normalization method is imposing unwanted structure on the data. If all four samples (a, b, c and d) form a clear linear relationship then that normalization method is representing the true biological structure of the data
Fig. 3a TP53. b POLR2A. c CD59. d GAPDH