| Literature DB >> 32514328 |
Ravindra Naraine1, Pavel Abaffy1, Monika Sidova1, Silvie Tomankova1, Kseniia Pocherniaieva2, Ondrej Smolik1,3, Mikael Kubista1, Martin Psenicka2, Radek Sindelka1.
Abstract
The merit of RNASeq data relies heavily on correct normalization. However, most methods assume that the majority of transcripts show no differential expression between conditions. This assumption may not always be correct, especially when one condition results in overexpression. We present a new method (NormQ) to normalize the RNASeq library size, using the relative proportion observed from RT-qPCR of selected marker genes. The method was compared against the popular median-of-ratios method, using simulated and real-datasets. NormQ produced more matches to differentially expressed genes in the simulated dataset and more distribution profile matches for both simulated and real datasets.Entities:
Keywords: DESeq; Median-of-ratios; Normalization; RNASeq; TOMOSeq; Transcriptomics
Year: 2020 PMID: 32514328 PMCID: PMC7264052 DOI: 10.1016/j.csbj.2020.05.010
Source DB: PubMed Journal: Comput Struct Biotechnol J ISSN: 2001-0370 Impact factor: 7.271
Fig. 1Schematic of the localization profile of selected mRNAs representing the four major profiles observed in Xenopus laevis egg.
Fig. 2degCheckFactor analysis of the proportion of the normalized counts for each gene relative to its mean count across the different sections (intra-section) or all same sections (inter-section) from the Simulated-TOMOSeq data and the resulting Principle Component Analysis for the 5000 genes showing the most variance. Replicate number is represented as r, a given egg section as s and gene as z. Intra-section analysis shows how well the normalization technique maintains separation between the different sections, while the inter-section analysis shows how well the median-of-ratios method can normalize between replicates.
Fig. 3a) Differences in the number of significant (padj < 0.1) DEGs detected between sections when using different normalization techniques for the Simulated-TomoSeq. b) Correlation between each section’s gene count proportion (relative to the egg) for the normalized data, versus those from the expected proportions in the Simulated-TOMOSeq. c) Distribution of the size factors obtained from each marker gene for each replicate and section for the Simulated-TOMOSeq. d) Number of marker genes detected within each profile after use of each normalization method. The localization profile comparison for the Simulated-TOMOSeq was assessed using genes that were commonly detected in all three normalization methods. The bottom axis shows the number of genes that were correctly identified within the given profile while the top axis shows the number of genes that were incorrectly profiled. The y-axis represents the log(10) of the number of detected genes. “Dm” represents DESeq2median while Ds represents DESeq2spike.
Assessments of the Area under the Receiver Operating Characteristic (ROC) curve (AUC) and also the number of profile matches for correctly identified DEGs after using each normalization method on the Simulated-TOMOSeq data.
| Normalization method | AUC-ROC of DEGs | Profile matches for DEGs shared with DESeq2expected | Profile matches for DEGs shared amongst all normalization method |
|---|---|---|---|
| DESeq2spike | 0.995 | 90% (4667/5188)) | 92% (1814/1978) |
| NormQ | 0.989 | 92% (4907/5334) | 94% (1853/1978) |
| DESeq2none | 0.952 | 77% (4325/5630) | 71% (1395/1978) |
| NormQ1 | 0.76 | 77% (4460/5818) | 71% (1394/1978) |
| DESeq2mean | 0.354 | 37% (771/2060) | 37% (739/1978) |
(correct profile match ⋂ correctly identified DEGs with no missing replicate data)/correctly identified DEGs with no missing replicate data.
(correct profile match ⋂ correctly identified DEGs shared by all normalization methods with no missing replicate data)/correctly identified DEGs shared by all normalization methods with no missing replicate data.
Fig. 4Schematic showing the normalization steps used for the NormQ method.
Recommendations for the selection of NormQ for RNASeq normalization.
| Recommendations | |
|---|---|
| 1 | Select well established marker genes that have a known distribution. If no marker genes are known, use DESeq2median or DESeq2spike to select at least five DEGs from each derived cluster profile, so as to reduce the probability (<0.005) of selecting outlier marker genes. |
| 2 | Ensure that the marker gene count across all replicates and sample section/condition are adequate (example >100). |
| 3 | Assess the relative abundance of the marker genes within each sample section/condition using RT-qPCR. |
| 4 | Use NormQ to renormalize the data. |
| 5 | Use degCheckFactor to assess the effectiveness of the size factors used. If the distribution between different sample sections/conditions are not well separated, then DESeq2median or DESeq2spike may be more appropriate methods as there is no asymmetry of your data. |
| 6 | Compare the NormQ, DESeq2median or DESeq2spike normalized data to the RT-qPCR derived profile to determine which technique best fits the data. |