| Literature DB >> 31114611 |
Zhenfeng Wu1,2, Weixiang Liu3, Xiufeng Jin2, Haishuo Ji2, Hua Wang2, Gustavo Glusman4, Max Robinson4, Lin Liu2, Jishou Ruan1, Shan Gao2.
Abstract
Data normalization is a crucial step in the gene expression analysis as it ensures the validity of its downstream analyses. Although many metrics have been designed to evaluate the existing normalization methods, different metrics or different datasets by the same metric yield inconsistent results, particularly for the single-cell RNA sequencing (scRNA-seq) data. The worst situations could be that one method evaluated as the best by one metric is evaluated as the poorest by another metric, or one method evaluated as the best using one dataset is evaluated as the poorest using another dataset. Here raises an open question: principles need to be established to guide the evaluation of normalization methods. In this study, we propose a principle that one normalization method evaluated as the best by one metric should also be evaluated as the best by another metric (the consistency of metrics) and one method evaluated as the best using scRNA-seq data should also be evaluated as the best using bulk RNA-seq data or microarray data (the consistency of datasets). Then, we designed a new metric named Area Under normalized CV threshold Curve (AUCVC) and applied it with another metric mSCC to evaluate 14 commonly used normalization methods using both scRNA-seq data and bulk RNA-seq data, satisfying the consistency of metrics and the consistency of datasets. Our findings paved the way to guide future studies in the normalization of gene expression data with its evaluation. The raw gene expression data, normalization methods, and evaluation metrics used in this study have been included in an R package named NormExpression. NormExpression provides a framework and a fast and simple way for researchers to select the best method for the normalization of their gene expression data based on the evaluation of different methods (particularly some data-driven methods or their own methods) in the principle of the consistency of metrics and the consistency of datasets.Entities:
Keywords: R package; evaluation; gene expression; normalization; scRNA-seq
Year: 2019 PMID: 31114611 PMCID: PMC6503164 DOI: 10.3389/fgene.2019.00400
Source DB: PubMed Journal: Front Genet ISSN: 1664-8021 Impact factor: 4.599
FIGURE 1Basic concepts. (A) A raw gene expression matrix can be transformed into a normalized gene expression matrix by the multiplication of a global factor fj to each column. Each column represents the expression levels of all genes from a sample and each row represents the expression levels of a gene across all samples. (B) As the library size methods, TN, TC, CR, or NR can be used to estimate a library size Nj. The library size methods (TN, TC, CR, and NR) produce the global normalization factor fj by the reciprocal of library size Nj. HG7, ERCC, DESeq, TU, NCS, and ES produce a pseudo library size Nj∗, which represents the relative amount of total RNA. RLE, UQ, and TMM produce a normalization factor sj to normalize the library size Nj and the global normalization factor for data normalization should be 106/Njsj. Q75 represents the third quartile Q3. For all methods, log represents the natural logarithm.
FIGURE 2Consistency of metrics and consistency of datasets. The non-zero ratio cutoffs from 0.2 to 0.9 for scRNA663 and from 0.7 to 1 for bkRNA18 were used to produce AUCVCs and mSCCs. All the normalization methods were classified into three groups based on their AUCVC values sorted in descending order (from the best to the poorest) using one scRNA-seq dataset scRNA663 (A) and one bulk RNA-seq dataset bkRNA18 (B). These methods were also classified into three groups based on their mSCC values sorted in ascending order (from the best to the poorest) using one scRNA-seq dataset scRNA663 (C) and one bulk RNA-seq dataset bkRNA18 (D). GAPDH is not applicable to scRNA-seq data due to zero counts of GAPDH present in many samples. All the numbers are accurate to two decimal places, the marginal differences are reflected by their orders. The raw gene expression data (None) was also used to produce AUCVCs and mSCCs for comparison. After further normalization, RLE is identical to DESeq and presented as DESeq (RLE) or DESeq∗ in this study.
FIGURE 3Visualization of evaluation results. A normalization method with a higher AUCVC produced a lower median of Spearman’s rank Correlation Coefficient (mSCC) between the normalized expression profiles of ubiquitous gene pairs using both scRNA-seq (A,C) and bulk RNA-seq data (B,D). Using 1-SCCs as distances, hierarchical clustering of 13 normalization factors showed equivalent classification into the same groups (E,F) as those by AUCVC and by mSCC (Figure 2). SCnorm was not applicable to be used to calculate SCCs, as it produced a factor matrix rather than a factor vector as 13 other methods. GAPDH is not applicable to scRNA-seq data due to zero counts of GAPDH present in many samples. The raw gene expression data (None) was also used to produce AUCVCs and mSCCs for comparison. After further normalization, RLE is identical to DESeq and presented as DESeq (RLE) or DESeq∗ in this study.