Literature DB >> 28468817

Assessment of Single Cell RNA-Seq Normalization Methods.

Abstract

We have assessed the performance of seven normalization methods for single cell RNA-seq using data generated from dilution of RNA samples. Our analyses showed that methods considering spike-in External RNA Control Consortium (ERCC) RNA molecules significantly outperformed those not considering ERCCs. This work provides a guidance of selecting normalization methods to remove technical noise in single cell RNA-seq data.

Entities: Chemical Disease Species

Keywords: normalization; scRNA; statistical index

Mesh：

Year: 2017 PMID： 28468817 PMCID： PMC5499114 DOI： 10.1534/g3.117.040683

Source DB: PubMed Journal: G3 (Bethesda) ISSN： 2160-1836 Impact factor: 3.154

Single cell RNA-seq (scRNA-seq) is becoming a powerful tool to study many biological problems, such as tissue heterogeneity (Kumar ; Patel ; Usoskin ), and cell differentiation during development (Biase ; Trapnell ; Treutlein ). Normalization is particularly critical for interpreting scRNA-seq data, including detection of differentially expressed gene, and identification of cell subtypes (Stegle ). The small amount of samples used in scRNA-seq often leads to higher technical noise compared to bulk RNA-seq, which needs to be removed. Various methods have been developed for normalization of scRNA-seq data, including fragments per kilobase of transcript per million mapped reads (FPKM) (Mortazavi ), upper quartile (UQ) (Bullard ), trimmed mean of M-values (TMM) (Robinson and Oshlack 2010), DESeq (Love ), removed unwanted variation (RUV) (Risso ), and gamma regression model (GRM) (Ding ). Most of these methods were developed for bulk RNA-seq, and then applied directly to scRNA-seq analysis. The performance of these methods was assessed on bulk (Dillies ; Risso ), but not on scRNA-seq, data because the ground truth is likely unknown for single cell assays. Recently, the NIH Single Cell Analysis Program—Transcriptome Project (SCAP-T) collected two well-characterized human reference RNA samples: Universal Human Reference RNA (UHR) and Human Brain Reference (HBR) (Dueck ). A set of RNA-seq data was generated using different amount of RNAs obtained from dilution (10 ng considered as bulk, 100 and 10 pg). These samples were prepared for sequencing using three protocols: antisense RNA IVT protocol (abbreviated as aRNA or A) (Morris ), a customized C1 SMARTer protocol performed on a Fluidigm C1 94-well chip (S) (Ramskold ), and a modified NuGen Ovation RNA sequencing protocol (N) (Kurn ) (Table 1). These data can be divided into six groups based on the sample source and amount: UHR, bulk; UHR, 100 pg; UHR, 10 pg; HBR, bulk; HBR, 100 pg, and HBR 10 pg. It is natural to assume that 100 pg samples would be more similar to bulk than 10 pg samples. Therefore, this set of data is invaluable to compare normalization methods because the ground truth is known. Furthermore, 51 of these samples were mixed with spike-in ERCC RNA molecules, which makes it possible to evaluate the impact of considering ERCCs in normalization of scRNA-seq data.

Table 1

Dilution RNA-seq data generated by SCAP-T

Single Cell Protocol	Human Brain Reference RNA Amount (HBR)			Universal Human Reference RNA Amount (UHR)			Total
Single Cell Protocol	10 pg	100 pg	Bulk	10 pg	100 pg	Bulk	Total
aRNA (A)	18(5)^a	3	—	12	7	—	40
C1 SMARTer (S)	15(15)	5(5)	—	15(15)	5(5)	—	40
Nugen Ovation RNASeq V2 (N)	4	4	—	15	11	—	34
None	—	—	3	—	—	3	6
Total	37	12	3	42	23	3	120

The number in the parenthesis is the number of samples containing spike-in ERCCs.

The number in the parenthesis is the number of samples containing spike-in ERCCs. Using this data set, we have assessed the performance of several commonly used normalization methods: FPKM, upper quartile (UQ), trimmed mean of M-values (TMM), DESeq, removed unwanted variation (RUV), and gamma regression model (GRM). This study thus provides a guidance of choosing normalization methods to remove technical noise in single cell RNA-seq data.

Materials and Methods

Selection of variable genes

Before applying different normalization methods on the data, we first selected variable genes using the following criteria: (1) across the 114 nonbulk samples, at least two samples with log(fpkm) >2; (2) across 114 nonbulk samples, variant of log(fpkm) >1. In this way, we found 13,375 genes for the following analyses.

Running normalization methods

FPKM (Garber ) normalizes the gene counts in consideration of library size and gene length. The UQ (Bullard ) and TMM (Robinson and Oshlack 2010) methods are implemented in the edgeR Bioconductor package. The DESeq (Anders and Huber 2010) method is implemented in the Bioconductor package. For all methods, we ran the R function using default parameters. Remove unwanted variants (RUV) (Risso ) is included in the RUVnormalize Bioconductor packages. The model sets up a generalized linear regression model between observed RNA-seq read counts and known covariates of interest, along with unknown unwanted variation factors. Different options of RUV, RUVr, and RUVg have different ways to estimate the unwanted variant factors. RUVr uses residuals from a first-pass regression of read counts (Risso ), and considers the least differentially expressed genes across samples. RUVg considers negative controls such as External RNA Control Consortium (ERCC) spike-ins, and assumes that they are not differentially expressed across samples. All the R functions were run using default parameters. For empirical undifferentially expressed genes in running RUVr, we chose genes not in the top 6000 differentially expressed genes. GRM (Ding ) fits a gamma regression model between FPKM values of reads and the concentration of ERCC spike-ins, and then make estimates of the molecular concentration of the genes from the reads.

Assessment of clustering performance

After normalization, we clustered the normalized gene reads using hierarchical clustering with the Ward method, and the metric was Pearson correlation between normalized gene reads. We assessed the clustering results using the following statistical indices:

Rand index:

The Rand index (Rand 1971) evaluates the correctness of clustering using prior labels. Given a set of n elements S = and two partitions needed to be compared, a partition of S into M clusters and a partition of S into N clusters for certain are defined as follows:Then Rand index can be computed as follows:where (a + b) is the number of agreements between X and Y, while (c + d) is the number of disagreements between X and Y. The higher the Rand index, the more similar the two partitions. As we compared the clustering partitions to the prior known labels, the higher Rand index indicates the better the clustering is.

Dunn index:

The Dunn index (Dunn 1974) evaluates the compactness and separation of the clustering. Specifically, given a set of n elements S = and the clustering partition as the Dunn index is computed as follows:By definition, the higher the Dunn index, the better the clustering, considering enough separation compared to diameter of individual clusters. Here, the distance between two samples is defined as 1 − the Pearson correlation coefficient between two samples.

Jaccard index:

The Jaccard index evaluates the stability of clustering that measures the similarity between two finite subsets (Jaccard 1912a; Tan ). Given two subsets, A and B, the Jaccard index is computed as:We used the Flexible Procedures for Clustering (fpc) package in R to calculate the Jaccard index, with random subsetting without replacement of samples. Each time a subset of samples was drawn, we clustered the samples, and then computed the maximum Jaccard index between the new cluster and the prior known cluster. Repeating B times for drawing B random subsets (here we chose B = 100), we then computed the mean of all the maximum Jaccard index (Hennig 2007). The higher the Jaccard index, the more robust the clustering.

Data availability

The authors state that all data necessary for confirming the conclusions presented in the article are represented fully within the article.

Results and Discussion

We selected seven methods to normalize the data of the 120 RNA-seq samples. These methods fall into two categories, depending on whether or not spike-in ERCC molecules were taken into consideration in normalization.

Evaluation of the methods not considering ERCC

We first evaluated the performance of five methods not considering ERCCs: FPKM (Mortazavi ), UQ (Bullard ), TMM (Robinson and Oshlack 2010), DESeq (Love ), and RUVr (Risso ) (nonconsidering ERCC version of RUV) (see Materials and Methods for the details of setup for running each method). A total of 13,375 genes was selected with log(fpkm) >2 in at least two of the 114 nonbulk samples (100 or 10 pg). Using Pearson correlation as the similarity metric, we clustered all 120 samples (114 nonbulk and six bulk samples) using hierarchical clustering (Figure 1 and Supplemental Material, Figure S1 in File S1). For all methods, the UHR and HBR samples were well separated as expected, and we thus focused on evaluating how well the HBR and UHR samples were further clustered. In either the UHR or HBR group, samples normalized by FPKM, UQ, TMM, and DESeq tended to cluster by sequencing protocol rather than RNA amount, which indicates that the differences introduced by the protocols were not completely removed by normalization. In contrast, the HBR samples normalized by RUVr were largely clustered by RNA amount rather than sequencing protocol, although a significant portion of 10 pg samples sequenced using aRNA were clustered separately (Figure 1 and Figure S1 in File S1). Qualitatively, RUVr normalization alleviates the difference between scRNA-seq protocols and the clustering results are closer to the ground truth than the other methods, i.e., the samples are clustered based on the source (HBR) and the RNA amount (bulk, 100 or 10 pg). However, the UHR samples normalized with RUVr were still clustered according to protocols rather than RNA amount. The other four methods showed worse clustering results than RUVr because 10 and 100 pg samples were mixed in each protocol group (Figure 1 and Figure S1 in File S1).

Figure 1

Hierarchical clustering of HBR and UHR RNA-seq samples with different normalization methods not considering ERCC. (A) DESeq with HBR samples; (B) RUVr with HBR samples; (C, D) DESeq and RUVr with UHR samples. The results of other methods are shown in Figure S1 in File S1. A, aRNA; N, Nugen Ovation RNASeq V2; S, C1 SMARTer. To quantify the similarity between the clusters generated from different normalization methods and the ground truth, we cut the clusters at different hierarchy levels, and calculated the Rand index between the clusters and the ground truth, which reflects the correctness of clustering. The Rand index is the ratio between the sample pairs that are correctly clustered and the total sample pairs (Rand 1971). Because all the methods successfully separated HBR and UHR samples, as discussed above, we calculated the Rand index on HBR and UHR samples separately. For either the HBR or UHR samples, we varied the cutoffs to cut the hierarchical trees into two to seven clusters, and calculated Rand index for each cutoff (Figure 2A and Figure S2 in File S1). RUVr clearly outperformed the other methods.

Figure 2

Comparison of five normalization methods not using ERCC on the HBR samples. (A) Rand index with different number of clusters; (B) Dunn index with different number of clusters; (C) Jaccard index with three clusters; (D) Rand index with the most variable genes; (E) Rand index with the least variable genes; (F) Dunn index with the most variable genes; and (G) Dunn index with the least variable genes. (D–G) Results using three clusters as the ground truth (bulk, 100 and 10 pg HBR samples). UHR results are shown in the supplementary materials (Figure S2 in File S1). Next, we evaluated the compactness and separation of clusters using the Dunn index, which is the ratio of the smallest distance between clusters to the largest intracluster distance, and the distance between the two samples is defined as 1 − the Pearson correlation between the samples (Dunn 1973). We computed the Dunn index on HBR and UHR samples separately. RUVr had a much higher Dunn index than the other methods when all the samples were clustered into two groups. For more clusters, all the methods had comparable Dunn index. This observation suggests that the separation between clusters is insensitive to normalization methods when the 120 samples were clustered into more than three clusters. It is important to examine whether the clustering results are robust to the selection of samples and genes. We first divided HBR or UHR samples into three clusters: bulk, 100 and 10 pg. Then, we randomly selected 50% of the total samples, clustered them into three groups using the 13,375 genes selected above. We computed the maximum Jaccard index (Jaccard 1912b) (see definition in Materials and Methods) between the new cluster and the ground truth. The Jaccard index reflects the correctly clustered samples using the subsets of samples, and thus the robustness of the clustering. All methods except RUVr showed similar Jaccard index values. Next, to evaluate the robustness of clustering to genes used to calculate Pearson correlation coefficient between samples, we ranked the 13,375 genes using their coefficients of variance (CV) (Brennecke ), which is the standard deviation of gene expression divided by the mean, reflecting the variation of a gene’s expression across samples. We selected 10 sets of genes: top 10%, top 30%, top 50%, top 70%, top 90%, and bottom 10%, bottom 30%, bottom 50%, bottom 70%, and bottom 90%. In general, using the most variable genes achieved the most correct clustering, i.e., high Rand index, and using a sufficient number of the most variable genes (top 10% for RUVr and FPKM, top 30% for UQ and TMM, and top 50% for DeSeq) gave the highest Dunn index, which represents the best separation of the clusters (Figure 2, D–G). Using the least variable genes (most invariable genes), such as the bottom 10% variable genes, normally showed lower Rand and Dunn index than using more variable genes, such as the bottom 90% variable genes (Figure 2, D–G).

Evaluation of methods considering ERCC

We next evaluated the performance of two methods considering ERCCs: RUVg (RUV model considering ERCC) (Risso ) and GRM (Ding ) (see Materials and Methods for the details of the setup for running each method). Among the 120 RNA-seq runs, 45 samples containing spike-in ERCCs were normalized using the two methods, and then clustered with six bulk samples together (Figure 3). The clustering results were similar to those obtained using the 13,375 selected genes. UHR and HBR samples were clearly separated, and the samples with the same amount of RNA were largely clustered together. One outlier in the RUVg cluster is a HBR 10 pg sample sequenced using the aRNA protocol that was clustered with HBR 100 pg samples.

Figure 3

Hierarchical clustering of 51 samples with different normalization methods using ERCC. (A) RUVg; (B) GRM. B, HBR; U, UHR; A, aRNA; S, C1 SMARTer.

Hierarchical clustering of 51 samples with different normalization methods using ERCC. (A) RUVg; (B) GRM. B, HBR; U, UHR; A, aRNA; S, C1 SMARTer. The GRM showed a slightly higher Rand index than RUVg when the number of clusters was the same as the ground truth (6) or more (Figure 4A). GRM also had a better Dunn index and comparable Jaccard index (Figure 4, B and C), which suggests that GRM achieves better separation of clusters (Dunn index). The robustness to sample selection is comparable for the two methods, as indicated by their similar Jaccard index. Furthermore, we assessed how sensitive the two methods are to gene selection. For the Dunn index, GRM outperformed RUVg, regardless of whether the most variable, or the least variable, genes were selected (Figure 4, F and G). For the Rand index, using the most variable genes, GRM consistently showed higher values than RUVg; using the least variable genes, GRM achieved higher Rand index using the bottom 90 and 70% genes, but lower values using the bottom 50, 30 or 10% genes than RUVg and RUVr (six clusters of bulk, 100 and 10 pg samples for HBR and UHR were considered as the ground truth) (Figure 4, D and E). We examined the clusters generated using the bottom 30% variable genes. GRM still correctly clustered all the samples, but RUVg incorrectly clustered some HBR samples with UHR, despite a higher Rand index (Figure 4E and Figure 5). We then used the bottom 10% variable genes (the most invariable genes) for a further comparison. Both methods misclustered UHR bulk with the HBR samples, but RUVg had additional HBR samples mistakenly clustered with UHR samples (Figure 5). For the Jaccard index, GRM performed better than RUVg, and RUVr with genes selected from top 10% through top 90% and bottom 90%, bottom 70% and bottom 50%. When the least variable genes were selected (bottom 10%), RUVg and RUVr significantly outperformed GRM (Figure S3 in File S1).

Figure 4

Figure 5

Comparison of two normalization methods using ERCC (RUVg and GRM) and one not using ERCC (RUVr) if the ground truth has two clusters (HBR and UHR samples). (A) Rand index of GRM, RUVg and RUVr with the most variable genes. (B) Rand index of GRM, RUVg and RUVr with the least variable genes. (C, D) Hierarchical clustering of GRM and RUVg using the bottom 30% variable genes. (E, F) Hierarchical clustering of GRM and RUVg using the bottom 10% variable (the most invariable) genes. Right brackets label the first branches. Red samples are wrongly clustered samples.

Comparison of two normalization methods using ERCC (RUVg and GRM) and one not using ERCC (RUVr). (A) Rand index with different number of clusters; (B) Dunn index with different number of clusters; (C) Jaccard index with six clusters; (D) Rand index with the most variable genes; (E) Rand index with the least variable genes; (F) Dunn index with the most variable genes; and (G) Dunn index with the least variable genes. (D–G) Results of using six clusters as the ground truth (bulk, 100 and 10 pg of HBR and UHR samples). Comparison of two normalization methods using ERCC (RUVg and GRM) and one not using ERCC (RUVr) if the ground truth has two clusters (HBR and UHR samples). (A) Rand index of GRM, RUVg and RUVr with the most variable genes. (B) Rand index of GRM, RUVg and RUVr with the least variable genes. (C, D) Hierarchical clustering of GRM and RUVg using the bottom 30% variable genes. (E, F) Hierarchical clustering of GRM and RUVg using the bottom 10% variable (the most invariable) genes. Right brackets label the first branches. Red samples are wrongly clustered samples. Notably, both methods considering ERCC outperformed the methods not considering ERCC on the 51 samples (45 10 or 100 pg and six bulk samples) containing spike-in ERCCs (Figure 4). RUVr, as a representative method, largely outperforms the other methods not considering ERCCs. Obviously, both GRM and RUVg showed higher Rand, Dunn and Jaccard index than RUVr, which indicates the value of considering ERCC in normalization and removing noise from scRNA-seq. Normalization of bulk RNA-seq using ERCCs can be useful, but may be less critical because the technical noise is smaller in bulk RNA-seq than in scRNA-seq.

Conclusions

Here, we evaluated the performance of seven normalization methods on 120 RNA-seq runs in terms of correctness (Rand index), compactness (Dunn index), and robustness (Jaccard index, robustness analysis using different sets of samples and genes) of clustering. The results showed that, for methods not considering spike-in ERCCs, RUVr showed higher Rand index and lower Jaccard index than FPKM, UQ, DESeq, and TMM; all methods showed similar Dunn index values. Considering ERCC, such as in the models of RUVg and GRM, significantly improved performance. Between RUVg and GRM, GRM is more robust in terms of selecting different sets of genes that generate similar clusters. Spike-in ERCCs would reduce the sequencing depth of mRNAs of interest, and there is also a concern about whether the synthesized ERCC molecules behave the same as mRNAs from the cell. Our analyses suggest that calibration of scRNA-seq to the spike-in ERCC is a powerful means of removing technical noise when the ERCCs are correctly modeled.

Supplementary Material

Supplemental material is available online at www.g3journal.org/lookup/suppl/doi:10.1534/g3.117.040683/-/DC1. Click here for additional data file.

21 in total

1. Normalization of RNA-seq data using factor analysis of control genes or samples.

Authors: Davide Risso; John Ngai; Terence P Speed; Sandrine Dudoit
Journal: Nat Biotechnol Date: 2014-08-24 Impact factor: 54.908

2. Novel isothermal, linear nucleic acid amplification systems for highly multiplexed applications.

Authors: Nurith Kurn; Pengchin Chen; Joe Don Heath; Anne Kopf-Sill; Kathryn M Stephens; Shenglong Wang
Journal: Clin Chem Date: 2005-08-25 Impact factor: 8.327

3. Mapping and quantifying mammalian transcriptomes by RNA-Seq.

Authors: Ali Mortazavi; Brian A Williams; Kenneth McCue; Lorian Schaeffer; Barbara Wold
Journal: Nat Methods Date: 2008-05-30 Impact factor: 28.547

Review 4. Computational methods for transcriptome annotation and quantification using RNA-seq.

Authors: Manuel Garber; Manfred G Grabherr; Mitchell Guttman; Cole Trapnell
Journal: Nat Methods Date: 2011-05-27 Impact factor: 28.547

5. Unbiased classification of sensory neuron types by large-scale single-cell RNA sequencing.

Authors: Dmitry Usoskin; Alessandro Furlan; Saiful Islam; Hind Abdo; Peter Lönnerberg; Daohua Lou; Jens Hjerling-Leffler; Jesper Haeggström; Olga Kharchenko; Peter V Kharchenko; Sten Linnarsson; Patrik Ernfors
Journal: Nat Neurosci Date: 2014-11-24 Impact factor: 24.884

Review 6. Computational and analytical challenges in single-cell transcriptomics.

Authors: Oliver Stegle; Sarah A Teichmann; John C Marioni
Journal: Nat Rev Genet Date: 2015-01-28 Impact factor: 53.242

7. Single-cell RNA-seq highlights intratumoral heterogeneity in primary glioblastoma.

Authors: Anoop P Patel; Itay Tirosh; John J Trombetta; Alex K Shalek; Shawn M Gillespie; Hiroaki Wakimoto; Daniel P Cahill; Brian V Nahed; William T Curry; Robert L Martuza; David N Louis; Orit Rozenblatt-Rosen; Mario L Suvà; Aviv Regev; Bradley E Bernstein
Journal: Science Date: 2014-06-12 Impact factor: 47.728

8. Full-length mRNA-Seq from single-cell levels of RNA and individual circulating tumor cells.

Authors: Daniel Ramsköld; Shujun Luo; Yu-Chieh Wang; Robin Li; Qiaolin Deng; Omid R Faridani; Gregory A Daniels; Irina Khrebtukova; Jeanne F Loring; Louise C Laurent; Gary P Schroth; Rickard Sandberg
Journal: Nat Biotechnol Date: 2012-08 Impact factor: 54.908

9. Differential expression analysis for sequence count data.

Authors: Simon Anders; Wolfgang Huber
Journal: Genome Biol Date: 2010-10-27 Impact factor: 13.583

10. Deconstructing transcriptional heterogeneity in pluripotent stem cells.

Authors: Roshan M Kumar; Patrick Cahan; Alex K Shalek; Rahul Satija; AJay DaleyKeyser; Hu Li; Jin Zhang; Keith Pardee; David Gennert; John J Trombetta; Thomas C Ferrante; Aviv Regev; George Q Daley; James J Collins
Journal: Nature Date: 2014-12-04 Impact factor: 49.962

3 in total

1. Interplay Between GH-regulated, Sex-biased Liver Transcriptome and Hepatic Zonation Revealed by Single-Nucleus RNA Sequencing.

Authors: Christine N Goldfarb; Kritika Karri; Maxim Pyatkov; David J Waxman
Journal: Endocrinology Date: 2022-07-01 Impact factor: 5.051

2. An Efficient and Flexible Method for Deconvoluting Bulk RNA-Seq Data with Single-Cell RNA-Seq Data.

Authors: Xifang Sun; Shiquan Sun; Sheng Yang
Journal: Cells Date: 2019-09-27 Impact factor: 6.600

3. Digital Cell Sorter (DCS): a cell type identification, anomaly detection, and Hopfield landscapes toolkit for single-cell transcriptomics.

Authors: Sergii Domanskyi; Alex Hakansson; Thomas J Bertus; Giovanni Paternostro; Carlo Piermarocchi
Journal: PeerJ Date: 2021-01-13 Impact factor: 2.984

3 in total