| Literature DB >> 33841507 |
Shen Yin1,2, Xiaowei Zhan1, Bo Yao1, Guanghua Xiao1, Xinlei Wang2, Yang Xie1.
Abstract
RNA-sequencing (RNA-seq) provides a comprehensive quantification of transcriptomic activities in biological samples. Formalin-Fixed Paraffin-Embedded (FFPE) samples are collected as part of routine clinical procedure, and are the most widely available biological sample format in medical research and patient care. Normalization is an essential step in RNA-seq data analysis. A number of normalization methods, though developed for RNA-seq data from fresh frozen (FF) samples, can be used with FFPE samples as well. The only extant normalization method specifically designed for FFPE RNA-seq data, MIXnorm, which has been shown to outperform the normalization methods, but at the cost of a complex mixture model and a high computational burden. It is therefore important to adapt MIXnorm for simplicity and computational efficiency while maintaining superior performance. Furthermore, it is critical to develop an integrated tool that performs commonly used normalization methods for both FF and FFPE RNA-seq data. We developed a new normalization method for FFPE RNA-seq data, named SMIXnorm, based on a simplified two-component mixture model compared to MIXnorm to facilitate computation. The expression levels of expressed genes are modeled by normal distributions without truncation, and those of non-expressed genes are modeled by zero-inflated Poisson distributions. The maximum likelihood estimates of the model parameters are obtained by a nested Expectation-Maximization algorithm with a less complicated latent variable structure, and closed-form updates are available within each iteration. Real data applications and simulation studies show that SMIXnorm greatly reduces computing time compared to MIXnorm, without sacrificing the performance. More importantly, we developed a web-based tool, RNA-seq Normalization (RSeqNorm), that offers a simple workflow to compute normalized RNA-seq data for both FFPE and FF samples. It includes SMIXnorm and MIXnorm for FFPE RNA-seq data, together with five commonly used normalization methods for FF RNA-seq data. Users can easily upload a raw RNA-seq count matrix and select one of the seven normalization methods to produce a downloadable normalized expression matrix for any downstream analysis. The R package is available at https://github.com/S-YIN/RSEQNORM. The web-based tool, RSeqNorm is available at http://lce.biohpc.swmed.edu/rseqnorm with no restriction to use or redistribute.Entities:
Keywords: FFPE; RNA-sequencing; archived samples; formalin-fixed paraffin-embedded samples; normalization; statistical methods
Year: 2021 PMID: 33841507 PMCID: PMC8024626 DOI: 10.3389/fgene.2021.650795
Source DB: PubMed Journal: Front Genet ISSN: 1664-8021 Impact factor: 4.599
Figure 1Summary of RSeqNorm web-portal process.
Figure 2RSeqNorm upload file requirements.
Figure 3Diagnostic plot returned by RSeqNorm using SMIXnorm.
Figure 4Simulation study. Average computing time of SMIXnorm and MIXnorm vs. sample size.
Correlations ρ between normalized FF and FFPE RAS pathway activation scores.
| ρ | 0.343 (0.012) | 0.343 (0.011) | 0.324 (0.017) | 0.298 (0.029) |
| RPM | UQ | TMM | Raw | |
| ρ | 0.286 (0.036) | 0.270 (0.049) | 0.153 (0.269) | 0.125 (0.366) |
p-values in the parenthesis are based on a two-sided permutation test for the hypothesis H.
Figure 5Gene-wise correlations between normalized FFPE and FF expression for soft tissue sarcomas data on all 20, 242 protein coding genes. The UQ method failed to normalize the data due to excess zero counts.
Gene-wise correlations between normalized FFPE and FF expression for soft tissue sarcomas data on the CINSARC gene signature.
| SMIXnorm | 0.344 | 0.465 | 0.529 |
| MIXnorm | 0.333 | 0.455 | 0.517 |
| DESeq | 0.165 | 0.260 | 0.354 |
| RPM | 0.146 | 0.243 | 0.350 |
| TMM | 0.010 | 0.098 | 0.161 |
| PS | −0.126 | 0.002 | 0.154 |
| UQ | – | – | – |
| Original | 0.020 | 0.107 | 0.181 |
The UQ method failed to normalize the data due to excess zero counts.
Figure 6Gene-wise correlations between normalized FFPE and RNAlater for ccRCC data on 18, 458 protein coding genes.
Summary of differential expression analysis based on different normalization methods from the ccRCC data.
| SMIXnorm | 1,490 | 1,486 | 1,023 | 13 |
| MIXnorm | 1,488 | 1,482 | 1,036 | 13 |
| DESeq | 1,014 | 951 | 680 | 7 |
| RPM | 999 | 926 | 676 | 9 |
| TMM | 1,073 | 1,067 | 632 | 7 |
| PS | 1,001 | 1,300 | 652 | 8 |
| UQ | 1,002 | 943 | 679 | 8 |
| Original | 1,041 | 1,096 | 646 | 9 |
The second column is the number of DE genes identified from the FFPE data; the third column is the number of DE genes identified from the RNAlater data; the fourth column is the number of common genes between the two sets of DE genes; the last column is the number of common genes among the two sets of top 20 DE genes from FFPE and RNAlater.
Shared DE genes among the two sets of top 20 DE genes from FFPE and RNAlater samples in the ccRCC data, ordered by the absolute value of the SMIXnorm normalized RNAlater log2 FC.
| CA9 | 8.02 | 8.04 | 5.66 | 5.66 |
| SLC6A3 | 7.20 | 7.22 | 6.30 | 6.31 |
| NDUFA4L2 | 6.38 | 6.39 | 4.88 | 4.89 |
| UMOD | −6.17 | −6.15 | −5.63 | −5.62 |
| GP2 | −5.53 | −5.51 | −4.96 | −4.96 |
| CLCNKA | −5.29 | −5.28 | −5.70 | −5.69 |
| CDCA2 | 5.21 | 5.23 | 5.04 | 5.05 |
| TNFAIP6 | 5.16 | 5.17 | 5.45 | 5.45 |
| SLC4A11 | −5.10 | −5.08 | −5.23 | −5.22 |
| KNG1 | −5.04 | −5.02 | −5.04 | −5.03 |
| SLC12A1 | −4.96 | −4.95 | −4.90 | −4.89 |
| AQP2 | −4.94 | −4.92 | −4.90 | −4.89 |
| NELL1 | −4.79 | −4.77 | −5.03 | −5.02 |