| Literature DB >> 29929481 |
Marcel Smid1, Robert R J Coebergh van den Braak2, Harmen J G van de Werken3,4, Job van Riet3,4, Anne van Galen5, Vanja de Weerd5, Michelle van der Vlugt-Daane5, Sandra I Bril5, Zarina S Lalmahomed2, Wigard P Kloosterman6, Saskia M Wilting5, John A Foekens5, Jan N M IJzermans2, John W M Martens5,7, Anieta M Sieuwerts5,7.
Abstract
BACKGROUND: Current normalization methods for RNA-sequencing data allow either for intersample comparison to identify differentially expressed (DE) genes or for intrasample comparison for the discovery and validation of gene signatures. Most studies on optimization of normalization methods typically use simulated data to validate methodologies. We describe a new method, GeTMM, which allows for both inter- and intrasample analyses with the same normalized data set. We used actual (i.e. not simulated) RNA-seq data from 263 colon cancers (no biological replicates) and used the same read count data to compare GeTMM with the most commonly used normalization methods (i.e. TMM (used by edgeR), RLE (used by DESeq2) and TPM) with respect to distributions, effect of RNA quality, subtype-classification, recurrence score, recall of DE genes and correlation to RT-qPCR data.Entities:
Keywords: Colorectal Cancer; DESeq2; GeTMM; Normalization methods; RNA sequencing; TPM; edgeR
Mesh:
Substances:
Year: 2018 PMID: 29929481 PMCID: PMC6013957 DOI: 10.1186/s12859-018-2246-7
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Fig. 1normalization using GeTMM method with n = number of genes and i = given gene i
Fig. 2Density plot by normalization method. Each line corresponds to the distribution of expression levels in a sample. X-axis shows log2 of read counts. a-f respectively show the distribution without normalization, and normalization according to several methods, as indicated
Fig. 3Correlation and RMSE to RT-qPCR data of 30 genes. a Correlation coefficients (x-axis) and b RMSE (x-axis) of 30 genes comparing RNA-seq normalization methods to RT-qPCR generated data
Fig. 4Boxplots of read counts per exon. a shows the expression levels in read counts per 100 bp for each exon in CDK1 (NB no additional normalization was performed). The whiskers extend to 1.5 IQR (interquartile range) above the third, or below the first quartile, with the median indicated by a horizontal line in the box. The notch indicates the 95% confidence interval of the median. b shows the same data for the MKI67 gene
Fig. 5Violin plots of rank correlation by method. Spearman rank correlation coefficients of 263 samples by correlating each method with RT-qPCR generated data
Fig. 6Bland-Altman plots comparing samples with high and low RIN values. a-d: for each normalization method, a group of 76 samples with low RIN values (< 7) was used to correlate expression data of 30 genes to RT-qPCR generated data. The same was performed for an equally sized high RIN sample group (> 9) and the correlation coefficients were compared. X-axis shows the mean correlation, the y-axis the difference (high RIN – low RIN). The blue line indicates the bias (mean of all differences), the dashed light-blue lines show the 95% limits of agreement, the dashed black line at zero is the identity line (indicating no difference). The p-value is derived from a one-sample t-test
Fig. 7Number of DE genes between left and right sided tumors per normalization method. RT-qPCR generated data were used as benchmark, showing 8 genes with FDR < 0.05 (dark-grey) and 22 genes FDR > 0.05 (black). For the RNA-seq normalization methods, black indicate true negatives (FDR > 0.05, matches with RT-qPCR), white indicate false positives (FDR < 0.05, not matching RT-qPCR), grey indicate true positives (FDR < 0.05, matches RT-qPCR) and light-grey indicate false negatives (FDR > 0.05, not matching RT-qPCR)
Fig. 8Violin plots of the recurrence score. The Oncotype DX ® Recurrence Score (RS) of 263 samples by method
Predicted CMS group by normalization method
| GeTMM | ||||||
|---|---|---|---|---|---|---|
|
| CMS1 | CMS2 | CMS3 | CMS4 | Mixed/indeterminate | Total |
| CMS1 |
| 0 | 0 | 0 | 7 | 53 |
| CMS2 | 0 |
| 0 | 0 | 5 | 132 |
| CMS3 | 0 | 0 |
| 0 | 0 | 23 |
| CMS4 | 0 | 1 | 0 |
| 4 | 10 |
| Mixed/indeterminate | 3 | 14 | 6 | 0 |
| 45 |
| Total | 49 | 142 | 29 | 5 | 38 | 263 |
Summary of results
| Normalization Method | Gene length correction | Distribution per sample | Influence of RIN on correlation | Intersample correlation | Intrasample correlation |
|---|---|---|---|---|---|
| RLE (DESeq2) | no | bimodal | no bias | ++ | + |
| TMM (edgeR) | no | bimodal | no bias | ++ | + |
| TPM | yes | normal | bias | – | ++ |
| GeTMM | yes | normal | no bias | ++ | ++ |
A ‘-’ indicates a relative poor performance for the given criterion, and increasing performance is indicated by ‘+’ and ‘++’