| Literature DB >> 23837715 |
Jianqiang Sun1, Tomoaki Nishiyama, Kentaro Shimizu, Koji Kadota.
Abstract
BACKGROUND: Differential expression analysis based on "next-generation" sequencing technologies is a fundamental means of studying RNA expression. We recently developed a multi-step normalization method (called TbT) for two-group RNA-seq data with replicates and demonstrated that the statistical methods available in four R packages (edgeR, DESeq, baySeq, and NBPSeq) together with TbT can produce a well-ranked gene list in which true differentially expressed genes (DEGs) are top-ranked and non-DEGs are bottom ranked. However, the advantages of the current TbT method come at the cost of a huge computation time. Moreover, the R packages did not have normalization methods based on such a multi-step strategy.Entities:
Mesh:
Substances:
Year: 2013 PMID: 23837715 PMCID: PMC3716788 DOI: 10.1186/1471-2105-14-219
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Figure 1DEGES-based analysis pipelines in (a) Main functions for obtaining DE results from tag count data in individual packages (edgeR, DESeq, baySeq, and TCC). The analysis pipelines of the packages can be roughly divided into two steps after importing the input data (black squares): (i) calculating normalization factors (blue solid squares) and (ii) estimating degrees of DE for each gene (blue dashed squares). (b) Outline of the DEGES-based normalization methods implemented in TCC. The key concept of DEGES in our calcNormFactors function is to remove data flagged as potential DEGs in step 2 before calculating normalization factors in step 3. Note that steps 2 and 3 in DEGES can be repeatedly performed in order to obtain more robust normalization factors and the function accepts many iterations n (i.e., n = 0 ~ 100).
Average AUC values for simulation data with replicates
| (a) TMM | | | | | | | | | |
| | 90.30 | 90.30 | 90.18 | 89.09 | 89.08 | 88.92 | 89.43 | 89.42 | 89.22 |
| | 90.30 | 90.12 | 89.63 | 89.15 | 88.91 | 88.26 | 89.54 | 89.28 | 88.66 |
| | 90.40 | 89.89 | 88.26 | 89.27 | 88.62 | 86.82 | 89.63 | 89.07 | 87.25 |
| (b) DEGES/TbT | | | | | | | | | |
| | 90.30 | 90.31 | 90.28 | 89.09 | 89.12 | 89.07 | 89.43 | 89.47 | 89.41 |
| | 90.30 | 90.25 | 90.26 | 89.15 | 89.07 | 89.01 | 89.54 | 89.48 | 89.49 |
| | 90.27 | 89.89 | 89.27 | 89.07 | 88.61 | 89.63 | 89.56 | 89.22 | |
| (c) DEGES/ | | | | | | | | | |
| | 90.29 | 90.32 | 90.28 | 89.09 | 89.13 | 89.08 | 89.43 | 89.48 | 89.41 |
| | 90.25 | 90.26 | 89.15 | 89.07 | 89.02 | 89.54 | 89.48 | 89.50 | |
| | 90.40 | 90.28 | 89.91* | 89.27 | 89.08 | 88.64 | 89.63 | 89.56 | 89.24 |
| (d) iDEGES/ | | | | | | | | | |
| | 90.29 | 90.28 | 89.09 | 89.13 | 89.08 | 89.43 | 89.48 | 89.42 | |
| | 90.30 | 89.15 | 89.08 | 89.06 | 89.54 | 89.49 | 89.54 | ||
| | 90.40 | 89.27 | 89.11 | 88.80 | 89.63 | 89.60 | 89.42 | ||
| (e) iDEGES/TDT | | | | | | | | | |
| | 90.32 | 89.09 | 89.12 | 89.09 | 89.43 | 89.48 | 89.42 | ||
| | 90.30 | 90.23 | 90.20 | 89.15 | 89.05 | 88.93 | 89.54 | 89.45 | 89.40 |
| | 90.40 | 90.25 | 89.82 | 89.27 | 89.04 | 88.53 | 89.63 | 89.52 | 89.13 |
| (f) iDEGES/ | | | | | | | | | |
| | 90.30 | 90.31 | 90.28 | 89.09 | 89.13 | 89.09 | 89.43 | 89.48 | 89.43 |
| | 90.30 | 90.24 | 90.21 | 89.15 | 89.05 | 88.95 | 89.54 | 89.45 | 89.42 |
| | 90.40 | 90.26 | 89.86 | 89.27 | 89.05 | 88.58 | 89.63 | 89.53 | 89.18 |
Average AUC values (%) of 100 trials for each simulation condition are shown. Simulation data contain a total of 10,000 genes: PDEG% of genes is for DEGs, PG1% of PDEG in G1 is higher than in G2, and each group has three biological replicates (i.e., G1_rep1, G1_rep2, G1_rep3, G2_rep1, G2_rep2, and G2_rep3). A total of nine conditions (three PDEG values × three PG1 values) are shown. The highest AUC value for each condition is in bold. AUC values with asterisks indicate significant improvements (p-value < 0.01, paired t-test) compared with DEGES/TbT. We used a bootstrap resampling size of 10,000 in baySeq when performing the normalization (i.e., baySeq at step 2 in the DEGES/TbT) and 2,000 when performing the DEG identification after normalization (i.e., baySeq in the XXX-baySeq combination).
Average computation times for obtaining normalization factors
| (a) TMM | | | |
| | 0.13 | 0.18 | 0.22 |
| | 0.16 | 0.19 | 0.15 |
| | 0.17 | 0.14 | 0.15 |
| (b) DEGES/TbT | | | |
| | 1492.92 | 1423.55 | 1573.23 |
| | 1556.93 | 1510.17 | 1408.46 |
| | 1527.08 | 1483.09 | 1543.80 |
| (c) DEGES/ | | | |
| | 3.04 | 3.08 | 3.24 |
| | 3.08 | 3.05 | 2.89 |
| | 2.99 | 2.92 | 3.05 |
| (d) iDEGES/ | | | |
| | 8.80 | 8.95 | 9.53 |
| 8.94 | 8.91 | 8.39 | |
| | 8.75 | 8.39 | 8.92 |
| (e) iDEGES/TDT | | | |
| | 17.92 | 17.54 | 18.24 |
| | 19.85 | 19.45 | 18.74 |
| | 21.17 | 20.74 | 21.16 |
| (f) iDEGES/ | | | |
| | 17.88 | 17.54 | 18.34 |
| | 19.96 | 19.43 | 18.72 |
| | 21.17 | 20.75 | 21.27 |
Average computation times (in seconds) of 100 trials for the six normalization methods in Table 1 are shown. The results of DEGES/TbT were obtained by using a suggested parameter setting for performing bootstrap resampling (i.e., samplesize = 10000).
Result of the iDEGES approach does not necessarily convergent
| (a) iDEGES/TbT | | | | |
| convergent | 69 | 70 | 83 | 222 |
| cyclic | 31 | 30 | 17 | 78 |
| (b) iDEGES/ | | | | |
| convergent | 86 | 86 | 79 | 251 |
| cyclic | 14 | 14 | 21 | 49 |
| (c) iDEGES/TDT | | | | |
| convergent | 82 | 81 | 84 | 247 |
| cyclic | 18 | 19 | 16 | 53 |
| (d) iDEGES/ | | | | |
| convergent | 99 | 97 | 100 | 296 |
| cyclic | 1 | 3 | 0 | 4 |
The numbers of convergent and non-convergent (cyclic) results of 100 trials under PDEG = 20% are shown: (a) iDEGES/TbT, (b) iDEGES/edgeR, (c) iDEGES/TDT, and (d) iDEGES/DESeq. We defined the trial as ‘convergent’ if potential DEGs estimated in the (NC + 1)th iteration was the same as those in the (NC)th iteration and the number of iteration n required for obtaining the convergent result as NC (NC > = 1). We defined the trial as ‘cyclic’ if potential DEGs estimated in the (i + NP)th iteration were the same as those in the ith iteration and the cycle as NP (NP > = 2).
AUC values for an RDR6 knockout dataset
| TMM | 68.36 | 61.09 | 60.21 |
| DEGES/TbT | 66.70 | 61.40 | 60.09 |
| DEGES/ | 69.28 | 61.27 | 60.31 |
| iDEGES/ | 65.70 | 61.34 | 60.86 |
| iDEGES/TDT | 65.76 | 60.84 | 60.19 |
| iDEGES/ | 64.88 | 60.45 | 59.20 |
The AUC values (%) for a total of 18 combinations with default setting (i.e., a floor PDEG of 5%) are shown.
Figure 2Effect of iterations for an RDR6 knockout dataset. (a) AUC values for four XXX-edgeR combinations with different iteration numbers (n = 0 ~ 30) when using a default floor PDEG value (= 5%) are shown: XXX = iDEGES/TbT (red lines), iDEGES/edgeR (black lines) , iDEGES/TDT (blue lines) , and iDEGES/DESeq (light blue lines). The AUC values after convergence were 69.64% for iDEGES/TbT at the 3rd iteration and 64.88% for iDEGES/DESeq at the 2nd iteration. The maximum and minimum values among cycles for iDEGES/edgeR were 67.37% and 64.57%, respectively. Note that the AUC values for three pipelines (iDEGES/TbT, iDEGES/edgeR, and iDEGES/TDT) are the same (= 68.36%) when n = 0 because those pipelines correspond to XXX = TMM (i.e., the TMM-edgeR combination). The first cycle for the non-convergent (P) results is indicated by a brace. (b) AUC values when using a 10% floor PDEG value are shown. The NP values for iDEGES/TbT, iDEGES/edgeR, and iDEGES/TDT were 8, 3, and 3, respectively.
Figure 3Dendrogram of average-linkage hierarchical clustering for a dataset. Dendrograms for a total of 26 ranked gene lists when using a default floor PDEG value setting (= 5%) are shown. The correlation coefficient is used as a similarity metric; the left-hand scale represents (1 - correlation coefficient). Each gene list is denoted as an “XXX-YYY” combination (the normalization method XXX followed by the DEG identification method YYY). Gene lists having iteration numbers on the left side correspond to cyclic results.