| Literature DB >> 24929920 |
Cyril Filloux, Meersseman Cédric, Philippe Romain, Forestier Lionel, Klopp Christophe, Rocha Dominique, Maftah Abderrahman, Petit Daniel1.
Abstract
BACKGROUND: Transcriptome sequencing is a powerful tool for measuring gene expression, but as well as some other technologies, various artifacts and biases affect the quantification. In order to correct some of them, several normalization approaches have emerged, differing both in the statistical strategy employed and in the type of corrected biases. However, there is no clear standard normalization method.Entities:
Mesh:
Substances:
Year: 2014 PMID: 24929920 PMCID: PMC4067528 DOI: 10.1186/1471-2105-15-188
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Figure 1Method implemented to correct the biases linked to transcript size. Size classes were built every 100 bp for transcripts < 5000 bp, as too few transcript numbers were observed with a size ≥ 5000 bp leading to scattered dots. The dotted line separates the red regression line corresponding to transcripts < 600 bp and the blue one to transcripts ≥ 600 bp. The vertical axis corresponds to log transformed values of read numbers. A) Cattle sample 1479. B) Drosophila sample SRR384925 (7132 genes with a single transcript). C) Human sample SRR1177729 (16,228 genes with a single transcript). D) Representation of sample 1479 after correction (green). E) RNA-Seq simulated data using rlsim, where Loess smoothing was applied to each series. The blue points correspond to a run where the sequenced fragments are in the range of 250–450 bp, the red one to the range of 450–650 bp, and the violet one to the range of 650–850 bp.
Comparison between FPKM and SGTR methods according to transcript size
| All genes | 159 | 1475 | 1.07E-15 | 7.96E-39 | 1.71E-39 | 4.80E-39 |
| Size < 1,000 bp | 9 | 1475 | 9.59E-03 | 2.15E-02 | 1.58E-02 | 2.62E-02 |
| 1,000 – 2,000 bp | 63 | 1475 | 1.86E-06 | 6.39E-18 | 4.71E-17 | 2.56E-17 |
| 2,000 – 3,000 bp | 49 | 1475 | 2.81E-07 | 3.55E-14 | 1.75E-14 | 2.47E-14 |
| 3,000 – 4,000 bp | 28 | 1475 | 3.25E-04 | 1.86E-06 | 1.86E-06 | 3.01E-06 |
| Size > 4,000 bp | 10 | 1475 | 1.56E-03 | 5.82E-03 | 6.95E-03 | 5.42E-03 |
| All genes | 155 | 1455 | 8.84E-16 | 2.95E-39 | 5.86E-39 | 3.40E-39 |
| Size < 1,000 bp | 9 | 1455 | 3.84E-02 | 1.03E-01 | 8.34E-02 | 1.05E-01 |
| 1,000 – 2,000 bp | 60 | 1455 | 9.81E-07 | 1.57E-18 | 2.00E-17 | 4.19E-18 |
| 2,000 – 3,000 bp | 50 | 1455 | 3.48E-09 | 9.07E-16 | 8.60E-16 | 1.43E-15 |
| 3,000 – 4,000 bp | 26 | 1455 | 5.60E-05 | 2.80E-07 | 2.99E-07 | 4.50E-07 |
| Size > 4,000 bp | 10 | 1455 | 4.44E-03 | 2.26E-03 | 2.99E-03 | 1.96E-03 |
| All genes | 162 | 1479 | 4.63E-14 | 8.24E-44 | 1.37E-43 | 2.02E-49 |
| Size < 1,000 bp | 9 | 1479 | 7.54E-02 | 1.55E-01 | 1.18E-01 | 9.34E-02 |
| 1,000 – 2,000 bp | 62 | 1479 | 1.75E-08 | 1.37E-16 | 6.87E-16 | 5.11E-18 |
| 2,000 – 3,000 bp | 53 | 1479 | 1.68E-05 | 1.64E-19 | 1.64E-19 | 9.64E-22 |
| 3,000 – 4,000 bp | 29 | 1479 | 1.67E-05 | 2.80E-09 | 3.08E-09 | 4.36E-10 |
| Size > 4,000 bp | 9 | 1479 | 1.33E-03 | 2.81E-05 | 5.39E-05 | 1.04E-04 |
| All genes | 152 | 1345 | 1.16E-14 | 1.57E-42 | 1.86E-42 | 6.11E-44 |
| Size < 1,000 bp | 9 | 1345 | 3.83E-02 | 9.84E-02 | 6.95E-02 | 7.98E-02 |
| 1,000 – 2,000 bp | 58 | 1345 | 7.93E-08 | 6.39E-18 | 6.85E-17 | 2.43E-18 |
| 2,000 – 3,000 bp | 50 | 1345 | 2.04E-05 | 8.39E-18 | 5.17E-18 | 1.74E-18 |
| 3,000 – 4,000 bp | 26 | 1345 | 5.08E-03 | 7.67E-07 | 7.87E-07 | 1.01E-06 |
| Size > 4,000 bp | 9 | 1345 | 6.83E-04 | 1.62E-04 | 2.59E-04 | 4.44E-04 |
| All genes | 162 | 1476 | 2.73E-15 | 1.73E-41 | 6.74E-41 | 1.51E-44 |
| Size < 1,000 bp | 9 | 1476 | 5.12E-02 | 6.26E-02 | 4.52E-02 | 5.10E-02 |
| 1,000 – 2,000 bp | 62 | 1476 | 3.38E-08 | 7.54E-17 | 4.72E-16 | 4.92E-18 |
| 2,000 – 3,000 bp | 53 | 1476 | 1.55E-05 | 2.56E-17 | 3.53E-17 | 1.89E-18 |
| 3,000 – 4,000 bp | 29 | 1476 | 3.44E-04 | 6.14E-07 | 7.13E-07 | 3.47E-07 |
| Size > 4,000 bp | 9 | 1476 | 9.18E-04 | 1.44E-04 | 2.76E-04 | 5.51E-04 |
N corresponds to the number of analyzed genes. The five samples (1475, 1455, 1479, 1345, and 1476) refer respectively to samples with a total read number around 10.106, 13.106, 20.106, 24.106, and 30.106 reads. Abbreviations: SGTR size: correction for transcript size; and SGTR Size and GC content: correction for transcript size and GC content. Only the p-values of Pearson correlation with qRT-PCR quantifications are indicated. p-values were calculated using the Past3 program [24].
Figure 2Relationships between RNA-Seq normalization methods and qRT-PCR quantifications (Cattle sample 1479). A) FPKM corrected values. B)log(FPKM) corrected values. C) SGTR corrected values including size and GC content bias correction.
Figure 3Method implemented to correct GC content biases. Variations in size-corrected mean read numbers according to GC content. The polynomial equations are indicated above (A: Cattle sample 1479, and D: Drosophila sample SRR384925). Application of the previous equation (Eq.3) to differences between 50% GC content and each GC content value, giving the equation indicated above (Eq.4) (B: Cattle sample 1479, and E: Drosophila sample SRR384925). Effect of GC content bias correction on the whole dataset. Clearly, no remaining dependence can be observed: the p-value to third order polynomial equation is 1.00 (C: Cattle sample 1479, and F: Drosophila sample SRR384925).
Figure 4Correlation between regression parameters and total read numbers (Cattle sample 1479). A) Slope for transcripts < 600 bp. B) Slope for transcripts ≥ 600 bp. C) Constant for transcripts ≥ 600 bp. The equations are indicated below the regression lines.
Correction of the impact of total read numbers
| All genes | 159 | 1475 | 7,96E-39 | 1.71E-39 | 4.80E-39 | 1.08E-38 |
| 155 | 1455 | 2,95E-39 | 5.86E-39 | 3.40E-39 | 2.66E-39 | |
| 162 | 1479 | 8,24E-44 | 1.37E-43 | 2.02E-49 | 1.21E-49 | |
| 152 | 1345 | 1,57E-42 | 1.86E-42 | 6.11E-44 | 5.64E-44 | |
| 162 | 1476 | 1,73E-41 | 6.74E-41 | 1.51E-44 | 2.28E-44 |
N corresponds to the number of analyzed genes. The five samples (1475, 1455, 1479, 1345, and 1476) refer respectively to samples with a total read number around 10.106, 13.106, 20.106, 24.106, and 30.106 reads. Abbreviations: SGTR size: correction for transcript size; SGTR Size and GC content: correction for transcript size and GC content; and Full SGTR: correction for transcripts size, total read number, and GC content. Only the p-values of Pearson correlation with qRT-PCR quantifications are indicated.