| Literature DB >> 26919855 |
Seth Raithel1, Loretta Johnson2, Matthew Galliart3, Sue Brown4, Jennifer Shelton5, Nicolae Herndon6, Nora M Bello7.
Abstract
BACKGROUND: Differential expression (DE) analysis of RNA-seq data still poses inferential challenges, such as handling of transcripts characterized by low expression levels. In this study, we use a plasmode-based approach to assess the relative performance of alternative inferential strategies on RNA-seq transcripts, with special emphasis on transcripts characterized by a small number of read counts, so-called low-count transcripts, as motivated by an ecological application in prairie grasses. Big bluestem (Andropogon gerardii) is a wide-ranging dominant prairie grass of ecological and agricultural importance to the US Midwest while edaphic subspecies sand bluestem (A. gerardii ssp. Hallii) grows exclusively on sand dunes. Relative to big bluestem, sand bluestem exhibits qualitative phenotypic divergence consistent with enhanced drought tolerance, plausibly associated with transcripts of low expression levels. Our dataset consists of RNA-seq read counts for 25,582 transcripts (60% of which are classified as low-count) collected from leaf tissue of individual plants of big bluestem (n = 4) and sand bluestem (n = 4). Focused on low-count transcripts, we compare alternative ad-hoc data filtering techniques commonly used in RNA-seq pipelines and assess the inferential performance of recently developed statistical methods for DE analysis, namely DESeq2 and edgeR robust. These methods attempt to overcome the inherently noisy behavior of low-count transcripts by either shrinkage or differential weighting of observations, respectively.Entities:
Mesh:
Substances:
Year: 2016 PMID: 26919855 PMCID: PMC4769568 DOI: 10.1186/s12864-016-2442-7
Source DB: PubMed Journal: BMC Genomics ISSN: 1471-2164 Impact factor: 3.969
Fig. 1Partitioning of high-count transcripts and low-count transcripts. Cumulative percentage of total read counts (y-axis) as a function of cumulative percentage of transcripts (x-axis), starting on the left with transcripts of highest read counts. Solid colored lines indicate the 3rd percentile most highly expressed transcripts (red line) and the 60th percentile least expressed transcripts (blue line), thereby defining high-count transcripts (light red to the left of the vertical red line) and low-count transcripts (light blue to the right of the vertical blue line), respectively
Number of transcripts in the dataset
| All transcripts | High-count transcripts | Low-count transcripts | |||||||
|---|---|---|---|---|---|---|---|---|---|
| Total | SB-only | BB-only | Total | SB-only | BB-only | Total | SB-only | BB-only | |
| No filter | 25,582 | 323 | 132 | 831 | 0 | 0 | 14,588 | 323 | 132 |
| RP filtering | 25,453 | 323 | 132 | 831 | 0 | 0 | 14,588 | 323 | 132 |
| CPM filtering | 14,848 | 0 | 0 | 828 | 0 | 0 | 4308 | 0 | 0 |
The table contains the number of total transcripts, high-count transcripts and low-count transcripts available for differential expression analyses in the complete dataset (i.e. no data filtering applied) or following data filtering based on a reads-present (RP) criterion or a counts per million (CPM) criterion. Also listed are number of transcripts with expression levels present in sand bluestem and absent in big bluestem (SB-only transcripts), and transcripts with expression levels present in big bluestem and absent in sand bluestem (BB-only transcripts)
Classification rules to compute performance metrics
| Transcripts not DE | Transcripts “spiked” for DE | Total | |
|---|---|---|---|
| Transcripts not declared significantly DE | TN | FN | R0 |
| Transcripts declared significantly DE | FP | TP | R1 |
| Total | S0 | S1 | G |
FP number of false positives (transcripts in S0 set declared differentially expressed); TP number of true positives (transcripts in S1 set declared differentially expressed); TN number of true negatives; FN number of false negatives; FPR, false positive rate = FP/S0; TPR true positive rate or power = TP/S1; PPV positive predictive value or precision = TP/R1; NPV negative predictive value = TN/R0; accuracy = (TP + TN)/G
Estimated false positive rates (FPR) on null plasmodes
| edgeR robust DF = 50 | edgeR robust DF = 10 | edgeR robust DF = 4 | edgeR robust | edgeR classic | DESEQ2 |
|---|---|---|---|---|---|
| FPR - All transcripts | |||||
| 0.0177 a | 0.0093 b | 0.0063 c | 0.0061 c | 0.0042 d | 0.0031 e |
| (0.00040) | (0.00040) | (0.00040) | (0.00040) | (0.00040) | (0.00040) |
| FPR - Low-count transcripts | |||||
| 0.0176 a | 0.0094 b | 0.0064 c | 0.0063 c | 0.0043 d | 0.0032 e |
| (0.00037) | (0.00037) | (0.00037) | (0.00037) | (0.00037) | (0.00037) |
Least square mean estimates (and corresponding SEM, shown in parentheses) of FPR for differential expression at FDR = 0.05 on all transcripts and on low-count transcripts based on DESeq2, EdgeR classic and EdgeR robust, implemented on null plasmodes of RNA-seq data. a,b,c,d,e, indicate significant differences (Tukey-Kramer adjusted P < 0.05) within a row
Performance metrics on differentially expressed (DE) plasmodes
| edgeR robust DF = 50 | edgeR robust DF = 10 | edgeR robust DF = 4 | edgeR robust | edgeR classic | DESEQ2 |
|---|---|---|---|---|---|
| Power - All transcripts | |||||
| 0.6495 a | 0.6275 b | 0.6215 b | 0.6229 b | 0.5704 c | 0.5418 d |
| (0.00610) | (0.00610) | (0.00463) | (0.00463) | (0.00435) | (0.00435) |
| Power - Low-count transcripts | |||||
| 0.3922 a | 0.3692 b | 0.3657 b | 0.3677 b | 0.2647 c | 0.1785 d |
| (0.00120) | (0.00120) | (0.00010) | (0.00010) | (0.00008) | (0.00008) |
| Precision - All transcripts | |||||
| 0.3607 a | 0.4393 b | 0.5218 c | 0.5323 c | 0.6321 d | 0.6586 e |
| (0.00792) | (0.01153) | (0.01192) | (0.01114) | (0.00589) | (0.00371) |
| Precision - Low-count transcripts | |||||
| 0.1800 a | 0.2289 b | 0.2904 c | 0.2963 c | 0.3605 d | 0.3915 e |
| (0.00811) | (0.01305) | (0.01432) | (0.01378) | (0.01363) | (0.01727) |
| NPV - All transcripts | |||||
| 0.9917 a | 0.9915 a | 0.9915 a | 0.9915 a | 0.9902 b | 0.9891 c |
| (0.00016) | (0.00016) | (0.00016) | (0.00016) | (0.00016) | (0.00016) |
| NPV - Low-count transcripts | |||||
| 0.9917 a | 0.9915 a | 0.9914 a | 0.9915 a | 0.9902 b | 0.9891 c |
| (0.00016) | (0.00016) | (0.00016) | (0.00016) | (0.00016) | (0.00016) |
| Accuracy - All transcripts | |||||
| 0.9718 a | 0.9778 b | 0.9821 b | 0.9826 c | 0.9858 d | 0.9862 e |
| (0.00080) | (0.00080) | (0.00024) | (0.00024) | (0.00012) | (0.00014) |
| Accuracy - Low-count transcripts | |||||
| 0.9679 a | 0.9744 b | 0.9794 c | 0.9798 c | 0.9841 d | 0.9855 e |
| (0.00122) | (0.00122) | (0.00079) | (0.00079) | (0.00028) | (0.00028) |
| FPR - All transcripts | |||||
| 0.0221 a | 0.0155 b | 0.011 b | 0.0106 c | 0.0063 d | 0.0053 e |
| (0.00086) | (0.00086) | (0.00024) | (0.00024) | (0.00001) | (0.00010) |
| FPR - Low-count transcripts | |||||
| 0.0244 a | 0.0175 b | 0.0124 c | 0.0121 c | 0.0063 d | 0.0037 e |
| (0.00128) | (0.00128) | (0.00082) | (0.00082) | (0.00019) | (0.00019) |
Least square mean estimates (and corresponding SEM, shown in parentheses) for true positive rate (TPR; i.e. power), positive predictive value (PPV; i.e. precision), negative predictive value (NPV), accuracy and false positive rate (FPR) for differential expression at FDR = 0.05 on all transcripts and on low-count transcripts yielded by DESeq2, EdgeR classic or EdgeR robust, implemented on DE plasmodes of RNA-seq data. a,b,c,d,e, indicate significant differences (Tukey-Kramer adjusted P < 0.05) within a row
Fig. 2MA-Plots for edgeR robust and DESeq2 with and without data filtering. Estimated fold-change in expression of RNA-seq transcripts for SB relative to BB as a function of transcript abundance following differential expression analyses with DESeq2 or edgeR robust (DF = Classic =_edgeR) on data subjected to no filtering (a, d) or to filtering with CPM (c, f) or RP (b, e) methods. For DESeq2, fold-changes are plotted over mean transcript expression on a log scale. For edgeR robust, fold-changes are plotted against counts per million on a log scale. Transcripts declared DE at FDR = 0.05 are colored in red
Number of transcripts declared differentially expressed (DE) using edgeR robust
| All transcripts | High-count transcripts | Low-count transcripts | |||||
|---|---|---|---|---|---|---|---|
| Total | SB-only | BB-only | Total | Total | SB-only | BB-only | |
| No filter | 3173 | 248 | 126 | 40 | 2135 | 245 | 121 |
| RP filtering | 3177 | 248 | 126 | 40 | 2137 | 245 | 121 |
| CPM filtering | 1002 | 0 | 0 | 23 | 239 | 0 | 0 |
The table contains the number of total transcripts, high-count transcripts and low-count transcripts declared DE using edgeR robust (with degrees of freedom specified based on the corresponding estimate obtained using classical edgeR software) analyses on the complete dataset (i.e. no data filtering applied) or following data filtering based on reads-present (RP) criterion or counts per million (CPM) criterion. Also listed are transcripts with expression levels present in sand bluestem and absent in big bluestem (SB-only transcripts) and transcripts with expression levels present in big bluestem and absent in sand bluestem (BB-only transcripts)
Number of transcripts declared differentially expressed (DE) using DESeq2
| All transcripts | High-count transcripts | Low-count transcripts | |||||
|---|---|---|---|---|---|---|---|
| Total | SB-only | BB-only | Total | Total | SB-only | BB-only | |
| No filter | 2290 | 112 | 69 | 38 | 1325 | 112 | 69 |
| RP filtering | 2297 | 111 | 69 | 38 | 1327 | 111 | 69 |
| CPM filtering | 952 | 0 | 0 | 30 | 204 | 0 | 0 |
The table contains the number of total transcripts, high-count transcripts and low-count transcripts declared DE using DESeq2 analyses on the complete dataset (i.e. no data filtering applied) or following data filtering based on reads-present (RP) criterion or counts per million (CPM) criterion. Also listed are transcripts with expression levels present in sand bluestem and absent in big bluestem (SB-only transcripts) and transcripts with expression levels present in big bluestem and absent in sand bluestem (BB-only transcripts)
Fig. 3Frequency of transcripts declared differentially expressed (DE) using edgeR robust and DESeq2. Venn diagrams of all transcripts and of low-count transcripts declared DE using edgeR robust (with degrees of freedom specified based on the corresponding estimate obtained using classical edgeR software) and DESeq2 on the complete dataset (i.e. no data filtering) or following data filtering based on a reads-present (RP) criterion or a counts per million (CPM) criterion
Fig. 4Comparison of P-values for DESeq2 tests on differential expression. Scatterplot of P-values for differential expression obtained using DESeq2’s likelihood ratio test (LRT) and Wald test on low-count and high-count transcripts subjected to no filtering (a, d) or to filtering with CPM (c, f) or RP (b, e) methods. Diagonal identity line is indicated in red