| Literature DB >> 32859148 |
Brian E Vestal1, Camille M Moore2, Elizabeth Wynn3, Laura Saba4, Tasha Fingerlin1, Katerina Kechris3.
Abstract
BACKGROUND: As the barriers to incorporating RNA sequencing (RNA-Seq) into biomedical studies continue to decrease, the complexity and size of RNA-Seq experiments are rapidly growing. Paired, longitudinal, and other correlated designs are becoming commonplace, and these studies offer immense potential for understanding how transcriptional changes within an individual over time differ depending on treatment or environmental conditions. While several methods have been proposed for dealing with repeated measures within RNA-Seq analyses, they are either restricted to handling only paired measurements, can only test for differences between two groups, and/or have issues with maintaining nominal false positive and false discovery rates. In this work, we propose a Bayesian hierarchical negative binomial generalized linear mixed model framework that can flexibly model RNA-Seq counts from studies with arbitrarily many repeated observations, can include covariates, and also maintains nominal false positive and false discovery rates in its posterior inference.Entities:
Keywords: Correlated data; Longitudinal data; Markov chain Monte Carlo; RNA-Seq
Mesh:
Substances:
Year: 2020 PMID: 32859148 PMCID: PMC7455910 DOI: 10.1186/s12859-020-03715-y
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Fig. 1Directed acyclic graph of the MCMSeq Model. Squares represent observed data, while circles represent model parameters to be estimated. Diamonds represent prior parameters and hyper-parameters. Solid arrows indicate stochastic dependence, while dashed arrows indicate deterministic dependence
Summary of simulated datasets
| 10 | |
| 15,000 | |
| 3, 5, and 10 per group | |
| (6, 10 and 20 total) | |
| 0 (all features) | |
| treatment and control at baseline | |
| 0 (all features) | |
| in the control group | |
| 0 (80% of features) | |
| over time between the treatment and | ES1 (10% of features), -ES (10% of features) |
| control groups | |
| Drawn from an empirical distribution based on human | |
| samples in real RNA-Seq data sets with repeated measures | |
| 0 (80% of features) | |
| between control and treatment at | ES (10% of features), -ES (10% of features) |
| follow-up | |
| 0 (80% of features) | |
| over time in the treatment group | ES (10% of features), -ES (10% of features) |
1ES is drawn from a Gamma distribution with mode of log(2) and a standard deviation of 0.5
Summary of methods compared
| MCMSeq | mcmseq | Bayesian negative binomial hierarchical model, assuming the following | |
| hyper-parameters: | |||
| Chains were run for 30,000 iterations and 10% was discarded as burn-in. | |||
| CPGLMM | cplm | Compound Poisson generalized linear mixed model fit using maximum | [ |
| likelihood. | |||
| DESeq2* | DESeq2 | DESeq2 analysis using default settings. To account for correlation, | [ |
| subjects are included as fixed effects in the model by replacing the | |||
| random effects in Equation 4 with fixed effects. | |||
| DESeq2 | DESeq2 | DESeq2 analysis using default settings, including only fixed effects in | [ |
| Equation 4. Correlation is ignored. | |||
| edgeR* | edgeR | edgeR analysis using default settings. To account for correlation, | [ |
| subjects are included as fixed effects in the model by replacing the | |||
| random effects in Equation 4 with fixed effects. | |||
| edgeR | edgeR | edgeR analysis using default settings, including only fixed effects in | [ |
| Equation 4. Correlation is ignored. | |||
| limma | limma | Count data were transformed for analysis with limma using voom. | [ |
| The duplicateCorrelation function using subject id as the blocking | [ | ||
| factor was used to account for correlation. | |||
| LMM | lmerTest | Data were first transformed using DESeq2’s variance stabilizing | [ |
| transformation. Linear mixed models, which assume a normally | |||
| distributed errors, were then fit to the transformed counts. | |||
| MACAU | MACAU2 | Poisson GLMM with random effects to account for over-dispersion and | [ |
| correlation among observations. A known covariance matrix must be supplied; | |||
| we used the covariance of the centered and scaled gene expression matrix as | |||
| shown in the third example of [ | |||
| NBGLMM | glmmADMB | Frequentist negative binomial generalized linear mixed model fit using | [ |
| maximum likelihood. | |||
| ShrinkBayes | ShrinkBayes | Bayesian hierarchical model using shrinkage priors to control false | [ |
| discovery rates. We used a negative binomial distribution for the count | |||
| data and shrinkage priors for all model parameters involved in statistical | |||
| tests. All other parameters were set to default values. | |||
| VarSeq | tcgsaseq | Model free method for identifying gene sets whose expression changes | [ |
| over time. Counts are transformed to log(CPM) for analysis. The | |||
| asymptotic test was used to generate |
Results presented in Supplementary materials, Section 2.3
Testing characteristics compared
| Characteristic | Description | Significance Thresholds | P-Value |
|---|---|---|---|
| Type 1 Error Rate | 0.0001, 0.01, 0.05, 0.1 | Unadjusted | |
| (False Positive Rate) | for unadjusted | ||
| False Discovery Rate | 0.01, 0.05, 0.1 | FDR Adjusted | |
| Power (Recall) | 0.01, 0.05, 0.1 | FDR Adjusted |
Number of models that failed to converge out of the 150,000 simulated genes in the null + mixed effect size simulation
| Method | N = 3 | N = 5 | N = 10 |
|---|---|---|---|
| MCMSeq | 184 | 335 | 1,526 |
| DESeq2* | 0 | 0 | 0 |
| DESeq2 | 1,295 | 979 | 0 |
| edgeR* | 0 | 0 | 0 |
| edgeR | 0 | 0 | 0 |
| limma | 0 | 0 | 0 |
| LMM | 32,060 | 23,300 | 14,150 |
| MACAU | 39 | 5 | 2 |
| NBGLMM | 6,113 | 9,992 | 11,995 |
| ShrinkBayes | 0 | 0 | 0 |
Fig. 2Relative type 1 error rates (false positive rates) for the 3 tests of interest. Relative rates are plotted on the log2 scale, for the 4 different testing thresholds at each of the 3 sample sizes examined. Values greater than 0 indicate type 1 error rates higher than nominal levels (e.g. a value of 1 means the observed rate was double what one would expect), while values less than 0 indicate conservative behavior (e.g. -1 means observed rate was half of what is expected). Values in the figure are based on the average rate calculated across the 10 data sets simulated at each sample size. Supplementary Fig. 5 shows the variability in observed rates for each of the methods. Results are consistent across the data sets, indicating the mean of the results is an accurate reflection of each method’s performance
Fig. 3Relative FDRs for the 3 tests of interest. Relative rates are plotted on the log2 scale, for the 3 different testing thresholds at each of the 3 sample sizes examined. Values greater than 0 indicate FDRs higher than nominal levels (e.g. a value of 1 means the observed rate was double what one would expect), while values less than 0 indicate conservative behavior (e.g. -1 means observed rate was half of what is expected). Values in the figure are based on the average rate calculated across the 10 data sets simulated at each sample size. Triangular plotting symbols indicate that no differentially expressed genes were identified at the given threshold
Fig. 4Statistical power (recall) for the 3 tests of interest. Values in the figure are based on the average power calculated across the 10 data sets simulated at each sample size
Fig. 5Scatter plot of power versus observed FDR for the 3 contrasts of interest in the n=5 simulated datasets. Significance was determined at the 0.05 FDR level. Points that lie to the left of the dashed vertical line represent methods that have an observed FDR less than the nominal rate of 5%, while points to the right represent methods with FDR inflation
Number of significant genes by contrast in the tuberculosis dataset. 15,993 genes were included in the analysis. Genes were called significant if the FDR adjusted p-value was <0.05
| edgeR | edgeR ∗ | limma | LMM | MCMSeq | NBGLMM | ShrinkBayes1 | |
|---|---|---|---|---|---|---|---|
| Number Failed to Converge | 0 | 0 | 0 | 392 | 132 | 1348 | 0 |
| Latent vs. Control | 279 | - | 11 | 2 | 3 | 9 | 416, 407 |
| Progressor vs. Control | 2221 | - | 1009 | 752 | 650 | 1130 | 1821, 1013 |
| Progressor vs. Latent | 1643 | - | 552 | 299 | 256 | 667 | 1369, 708 |
| Control | 1 | 0 | 0 | 0 | 0 | 0 | 2977 |
| Latent | 2 | 0 | 0 | 0 | 3 | 2 | 3171 |
| Progressor | 664 | 790 | 2460 | 1413 | 1463 | 1878 | 7848 |
| Latent vs. Control | 0 | 0 | 0 | 0 | 3 | 0 | 3331, 3230 |
| Progressor vs. Control | 16 | 272 | 310 | 537 | 635 | 794 | 7199, 6429 |
| Progressor vs. Latent | 170 | 278 | 1852 | 825 | 852 | 1328 | 6681, 7390 |
1In order to test all contrasts of interest, ShrinkBayes models were fit three times, using a different reference group each time. This resulted in some contrasts being evaluated twice. As ShrinkBayes is a stochastic method, the number of significant genes can differ between model fits. We present the results from both models when tests were performed multiple times
Functional enrichment analysis
| Description | N Genes | N Genes | N | Fold | |
|---|---|---|---|---|---|
| in Set | in DEG List | Expected | Enrichment | ||
| 0019221 | cytokine-mediated | 659 | 110 | 42.28 | 2.6 |
| signaling pathway | |||||
| 0043312 | neutrophil degranulation | 483 | 88 | 30.99 | 2.84 |
| 0045088 | regulation of innate | 434 | 75 | 27.84 | 2.69 |
| immune response | |||||
| 0001817 | regulation of cytokine | 676 | 98 | 43.37 | 2.26 |
| production | |||||
| 0002684 | positive regulation of | 1125 | 137 | 72.17 | 1.9 |
| immune system process | |||||
| 0010608 | post-transcriptional | 512 | 77 | 32.85 | 2.34 |
| regulation of gene | |||||
| expression | |||||
| 0016032 | viral process | 690 | 93 | 44.27 | 2.1 |
| 1903706 | regulation of hemopoiesis | 442 | 69 | 28.36 | 2.43 |
| 0051248 | negative regulation of | 1087 | 128 | 69.74 | 1.84 |
| protein metabolic process | |||||
| 0033554 | cellular response to stress | 1662 | 175 | 106.63 | 1.64 |
| 0046649 | lymphocyte activation | 369 | 60 | 23.67 | 2.53 |
| 0002683 | negative regulation of | 434 | 66 | 27.84 | 2.37 |
| immune system process | |||||
| 0034660 | ncRNA metabolic process | 542 | 76 | 34.77 | 2.19 |
| 0034341 | response to interferon- | 182 | 38 | 11.68 | 3.25 |
| gamma | |||||
| Blankley | 327 | 123 | 30.4 | 4.1 | |
| Kafouru | 18 | 13 | 1.7 | 7.8 | |
| Zak | 14 | 9 | 1.3 | 6.9 | |
Top GO terms were selected as follows: the top 50 GO terms were selected by p-value
The list was then reduced to include only the most specific subclass in an ontology
All results included had BH adjusted p-values <0.0001