Literature DB >> 25847007

EBSeq-HMM: a Bayesian approach for identifying gene-expression changes in ordered RNA-seq experiments.

Ning Leng¹, Yuan Li², Brian E McIntosh³, Bao Kim Nguyen³, Bret Duffin³, Shulan Tian³, James A Thomson⁴, Colin N Dewey⁵, Ron Stewart³, Christina Kendziorski⁵.

Abstract

MOTIVATION: With improvements in next-generation sequencing technologies and reductions in price, ordered RNA-seq experiments are becoming common. Of primary interest in these experiments is identifying genes that are changing over time or space, for example, and then characterizing the specific expression changes. A number of robust statistical methods are available to identify genes showing differential expression among multiple conditions, but most assume conditions are exchangeable and thereby sacrifice power and precision when applied to ordered data.
RESULTS: We propose an empirical Bayes mixture modeling approach called EBSeq-HMM. In EBSeq-HMM, an auto-regressive hidden Markov model is implemented to accommodate dependence in gene expression across ordered conditions. As demonstrated in simulation and case studies, the output proves useful in identifying differentially expressed genes and in specifying gene-specific expression paths. EBSeq-HMM may also be used for inference regarding isoform expression.
AVAILABILITY AND IMPLEMENTATION: An R package containing examples and sample datasets is available at Bioconductor. CONTACT: kendzior@biostat.wisc.edu SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.

Entities: Chemical Gene Species

Mesh：

Year: 2015 PMID： 25847007 PMCID： PMC4528625 DOI： 10.1093/bioinformatics/btv193

Source DB: PubMed Journal: Bioinformatics ISSN： 1367-4803 Impact factor: 6.937

1 Introduction

With improvements in next-generation sequencing technologies and reductions in price, ordered RNA-seq experiments are becoming common. Of primary interest in these experiments is characterizing how genes are changing over some factor with ordered levels (for example, ordered in time, in space, along a gradient, etc). For simplicity, we refer to any ordered RNA-seq experiment as a time-course experiment, noting that other similar designs may be analyzed within this framework; and we restrict attention to time-course data collected within a single biological condition. In a time-course RNA-seq experiment, an investigator may be interested in genes that are monotonically increasing or decreasing, that increase initially then decrease, that increase initially then remain unchanged and so on. We refer to these types of changes in expression hereinafter as expression paths, and we consider three broad types: (i) constant paths: expression remains unchanged, or equally expressed (EE), over all time points; (ii) sporadic paths: expression shows some change between at least one pair of time points, but remains unchanged between at least one other pair and (iii) dynamic paths: expression changes continuously. With respect to the examples listed earlier, the first few (expression is monotonically increasing, monotonically decreasing, increasing then decreasing) are instances of dynamic paths. The last (increase initially then remain unchanged) is an example of a sporadic path. A number of robust statistical methods are available for identifying differentially expressed (DE) genes [EBSeq (Leng ), DESeq2 (Love ), edgeR (Robinson ), voom (Law ), baySeq (Hardcastle and Kelly, 2010), Cuffdiff2 (Trapnell )] as well as isoforms [EBSeq, rSeqDiff (Shi and Jiang, 2013), Cuffdiff2, BitSeq (Glaus )] in a static RNA-seq experiment; and most of these methods accommodate time-course experiments by considering time as a factor with multiple, unordered, levels (Supplementary Section S7 provides details). The statistical tests employed are designed to identify a gene as DE if it shows a change at least one time point; and, consequently, non-constant genes are detected collectively. Multiple steps of subsequent analyses are required if an investigator wants to distinguish sporadic paths from dynamic ones, to classify genes into distinct paths and to assess the associated classification uncertainty—for example, perform model fitting multiple times with different design matrices and then adjust for multiple testing. In addition, as these approaches were not designed specifically for time-course experiments, they do not accommodate dependence over time and consequently sacrifice power if applied in this setting. These same issues were addressed in the context of microarray time-course experiments, and a number of methods are available for analyzing (Conesa ; Filkov ; Ma ; Yuan and Kendziorski, 2006) and clustering (Ernst ; Luan and Li, 2003) time-course microarray data. These methods are not directly applicable to RNA-seq studies since they do not accommodate count data, the unequal variabilities in measurements or the dependence of isoforms within genes. To address this, the approach developed by Conesa , maSigPro, originally developed for microarray time-course analysis, was recently extended to accommodate ordered RNA-seq count data (Nueda ). Like DESeq2 and edgeR, maSigPro-GLM is based on a negative binomial (NB) generalized linear model (GLM); but unlike previous approaches, maSigPro-GLM defines gene-specific expected expression by a time-dependent polynomial to accommodate dependence over time. Once significant genes are selected, a second regression is conducted for each gene to identify the time points at which it shows expression differences. Clustering algorithms are then applied to the resulting regression coefficients and/or expression values to identify groups of genes with similar expression profiles. Although useful the two step procedure makes it challenging to determine appropriate thresholds for false discovery rate (FDR) control, and suggested thresholds are conservative in many settings (Nueda ). In addition, identified gene groups are subject to limitations inherent in clustering algorithms; namely, the number of groups as well as group membership are determined by user-defined cutoffs, there is no probabilistic information associated with a given gene’s membership within a group, and it is not clear how to classify gene groups into expression paths. To address these considerations, we have developed an empirical Bayes auto-regressive hidden Markov model (HMM) based approach called EBSeq-HMM. The model extends our previous work, EBSeq, for identifying DE genes and isoforms across two or more biological conditions (Leng ). As detailed in Methods, an auto-regressive process describes changes in expression over time, and a hidden Markov component is used to accommodate dependence. EBSeq-HMM allows users to identify genes with non-constant expression over multiple ordered conditions, and simultaneously classify them into expression paths. Results from a simulation study, detailed in Section 3.1, suggest that EBSeq-HMM has increased power over competing approaches for identifying genes following non-constant paths, especially for those genes showing subtle yet consistent changes over time. EBSeq-HMM also provides improved accuracy in classifying genes into expression paths. Similar results are demonstrated in a case study of the adult mouse limb presented in Section 3.2.

2 Methods

2.1 EBSeq-HMM: an empirical Bayes auto-regressive Hidden Markov model

EBSeq-HMM requires estimates of gene or isoform expression collected over three or more ordered levels of a factor. The general model is presented for gene-level analysis; the isoform-level model is discussed in Section 2.3. To simplify the presentation, we refer to ordered levels as time points denoted by , noting that the method directly accommodates other ordered data structures (e.g. ordered in space, along a gradient, etc.). Let be a matrix of expression values for G genes in N samples at time t. The full set of observed expression values is then denoted by . With a slight abuse of notation, let denote one row of this matrix containing data for gene g over time; X denotes expression values for gene g at time t in sample n. Of interest are changes in the latent mean expression levels for gene g: . We allow for three possibilities, or states, to describe such changes: Up, Down, EE. If , we define state as Up; if is Down and defines as EE. The main goals in an ordered RNA-seq experiment—identifying genes that change over time, and specifying each genes’ expression path—can be restated as questions about these underlying states. In short, for each gene g and each transition between t−1 and t, we would like to estimate the probability of each state. A gene is said to follow a non-constant path if at least one state is not EE. We would also like to estimate the most likely expression path, which is given by the configuration of expression states over time (), noting that the most likely configuration of states need not equal the collection of states that define marginally at each t (an example is provided in Section 3.1). To make inference regarding these states, we propose a model for the set of expression measurements taken on a gene g. We make the common and well-supported assumption that gene expression in an RNA-seq experiment is well described by a NB distribution (Anders and Huber, 2010; Hardcastle and Kelly, 2010; Love ; Nueda ; Robinson ; Trapnell ). Were we to consider time t in isolation, this implies where the NB distribution may be parameterized such that . For simplicity of notation, we assume equal library sizes. Details on adjustments for unequal library sizes are given in Supplementary Section S2. Because our interest here is in quantifying changes in over time, we assume expression at time t depends on that at t−1 through parameters r and q. Specifically, where if s is Up; if s is Down and if s is EE. The data dependent parameter c specifies the expected change associated with each state. For example, if c = 2, then = Up refers to a 2-fold increase in expression between t−1 and t. Although c may be defined by a user, we suggest estimation by maximum likelihood (see the next section). We further model fluctuations in μ by defining a prior distribution for for all g and t > 1. Given this set-up, when t > 1, the marginal predictive conditional distribution describing expression (or emissions) for each state is Beta-NB: where . The expected mean is then defined as (Teerapabolarn, 2008). When t = 1, the prior distribution for is defined as for all g, and the marginal predictive distribution is . For genes with dynamic paths, each state is dependent on the prior state since these genes represent continuous changes over time. To accommodate this dependence, we assume that the state process is described by a Markov chain. The constant and sporadic genes do not show continuous changes over time, and consequently we assume that states are independent, although we note that dependence among expression levels is still accommodated via the auto-regressive component. In summary, the time-course for a dynamic gene is governed by two interrelated probabilistic mechanisms: the conditional distribution (emissions model) at each time and the process describing the evolution of states over time. Initially, we assume that the observed expression vector can be characterized by the Beta-NB model described earlier and that the state process can be described by a Markov chain. Were it the case that dependence among measurements is fully captured by the state process, the proposed model would be a standard HMM. However, this last assumption does not hold, given that X for dynamic genes depends not only on the state but also on through . Consequently, the model for dynamic genes is given by a Markov-switching auto-regressive model, as in Hamilton (1989) and Ailliot and Monbet (2012) (Fig. 1). For constant and sporadic genes, we assume the same emissions model, but do not assume the state process is Markov. Taken together, since we do not know the expression path type a priori, the model for the full set of expression measurements is a two-component mixture over the sporadic/constant and dynamic genes.

Fig. 1.

(a) An auto-regressive hidden Markov component models dynamic paths. (b) An auto-regressive non-hidden Markov component models constant and sporadic paths

2.2 Parameter estimation

In the emissions distributions, the unknown parameters (r’s, α and β) are estimated using the method of moments (r’s are estimated within time point while α and β are estimated using all samples); c is estimated via maximum likelihood. Recall that EBSeq-HMM assumes a mixture model with a Markov component m1 and a non-Markov component m2. We assume equal prior probabilities of being in each mixture component. In Markov chain m1, the Baum-Welch algorithm is used to estimate initial and state transition probabilities for . Here, we assume a non-homogeneous Markov chain for the hidden states so ’s are different for different t’s. Denote the vector of initial probabilities and the state transition matrices estimated from the last step as . Given parameter estimates , define and . The forward and backward steps of the Baum-Welch algorithm are then defined as follows: The initial and state transition probabilities are updated by: Parameters are estimated by fixing expected fold-change (FC) c at 1.2. The process is then repeated for c in (1.4, 1.6, … , 3); and the parameter set with maximum likelihood is used in the final model.

2.3 Inference at the isoform level

The model detailed in the previous section applies to gene counts. To apply the approach to isoforms, the uncertainty inherent in isoform expression estimation should be accommodated. In short, estimating expression at the gene-level is a relatively easy task in RNA-seq as all reads mapping to a gene’s constituent exons may be used. The same holds true for estimating expression for an isoform unique to its parent gene. However, for genes with multiple isoforms, the problem is more challenging as reads mapping to overlapping exons (exons present in more than one isoform) must be allocated to isoforms in a way that is consistent with their expression. Consequently, there is increased uncertainty (on average) in expression estimates for isoforms with multiple overlapping exons, referred to as complex isoforms; and the uncertainty has been shown to have a substantial effect on downstream analysis methods (Leng ). Specifically, define an isoform of gene g as belonging to the I = k group, for example, where k = 1, 2 or 3, if the total number of isoforms from gene g is k (the I = 3 group contains all isoforms from genes having 3 or more isoforms). Leng demonstrated that there is decreased variability in the I = 1 group, but increased variability in the others, due to the relative increase in uncertainty inherent in estimating isoform expression when multiple isoforms of a given gene are present. This observation is not specific to the dataset and/or the method used for isoform expression estimation; it is also not specific to the particular method used for quantifying isoform complexity. To adjust for the increased uncertainty inherent in complex isoform expression estimates, we allow the Beta prior to depend on isoform group: . The hyperparameter α is shared across isoforms, but here β depends on I, accommodating the systematic differences in variability among the I groups. I quantifies a measure of isoform complexity and may be defined by the user as the number of isoforms from a gene, as described earlier. It could also be defined by an isoform’s mappability score or credibility interval as provided by Koehler , Li and Dewey (2011) or Derrien .

2.4 Simulated data

We followed the simulation setup of Robinson and Smyth (2007) by defining counts as NB with gene-specific mean in sample n and time point t given by μ and variance . The (μ, )’s were sampled as pairs from the mouse limb case study data described in the next section. Paired sampling was done to preserve the mean-variance relationship observed in most RNA-seq datasets. Each simulated dataset contains 10 000 genes and 15 samples which represent three biological replicates at each of five time points. One hundred datasets were considered for each simulation scenario.

2.4.1 Sim I

Sim I considers dynamic changes over time for 60% of the genes, which matches the percentage in the case study data. For these genes, paths were generated from an HMM. With five conditions, there are four states in the hidden chain (as shown in Fig. 1), so three state transition matrices were used. We defined the initial probabilities as 0.5 and the state transition matrices as , and , which resulted in Up-Up-Down-Down and Down-Down-Up-Up being the two most frequent expression paths. Note that other paths were realized as well, although with fewer genes. Once a gene’s particular path (collection of states) was generated, was simulated as μ multiplied (divided) by δ if was Up (Down). For one-half of the dynamic genes, we simulated strong effects, with δ sampled from empirical FCs between 1.3 and 1.4 calculated using case study data. The other one-half represent weak effects with δ sampled from empirical FCs between 1.2 and 1.3. The remaining 40% of genes were simulated as constant meaning the latent level of expression remains unchanged across conditions. To simulate genes following constant paths, we only took the genes whose simulated empirical FC of medians between any two adjacent time points was within (1/1.2, 1.2).

2.4.2 Sim II

For this simulation scenario, 40% of the 10 000 genes were simulated as dynamic as in Sim I and another 20% were simulated as sporadic. For dynamic genes, paths were generated from an HMM as described in Sim I; half were simulated as strong effects and the other half were with weak effects. For the sporadic genes, a time point t was chosen at random and μ was defined as , where δ was sampled from empirical FCs between 1.3 and 1.4. The remaining 40% of genes were simulated as constant, again as described in Sim I.

2.5 Case study data

Of interest in our case study, detailed below, is RNA-seq data from the James Thomson Lab at the Morgridge Institute for Research. We evaluated gene expression from seven positions along the mouse limb: proximal stylopod, distal stylopod, elbow, proximal zeugopod, distal zeugopod, autopod and digit. Three 12-week old C57BL/6J female mice were euthanized by cervical dislocation, followed by the extraction of the right forelimb. The tissues were treated with RNAlater (Sigma), per manufacturers instructions, dissected using a SteREO Discovery.V8 microscope (Zeiss), and stored at −20°C. The tissues were homogenized and lysed using a variable speed rotor stator homogenizer and Qiazol (Qiagen). Total RNA was extracted from the homogenized tissue samples using Qiagen’s RNeasy Lipid Tissue Mini (digits) and Midi (all other) Kits. A total of 21 samples were sequenced using Illumina’s Directional mRNA-Seq protocol (Part # 15018460 Rev. A). The reads are single-end with read length 42-bp. Each sample was run on one lane of an Illumina GAII in a randomized order to reduce batch effects. Alignment was done using Bowtie (Langmead ) with the hg19 RefSeq annotation. Expression estimates were obtained from RSEM (Li and Dewey, 2011) and library size factors were obtained using median-of-ratios normalization (Anders and Huber, 2010). See Supplementary Section S7 for package versions and further details.

2.6 Identification of DE genes and classification

EBSeq-HMM is compared with EBSeq, DESeq2, edgeR, voom, maSigPro and a naive method based on FC. See Supplementary Section S7 for package versions and further details. Two tasks are of interest: identifying DE genes, defined as those showing any change across conditions; and assigning DE genes into their most likely expression path.

2.6.1 Identification of DE genes

To identify a list of DE genes with FDR α via EBSeq-HMM or EBSeq, we take those genes for which the posterior probability (PP) of being constant is less than or equal to α. Both DESeq2 and edgeR implement a generalized-linear model to test H0: data ∼ intercept versus H1: data ∼ intercept + condition with derived P-values adjusted for multiplicities using Benjamini and Hochberg (1995). To construct a list of DE genes with target FDR α, we consider those genes with adjusted P-values less than or equal to α. As detailed in Law , the voom approach first estimates the precision weights based on the inverse variance, then applies the limma empirical Bayes pipeline taking the precision weights as prior information to account for the unequal variabilities in RNA-seq data. A similar hypothesis test was performed as in DESeq2 and edgeR, and the P-values were adjusted using Benjamini-Hochberg as well. Genes with adjusted P-values less than or equal to α were considered. As suggested in the maSigPro user manual, we applied the GLM method in the maSigPro package with the NB family and default parameter settings. We also considered two additional settings. Specifically, maSigPro uses an R2 value to obtain a sorted gene list. However, it is not clear how to pick an R2 threshold that gives a gene list with FDR controlled at some target level. The authors suggest 0.7 as the default R2 value. In addition to this default setting, we also considered R2 thresholds of 0.5 and 0.3 to evaluate maSigPro more thoroughly. For the naive FC method, denote as the median expression of gene g at time point t. A gene g is called Up (Down) between t and t + 1 if is greater than (less than) K; otherwise, it is EE. We evaluate five values of K: 1.2, 1.3, 1.5, 2 and 2.5. A gene is defined as DE if it is non-EE at any transition.

2.6.2 Classification of genes into expression paths

Recall that EBSeq-HMM provides gene-specific posterior probabilities associated with each expression path. For EBSeq-HMM, a DE gene is classified into a specific expression path if its PP of being in that path exceeds 0.5. Selecting genes with PP > 0.5 ensures that the posterior maximizing class always minimizes the Bayes risk regardless of choice of the metric loss function (Schlüter ), although we note that there may be reasons to consider different thresholds in some situations (Section 4). For EBSeq, DESeq2, edgeR, voom and maSigPro, classifying DE genes into expression paths is not of interest, and no clear guidelines on how to do so is provided. Consequently, these methods are not evaluated for expression path classification. Finally, since no uncertainty measure of assignment is available using FC, for the FC analysis a gene is classified into the path defined by the Up/Down/EE calls across transitions.

3 Results

3.1 Simulation results

Simulation studies were conducted to investigate the operating characteristics of EBSeq-HMM and to assess how it compares with EBSeq, DESeq2, edgeR, voom, maSigPro and FC analysis. As detailed in Methods, each simulated dataset derives counts from a NB model. Like EBSeq-HMM, EBSeq, DESeq2, edgeR and maSigPro also assume that counts are distributed as NB, and consequently, this assumption should not provide advantage, or lack thereof, to any one method in particular. As the form of the variance is that assumed in edgeR, there may be a slight advantage given to that method. Parameter estimates were derived from case study data to help ensure that many features of real data are preserved in the simulation (e.g. mean/variance relationship and magnitude of FCs; Section 2.4 and the Supplement Section S3 for more details). Table 1 shows the power and FDR for identifying dynamic genes in Sim I, where the target FDR is controlled at 5%. In addition to showing power overall, it is also shown separately for strong and weak effects (FDR is not shown for each subgroup because false discoveries are discoveries of EE genes and therefore cannot be classified as strong or weak). EBSeq-HMM has higher power than EBSeq, DESeq2, edgeR and voom, which is largely due to its ability to identify genes showing subtle, yet consistent, changes over time. Specifically, the power of the five methods is comparable for genes with strong effects, but EBSeq-HMM shows advantage in identifying genes where changes between any two points are relatively small. An example of two genes identified exclusively by EBSeq-HMM is shown in Figure 2 [panels (a) and (b)]. It is clear from the figure that the change between any two points is small (FC < 1.3) and in some cases these changes would not be identified by a marginal analysis between adjacent time points [e.g. time points 1 and 2 in Fig. 2b], but EBSeq-HMM identifies the genes as dynamic given the consistent changes over time.

Table 1.

Operating characteristics for identifying changes in Sim I

	Power (%)	FDR (%)	F1 score (%)	Power (strong) (%)	Power (weak) (%)
EBSeqHMM	98.6	4.3	97.1	99.7	97.5
EBSeq	90.0	0.1	94.7	93.9	86.1
DESeq2	92.4	0	96.1	95.4	89.4
edgeR	92.5	0.1	96.1	96.1	89.4
voom	91.9	0	95.8	95.1	88.6
maSigPro (0.7)	46.8	0	63.8	56.1	37.5
maSigPro (0.5)	76.1	0.1	86.4	81.5	70.6
maSigPro (0.3)	86.9	0.5	92.8	90.6	83.2
FC (2.5)	0.6	0.2	1.2	0.8	0.5
FC (2)	3.4	1.4	6.6	4.3	2.6
FC (1.5)	42.1	3.5	58.7	55.7	28.6
FC (1.3)	90.0	8.5	90.7	97.5	82.4
FC (1.2)	98.6	19.7	88.6	99.8	97.9

The first three columns show the average power, FDR and F1 score for detecting DE genes in Sim I. Power within the strong and weak groups is further evaluated in columns 4 and 5. Averages are calculated over 100 Sim I simulations. The standard errors (not shown) for EBSeq-HMM, EBSeq, DESeq2, edgeR, voom and maSigPro (and in most cases FC) were .

Fig. 2.

Shown are two genes identified exclusively by EBSeq-HMM in Sim I data (upper) and in case study data (lower). The x-axis shows time points (upper) and positions on mouse limb (lower), and the y-axis shows median gene expression adjusted for library sizes

Operating characteristics for identifying changes in Sim I The first three columns show the average power, FDR and F1 score for detecting DE genes in Sim I. Power within the strong and weak groups is further evaluated in columns 4 and 5. Averages are calculated over 100 Sim I simulations. The standard errors (not shown) for EBSeq-HMM, EBSeq, DESeq2, edgeR, voom and maSigPro (and in most cases FC) were . Shown are two genes identified exclusively by EBSeq-HMM in Sim I data (upper) and in case study data (lower). The x-axis shows time points (upper) and positions on mouse limb (lower), and the y-axis shows median gene expression adjusted for library sizes Note that although EBSeq-HMM has the highest empirical FDR among these five methods, it is still well-controlled under the 5% target FDR. In fact, among all approaches, the empirical FDR from EBSeq-HMM is closest to the target FDR. To better understand the overall performance of each method, the third column in Table 1 shows the F1 score. The F1 score measures a test’s accuracy accounting for both power and false discoveries, where an F1 score reaches its best value at 1 and worst at 0. EBSeq-HMM has the highest F1 score among all approaches. In addition, Table 1 shows that [consistent with other studies (Nueda )], the suggested threshold of maSigPro () is conservative and provides lower power than EBSeq-HMM, EBSeq, DESeq2, edgeR and voom. The power is improved by relaxing the threshold, but is still lower than others. The FC analysis works best at threshold 1.3, but is still inferior to the other methods. Table 2 shows the power, FDR and F1 score for identifying DE genes (either dynamic or sporadic) in Sim II where, again, the target FDR is controlled at 5%. The increased power of EBSeq-HMM in identifying dynamic genes that was demonstrated in Sim I persists when sporadic genes are present, and EBSeq-HMM also shows advantage for identifying sporadic genes.

Table 2.

Operating characteristics for identifying changes in Sim II

	Power (%)	FDR (%)	F1 score (%)	Power (strong) (%)	Power (weak) (%)	Power (sporadic) (%)
EBSeqHMM	94.5	4.5	95.0	99.7	97.4	86.4
EBSeq	81.4	0.1	89.7	93.9	86.1	64.2
DESeq2	84.1	0	91.4	95.2	89.3	67.9
edgeR	84.4	0	91.6	95.4	89.5	68.3
voom	83.2	0	90.8	95.0	88.7	65.9
maSigPro (0.7)	33.1	0	49.7	56.0	37.8	5.5
maSigPro (0.5)	56.8	0.1	72.4	81.6	70.6	18.2
maSigPro (0.3)	67.4	0.5	80.4	89.9	82.3	30.0
FC (2.5)	0.4	0.4	0.8	0.7	0.4	0.1
FC (2)	2.5	1.9	4.9	4.2	2.5	0.8
FC (1.5)	36.1	4.0	52.5	55.9	28.6	23.9
FC (1.3)	83.0	9.0	86.8	97.4	82.5	69.2
FC (1.2)	95.8	20.1	87.1	99.8	97.9	89.6

The first three columns show the average power, FDR and F1 score for detecting DE genes in Sim II. For dynamic genes, the power within the strong and weak groups is further evaluated in columns 4 and 5. Power within the sporadic group is evaluated in column 6. Averages are calculated over 100 Sim II simulations. The standard errors (not shown) for EBSeq-HMM, EBSeq, DESeq2, edgeR, voom and maSigPro (and in most cases FC) were .

Operating characteristics for identifying changes in Sim II The first three columns show the average power, FDR and F1 score for detecting DE genes in Sim II. For dynamic genes, the power within the strong and weak groups is further evaluated in columns 4 and 5. Power within the sporadic group is evaluated in column 6. Averages are calculated over 100 Sim II simulations. The standard errors (not shown) for EBSeq-HMM, EBSeq, DESeq2, edgeR, voom and maSigPro (and in most cases FC) were . In spite of this advantage, we note that all methods show reduced power for identifying sporadic genes. This is because in the simulation (and in our case study data upon which the simulation is based), the range of expression differences in sporadic genes is smaller, in general, than in dynamic genes. For example, consider dynamic genes having fold changes at each transition between 1.3 and 1.4. On average, for a dynamic gene that is monotonically increasing, the range in expression would be over all conditions (for a weak dynamic gene, the range would be ). However, in a sporadic gene, the range would be since only one condition differs from the others. In addition to identification of DE genes, we also evaluated the ability of EBSeq-HMM and FC to classify genes into distinct expression paths (EBSeq, DESeq2, edgeR, voom and maSigPro were not evaluated as they were not developed for this purpose; Section 2.6). Figure 3 shows results for eight dynamic paths simulated in Sim I; these eight were chosen as they contain the most genes among all simulated paths. The ground truth shows the number of genes simulated in each expression path. Also shown are the average number classified into each path by EBSeq-HMM and by FC analysis at FC threshold K = 1.2 and 1.3 (averages are calculated over 100 Sim I datasets). Correct classifications are shown in blue; incorrect are shown in red. For FC analysis, we chose 1.2 and 1.3 as they performed best under all thresholds considered. As shown, EBSeq-HMM identified more true positives than FC, while the FDR is well below 5%. Similar results were observed in Sim II data (Supplementary Fig. S2).

Fig. 3.

Shown are the number of genes (ground truth) simulated in Sim I as being in each of eight dynamic paths (these eight are shown as they contain the most genes among all simulated paths). Also shown are the average number classified into each path by EBSeq-HMM and by FC analysis at thresholds 1.2 and 1.3 (averages are calculated over 100 Sim I datasets). Correct classifications are shown in blue (first bar); incorrect are shown in red (second bar)

3.2 Case study results

An important problem in regenerative biology is understanding the connection between gene expression patterns and the positional identities of cells throughout development. Once humans and other mammals reach adulthood, they possess a very limited ability to regenerate body parts like limb structures; and it has been hypothesized that a loss of positional identity information is at least partially responsible for the reduction in regenerative capacity. However, a few studies (Chang, 2009; Rinn ; Wang ) have demonstrated that some aspects of positional identity in mammals are retained into adulthood. Understanding the changes in gene expression across limb positions in mammals is an essential first step in gaining a better understanding of these processes. Toward this end, we conducted RNA-seq experiments to study gene expression changes over seven positions (proximal to distal) along the limbs of adult mice. EBSeq-HMM, EBSeq, DESeq2, edgeR and voom identified 14 817, 12 825, 11 517, 9520 and 10 259 DE genes at a 5% target FDR, and there is substantial overlap among the lists. Specifically, EBSeq-HMM identified over 90% of the genes identified by the other approaches. maSigPro identified 2479, 6919 and 10 727 DE genes using R2 threshold 0.7, 0.5 and 0.3 and FC analyses identified 4225, 6500, 10 881, 14 016 and 15 877 genes for 2.5, 2, 1.5, 1.3 and 1.2, respectively. These identifications showed substantially lower overlap with other methods. Given that the majority of genes identified by EBSeq, DESeq2, edgeR and voom are also identified by EBSeq-HMM, we focus initially on genes that are identified exclusively by EBSeq-HMM. Figure 2c and d shows two examples. As in the simulated data [shown in (a) and (b)], these genes have subtle but consistent changes over the seven limb positions, again demonstrating that by accommodating dependence, EBSeq-HMM has increased power to identify genes showing relatively weak, but consistent, changes. Supplementary Figure S3 shows similar results for other genes identified exclusively by EBSeq-HMM. Although the simulation and case study results suggest that EBSeq-HMM has increased power for identifying DE genes, the main advantage of EBSeq-HMM over other approaches is in its ability to classify genes into particular expression paths. To illustrate, we consider Hox genes, a set of genes that are of primary interest here as they are well-known to play an important role in maintaining positional identity in adult cells (Rinn ; Wang ). In our case study data, 33 out of 39 Hox genes were identified as DE by EBSeq-HMM. Figure 4 shows expression levels of the 33 genes along with their most likely expression paths. Although the positional changes for most Hox genes are not well-known, it is known that Hoxb4 and Hoxb8 have up-regulated expression in proximal sites (Rinn ; Wang ). The EBSeq-HMM paths for these genes are consistent with these prior studies and provide further information as they characterize changes across the seven positions. In addition, the overall pattern of Hox gene expression found here demonstrates that, in general, higher numbered Hox genes are up-regulated distally and lower numbered Hox genes are up-regulated proximally. This is in agreement with existing data and models of proximal-distal patterning of the limb (Zakany and Duboule, 2007).

Fig. 4.

Shown are median expression levels of 33 Hox genes identified as DE by EBSeq-HMM. The expression values were adjusted for library size and further scaled to mean 0 and standard deviation 1 for each gene; median expression over three replicates is shown. Genes were clustered via hierarchical clustering using Euclidean distance and complete linkage. The x-axis shows seven positions over the mouse limb To explore other genes beyond the Hox family that may be involved in positional identity, we considered 2347 genes that are classified by EBSeq-HMM into one of 64 possible dynamic paths. Among the 64 clusters formed by these dynamic genes, the two largest are Up-Down-Up-Down-Down-Down (827) and Down-Up-Down-Up-Up-Up (218). Figure 5a and b shows median expression of each position for each of these genes. As these groups each contain Hox genes but also previously unknown genes showing similar dynamics across position, the novel identifications define candidates for further study.

Fig. 5.

(a), (b) Shown are genes classified as following an Up-Down-Up-Down-Down-Down (left panel, 827 genes) or Down-Up-Down-Up-Up-Up (right panel, 218 genes) expression path in the case study data. Each line indicates one gene. The x-axis shows seven positions over the mouse limb; the y-axis shows median scaled expression within each position

4 Discussion

We have developed an approach called EBSeq-HMM for analysis of ordered RNA-seq experiments. EBSeq-HMM may be used to identify genes that are DE across a set of ordered conditions and to classify genes into their most likely expression paths. There are a number of methods available for identifying DE genes that may be used when data from multiple conditions is available. EBSeq-HMM has two main advantages over these approaches. First, it accommodates dependence across ordered conditions and consequently has increased power to identify genes showing subtle, yet consistent, changes. Second, for every gene, EBSeq-HMM calculates the gene-specific PP associated with each possible expression path and in doing so allows for genes to be classified into distinct expression paths with a pre-specified FDR. Put another way, EBSeq-HMM not only identifies genes that change across conditions, but can be used to specify how they change. Simulations demonstrated the power of EBSeq-HMM over other approaches to identify DE genes. In particular, results showed that DESeq2, edgeR and voom perform well in detecting trends and/or changes are relatively strong, but that EBSeq-HMM has increased power to identify genes showing weaker changes. EBSeq-HMM also worked well for identifying genes showing sporadic changes (where there is no dependence across ordered conditions as for some genes in Sim II). Applying maSigPro-GLM with its default cutoff for calling DE genes gave significantly reduced power than other approaches. Relaxing the cutoff improved its power, but it was still inferior to the others. In addition to DE gene identification, EBSeq-HMM performed well for classifying genes into expression paths. We defined a gene as being in a particular path if the gene was classified as DE at FDR 5% (PP of EE was less than 0.05) and the PP of being in that path exceeded 0.5. Given the two step process, observed mis-classification rates were conservatively controlled. Note that in some cases, a DE gene may not be classified to any particular path. For example, if the last time point of a four-condition experiment is known to be noisy, a gene that is initially increasing may have equal PP, say one-third, of being Up-Up-Up, Up-Up-EE, and Up-Up-Down. This gene would be called DE with 5% FDR since PP(EE-EE-EE) < 0.05, but it would not be assigned into a particular expression path if threshold 0.5 was used. In some cases, a user may want to modify these thresholds. If a false negative classification was considered more serious than a false positive, this threshold could be adjusted. Motivation for doing so under varying loss functions is discussed in (Berger, 1985).

5 Implementation

EBSeq-HMM is implemented as an R package (EBSeqHMM), currently available at Bioconductor: www.bioconductor.org/packages/devel/bioc/html/EBSeqHMM.html. EBSeq-HMM requires estimates of gene or isoform expression, but is not specific to any particular estimation method. To estimate library sizes, EBSeq-HMM defaults to median-of-ratios normalization (Anders and Huber, 2010); TMM (Robinson and Oshlack, 2010) and Upper Quartile Normalization (Bullard ) are also available in the package. Like most methods, EBSeq-HMM makes assumptions regarding the distribution governing expression measurements. Consequently, poor performance may result if there are strong departures from these assumptions. Model diagnostics are implemented in EBSeq-HMM to ensure that assumptions can be easily checked. They should be considered with each application and results should not be used if serious departures from model assumptions are observed. A typical diagnostic summary for the case study data is shown in Supplementary Figure S4.

25 in total

1. Analysis techniques for microarray time-series data.

Authors: Vladimir Filkov; Steven Skiena; Jizu Zhi
Journal: J Comput Biol Date: 2002 Impact factor: 1.479

2. Clustering of time-course gene expression data using a mixed-effects model with B-splines.

Authors: Yihui Luan; Hongzhe Li
Journal: Bioinformatics Date: 2003-03-01 Impact factor: 6.937

3. EBSeq: an empirical Bayes hierarchical model for inference in RNA-seq experiments.

Authors: Ning Leng; John A Dawson; James A Thomson; Victor Ruotti; Anna I Rissman; Bart M G Smits; Jill D Haag; Michael N Gould; Ron M Stewart; Christina Kendziorski
Journal: Bioinformatics Date: 2013-02-21 Impact factor: 6.937

4. Differential analysis of gene regulation at transcript resolution with RNA-seq.

Authors: Cole Trapnell; David G Hendrickson; Martin Sauvageau; Loyal Goff; John L Rinn; Lior Pachter
Journal: Nat Biotechnol Date: 2012-12-09 Impact factor: 54.908

5. RSEM: accurate transcript quantification from RNA-Seq data with or without a reference genome.

Authors: Bo Li; Colin N Dewey
Journal: BMC Bioinformatics Date: 2011-08-04 Impact factor: 3.307

6. Fast computation and applications of genome mappability.

Authors: Thomas Derrien; Jordi Estellé; Santiago Marco Sola; David G Knowles; Emanuele Raineri; Roderic Guigó; Paolo Ribeca
Journal: PLoS One Date: 2012-01-19 Impact factor: 3.240

7. voom: Precision weights unlock linear model analysis tools for RNA-seq read counts.

Authors: Charity W Law; Yunshun Chen; Wei Shi; Gordon K Smyth
Journal: Genome Biol Date: 2014-02-03 Impact factor: 13.583

8. Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2.

Authors: Michael I Love; Wolfgang Huber; Simon Anders
Journal: Genome Biol Date: 2014 Impact factor: 13.583

9. rSeqDiff: detecting differential isoform expression from RNA-Seq data using hierarchical likelihood ratio test.

Authors: Yang Shi; Hui Jiang
Journal: PLoS One Date: 2013-11-18 Impact factor: 3.240

10. Next maSigPro: updating maSigPro bioconductor package for RNA-seq time series.

Authors: María José Nueda; Sonia Tarazona; Ana Conesa
Journal: Bioinformatics Date: 2014-06-03 Impact factor: 6.937

36 in total

1. Identification and Characterization of the Physiological Gene Targets of the Essential Lytic Replicative Epstein-Barr Virus SM Protein.

Authors: Jacob Thompson; Dinesh Verma; DaJiang Li; Tim Mosbruger; Sankar Swaminathan
Journal: J Virol Date: 2015-11-11 Impact factor: 5.103

2. Beta-catenin signaling regulates barrier-specific gene expression in circumventricular organ and ocular vasculatures.

Authors: Yanshu Wang; Mark F Sabbagh; Xiaowu Gu; Amir Rattner; John Williams; Jeremy Nathans
Journal: Elife Date: 2019-04-01 Impact factor: 8.140

3. Adaptive Chromatin Remodeling Drives Glioblastoma Stem Cell Plasticity and Drug Tolerance.

Authors: Brian B Liau; Cem Sievers; Laura K Donohue; Shawn M Gillespie; William A Flavahan; Tyler E Miller; Andrew S Venteicher; Christine H Hebert; Christopher D Carey; Scott J Rodig; Sarah J Shareef; Fadi J Najm; Peter van Galen; Hiroaki Wakimoto; Daniel P Cahill; Jeremy N Rich; Jon C Aster; Mario L Suvà; Anoop P Patel; Bradley E Bernstein
Journal: Cell Stem Cell Date: 2016-12-15 Impact factor: 24.633

Review 4. Evaluation of tools for highly variable gene discovery from single-cell RNA-seq data.

Authors: Shun H Yip; Pak Chung Sham; Junwen Wang
Journal: Brief Bioinform Date: 2019-07-19 Impact factor: 11.622

5. Complement Receptor C5aR1 Plays an Evolutionarily Conserved Role in Successful Cardiac Regeneration.

Authors: Niranjana Natarajan; Yamen Abbas; Donald M Bryant; Juan Manuel Gonzalez-Rosa; Michka Sharpe; Aysu Uygur; Lucas H Cocco-Delgado; Nhi Ngoc Ho; Norma P Gerard; Craig J Gerard; Calum A MacRae; Caroline E Burns; C Geoffrey Burns; Jessica L Whited; Richard T Lee
Journal: Circulation Date: 2018-01-18 Impact factor: 29.690

6. Tolerogenic nanoparticles suppress central nervous system inflammation.

Authors: Jessica E Kenison; Aditi Jhaveri; Zhaorong Li; Nikita Khadse; Emily Tjon; Sara Tezza; Dominika Nowakowska; Agustin Plasencia; Vincent P Stanton; David H Sherr; Francisco J Quintana
Journal: Proc Natl Acad Sci U S A Date: 2020-11-25 Impact factor: 11.205

7. Inferring Regulatory Programs Governing Region Specificity of Neuroepithelial Stem Cells during Early Hindbrain and Spinal Cord Development.

Authors: Deborah Chasman; Nisha Iyer; Alireza Fotuhi Siahpirani; Maria Estevez Silva; Ethan Lippmann; Brian McIntosh; Mitchell D Probasco; Peng Jiang; Ron Stewart; James A Thomson; Randolph S Ashton; Sushmita Roy
Journal: Cell Syst Date: 2019-07-10 Impact factor: 10.304

8. Histone chaperone ASF1B promotes human β-cell proliferation via recruitment of histone H3.3.

Authors: Pradyut K Paul; Mary E Rabaglia; Chen-Yu Wang; Donald S Stapleton; Ning Leng; Christina Kendziorski; Peter W Lewis; Mark P Keller; Alan D Attie
Journal: Cell Cycle Date: 2016-10-18 Impact factor: 4.534

9. Insight into Genes Regulating Postharvest Aflatoxin Contamination of Tetraploid Peanut from Transcriptional Profiling.

Authors: Walid Korani; Ye Chu; C Corley Holbrook; Peggy Ozias-Akins
Journal: Genetics Date: 2018-03-15 Impact factor: 4.562

10. MANF is neuroprotective against ethanol-induced neurodegeneration through ameliorating ER stress.

Authors: Yongchao Wang; Wen Wen; Hui Li; Marco Clementino; Hong Xu; Mei Xu; Murong Ma; Jacqueline Frank; Jia Luo
Journal: Neurobiol Dis Date: 2020-12-06 Impact factor: 5.996