Literature DB >> 24049071

DEXUS: identifying differential expression in RNA-Seq studies with unknown conditions.

Günter Klambauer¹, Thomas Unterthiner, Sepp Hochreiter.

Abstract

Detection of differential expression in RNA-Seq data is currently limited to studies in which two or more sample conditions are known a priori. However, these biological conditions are typically unknown in cohort, cross-sectional and nonrandomized controlled studies such as the HapMap, the ENCODE or the 1000 Genomes project. We present DEXUS for detecting differential expression in RNA-Seq data for which the sample conditions are unknown. DEXUS models read counts as a finite mixture of negative binomial distributions in which each mixture component corresponds to a condition. A transcript is considered differentially expressed if modeling of its read counts requires more than one condition. DEXUS decomposes read count variation into variation due to noise and variation due to differential expression. Evidence of differential expression is measured by the informative/noninformative (I/NI) value, which allows differentially expressed transcripts to be extracted at a desired specificity (significance level) or sensitivity (power). DEXUS performed excellently in identifying differentially expressed transcripts in data with unknown conditions. On 2400 simulated data sets, I/NI value thresholds of 0.025, 0.05 and 0.1 yielded average specificities of 92, 97 and 99% at sensitivities of 76, 61 and 38%, respectively. On real-world data sets, DEXUS was able to detect differentially expressed transcripts related to sex, species, tissue, structural variants or quantitative trait loci. The DEXUS R package is publicly available from Bioconductor and the scripts for all experiments are available at http://www.bioinf.jku.at/software/dexus/.

Entities: Chemical Disease Gene Mutation Species

Mesh：

Year: 2013 PMID： 24049071 PMCID： PMC3834838 DOI： 10.1093/nar/gkt834

Source DB: PubMed Journal: Nucleic Acids Res ISSN： 0305-1048 Impact factor: 16.971

INTRODUCTION

The advent of next-generation sequencing has greatly expanded our knowledge about transcriptomes. New transcripts and splice variants have been found and break points of known transcripts determined more accurately (1–6). However, in RNA-Seq experiments, quantification of the expression of transcripts can be difficult (7). Without biological variability, transcripts that are differentially expressed between two conditions can be detected reliably (8). In studies with biological variability, however, detection of differential expression between two conditions remains challenging (9). A transcript that is differentially expressed between many conditions is hard to detect because read count variation due to differential expression and due to high overdispersion can only be distinguished with many samples and high coverage. See Supplementary Section S2 for more details. To detect differentially expressed transcripts, we therefore assume that the number of conditions is small compared with the number of samples.

Identifying differential expression is currently limited to particular study designs

Current methods for analyzing RNA-Seq data can identify differential expression between two conditions. For example, in a case-control study, only transcripts that are differentially expressed between cases and controls can be identified. Similarly, in a randomized controlled study, differential expression between treated and untreated subjects can be detected. These study designs can be generalized to more case groups or more treatments, which leads to multiple (more than two) known conditions. For example, multiple conditions may be due to different tissue types, as in the ‘Allen Brain Atlas’ (10), the ‘Gene Expression Nervous System Atlas’ (11), and the ‘BioGPS’ (12). Identification of differential expression in RNA-Seq data requires a priori known conditions. In cohort, cross-sectional and nonrandomized controlled studies, the biological conditions are unknown or only partially known. Cohort and cross-sectional studies are observational studies in which the conditions of the subjects are unknown. Examples of observational studies include the HapMap (13), ENCODE (6) and the 1000 Genomes (14) project, for which RNA-Seq data are available (15,16). Nonrandomized controlled studies are treatment studies in which conditions such as genetic, environmental or treatment effects are not completely known. In nonrandomized controlled studies, unknown genetic variations such as single-nucleotide polymorphisms (SNPs), copy number variations and unknown environmental factors may result in differential expression between treated subjects. Furthermore, individual unknown treatment effects may cause variation in gene expression, for instance, responses of cell lines to the addition of compounds (17). Other examples are found in oncology, where unknown cancer subtypes or unknown cancer stages are characterized by a particular gene expression profile (18,19). In nonclinical studies, the conditions are also often unknown. During development, the transcriptome regulates and controls cell growth, differentiation, movement and morphogenesis. Genes are differentially expressed between different time points and between different tissues; even within one tissue, gene expression may vary spatially. For two samples taken at different times or from different locations it is often unknown whether the conditions differ. Another example is in vivo or in vitro gene expression in mice treated with drug candidates (20,21). Unknown factors such as individual responses or side effects lead to differentially expressed transcripts between the samples. The detection of differential expression in RNA-Seq studies with unknown conditions is important to obtain new biological knowledge. Current RNA-Seq methods, however, require the conditions to be known. For microarray data, a method for identifying unknown conditions in gene expression has been suggested (22). However, this method cannot be applied to RNA-Seq data with unknown conditions because a primary modeled factor is required and the noise is assumed to be Gaussian, which is not appropriate for RNA-Seq count data (23). We therefore present DEXUS, a method capable of detecting differential expression in RNA-Seq studies with unknown conditions. A summary of study designs and methods that can detect differential expression in them is shown in Table 1.

Table 1.

An overview of study designs and methods that can detect differential expression in them

Study design	DEXUS	DESeq	edgeR	baySeq	SAMSeq	DEGSeq	PoissonSeq	NOISeq	DSS
Two known conditions
Case-control study	✓	✓	✓	✓	✓	✓	✓	✓	✓
Randomized controlled study	✓	✓	✓	✓	✓	✓	✓	✓	✓
Multiple known conditions
Multiple case-control study	✓	✓	✓	✓	✓
Multiple treatment RCS	✓	✓	✓	✓	✓
Unknown conditions
Cross-sectional study	✓
Cohort study	✓
Nonrandomized controlled study	✓

Alongside DEXUS, we included DESeq (24), edgeR (25), baySeq (26), SAMSeq (27), DEGSeq (28), PoissonSeq (29), NOISeq (30) and DSS (31).

An overview of study designs and methods that can detect differential expression in them Alongside DEXUS, we included DESeq (24), edgeR (25), baySeq (26), SAMSeq (27), DEGSeq (28), PoissonSeq (29), NOISeq (30) and DSS (31).

Existing methods for detecting differential expression in RNA-Seq data

Methods that detect differential expression in RNA-Seq data are usually based on read counts, i.e. the number of reads mapping to a DNA region that is transcribed, such as a gene or an exon (32). These methods compare read counts for two conditions. If read counts show a large and consistent difference between the conditions, then the according transcript is differentially expressed. In this subsection, we review methods that detect differential expression in RNA-Seq data. Many methods model read counts by a negative binomial distribution because even after normalization the read counts have high variance. Therefore, we divide methods into two classes: those which do not use negative binomials (class A) and those which do (class B). The following methods belong to class A. DEGSeq (28) assumes that the log fold change of mean read counts between the two conditions follows a normal distribution given the log average expression. A differentially expressed gene is identified by a small P-value by means of this distribution. NOISeq (30) also considers the log fold change of read counts between two given conditions together with their absolute difference. Empirical distributions are calculated using all pairs of replicates from different conditions. NOISeq identifies a gene as differentially expressed if the log fold change of read counts and the absolute difference of read counts between the two conditions have both a small P-value for the empirical distributions. SAMSeq (27) performs a Wilcoxon test for each transcript testing the counts of one condition against the counts of the other. Because standard normalization techniques are not applicable, subsampling is used to normalize the read counts. SAMSeq requires a relatively high number of samples per condition to obtain significance for differential expression. PoissonSeq (29) fits a Poisson log-linear model to the read counts after transforming them. A score statistic on the model parameters determines the significance for differential expression. The following class B methods use negative binomial distributions to model the read counts. edgeR (25) uses a quantile-adjusted conditional maximum likelihood estimator for the overdispersion parameter of the negative binomial distribution. This estimator is more accurate than the standard maximum likelihood estimator when only few replicates per condition are available (33). Borrowing information across transcripts allows the dispersion parameter to be adjusted toward a consensus value using an empirical Bayes procedure (34). Finally, edgeR uses an exact test to determine whether the counts of the two conditions come from the same negative binomial distribution. DESeq (24) pools together transcripts with similar expressions values to improve the estimate of the overdispersion parameter. The overdispersion is assumed to be a function of the mean read count and is therefore estimated per condition. To determine whether a transcript is differentially expressed, the distribution parameters of the two conditions are tested by an exact test for equality of means. baySeq (26) determines the distribution of the overdispersion parameter by applying a quasi-likelihood method to the read counts of one condition. The resulting distribution is used as prior for estimating the overdispersion parameter when fitting the model to the read count data. DSS (31) is similar to baySeq. A negative binomial distribution is fitted to the read count data using a prior on the overdispersion parameter. This prior is a log-normal distribution, whose parameters are optimized using the dispersion parameters of each condition. Finally, a Wald test is used to determine differential expression. In summary, the class B methods, which use negative binomial distributions, i.e. DESeq, baySeq, DSS and edgeR, mainly differ in the way they estimate the overdispersion parameter. Estimating the overdispersion parameter is crucial for the performance and not trivial because the maximum likelihood estimator is biased and has high variance if the sample size is small (33). The subsequent statistical test has a smaller effect on the results than the parameter estimates (23,31).

Extensions to multiple known conditions

McCarthy et al. (32) extended the R package edgeR to more than two conditions. A generalized linear model is fitted to the data, and then coefficients are tested for being different from zero, which leads to the final P-values. Again, the estimation of the overdispersion parameter for a transcript borrows information from other transcripts. DESeq, baySeq and SAMSeq have also been extended to more than two conditions.

MATERIALS AND METHODS

Method overview

Our goal is to identify differentially expressed transcripts in studies with unknown conditions. A transcript is differentially expressed if the mean expression levels for different conditions are different and read counts are observed under more than one condition. Therefore we assume a small number of conditions because, as mentioned above, the detection of differential expression for many conditions is difficult. RNA-Seq expression data are usually represented as read counts per transcript, or alternatively by exon or gene. It was observed that read counts from a single condition follow a negative binomial distribution (24–26,31). DEXUS therefore models read counts as a finite mixture of negative binomial distributions. The model that best explains the observed read counts is selected from a set of models. In a Bayesian framework, model selection is based on finding the parameter which maximizes the posterior, the maximum a posteriori (MAP) parameter. The MAP model is found by an expectation maximization (EM) algorithm, where E-step and M-step are alternated repeatedly. The E-step estimates the unknown conditions based on actual model parameters, and the M-step optimizes the model parameters based on the E-step estimates. Models that use only one condition to explain the read counts are preferred by means of a prior distribution. One condition is the null hypothesis, which is rejected only if the data show strong evidence for more than one condition. Therefore, the parameters of the prior distribution determine how much DEXUS prefers to select models that explain the data without differential expression. Consequently, via the prior parameters, DEXUS can be adjusted to have a low false discovery rate at the detection of differential expression. In the following subsections, we first describe the model in more detail and then explain the EM algorithm for model selection. Model selection includes prior assumptions that lower the false discovery rate and lead to more accurate estimates. Finally, we show how to call differentially expressed transcripts on the basis of an informative/noninformative (I/NI) value.

The model

Read count x per transcript is explained by a mixture of n negative binomial distributions: where is the probability of being in condition i out of n possible conditions. In condition i, read counts are drawn from a negative binomial distribution with mean and size r, where the size parameter r is the inverse of the overdispersion . Note that we use the instead of the usual parametrization to locally accumulate parameters that are associated with large overdispersions. This accumulation is essential to define a prior within a Bayesian framework. A nondegenerate DEXUS model is identifiable (see Supplementary Section S3.1.3), as required for the maximum likelihood and the maximum a posterior estimator to be consistent. Consistency means that the estimator converges to the true parameter values with more data points, which is important for identifying differential expression. If the mean read count exceeds the variance, the maximum likelihood estimate of r tends to and the negative binomial converges to a Poisson distribution (see Supplementary Section S3.2.2).

Model selection

We perform model selection in a Bayesian framework by maximizing the posterior, i.e. by a MAP approach (35–37). Therefore, the parameters , and are considered to be random variables, and the likelihood p(x) in Equation (1) becomes the conditional probability . The objective of the model selection is to maximize the posterior of the parameters: where the priors on and are assumed to be independent of each other, and are defined in the following.

Dirichlet prior for probabilities of conditions

First we choose the prior on the probabilities of the conditions. Since the majority of transcripts in a data set are usually not differentially expressed, the model should favor explaining the read counts for a transcript with a single condition. The null hypothesis of one condition should only be rejected if the data contain strong evidence for more than one condition. The prior reduces the number of falsely discovered differentially expressed transcripts and therefore keeps the false discovery rate low. DEXUS uses a Dirichlet prior on with parameters to incorporate the preference for only one condition: where is an n-dimensional probability vector. Each component is distributed according to a beta distribution with . To express the prior knowledge that most transcripts are not differentially expressed and are generated under only one condition, we set (for ). This setting assumes that most read counts are generated under condition i = 1, which we call the major condition, while conditions are called minor conditions. The vector of hyperparameters can be simplified to one hyperparameter G (Supplementary Section S3.2.1). In Supplementary Section S3.4 we show that DEXUS is not sensitive to the choice of the hyperparameter G. Therefore DEXUS is easy to use as good results are obtained with the default setting of G = 1 (see Supplementary Section S3.4). Without having seen the data, we assume that only the major condition is present, which means that the transcript is not differentially expressed. Only when the data show strong evidence also for minor conditions, does the posterior assign nonzero probabilities to minor conditions and the transcript is called differentially expressed.

Truncated exponential priors for overdispersions

In DEXUS model selection, the second prior is on the size parameter r of the negative binomial distribution, which determines the overdispersion. A prior on r improves the estimation of r if the number of samples is small. The maximum likelihood estimator of r is biased for few samples and overestimates the true size parameter (38,39), as confirmed in Supplementary Section S3.2.5. In a Bayesian approach, the influence of the prior decreases with an increasing number of samples, and therefore the MAP estimator is asymptotically (number of samples tending to infinity) unbiased. To keep the estimate of r small, the prior pushes r toward zero. We choose an exponential distribution as prior: where η is a hyperparameter. Like DESeq (24), we truncate the size parameter at the right-hand tail by using the constraint . The upper bound on the size parameter is equivalent to a lower bound on the overdispersion and ensures minimal overdispersion for the read counts of each transcript. Further, this bound makes the parameter space compact, which is required for a consistent estimator. The same exponential prior is used for each component of . The hyperparameter η for the exponential prior on r is transformed to a hyperparameter θ (see Supplementary Section S3.2.5). Like the hyperparameter G, also θ is robust and supplies good results with .

Uniform priors for means

Finally, DEXUS model selection uses a prior on the mean μ of the negative binomial distribution. If in one condition all read counts were close to zero (transcripts are not present), the estimate of the mean of the negative binomial would not converge. Therefore, is lower bounded by . To ensure a compact parameter space as required for a consistent estimator, we use a uniform prior on on the compact interval , where can be set to the largest observed read count. In summary, DEXUS has only few parameters which in most applications need not be adjusted by the user, as their default values give good results.

EM algorithm

With the priors defined, the model with maximum parameter posterior can be selected. The EM algorithm (40) is used to minimize an upper bound on the negative log-posterior of the parameters. The E-step of the EM algorithm estimates the probability that a read count is generated under a particular condition. The M-step optimizes the model parameters based on the E-step estimates. In the DEXUS model, is the probability of condition i without observing any data. The model posterior estimates the probability that read count x is generated under condition i (the probability of condition i after observing data x): This equation is the E-step (expectation step) of the EM algorithm. Using the posterior estimates , we obtain following update rules for the M-step (maximization step): The complete derivation of the EM algorithm can be found in the Supplementary Section S3.2.1. estimate for : update: update: The new r is obtained by solving the following equation for r: where ψ is the digamma function. The equation is solved numerically for r by means of a ‘bisection’ procedure. update: and r are initialized by using the results of k-means clustering (see Supplementary Section S3.2.4). The values are simply initialized with the maximum entropy setting .

I/NI value: evidence of differential expression

The Bayesian framework allows definition of an I/NI call (36,37,41,42). The I/NI call serves to extract differentially expressed transcripts with a desired specificity (1 − significance level or 1 − type I error rate) or sensitivity (power or 1 – type II error rate). DEXUS first computes the I/NI value, which quantifies the contribution of differential expression to the read count variation. Transcripts are then called informative if the I/NI value exceeds a threshold. Unlike or r, which capture noise variation, captures variation arising from differentially expressed transcripts. The posterior of indicates differential expression in the data in the form of minor conditions with probabilities larger than zero. The larger the posterior value of a minor condition , the stronger the evidence for its presence. Further, evidence is also required that the minor condition is different from the major condition in terms of mean read counts. Although identifiability of the DEXUS model ensures that the negative binomials of different conditions are different, they may still be close to one another. The more the mean of the minor condition differs from the mean of the major condition, the stronger is the evidence that the minor condition is different from the major condition. In conclusion, evaluating the evidence of differential expression (the I/NI value) should consider two factors: (i) as the evidence for the presence of the minor condition ; (ii) the difference between the means of the major and minor conditions as evidence that they are indeed different. The difference between the means is expressed by the log difference . Factor (I) is incorporated into the I/NI value by weighting these differences by , which yields The I/NI value is the expected log fold change of read counts with respect to the mean read count of the major condition given a noise-free model. ‘Noise-free’ refers to the assumption that under each condition, only the mean read count is observed. For a mathematical interpretation of the I/NI value see Supplementary Section S3.3.2.

Experiments

We evaluated DEXUS on simulated and real-world data sets. The simulated data sets had various library sizes, different numbers of replicates and different ratios between mean read counts under the different conditions. DEXUS was tested on the following real-world RNA-Seq data sets: (i) ‘Nigerian HapMap’, based on 69 Nigerian HapMap individuals, (ii) ‘European HapMap’, based on 60 European HapMap individuals, (iii) ‘Primate Liver’, based on liver tissue samples from humans, chimpanzees and rhesus macaques, (iv) ‘Maize Leaves’, using samples from different locations of maize plant leaves, and (v) ‘Mice Strains’, based on different strains of mice (Supplementary Section S4.2.4). First we report the performance of DEXUS on 2400 simulated data sets for which the conditions were known but withheld from DEXUS. We then present tests on real-world data sets with either unknown conditions (‘Nigerian HapMap’, ‘European HapMap’) or partially known conditions (‘Primate Liver’, ‘Maize Leaves’). In the latter case the conditions were withheld from DEXUS to ascertain whether it can identify them.

DEXUS for known conditions

Before we tested DEXUS on data with unknown conditions, we assessed how well it performs if the conditions of interest are known. For known conditions, DEXUS estimates only the parameters of a negative binomial for each condition. Therefore, we compared the parameter estimates of DEXUS to previously suggested methods in terms of detecting differentially expressed transcripts, namely the following eight state-of-the-art methods: DSS (31), DESeq (24), baySeq (26), edgeR (25), DEGseq (28), NOISeq (30), PoissonSeq (29) and SAMseq (27). If only few samples per condition are available, the performance of DEXUS is below the best performing other methods. For medium and large sample numbers and small library size (1e6) DEXUS is second and third best method. For medium and large sample numbers and large library sizes (1e7 and 1e8) DEXUS outperforms all other methods. The experiments and the respective results are described in detail in Supplementary Section S4.2.

Simulated RNA-Seq data

Generating simulated RNA-Seq data

We simulated data sets from different experimental settings following the suggestions of Robinson et al. (34), Hardcaste and Kelly (26) and Wu et al. (31). For all samples of a data set, the library size was 1e6, 1e7 or 1e8 to cover a wide range of applications. Keeping the library size and the read quality constant for each sample in a data set avoids the need for normalization of the read counts, i.e. it avoids normalization biases. For each experiment, we choose a particular number of replicates per condition to evaluate DEXUS for different sample sizes and for unbalanced data. In case of two conditions, the numbers of replicates were 6/6, 9/3, 10/2, 11/1, 12/12, 18/6, 20/4, 22/2 (condition1/condition2). Each experiment consisted of 100 data sets with 10 000 transcripts each. The conditions were known and used for evaluation but withheld from DEXUS. For the simulation we assumed that under condition i the reads x for a transcript are distributed according to a negative binomial . After Wu et al. (31), we took the mean and the size r from the ‘Mice Strains’ benchmark RNA-Seq data set (43) using only data from one particular biological condition. For a randomly selected transcript, the value was obtained as the median read count of the condition. The overdispersion tends to decrease with increasing mean read counts (see Supplementary Figure S15). Therefore, we fitted a regression line to overdispersion values by least squares. After sampling values, the corresponding values were obtained by means of the regression line to which zero-one normally distributed noise was added. Thirty percent of the genes were chosen to be differentially expressed. Differential expression was modeled by adjusting the means of the negative binomials to obtain log fold changes of 0.5, 1 and 1.5 between the mean of the major and the minor condition. The fold change values are randomly chosen with equal probability, such that all 3-fold change categories have about the same number of genes in each data set.

Evaluation criteria for simulated RNA-Seq data

We formulate the detection of differential expression as a classification task: DEXUS must decide whether a transcript is differentially expressed (positive prediction) or not (negative prediction). For the simulated data, we knew which transcripts were differentially expressed (the positives) and which were not (the negatives). DEXUS ranks the transcripts by the I/NI value from Equation (10). For a given I/NI threshold (the I/NI call), we can determine true positives, false positives, true negatives and false negatives. Using these numbers, we computed the specificity and the sensitivity of DEXUS. The specificity corresponds to ‘1 − significance level’ or ‘1 − type I error rate’. The type I error rate is the ratio between false detections and all negatives. The sensitivity corresponds to the ‘power’ or ‘1 − type II error rate’. The type II error rate is the ratio between missed positives and all positives.

Results on simulated RNA-Seq data

We tested DEXUS on the simulated RNA-Seq data using its default parameters. Table 2 shows the results in terms of sensitivity and specificity for library size 1e8 at different thresholds for the I/NI value. Transcripts with an I/NI value above the threshold are called informative or (equivalently) differentially expressed. Results for other library sizes are presented in Supplementary Tables S12 and S13. The specificity of DEXUS is high across various numbers of replicates, whereas the sensitivity varies considerably. High specificity means that few transcripts are falsely identified as being differentially expressed. In highly unbalanced experiments, i.e. 11/1 and 22/2 replicates, differentially expressed transcripts are detected only at low I/NI thresholds of 0.025 and 0.05. Note that the minor condition i = 2 (smaller subgroup) leads to a small and therefore to a small I/NI value. For unbalanced data, the few minor condition samples must be distinguished from random outliers of the major condition.

Table 2.

The performance of DEXUS in terms of sensitivity and specificity in detecting differential expression with unknown conditions

I/NI threshold	0.025		0.05		0.1
C1/C2	Specificity	Sensitivity	Specificity	Sensitivity	Specificity	Sensitivity
6/6	0.893 ± 0.003	0.775 ± 0.009	0.951 ± 0.002	0.720 ± 0.009	0.985 ± 0.002	0.646 ± 0.009
9/3	0.893 ± 0.004	0.827 ± 0.006	0.951 ± 0.002	0.766 ± 0.007	0.985 ± 0.001	0.580 ± 0.008
10/2	0.893 ± 0.003	0.819 ± 0.008	0.950 ± 0.002	0.656 ± 0.009	0.985 ± 0.001	0.325 ± 0.009
11/1	0.893 ± 0.003	0.677 ± 0.009	0.951 ± 0.002	0.351 ± 0.008	0.985 ± 0.001	0.020 ± 0.003
12/12	0.945 ± 0.002	0.735 ± 0.008	0.982 ± 0.001	0.665 ± 0.008	0.996 ± 0.001	0.610 ± 0.009
18/6	0.945 ± 0.003	0.816 ± 0.008	0.982 ± 0.002	0.743 ± 0.009	0.996 ± 0.001	0.570 ± 0.011
20/4	0.945 ± 0.003	0.810 ± 0.008	0.982 ± 0.002	0.625 ± 0.009	0.996 ± 0.001	0.308 ± 0.009
22/2	0.946 ± 0.002	0.650 ± 0.009	0.982 ± 0.001	0.325 ± 0.008	0.996 ± 0.001	0.006 ± 0.002
Mean	0.919 ± 0.028	0.764 ± 0.069	0.966 ± 0.017	0.606 ± 0.172	0.991 ± 0.006	0.383 ± 0.261

The first column ‘C1/C2’ contains the numbers of replicates for the first and second condition. The other columns list sensitivity and specificity (with standard deviations) of DEXUS at different I/NI thresholds as the average across 100 data sets. The last row (‘Mean’) gives the average of the results in the columns. The library size was 1e8 for all experiments.

The performance of DEXUS in terms of sensitivity and specificity in detecting differential expression with unknown conditions The first column ‘C1/C2’ contains the numbers of replicates for the first and second condition. The other columns list sensitivity and specificity (with standard deviations) of DEXUS at different I/NI thresholds as the average across 100 data sets. The last row (‘Mean’) gives the average of the results in the columns. The library size was 1e8 for all experiments.

Real-world RNA-Seq data

‘Nigerian HapMap’

Pickrell et al. (16) sequenced RNA from 69 Nigerian HapMap individuals to study expression quantitative trait loci (eQTLs). The read count data were provided by the ReCount repository (44). As in previous experiments, DEXUS was applied to these data with its default parameters and ranked genes according to the I/NI value. The read counts of top-ranked genes and the conditions identified by DEXUS are visualized as a heatmap in Figure 1.

Figure 1.

Heatmap of the normalized read counts of the 12 genes with the largest I/NI values for the ‘Nigerian HapMap’ data set. Colors range from white for low expression to blue for high expression. Different individuals are denoted along the x-axis, while the top-ranked genes are denoted by their gene symbols along the y-axis. Red crosses indicate samples that belong to the minor condition. At the right side of the heatmap, each gene is annotated by the minimum (‘>’), the median of two conditions (‘m1’ and ‘m2’) and the maximum (‘<’) read count. Five out of the 12 top-ranked genes are located on the Y chromosome (RPS4Y1, CYorf15A, EIF1AY, TMSB4Y, RPS4Y2). For these genes, the identified conditions are related to the sex. For four of the 12 top-ranked genes, at least one eQTL is known. For ZFP57, the associated eQTL is the SNP rs1736924 with a minor allele frequency (MAF) of 0.14 (16). CDH1 has 6 eQTLs, one of which is SNP rs7196495 with a MAF of 0.22 (45). CLLU1OS possesses the eQTL SNP rs12580153 with a MAF of 0.19 (46). L1TD1 has 2 eQTLs, one of which is SNP rs12137088 with a MAF 0.30 (47). Because the MAFs are high, it is plausible that the minor alleles are observed in the HapMap data set and that they lead to differential expressions of the associated genes. The conditions that were found by DEXUS are related to the alleles of corresponding SNPs. Because the HapMap samples are lymphoblastoid cells, we confirmed that the genes detected by DEXUS are indeed expressed in lymphoblastoid cell lines. The gene NLRP2, ranked 11th by DEXUS, is expressed in lymphoblastoid cells but with large variability (48), as shown in Supplementary Figure S17. NLRP2 is expressed in the HapMap individuals, but in some at a low level. Schlattl et al. (49) identified a copy number variable region that partially covers NLRP2 and causes its differential expression. Therefore, the conditions that DEXUS identified for NLRP2 may be related to copy number states of the samples. Copy number states may also be responsible for differential expression of MKRN3, which was ranked 12th by DEXUS. Pinto et al. (50) and Redon et al. (51) identified a copy number variable region covering MKRN3. However, interpreting the MKRN3 conditions is difficult because only the paternal copy of MKRN3 is expressed. We analyzed the I/NI value ranking of transcripts: genes on the X chromosome were ranked significantly higher than other genes (P = 3.0e−12), which can be explained by sex-related transcripts. An analogous test for the Y chromosome was not significant because too few genes were expressed. However, as already mentioned, five out of the 12 top-ranked genes are located on the Y chromosome. At an I/NI threshold of 0.1, DEXUS called 366 differentially expressed genes. Gene set enrichment analysis showed that the called genes are associated with the extracellular region and the plasma membrane. In total, 20 significant GO terms were found, including ‘extracellular space’, ‘extracellular region part’ and ‘plasma membrane part’ with P = 2.5e−5, P = 8.8e−5 and P = 0.01, respectively. ‘Cell–cell signaling’, ‘chemokine receptor binding’ and ‘chemokine activity’ were also significant at P = 4.0e−3, P = 8.0e−4 and P = 9.8e−4 (P-values were corrected for multiple testing by means of the Benjamini–Hochberg procedure). These GO terms are in agreement with characteristics of lymphoblastoid cells. Supplementary Table S18 provides a complete list of all significant GO terms.

‘European HapMap’

We analyzed the RNA-Seq data of 60 individuals from the HapMap cohort from Montogmery et al. (15), which were provided by the ReCount repository (44). Again, DEXUS was applied to these data with its default parameters and ranked genes according to the I/NI value. The read counts of top-ranked genes and the identified conditions are visualized as a heatmap in Figure 2.

Figure 2.

Heatmap of the normalized read counts of the 12 genes with the largest I/NI values for the ‘European HapMap’ data set. Colors range from white for low expression to blue for high expression. Different individuals are denoted along the x-axis, while the top-ranked genes are denoted by their gene symbols along the y-axis. Red crosses indicate samples that belong to the minor condition. At the right hand side of the heatmap, each gene is annotated by the minimum (‘>’), the median of two conditions (‘m1’ and ‘m2’) and the maximum (‘<’) read count. RPS4Y1 is the gene with the largest I/NI value, differentially expressed between males and females, and located on the Y chromosome. The genes CYorf15A and TMSB4Y, ranked fourth and fifth according to the I/NI value, are also located on the Y chromosome. As in the ‘Nigerian HapMap’ data set, ZFP57 was detected as being differentially expressed. In addition to ZFP57, two other of the 12 top-ranked genes have eQTLs. CLLU1OS has as eQTL the SNP rs12580153 with a MAF of 0.19 (46). POU2F3 has as eQTL the SNP rs2847497 with a MAF of 0.14 (52). As in the ‘Nigerian HapMap’ data set, some top-ranked genes, such as NLRP2 (again rank 11), were differentially expressed owing to variable copy numbers (49). For the genes T, PRSS21 and RASSF10, DEXUS identified two conditions for which an explanation remains to be found and which may indicate a hitherto unknown source of variability in gene expression. The second-ranked gene T, the third-ranked gene PRSS21 and the 12th-ranked gene RASSF10 are known to be expressed in B-lymphoblastoid cells (6,12), the cell type of the HapMap samples. The high expression variability of T and PRSS21 in the B-lymphoblastoid cell line has already been reported by the ENCODE Project (6). The ENCODE Project expression values for the genes T, PRSS21 and RASSF10 are visualized in Supplementary Figures S19, S20 and S21. Analyzing the I/NI value ranking, we found that genes on the X chromosome are ranked significantly higher (P = 8.0e−6, Wilcoxon test). The analogous test for the Y chromosome yielded no significant results, as too few genes were expressed. However, three out of the 12 top-ranked genes with the largest I/NI value are located on the Y chromosome. At an I/NI threshold of 0.1, DEXUS called 680 differentially expressed genes. Gene set enrichment analysis showed that the called genes are associated with ion transport. Significant Gene Ontology (GO) terms were ‘ion transport’, ‘potassium ion transport’ with P = 0.04 and P = 4.3e−03, respectively. Again ‘plasma membrane part’ was significant at P = 0.027. Although 36 of the 680 genes were related to ‘cell–cell signaling’ and 6 to ‘chemokine activity’, these GO terms were not significant in this data set after correction for multiple testing by means of the Benjamini–Hochberg procedure. A table of all significant GO terms can be found in Supplementary Table S19.

‘Primate Liver’

Blekhman et al. (53) investigated the differences in alternative splicing in liver tissue between humans, chimpanzees and rhesus macaques. For this purpose, they performed RNA-Seq on three male and three female liver samples from each species. They focused on the expression values of exons that had reliably determined orthologs in all species. Read counts for exons were provided by Blekhman et al. (53), who used gene models from Ensemble (Release 50). After pooling technical replicates, DEXUS ranked genes according to the I/NI value using its default parameters. The 10 top-ranked genes are visualized in Figure 3, which shows clear differential expression between the species. For these genes, and without having been provided with this information, DEXUS determined one of the three species as minor condition. Interestingly, out of the 10 top-ranked genes, six are human pseudogenes, AC010591.10, AC105383.3, AC093874.3-1, AC105383.3, AL132855.4 and UOX, which are inactive in humans because of recent structural rearrangements (54). Because the rearrangements are recent, their orthologs can be identified reliably in other primates. Differential expression is detected because these orthologs are still transcribed in Pan troglodytes and in Macaca mulatta.

Figure 3.

Heatmap of the normalized read counts of the 10 genes with the largest I/NI values for the ‘Primate Liver’ data set. Colors range from white for low expression to blue for high expression. The x-axis shows female and male individuals from the three species human Homo sapiens (HS), chimpanzee P. troglodytes (PT) and rhesus macaques M. mulatta (MM). The y-axis displays top-ranked genes indicated by their gene symbols. Red crosses mark samples that were assigned to the minor condition. At the right side of the heatmap, each gene is annotated by the minimum (‘>’), the median of two conditions (‘m1’ and ‘m2’) and the maximum (‘<’) read count. Several of the 10 top-ranked genes are associated with liver pathways and are therefore expressed in liver tissues. Differential expression of these genes between species may have arisen from different diets. Examples of such genes are the human pseudogene UOX, which catalyze the oxidation of uric acid to allantoin in M. mulatta, ABP1 and GSTM5, which participate in degradation and detoxification pathways, VNN3, which helps to recycle vitamin B5, and CHFR2, which is associated with lipoproteins. Thresholding the I/NI call at 0.1, DEXUS called 3384 genes (16% of all genes) as differentially expressed. A gene set enrichment analysis found the GO terms ‘intrinsic to plasma membrane’ (P = 7.9e−7) and ‘integral to plasma membrane’ (P = 4.0e−6) to be significant. Thus, genes that encode membrane proteins seem to be differentially expressed between species more often than other genes. Interestingly, also ‘response to extracellular stimulus’, ‘response to nutrient’ and ‘response to nutrient levels’ were significant (all P-values <7.6e−5), which supports the hypothesis that some genes are differentially expressed owing to the different diets of the species. All P-values were corrected by means of the Benjamini–Hochberg procedure.

‘Maize Leaves’

Using RNA-Seq data from various locations on maize plant leaves, Li et al. (55) studied the developmental dynamics of the maize transcriptome. For each location, two biological replicates were sequenced with Illumina’s Genome Analyzer II. The reads were mapped to the TE-masked Zea maize ZmB73 reference genome version 2 (AGPv2), release 5a, using the GSNAP splicing short read mapper (56). We counted the overlaps between mapped reads and the Z. maize gene definitions from the Ensemble Plants database (Release 14). Reads that have multiple possible alignments or that overlap with more than one gene were discarded. DEXUS was applied to this data with its default parameters. Figure 4 shows the genes with the largest I/NI value and the conditions that were identified by DEXUS. DEXUS found differences in gene expressions between different locations on the leaf despite this information being withheld. Further, it almost always assigned the two replicates to the same condition without knowledge of replicates or leaf locations. Thus, DEXUS assigns conditions reliably.

Figure 4.

Heatmap of the normalized read counts of the 10 genes with the largest DEXUS I/NI values for the ‘Maize Leaves’ data set. Colors range from white for low expression to blue for high expression. The x-axis shows samples from different locations on the maize plant leaf. The y-axis displays different genes denoted by their gene symbols. Red crosses indicate that the according samples belong to the minor condition. At the right hand side of the heatmap, each gene is annotated by the minimum (‘>’), the median of two conditions (‘m1’ and ‘m2’) and the maximum (‘<’) read count. Eight of the 10 top-ranked genes were also measured by means of microarrays across different leaf locations of Z. mays (57). In this microarray experiment, all eight genes show an absolute log fold change of at least 1 between base and tip. Six of these eight genes show an absolute log fold change greater than four. The two remaining genes, GRMZM2G331518 and AC213612.3_FG001, were not annotated on the microarray. The function of the top-ranked gene GRMZM2G331518 is not known. However, the associated peptide is similar to the defensin-like protein 91 of Arabidopsis thaliana, which plays a role in immune response. The gene ranked ninth, AC213612.3_FG001, is a glycine-rich cell wall structural protein, which is compatible with cell walls at different locations having different structure. At a threshold of 0.1 for the I/NI call, DEXUS called 15 756 differentially expressed genes. Gene set enrichment analysis using the R package goseq (58) yielded to the significant GO terms ‘chloroplast’ (P = 1.8e−92) and ‘plasma membrane’ (P = 1.3e−34). Further, the GO terms ‘cytosolic ribosome’ (P = 9.8e−32), ‘chloroplast thylakoid membrane’ (P = 5.4e−31) and ‘chloroplast stroma’ (P = 1.8e−30) were significant. All P-values were corrected by means of the Benjamini–Hochberg procedure. It is plausible that that chloroplast also differs at different locations on the maize plant leaf. The GO term ‘cell wall’ was highly significant (P = 3.9e−18), which supports the above-mentioned hypothesis that the cell walls differ at different locations on the plant leaf.

CONCLUSION

We have introduced DEXUS, an algorithm that identifies differentially expressed transcripts in RNA-Seq data with unknown conditions. DEXUS is appropriate for use with data from cohort, cross-sectional and nonrandomized controlled studies, where conditions are often unknown. In experiments with simulated and real-world data with known conditions, DEXUS successfully found differential expressed transcripts and conditions, although the conditions were withheld from DEXUS. For HapMap individuals, DEXUS detected differentially expressed transcripts, the vast majority of which are related to sex, eQTLs or structural variants. We envisage that DEXUS will evolve into an important tool for analyzing RNA-Seq data in studies with unknown conditions and thus for obtaining new biological and medical knowledge.

SUPPLEMENTARY DATA

Supplementary Data are available at NAR Online.

FUNDING

Funding for open access charge: Funds from the Institute of Bioinformatics, Johannes Kepler University Linz. Conflict of interest statement. None declared.

57 in total

Review 1. The Japanese toxicogenomics project: application of toxicogenomics.

Authors: Takeki Uehara; Atsushi Ono; Toshiyuki Maruyama; Ikuo Kato; Hiroshi Yamada; Yasuo Ohno; Tetsuro Urushidani
Journal: Mol Nutr Food Res Date: 2010-02 Impact factor: 5.914

2. DEGseq: an R package for identifying differentially expressed genes from RNA-seq data.

Authors: Likun Wang; Zhixing Feng; Xi Wang; Xiaowo Wang; Xuegong Zhang
Journal: Bioinformatics Date: 2009-10-24 Impact factor: 6.937

3. Transcriptome genetics using second generation sequencing in a Caucasian population.

Authors: Stephen B Montgomery; Micha Sammeth; Maria Gutierrez-Arcelus; Radoslaw P Lach; Catherine Ingle; James Nisbett; Roderic Guigo; Emmanouil T Dermitzakis
Journal: Nature Date: 2010-03-10 Impact factor: 49.962

4. Sex-specific and lineage-specific alternative splicing in primates.

Authors: Ran Blekhman; John C Marioni; Paul Zumbo; Matthew Stephens; Yoav Gilad
Journal: Genome Res Date: 2009-12-15 Impact factor: 9.043

5. Evaluation of statistical methods for normalization and differential expression in mRNA-Seq experiments.

Authors: James H Bullard; Elizabeth Purdom; Kasper D Hansen; Sandrine Dudoit
Journal: BMC Bioinformatics Date: 2010-02-18 Impact factor: 3.169

6. Fast and SNP-tolerant detection of complex variants and splicing in short reads.

Authors: Thomas D Wu; Serban Nacu
Journal: Bioinformatics Date: 2010-02-10 Impact factor: 6.937

7. Common regulatory variation impacts gene expression in a cell type-dependent manner.

Authors: Antigone S Dimas; Samuel Deutsch; Barbara E Stranger; Stephen B Montgomery; Christelle Borel; Homa Attar-Cohen; Catherine Ingle; Claude Beazley; Maria Gutierrez Arcelus; Magdalena Sekowska; Marilyne Gagnebin; James Nisbett; Panos Deloukas; Emmanouil T Dermitzakis; Stylianos E Antonarakis
Journal: Science Date: 2009-07-30 Impact factor: 47.728

8. Comparative analysis of processed ribosomal protein pseudogenes in four mammalian genomes.

Authors: Suganthi Balasubramanian; Deyou Zheng; Yuen-Jong Liu; Gang Fang; Adam Frankish; Nicholas Carriero; Rebecca Robilotto; Philip Cayting; Mark Gerstein
Journal: Genome Biol Date: 2009-01-05 Impact factor: 13.583

9. edgeR: a Bioconductor package for differential expression analysis of digital gene expression data.

Authors: Mark D Robinson; Davis J McCarthy; Gordon K Smyth
Journal: Bioinformatics Date: 2009-11-11 Impact factor: 6.937

10. BioGPS: an extensible and customizable portal for querying and organizing gene annotation resources.

Authors: Chunlei Wu; Camilo Orozco; Jason Boyer; Marc Leglise; James Goodale; Serge Batalov; Christopher L Hodge; James Haase; Jeff Janes; Jon W Huss; Andrew I Su
Journal: Genome Biol Date: 2009-11-17 Impact factor: 13.583

9 in total

1. Complex Sources of Variation in Tissue Expression Data: Analysis of the GTEx Lung Transcriptome.

Authors: Matthew N McCall; Peter B Illei; Marc K Halushka
Journal: Am J Hum Genet Date: 2016-09-01 Impact factor: 11.025

2. Natural DNA Transformation Is Functional in Lactococcus lactis subsp. cremoris KW2.

Authors: Blandine David; Amandine Radziejwoski; Frédéric Toussaint; Laetitia Fontaine; Marie Henry de Frahan; Cédric Patout; Sabine van Dillen; Patrick Boyaval; Philippe Horvath; Christophe Fremaux; Pascal Hols
Journal: Appl Environ Microbiol Date: 2017-08-01 Impact factor: 4.792

3. Data quality aware analysis of differential expression in RNA-seq with NOISeq R/Bioc package.

Authors: Sonia Tarazona; Pedro Furió-Tarí; David Turrà; Antonio Di Pietro; María José Nueda; Alberto Ferrer; Ana Conesa
Journal: Nucleic Acids Res Date: 2015-07-16 Impact factor: 16.971

4. SigFuge: single gene clustering of RNA-seq reveals differential isoform usage among cancer samples.

Authors: Patrick K Kimes; Christopher R Cabanski; Matthew D Wilkerson; Ni Zhao; Amy R Johnson; Charles M Perou; Liza Makowski; Christopher A Maher; Yufeng Liu; J S Marron; D Neil Hayes
Journal: Nucleic Acids Res Date: 2014-07-16 Impact factor: 16.971

5. The transcriptome of NaCl-treated Limonium bicolor leaves reveals the genes controlling salt secretion of salt gland.

Authors: Fang Yuan; Ming-Ju Amy Lyu; Bing-Ying Leng; Xin-Guang Zhu; Bao-Shan Wang
Journal: Plant Mol Biol Date: 2016-03-03 Impact factor: 4.076

6. Benchmarking association analyses of continuous exposures with RNA-seq in observational studies.

Authors: Tamar Sofer; Nuzulul Kurniansyah; François Aguet; Kristin Ardlie; Peter Durda; Deborah A Nickerson; Joshua D Smith; Yongmei Liu; Sina A Gharib; Susan Redline; Stephen S Rich; Jerome I Rotter; Kent D Taylor
Journal: Brief Bioinform Date: 2021-11-05 Impact factor: 11.622

7. Toward the human cellular microRNAome.

Authors: Matthew N McCall; Min-Sik Kim; Mohammed Adil; Arun H Patil; Yin Lu; Christopher J Mitchell; Pamela Leal-Rojas; Jinchong Xu; Manoj Kumar; Valina L Dawson; Ted M Dawson; Alexander S Baras; Avi Z Rosenberg; Dan E Arking; Kathleen H Burns; Akhilesh Pandey; Marc K Halushka
Journal: Genome Res Date: 2017-09-06 Impact factor: 9.043

8. Mixture models reveal multiple positional bias types in RNA-Seq data and lead to accurate transcript concentration estimates.

Authors: Andreas Tuerk; Gregor Wiktorin; Serhat Güler
Journal: PLoS Comput Biol Date: 2017-05-15 Impact factor: 4.475

9. Impact of adaptive filtering on power and false discovery rate in RNA-seq experiments.

Authors: Sonja Zehetmayer; Martin Posch; Alexandra Graf
Journal: BMC Bioinformatics Date: 2022-09-24 Impact factor: 3.307

9 in total