Literature DB >> 23940099

Phylogenomic distance method for analyzing transcriptome evolution based on RNA-seq data.

Xun Gu1, Yangyun Zou, Wei Huang, Libing Shen, Zebulun Arendsee, Zhixi Su.   

Abstract

Thanks to the microarray technology, our understanding of transcriptome evolution at the genome level has been considerably advanced in the past decade. Yet, further investigation was challenged by several technical limitations of this technology. Recent innovation of next-generation sequencing, particularly the invention of RNA-seq technology, has shed insightful lights on resolving this problem. Though a number of statistical and computational methods have been developed to analyze RNA-seq data, the analytical framework specifically designed for evolutionary genomics remains an open question. In this article we develop a new method for estimating the genome expression distance from the RNA-seq data, which has explicit interpretations under the model of gene expression evolution. Moreover, this distance measure takes the data overdispersion, gene length variation, and sequencing depth variation into account so that it can be applied to multiple genomes from different species. Using mammalian RNA-seq data as example, we demonstrated that this expression distance is useful in phylogenomic analysis.

Entities:  

Keywords:  RNA-seq; genome expression distance; transcriptome evolution

Mesh:

Year:  2013        PMID: 23940099      PMCID: PMC3787673          DOI: 10.1093/gbe/evt121

Source DB:  PubMed          Journal:  Genome Biol Evol        ISSN: 1759-6653            Impact factor:   3.416


Introduction

Despite exciting achievements in transcriptome changes in genome evolution, mainly based on microarrays (Enard et al. 2002; Caceres et al. 2003; Gu and Gu 2003; Makova and Li 2003; Rifkin et al. 2003; Huminiecki and Wolfe 2004; Khaitovich et al. 2004; Gu et al. 2005; Gu and Su 2007), further investigation has been challenged by the availability of robust gene expression data across a broad range of species and tissues (Wang et al. 2009). Nevertheless, recent technological innovations called next-generation sequencing, particularly the development of RNA-seq technology, have shed some light to this problem, which can generate tens of millions of short sequence reads. These reads can be mapped to each gene through the reference genome or de novo assembling, enabling researchers to quantify the transcription level in ultra-high resolution (Cloonan et al. 2009; Morozova et al. 2009; Wang et al. 2009). Indeed, RNA-seq technology has already made unprecedented advances for revealing the complexity of transcriptional phenomena, ranging from the expression profiling, dissection of isoform, and allelic expression, to the extension of 3′-UTR regions, novel splice junctions, modes of antisense regulation, and intragenic expression (Carninci et al. 2005; Eveland et al. 2008; Mortazavi et al. 2008; Nagalakshmi et al. 2008; Sultan et al. 2008; Graveley et al. 2010; Trapnell et al. 2010). The power of RNA-seq in the study of transcriptome evolution was well demonstrated by the recent work of Brawand et al. (2011). They reported a large-scale RNA-seq analysis of six mammalian tissues and showed the dynamics of transcriptome evolution that may underlie many phenotypic differences between species. However, despite many studies in RNA-seq data analysis (Lu et al. 2005; Robinson and Smyth 2007, 2008; Anders and Huber 2010; Di et al. 2011; Zhou et al. 2011; McCarthy et al. 2012), statistical methods designed specifically for evolutionary genomics have not been well developed. In this article we report a new method for estimating the genome expression distance based on RNA-seq data, which has explicit interpretations under the model of gene expression evolution. Using mammalian RNA-seq data as example, we show that this expression distance can be used in phylogenomic reconstruction and related phylogeny-based expression analysis.

Materials and Methods

New Methods

Statistical Framework of Transcriptome Evolution

Because RNA-seq technology provides readout counts, the sampling property is similar to some earlier data types such as SAGE (Velculescu et al. 1995) or EST (Audic and Claverie 1997; Ewing and Claverie 2000). A variety of statistical methods were proposed; see Zhou et al. (2011), McCarthy et al. (2012), and Di et al. (2011) for recent advances and references therein. Simply to say, these methods considered RNA-seq overdispersion as well as data normalization to remove nonbiological effects in the data processing, which will be also addressed in our method. Though a simple Poisson distribution model p(x; λ), characterized by the variance equal to the mean (λ), can effectively handle substantial zero counts, many studies have shown that RNA-seq counts exhibit a greater variance across biological replicates than expected (Di et al. 2011; McCarthy et al. 2012). This phenomenon is called overdispersion in statistics. Among a number of statistical models proposed to remedy this problem (Lu et al. 2005; Robinson and Smyth 2007, 2008; Anders and Huber 2010; Di et al. 2011; Zhou et al. 2011; McCarthy et al. 2012), our study adopts the widely used negative binomial distribution (NBD). We choose a special form denoted by p(x; λ, ω), characterized by the mean parameter (λ) and the overdispersion parameter (ω) (see eq. 4 in Data Processing section). A large value of ω indicates a strong over-dispersion, and vice versa. When ω = 0, p(x; λ, ω) is reduced to the Poisson model. Next we model the mean parameter λ as a random variable to describe the expression variability among genes. A typical RNA-seq sample may include many thousands of genes, showing a highly skewed distribution of read counts. For instance, in mammalian tissues (Brawand et al. 2011), the top 5% highly expressed genes received roughly 102–105 RNA-seq counts, whereas the bottom 40% lowly expressed genes received roughly 0–10 counts. We therefore implement a lognormal distribution, analogous to the log-transformation in the microarray data analysis (Kerr and Churchill 2001; Irizarry et al. 2003); the log of λ follows a normal distribution with the mean µ and variance η2. Together, the RNA-seq counts in a sample follow a negative binomial-lognormal distribution denoted by f(x). Though the analytical form of f(x) is not available, the mean and variance of f(x) can be derived straightforwardly; see equation (5) in Data Processing section. In the case of two RNA-seq samples of the same tissue from two species (genomes) X and Y, the mean parameter λ (or λ) of genome X (or Y) follows a lognormal distribution accounting for the among-gene expression variability. Because λ and λ are correlated by the evolutionary relatedness of genomes X and Y, without loss of generality the joint model of λ and λ can be written as follows (Gu 2004): where α is the ancestral genetic component shared by X and Y, β and β are the independent genetic effects, and μ and μ are the ground means. Together, α, β, and β describe the evolutionally correlated structure of the underlying regulatory machinery. To implement this model, we further assume that α, β, and β are mutually independent, each of which follows a normal distribution with the mean 0, and the variance ρ2, v2, or v2, respectively. As shown in figure 1, the variance component ρ2 measures the expression variability at the common ancestor of species X and Y. Meanwhile, the variance component v2 (or v2) measures the expression variability generated during the evolution from the common ancestor to the current species X (or Y). For the current genome X, the marginal expression variability is given by γα + β so that the variance of among-gene variability is given by η2 = ρ2 + v2. Similarly, for genome Y, we have γ = α + β and η2 = ρ2 + v2.
F

Model of transcriptome evolution between two species. (A) A schematic illustration for a rooted two-gene tree: ρ2 refers to among-gene expression variability at the common ancestor of species X and Y; v2 and v2 measure the among-gene expression variability in lineage X and Y since the split of common ancestor, respectively. (B) The variance–covariance matrix of genome expression between for current genomes X and Y. (C) The expression distance U plotted against the evolutionary time t. Expression divergence is an accelerated process under the adaptive model, a constant-rate process under the neutral model, and a decelerated process under the stabilizing model. In particular, when W→0, we have U →2σ2t, i.e., the stabilizing selection model is reduced to the neutral model; and when t→∞, U →1/W, i.e., the expression divergence approaches a saturated level.

Model of transcriptome evolution between two species. (A) A schematic illustration for a rooted two-gene tree: ρ2 refers to among-gene expression variability at the common ancestor of species X and Y; v2 and v2 measure the among-gene expression variability in lineage X and Y since the split of common ancestor, respectively. (B) The variance–covariance matrix of genome expression between for current genomes X and Y. (C) The expression distance U plotted against the evolutionary time t. Expression divergence is an accelerated process under the adaptive model, a constant-rate process under the neutral model, and a decelerated process under the stabilizing model. In particular, when W→0, we have U →2σ2t, i.e., the stabilizing selection model is reduced to the neutral model; and when t→∞, U →1/W, i.e., the expression divergence approaches a saturated level.

Definition of Expression Distance

For a given tissue, the expression distance should measure the expression divergence between two species that had diverged t time units ago, reflecting the underlying regulatory divergence. Because two variance components v2 and v2 characterize the expression divergence along the lineages from the common ancestor to species X and Y, respectively, following our previous work (Gu 2004; Gu and Su 2007) we define the expression distance between species X and Y as The biological interpretation of equation (2) can be briefly summarized as follows; also see figure 1C for numerical illustrations. (i) Under the simple Brownian model that represents a selectively neutral expression evolution (Gu 2004), we have U = 2σ2t, where σ2 is the rate of mutational variance. Hence, under the neutral expression model, the expression distance U increases proportionally with the evolutionary time t, and the rate (r) of expression divergence equals to the rate of mutational variance, i.e., r = σ2. (ii) Under the Ornstein-Uhlenback (OU) model (Gu and Su 2007), gene expression has been maintained around its optimum by the stabilizing selection and any deviation of expression profile may reduce the organismal fitness. It has been shown (Gu and Su 2007) that the expression distance is expected to be U = (1 − e−2)/W, where W describes the selection strength and the decay rate β = Wσ2. Importantly, when t→∞, U→1/W, which means that the expression distance approaches a saturated level determined by the strength of stabilizing selection. In a special case of W→0, i.e., very weak stabilizing selection, one can show U→2σ2t, i.e., the neutral Brownian model. Intuitively, expression divergence under the stabilizing model evolves more slowly than the neutral expectation. Indeed, the rate of expression divergence (r) under the stabilizing model can be symbolically written by r = σ2f, where the expression constraint f < 1 measures the effect of purifying selection. In short, stabilizing selection model of expression divergence is consistent with the nearly neutral model. (iii) Despite many forms of adaptive expression divergence, the general pattern is that the rate of expression divergence can be accelerated by the adaptive evolution, i.e., r > σ2. For instance, gradual directive selection (Gu 2004) predicts that the expression distance is proportional to t2.

Estimation of Expression Distance from RNA-seq Data

Suppose that we have RNA-seq data of a tissue from genomes X and Y, both of which contain n orthologous genes with RNA-seq counts denoted by x1, … , x and y1, … , y. When RNA-seq data contain multiple biological replicates, we use a simple mean. It is thus straightforward to obtain the estimates of first, second, and cross moments E[x], E[x2], E[y], E[y2], and E[xy], respectively; for instance, the estimate of E[xy] is given by Σ/n. On the other hand, the expectations of these moments under the NBD-lognormal model can be found in equations (5) and (6) in Data Processing section, allowing us to develop a simple method to estimate the expression distance U = v2 + v2. To this end, we first define three basic quantities: J = E[x2] − E[x], J = E[y2] − E[y], and J = E[xy]. The (mean-corrected) second moments J and J represent the expression variability in genomes X and Y, respectively, and the cross-product J measures the co-expression pattern. Putting together with equation (2) and equations (5) and (6), one can derive the relationships of J, J, and J with the underlying model parameters (presented in the second column of table 1). It follows that the expression distance defined by equation (2) can be rewritten as follows: where Ω = ln(1 + ω /L) and Ω = ln(1 + ω /L) are the effects of overdispersion; L and L are the numbers of biological replicates of genomes X and Y, respectively.
Table 1

Definitions, Theoretical Expectations, and Formulas of Statistical Estimation for Three Quantities J, J, and J

QuantityaExpectationbEstimationc
JXX = E[x2] − E[x]
JYY = E[y2] − E[y]
JXY = E[xy]

aE[.] is short form for expectation.

bDerivation of each expectation can be found in Materials and Methods. See figure 1 and the text for the description of model parameters.

cx (or y) is the mean RNA-seq count of gene i over its biological replicates in genome X (or Y); and n is the number of genes under study.

Definitions, Theoretical Expectations, and Formulas of Statistical Estimation for Three Quantities J, J, and J aE[.] is short form for expectation. bDerivation of each expectation can be found in Materials and Methods. See figure 1 and the text for the description of model parameters. cx (or y) is the mean RNA-seq count of gene i over its biological replicates in genome X (or Y); and n is the number of genes under study. The flow chart in figure 2 shows the statistical procedure for the estimation of U; see Data Processing section for technical details. In the first step of data normalization, we introduced two correction constants of each genome (X) to remove the overestimation of expression distance: constant C accounts for the effect caused by the sequence length variation among genes and B for the sequencing depth variation among genomes. After data normalization, one can compute J, J, and J, respectively, by the formulas in the third column of table 1. When genes have the same sequence length and genomes have the same sequencing depth, we have C = 1 and B = 1; in this case, J, J and J are simply calculated by the method of moments. The next issue is to estimate overdispersion. We implemented a simple method to estimate Ω and Ω even for only two biological replicates available. One may see Results and Discussion for a special treatment in the case of single biological replicate. Finally, the sampling variance of the estimated U can be empirically determined by the bootstrapping approach or a simple approximate method.
F

Flow chart for illustrating the statistical procedure of expression distance estimation.

Flow chart for illustrating the statistical procedure of expression distance estimation.

Data Sets

We downloaded the mammalian RNA-seq data in five tissues (brain, cerebellum, liver, heart, and kidney) from Brawand et al. (2011). For simplicity, we used the total reads of all 5,636 1:1 orthologous genes, suggested by the original authors. Nevertheless, we obtain the RNA-seq counts independently from the raw reads and found virtually the same results.

Data Processing

Calculation of Moments

The specific form of NBD we used in our study is as follows: where α = 1/ω. Let φ(λ) be a lognormal distribution φ(λ) such that the log of λ follows a normal distribution with the mean µ and variance η2. Then, the negative binomial-lognormal distribution for RNA-seq counts (x) of genes is given by f (x) = ∫p(x; λ, ω) φ(λ) d λ. Next we derive first and second moments. From the conditional expectation E[x| λ] = λ according to equation (4), we have E[x] = E[E[x| λ]] = E[λ]. Similarly, we have E[x2] = E[E[x2| λ]] = E[λ + (ω + 1)λ2]. With respective the lognormal distribution φ(λ), we obtain In the case of two genomes X and Y, the first and second moments of x or y are given by equation (5). For the cross-moment of x and y, from equation (4) we have E[xy] = E[E[xy| λλ]] = E[λλ]. Together with the independent assumption of three components in equation (1) and the lognormal distribution φ(λ), we derive E[λλ] = E[exp(μ + α + β) exp(μ + α + β)] = exp(μ + μ) E[exp(2α)] E[exp(β)] E[exp (β)], resulting in When the mean RNA-seq counts over L number of biological replicates is used to estimate the expression distance, the first, second, and cross-moments can be derived with a similar approach, except for the overdispersion parameters ω. = ω/L (omitting the subscripts X or Y).

RNA-seq Data Normalizations

Two main nonbiological effects inherited in the RNA-seq data processing need to be removed to avoid potential biases in the estimation of expression distance: sequence length variation and sequencing depth variation. To this end, we assume that the RNA-seq count of any gene (x) can be written as where c and B are the normalization constants and variable z is the normalized count when all genes have the same length (equal to the genome mean) and the same sequencing depth (equal to the mean over the genomes under study). Similar to RPKM (reads per kilobase per million mapped reads), we set c, where l is the sequence length of gene i and l is the genome mean of sequence lengths. To correct sequencing depth variation, one has to consider the factor that the number (N) of genes may vary among genomes. Here we used a relative measure for any genome X by defining R = Total counts/N. That is, we actually normalize the data such that the mean count per gene is roughly the same among the genomes under study. Moreover, we choose B = R0, where R0 is the mean over all genomes under study. Next we derive the formulas in the third column of table 1 to estimate the expression distance after the data normalization. From equation (7) we claim that the expectations E[x] is given by E[x] = B[z] Σ/n = BE[z] because Σ/n = 1. Similarly, we have E[x2] = B2E[z2] Σ2/n = B2C[z2], where C = Σ2/n. Therefore, after data normalization, we have J = E[z2] − E[z] = E[x2]/(B2 C) − E[x]/B. In the same manner, we have J = E[y2]/(B2 C) − E[y]/B, and J = E[xy]/(B), where C/n. After replacing these moments by their corresponding sampling moments, we obtain the results as shown in table 1.

Outlier Control

There are always a few outlier, i.e., extremely highly expressed genes. Their expression variations are very sensitive to the physiological or developmental condition when the sample was obtained. Because the distribution of RNA-seq is highly skewed, estimation of expression distance could be distorted by these outliers. As the first attempt, we implemented a simple cutoff to alleviate this problem: for the top 2.5% of highly expressed genes, we reset their RNA-seq counts to the value of the 97.5% quantile. Our preliminary analysis indicates that this approach is efficient and not sensitive to the selected cutoff (not shown).

Estimation of Overdispersion

If the number of biological replicates in RNA-seq data set is small, estimation of gene-specific overdispersion remains a difficult task. To deal with this problem, a number of statistical methods were proposed by sharing a certain amount of information between genes. For the practical reason, we implemented a fast but robust method to estimate the genome-wide overdispersion parameters ω (for ω or ω) by maximizing the joint likelihood function of NBDs. We use genome X for illustration. Suppose that x is the RNA-seq count of the k-th biological replicate of gene i. The log-likelihood function of gene i, denoted by lik(λ, ω|x), is formulated according to the NBD, whereas the mean (λ) is gene-specific and ω is the common parameter. Thus, the overall likelihood function Lik over all genes is the sum of all lik(λ, ω|x). A standard numerical procedure can be applied to obtain the maximum likelihood estimate of ω, which is converged rapidly when the moment estimate is used as an initial value: Let x. and V be the sampling mean and variance of gene i. The initial estimate of ω can be calculated as Σ (V, − x)/Σ2.

A Simple Method for Estimating Sampling Variance of U

The sampling variance of the estimated expression distance can be numerically calculated by the bootstrapping method. Nevertheless, by computer simulations we found that the following simple formula is close to the bootstrapping result: Var(U) = q/[(1 − q)n], where q = J2/J, and n is the number of genes.

Results and Discussion

Mammalian Tissue Expression Evolution

We used mammalian RNA-seq data (Brawand et al. 2011) in brain, cerebellum, kidney, heart, and lung to demonstrate the application of our newly developed method. For simplicity, we used the RNA-seq counts of 5,636 1:1 orthologous genes used by the original authors. We estimated C for the effect of sequence length variation in each genome. Since we observed that it has only a small-scale variation among genomes, we used the averaged correction constant C = 1.324 in the following analysis. By contrast, each tissue we have studied reveals a great deal of B variation, suggesting that the sequencing depth variation among genomes should be corrected appropriately (see table 2 for examples). After estimating the effects of overdispersion, we calculated the pairwise expression distances between mammalian genomes (the up diagonal in table 3 for brain and the down diagonal for cerebellum); the sampling variances of expression distance are presented in the form of standard error. Apparently, the expression distance is small between phylogenetically closely related genomes and large between distantly related genomes. Based on the expression distance matrices, we reconstructed the genome expression phylogeny by the neighbor-joining method. For illustration, figure 3A shows the expression phylogeny for the mammalian brain. With some minor exceptions, the inferred tree is consistent with the known mammalian phylogeny, which correctly resolved the lineage of placentals, or eutherians from marsupials and monotremes, and separated two major eutherian lineages (primates and rodents). On the other hand, we mapped the expression distances onto the known mammalian phylogeny, as shown in figure 3B. We have performed all analyses in the other four tissues. All inferred expression phylogenies are roughly consistent with the known mammalian phylogeny. Similar to Brawand et al. (2011), we found that different tissues and lineages may show different expression distances. For instance, the expression rate of testes is more rapid than the rest of tissues. Because of the space limit, we will show these results in detail elsewhere.
Table 2

Summary for the Estimates of Deep-Sequencing Parameters and Overdispersed Parameters in Mammalian Brains and Cerebellums

Bx
ΩX
BrainCerebellumBrainCerebellum
Human0.6191.1830.1650.034
Chimpanzee0.6600.8310.1020.049
Gorilla1.2151.0630.0510.034
Orangutan1.4620.9700.0390.033
Macaque0.8460.5980.0460.009
Mouse1.4390.8760.1620.054
Opossum1.0300.7460.1530.003
Platypus1.0930.9990.0340.013
Table 3

Pairwise Tissue Expression Distance (U) Matrix of Brain and Cerebellum in Mammals

HumanChimpanzeeGorillaOrangutanMacaqueMouseOpossumPlatypus
Human00.116 ± 0.0380.174 ± 0.0310.338 ± 0.0210.247 ± 0.0250.248 ± 0.0250.473 ± 0.0170.797 ± 0.012
Chimpanzee0.304 ± 0.02300.191 ± 0.0290.300 ± 0.0230.258 ± 0.0250.333 ± 0.0210.494 ± 0.0170.799 ± 0.012
Gorilla0.357 ± 0.0210.329 ± 0.02200.348 ± 0.0210.299 ± 0.0230.379 ± 0.0200.512 ± 0.0160.890 ± 0.011
Orangutan0.523 ± 0.0160.393 ± 0.0190.511 ± 0.01600.302 ± 0.0230.426 ± 0.0180.535 ± 0.0160.912 ± 0.011
Macaque0.468 ± 0.0170.343 ± 0.0210.456 ± 0.0180.459 ± 0.01800.306 ± 0.0230.464 ± 0.0180.852 ± 0.012
Mouse0.493 ± 0.0170.467 ± 0.0170.549 ± 0.0160.680 ± 0.0140.518 ± 0.01600.361 ± 0.0200.704 ± 0.013
Opossum0.810 ± 0.0120.699 ± 0.0130.785 ± 0.0120.821 ± 0.0120.672 ± 0.0140.676 ± 0.01400.512 ± 0.016
Platypus1.010 ± 0.0100.842 ± 0.0120.976 ± 0.0100.992 ± 0.0100.823 ± 0.0120.786 ± 0.0120.777 ± 0.0120

Note.—Up diagonal for brain and down diagonal for cerebellum; the sampling variances of expression distance are presented in the form of standard error.

F

Mammalian brain expression phylogeny. (A) Expression phylogeny inferred by the neighbor-joining method based on expression distance matrix of brains. Nodes with * means bootstrapping values >0.95 and with ** values >0.99. (B) The result of mapping the expression distance to a given species tree, which is extracted from the tree of life (http://tolweb.org/, last accessed September 18, 2013).

Mammalian brain expression phylogeny. (A) Expression phylogeny inferred by the neighbor-joining method based on expression distance matrix of brains. Nodes with * means bootstrapping values >0.95 and with ** values >0.99. (B) The result of mapping the expression distance to a given species tree, which is extracted from the tree of life (http://tolweb.org/, last accessed September 18, 2013). Summary for the Estimates of Deep-Sequencing Parameters and Overdispersed Parameters in Mammalian Brains and Cerebellums Pairwise Tissue Expression Distance (U) Matrix of Brain and Cerebellum in Mammals Note.—Up diagonal for brain and down diagonal for cerebellum; the sampling variances of expression distance are presented in the form of standard error.

Some Technical Comments

There are several ad hoc distance measures that have been used to analyze the divergence in expression. For instance, Brawand et al. (2011) used 1−R, where R is the Spearman’s correlation coefficient, and the Euclidean distance in their analyses. Although these measures are useful, our model-based expression distance has a unique strength for the study of transcriptome evolution because it provides a basis to generate testable hypotheses under the phylogenetic framework. In addition, our method has considered the effects of sampling and data processing so that the user can justify whether a conclusion is sensitive to the high throughput-dependent noise. Our model implements a NBD to account for data over-dispersion. Though it is a common practice in statistics, some studies suggested that it may not be sufficient in RNA-seq analysis. Meanwhile, we use the lognormal-normal distribution to account for highly skewed RAN-seq variability. It remains our further work to evaluate whether the current model is the most appropriate for RNA-seq data, and how to improve the robustness of our method in the estimation of expression distance. In real data analysis, application of new expression distance is difficult in the case of no biological replicate, because Ω and Ω cannot be estimated. To resolve this problem, we suggest a modified expression distance by omitting the overdispersion effects, that is, Though U* tends to overestimate the expression distance, one can show that U* satisfies the “four-point condition” (Gu and Li 1996). In other words, U* is a paralinear distance to U, which has the following properties: 1) Under the strict additivity, the phylogenetic topology inferred from U* is the same as that from U. 2) External branch lengths tend to be overestimated, whereas internal branch lengths are expected to be unbiased. Our software has the option of paralinear expression distance estimation.

Software Availability

We have developed a software system, called PhyExp, short for phylogenomic analysis of expression profiles, to help the evolutionary analysis of RNA-seq data. There are several commercially available platforms such as Illumina, SOLiD, or 454 Genome Sequencer, but the RNA-seq data processing and analysis is about the same. Two distribution R packages, compatible with Windows and Linux operating systems, respectively, are available at http://www.xungulab.com (last accessed September 23, 2013). The first version, PhyExp1.0, has implemented the following options: 1) After the input file (RNA-Seq counts of genes) has been loaded, the expression distance matrix, including the paralinear distances, as well as their sampling variances are calculated.2) Infer the expression phylogeny by the neighbor-joining method; the statistical reliability can be examined via the bootstrapping. 3) PhyExp1.0 has the option to input the amino acid sequence alignment, which allows the user to map the expression distances onto the inferred molecular phylogeny or to a user-provided phylogeny. There are several directions in further improvements: 1) Implement a suite of phylogeny-based analysis tools, including testing asymmetry of expression divergence, ancestral expression inference, and phylogeny-dependent detection of differentially expressed genes (unpublished results). 2) Develop and implement advanced methods for dealing with data normalization and data overdispersion. And 3) for the practical purpose, implement the option of expression divergence analysis based on microarray data. Moreover, we are particularly interested how expression divergence is correlated with sequence divergence as well as related phenotypes along the phylogeny (Lartillot and Poujol 2011).
  34 in total

1.  EST databases as multi-conditional gene expression datasets.

Authors:  R M Ewing; J M Claverie
Journal:  Pac Symp Biocomput       Date:  2000

2.  A phylogenetic model for investigating correlated evolution of substitution rates and continuous phenotypic characters.

Authors:  Nicolas Lartillot; Raphaël Poujol
Journal:  Mol Biol Evol       Date:  2010-10-06       Impact factor: 16.240

Review 3.  Applications of new sequencing technologies for transcriptome analysis.

Authors:  Olena Morozova; Martin Hirst; Marco A Marra
Journal:  Annu Rev Genomics Hum Genet       Date:  2009       Impact factor: 8.929

4.  The transcriptional landscape of the yeast genome defined by RNA sequencing.

Authors:  Ugrappa Nagalakshmi; Zhong Wang; Karl Waern; Chong Shou; Debasish Raha; Mark Gerstein; Michael Snyder
Journal:  Science       Date:  2008-05-01       Impact factor: 47.728

5.  Mapping and quantifying mammalian transcriptomes by RNA-Seq.

Authors:  Ali Mortazavi; Brian A Williams; Kenneth McCue; Lorian Schaeffer; Barbara Wold
Journal:  Nat Methods       Date:  2008-05-30       Impact factor: 28.547

Review 6.  RNA-Seq: a revolutionary tool for transcriptomics.

Authors:  Zhong Wang; Mark Gerstein; Michael Snyder
Journal:  Nat Rev Genet       Date:  2009-01       Impact factor: 53.242

7.  The developmental transcriptome of Drosophila melanogaster.

Authors:  Brenton R Graveley; Angela N Brooks; Joseph W Carlson; Michael O Duff; Jane M Landolin; Li Yang; Carlo G Artieri; Marijke J van Baren; Nathan Boley; Benjamin W Booth; James B Brown; Lucy Cherbas; Carrie A Davis; Alex Dobin; Renhua Li; Wei Lin; John H Malone; Nicolas R Mattiuzzo; David Miller; David Sturgill; Brian B Tuch; Chris Zaleski; Dayu Zhang; Marco Blanchette; Sandrine Dudoit; Brian Eads; Richard E Green; Ann Hammonds; Lichun Jiang; Phil Kapranov; Laura Langton; Norbert Perrimon; Jeremy E Sandler; Kenneth H Wan; Aarron Willingham; Yu Zhang; Yi Zou; Justen Andrews; Peter J Bickel; Steven E Brenner; Michael R Brent; Peter Cherbas; Thomas R Gingeras; Roger A Hoskins; Thomas C Kaufman; Brian Oliver; Susan E Celniker
Journal:  Nature       Date:  2010-12-22       Impact factor: 49.962

8.  Transcript assembly and quantification by RNA-Seq reveals unannotated transcripts and isoform switching during cell differentiation.

Authors:  Cole Trapnell; Brian A Williams; Geo Pertea; Ali Mortazavi; Gordon Kwan; Marijke J van Baren; Steven L Salzberg; Barbara J Wold; Lior Pachter
Journal:  Nat Biotechnol       Date:  2010-05-02       Impact factor: 54.908

9.  Differential expression analysis for sequence count data.

Authors:  Simon Anders; Wolfgang Huber
Journal:  Genome Biol       Date:  2010-10-27       Impact factor: 13.583

10.  RNA-MATE: a recursive mapping strategy for high-throughput RNA-sequencing data.

Authors:  Nicole Cloonan; Qinying Xu; Geoffrey J Faulkner; Darrin F Taylor; Dave T P Tang; Gabriel Kolle; Sean M Grimmond
Journal:  Bioinformatics       Date:  2009-07-30       Impact factor: 6.937

View more
  5 in total

1.  Detecting cognizable trends of gene expression in a time series RNA-sequencing experiment: a bootstrap approach.

Authors:  Shatakshee Chatterjee; Partha P Majumder; Priyanka Pandey
Journal:  J Genet       Date:  2016-09       Impact factor: 1.166

2.  Gene expression of functionally-related genes coevolves across fungal species: detecting coevolution of gene expression using phylogenetic comparative methods.

Authors:  Alexander L Cope; Brian C O'Meara; Michael A Gilchrist
Journal:  BMC Genomics       Date:  2020-05-20       Impact factor: 3.969

3.  TreeExp2: An Integrated Framework for Phylogenetic Transcriptome Analysis.

Authors:  Jingwen Yang; Hang Ruan; Wenjie Xu; Xun Gu
Journal:  Genome Biol Evol       Date:  2019-11-01       Impact factor: 3.416

4.  Evolutionary conservation and divergence of the human brain transcriptome.

Authors:  William G Pembroke; Christopher L Hartl; Daniel H Geschwind
Journal:  Genome Biol       Date:  2021-01-29       Impact factor: 17.906

5.  RNA: An Expanding View of Function and Evolution.

Authors:  Xinwei Han; Yuan Chen; Liuyang Wang; Wenwen Fang; Ning Zhang; Qiyun Zhu
Journal:  Evol Bioinform Online       Date:  2016-01-14       Impact factor: 1.625

  5 in total

北京卡尤迪生物科技股份有限公司 © 2022-2023.