Literature DB >> 22537041

MGMR: leveraging RNA-Seq population data to optimize expression estimation.

Abstract

BACKGROUND: RNA-Seq is a technique that uses Next Generation Sequencing to identify transcripts and estimate transcription levels. When applying this technique for quantification, one must contend with reads that align to multiple positions in the genome (multireads). Previous efforts to resolve multireads have shown that RNA-Seq expression estimation can be improved using probabilistic allocation of reads to genes. These methods use a probabilistic generative model for data generation and resolve ambiguity using likelihood-based approaches. In many instances, RNA-seq experiments are performed in the context of a population. The generative models of current methods do not take into account such population information, and it is an open question whether this information can improve quantification of the individual samples
RESULTS: In order to explore the contribution of population level information in RNA-seq quantification, we apply a hierarchical probabilistic generative model, which assumes that expression levels of different individuals are sampled from a Dirichlet distribution with parameters specific to the population, and reads are sampled from the distribution of expression levels. We introduce an optimization procedure for the estimation of the model parameters, and use HapMap data and simulated data to demonstrate that the model yields a significant improvement in the accuracy of expression levels of paralogous genes.
CONCLUSIONS: We provide a proof of principal of the benefit of drawing on population commonalities to estimate expression. The results of our experiments demonstrate this approach can be beneficial, primarily for estimation at the gene level.

Entities: Chemical Disease Gene Species

Mesh：

Year: 2012 PMID： 22537041 PMCID： PMC3358656 DOI： 10.1186/1471-2105-13-S6-S2

Source DB: PubMed Journal: BMC Bioinformatics ISSN： 1471-2105 Impact factor: 3.169

Introduction

With the rapid decline in the cost of sequencing, RNA-Seq has emerged as a legitimate competitor to mi-croarrays as a means of assessing global gene expression. Even as arrays currently enjoy a cost advantage, many new applications of information accessible only through sequencing further strengthen the case that sequencing may soon supplant arrays as the technology of choice for transcription analysis. One such application is fine-grained assessment of variation in expression and the sources for such variation, as exemplified by recent large-scale RNA-Seq studies [1,2] of two different HapMap [3] populations. Such studies complement genomic DNA sequencing by elucidating the link between SNPs and expression. Unfortunately, with any new technology come its limitations, and several studies have pointed out that RNA-Seq is not immune to bias [4-6]. Perhaps the most widely discussed hurdle to accurate estimation in the case of RNA-Seq is the problem of reads mapped to multiple locations in the target genome (or in the target transcript sequences). These reads, which are called multireads, can stem from either paralogous gene sequences or from different isoforms of the same gene that share exons. Several methods have emerged to address the multiread problem for paralog and isoform estimation [7-10]. These methods are all based on probabilistic modeling that is optimized by an expectation maximization procedure. It has been repeatedly shown that by using such methods one can get better quantification of the expression levels compared to quantification based on naive approaches of read assignment. In many applications, a set of samples is studied. For instance, one may be interested in comparing the expression levels in cases verses controls, or in tissues originating from different organs. In such cases, it is plausible that the commonality of expression patterns within each of the defined groups of studied samples may be used to improve the quantification results in each of the samples. We demonstrate that by analyzing expression profiles of a population together, one gets expression estimates more accurate than those obtained by estimating individual sample expression levels independently. We describe and implement a probabilistic model of the sequencing process that incorporates population commonalities, and demonstrate its advantages over existing methods in the population setting.

Methods

RNA-Seq multiread expression estimation

As we seek to extend the prevalent generative model of RNA-Seq [7-11], we begin by reviewing the basic elements of that model. Let G = (G1, ...,G) be the set of M transcribed regions considered and P = (P1, ...,P) be the proportions of RNA bases attributed to each transcript out of the total number of transcribed bases in a sequenced sample. Regions may be either genes or transcripts, depending on the level of transcription being investigated. We require P to satisfy ∑P= 1 and ∀gϵG, 0 ≤ P≤ 1. The model describes an RNA sequencing experiment where regions in G are randomly chosen according to the distribution P, start positions in these regions are chosen uniformly, and reads of length ℓ are generated by copying ℓ consecutive bases from each chosen region to produce a set of reads R = (r1,..., r). Sequencing is assumed to be error prone, leading to a certain probability of error for each read base. Based on the repetitions present in the set of regions and errors in alignment, reads may fail to map to the region from which they originate or may map to additional locations. Thus, we assign a probability of obtaining read rgiven that it originated from region . In this case we rely on the alignment of rto Gto afford us the best match position instead of summing over all possible starting positions. ℓis the effective length of G(i.e., the number of start positions from which a full length read can be derived) as defined in [11], ϵ is taken to be a constant per-base error rate, errors are assumed to be independent, and erroris the number of mismatches in the best alignment of rto G. This formulation leads to the likelihood of observing the data: This likelihood function is used to estimate P given the read alignments. Typically, one will use expectation maximization to find the P for which the likelihood is maximized. It is assumed that P(r|G) is zero for all regions to which ris not aligned.

Common population extension

To estimate expression levels in N individuals from a defined population, we modify the above model by assuming that samples are drawn from a common population. This is imposed by having P = [(P11, ...,P1), .., (P...,P)] be probability densities drawn from a common Dirichlet distribution, defined by a set of hyper-parameters specific to the population: ∀iϵ[1, N], p= (P,..., P) ~ Dirichlet (α1,..., α). For sample i, we denote the set of reads as , where each ris mapped to one or more regions in G. The output of a read alignment program defines the set of accepted regions for the read (in practice only alignments with up to 2 errors are accepted) and provides the number of errors in alignment for each read-region pair. This allows us to calculate P(r|G) as done above for one sample. For convenience we denote P(r|G) = q(taken to be zero for all regions not mapped to), which is independent of and . As before, our objective is to estimate P, but in this case we must optimize by estimating P and together. We begin by writing the likelihood function: Since expression values are sampled from the Dirichlet distribution, Where and similar to (1) above, This leads to Taking the log, we get

Multi-Genome Multi-Read (MGMR) algorithm

We wish to estimate and p,...,pto maximize equation (7) above. For this purpose, we adopt an alternating iterative procedure of estimating given the current estimate of p,...,pand vice-versa until the total change in becomes sufficiently small (or until a pre-set number of iterations have been executed). Although for EM-based estimation methods convexity guarantees an optimal solution will be obtained, here (as shall be seen below) we have no such guarantee. Thus, we confine updates to be local by performing only one update for P and one for . By one MGMR iteration, we refer to one EM-based P update followed by one update.

Estimating P given α

If we assume is given, we can write the EM steps to find p,...,p: E step Letting signify a matching between reads and regions, and Match(j) be the region from which read j originates, we get: which leads to where M step Given that each qis fixed, the above reduces to maximizing Using Lagrange multipliers and differentiating, we see that this is maximized with

Estimating α given P

Given a new estimate for P(, we can use a fixed point iteration [12] to get a new estimate of By using the known bound , we can get a lower bound on F(α): where . We maximize this bound with a fixed point iteration similar to EM, noting that for fixed values of P convergence is guaranteed, and that for the Dirichlet distribution the maximum is the only stationary point [12]. This leads to the update

Heuristics/Implementation

As we have found F(α) presented in equation (15) is non-convex even in 2 dimensions (Figure 1), we confine updates to be local by allowing only one update for both the α and P estimation steps at each MGMR iteration. For genes with EM expression estimates equaling zero in all samples we substitute 10-20 for their values in MGMR to avoid taking the log of zero. For P updates (e.g., equation 14), we avoid potentially negative P values by adding one to each alpha (thus ignoring -1 in the numerator and denominator). We implemented the algorithm in MATLAB, where the inputs required are read-gene map files for each sample as in SEQEM [7], and an initial P estimate matrix. Alphas are initialized as an M-length vector of ones.

Figure 1

A mesh representation of F(α) [equation (15)] showing non-convex behavior. P is a 10 × 2 constant matrix and α is varied on [0:50,0:50]. The case shown is for N = 10, M = 2 (ten samples, two α parameters). Non-convex behavior is demonstrated by the values on the plane defined by α1 = .06 on the range [0,50] on the right.

Results

Experimental setup

As in [7,9,10], we examined MGMR's accuracy by comparing its estimates of known expression levels with those of existing methods, namely SEQEM [7] and RSEM [9,10]. The initial "known" expression levels were estimates obtained from RNA-Seq samples; how these estimates were obtained is described below. In our case, we had to simulate a population sharing similar expression levels and the same set of gene regions. Our experiment differed in that we sought to use additional information to improve on the estimates of these existing methods. These methods were designed to estimate expression of single samples, and each had specific advantages which we disregarded in our comparison. For example, we ignored both SEQEM's ability to utilize SNP information and RSEM's ability to allow estimation on assembled transcripts by using only reference sequences.

Simulating data

To derive artificial reads, we first estimated expression on real biological samples using one method and then used the resulting distribution of expression values to generate simulated datasets for testing. Real samples were drawn from lymphoblastoid cells of the Yoruba in Ibadan (YRI) population [2,3]. As MGMR requires initial expression estimates, the estimate derived from the method it was being compared with in each case was input to MGMR. Thus, the four initial estimates used were from SEQEM, MGMR(SEQEM) (namely, MGMR initialized by SEQEM's result), RSEM, and MGMR(RSEM) (namely, MGMR initialized by RSEM estimates). We denote these four estimates SEQEM-A, SEQEM-B, MGMR-A and MGMR-B, respectively. We simulated reads by randomly selecting start sites falling within gene boundaries and extracting sequences from those positions. Read lengths were defined for each experiment, and simulations were repeated multiple times to account for randomness in the sampling process. To derive the sequence set for the SEQEM comparison, we expanded upon the procedure used in [7]. There, SEQEM was shown to improve estimation of paralogous gene expression on a set of exon sequences from 51 Homo Sapiens chromosome 1 paralogs from the HomoloGene [13] database. We extracted a larger set of sequences containing all HomoloGene paralogs in Homo Sapiens having at least one exon longer than twice the read length used that do not overlap in genomic coordinates. We required this minimal length because sequences were sampled from exons, and we needed to ensure enough positions existed for full length reads to be sampled from these exons. 285 such genes remained (for reads of length 35 bp), and these were the genes on which expression was tested and from which read sequences were derived. The SEQEM-A and SEQEM-B read sets were generated based on randomly selected exons from these genes and the expression levels from the SEQEM-A and SEQEM-B estimates taken on 20 YRI individuals. The read length of 35 bp corresponded to that of the YRI samples, and a coverage level of 20 was chosen, as this was the level at which SEQEM was shown to perform best in [7]. We performed a total of 30 repetitions of read simulations, where each repetition consisted of 20 samples (corresponding to the original 20 YRI samples used). For the RSEM-A and RSEM-B read sets, the transcript set used was also obtained by filtering the HomoloGene database to avoid gene overlaps, but no length filtering was required: reads were now sampled directly from transcripts which all had effective lengths greater than the read length used. 524 transcripts corresponding to 265 genes survived this filtering. For these read sets, we produced 30 repetitions of 74 samples, where each consisted of 100 bp reads at a coverage level of 20. In all other respects the sampling process and read generation steps were identical to those performed for the SEQEM-A and SEQEM-B read sets.

Error measures

Accuracy was assessed by three error measures, the first two of which were applied in [7]: error rate, computed as difference, computed as , and Kullback-Leibler divergence, computed as . Here P is the estimated distribution generated by the tested algorithm and Q is the true distribution. Error measures were averaged over all repetitions per sample, and then over all samples.

Simulated data with priors based on real estimates - estimating paralogous gene expression

To test performance on paralogous gene estimates, we set out to compare independent sample SEQEM estimates with MGMR's common population estimates. Before applying SEQEM, we looked to address one criticism of it from [11], where it was suggested that SEQEM's estimation could be improved by incorporation of transcript length correction. Upon examination, we found the effect of this correction was an increase in accuracy, and thus we maintained it for subsequent tests. This improvement is documented in the appendix. With this correction in place, we estimated expression levels on the SEQEM-A and SEQEM-B read sets, applying SEQEM and MGMR to each. Outputs were recorded at 1-10, 20, 30, 40, 50 and 100 iterations for MGMR and at 100 iterations for SEQEM. The results are shown in Figure 2. We observed that both error and variance levels dropped sharply within just a few iterations for MGMR, and converged to significantly better estimates on average than SEQEM. These trends were consistent across all three error measures [Table 1]. Variance seemed to diminish more with MGMR over time, as might be expected for a method that shares information across samples. Notably, MGMR outperformed SEQEM on estimates for samples based on SEQEM-A, where sample estimates were obtained independently (and thus we expect the variation inherent in the real samples to be maintained).

Figure 2

Table 1

MGMR vs. SEQEM error at 100 iterations on SEQEM-A and SEQEM-B data sets

	SEQEM-A sampling				SEQEM-B sampling

	SEQEM		MGMR		SEQEM		MGMR

	Error	SD	Error	SD	Error	SD	Error	SD
E	1.27	1 * 10^-2	1.03	0.14	1.50	0.70	0.82	6 * 10^-3
χ²	0.66	2 * 10^-3	0.22	4 * 10^-3	0.69	0.05	0.27	1 * 10^-4
KL	0.29	7 * 10^-4	0.14	1 * 10^-4	0.18	2 * 10^-4	0.17	1 * 10^-4

These data sets were derived from SEQEM and MGMR(SEQEM) estimates, respectively, on 20 YRI samples. (E: relative error rate; χ2: Chi-squared error; KL: Kullback-Liebler divergence; SD: standard deviation)

Relative error measured on SEQEM-A and SEQEM-B data sets. MGMR outputs on SEQEM-A and SEQEM-B initializations were compared with SEQEM up to 100 iterations. MGMR outputs were recorded at 1-10, 20, 30, 40, 50 and 100 iterations. The first few iterations have been trimmed to allow a compact presentation. MGMR vs. SEQEM error at 100 iterations on SEQEM-A and SEQEM-B data sets These data sets were derived from SEQEM and MGMR(SEQEM) estimates, respectively, on 20 YRI samples. (E: relative error rate; χ2: Chi-squared error; KL: Kullback-Liebler divergence; SD: standard deviation)

Simulated data with priors based on real samples - estimating transcript level expression

We also sought to examine whether MGMR can improve results in the more challenging setting of estimating transcript level expression. Here, we expect estimates to be noisier due to low expression values in the real samples, and we must contend with multiread mappings due to paralogous genes as well as to isoforms of particular genes sharing subsequences as a result of alternative splicing. In anticipation of this challenge, we used a larger set consisting of 74 sample of single-end YRI samples as the real data source and simulated 100 bp reads instead of 35 bp. This was expected to be a difficult case for estimation, as all genes in the set are paralogs and many have multiple isoforms, as described in the section "Simulating data." Once expression estimation was performed on the YRI samples and read sets RSEM-A and RSEM-B were generated, we again performed expression estimates with RSEM and MGMR on each set. In this case, unfortunately, we found the results did not exhibit a consistent trend as before and overall appeared inconclusive. These results are summarized in Table 2. It remains to be seen why the error results differ according to the level of estimation (gene vs. transcript) performed.

Table 2

MGMR vs. RSEM error at 100 iterations on RSEM-A and RSEM-B data sets

	RSEM-A Sampling				RSEM-B Sampling

	RSEM		MGMR		RSEM		MGMR

	Error	SD	Error	SD	Error	SD	Error	SD
E	0.1	1 * 10^-3	0.69	1 * 10^-3	1.0	1 * 10^-4	0.61	1 * 10^-3
χ²	0.02	6 * 10^-4	1.25	0.01	0.02	9 * 10^-4	0.58	3 * 10^-4
KL	1.5	0.22	0.6	1 * 10^-3	0.8	0.11	0.38	6 * 10^-4

E: relative error rate; χ2: Chi-squared error; KL: Kullback-Liebler divergence; SD: standard deviation

MGMR vs. RSEM error at 100 iterations on RSEM-A and RSEM-B data sets E: relative error rate; χ2: Chi-squared error; KL: Kullback-Liebler divergence; SD: standard deviation

Conclusion

As shown by the 1000 Genomes and HapMap projects, one of the drives of modern genetics and bioinformatics research is to characterize variation in populations. Because of cost and time constraints, such projects have only recently become feasible. In addition to such studies assessing genomic variation and its relation to disease phenotypes based on DNA, it is anticipated that RNA-Seq population studies will also grow in popularity to more directly assign functional significance to variant loci by means of transcription measures. Thus, it becomes essential to accurately measure the expression levels from each individual to characterize such variation. Here, we have shown that for one common study design an unexpected benefit can arise. When individuals in these studies are drawn from the same population, the estimates made on each can be made more accurate because of the commonalities among population members. A shortcoming of the MGMR approach is that since it assumes commonality among the samples, outlier samples will be attracted towards the common denominator, and thus appear more similar to the group profile than they really are. In particular, if the data are subject to differential expression analysis, MGMR may reduce the number of differentially expressed genes. We have investigated the efficacy of MGMR in tackling two typical experimental settings - inferring expression levels of paralogs at the gene level, and of isoforms (also drawn from a difficult set of paralogs). Although substantial gains were obtained in the first, more inquiry is required to demonstrate a benefit in the latter. It is worth noting that in each case at least a quarter of the regions considered showed improvement, as shown in Table 3. With these results, we submit a proof of concept that population structure can aid in estimation of expression levels for RNA-Seq samples.

Table 3

Proportion of genes for which MGMR improves estimates on different data sets

	SEQEM-A	SEQEM-B	RSEM-A	RSEM-B
Proportion	104/285	78/285	126/524	173/524
%	36.5	27.3	24.0	33.0

Proportions of regions (genes for SEQEM and transcripts for RSEM, respectively) for which MGMR has lower relative error on average than each method compared to.

Proportion of genes for which MGMR improves estimates on different data sets Proportions of regions (genes for SEQEM and transcripts for RSEM, respectively) for which MGMR has lower relative error on average than each method compared to.

List of abbreviations

E: relative error rate; χ2: Chi-squared error; KL: Kullback-Liebler divergence; SD: standard deviation; bp: base pair

Competing interests

The authors declare that they have no competing interests.

Authors' contributions

RR and EH developed the method. RS and EH designed the experiments. RR implemented the method and performed experiments. RR, EH, and RS analyzed results and wrote the manuscript. All authors read and approved the final manuscript.

10 in total

1. Accurate estimation of expression levels of homologous genes in RNA-seq experiments.

Authors: Bogdan Paşaniuc; Noah Zaitlen; Eran Halperin
Journal: J Comput Biol Date: 2011-03 Impact factor: 1.479

2. Transcriptome genetics using second generation sequencing in a Caucasian population.

Authors: Stephen B Montgomery; Micha Sammeth; Maria Gutierrez-Arcelus; Radoslaw P Lach; Catherine Ingle; James Nisbett; Roderic Guigo; Emmanouil T Dermitzakis
Journal: Nature Date: 2010-03-10 Impact factor: 49.962

3. A scaling normalization method for differential expression analysis of RNA-seq data.

Authors: Mark D Robinson; Alicia Oshlack
Journal: Genome Biol Date: 2010-03-02 Impact factor: 13.583

4. Evaluation of statistical methods for normalization and differential expression in mRNA-Seq experiments.

Authors: James H Bullard; Elizabeth Purdom; Kasper D Hansen; Sandrine Dudoit
Journal: BMC Bioinformatics Date: 2010-02-18 Impact factor: 3.169

5. RNA-Seq gene expression estimation with read mapping uncertainty.

Authors: Bo Li; Victor Ruotti; Ron M Stewart; James A Thomson; Colin N Dewey
Journal: Bioinformatics Date: 2009-12-18 Impact factor: 6.937

6. A second generation human haplotype map of over 3.1 million SNPs.

Authors: Kelly A Frazer; Dennis G Ballinger; David R Cox; David A Hinds; Laura L Stuve; Richard A Gibbs; John W Belmont; Andrew Boudreau; Paul Hardenbol; Suzanne M Leal; Shiran Pasternak; David A Wheeler; Thomas D Willis; Fuli Yu; Huanming Yang; Changqing Zeng; Yang Gao; Haoran Hu; Weitao Hu; Chaohua Li; Wei Lin; Siqi Liu; Hao Pan; Xiaoli Tang; Jian Wang; Wei Wang; Jun Yu; Bo Zhang; Qingrun Zhang; Hongbin Zhao; Hui Zhao; Jun Zhou; Stacey B Gabriel; Rachel Barry; Brendan Blumenstiel; Amy Camargo; Matthew Defelice; Maura Faggart; Mary Goyette; Supriya Gupta; Jamie Moore; Huy Nguyen; Robert C Onofrio; Melissa Parkin; Jessica Roy; Erich Stahl; Ellen Winchester; Liuda Ziaugra; David Altshuler; Yan Shen; Zhijian Yao; Wei Huang; Xun Chu; Yungang He; Li Jin; Yangfan Liu; Yayun Shen; Weiwei Sun; Haifeng Wang; Yi Wang; Ying Wang; Xiaoyan Xiong; Liang Xu; Mary M Y Waye; Stephen K W Tsui; Hong Xue; J Tze-Fei Wong; Luana M Galver; Jian-Bing Fan; Kevin Gunderson; Sarah S Murray; Arnold R Oliphant; Mark S Chee; Alexandre Montpetit; Fanny Chagnon; Vincent Ferretti; Martin Leboeuf; Jean-François Olivier; Michael S Phillips; Stéphanie Roumy; Clémentine Sallée; Andrei Verner; Thomas J Hudson; Pui-Yan Kwok; Dongmei Cai; Daniel C Koboldt; Raymond D Miller; Ludmila Pawlikowska; Patricia Taillon-Miller; Ming Xiao; Lap-Chee Tsui; William Mak; You Qiang Song; Paul K H Tam; Yusuke Nakamura; Takahisa Kawaguchi; Takuya Kitamoto; Takashi Morizono; Atsushi Nagashima; Yozo Ohnishi; Akihiro Sekine; Toshihiro Tanaka; Tatsuhiko Tsunoda; Panos Deloukas; Christine P Bird; Marcos Delgado; Emmanouil T Dermitzakis; Rhian Gwilliam; Sarah Hunt; Jonathan Morrison; Don Powell; Barbara E Stranger; Pamela Whittaker; David R Bentley; Mark J Daly; Paul I W de Bakker; Jeff Barrett; Yves R Chretien; Julian Maller; Steve McCarroll; Nick Patterson; Itsik Pe'er; Alkes Price; Shaun Purcell; Daniel J Richter; Pardis Sabeti; Richa Saxena; Stephen F Schaffner; Pak C Sham; Patrick Varilly; David Altshuler; Lincoln D Stein; Lalitha Krishnan; Albert Vernon Smith; Marcela K Tello-Ruiz; Gudmundur A Thorisson; Aravinda Chakravarti; Peter E Chen; David J Cutler; Carl S Kashuk; Shin Lin; Gonçalo R Abecasis; Weihua Guan; Yun Li; Heather M Munro; Zhaohui Steve Qin; Daryl J Thomas; Gilean McVean; Adam Auton; Leonardo Bottolo; Niall Cardin; Susana Eyheramendy; Colin Freeman; Jonathan Marchini; Simon Myers; Chris Spencer; Matthew Stephens; Peter Donnelly; Lon R Cardon; Geraldine Clarke; David M Evans; Andrew P Morris; Bruce S Weir; Tatsuhiko Tsunoda; James C Mullikin; Stephen T Sherry; Michael Feolo; Andrew Skol; Houcan Zhang; Changqing Zeng; Hui Zhao; Ichiro Matsuda; Yoshimitsu Fukushima; Darryl R Macer; Eiko Suda; Charles N Rotimi; Clement A Adebamowo; Ike Ajayi; Toyin Aniagwu; Patricia A Marshall; Chibuzor Nkwodimmah; Charmaine D M Royal; Mark F Leppert; Missy Dixon; Andy Peiffer; Renzong Qiu; Alastair Kent; Kazuto Kato; Norio Niikawa; Isaac F Adewole; Bartha M Knoppers; Morris W Foster; Ellen Wright Clayton; Jessica Watkin; Richard A Gibbs; John W Belmont; Donna Muzny; Lynne Nazareth; Erica Sodergren; George M Weinstock; David A Wheeler; Imtaz Yakub; Stacey B Gabriel; Robert C Onofrio; Daniel J Richter; Liuda Ziaugra; Bruce W Birren; Mark J Daly; David Altshuler; Richard K Wilson; Lucinda L Fulton; Jane Rogers; John Burton; Nigel P Carter; Christopher M Clee; Mark Griffiths; Matthew C Jones; Kirsten McLay; Robert W Plumb; Mark T Ross; Sarah K Sims; David L Willey; Zhu Chen; Hua Han; Le Kang; Martin Godbout; John C Wallenburg; Paul L'Archevêque; Guy Bellemare; Koji Saeki; Hongguang Wang; Daochang An; Hongbo Fu; Qing Li; Zhen Wang; Renwu Wang; Arthur L Holden; Lisa D Brooks; Jean E McEwen; Mark S Guyer; Vivian Ota Wang; Jane L Peterson; Michael Shi; Jack Spiegel; Lawrence M Sung; Lynn F Zacharia; Francis S Collins; Karen Kennedy; Ruth Jamieson; John Stewart
Journal: Nature Date: 2007-10-18 Impact factor: 49.962

7. Understanding mechanisms underlying human gene expression variation with RNA sequencing.

Authors: Joseph K Pickrell; John C Marioni; Athma A Pai; Jacob F Degner; Barbara E Engelhardt; Everlyne Nkadori; Jean-Baptiste Veyrieras; Matthew Stephens; Yoav Gilad; Jonathan K Pritchard
Journal: Nature Date: 2010-03-10 Impact factor: 49.962

8. RSEM: accurate transcript quantification from RNA-Seq data with or without a reference genome.

Authors: Bo Li; Colin N Dewey
Journal: BMC Bioinformatics Date: 2011-08-04 Impact factor: 3.307

9. Transcript assembly and quantification by RNA-Seq reveals unannotated transcripts and isoform switching during cell differentiation.

Authors: Cole Trapnell; Brian A Williams; Geo Pertea; Ali Mortazavi; Gordon Kwan; Marijke J van Baren; Steven L Salzberg; Barbara J Wold; Lior Pachter
Journal: Nat Biotechnol Date: 2010-05-02 Impact factor: 54.908

10. Transcript length bias in RNA-seq data confounds systems biology.

Authors: Alicia Oshlack; Matthew J Wakefield
Journal: Biol Direct Date: 2009-04-16 Impact factor: 4.540

10 in total

2 in total

1. Network-Based Isoform Quantification with RNA-Seq Data for Cancer Transcriptome Analysis.

Authors: Wei Zhang; Jae-Woong Chang; Lilong Lin; Kay Minn; Baolin Wu; Jeremy Chien; Jeongsik Yong; Hui Zheng; Rui Kuang
Journal: PLoS Comput Biol Date: 2015-12-23 Impact factor: 4.475

2. Stability, delivery and functions of human sperm RNAs at fertilization.

Authors: Edward Sendler; Graham D Johnson; Shihong Mao; Robert J Goodrich; Michael P Diamond; Russ Hauser; Stephen A Krawetz
Journal: Nucleic Acids Res Date: 2013-03-06 Impact factor: 16.971

2 in total