Literature DB >> 27669167

Modeling of RNA-seq fragment sequence bias reduces systematic errors in transcript abundance estimation.

Michael I Love^1,2, John B Hogenesch³, Rafael A Irizarry^1,2.

Abstract

Entities: Chemical Disease Gene Species

Year: 2016 PMID： 27669167 PMCID： PMC5143225 DOI： 10.1038/nbt.3682

Source DB: PubMed Journal: Nat Biotechnol ISSN： 1087-0156 Impact factor: 54.908

We find that current computational methods for estimating transcript abundance from RNA-seq data can lead to hundreds of false-positive results. We show that these systematic errors stem largely from a failure to model fragment GC content bias. Sample-specific biases associated with fragment sequence features lead to mis-identification of transcript isoforms. We introduce alpine, a method for estimating sample-specific bias-corrected transcript abundance. By incorporating fragment sequence features, alpine greatly increases the accuracy of transcript abundance estimates, enabling a fourfold reduction in the number of false positives for reported changes in expression compared with Cufflinks. Using simulated data, we also show that alpine retains the ability to discover true positives, similar to other approaches. The method is available as an R/Bioconductor package that includes data visualization tools useful for bias discovery. Obtaining transcript abundance information from RNA sequencing (RNA-seq) experiments relies on complex methods implemented in software such as Cufflinks [1] and RSEM [2]. However, RNA-seq data can suffer from sample-specific biases as a result of RNA extraction and library preparation steps[3-5]. Methods for estimating gene and transcript abundance attempt to mitigate the effect of technical biases by estimating sample-specific bias parameters. For gene-level expression, common normalization methods use the GC content and length of the gene[6-8], or identify batch effects by detecting structure in expression measurements that are common across genes and not associated with the experimental design[9-11]. At the transcript level, sample-specific biases that current methods correct for include the fragment length distribution induced by size selection, positional bias along the transcript due to RNA degradation and mRNA selection techniques, and sequence-based bias in read start positions arising from the differential binding efficiency of random hexamer primers[2,12-16] (Figure 1a and Supplementary Table 1). Even so, it is common to observe extreme variability in the coverage of RNA-seq fragments along transcripts that is purely technical and sample-specific[17] and not explained by current bias models, which confounds current methods designed to identify and quantify transcripts[18] (Figure 1b).

Figure 1

Quantification of transcript abundance from RNA-seq experiments. (a) RNA-seq biases. (b) Ignoring fragment sequence bias impairs transcript abundance estimation when isoform-specific regions have GC content or sequence features that lead to under-representation of fragments, resulting in false positives of predicted expression of isoforms that are lowly or not expressed.

To investigate the cause of systematic errors in transcript abundance estimates, we used existing data to evaluate currently available software. We downloaded 30 RNA-seq samples from the GEUVADIS Project of lymphoblastoid cell lines derived from the Toscani population, 15 of which were sequenced at one center and 15 at another[19] (Supplementary Table 2). We ran Cufflinks with its sequence bias removal option described in Roberts et al. [14] turned on. Performing a t-test on log(FPKM + 1) values from Cufflinks across centers, and filtering on Benjamini-Hochberg adjusted p values for a target 1% false discovery rate (FDR) resulted in 2,510 transcripts out of 25,588 (10%) with average FPKM greater than 0.1 reported as differentially expressed (Figure 2a) However, we expect that nearly all of the 2,510 transcripts reporting differential expression are false positives, as random permutations of samples resulted in no transcripts with differential expression. RSEM had similar rates of differentially expressed transcripts across center (Supplementary Figure 1). When comparing across center, 619 out of 6,761 genes with multiple isoforms and average FPKM greater than 0.1 for one or more isoform had changes in the reported major isoform across centers using FPKM values estimated by Cufflinks (Supplementary Table 3).

Figure 2

Problems with current transcript abundance estimation methods. (a) Volcano plot of a comparison of Cufflinks transcript estimates from center 1 against center 2, with 2,510 transcripts reported differentially expressed at a target FDR of 1% and 515 with family-wise error rate (FWER) of 1% using a more conservative Bonferroni correction. (b) Densities of GC content of isoform-specific regions from genes with two isoforms when one or more reported differential expression, compared to GC content of random exons. (c) Sashimi plot [30] for GEUVADIS samples in a region of the USF2 gene containing the alternative exon distinguishing two isoforms, with the GC content of each exon listed below the gene model. Samples from center 1 had a drop-out in coverage on the high GC exons, including the alternative exon. The curved lines in the Sashimi plot represent RNA-seq reads spanning exon-exon junctions with numbers indicating number of supporting reads. (d) Cufflinks FPKM estimates for the two isoforms of USF2 where technical artifacts in coverage lead to discordant estimates of abundance across center. (e) Curves fit by alpine estimating dependence of fragment rate on GC content after controlling for random hexamer priming bias of read starts (see main text and Supplementary Note).

We focused on genes with two isoforms to more easily identify what features underlie the difference in estimated expression. Out of 5,716 transcripts from genes with two isoforms in which at least one isoform had FPKM greater than 0.1, 566 transcripts reported differential expression across center according to Cufflinks estimated abundances, at a target FDR of 1%. For genes with one or more isoforms reported differentially expressed, the regions of the isoform that were exclusive to one or the other isoform had much higher GC-content (mode at 70% compared to 50%) than expected by chance (Figure 2b, Wilcoxon p < 0.0001). An example of a gene with differences in expression estimates across center is USF2 (Figure 2c). It is often short sequences that distinguish isoforms of a gene, which in some cases include stretches of high GC content. Because methods such as Cufflinks and RSEM employ a likelihood model that does not account for differences in coverage due to fragment sequence features like GC content, the drop in coverage for samples from center 1 results in a shift in expression estimates from NM003367 to NM207291, which does not include the high GC exon (Figure 2d). The samples from center 1 have dramatically reduced representation of high GC fragments compared to center 2, even after adjusting for differences due to random hexamer priming bias (Figure 2e). A GC content bias is often observed in high-throughput sequencing data, with fragments of certain GC content under-represented, and can be partially attributed to PCR amplification during library preparation[20]. For DNA-seq, this bias is best corrected by modeling at the scale of the fragment[21]. While one existing method for estimating percent of isoform expression assumed a positive linear relationship between exon counts and exon GC content[22], we confirmed other findings [21], that the GC content effect was often nonlinear and highly sample-specific, therefore requiring sample-specific bias estimation using smooth functions of fragment GC content. Additionally, stretches of high GC content sequences within a fragment can have an influence on whether a fragment will be amplified and sequenced [23]. To account for this bias, we built a bias modeling framework called alpine, using previously described bias features (fragment length, relative position, and read start sequence) as well as fragment GC content and the presence of long GC stretches within the fragment [23] to model the number of times a potential fragment of a transcript was observed (0,1,2,...). alpine takes as input a set of gene annotations and the RNA-seq reads aligned to the genome. We considered all potential fragments with lengths within the center of the fragment length distribution, at all possible positions consistent with the transcript's beginning and end. alpine employs a Poisson generalized linear model (GLM) for the count of each potential fragment within the transcript using the bias features described above, estimating bias parameters for each sample separately (Supplementary Note). The read start sequence bias parameters are estimated by alpine using the variable length Markov model (VLMM) proposed by Roberts et al. [14] and implemented in Cufflinks. After estimating bias parameters, alpine can predict transcript coverage or produce bias-corrected estimates of transcript abundance. We used a benchmarking dataset of 1,062 human in vitro transcribed (IVT) and sequenced cDNA clones mixed at various concentrations with mouse total RNA[17]. We focused our analysis on 64 of the IVT transcripts as exhibiting “high unpredictable coverage”[17]. We compared the predictive power of four bias models implemented in alpine: a read start model, a fragment GC content model, a fragment GC content and GC stretches model, and a model including all of these biases (Supplementary Figure 2). Additionally we compared the predictive power of the alpine bias models to the read start bias model mseq [12] that trains multiple additive regression trees (MART) on the local sequence context surrounding read start positions. The mseq method does not itself provide estimates of transcript abundance, but can be used to train a read start bias model and therefore to predict start locations of single-end reads for a test set of transcripts. A good bias model should be able to predict the pattern of fragment coverage along a transcript. For assessing predictions, we used 2-fold cross validation, such that two models were trained on two halves of the data, and always evaluated on transcripts that were not in the training set. For mseq, fragment coverage was extended from predicted start positions from both read pairs using the median fragment length. Predictive power was measured as the percent reduction of mean squared error in explaining raw fragment coverage, compared to a null model that predicted uniform coverage across the transcript. The models that included fragment GC content doubled the predictive power of the read start bias models, both the Cufflinks VLMM read start bias model implemented within alpine and mseq, which performed similarly (Figure 3a). The model that also included the information about GC stretches was more predictive than the model with just the fragment GC content, although only slightly so. The fragment sequence models accurately capture the drops in coverage that were not captured by the read start sequence models (Figure 3b, Supplementary Figures 3-4).

Figure 3

Modeling and correcting fragment sequence bias. (a) Boxplot comparing reduction in mean squared error (MSE) in predicting coverage for different bias models for all 8 samples and 64 transcripts (n=512). The models are “read start”: the Cufflinks VLMM for read starts, the mseq model for read starts, “GC”: a model using fragment GC content, “GC+str”: as in “GC” plus additional terms for stretches of high GC, “all”: the VLMM for read starts in addition to the terms in “GC+str”. The models including fragment GC doubled the reduction in MSE as the read start models. (b) Predicted coverage plots for bias models on GenBank BC011380 (raw coverage in blue, test-set predicted coverage in black). (c) and (d) Volcano plots of differential transcript expression across centers for genes with two isoforms (orange: Benjamini-Hochberg adjusted p values less than 1%; red: Bonferroni FWER rate less than 1%). (e) and (f) FPKM estimates for two isoforms of BASP1 across center. (g) ROC curves for a simulation of a confounded design, with fragment sequence bias drawn from 30 GEUVADIS samples. The “gold” ROC curve indicates the sensitivity and specificity using the true underlying fragment counts, without fragment sequence bias and with known transcript assignment for fragments (h) The partial area under the curve (AUC) for panel (g), considering false positive rate in [0, 0.2], scaled to take values between 0 and 1.

We then used our approach to compare the transcript abundance estimates from one center against the other in the 30 GEUVADIS samples. To more clearly show the performance with respect to differential isoform usage, we focused on 5,676 transcripts from genes with two isoforms, with average FPKM values estimated by Cufflinks greater than 0.1 for one of the two isoforms, and such that the two isoforms had at least one overlapping basepair. We compared log(FPKM + 1) estimates across center using a t-test. We found that including bias terms for fragment GC and GC stretches resulted in a fourfold decrease in the number of false positives at a target FDR of 1%: Cufflinks reported 562 differentially expressed transcripts, while alpine reported 141 (Figure 3c-d). Using a more conservative Bonferroni correction, Cufflinks reported 157 differentially expressed transcripts across center with FWER of 1%, while alpine reported only 37. In general, alpine greatly reduced across-center significant differences while within-center coefficient of variation of abundance estimates remained the same as for Cufflinks (Supplementary Figures 5-8). Likewise we observed reduced across-center differences for estimation of isoform percentages within the 2,838 genes with two isoforms. For each gene, we calculated the estimated percent expression of the major isoform for center 1 (a number ranging from 50% to 100% by definition), against the estimated percent expression of that same isoform in center 2. Correcting for fragment sequence bias using alpine reduced the number of extreme predicted changes in isoform percent when comparing across center (Supplementary Figure 9). An example of false positive for isoform switching is the two-isoform gene BASP1 (Figure 3e-f). We further compared the performance of alpine against new lightweight methods for estimating transcript abundance, Sailfish [24], kallisto [25], and Salmon [26]. Note that four of the methods evaluated, Cufflinks, kallisto, Salmon, and Sailfish all were run with read start sequence bias correction turned on (Supplementary Note) and obtained similar results (Supplementary Figure 10). When restricting positives to those transcripts with Benjamini-Hochberg adjusted p value less than 1% and additionally requiring that the log fold change be above a threshold (0.5, 1, or 2), alpine consistently had a lower percent of false positives out of the set of 5,676 transcripts than all other methods (Supplementary Figure 11). Using only read start bias terms in the alpine model did not provide visible improvements (Supplementary Figure 12). alpine bias estimates for the model including all bias terms (fragment length, read starts, fragment GC and GC stretches, relative position in transcript) are shown in Figure 2e and Supplementary Figure 13. The lightweight quantification methods generated the same kind of isoform-switching errors as Cufflinks and RSEM (Supplementary Figures 14-15), and as MISO [27], a statistical method for estimating isoform percentages within multi-isoform genes (Supplementary Figure 16). While MISO performed better than the existing transcript abundance estimators in consistency of isoform identification, alpine reported less than half of the large isoform switches across center reported by MISO (Supplementary Figure 9). Fitting the fragment sequence model does require more computational effort than other models, though our implementation ran in comparable time to the cuffquant step of the Cufflinks suite (Online Methods). To determine if recovery of true positives was maintained by alpine, we performed a sensitivity analysis using an RNA-seq fragment simulator, Polyester [28], to generate 30 samples with fragment GC content dependence estimated from the 30 GEUVADIS samples, after having removed read start sequence bias (Figure 2e). In one simulation, differential expression of 10% of transcripts was simulated across a condition confounded with sequencing center, while in another simulation the condition was balanced with sequencing center. We then ran alpine, Cufflinks, RSEM, kallisto, Salmon, and Sailfish on the simulated data and compared the true positive rate and false positive rate under the confounded and balanced designs (Online Methods). alpine had the highest sensitivity for a given specificity in the confounded design (Figure 3g-h), while methods performed similarly for the balanced design (Supplementary Figure 17). alpine had the highest accuracy in estimating the true expression values compared to the other methods (Supplementary Figures 18-19), with low median absolute error in estimating the percent of isoform abundance for genes with two isoforms, comparable to RSEM (Supplementary Figure 20). To confirm that library preparation contributes to the systematic errors we attributed to fragment GC bias, we downloaded a subset of samples from the SEQC Consortium[4], including libraries of the same samples that were prepared and sequenced at three different sites and libraries prepared at a separate site and only sequenced at the three sites (Supplementary Table 4). We found that fragment GC content dependence was strongly associated with the location of library preparation, and not with the location of sequencing (Supplementary Figure 21). To explore the extent to which different library preparation protocols affect the fragment GC dependence, we downloaded a subset of samples from the ABRF Next-Generation Sequencing Study[29] that were prepared using either ribosomal RNA depletion or poly-A selection (Supplementary Table 5). We observed little change in the shape of fragment GC dependence across protocol, and a strong effect in the positional bias, with poly-A selected samples having highest coverage at the 3’ end of transcripts, as reported by the ABRF study authors[29] (Supplementary Figure 22). Even though the fragment GC content dependence did not differ greatly across protocol in this dataset, we evaluated the extent to which alpine was able to remove systematic bias by modeling positional bias. In a comparison of expression estimates across protocol, alpine reported the fewest number of false positives, controlling at 1% FDR and at 1% FWER (Supplementary Figure 23). The large number of transcripts with false positive differences in abundance reported across protocol, in the range of 10-14,000 at 1% FDR for all methods out of 28,000, suggests that none of the existing methods evaluated including alpine can remove the bulk of systematic bias seen across protocol. The relationship of the samples in the ABRF dataset allowed an assessment of the measurement accuracy in terms of the methods’ recovery of expected mixing ratios (Online Methods). In Supplementary Figures 24-25 we assess the different methods’ ability to recover the expected mixing ratio of C/D given measurements of A/B for the transcripts in the top 25% of abundance (as suggested by Li et al. [29]). When quantifying the number of transcripts whose C/D ratio was within 10% of the expected value, RSEM had the highest recovery for both protocols, though alpine also had consistently high recovery (within 5% of the top method). Overall, the mixing ratio recovery for all methods was higher for poly-A selected samples (65-75%) compared to ribosomal RNA depleted samples (45-65%). Systematic errors and batch effects are a continuing cause of concern for RNA-seq experiments. Large-scale, block design, and well-documented transcriptome sequencing projects such as those performed by GEUVADIS[3,19], SEQC[4], and ABRF[29] allow the study of technical biases, such as fragment GC content bias, and the creation of computational methods that correct these biases. Indeed, ’t Hoen et al.[3] observed that slight differences in average GC content across samples lead to differences in quantification for transcripts. We note that our findings reflect general systematic errors and not just differences induced by batch effects. The problem we identify holds for within-sample transcript abundance estimates for samples from a single center and for samples in small-scale experiments. There are likely to be many incorrectly reported major isoforms and biased abundance estimates for experiments that show strong dependence of the fragment rate on fragment GC content (e.g. Figure 2e), unless these are explicitly corrected for using methods that model fragment sequence bias. Strong GC content bias was found for some samples for all public datasets examined. Fold change thresholds[4] are not an appropriate solution to the particular problem presented here, because fold changes induced by technical bias are often larger than those potentially of biological interest. Here we demonstrated specificity using data prepared by different sequencing centers, and sensitivity using simulation. New benchmarking experiments would be valuable for further sensitivity analysis, experiments where the true isoform or set of isoforms are known, and in which characteristically highly-variable profiles of transcript coverage are obtained by following as closely to the steps of a standard RNA-seq experiment as possible. While the sequence features we included in our model provided substantial improvements over existing methods, we hypothesize that more variability can be explained by discovering new predictive features. alpine provides a modular framework that facilitates further exploration, which will prove useful for optimization of protocols to reduce fragment GC content bias, preferable to computational corrections.

Online Methods

RNA-seq read alignment and quantification

IVT-seq FASTQ files made publicly available by Lahens et al.[17] were downloaded from the Sequence Read Archive. Paired-end reads were aligned to the human reference genome contained in the Illumina iGenomes UCSC hg19 build, using STAR version 2.5.0[31]. The exons of the GenBank transcripts were read from the feature_quant.txt files posted to GEO, and the list of transcripts with high unpredictable coverage was downloaded from the additional files of Lahens et al.[17]. For the GEUVADIS and ABRF datasets, the same computational pipeline was used. GEUVADIS FASTQ files made publicly available by Lappalainen et al.[19] were downloaded from the European Nucleotide Archive (Supplementary Table 2). ABRF FASTQ files made publicly available by Li et al.[29] were downloaded from the European Nucleotide Archive (Supplementary Table 5). Paired-end reads were aligned to the human reference genome contained in the Illumina iGenomes UCSC hg19 build, using TopHat version 2.0.11[32] (for Cufflinks) and STAR version 2.5.0 (for alpine). The genes.gtf file contained in the Illumina iGenomes build containing RefSeq transcripts was filtered to genes on chromosomes 1-22, X, Y and M, and provided to Cufflinks, RSEM and alpine as gene annotation. For the SEQC dataset, the bias estimation steps of alpine were run, using STAR version 2.5.0 and the same gene annotation as above. The SEQC FASTQ files made publicly available by Su et al.[4] were downloaded from the European Nucleotide Archive (Supplementary Table 4). Transcript quantification details, including the description of the alpine model and the commands used for running other software are provided in the Supplementary Note. Generating the bias coefficients for the 30 GEUVADIS samples using the alpine software required 40 minutes using 6 cores and 25 Gb of memory. Using alpine to estimating the transcript abundances for 5,676 transcripts from two-isoform genes for 30 samples required 4 hours using 30 cores and 55 Gb of memory.

Training and testing mseq on IVT-seq data

mseq version 1.2 Li et al.[12] was used to model read start bias on the IVT-seq dataset, using the same training- and test-set splits as used by alpine. martTrain was run with interaction depth of 10 and 2000 trees (the defaults) using 15 basepairs to the left and right of the position to be modeled. martPred and getPredCount were used to predict expected read start counts for positions in the test transcripts. As suggested in the mseq documentation, read starts from both ends of the fragment were modeled by providing forward and reverse strand data for each transcript. To generate predicted fragment coverage, the median fragment length was used to extend fragment coverage from mseq predicted read starts.

Simulation

Polyester[28] was used for the RNA-seq fragment simulator, which models variability in expression levels across biological replicates and which was easily extended to include the exact fragment sequence bias obtained from the GEUVADIS dataset. The default simulation parameters were used except the size parameter for the overdispersion was set to 100 (corresponding to a negative binomial dispersion parameter of 0.01). 300 genes with a single isoform, 300 genes with two isoforms, and 300 genes with three to five isoforms were simulated for 30 samples. Average gene-level FPKM estimates across the GEUVADIS samples as estimated by Cufflinks were used for the simulation. Expression levels for the isoforms of genes were determined by multiplying the gene-level FPKM value by a random vector from a flat Dirichlet distribution. Reads were assigned to the transcripts according to the FPKM values and assuming an experiment with a total of 60 million paired-end reads. Paired-end reads were generated and randomly kept or discarded according to a probability derived from the GC content bias curves for 30 GEUVADIS samples in Figure 2e. As the process of discarding pairs of reads would result in unequal final library size, the simulation process was repeated, after first scaling the initial target library size higher such that the final library size for the experiment would be equal to 60 million paired-end reads in expectation. Note that the GC content curves used in the simulation (Figure 2e) were estimated after removing the read start bias using the Cufflinks VLMM, and so the fragment sequence effect observed is not a proxy for read start bias. Simulated paired-end reads were shuffled before being used as input to quantification methods. 10% of transcripts were selected to be differentially expressed across condition, with equal chance of twofold up- or down-regulation. In one simulation, the condition was confounded with the sequencing center (15 samples against 15 samples), and in another simulation, condition was balanced across sequencing center. A t-test was performed on log(FPKM + 1) values. For kallisto, Salmon, and Sailfish, log(TPM + PC) was used, with a pseudocount corresponding to 1 on the FPKM scale. For the balanced design, sequencing center was added as a blocking term to a linear model for differential expression. Note that the balanced design with blocking factor represents the best-case scenario for the competing methods, as the batches were known exactly and the residual degrees of freedom was high, such that the batch effects on transcript abundance estimates could be precisely estimated and removed by the linear model. In contrast, scenarios with total or partial confounding, unknown batches, and sample-specific deviations within batches are typical in RNA-seq experiments and not represented by the balanced design simulation.

Specificity analysis in ABRF samples

Testing specificity on the ABRF dataset was performed in a similar manner as in the GEUVADIS dataset, by performing a t-test on log2(FPKM + 1) abundance estimates (or using TPM for kallisto, Salmon, and Sailfish with a pseudocount corresponding to 1 on the FPKM scale). Only the coding transcripts, as annotated by RefSeq, were used for the analysis, as the poly-A selection protocol is not designed to capture all the non-coding transcripts. The alpine software was run with bias correction terms for fragment length, relative position, and fragment GC content, and other methods were run with their bias correction arguments turned on, including a positional bias correction term for RSEM (Supplementary Note). A single scaling factor for FPKM or TPM values was calculated for each method to adjust for the removal of the non-coding transcripts. This factor was calculated by dividing, for each transcript, the average of abundance estimates for poly-A selected samples by the average for ribosomal RNA depleted samples. The median of this ratio over all coding transcripts was used to scale the ribosomal RNA depleted samples, similar to the median-ratio method for library normalization[33]. In order to test for differences due to protocol, for each transcript, a linear model was fit with the coefficients where Ys is a log2 transformed expression estimate for a single sample s. As is an indicator variable taking a value 0 if sample s is not reference sample A and 1 if sample s is A, and similarly for Bs, Cs, and Ds. Ps is an indicator variable indicating if sample s was produced using the poly-A selection protocol. The polyA term then provides the difference in log2 abundance values between poly-A selected and ribosomal RNA depleted samples. Null hypothesis tests for this coefficient being equal to zero were performed generating two-tailed Wald test p values. p values from each transcript were adjusted using the Benjamini-Hochberg method controlling the FDR and the more conservative Bonferroni method controlling the FWER.

Accuracy in mixing ratios for ABRF samples

As the C and D samples in the ABRF dataset were created by mixing 3:1 and 1:3 ratios of the A and B samples respectively, a comparison of C/D and A/B for the different methods can be used to generate a measure of accuracy in estimating transcript abundance. The same calculation was used as described by Su et al.[4] in the Methods section of the SEQC paper. The expected mixing ratio of C/D can be calculated as a ratio of polynomials given the value of A/B from the abundance estimates. The same adjustment as used by Su et al. [4] was used to account for the different ratios of mRNA to total RNA in A and B samples. Mixing ratios were evaluated for the transcripts in the top 25%, and for which both samples A and B had positive abundance estimates (> 0.1 on the FPKM scale).

Code availability

alpine version 0.1.2 was used in this manuscript. The alpine software is made available at: http://bioconductor.org/packages/alpine The modified branch of Polyester including fragment sequence bias used here is made available at: https://github.com/mikelove/polyesterAlpineMs Code for producing the simulation is made available at: https://github.com/mikelove/fragmentBiasSimulation alpine is implemented in the R language using core Bioconductor[34] packages. Information for all of the fragment features is generated using the packages GenomicAlignments, Biostrings, and GenomicRanges[35].

33 in total

1. Multi-platform assessment of transcriptome profiling using RNA-seq in the ABRF next-generation sequencing study.

Authors: Sheng Li; Scott W Tighe; Charles M Nicolet; Deborah Grove; Shawn Levy; William Farmerie; Agnes Viale; Chris Wright; Peter A Schweitzer; Yuan Gao; Dewey Kim; Joe Boland; Belynda Hicks; Ryan Kim; Sagar Chhangawala; Nadereh Jafari; Nalini Raghavachari; Jorge Gandara; Natàlia Garcia-Reyero; Cynthia Hendrickson; David Roberson; Jeffrey Rosenfeld; Todd Smith; Jason G Underwood; May Wang; Paul Zumbo; Don A Baldwin; George S Grills; Christopher E Mason
Journal: Nat Biotechnol Date: 2014-08-24 Impact factor: 54.908

2. Normalization of RNA-seq data using factor analysis of control genes or samples.

Authors: Davide Risso; John Ngai; Terence P Speed; Sandrine Dudoit
Journal: Nat Biotechnol Date: 2014-08-24 Impact factor: 54.908

3. Reproducibility of high-throughput mRNA and small RNA sequencing across laboratories.

Authors: Peter A C 't Hoen; Marc R Friedländer; Jonas Almlöf; Michael Sammeth; Irina Pulyakhina; Seyed Yahya Anvar; Jeroen F J Laros; Henk P J Buermans; Olof Karlberg; Mathias Brännvall; Johan T den Dunnen; Gert-Jan B van Ommen; Ivo G Gut; Roderic Guigó; Xavier Estivill; Ann-Christine Syvänen; Emmanouil T Dermitzakis; Tuuli Lappalainen
Journal: Nat Biotechnol Date: 2013-09-15 Impact factor: 54.908

4. Transcriptome assembly and isoform expression level estimation from biased RNA-Seq reads.

Authors: Wei Li; Tao Jiang
Journal: Bioinformatics Date: 2012-10-11 Impact factor: 6.937

5. Analysis and design of RNA sequencing experiments for identifying isoform regulation.

Authors: Yarden Katz; Eric T Wang; Edoardo M Airoldi; Christopher B Burge
Journal: Nat Methods Date: 2010-11-07 Impact factor: 28.547

6. Near-optimal probabilistic RNA-seq quantification.

Authors: Nicolas L Bray; Harold Pimentel; Páll Melsted; Lior Pachter
Journal: Nat Biotechnol Date: 2016-04-04 Impact factor: 54.908

7. Biases in Illumina transcriptome sequencing caused by random hexamer priming.

Authors: Kasper D Hansen; Steven E Brenner; Sandrine Dudoit
Journal: Nucleic Acids Res Date: 2010-04-14 Impact factor: 16.971

8. Detecting and correcting systematic variation in large-scale RNA sequencing data.

Authors: Sheng Li; Paweł P Łabaj; Paul Zumbo; Peter Sykacek; Wei Shi; Leming Shi; John Phan; Po-Yen Wu; May Wang; Charles Wang; Danielle Thierry-Mieg; Jean Thierry-Mieg; David P Kreil; Christopher E Mason
Journal: Nat Biotechnol Date: 2014-08-24 Impact factor: 54.908

9. Transcriptome and genome sequencing uncovers functional variation in humans.

Authors: Tuuli Lappalainen; Michael Sammeth; Marc R Friedländer; Peter A C 't Hoen; Jean Monlong; Manuel A Rivas; Mar Gonzàlez-Porta; Natalja Kurbatova; Thasso Griebel; Pedro G Ferreira; Matthias Barann; Thomas Wieland; Liliana Greger; Maarten van Iterson; Jonas Almlöf; Paolo Ribeca; Irina Pulyakhina; Daniela Esser; Thomas Giger; Andrew Tikhonov; Marc Sultan; Gabrielle Bertier; Daniel G MacArthur; Monkol Lek; Esther Lizano; Henk P J Buermans; Ismael Padioleau; Thomas Schwarzmayr; Olof Karlberg; Halit Ongen; Helena Kilpinen; Sergi Beltran; Marta Gut; Katja Kahlem; Vyacheslav Amstislavskiy; Oliver Stegle; Matti Pirinen; Stephen B Montgomery; Peter Donnelly; Mark I McCarthy; Paul Flicek; Tim M Strom; Hans Lehrach; Stefan Schreiber; Ralf Sudbrak; Angel Carracedo; Stylianos E Antonarakis; Robert Häsler; Ann-Christine Syvänen; Gert-Jan van Ommen; Alvis Brazma; Thomas Meitinger; Philip Rosenstiel; Roderic Guigó; Ivo G Gut; Xavier Estivill; Emmanouil T Dermitzakis
Journal: Nature Date: 2013-09-15 Impact factor: 49.962

Modeling of RNA-seq fragment sequence bias reduces systematic errors in transcript abundance estimation.

Online Methods

RNA-seq read alignment and quantification

Training and testing mseq on IVT-seq data

Simulation

Specificity analysis in ABRF samples

Accuracy in mixing ratios for ABRF samples

Code availability

1. Multi-platform assessment of transcriptome profiling using RNA-seq in the ABRF next-generation sequencing study.

2. Normalization of RNA-seq data using factor analysis of control genes or samples.

3. Reproducibility of high-throughput mRNA and small RNA sequencing across laboratories.

4. Transcriptome assembly and isoform expression level estimation from biased RNA-Seq reads.

5. Analysis and design of RNA sequencing experiments for identifying isoform regulation.

6. Near-optimal probabilistic RNA-seq quantification.

7. Biases in Illumina transcriptome sequencing caused by random hexamer priming.

8. Detecting and correcting systematic variation in large-scale RNA sequencing data.

9. Transcriptome and genome sequencing uncovers functional variation in humans.

10. TopHat2: accurate alignment of transcriptomes in the presence of insertions, deletions and gene fusions.

1. Statistical Modeling of High Dimensional Counts.

2. Missing data and technical variability in single-cell RNA-sequencing experiments.

3. Polee: RNA-Seq analysis using approximate likelihood.

4. Modelling RNA-Seq data with a zero-inflated mixture Poisson linear model.

5. Transcriptome analysis of rat dorsal hippocampal CA1 after an early life seizure induced by kainic acid.

6. Nonparametric expression analysis using inferential replicate counts.

7. Swimming downstream: statistical analysis of differential transcript usage following Salmon quantification.

8. Detecting, Categorizing, and Correcting Coverage Anomalies of RNA-Seq Quantification.

9. Modeling and analysis of RNA-seq data: a review from a statistical perspective.

10. Novel genetic features of human and mouse Purkinje cell differentiation defined by comparative transcriptomics.