| Literature DB >> 29116112 |
Yuguang Xiong1, Magali Soumillon2,3, Jie Wu4,5, Jens Hansen1, Bin Hu1, Johan G C van Hasselt1, Gomathi Jayaraman1, Ryan Lim4,5, Mehdi Bouhaddou1, Loren Ornelas6,7, Jim Bochicchio2, Lindsay Lenaeus6,7, Jennifer Stocksdale5, Jaehee Shim1, Emilda Gomez6,7, Dhruv Sareen6,7,8, Clive Svendsen6,7,8, Leslie M Thompson4,5,9, Milind Mahajan10, Ravi Iyengar1, Eric A Sobie1, Evren U Azeloglu11, Marc R Birtwistle12,13.
Abstract
Creating a cDNA library for deep mRNA sequencing (mRNAseq) is generally done by random priming, creating multiple sequencing fragments along each transcript. A 3'-end-focused library approach cannot detect differential splicing, but has potentially higher throughput at a lower cost, along with the ability to improve quantification by using transcript molecule counting with unique molecular identifiers (UMI) that correct PCR bias. Here, we compare an implementation of such a 3'-digital gene expression (3'-DGE) approach with "conventional" random primed mRNAseq. Given our particular datasets on cultured human cardiomyocyte cell lines, we find that, while conventional mRNAseq detects ~15% more genes and needs ~500,000 fewer reads per sample for equivalent statistical power, the resulting differentially expressed genes, biological conclusions, and gene signatures are highly concordant between two techniques. We also find good quantitative agreement at the level of individual genes between two techniques for both read counts and fold changes between given conditions. We conclude that, for high-throughput applications, the potential cost savings associated with 3'-DGE approach are likely a reasonable tradeoff for modest reduction in sensitivity and inability to observe alternative splicing, and should enable many larger scale studies focusing on not only differential expression analysis, but also quantitative transcriptome profiling.Entities:
Mesh:
Substances:
Year: 2017 PMID: 29116112 PMCID: PMC5676863 DOI: 10.1038/s41598-017-14892-x
Source DB: PubMed Journal: Sci Rep ISSN: 2045-2322 Impact factor: 4.379
Figure 1Schematic of Library Construction Differences between Conventional and 3′-DGE mRNA Sequencing. BC: Barcode; UMI: Unique Molecular Identifier.
Figure 2Fidelity of Sequence Alignments in Conventional (Conv) and 3′-end Digital Gene Expression (3′-DGE) Method. For each of the ~14,000 genes with greater than four counts (summed across all 16 samples) measured by conventional method and 3′-DGE method, the proportion of reads aligned to only that one gene was calculated. This proportion is close to 1 for most genes, indicating reliable quantification.
Figure 3Sensitivity of Gene Detection by Conventional (Conv) and 3′-end Digital Gene Expression (3′-DGE) mRNA Sequencing Methods. (a) Gene-wise reads are removed from every sample in a probability proportional to the abundance of the gene in a sample, to generate a set of the number of identified genes over a range of simulated read depths. The curves for individual replicate samples are shown with the thinner points, showing in general low variability. The average is shown with the solid line. (b) Each replicate from both mRNAseq technologies were downsampled via read removal to a common read depth (2.8 million reads per sample), and the differences in identified genes were analyzed. Most genes identified in conventional but not 3′-DGE were shared across treatment conditions, and likewise for those identified by 3′-DGE but not conventional. Specific gene names are listed in Tables S3,S4.
The Best-Fit Parameter Values of a Michaelis-Menten Model to the Average Curves in Fig. 3a. The function drm in the R package drc, using the fit function mm, was used.
| Technology | Treatment | Km (reads) | Vmax (genes) |
|---|---|---|---|
| Conv | CTRL | 49,719 | 12,905 |
| Conv | SOR | 45,441 | 13,202 |
| Conv | SUN | 50,902 | 13,099 |
| 3′-DGE | CTRL | 75,277 | 11,117 |
| 3′-DGE | SOR | 60,979 | 10,884 |
| 3′-DGE | SUN | 70,673 | 11,006 |
Figure 4Quantitative Comparison between Conventional (Conv) and 3′-end Digital Gene Expression (3′-DGE) mRNA Sequencing Methods. (a) Correlations of the replicate samples from the Conv read counts and 3′-DGE read counts show that the replicate samples obtained by the same method correlate well with each other under each condition. Control (CTRL), Sorafenib (SOR), Sunitinib (SUN). Pearson correlation is used. (b,c) Quantitative gene-wise comparison between Conv read counts and 3′-DGE UMI counts. Datasets are downsampled to a common read depth of 2.8 million reads, and then gene-by-gene comparisons are made via scatter plots. To generate a reduced UMI count dataset, upon removal of a read count, UMI counts were removed with probability proportional to the ratio between UMI counts and read counts for that gene (accounting for PCR bias). Density of points in scatter plots is indicated by depth of color. Inset text box shows Pearson correlation. In all plots, data are scaled so units are comparable. (b) Scatterplots of UMI counts for 3′-DGE versus read counts for conventional, without normalization by average transcript length. There is a general trend of agreement but correlation is low for quantitative agreement. (c) Scatterplots of UMI counts for 3′-DGE versus transcript length-normalized read counts for conventional. Quantitative agreement is significantly improved upon this normalization. (d) Potential biases of 3′-DGE or Conv techniques. We averaged data from all 16 read depth-normalized samples and defined lines that flank the typical variance in the data to identify genes that have evidence of bias in quantification. Genes falling outside of this range are reported in Tables S5,S6.
Figure 5Differential Gene Expression Analysis between Conventional (Conv) and 3′-end Digital Gene Expression (3′-DGE) mRNA Sequencing Methods. (a) Control (CTRL) data are compared Sorafenib (SOR) or Sunitinib (SUN) to identify differentially expressed genes (DEGs) using EdgeR for both Conv and 3′-DGE datasets. A gene is defined as differentially expressed using a false discovery rate (FDR) cutoff of 0.1. (b) Comparison of statistical significance for the 3,136 shared differentially expressed genes (DEGs) from Sorafenib-treated samples in 3′-DGE and Conv methods with FDR <0.1. The negative base-10 logarithm of the p-value for differential expression is plotted for each technique, with depth of color indicating density of points. Pearson’s correlation coefficient is indicated with the inset text. (c) Identification of Differential Expression as a Function of Read Depth. The number of differentially expressed genes (DEGs, FDR < 0.1) was quantified after progressive downsampling of UMI counts from 3′-DGE datasets or read counts from conventional (Conv) datasets (for Sorafenib vs. CTRL). (d) Comparison of statistical significance for all genes identified from SOR-treated samples and SUN-treated samples by two methods. The negative base-10 logarithm of the p-value for differential expression is plotted for each technique, with depth of color indicating density of points. The Pearson correlation coefficient is calculated for each treatment. (e) Comparison of fold change for all genes identified from SOR-treated samples and SUN-treated samples by two methods. The log base two fold-change is plotted for each technique, with depth of color indicating density of points. The Pearson correlation coefficient is calculated for each treatment. (f,g) Rank-rank Hypergeometric tests for consistency of differential expression ranking and gene expression signatures. (f) All genes for which a p-value for differential expression was calculated were first sorted into up or down regulated genes (as compared to CTRL), and then ranked by statistical significance. The probability of overlap between two different such rank lists was calculated with Fisher’s Exact Test (aka hypergeometric test), and visualized with a heatmap, for all combinations of list cutoffs. (g) Pairwise comparisons of SUN- and SOR-treated data for 3′-DGE and conventional. SOR-treated samples show much higher relative statistical significance, as expected, because only SOR induced large changes in gene expression. Note the difference in p-value scales across the three panels, which indicate the relative statistical significance of the results.
Figure 6Comparison between Conventional (Conv) and 3′-end Digital Gene Expression (3′-DGE) mRNA Sequencing Methods with an Independent Dataset. (a) Sensitivity of gene detection by two methods. Gene-wise reads are removed from every sample in a probability proportional to the abundance of the gene in a sample, to generate a set of the number of identified genes over a range of simulated read depths. The curves for individual replicate samples are shown with the thinner points, showing in general low variability. The average is shown with the solid line. (b,c) Quantitative gene-wise comparison between two methods. Density of points in scatter plots is indicated by depth of color. Inset text box shows Pearson correlation. In all plots, data are scaled so units are comparable. B. Scatterplots of UMI counts for DToxS’ 3′-DGE versus read counts for NeuroLINCS’ conventional, without normalization by average transcript length. CTRL or SMA refer to the genetic status of the iPS cells (see Methods). (c) Scatterplots of UMI counts for DToxS’ 3′-DGE versus transcript length-normalized read counts for NeuroLINCS’ conventional. (d,e) Comparison of statistical significance (d) or fold-change (e) for all genes identified from SMA samples by DToxS’ 3′-DGE method and NeuroLINCS’ Conv method. (d) The negative base-10 logarithm of the p-value for differential expression is plotted for each technique, with depth of color indicating density of points. (e) The log base two fold-change is plotted for each technique, with depth of color indicating density of points. (f) Rank-rank Hypergeometric tests for consistency of differential expression ranking and gene expression signatures. All genes for which a p-value for differential expression was calculated were first sorted into up or down regulated genes (as compared to CTRL), and then ranked by statistical significance. The probability of overlap between two different such rank lists was calculated with Fisher’s Exact Test (aka hypergeometric test), and visualized with a heatmap, for all combinations of list cutoffs. Shown here are lists from SMA vs. control for 3′-DGE and conventional.