| Literature DB >> 29868123 |
Oswaldo A Lozoya1, Janine H Santos1, Richard P Woychik1.
Abstract
To life scientists, one important feature offered by RNAseq, a next-generation sequencing tool used to estimate changes in gene expression levels, lies in its unprecedented resolution. It can score countable differences in transcript numbers among thousands of genes and between experimental groups, all at once. However, its high cost limits experimental designs to very small sample sizes, usually N = 3, which often results in statistically underpowered analysis and poor reproducibility. All these issues are compounded by the presence of experimental noise, which is harder to distinguish from instrumental error when sample sizes are limiting (e.g., small-budget pilot tests), experimental populations exhibit biologically heterogeneous or diffuse expression phenotypes (e.g., patient samples), or when discriminating among transcriptional signatures of closely related experimental conditions (e.g., toxicological modes of action, or MOAs). Here, we present a leveraged signal-to-noise ratio (LSTNR) thresholding method, founded on generalized linear modeling (GLM) of aligned read detection limits to extract differentially expressed genes (DEGs) from noisy low-replication RNAseq data. The LSTNR method uses an agnostic independent filtering strategy to define the dynamic range of detected aggregate read counts per gene, and assigns statistical weights that prioritize genes with better sequencing resolution in differential expression analyses. To assess its performance, we implemented the LSTNR method to analyze three separate datasets: first, using a systematically noisy in silico dataset, we demonstrated that LSTNR can extract pre-designed patterns of expression and discriminate between "noise" and "true" differentially expressed pseudogenes at a 100% success rate; then, we illustrated how the LSTNR method can assign patient-derived breast cancer specimens correctly to one out of their four reported molecular subtypes (luminal A, luminal B, Her2-enriched and basal-like); and last, we showed the ability to retrieve five different modes of action (MOA) elicited in livers of rats exposed to three toxicants under three nutritional routes by using the LSTNR method. By combining differential measurements with resolving power to detect DEGs, the LSTNR method offers an alternative approach to interrogate noisy and low-replication RNAseq datasets, which handles multiple biological conditions at once, and defines benchmarks to validate RNAseq experiments with standard benchtop assays.Entities:
Keywords: DEG; LSTNR; RNAseq; biomarker discovery; expression patterns; noise
Year: 2018 PMID: 29868123 PMCID: PMC5964166 DOI: 10.3389/fgene.2018.00176
Source DB: PubMed Journal: Front Genet ISSN: 1664-8021 Impact factor: 4.599
Figure 1LSTNR method workflow. The schematic depicts the main steps involved in detection of statistically significant genes (SGs) starting from uniquely aligned reads for each gene in individual samples: coverage normalization as reads per million total sequenced reads (RPM); parametric distribution fitting; independent filtering based on fit parameters; generalized linear modeling (GLM), leading to gene resolution weights; log-fold expression measurements (Log2FC) vs. a fixed reference; and two-way resolution-weighed ANOVA of Log2FC. Top right: stratification criteria for SGs and their respective nomenclature: differentially expressed genes (DEGs), leveraged signal-to-noise-ratio genes (LSTNRs), and DEGs with reproducible expectation estimates (DEGREEs). The subset of Profiler DEGREEs corresponds to DEGREEs with a mean retrospective statistical power ≪π≫ > 90% which, when ranked by within-gene observed effect sizes (ΔLog2FC) first and by their 95% lower confidence limit of retrospective statistical power (πlow) next, show monotonically decreasing ΔLog2FC values. Profiler DEGREEs can then be used for benchtop validation or for additional statistical analysis to obtain a reductive set of prospective biomarkers through a variety of machine-learning analytics (e.g., partition trees, canonical factor analysis).
Figure 2Analysis of EPIG-seq in silico test data by the LSTNR method. (A) Quantile plot of pseudogene-averaged RPM across pseudoreplicates, overlaid onto their best-fit threshold lognormal distribution parametric model (purple); shading around the parametric fit represents the simultaneous 95% confidence interval of predicted means. (B) Linear fitting and distribution of relative expression metrics for pseudogene × group blocks with respect to individual pseudoreplicates. Left: average of Log2FC measurements vs. Log2FC from individual pseudoreplicates; right: values of the canonical link function log (RPM) of pseudoreplicates vs. pseudogene × group block averages of the linear predictor function, ≪(θ)≫, estimated by GLM. Colors of individual data points corresponds to the estimated quantile density of Log2FC values in mean vs. replicate space (left panel) as indicated by the adjacent color scale. (C) Gene resolution weights as a function of FDR-adjusted significance levels of pseudogenes as determined by pseudogene × group two-way ANOVA based on the linear predictor (θ). Point coloring corresponds to that in B; size of each data point is representative of pseudogene-averaged RPM across pseudoreplicates. (D) Distribution of range-scaled residuals around the average of pseudogene × group blocks before (left) and after (right) multiplying Log2FC measurements by gene resolution weights depicted in (C); point coloring corresponds to that in (B). (E) Heatmap plot for two-way unsupervised clustering of pseudoreplicates (horizontal) and 4,541 significant pseudogenes (FDR p < 0.05) detected by LSTNR analysis based on Log2FC vs. the mean RPM in the baseline group. Left: bird view of the entire heatmap; center: magnification to subset of 1,000 pseudogenes from five simulated co-expression patterns and their correspondence with inferred hierarchical clades; right: heatmap is colored on a green-black-red gradient scale of Log2FC values relative to baseline (green, downregulated; black, same; red, upregulated). (F) Distribution of net residuals around pseudogene × group Log2FC averages, based on 20,000 analyzed pseudogenes (before LSTNR testing) and 4,541 pseudogenes with statistically significant resolution-weighed differences (after LSTNR testing). Dotted blue lines to the left and right of the x-axis origin enclose the predicted 95% tolerance interval of residuals in each respective plot. (G) Average Log2FC expression ± s.d. of pseudogenes in simulated patterns vs. inferred clades of co-expression across groups. (H) Heatmap depicts Pearson's correlation coefficient r values between 4,541 statistically significant pseudogenes. Pseudogenes are displayed by groups of inferred co-expression clades; heatmap is colored on a blue-white-red gradient scale of Pearson's correlation coefficient r values (blue, negative correlation; white, not correlated; red, positive correlation). N = 35 pseudoreplicates in each of 4 simulated condition groups: baseline, and groups 1 to 3; number of simulated co-expression patterns: five, with 200 pseudogenes each. Simulation comprises 140 total pseudoreplicates with reads distributed among 20,000 total pseudogenes.
Step-by-step output as numbers of qualifying genes along the LSTNR analytical pipeline for a validation in silico dataset (courtesy of Li and Bushel, 2016; doi: 10.1186/s12864-016-2584-7).
| Simulated pseudogenes | 20,000 |
| Distribution of pseudogene-wise RPM means | P(y) ~ |
| σ′ = 25.4 | |
| μ′ = 25.4 | |
| γ = 2.5 × 10−3 RPM | |
| 20,000 | |
| Log[y] | |
| 9,492 | |
| 4,541 |
Confusion matrices for differential pseudo-gene pattern assignment by EPIG, EPIG-seq, and LSTNR implementation based on a validation in silico dataset (EPIG and EPIG-seq: courtesy of Li and Bushel, 2016; doi: 10.1186/s12864-016-2584-7.
| EPIG | Group A | 0 | 0 | 46 | 0 | 0 | 46 | 23 | 100 |
| Group B | 0 | 0 | 0 | 0 | 136 | 136 | 68 | 100 | |
| Group C | 0 | 39 | 0 | 0 | 0 | 39 | 19.5 | 100 | |
| Group D | 0 | 0 | 0 | 60 | 0 | 60 | 30 | 100 | |
| Group E | 35 | 0 | 0 | 0 | 0 | 35 | 17.5 | 100 | |
| EPIG-Seq | Group A | 7 | 132 | 11 | 0 | 0 | 150 | 66 | 98.2 |
| Group B | 0 | 0 | 169 | 0 | 0 | 169 | 84.5 | 100 | |
| Group C | 135 | 26 | 5 | 0 | 0 | 166 | 67.5 | 96.9 | |
| Group D | 0 | 0 | 0 | 15 | 166 | 181 | 83 | 98.5 | |
| Group E | 9 | 23 | 0 | 111 | 43 | 186 | 55.5 | 92.5 | |
| LSTNR | Clade A' | 177 | 0 | 12 | 0 | 4 | 193 | 88.5 | 98.4 |
| Clade B' | 0 | 200 | 0 | 0 | 0 | 200 | 100 | 100 | |
| Clade C' | 1 | 0 | 155 | 0 | 1 | 157 | 77.5 | 99.8 | |
| Clade D' | 0 | 0 | 0 | 200 | 0 | 200 | 100 | 100 | |
| Clade E' | 22 | 0 | 33 | 0 | 195 | 250 | 97.5 | 94.5 | |
Figure 3Molecular subtype discrimination of transcriptional signatures from patient-derived breast cancer specimens using the LSTNR method by split-pool and single-shot approaches. (A) Quantile plot of gene-averaged RPM across replicates, overlaid onto best-fit threshold Weibull distribution parametric models per individual realization; fit lines and points are colored by the realization they were tested under. (B) Gene resolution weights calculated within each realization, as a function of FDR-adjusted significance levels of genes, based on linear predictor (θ) gene × group two-way ANOVA. Point coloring indicates realization tested; size of each data point is representative of gene-averaged RPM across replicates within each realization; also, the top 10 genes found in all realizations with highest RPM scores, and their identities, are shown with dark outlines. (C) Venn diagram depicting the number of shared gene symbols among consensus SGs, DEGs, LSTNRs, and DEGREEs identified in all realizations. (D) Venn diagram depicting the concordance between DEGREEs detected by all-at-once single-shot analysis of the TCGA dataset vs. the set of consensus DEGREEs identified across separate realizations. (E) Heatmap plots for two-way unsupervised clustering of all specimens (horizontal) and DEGREEs detected by single-shot analysis (top) or consensus across realizations (bottom) based on differential expression levels. Coloring of row dendrograms and labels represent the phenotype as annotated in TCGA database for each specimen. Dot columns on the left of heatmap plots depict inferred specimen clades via unsupervised clustering of depicted genes in each heatmap separately. Right: heatmaps are colored on green-black-red gradient scales of Log2FC values relative to baseline (green, downregulated; black, same; red, upregulated). (F) Three-factor discriminant analysis and plots of tested specimens based on Log2FC measurements of single-shot DEGREEs (top) vs. consensus DEGREEs (bottom). Coloring of lines and points in depicted plots represent the phenotype as annotated in TCGA database for each specimen. ROC curves and their AUC values are also shown. (G) Volcano plots of 200 Profiler DEGs across all breast cancer subtype specimens relative to normal group; y-axis: post-hoc pairwise significance scores of Log2FC measurements; x-axis: mean Log2FC vs. mean RPM of normal specimens. (H) Heatmap depicts Pearson's correlation coefficient r-values between 200 Profiler DEGREEs. Order of Profiler DEGREEs corresponds to their arrangement by unsupervised clustering, as shown in (E). Heatmap is colored on a blue-white-red gradient scale of Pearson's correlation coefficient r-values (blue, negative correlation; white, not correlated; red: positive correlation). (I) Heatmap plots of all breast cancer specimens sorted by phenotype and replicate number (rows) vs. Profiler DEGREEs detected by consensus across realizations (columns) based on differential expression levels. Order of Profiler DEGREEs corresponds to their arrangement by unsupervised clustering, as shown in (E,H). Coloring of row labels represent the phenotype as annotated in TCGA database for each specimen. Right: heatmap is colored on a green-black-red gradient scale of Log2FC values relative to normal specimens (green, downregulated; black, same; red, upregulated). (J) Inferred diagnostic biomarkers of breast cancer subtypes by sequential partitioning tree analysis of Profiler DEGREEs. Coloring of lines and points in depicted plots represent the phenotype as annotated in TCGA database for each specimen. Partition analysis ROC curves and their AUC values are also shown. N = 40 independent specimens per breast cancer molecular subtype: normal, luminal A, luminal B, HER2+, and basal-like. Specimen classification as annotated in TCGA database. TCGA dataset comprises 200 total individual specimens with reads aligned onto 20,532 genes overall, analyzed as a whole (single-shot) or evenly split into four separate and mutually exclusive realizations (split-pool). Reference genome: hg19.
Step-by-step output as numbers of qualifying genes along the LSTNR analytical pipeline for RNAseq data deposited in TCGA from four realizations of patient-derived breast cancer transcriptomes across four molecular subtypes (courtesy of Li and Bushel, 2016; 10.1186/s12864-016-2584-7).
| Genes with uniquely aligned reads | 20,532 | ||||
| Distribution of gene-wise RPM means | P(y) ~ | ||||
| α = 25.4 RPM | α = 24.1 RPM | α = 25.0 RPM | α = 21.9 RPM | α = 22.2 RPM | |
| ß = 0.53 | ß = 0.53 | ß = 0.54 | ß = 0.49 | ß = 0.49 | |
| γ = 9.9 × 10−3 RPM | γ = 6.6 × 10−3 RPM | γ = 1.1 × 10−2 RPM | γ = 1.6 × 10−3 RPM | γ = 1.5 × 10−3 RPM | |
| 8,538 | |||||
| (y–γ)−1 | |||||
| 2,851 | |||||
| 6193 | |||||
resolution-weighed effect size above 5% of gene-wise variation (δLog2FC > 0.3 × σSSR); and | 6,093 | ||||
resolution-weighed effect size above 5% of gene-wise variation (δLog2FC > 0.3 × σSSR); and at least one group with Log2FC differences vs. baseline greater than 95% Tolerance Interval of gene × group residuals among SGs ( | 1,511 | ||||
| 1,511 | |||||
| CBX7, ESR1, FOXC1, and FOXM1 | |||||
Step-by-step output as numbers of qualifying genes along the LSTNR analytical pipeline for liver transcriptomes from male Sprague-Dawley rats after toxicant exposure based on the mode-of-action training RNAseq dataset by the MAQC phase III SEQC crowdsource toxicogenomics (TGxSEQC) effort (GEO accession number: GSE55347).
| Genes with uniquely aligned reads | 30,852 |
| Distribution of gene-wise RPM means | P(y)~ |
| α = 6.7 RPM | |
| ß = 0.38 | |
| γ = 2.5 × 10−3 RPM | |
| 9,593 | |
| (y–γ)−1 | |
| 3,975 | |
| 5,983 | |
resolution-weighed effect size above 5% of gene-wise variation (δLog2FC>0.3 × σSSR); and | 5,864 |
resolution-weighed effect size above 5% of gene-wise variation (δLog2FC>0.3 × σSSR); and at least one group with Log2FC differences vs. baseline greater than 95% Tolerance Interval of gene × group residuals among SGs ( | 1,953 |
| 1,510 | |
| Ucp3, Tmem86b, Sugct, Acaa1b, Hadhb, Tfam, Acaa1a, and Gsdmd |
Figure 4LSTNR method analysis of the MOA training RNAseq dataset by the TGxSEQC crowdsource effort. (A) Heatmap plots of expression differences (top) and Pearson's r correlation coefficients (bottom) for 45 Sprague-Dawley rat liver specimens exposed to toxic agents classified under 5 different MOAs (rows; N = 9 per MOA) and 1,510 DEGREEs (columns) based on Log2FC measurements vs. the average of 9 mock-treated controls. The expression heatmap of DEGREEs (top) is colored on a green-black-red gradient scale (left) of Log2FC values (green downregulated; black, same; red, upregulated); the correlation heatmap of DEGREEs (bottom) is colored on a blue-white-red gradient scale (left) of Pearson's correlation coefficient r-values (blue, negative correlation; white, not correlated; red, positive correlation). Order of DEGREEs in both heatmaps (columns) is illustrated by the column dendrogram in the expression heatmap, and was based on two-way unsupervised hierarchical clustering of DEGREEs and specimens based on Log2FC expression differences (expression heatmap, top). Dot plots on the right of the expression heatmap (top) depict experimental classifications of treated specimens based on: means of transcriptional activation of MIEs (left to right: non-receptor mediated, or NRM vs. receptor-mediated or RM); MOA (left to right: Ahr, CAR/PXR, Cytotoxic, DNA damage, PPARA); route of exposure (left to right: intraperitoneal injection, or oral gavage); and nutritional status of vehicle (left to right: not nutritive solution, or nutritive oil-based vehicle). Inferred specimen clades, represented in the expression heatmap (top) by row dendrograms and their respective MOA × Agent labels, are shown with alternating orange and blue coloring for clarity. (B) Expression heatmap for two-way unsupervised clustering of Log2FC differences, as in (A), using only 65 Profiler DEGREEs (columns). Profiler DEGREE columns are labeled by their respective Entrez gene symbol (top); the names of 8 biomarkers selected from the set of Profiler DEGREEs by sequential partition tree analysis are shown raised and in bold. Coloring of the row dendrogram and its labels (left) represent inferred grouping of treated specimens into 7 MOA clusters (I-VII). Inferred co-expression patterns represented by the column dendrogram (bottom; a–e) are shown with alternating gray and black coloring for clarity. Top left: heatmap is colored on a green-black-red gradient scale (top left) of Log2FC values relative to mock-treated controls (green, downregulated; black, same; red, upregulated). Dot plots on the right of the expression heatmap depict experimental classifications of treated specimens based on route of exposure (left to right: intraperitoneal injection, or oral gavage) and nutritional status of vehicle (left to right: not nutritive solution, or nutritive oil-based vehicle). (C) Average Log2FC expression ± s.e.m. based on Profiler DEGREEs, split into co-expression patterns from (B), and separated by experimental MOA classification. (D) Concordance of relative expression estimates normalized to house-keeping glyceraldehyde-3-phosphate dehydrogenase (Gapdh) gene between RNA-seq and qPCR experiments for eight annotated gene symbols across hepatotoxic agents with DNA damage MOA; all assays were performed from cDNA templates derived using the same total RNA extracts for both techniques (N = 3 per hepatotoxic agent); qPCR assays were performed in technical duplicates per reaction plate with matched untreated samples as controls (CTR). Overall linear regression (regression mean: solid black line; ±95 CI of regression: dashed black lines) corresponds to qPCR-based average normalized Log2FC expression levels of each gene (x-axis) vs. normalized Log2FC expression measurements per sample among mRNA transcripts with matching gene symbols detected by RNA-seq (y-axis). (E) Enriched signaling pathways and (F) inferred upstream regulators for Profiler DEGREEs with |Log2FC|>0.82 under each MOA cluster identified in (B), based on Ingenuity Knowledge Base ontologies; equal analyses are depicted in regards to (G) metabolic pathways and (H) inferred disease and biological functions, respectively. Pathways depicted in (E,G) heatmaps showed enriched representation p < 0.05 in at least one MOA cluster; upstream regulators and functions in (F,H) showed both enriched representation p < 0.05 and predictive |z| > 2.0 in at least one MOA cluster. Intensity of purple coloring in pathway heatmaps (E,G) represent increasing levels of significance; coloring of activation heatmaps (F,H) on a blue-white-orange gradient scale depicts prediction z-score values (blue, inhibited; white, inactive; orange, activated). (I) Expression heatmap for two-way unsupervised clustering of Log2FC differences based on the 8 biomarkers highlighted in (B) (columns). Coloring of the row dendrogram and its labels (right) matches the 7 MOA clusters (I-VII) depicted in (B). Top right: heatmap is colored on a green-black-red gradient scale (top left) of Log2FC values relative to mock-treated controls (green, downregulated; black, same; red: upregulated). Dot plots (left) depict experimental classifications of treated specimens based means of transcriptional activation of MIEs (left to right: non-receptor mediated, or NRM vs. receptor-mediated or RM). N = 9 independent specimens per MOA classification with three different hepatotoxic agents each (N = 3 per MOA × Agent combination): Ahr (3ME = 3-methylcholantrene, 300 mg/kg/day, 5 days; NAP = β-naphthoflavone, 1,500 mg/kg/day, 5 days; LEF = leflunomide, 60 mg/kg/day, 5 days); CAR/PXR (ECO = econazole, 334 mg/kg/day, 5 days; MET = methimazole, 100 mg/kg/day, 3 days; PHE = phenobarbital, 54 mg/kg/day, 5 days); Cytotoxic (CAR = carbon tetrachloride, 1,175 mg/kg/day, 7 days; CHL = chloroform, 600 mg/kg/day, 5 days; THI = thioacetamide, 200 mg/kg/day, 5 days); DNA damage (AFL = aflatoxin B1, 0.3 mg/kg/day, 5 days; IFO = ifosfamide, 143 mg/kg/day, 3 days; NIT = N-nitrosodimethylamine, 10 mg/kg/day, 5 days); and PPARA (BEZ = bezafibrate, 617 mg/kg/day, 7 days; NAF = nafenopin, 338 mg/kg/day, 5 days; PIR = pirinixic acid, 364 mg/kg/day, 5 days). Log2FC measurements were calculated relative to 9 vehicle-only mock-treated controls. The TGxSEQC training dataset comprises 54 total individual specimens with reads aligned onto 30,852 RefSeq-annotated transcripts overall. To designate the final list of Profiler DEGREEs, Log2FC of differentially expressed transcripts (resolution-weighed ANOVA FDR p < 0.05 and δLog2FC > 0.3 × σSSR, and both post-hoc p < 0.05 and Log2FC > 95%TISSR in at least one MOA vs. average of controls) annotated as mRNA-encoding (RefSeq NM_ suffix) were consolidated by the average of Log2FC values under matching and nonduplicate gene symbols from the Entrez database. Reference genome: rn6.
Partitioning contribution statistics by likelihood ratio tests of eight selected biomarkers via bootstrap forest modeling of sample classification under seven inferred MOA clusters with reference gene lists at different tiers of statistical stringency (mode-of-action training RNAseq dataset, TGxSEQC; GEO accession number: GSE55347).
| Ucp3 | 8.4178 | 14.2830 | 14.5505 | |
| (9.89%) | (17.12%) | (22.92%) | ||
| 1 | 1 | 1 | Rank by | |
| 0.2091 | 0.0266 | 0.0241 | ||
| Tmem86b | 0.0000 | 2.4276 | 8.5938 | |
| (0.00%) | (2.91%) | (13.54%) | ||
| 1363 | 6 | 2 | ||
| 1.0000 | 0.8765 | 0.1977 | ||
| Hadhb | 0.3465 | 1.4197 | 8.3181 | |
| (0.41%) | (1.70%) | (13.11%) | ||
| 83 | 20 | 3 | ||
| 0.9992 | 0.9647 | 0.2157 | ||
| Gsdmd | 0.0000 | 0.1950 | 7.5273 | |
| (0.00%) | (0.23%) | (11.86%) | ||
| 603 | 50 | 4 | ||
| 1.0000 | 0.9999 | 0.2748 | ||
| Acaa1b | 0.0000 | 2.3052 | 7.3439 | |
| (0.00%) | (2.76%) | (11.57%) | ||
| 142 | 9 | 5 | ||
| 1.0000 | 0.8896 | 0.2902 | ||
| Acaa1a | 0.4146 | 2.1847 | 6.5803 | |
| (0.49%) | (2.62%) | (10.37%) | ||
| 61 | 12 | 6 | ||
| 0.9987 | 0.9020 | 0.3614 | ||
| Tfam | 0.0000 | 2.2218 | 6.3974 | |
| (0.00%) | (2.66%) | (10.08%) | ||
| 1331 | 10 | 7 | ||
| 1.0000 | 0.8982 | 0.3802 | ||
| Sugct | 0.0000 | 0.0000 | 4.1599 | |
| (0.00%) | (0.00%) | (6.55%) | ||
| 1308 | 56 | 8 | ||
| 1.0000 | 1.0000 | 0.6551 |