| Literature DB >> 34099673 |
Barry Slaff1, Caleb M Radens2,3, Paul Jewell2, Anupama Jha1, Nicholas F Lahens4, Gregory R Grant2,4, Andrei Thomas-Tikhonenko5,6, Kristen W Lynch2,3,7, Yoseph Barash8,9,10.
Abstract
The effects of confounding factors on gene expression analysis have been extensively studied following the introduction of high-throughput microarrays and subsequently RNA sequencing. In contrast, there is a lack of equivalent analysis and tools for RNA splicing. Here we first assess the effect of confounders on both expression and splicing quantifications in two large public RNA-Seq datasets (TARGET, ENCODE). We show quantification of splicing variations are affected at least as much as those of gene expression, revealing unwanted sources of variations in both datasets. Next, we develop MOCCASIN, a method to correct the effect of both known and unknown confounders on RNA splicing quantification and demonstrate MOCCASIN's effectiveness on both synthetic and real data. Code, synthetic and corrected datasets are all made available as resources.Entities:
Mesh:
Year: 2021 PMID: 34099673 PMCID: PMC8184769 DOI: 10.1038/s41467-021-23608-9
Source DB: PubMed Journal: Nat Commun ISSN: 2041-1723 Impact factor: 14.919
Fig. 1Batch effects impact both gene expression and splicing analysis.
Uniform manifold approximation and projection (UMAP) of gene expression analyses (a, c) and splicing analysis (b, d) for TARGET (top, N = 870) and ENCODE (bottom, N = 489). Colors indicate batch identity. Numbers in red represent percent of total variation (R2) associated with batch in each dataset. Shapes mark samples from the same patient (TARGET, patient TARGET-10-PANKAK) or experiment type (ENCODE, U2AF2 KD) which cluster by batch. TPM, transcripts per million; FC, Fold change; dPSI, delta percent splice inclusion.
Fig. 2Removal of batch effects from simulated RNA-Seq data with MOCCASIN.
RNA-Seq samples from mouse Aorta and Cerebellum were simulated using BEERS while injecting G% of the genes in half the samples with a batch effect of C% expression change of the main isoform (see main text). a Cumulative distribution of the difference (|dPSI|) from simulated ground truth after batch signal injection (G = 20%, C = 60%) either before MOCCASIN (blue) or after correction with increasing numbers of samples for each of the four batch/condition combinations: 1 × 4 (4 total, purple), 2 × 4 (8 total, brown), 3 × 4 (12 total, yellow), and 4 × 4 (16 total, orange). All plots are derived from the same representative sample (SRR1158528) to maintain a fixed base for comparison, with similar plots observed for other perturbed samples (data not shown). Total number of LSVs: 21566. b, c The number of LSVs (Y-axis) detected as differential (|E(dPSI)| > 0.2) for the batch 1 (N = 4) versus batch 2 (N = 4) signal (b, left) and the aorta (N = 4) vs cerebellum (N = 4) signal (c, right) across a range of increasingly significant p-values (X-axis, Student’s t test, −log10 scale). Number of samples used is 4 per batch/tissue combination (same as in the orange line in a). The green points (“Ground Truth”) are from the simulated data with no batch signal injection and the blue points (“Before MOCCASIN”) are from the same data after batch signal injection (G = 20%, C = 60%). Both blue and green points serve as reference points for MOCCASIN correction of the batch signal. Orange and gray represent, respectively, the results after MOCCASIN correction when the batches are known or unknown. d, e Assessing false positive rate (FPR) for the batch signal (d, left) and false discovery rate (FDR) for the tissue signal (e, right) for a range of G values (2, 5, 20%) and C effect size (2, 10, 60%). Number of samples same as in (b, c). Here positive events where considered as those changing by at least 20% with high confidence by MAJIQ (P(|dPSI| > 0.2) > 0.95). Under these definitions small effect sizes (C = 2,10%) represent perturbations that are not expected to affect the positive event set much. f Heatmaps of E(PSI) from simulated data without batch effect (ground truth, left), with simulated batch effect (G = 20%, C = 60%) without correction (middle), and after applying MOCCASIN with 1 known confounding factor (right). Each column is a sample, and each row is an LSV (N = 4941). The colored bars above the samples denote the sample’s tissue (8 aorta samples in purple, and 8 cerebellum samples in green) and batch (8 batch 1 sample in red and 8 batch 2 samples blue).
Fig. 3Batch correction of TARGET and ENCODE datasets.
a UMAP plot for ENCODE (N = 489) as in Fig. 1 but after applying MOCCASIN. U2AF2 knockdowns (square and circle) now cluster together, percent of total variance (R2) associated with batch drops to 4.3% (red text) b Pearson correlation of significant splicing changes upon U2AF2 KD (dPSI > 0.2) between batches increases from R = 0.71 (top left) to R = 0.79 after MOCCASIN (top right). Similarly, the number of significantly changing events (P(|dPSI | > 0.2) > 0.95) in each batch and the overlap of these events between batches increases after MOCCASIN correction (c, compare bottom left and right). c UMAP plot for TARGET (N = 870) as in Fig. 1 but after applying MOCCASIN without specifying a cell type (top left), after adding an unknown confounder (bottom left), and when using a cell type specific model after inferring missing cell type labels (right). TARGET technical replicates P25a-c cluster together (square, circle, and triangle), and R2 drops to 0.3% (red text). d Pearson correlation of significant splicing changes between primary and relapse samples from the same patient (TARGET-10-PANKAK) increases from R = 0.87 (top left) to R = 0.97 (top right) after correcting for the sequencing platform. Note that since only the HiSeq2500 samples are corrected by MOCCASIN, only the set that includes those (cyan) is affected. Accordingly, the total number of significantly changing events (P(|dPSI| > 0.2) > 0.95) drops for the cyan set from 18908 (bottom left) to 16100 (bottom right) but the overlap with the other set increases (11360 to 13290).