| Literature DB >> 33722938 |
Young-Sook Kim1,2,3,4,5, Graham D Johnson1,2,3,4, Jungkyun Seo1,2,3,4,5, Alejandro Barrera1,2,3,4, Thomas N Cowart1,4, William H Majoros1,3,4,5, Alejandro Ochoa1,4,5, Andrew S Allen1,2,4,5, Timothy E Reddy1,2,3,4,5.
Abstract
High-throughput reporter assays such as self-transcribing active regulatory region sequencing (STARR-seq) have made it possible to measure regulatory element activity across the entire human genome at once. The resulting data, however, present substantial analytical challenges. Here, we identify technical biases that explain most of the variance in STARR-seq data. We then develop a statistical model to correct those biases and to improve detection of regulatory elements. This approach substantially improves precision and recall over current methods, improves detection of both activating and repressive regulatory elements, and controls for false discoveries despite strong local correlations in signal.Entities:
Mesh:
Year: 2021 PMID: 33722938 PMCID: PMC8092017 DOI: 10.1101/gr.269209.120
Source DB: PubMed Journal: Genome Res ISSN: 1088-9051 Impact factor: 9.043
Figure 1.Technical biases affect STARR-seq signal. (A) STARR-seq input libraries have higher signal variance than ChIP-seq input control libraries. Variance in per base signal in individual RPKM-normalized libraries are plotted for Chromosome 1. The error bars indicate variance between replicates. The number of replicates plotted is as follows: six replicates for STARR-seq A549 data; two replicates for STARR-seq HeLa-S3 data; three replicates for ChIP-seq A549 data; two replicates for ChIP-seq HeLa-S3 data; three replicates for ChIP-seq LNCaP data. (B) Representative browser signal tracks are shown for STARR-seq and ChIP-seq input libraries (Chr 1: 11,197,048–11,236,707). Signals are RPKM-normalized. (C) Pearson's correlations of STARR-seq input library signals in 1-bp windows along Chromosome 1. (D) DNA sequence biases impact STARR-seq signals. STARR-seq signals are plotted for 500-bp windows with varying degrees of bias for the following physical properties of DNA: fragment-end DNA structures, Gibbs free energy, G-quadruplex structure, and mappability. Whiskers extend 1.5 times the interquartile range. Center lines in the boxes show the medians. In plots of fragment-end bias, minor groove width (MGW) and propeller twist (ProT) are plotted and the ideal is log2(Freq in input/Freq in ref)=0. In plots of other biases, the ideal line is the median signal. (E) PCR amplification introduces bias into STARR-seq libraries. The impact of Gibbs free energy bias is shown for PER1 BAC libraries amplified with different numbers of PCR cycles (3, 6, 12, and 18 cycles). Each point represents the sum of signals in a 500-bp window from three technical replicates. The solid line is a lowess fit line. The dashed ideal line is the median signal across all windows.
Figure 2.The CRADLE GLM approach accurately predicts signal bias. (A) Equation of the GLM to predict the impact of technical biases and approach used by CRADLE to calculate bias covariates. To estimate bias effects for each position (blue), we used a window centered on that position that was twice the median fragment length, L. We assume L number of fragments (green) in a window and that each fragment is L-bp in length. To calculate each bias covariate for the position, we combined quantitative measures from L fragments. (pos) Single-bp position; (MGW) minor grove width; (ProT) propeller twist; (Anneal) annealing efficiency; (Denature) denaturation efficiency; (Gquad) G-quadruplex structure; (Map) mappability. (B)–(F) The results from the GLM fitted with Johnson et al. (2018) STARR-seq data (six input libraries and five 0-h dex-treated output libraries) and Muerdter et al. (2018) STARR-seq data (two input libraries and two no-inhibitor-treated output libraries). For C–F, the results were visualized for Chromosome 1. (B) Coefficients in input libraries for regions with signals above and below the 90th percentile (“Regions with high input signal” and “The rest of regions,” respectively). (C) Ratio of the sum of squared errors with structured sampling to the sum of squared errors with random sampling are plotted for regions with extremely high signals (above the 99th percentile). (D) Variance explained by CRADLE are plotted. The R2 values are from GLMs fitted with input and output STARR-seq libraries. The error bars indicate variance between replicates. (E) Distribution of GLM residuals and the STARR-seq effect sizes are shown after correction. (F) Squared semipartial correlations are shown for fragment-end, Gibbs free energy, G-quadruplex, and mappability covariates. The error bars indicate variance between replicates. (G) The R2 values of the GLMs are shown for PER1 BAC libraries amplified with different numbers of PCR cycles. (H) Coefficients of anneal and denature covariates are shown for the GLM fitted with PER1 BAC libraries. The error bars show a 95% confidence interval.
Figure 3.The CRADLE GLM approach corrects technical bias. (A) STARR-seq signals are plotted for 500-bp windows along Chromosome 1 after removing technical bias with CRADLE. Signal is balanced despite varying degrees of technical biases. The ideal line is the median corrected signal. Whiskers extend 1.5 times the interquartile range. Center lines in the boxes show the medians. (B) Variance in observed signals and CRADLE-corrected signals in 1-bp windows are shown along Chromosome 1. The error bars indicate variance between replicates; six input libraries and five 0-h dex-treated output libraries in Johnson et al. (2018) are plotted; two input libraries and two no-inhibitor-treated output libraries in Muerdter et al. (2018) are plotted. (C) STARR-seq signals are shown for PER1 BAC libraries amplified with different numbers of PCR after removing technical bias with CRADLE. Each point represents the sum of corrected signals in a 500-bp window from three technical replicates. The solid line is a lowess fit line. The dashed ideal line is the median signal across all windows. (D) Variance in observed signals and CRADLE-corrected signals in 1-bp windows are shown after correcting the PER1 BAC libraries. (E) Representative signal tracks are shown for STARR-seq input libraries before and after CRADLE correction (Chr 2: 29,772,197–29,791,543). (F) STARR-seq and ChIP-seq signal tracks are shown in the dex-responsive PER1 locus. Observed and corrected signal of Johnson et al. (2018) are presented for 0-h dex-treated (untreated) and 12-h dex-treated output libraries. ChIP-seq signal tracks are not corrected. The highlighted region (Chr 17: 8,151,204–8,152,809) is a known dex-responsive activating regulatory element. (G) STARR-seq signal tracks are shown for the TMEM63C locus. Observed and corrected signal of Johnson et al. (2018) are presented for input and 0-h dex-treated output libraries. The highlighted region (Chr 14: 77,207,895–77,210,261) contains a REST motif and is occupied by REST in multiple cell types.
Figure 4.Detection of regulatory elements with CRADLE. (A) CRADLE regulatory element pipeline is shown in diagram. Effect sizes are calculated in windows of uniform length. Contiguous windows with similar effect sizes are merged into regions before filtering regions with small variance. Regions are binned and a statistical test is performed on each bin to compare corrected input and output signals. Bin-level P-values are merged to generate a region-level P-value before performing a region-level Benjamini–Hochberg (BH) procedure. Regions selected by the first BH procedure were used to perform a bin-level BH procedure to identify regulatory elements. (B) The number of detected regulatory elements is dependent on the variance filter. (C,D) Precision recall curves, using corrected and uncorrected signals in the simulation study. To detect regulatory elements with uncorrected signals, two statistical approaches were used: (1) fitting uncorrected signals to Poisson GLM and performing Wald test (“Uncorrected 1”) and (2) using a Poisson distribution with the mean of uncorrected input signals as a null distribution and testing the significance of the mean of uncorrected output signals (“Uncorrected 2”). (C) Precision recall curve when signals are simulated with mixed fold change (2, 3, 4) and a mix of activating and repressive elements. (D) Precision recall curve when signals are simulated with a fixed fold change (FC) and with a fixed regulatory activity (either activating or repressive). (E–G) Comparison of inhibitor-responsive regulatory elements detected by CRADLE and Muerdter et al. (2018). (E) The Venn diagram shows the overlap of regulatory elements detected by both studies. (F) Transcription factor motif enrichment is shown for inhibitor-responsive repressive regulatory elements exclusively detected by each study. Rank* is the rank of motif in the other study. (G) The mean of IRE3 ChIP-seq effect size is plotted for inhibitor-responsive repressive regulatory elements exclusively detected by each study. (H) The Venn diagram shows the overlap of dex-responsive activating and repressive regulatory elements detected by CRADLE and Johnson et al. (2018). (I) Transcription factor motif enrichment in A549 steady-state repressive regulatory elements detected by CRADLE.