Literature DB >> 34792163

R-loopBase: a knowledgebase for genome-wide R-loop formation and regulation.

Ruoyao Lin¹, Xiaoming Zhong², Yongli Zhou¹, Huichao Geng³, Qingxi Hu¹, Zhihao Huang³, Jun Hu¹, Xiang-Dong Fu⁴, Liang Chen³, Jia-Yu Chen¹.

Abstract

R-loops play versatile roles in many physiological and pathological processes, and are of great interest to scientists in multiple fields. However, controversy about their genomic localization and incomplete understanding of their regulatory network raise great challenges for R-loop research. Here, we present R-loopBase (https://rloopbase.nju.edu.cn) to tackle these pressing issues by systematic integration of genomics and literature data. First, based on 107 high-quality genome-wide R-loop mapping datasets generated by 11 different technologies, we present a reference set of human R-loop zones for high-confidence R-loop localization, and spot conservative genomic features associated with R-loop formation. Second, through literature mining and multi-omics analyses, we curate the most comprehensive list of R-loop regulatory proteins and their targeted R-loops in multiple species to date. These efforts help reveal a global regulatory network of R-loop dynamics and its potential links to the development of cancers and neurological diseases. Finally, we integrate billions of functional genomic annotations, and develop interactive interfaces to search, visualize, download and analyze R-loops and R-loop regulators in a well-annotated genomic context. R-loopBase allows all users, including those with little bioinformatics background to utilize these data for their own research. We anticipate R-loopBase will become a one-stop resource for the R-loop community.

Entities: Chemical

Mesh：

Substances：
RNA
DNA

Year: 2022 PMID： 34792163 PMCID： PMC8728142 DOI： 10.1093/nar/gkab1103

Source DB: PubMed Journal: Nucleic Acids Res ISSN： 0305-1048 Impact factor: 16.971

INTRODUCTION

R-loops are three-stranded nucleic acid structures composed of an RNA:DNA hybrid and a displaced single-stranded DNA (1). Initially considered as rare by-products of transcription, R-loops are now found widely distributed across genomes in species from bacteria to human (2). Excessive R-loops are critical sources of genome instability (3,4), and underlie many human diseases (5), such as cancers (6,7), neurodegenerative disorders (8) and autoimmune diseases (9). Intriguingly, R-loops have been increasingly appreciated as key cellular regulators in many physiological processes (1,4), including DNA replication (10), homologous recombination (11), DNA damage repair (12) and transcription (13). Collectively, the functional studies of R-loops have greatly advanced both basic and translational research. Over the past decade, the recognition of the functional importance of R-loops has accelerated the development of more than ten different genome-wide R-loop detection technologies that are based on either the anti-RNA:DNA hybrid monoclonal antibody S9.6 (14–21) or (the hybrid binding domain of) catalytically-deficient RNase H1 (21–25). Although all these technologies are able to create a genome-wide R-loop map, they are not always consistent with each other regarding R-loop sizes and locations, and other associated genomic features (1,26–28). While an early electron microscopy study suggested that individual R-loop structures are of ∼200 nt in length (29), sizes of mapped R-loops are within a much broader range from a few hundred bases to several kilobases. Promoter regions are generally considered as R-loop hotspots (21), a large fraction of R-loops mapped by some technologies are however within gene bodies (18), in the vicinity of transcription termination sites (18), or even in intergenic regions (15,19). Furthermore, many mapped R-loop regions do not seem to comply with the GC skew sequence property, the G4 formation propensity, and the topological requirement, all of which have been found tightly associated with R-loop formation (13,21,29,30). While limitations of different technologies that could largely explain these discrepancies have been extensively discussed (1,26–28), what is missing is a systematic effort to compare different technologies side by side in a well-annotated genomic context, and to integrate all available R-loop mapping data to generate a reference set of R-loops for future functional R-loop studies. An increasing number of proteins are now thought to regulate R-loop dynamics. Certain RNA processing proteins and DNA damage-related factors may counteract R-loop formation, and some nucleases and helicases may resolve existing R-loops. However, the molecular mechanism and the global regulatory network involved are only partially understood. For example, considering their pervasive association with chromatin (31), RNA processing proteins may regulate R-loops in an RNA-independent way rather than through RNA binding activity as proposed before (4). Although helicases are in general considered to limit R-loops by resolving RNA:DNA hybrids, they may instead promote R-loop formation by resolving structured RNA to facilitate its invasion into DNA (32). R-loop regulatory proteins, named as R-loop regulators, have been systematically profiled by either S9.6 antibody or hybrid probe enrichment coupled with mass spectrometry analysis (33,34). However, due to the affinity of S9.6 to dsRNA (35), it remains unclear whether the identified proteins are truly involved in R-loop regulation. Furthermore, validated R-loop regulators are scattered in the literature and how they are functionally connected is unclear. Here, we establish R-loopBase to tackle the above challenges. The massive amount of human genomics data enables us to perform integrative analysis. We thus generate human R-loop zones of different confidence levels and R-loop regulome with comprehensive functional annotations. We also curate and annotate R-loop regulators in mouse, yeast (Saccharomyces cerevisiae) and Escherichia coli to support R-loop research in these model organisms. User-friendly interfaces are developed to allow users, even those with little bioinformatics background, to leverage these data for their own R-loop research. We will continuously update R-loopBase in the future to better serve the R-loop community.

MATERIALS AND METHODS

Collection and analyses of genome-wide R-loop mapping data in human

We collected meta information of genome-wide R-loop mapping data from all PubMed literature with the keyword “R-loop” OR “R-loops” OR “RNA DNAhybrid” OR “RNA DNAhybrids” OR “DNA RNAhybrid” OR “DNA RNAhybrids” as query. In current release of R-loopBase, datasets generated from human cells under basal conditions published on and before March 31st 2021 were downloaded (Supplementary Table S1). In total, 118 datasets generated by 11 different technologies were collected (9,13,14,16,18,19,21,23–25,28,34,36–55) (Table 1 and Supplementary Figure S1), and subjected to a standardized data analysis pipeline (Supplementary Figure S2). Briefly, technical replicates if existed were merged first, and raw sequencing data were then mapped to the human genome (hg38) using Bowtie2 local alignment mode (56). Uniquely-mapped non-redundant reads were kept as useful reads and samples with >7M useful reads were considered as with sufficient read counts. To maximally leverage the sequencing data, biological replicates with <7M useful reads were merged to meet with the minimal reads count cutoff as long as they were highly correlated (Spearman correlation coefficient > 0.5). Finally, peak calling was done with MACS2 (57) for all useful reads (DRIP-seq, DRIVE-seq, MapR and R-loop CUT&Tag) or useful reads from Watson or Crick strand separately (DRIPc-seq, RDIP-seq, ssDRIP-seq, qDRIP-seq, R-ChIP and RR-ChIP), using q-value cutoff 0.01 for narrow peak (R-ChIP and R-loop CUT&Tag) and 0.05 for broad peak (DRIP-seq, DRIPc-seq, RDIP-seq, ssDRIP-seq, qDRIP-seq, DRIVE-seq, MapR and RR-ChIP). If multiple biological replicates existed, peaks with ≥50 bp overlap among ≥2 replicates were merged and taken as reproducible peaks. Only peaks with strong signal enrichment (fold change ≥ 2) and outside of ChIP-seq blacklisted regions were used for downstream analysis. Samples with <100 peaks called were discarded. Following ENCODE guidelines for ChIP-seq data analyses (58), we further calculated signal portion of tags (SPOT) and reads in blacklisted regions (RiBL) as part of quality control matrix for users’ reference (Supplementary Table S2). When processing bisDRIP-seq data, rather than peak calling, we uploaded the processed bisDRIP-seq data onto the R-loopBase genome browser for visualization and comparison with other data.

Table 1.

Genome-wide R-loop mapping data in human

Technology^a	Treatment	Samples	Datasets	References
DRIP-seq	control	B-cell (1/1^b), CHLA10 (1/1), EWS502(1/1), HeLa (4/4), HEK293 (2/2), SHSY5Y (2/2), TC32 (1/1), Stromal (4/4), Basal-epithelial (4/4), Luminal-progenitor (4/4), Mature-luminal-epithelial (4/4), MCF-7 (1/1), NT2 (6/6), K562 (2/1), Primary-fibroblast (2/2), U2OS (8/7), U87 (2/2), Jurkat (2/0), T-cells (2/0), IMR-90 (1/0), HEK293T (1/0)	55/47	(9,18,21,28,34,36–51)
	knock down	U2OS (8/6), U87 (2/2), HeLa (4/4), HEK293 (2/2), SHSY5Y (2/2)	18/16	(38,41–43,50,51)
RDIP-seq	control	HeLa (2/2), IMR-90 (1/1), HEK293T (1/0)	4/3	(19,52)
	knock down	HeLa (2/2)	2/2	(52)
DRIPc-seq	control	K562 (2/2), HEK293 (2/2), NT2 (2/2)	6/6	(18,41,48)
	knock down	K562 (2/2), HEK293 (2/2)	4/4	(41,48)
ssDRIP-seq	control	HeLa (3/3), hVECs (2/2), hESCs (2/2), hiPSCs (2/2), hMSCs (2/2), hNSCs (2/2), hVSMCs (2/2)	15/15	(53,54)
	knock down	HeLa (3/3)	3/3	(53)
bisDRIP-seq	control	MCF-7 (13/13)	13/13	(16)
qDRIP-seq	control	HeLa (3/2)	3/2	(14)
DRIVE-seq	control	NT2 (1/1)	1/1	(21)
R-ChIP	control	HEK293T (5/5), K562 (2/2), HeLa (1/0)	8/7	(13,55)
RR-ChIP	control	HeLa (2/2)	2/2	(23)
MapR	control	HEK293 (3/3), U87T (2/2)	5/5	(25)
R-loop CUT&Tag	control	HEK293T (6/6)	6/6	(24)
Sum		-	145/132

arefer to Supplementary Figure S1 for procedures of different technologies.

bnumber of datasets analyzed/high-quality datasets.

Genome-wide R-loop mapping data in human arefer to Supplementary Figure S1 for procedures of different technologies. bnumber of datasets analyzed/high-quality datasets.

Generation of human R-loop zones of different confidence levels

Considering that R-loop peaks co-detected independently by multiple technologies are more likely conservative bona fide R-loops, we performed integrative analysis of all mapped R-loop peaks in human cells identified by all technologies. First, stranded R-loop peaks were merged as R-loop zones on Watson or Crick strand separately. Non-stranded R-loops, if overlapped with stranded R-loop peaks, were also assigned to Watson or Crick R-loop zones accordingly. The remaining non-stranded R-loop peaks were merged as non-stranded R-loop zones. Second, the resulting R-loop zones were further partitioned into sub-regions of n different confidence levels, where n is the minimal number of technologies by which individual sub-regions were detected by. For example, R-loop zones of level 3 were those co-detected by ≥3 technologies. Small R-loop zones (<50 bp) were not included.

Collection of known R-loop regulators

To collect known proteins involved in R-loop regulation from scattered literature, we downloaded all PubMed publications using keyword “R-loop” OR “R-loops” OR “RNA DNA hybrid” OR “RNA DNA hybrids” OR “DNA RNA hybrid” OR “DNA RNA hybrids”as query for manual curation of R-loop regulators. A protein is considered as an R-loop regulator if it binds to, stabilizes, resolves or degrades RNA:DNA hybrids, or levels of R-loops or RNA:DNA hybrids are changed upon chemical or genetic manipulation of the protein. Validated R-loop regulators in human, mouse, yeast (Saccharomyces cerevisiae) and E. coli were collected and annotated (Supplementary Table S3), and putative R-loop regulators identified in high throughput screening studies (33,34) were also included.

Protein–protein Interaction (PPI) network and gene list enrichment analysis of R-loop regulators

GO (gene ontology) (59) and DO (disease ontology) (60) enrichment analysis of validated or putative R-loop regulators were done by clusterProfiler (61), and only top ranked GO or DO terms were shown. For validated R-loop regulators, high-confidence physical interactions (minimal required interaction score = 0.700) with experimental evidences and integrated information from databases were retrieved from the STRING database (62) and visualized by Cytoscape software (63). Proteins were organized into different clusters using the AutoAnnotate plug-in (64). GO and KEGG enrichment analyses were then performed for clusters consisted of at least three genes.

Identification of R-loops targeted by R-loop regulators

R-loops have been mapped with replicates before and after gene knockdown for ten R-loop regulators (Supplementary Table S1), allowing us to identify their targeted R-loops. To do so, broad peaks were first called by referring to the references where these data were originally reported. Differential binding analyses were then done with default settings of DiffBind package (65) except for summits = FALSE and bUseSummarizeOverlaps = TRUE. DEseq2 (66) was called by DiffBind for differential analyses. Thirty-three validated human regulators have ChIP-seq data available, and 21 with CLIP-seq available. We downloaded peak files of these data from ENCODE project (67). R-loops intersected with ChIP-seq or CLIP-seq peaks as determined by BEDTools (68) were taken as potential targets of the corresponding regulator. For the above human R-loop regulators, we turned to published literature to summarize a list of their targeted R-loops validated by DNA:RNA hybrid immunoprecipitation qPCR (Supplementary Table S4). Primer-blast was used to locate the targeted R-loop region, which would be kept only when there was only one perfectly-matched locus, and the locus was consistent with what was reported in the corresponding literature.

Integration of functional genomics data

For cell types with R-loop mapping data available, we integrated and analyzed other genomics data that might help to distinguish bona fide R-loops from false positives, or interpret the molecular function of R-loops and R-loop regulators (Supplementary Table S5). G4 ChIP-seq (69–71), RPA ChIP-seq (43,72), GRO-seq (13,73–76) and PRO-seq (24,77) data were searched and downloaded from GEO (78). All datasets were processed by following the original publications. ChIP-seq data for 29 different histone modifications, CLIP-seq and ChIP-seq data for validated R-loop regulators, ATAC-seq, WGBS and Repli-seq data were downloaded from ENCODE (67). R-loop-related sequence features were prepared as below. Putative G4 motifs were identified on a genome-wide scale as described previously (79). GC percent was directly downloaded from UCSC genome browser (80). AT or GC skew values were computed for every 110 nt window with the step size of 10 nt. Predicted R-loop forming sequences were downloaded from R-loopDB (81). CrossMap was used for conversion between different genome builds whenever needed (82). Expression profiles of R-loop regulators were based on GTEx (83), TCGA (84) and CCLE (85) data.

Development of R-loopBase management system and interactive user interfaces

R-loopBase was developed using several web development technologies. Data were largely managed with MySQL. Web pages were built by HTML, CSS, JavaScript, AJAX, JSP, Java and Tomcat. The page contents were delivered by Apache. A local R-loopBase genome browser was developed on the basis of UCSC genome browser (80,86). R-loopBase is freely accessible using the URL https://rloopbase.nju.edu.cn.

RESULTS

Identification and characterization of a reference set of human R-loop zones

A major challenge in R-loop field is to precisely locate R-loops across the genome (1). Different technologies usually give rise to distinct genome-wide R-loop maps in the same cells (13,14,23–25,41,43,52,53) (Figures 1A–C). Even applying the same technology in the same cell line, different labs (42,46,51) (Figures 1A and D) or the same lab (18,21,36) (Figures 1A and E) sometimes generate R-loop maps with large differences. These broad discrepancies are largely rooted in different protocols of DNA fragmentation, R-loop enrichment and sequencing library preparation adopted by different (and even the same) technologies (Supplementary Figure S1). It is currently premature to conclude which technology or protocol is better than the others. Nevertheless, the high reproducibility between biological replicates generated in the same study indicates that each technology itself is robust if all experimental conditions are well controlled (54) (Figure 1F). With inspection, we noted that positive R-loop loci verified previously are usually well-supported by multiple independent technologies, such as, the R-loop forming region at ATRAID promoter (13) (Figure 2A). In contrast, as exemplified by SNRPN locus (87) (Figure 2A), most negative R-loop loci are not detectable or detected only by a limited number of technologies. We therefore postulate that integrative analysis of all R-loop mapping data holds the promise of distinguishing those conservative R-loop zones from technology-specific false positives or experimental variations.

Figure 1.

Figure 2.

Identification and characterization of human R-loop zones. (A) Shown are two representative genomic loci covering one positive (highlighted in red rectangle) and one negative (highlighted in grey rectangle) R-loop forming regions verified previously. The R-loop signals detected by different technologies are illustrated. (B) Workflow for generation of human R-loop zones of different confidence levels. (C) Number of R-loop zones of different confidence levels. R-loop zones of levels 6–9 are further zoomed in. (D–H) Shown are R-loop size distribution (D), genomic distribution (E), GC skew (F), and distribution of percentages of individual R-loops overlapped with G4 ChIP-seq peaks (G) or predicted R-loop regions by R-loopDB (H) for R-loop zones of different confidence levels. TSS: transcription start sites; TTS: transcription termination sites. Linear regression is done for (G) and the Pearson correlation coefficient and p-value are indicated.

Broad discrepancy among mapped R-loops. (A) Shown are R-loop peaks and signals at FUS gene locus. Different technologies (the top two panels), DRIP-seq datasets generated by different labs (the third panel) and in different experiments of the same lab (the fourth panel), and two biological replicates generated by ssDRIP-seq (the bottom panel) are coded in different colors. The same color scheme is used in (B–F). (B, C) Upset plots showing the technology-specific R-loop peaks or co-detected R-loop peaks in HEK293 (B) or Hela (C) cells. (D–F) Venn diagrams showing the overlap of DRIP-seq data generated by different labs (D), in different experiments from the same lab (E), or ssDRIP-seq data generated as biological replicates in one study (F). Identification and characterization of human R-loop zones. (A) Shown are two representative genomic loci covering one positive (highlighted in red rectangle) and one negative (highlighted in grey rectangle) R-loop forming regions verified previously. The R-loop signals detected by different technologies are illustrated. (B) Workflow for generation of human R-loop zones of different confidence levels. (C) Number of R-loop zones of different confidence levels. R-loop zones of levels 6–9 are further zoomed in. (D–H) Shown are R-loop size distribution (D), genomic distribution (E), GC skew (F), and distribution of percentages of individual R-loops overlapped with G4 ChIP-seq peaks (G) or predicted R-loop regions by R-loopDB (H) for R-loop zones of different confidence levels. TSS: transcription start sites; TTS: transcription termination sites. Linear regression is done for (G) and the Pearson correlation coefficient and p-value are indicated. As R-loop mapping technologies have been extensively applied to human cells in comparison with other species, we were therefore motivated to generate a reference set of human R-loop zones by integrative analysis. A total of 118 datasets generated in 28 human cell types under basal conditions by 11 different R-loop mapping technologies (Supplementary Figure S1) were collected and processed (Materials and Methods, Table 1 and Supplementary Table S1). Of them, 107 datasets from 26 cell types survived quality control, which took into consideration of usable reads count, reproducibility, signal enrichment and specificity and so on (Materials and Methods, Figure 2B and Supplementary Table S2). We combined all mapping data of different cell origins, aiming to characterize all possible R-loop forming regions in any human cell type. We assumed that regions detected with multiple independent technologies are more likely R-loop forming regions of high confidence by mitigating intrinsic limitations of individual technologies. Following this principle, we partitioned the R-loop regions mapped by different technologies into different confidence levels, which correspond to the minimal number of technologies they were detected by (Materials and Methods and Figure 2B). For example, the resulting R-loop zones of level 1 are those detected by ≥1 technology, those co-detected by ≥2 different technologies are classified as level 2 R-loop zones, and so on and so forth. Along with the increase of confidence levels, the number and median length of R-loop zones decreased from ∼800 000 to ∼200 (Figure 2C), and from ∼800 nt to ∼200 nt (Figure 2D), respectively. As predicted, R-loop zones of higher confidence levels were more often distributed at promoter regions (Figure 2E), and more likely associated with well-known R-loop-associated features, such as GC skew (Figure 2F), and G4 formation as determined by G4 ChIP-seq (Figure 2G), confirming that bona fide R-loops were increasingly enriched in R-loop zones of higher confidence levels. In support of the role of R-loops in regulating transcription termination (88), percentages of R-loops at TTS regions also gradually increased except for level 9. When compared with predicted R-loops solely based on sequence features by R-loopDB (81), a similar trend was observed (Figure 2H). Notably, a considerable fraction of high-confidence R-loops were not detected by R-loopDB (Figure 2H), suggesting that sequence feature is not the only molecular determinant for R-loop formation. In sum, we characterized a reference set of human R-loops of different confidence levels, allowing customized analysis by researchers who are interested in R-loop biology.

Compendium of human proteins regulating R-loop homeostasis

The regulatory mechanisms of R-loop dynamics have been studied and documented in scattered literature, and we were thus prompted to understand the global regulatory network by manually curating and systematically characterizing the most comprehensive list of R-loop regulatory proteins (Materials and Methods). A total of 1185 proteins in human, 24 in mouse, 63 in yeast (Saccharomyces cerevisiae) and 21 in E. coli were collected as R-loop regulators (Supplementary Table S3 and Figure 3A). We next focused on human R-loop regulators given the greater breadth of data available in human. Of 1185 human proteins, 186 (15.69%) were validated in multiple independent studies (53, 4.47%), or by multiple assays (64, 5.40%) or one assay (69, 5.82%) in only one study (Figure 3A). Among a variety of assays (Supplementary Table S3), the S9.6 antibody was most widely used (Figure 3B). However, given the specificity issue of S9.6 antibody, the sensitivity to RNase H treatment was often introduced as the gold standard (Figure 3B). It was also an important alternative to directly examine the binding, helicase or nuclease activity of a protein towards synthetic RNA:DNA hybrids in vitro (Figure 3B). Although not validated yet, the remaining 999 human proteins are potentially implicated in R-loop regulation as well. They were significant hits in one or two proteomics profiling studies for identification of RNA:DNA hybrid binding proteins (Figure 3A) (33,34). Similar to validated R-loop regulators, they were also enriched to be DNA, RNA or chromatin binding proteins (Supplementary Figure S3), suggesting that a considerable fraction of them may be authentic R-loop regulators. Collectively, we cataloged a comprehensive list of proteins regulating R-loops to date.

Figure 3.

Characterization of R-loop regulators collected in R-loopBase. (A) Number of R-loop regulators in four species, and number of human R-loop regulators with different levels of supporting evidence, i.e., being validated in at least two independent studies (≥2 studies), with at least two independent assays (≥2 assays) or only one assay (1 assay), or hit in at least two high-throughput screening studies (≥2 HTS) or only one HTS study (1 HTS). (B) Number of top-ranked validation assays based on S9.6 monoclonal antibody, RNase H or synthetic nucleotide oligos that are indicated by different colors. (C) PPI network and GO enrichment analysis. PPI clusters defined by Cytoscape are labelled with different colors, and GO terms (biological process) for clusters 1–6 are shown using corresponding color schemes. Solid and dash-dot edges represent interactions of high and low confidence as defined by STRING database, respectively, and node size is proportional to interaction degree. (D) Left: number of regulators with or without target information. Right: number of regulators with their targeted R-loops revealed with indicated methods. (E) Top 10 regulators with the highest number of targeted R-loops identified by KD (pink) or ChIP-seq (blue) assay. (F) Shown are 9 regulators with both CLIP-seq and ChIP-seq data, with bubble size and color corresponding to number and percentage of RNA and DNA binding sites located at R-loop regions. (G) Bubble chart showing top 10 disease ontology (DO) terms, with the bubble size corresponding to number of genes associated with individual DO terms. To gain an insight into how R-loop regulators are functionally related to each other, we performed PPI network analysis focusing on validated human R-loop regulators (Materials and Methods). Seven distinct clusters were readily identified, and GO enrichment analysis revealed cluster-specific enrichment of biological processes (Materials and Methods and Figure 3C). Clusters 1, 5 and 6 mainly consisted of proteins involved in DNA replication or damage response processes, consistent with the deleterious role of R-loops to induce replication stress and even DNA damages as well as the regulatory role of R-loops to facilitate efficient DNA damage repair (12). Proteins in RNA metabolic processes, including RNA splicing and export factors (cluster 2) and RNA exosome complex (cluster 7), constituted the second largest group of R-loop regulators, supporting the proposed function of RNA binding proteins in counteracting R-loop formation (4). Interestingly, many RNA binding proteins (cluster 4), especially those involved in small RNA processing, are functionally connected with DNA damage repair factors, in favor of the notion that RBP re-localization upon DNA damage may coordinately regulate RNA and DNA metabolism (89). As R-loops are dynamically coupled with transcription (13), the other group of R-loop modulators mainly included transcription elongation factors (cluster 3). With the availability of multi-omics data for human R-loop regulators, we were prompted to establish regulatory connections between individual R-loops and their regulators. The differential R-loop peaks following knockdown are likely targets of the corresponding regulator. In addition, R-loops may be directly modulated through chromatin or RNA binding activity of their regulators. Accordingly, we cataloged putative targeted R-loops for 52 validated R-loop regulators based on knockdown, ChIP-seq or CLIP-seq data (Materials and Methods and Figure 3D). Importantly, these regulatory connections were well aligned with the existing lab results. We got 175 validated R-loop forming regions targeted by one of 52 R-loop regulators (Materials and Methods and Table S4). Although the existing experimental data were mostly from cell lines different with ours, we observed good validation rates for many regulators. For example, about half of the experimentally-verified R-loop targets are well-supported for BRCA1 (69%, 9/13), SMN1 (57%, 8/14), SIN3A (50%, 4/8), U2AF1 (50%, 2/4) and etc. Therefore, the regulatory relationship we deduced here is a good starting point for mechanistic understanding of R-loop regulation. With our data, we discovered that a few proteins may be master R-loop regulators as they targeted thousands of high-confidence R-loops (Figure 3E). Nine regulators have both ChIP-seq and CLIP-seq data available, allowing us to explore their regulatory mechanism. Interestingly, although these are typical RNA binding proteins, their chromatin binding rather than RNA binding activity are in general more associated with high-confidence R-loops (Figure 3F). This finding contradicts with the view that RNA binding proteins counteract R-loop formation by binding to RNA (4), yet instead suggests RNA binding proteins may regulate gene expression and R-loop dynamics through direct association with chromatin (6,31). More follow-up mechanistic studies are thus needed to resolve the puzzle. We further investigated the relevance of R-loop regulators to human diseases, in hope of opening new avenues for future disease mechanism research from the perspective of R-loops. Malfunction of R-loop regulators was significantly associated with human diseases. Of 186 validated human R-loop regulators, 128 are associated with human diseases, significantly higher than background (Monte Carlo simulation, P-value < 0.001). Besides cancers, neurological diseases in a broad sense, including ataxia telangiectasia and lateral sclerosis, were among the most enriched disease ontology terms (Figure 3G; Materials and Methods). It shall be interesting to interrogate why and how neural cells are specifically less tolerant of deficiency of R-loop regulators.

R-loopBase development for R-loop studies in a well-annotated genomic context

To maximize the usefulness of R-loop zones and regulators described above, we characterized them by integrating multiple categories of functional genomic annotations (Materials and Methods and Figure 4A). First, previous studies have linked R-loop formation with specific genomic features, the integration of which may help evaluate independently whether R-loop peaks detected by individual technologies or R-loop zones defined by our integrative analysis are true or not. To this end, we integrated the predicted R-loop forming sequences (81), and computed GC content, GC skew and AT skew, and predicted G4 motifs. As not all G4 motifs permit G4 formation in vivo and G4 structures may involve atypical motifs (90), we thus integrated G4 structures mapped with ChIP-seq technology as well. R-loop detection can rely on the identification of the displaced ssDNA. The binding profile of ssDNA-binding protein RPA was previously suggested as an alternative method to locate R-loops (4), so we also collected public RPA ChIP-seq data. R-loop formation is clearly a consequence of transcription, and histone marks can be taken as a proxy of transcription or chromatin status especially when transcription data are not available. Therefore, GRO-seq and PRO-seq data, and 29 types of histone modifications were integrated from ENCODE project (67) and elsewhere. R-loop formation is coupled with local unmethylated status of the genomic DNA (21), prompting us to further integrate whole genome bisulfide sequencing data. Hybrids between template DNA and RNA primers during DNA replication might be captured, we therefore also integrated replication timing data (91). Second, for individual R-loop regulators, we collected and organized gene annotations from HUGO, NCBI and GeneCards. We also annotated each regulator with its supporting evidences, molecular function for R-loop regulation and putative R-loop targets (Supplementary Table S3). As R-loop regulators are in general associated with human diseases (Figures 3G), to better facilitate functional study in a specific disease or cell model, we annotated their expression in normal and cancerous tissues, as well as cell lines from GTEx (83), TCGA (84) and CCLE (85) projects. Overall, about 300 datasets from ten broad categories were integrated (Figure 4A and Supplementary Table S5), generating billions of functional annotations.

Figure 4.

R-loopBase data integration and interface development. (A) A matrix showing the datasets integrated in R-loopBase, with columns corresponding to different cell types and rows corresponding to different types of genomics data. Data sources for expression profiles of R-loop regulators are shown in the dialog box. *Including 1,185 regulators in human, 24 in mouse, 63 in yeast and 21 in E. coli. (B) The overall framework of R-loopBase database. Datasets in (A) are systematically processed, updated and managed in R-loopBase (top left) and could be directly downloaded for genome-wide analysis (top right). In addition, through the query system (middle), all datasets could be retrieved for display on R-loop page, regulator page or genome browser (bottom). To support the data management and to better serve the R-loop community, we developed R-loopBase platform with multiple user-friendly interactive interfaces (Figures 4B and 5). First, ID- and location-based query systems were developed for searching human R-loop forming regions (Figure 5A). For each query, statistics of R-loop zones are shown with an UpSet plot (Figure 5B). Displayed below are R-loopBase R-loop zones of different confidence levels, each of which can be accessed through the drop-down menu (Figure 5C). For each R-loop zone, the supporting technologies, cell lines in which it is detected and its known regulators are listed in detail (Figure 5C). Users will be further directed to R-loopBase genome browser via the hyperlinked genomic coordinate for visualization of R-loops in the well-annotated genomic context (Figure 5D). Alternatively, ID-, location- and sequence-based queries are supported by R-loopBase genome browser for visual inspection of R-loop zones. Clearly, along with the increase of confidence level, R-loop zones are growingly narrowed down and increasingly associated with well-known R-loop-related features, such as GC skew, G4 formation and local transcriptional activity (Figure 5D). The R-loopBase genome browser also allows direct comparison among R-loops identified by individual technologies and other genomic features.

Figure 5.

Screenshots of R-loopBase interfaces. (A) Search box for R-loops. (B) UpSet plot showing numbers of R-loop zones detected by indicated technologies. (C) Detailed information for R-loop zones. Hyperlinked texts are highlighted in purple. (D) A screenshot of R-loopBase genome browser showing the R-loop zone displayed in (C). Displayed from top to bottom are R-loop zones of nine levels, individual R-loop peaks mapped by different technologies, R-loop-associated features and R-loop regulators. (E) Query systems for R-loop regulators. (F) Four categories of annotation for R-loop regulator SIN3A. Cross-referenced contents are connected with arrows and shaded triangles. Second, by following the hyperlink associated with a specific R-loop regulator, users can also be directed from the R-loop page to the regulator page (Figures 4B and 5C), which provides regulator-centric view of R-loopBase annotations. Besides, ID-, species- or expression-based query systems were also developed for searching R-loop regulators in specific species or with specific gene expression profile (Figure 5E). For a given R-loop regulator, four categories of gene annotations are provided (Figure 5F), i.e., basic information from public databases, supporting evidences and functions of R-loops, putative R-loop targets and gene expression profiles. We also implemented an interface for R-loopBase team members and users to independently update annotations for R-loop regulators to ensure the accuracy and completeness (Figure 4B, top left). Similar to R-loop pages, regulator pages are also cross-referenced with R-loop pages and the R-loopBase genome browser. In particular, a regulator track showing DNA or RNA binding sites and regulated R-loop regions are provided on the R-loopBase genome browser. Such information may guide the future investigations of the regulatory mechanism of individual R-loop loci. Finally, to support data analysis in batches, all searching results can be directly downloaded as tables (Figures 5C and F). Alternatively, to support genome-wide data analysis, all R-loop zones, list of regulators and targeted R-loops can be downloaded from the R-loopBase download page or genome browser.

DISCUSSION

Rapid advances in R-loop research have tremendously broadened our knowledge of R-loop regulation and function, but also led to confusions due to large discrepancies in data-driven discoveries revealed by different R-loop mapping technologies. The massive set of R-loop mapping data however has not been systematically collected and analyzed, to allow the community to easily access and compare them, thus taking full advantage of the information behind. We attempted to address this by developing R-loopBase. We integrated R-loop sequencing data in human cells generated by all 11 different technologies published to date and hundreds of R-loop-related genomics datasets. By doing so, R-loopBase functions nicely in direct comparison among different R-loop mapping technologies, and to bridge gaps between mapped R-loops at individual genomic loci and related sequence and biochemical properties. We have further endeavored to define confidence levels for mapped R-loops according to the number of technologies by which they are supported. This unique function of R-loopBase is very helpful for selecting R-loops of high confidence for functional studies. Furthermore, the regulatory mechanism for R-loop homeostasis remains elusive. By our systematic effort to collect, annotate and categorize 1293 potential R-loop regulators from literature, we provide a global view of the R-loop regulatory network for the first time. These efforts not only allow users to access regulators with various types of supporting evidence or expression profile for their own research, but also link to the regulatory mechanism and disease relevance. To the best of our knowledge, there are two other R-loop databases, i.e., R-loopDB (81) and R-loop Atlas (http://bioinfor.kib.ac.cn/R-loopAtlas). R-loopDB is a collection of predicted R-loop forming sequences in eight species, and human and mouse R-loops mapped via DRIP-seq, DRIPc-seq or RDIP-seq. R-loop Atlas is a specialized database for R-loops in Arabidopsis thaliana and Oryza sativa, which contains R-loops mapped mainly via the ssDRIP-seq. A few key features well distinguish R-loopBase from these two databases. First, R-loopBase has the most comprehensive collection of R-loops in human. R-loopBase integrates predicted R-loops by R-loopDB, and archives human R-loop mapping datasets generated by all 11 different technologies. Second, sequence-based R-loop prediction may miss R-loops with atypical sequence features (Figure 2H), and different R-loop mapping technologies usually generate inconsistent R-loop maps (Figure 1). It is thus not so straightforward to know the authority of a predicted or experimental R-loop from R-loopDB or R-loop Atlas. In contrast, R-loopBase presents a reference set of human R-loops and assign it with 9 different confidence levels based on integrative analysis of all R-loop mapping datasets. Moreover, R-loopBase integrates hundreds of R-loop-related genomics datasets, which collectively enable customized evaluation of the likelihood for R-loop formation at a specific genomic locus. Third, R-loop regulome is missing in either R-loopDB or R-loop Atlas. R-loopBase fills this gap by collecting and annotating a complete list of known R-loop regulatory proteins and their targeted R-loops. Users can easily get access to these resources for mechanistic and disease studies from the perspectives of R-loops. Lastly, there is still room for further development of R-loopBase. The current release of R-loopBase is largely built on human R-loop data, mainly because only human cells have enough data for integrative analysis. However, since R-loops are conserved from bacteria to human, it would be important to include other species to study R-loops from the evolutionary point of view. While we collected R-loop regulators for three more species in addition to human, a reference set of R-loops for each species will be analyzed and presented when more genome-wide R-loop mapping data and functional genomics data are available. More importantly, as R-loops are dynamically regulated in a variety of physiological processes important for development and disease progression, our approach of defining high-confidence R-loops may miss out cell- or condition-specific R-loops under dynamic control. It is thus necessary to include R-loop dynamics data and develop novel method for precise R-loop mapping in the future. Of note, only a small fraction of R-loop regulators has been functionally validated. An even a smaller number of proteins have genomic target information available. More efforts are clearly needed to fully dissect the complex and region-specific mechanism of R-loop regulation. With continuous updates, R-loopBase will become more and more powerful as a one-stop interface to serve the community in the future.

DATA AVAILABILITY

All R-loopBase data are freely accessible using the URL: https://rloopbase.nju.edu.cn. Click here for additional data file.

91 in total

1. Genome-wide R-loop Landscapes during Cell Differentiation and Reprogramming.

Authors: Pengze Yan; Zunpeng Liu; Moshi Song; Zeming Wu; Wei Xu; Kuan Li; Qianzhao Ji; Si Wang; Xiaoqian Liu; Kaowen Yan; Concepcion Rodriguez Esteban; Weimin Ci; Juan Carlos Izpisua Belmonte; Wei Xie; Jie Ren; Weiqi Zhang; Qianwen Sun; Jing Qu; Guang-Hui Liu
Journal: Cell Rep Date: 2020-07-07 Impact factor: 9.423

2. R-loop formation is a distinctive characteristic of unmethylated human CpG island promoters.

Authors: Paul A Ginno; Paul L Lott; Holly C Christensen; Ian Korf; Frédéric Chédin
Journal: Mol Cell Date: 2012-03-01 Impact factor: 17.970

3. R-loopDB: a database for R-loop forming sequences (RLFS) and R-loops.

Authors: Piroon Jenjaroenpun; Thidathip Wongsurawat; Sawannee Sutheeworapong; Vladimir A Kuznetsov
Journal: Nucleic Acids Res Date: 2016-11-28 Impact factor: 16.971

4. Pervasive Chromatin-RNA Binding Protein Interactions Enable RNA-Based Regulation of Transcription.

Authors: Rui Xiao; Jia-Yu Chen; Zhengyu Liang; Daji Luo; Geng Chen; Zhi John Lu; Yang Chen; Bing Zhou; Hairi Li; Xian Du; Yang Yang; Mingkui San; Xintao Wei; Wen Liu; Eric Lécuyer; Brenton R Graveley; Gene W Yeo; Christopher B Burge; Michael Q Zhang; Yu Zhou; Xiang-Dong Fu
Journal: Cell Date: 2019-06-27 Impact factor: 41.582

5. Next-generation characterization of the Cancer Cell Line Encyclopedia.

Authors: Mahmoud Ghandi; Franklin W Huang; Judit Jané-Valbuena; Gregory V Kryukov; Christopher C Lo; E Robert McDonald; Jordi Barretina; Ellen T Gelfand; Craig M Bielski; Haoxin Li; Kevin Hu; Alexander Y Andreev-Drakhlin; Jaegil Kim; Julian M Hess; Brian J Haas; François Aguet; Barbara A Weir; Michael V Rothberg; Brenton R Paolella; Michael S Lawrence; Rehan Akbani; Yiling Lu; Hong L Tiv; Prafulla C Gokhale; Antoine de Weck; Ali Amin Mansour; Coyin Oh; Juliann Shih; Kevin Hadi; Yanay Rosen; Jonathan Bistline; Kavitha Venkatesan; Anupama Reddy; Dmitriy Sonkin; Manway Liu; Joseph Lehar; Joshua M Korn; Dale A Porter; Michael D Jones; Javad Golji; Giordano Caponigro; Jordan E Taylor; Caitlin M Dunning; Amanda L Creech; Allison C Warren; James M McFarland; Mahdi Zamanighomi; Audrey Kauffmann; Nicolas Stransky; Marcin Imielinski; Yosef E Maruvka; Andrew D Cherniack; Aviad Tsherniak; Francisca Vazquez; Jacob D Jaffe; Andrew A Lane; David M Weinstock; Cory M Johannessen; Michael P Morrissey; Frank Stegmeier; Robert Schlegel; William C Hahn; Gad Getz; Gordon B Mills; Jesse S Boehm; Todd R Golub; Levi A Garraway; William R Sellers
Journal: Nature Date: 2019-05-08 Impact factor: 49.962