Hai Fang1, Kankan Wang2. 1. State Key Laboratory of Medical Genomics and Shanghai Institute of Hematology, Ruijin Hospital, Shanghai Jiao Tong University School of Medicine, Shanghai 200025, China; Wellcome Trust Centre for Human Genetics, University of Oxford, Oxford OX3 7BN, UK. Electronic address: hfang@well.ox.ac.uk. 2. State Key Laboratory of Medical Genomics and Shanghai Institute of Hematology, Ruijin Hospital, Shanghai Jiao Tong University School of Medicine, Shanghai 200025, China. Electronic address: kankanwang@shsmu.edu.cn.
Abstract
A regularly shaped grid is useful for analyzing data particularly at multilayer levels, where patterns can be visually represented and analytically compared-conceptually similar to Picasso's cubism. Here we introduce ATLAS, featuring a suite of spatially ordered maps designed for representation and comparison of patterns seen in regulatory genomic data. It produces a landscape learned from input data and enables landscape-guided correlation with additional data. We illustrate its use for multilayer data comparison on the same cell type, and for comparisons involving different cell types, revealing information in a scientifically insightful and also visually intuitive way. The data-driven and visual-aided ability of ATLAS presents a general strategy for regulatory genomic data analysis.
A regularly shaped grid is useful for analyzing data particularly at multilayer levels, where patterns can be visually represented and analytically compared-conceptually similar to Picasso's cubism. Here we introduce ATLAS, featuring a suite of spatially ordered maps designed for representation and comparison of patterns seen in regulatory genomic data. It produces a landscape learned from input data and enables landscape-guided correlation with additional data. We illustrate its use for multilayer data comparison on the same cell type, and for comparisons involving different cell types, revealing information in a scientifically insightful and also visually intuitive way. The data-driven and visual-aided ability of ATLAS presents a general strategy for regulatory genomic data analysis.
Now we are in the era of doing genomic data science; an important element of this concept is how to represent and model genomic data in a scientifically sound and also visually arresting way. Identifying and comparing the occurrence of patterns hidden in genomic data is the very first challenge to data scientists (Marx, 2013). How to intuitively characterize regulatory genomic data? How to analytically compare data at the multilayer levels? Could such characterizations and comparisons be made at the target gene level? We address these challenges by considering routinely generated types of regulatory genomic data. Regulatory genomic data are essentially in the form of non-coding genomic regions associated with signals thereof, such as transcription factor (TF) binding and DNase I hypersensitivity sites (DHSs), that are profiled in the same cell type or the same data type produced across cell types (The ENCODE Project Consortium, 2012).Data-driven models are needed for achieving effective representation, comparison, and exploration of data. Picasso's cubist painting is a representation of the natural forms reduced into basic geometric regulars on the 2D map, known as “cubism”. The use of a regularly shaped grid to “see” (model) data is attractive. We human beings are good at perceiving and comparing regular grids such as hexagons, and rightly using them could provide an information-rich overview of data. Designing such grids, however, is not trivial; a priori knowledge of data structure is usually unknown or limited, and the need for rich regular shapes should be met.In addition to regular grids, models that are able to learn data are desirable. A self-organizing learning algorithm is a special type of artificial intelligence (AI), suitable for this purpose (Zhang and Fang, 2012). We previously described such model based on a supra-hexagonal map (Fang and Gough, 2014a), successfully applied to genomic data analysis (Allman et al., 2016, Jiang et al., 2016). The map grid of this shape, however, is not without limitations; the underlying structure of data is not necessarily distributed as a radial symmetry, and this limitation should be overcome.We have created a system to realize regulatory genomic data cubism, designed for the self-organized representation and comparison of the occurrence of the patterns seen in regulatory genomic data. We refer to such system as a taught landscape analytic system, or called the “ATLAS”: taught, because of the data-driven ability in a self-organizing manner; landscape, because of a global view in a visual-friendly way; and analytic, because of the support for quantitative analyses. This system is also highlighted by the support for linking regulatory elements to target genes and the support for correlation analysis involving data at multilayer levels. We demonstrate the use by showing how to correlate multilayer regulatory genomic data in a leukemia cell type, how to correlate the same layered genomic data involving two cell types, and how to link TFs to datasets generated from clustered regularly interspaced short palindromic repeats (CRISPR). In a wider context, ATLAS adds a new strategy to the repository of AI applied to the big regulatory genomic data analysis.
Results
Realizing Cubism by Creating a Suite of Spatially Ordered 2D Maps
We devised a range of spatially ordered 2D maps, all designed on the basis of architecture used in a supra-hexagonal map. Illustrated in Figure 1A is the supra-hexagonal map with nodes indexed in a way that they radiate circularly outward. Its map variants (Figures 1B–1I) are all indexed in the same way. We achieved this by designing, for example, the ladder-shaped map (Figure 1H) derived from the top half part of the supra-hexagonal map (Figure 1A). In addition to the same architectural design, we modeled maps of all shapes following the same principle: learned from input data using the self-organizing learning algorithm (Transparent Methods). In brief, the learning achieves the conversion from the input data matrix to the codebook matrix associated with the learned map, constrained by the map shape. The learning process consists of (1) the choice of the map shape, empirically determined according to the input data (that is, informed by visualizing the input data onto a 2D hyperspace by principle component analysis [PCA]) and (2) the learning of the map, automatically achieved (that is, iteratively identifying the winner node and updating its neighbors). The process is largely data-driven, although the user can explicitly choose map shapes, if a priori knowledge strongly suggests doing so. The learned map (the codebook matrix) is visualized as the landscape and is fused with additional data for correlation analysis (Figure 1J). In summary, we have provided wide choices of 2D regular maps (“cubism”), enabling input data to be represented on a case-by-case basis, and subsequently effective comparison with additional data (usually another layered data).
Figure 1
Realization of Cubism by Creating a Suite of Spatially Ordered 2D Maps
(A) A supra-hexagonal map. Uniquely determined by the radius, node indexed and expanding outward (colored alternatively in yellow and blue).
(B–I) Map variants sharing the basis of architectural design. They all have the same radius and nodes indexed in the same way, including a ring-shaped map (B), a diamond map (C), a trefoil map (D), a butterfly map (E), an hourglass map (F), a ladder map (G), a bridge map (H), and a triangle-shaped map (I).
(J) Backbone of analysis: a landscape learned from input data and then the landscape-guided correlation with additional data.
Realization of Cubism by Creating a Suite of Spatially Ordered 2D Maps(A) A supra-hexagonal map. Uniquely determined by the radius, node indexed and expanding outward (colored alternatively in yellow and blue).(B–I) Map variants sharing the basis of architectural design. They all have the same radius and nodes indexed in the same way, including a ring-shaped map (B), a diamond map (C), a trefoil map (D), a butterfly map (E), an hourglass map (F), a ladder map (G), a bridge map (H), and a triangle-shaped map (I).(J) Backbone of analysis: a landscape learned from input data and then the landscape-guided correlation with additional data.
Comparing Regulatory Genomics Involving Multiple Data Types and/or between Different Cell Types
In this section, we describe a typical procedure used in ATLAS to carry out integrated tasks analyzing regulatory genomic data (Figure 2), including (1) how to select a map tailored to and learned by the input data and (2) how to use the learned map (visualized as a landscape) as a scaffold/reference for correlating with additional data. We mainly focus on two leukemia cell types: K562 (a chronic myeloid leukemia cell line) and GM12878 (a lymphoblastoid cell line); a total of 51 TFs common to both cell types were assayed (The ENCODE Project Consortium, 2012). Together with TF binding data, we also include DHS data in K562 (Thurman et al., 2012) for comparisons involving multiple data types produced in the same cell type. Taking into account enhancer-target knowledge (Fishilevich et al., 2017) (Figure 2A), we first calculated gene-centric association scores from genome-wide TF binding sites or DHSs, with higher scores implying higher chance of a gene (in rows) being targeted by a TF or a DHS (in columns), and then used these gene-centric scores for the learning and fusion (Figure 2B; Transparent Methods). The correlation was estimated based on the learned and fused map (Figure 2C).
Figure 2
Overview of How to Correlate TFs and/or DHSs
(A) Association score for genes calculated from an input list of TF binding sites (or DHSs) by utilizing unified enhancer-target knowledge.
(B) The learning and fusing of regulatory genomic data.
(C) Correlation involving multilayer data types (TFs and DHSs) in the same cell type and/or involving two cell types (K562 and GM12878).
Overview of How to Correlate TFs and/or DHSs(A) Association score for genes calculated from an input list of TF binding sites (or DHSs) by utilizing unified enhancer-target knowledge.(B) The learning and fusing of regulatory genomic data.(C) Correlation involving multilayer data types (TFs and DHSs) in the same cell type and/or involving two cell types (K562 and GM12878).
A Ladder-Shaped Map Models TF Targeting in K562 and Defines Inter-TF Taxonomy
Based on the calculated gene-centric association scores for 51 TFs in K562 (Table S1), we observed a ladder-like distribution of target genes (Figure S1) and hence chose a ladder-shaped map (Figure 1H) for the learning. We visualized the learned map as the TF landscape (Figure 3A) and built a TF tree using a neighbor-joining algorithm (Paradis et al., 2004) (Figure S2). The basis of inter-TF taxonomy becomes much clearer when coupled with the TF landscape visualization, indicative of co-occupancy on putative target genes. For example, binding/targeting profiles of seven TFs (YY1, ELF1, EGR1, MAZ, POLR2A, MYC, and MAX) were similar both in the number and strength of target genes, being grouped together in the tree; they are all essential for the basic transcriptional regulation in K562.
Figure 3
Correlating TFs and/or DHSs in K562
(A) TF landscape in K562. The landscape visualized based on a ladder-shaped map learned from gene-centric TF association scores. The color bar represents gene association scores, with the red for the highest score and the green for the lowest.
(B) The DHS map produced by fusing the DHS data onto the learned TF map.
(C) The polar plot of TFs ranked by correlation to DHS.
See also Figures S1 and S2, and Tables S1 and S2.
Correlating TFs and/or DHSs in K562(A) TF landscape in K562. The landscape visualized based on a ladder-shaped map learned from gene-centric TF association scores. The color bar represents gene association scores, with the red for the highest score and the green for the lowest.(B) The DHS map produced by fusing the DHS data onto the learned TF map.(C) The polar plot of TFs ranked by correlation to DHS.See also Figures S1 and S2, and Tables S1 and S2.
K562 TF Map Is Fused with Additional DHS Targeting Data for Multilayer Data Comparison in the Same Cell Type
Next, we calculated the DHS-derived gene-centric association scores (Figure 3A and Table S2) and fused DHS targeting data onto the learned map, producing a fused DHS map (Figure 3B). Based on the degree of similarity in target gene profiles for DHSs, TFs were ranked with the top two, MAX and MYC (Pearson's correlation >0.75 in Figure 3C). Both factors form the dimers as a transcriptional activator binding to the E box; how they select target genes largely depends on the chromatin context (Sabò and Amati, 2014). The other top TFs are SPI1 (also known as PU.1), which acts as a chromatin accessibility factor (Marecki et al., 2004), and JUND, a functional component of the AP1 complex that has been shown to potentiate chromatin accessibility (Biddie et al., 2011). Multilayer data comparison revealed relationships between the transcription factor binding events and chromatin accessibility in terms of targeting potential.
K562 TF Map Is Used for Comparison Involving Two Cell Types
A more interesting case of comparison involves two cell types having the same data type. For this, we fused the K562 TF learned map with the TF binding/targeting data in GM12878 (that is, gene-centric association scores for TFs in GM12878; Table S3), producing a fused TF map in GM12878. The per-TF map comparisons between the two are illustrated in Figure 4A, with correlations shown in Figure 4B. Given a wide range of correlations observed across TFs, we hypothesized that the differences in the structural characteristics of TFs might explain the cellular basis of differential binding/targeting events. To test this hypothesis, we used the dnet package (Fang and Gough, 2014b) to perform enrichment analysis for 25 highly correlated TFs versus 25 lowly correlated ones using SCOP structural domains (de Lima Morais et al., 2011), revealing the distinctive protein domain compositions (Figure 4C). TFs with high correlation tend to contain winged helix DNA binding domains (false discover rate [FDR] = 1.5×10−9; E2F4, ELF1, ELK1, ETS1, RAD21, RFX5, and SPI1) and beta-beta-alpha zinc finger domains (FDR = 1.2×10−4; CTCF, EGR1, MAZ, SP1, ZBTB33, and ZNF143), with their domain architectures (obtained via the dcGO Predictor Batch Query [Fang and Gough, 2013]) illustrated in Figure 4D. By contrast, TFs with low correlation are unique in leucine zipper domains (FDR = 6.7×10−10; ATF3, CEBPB, FOS, JUND, and NFE2) and helix-loop-helix DNA binding domains (FDR = 3.9×10−8; BHLHE40, MAX, MYC, USF1, and USF2) (Figures 4C and 4E).
Figure 4
Correlating TFs between K562 and GM12878
(A) Side-by-side comparisons of TFs. Fusion of the learned K562 TF map with the TF data in GM12878 produces the GM12878 TF map.
(B) TFs ranked by correlation between K562 and GM12878; illustrated in the polar correlation plot of TFs.
(C) Protein domains enriched in top-ranked TFs and least-ranked TFs. Odds ratio (and 95% confidence interval) based on Fisher's exact test.
(D and E) Protein domain architectures for TFs. Domain architecture for a TF is based on the longest protein sequence (represented by UniProt ID). Highly correlated TFs (D) and lowly correlated TFs (E).
See also Table S3.
Correlating TFs between K562 and GM12878(A) Side-by-side comparisons of TFs. Fusion of the learned K562 TF map with the TF data in GM12878 produces the GM12878 TF map.(B) TFs ranked by correlation between K562 and GM12878; illustrated in the polar correlation plot of TFs.(C) Protein domains enriched in top-ranked TFs and least-ranked TFs. Odds ratio (and 95% confidence interval) based on Fisher's exact test.(D and E) Protein domain architectures for TFs. Domain architecture for a TF is based on the longest protein sequence (represented by UniProt ID). Highly correlated TFs (D) and lowly correlated TFs (E).See also Table S3.
Linking TFs to CRISPR, Both in K562
In this section, we consider all 100 TFs available in K562 (The ENCODE Project Consortium, 2012) and demonstrate that linking TFs to CRISPR reveals a greater number TF binding/targeting events for essential genes. The CRISPR-based screen identified genes required for survival in K562 (Wang et al., 2015). The CRISPR score measures the essential genes; the lower the score, the higher the fitness cost imposed by gene inactivation (that is, the more essential genes are). Following the process outline in Figure 5A, we first produced the binding/targeting landscape of TFs (Figure 5B) and ranked them according to correlation with the fused CRISPR map (Figures 5C and 5D). Next, we implemented a region-growing algorithm to partition the learned map into gene clusters, the number of gene clusters determined based on the distance matrix of the map nodes and each cluster covering continuous regions (Transparent Methods). In doing so, we obtained 11 gene clusters (C1-C11; Figure 5E and Table S4). For each cluster, we subsequently summarized the CRISPR gene essential scores (Figure 5F). We observed that essential genes with the lowest CRISPR scores (C1) have greater number of TF binding events (Figure 5G) and that the TF binding events are rarely seen for genes with the highest CRISPR scores (C6; Figure 5G). Through enrichment analysis using Reactome pathways (Fabregat et al., 2018), we also found enrichment of essential biological processes/pathways, such as translation and nonsense-mediated decay in C1, and no enrichment in C6 (Figure S4).
Figure 5
Linking TFs to CRISPR in K562
(A) Schematic overview for correlating TF binding/targeting with CRISPR screens in K562. Calculating gene-centric association scores per TF is illustrated in Figure 2A. Also illustrated are identification and enrichment analysis of gene clusters.
(B) TF landscape visualized based on a triangle-shaped map learned from gene-centric TF association scores in K562.
(C) The polar correlation plot of TFs ranked by correlation to CRISPR.
(D) CRISPR map produced by fusing the CRISPR data onto the learned TF map.
(E) Gene clusters identified from the learned TF map. Clusters are color coded and labeled.
(F) Summary of CRISPR gene essential scores per cluster; valued is the per-cluster average.
(G) Heatmap of gene clusters.
See also Figure S4 and Table S4.
Linking TFs to CRISPR in K562(A) Schematic overview for correlating TF binding/targeting with CRISPR screens in K562. Calculating gene-centric association scores per TF is illustrated in Figure 2A. Also illustrated are identification and enrichment analysis of gene clusters.(B) TF landscape visualized based on a triangle-shaped map learned from gene-centric TF association scores in K562.(C) The polar correlation plot of TFs ranked by correlation to CRISPR.(D) CRISPR map produced by fusing the CRISPR data onto the learned TF map.(E) Gene clusters identified from the learned TF map. Clusters are color coded and labeled.(F) Summary of CRISPR gene essential scores per cluster; valued is the per-cluster average.(G) Heatmap of gene clusters.See also Figure S4 and Table S4.
Discussion
When Cubism Meets Genomics
The timing is right in the era of doing genomic data science. A demanding issue in doing big genomic data science is how to identify and compare the occurrence of the patterns from multidimensional and multiparametric data. These data come in a heterogeneous form, making it difficult to integrate information of different types and sources. We address this issue following the Picasso's cubism philosophy. We devise a suite of maps that are able to capture a wide range of data shapes under a single framework. We suggest the mapping of the input data onto a 2D hyperspace oriented along the first two axes, for example, identified by PCA, and the resulting data point cloud directs the choice of the map shape (as illustrated in Figures S1 and S3). We recognize the possibility of such empirical observation resulting in no match with any shape supported currently (shown in Figures 1A–1I); in this situation, it is advisable to consider the supra-hexagonal map owing to its perfect symmetry. In addition to the shape, the maps also have AI allowing for data modeling in a self-organizing manner.
Challenges and Opportunities of Regulatory Genomics
In the human body, connections between gene promoters and regulatory elements control cell-type specificity. This context-specific control requires mapping of such connections for every cell type. Experimentally, it poses great challenges to data generators. When compared with identifying chromatin interactions (Javierre et al., 2016), technologies make it much easier to generate (have generated) regulatory elements such as TF binding sites and DHSs for most cell types. Computationally, it is achievable by integrating available context-specific regulatory elements and their target genes into a less context-specific scaffold; one such effort is to assimilate knowledge of enhancer-target connections (Fishilevich et al., 2017). Building on such computational opportunities and also as a proof of principle, we have utilized the enhancer-target knowledge combined with context-specific regulatory genomic data to estimate gene-centric targeting profiles of TFs and DHSs. The self-organized modeling and representation of gene targeting profiles between TFs and/or DHSs make it straightforward for effective comparisons.
Analytical Advantages of ATLAS
ATLAS is the first realization of analytical cubism in regulatory genomics. More importantly, it enables comparisons, both visually intuitive and scientifically insightful. We demonstrate the use of ATLAS to compare regulatory genomic data of different types and involving different cell types. In the illustrated use cases, we show that modeling TF binding/targeting within a cell type defines inter-TF taxonomy; comparing multilayer regulatory data in the same cell type reveals the targeting relationships between the TF binding events and chromatin accessibility; and linking TFs to CRISPR identifies many TF binding events for essential genes. All these are achieved in a transparent way, rather than in the black box. The added value of the landscape-guided correlation is that such correlation can be intuitively visualized, for example, the correlation shown in Figures 4A and 4B. This value is useful particularly at the exploratory stage of multilayer genomic data analysis. Furthermore, reproducible use cases with 2D visuals enable effective data-driven and visual-aided exploratory analysis.
Future Directions of ATLAS
Other than the analytical powers currently supported by ATLAS, we plan to build a resource connecting regulatory elements to putative target genes, under contexts of different levels (from the generic to the system-/organ-specific and to the lineage-specific level). Future efforts will also focus on a user-friendly web interface targeting users who are less familiar with the R environment. We anticipate that the self-organized representation and comparison offered by ATLAS will be of great use at the exploratory stage of regulatory genomic data analysis. We also anticipate that ATLAS, freely available to and reproducible by the scientific community, will aid in the use of big data produced from genomics consortia to address the big questions.
Methods
All methods can be found in the accompanying Transparent Methods supplemental file.
Authors: David A de Lima Morais; Hai Fang; Owen J L Rackham; Derek Wilson; Ralph Pethica; Cyrus Chothia; Julian Gough Journal: Nucleic Acids Res Date: 2010-11-09 Impact factor: 16.971
Authors: Lulu Jiang; Charles C T Hindmarch; Mark Rogers; Colin Campbell; Christy Waterfall; Jane Coghill; Peter W Mathieson; Gavin I Welsh Journal: Sci Rep Date: 2016-10-24 Impact factor: 4.379
Authors: Biola M Javierre; Oliver S Burren; Steven P Wilder; Roman Kreuzhuber; Steven M Hill; Sven Sewitz; Jonathan Cairns; Steven W Wingett; Csilla Várnai; Michiel J Thiecke; Frances Burden; Samantha Farrow; Antony J Cutler; Karola Rehnström; Kate Downes; Luigi Grassi; Myrto Kostadima; Paula Freire-Pritchett; Fan Wang; Hendrik G Stunnenberg; John A Todd; Daniel R Zerbino; Oliver Stegle; Willem H Ouwehand; Mattia Frontini; Chris Wallace; Mikhail Spivakov; Peter Fraser Journal: Cell Date: 2016-11-17 Impact factor: 41.582
Authors: Robert E Thurman; Eric Rynes; Richard Humbert; Jeff Vierstra; Matthew T Maurano; Eric Haugen; Nathan C Sheffield; Andrew B Stergachis; Hao Wang; Benjamin Vernot; Kavita Garg; Sam John; Richard Sandstrom; Daniel Bates; Lisa Boatman; Theresa K Canfield; Morgan Diegel; Douglas Dunn; Abigail K Ebersol; Tristan Frum; Erika Giste; Audra K Johnson; Ericka M Johnson; Tanya Kutyavin; Bryan Lajoie; Bum-Kyu Lee; Kristen Lee; Darin London; Dimitra Lotakis; Shane Neph; Fidencio Neri; Eric D Nguyen; Hongzhu Qu; Alex P Reynolds; Vaughn Roach; Alexias Safi; Minerva E Sanchez; Amartya Sanyal; Anthony Shafer; Jeremy M Simon; Lingyun Song; Shinny Vong; Molly Weaver; Yongqi Yan; Zhancheng Zhang; Zhuzhu Zhang; Boris Lenhard; Muneesh Tewari; Michael O Dorschner; R Scott Hansen; Patrick A Navas; George Stamatoyannopoulos; Vishwanath R Iyer; Jason D Lieb; Shamil R Sunyaev; Joshua M Akey; Peter J Sabo; Rajinder Kaul; Terrence S Furey; Job Dekker; Gregory E Crawford; John A Stamatoyannopoulos Journal: Nature Date: 2012-09-06 Impact factor: 49.962