| Literature DB >> 23632163 |
Geir K Sandve1, Sveinung Gundersen, Morten Johansen, Ingrid K Glad, Krishanthi Gunathasan, Lars Holden, Marit Holden, Knut Liestøl, Ståle Nygård, Vegard Nygaard, Jonas Paulsen, Halfdan Rydbeck, Kai Trengereid, Trevor Clancy, Finn Drabløs, Egil Ferkingstad, Matús Kalas, Tonje Lien, Morten B Rye, Arnoldo Frigessi, Eivind Hovig.
Abstract
The immense increase in availability of genomic scale datasets, such as those provided by the ENCODE and Roadmap Epigenomics projects, presents unprecedented opportunities for individual researchers to pose novel falsifiable biological questions. With this opportunity, however, researchers are faced with the challenge of how to best analyze and interpret their genome-scale datasets. A powerful way of representing genome-scale data is as feature-specific coordinates relative to reference genome assemblies, i.e. as genomic tracks. The Genomic HyperBrowser (http://hyperbrowser.uio.no) is an open-ended web server for the analysis of genomic track data. Through the provision of several highly customizable components for processing and statistical analysis of genomic tracks, the HyperBrowser opens for a range of genomic investigations, related to, e.g., gene regulation, disease association or epigenetic modifications of the genome.Entities:
Mesh:
Year: 2013 PMID: 23632163 PMCID: PMC3692097 DOI: 10.1093/nar/gkt342
Source DB: PubMed Journal: Nucleic Acids Res ISSN: 0305-1048 Impact factor: 16.971
Figure 1.Schematic overview of tool categories available at the Genomic HyperBrowser server. The figure indicates at which points of a typical analysis scenario the various tools may be of use, from the initial collection and preparation of data, through customization of data to match the analysis, to the statistical evaluation of a biological hypothesis. For boxes representing several tools, the precise list of tools can be found under the corresponding header in the table that is referred to (for instance, the two tools represented by the ‘Format and convert’ box can be found under the heading ‘Format and convert tracks’ of Table 3).
Tools for extracting genomic tracks from the HyperBrowser repository, customizing tracks into forms suitable for a subsequent analysis of interest, generating new tracks, and formatting and converting existing tracks
| Tool name | Description | Genomic example |
|---|---|---|
| HyperBrowser track repository | ||
| Extract track from HyperBrowser repository | Used to extract datasets from the track repository stored on the HyperBrowser server. Datasets can be extracted in a range of different formats, and from limited regions of the genome, if needed. Also, overlapping segments can be merged. | Extract the RefSeq gene track, in order to expand the gene segments with the ‘Expand BED segments’ tool. |
| Customize tracks | ||
| Expand BED segments | Allows extracting start-, mid- or endpoints of genomic intervals, as well as expanding either the original intervals or the extracted start-/end-/mid-points. This is useful in a variety of situations where an analysis of interest involves either proximity to or positioning relative to the original track elements, or where a size unification of track elements is desired (based on, e.g., taking midpoints and then expanding a certain distance). Also, if the expanded region crosses any chromosome borders, this is handled correctly. | An example of an analysis involving both proximity and relative positioning is the analysis of histone modification frequencies in bins of particular distances relative to the upstream end points of genes (transcription start sites). |
| Combine two BED files into single case–control track | Allows combining elements from two separate datasets into a single track where the elements are denoted as case (target) or control, depending on their source. This allows analyses of how other tracks preferentially interact with case elements as opposed to control elements. | An example is to combine chromatin states from two different cell types as case and control elements, in order to ask whether regions associated to MS susceptibility overlap more with case than control segments. See section ‘Full analysis scenario’. |
| Merge multiple BED files into single categorical track | Allows combining elements from multiple datasets into a single track, denoted with a category that reflects their source. | Merge segment tracks denoting, e.g., exons, introns and intergenic regions in order to create a category track spanning the whole genome. |
| Generate tracks | ||
| Generate bp-level track from DNA sequence | Supports a rich set of possibilities for constructing tracks based on the DNA sequence itself along a reference genome. | Construct a bp-level track of GC content in a sliding window of selectable size along the genome. |
| Generate bp-level track of distance to nearest segment | Allows the generation of tracks giving for each bp the distance (in bps) to the nearest element in any track. | Generate a bp-level track of distance to nearest gene. |
| Generate intensity track for confounder handling | Generates so-called ‘intensity tracks’ which are used in controlling for confounder tracks in particular analyses. The user selects a target track as well as a set of control tracks, i.e. a set of tracks whose influence on the target track one aims to control for. The generated intensity track defines, for each base pair, the probability that an element of the target track lands at that position during randomization. The intensity track can afterwards be selected as part of the null model specification when doing hypothesis testing through the ‘Analyze genomic tracks’ tool. | Can, e.g., be used to control for the influence of gene proximity when analyzing the relation between TF binding locations and active regions in a given cell type. |
| Generate k-mer occurrence track | Generates a global track of occurrence locations for a specified k-mer on a particular reference genome. | Generate a track of all occurrences of the 8-mer ‘ACGTTGCA’ in the human hg19 genome assembly. |
| Generate track of genes associated with literature terms (using Coremine) | Generates a track of gene segments along the human genome, where the genes are associated with one or more specified literature terms. The associations are provided by the CoreMine medical database, which is regularly updated with term-gene associations mined from published literature. | Find a set of genes associated with melanoma. Each gene will have an attached |
| Format and convert tracks | ||
| Convert between GTrack/BED/ WIG/bedGraph/GFF/ FASTA files | The most commonly used formats for genomic location data are (arguably) the formats BED, BedGraph and WIG defined by the UCSC Genome Browser, as well as the format GFF in various versions. The tool allows converting between these formats, to the degree they are able to represent the same information. The tool also allows converting data to and from the recent GTrack format, which is a recent, unified format that is capable of representing data of any track type, and thus data stemming from any of the other file formats ( | Convert a GTrack file to the BED format in order to use BED-specific Galaxy tools. |
| Create GTrack file from unstructured tabular data | The tool allows structuring unformatted tabular data into a GTrack file by specifying the necessary meta-data through simple selection boxes, inferring further properties of the data where possible. | Import virus integration sites of the Human Papilloma Virus (HPV) from an Excel spreadsheet into a GTrack file for further analysis by the ‘Analyze genomic tracks’ tool. |
Further descriptions are given at the web pages of the tools themselves, along with demo buttons and links to reproducible examples of how each tool can be used. The GTrack-related tools have previously been described (6).
Selected descriptive statistics and hypothesis tests available through the ‘Analyze genomic tracks’ tool of the Genomic HyperBrowser
| Track1 type | Track2 type | Statistical investigation | Description |
|---|---|---|---|
| Descriptive statistics | |||
| P | Counts | The number of track1-points | |
| P | Frequency | The frequency of track1-points | |
| P | Mean and variance of gaps | Mean and variance of gaps between track1-points | |
| P | P | Frequency proportion | The proportion of all points (track1 and track2) arising from track1 |
| P | P | Point distances | The distribution of distances from each track1-point to the nearest track2-point |
| P | S | Count inside/outside | The number and proportion of track1-points inside and outside track2-segments |
| P | S | Matrix of count inside | The number of track1-points inside track2-segments, for all combinations of categories from both tracks |
| P | S | Relative position within segments | The average relative position of track1-points within track2-segments |
| P | S | Point to segment distances | The distribution of distances from each track1-point to the nearest track2-segment |
| S | Bp coverage | The number of base pairs covered by track1 | |
| S | Proportional coverage | The proportion of total base pairs covered by track1 | |
| S | Avg. segment length | The average length of segments of track1 | |
| S | Segment lengths | The distribution of lengths of each track1-segment | |
| S | S | Coverage | Base pair and proportional coverage by track1, track2 and by both |
| S | S | Enrichment | The enrichment of track1 inside track2 and vice versa, at the bp level |
| S | S | Segment distances | The distribution of distances from each track1-segment to the nearest track2-segment |
| F | Mean | The mean value of track1 | |
| F | Sum | The sum of values of track1 | |
| F | Variance | The variance of values of track1 | |
| F | Min and max | The extreme values (min/max) of track1 | |
| F | P | Mean at points | The mean value of track1 at positions of track2 |
| F | S | Mean inside and outside | The mean value of track1 inside track2 and outside track2 |
| F | F | CC | Pearson's correlation coefficient of track1 and track2 |
| VP | Values | The distribution of values of track1-elements | |
| VP | S | Values inside | The distribution of values of track1-elements inside track2-elements |
| VS (c/c) | P | Inside case versus control | The number of track2-points that falls inside track1-segments marked as case or control |
| VP (c/c) | VS (c/c) | Two-by-two table of inside | Two-by-two table of case/control track1-points that falls inside case/control track2-segments |
| VS (cat) | Category bp coverage | The number of base pairs covered by each category of track1 | |
| VS (cat) | Category point count | The number of elements of each category of track1 | |
| VP (cat) | VS (cat) | Contingency table of inside | Contingency table of categorical track1-points that falls inside categorical track2-segments |
| L | Number of nodes and edges | The number of nodes and edges in track1 | |
| L | Number of neighbors | The distribution of the number of neighbors for each node in the graph (track1) | |
| L (w) | Edge weights | The distribution of weights for each edge of the graph (track1) | |
| L (w) | Clustered heatmap of graph | Clustered heatmap of weights of the graph (track1) | |
| Hypothesis tests | |||
| P | P | Different frequencies? | Where is the relative frequency of points of track1 different from the relative frequency of points of track2, more than expected by chance? |
| P | P | Located nearby? | Are the points of track1 closer to the points of track2 than expected by chance? |
| P | S | Located inside? | Are the points of track1 falling inside the segments of track2, more than expected by chance? |
| P | S | Located non-uniformly inside? | Do the points of track1 tend to accumulate more toward the borders of the segments of track2? |
| P | S | Located nearby? | Are the points of track1 closer to the segments of track2 than expected by chance? |
| S | S | Similar segments? | Are track1-segments similar (in position and length) to track2-segments, more than expected by chance? |
| S | S | Overlap? | Are the segments of track1 overlapping the segments of track2, more than expected by chance? |
| S | S | Located nearby? | Are the segments of track1 closer to the segments of track2 than expected by chance? |
| F | F | Correlated? | Are the values of track1 and track2 more positively correlated than expected by chance? |
| P | F | Higher values at locations? | Are the values of track2 higher at the points of track1, than what is expected by chance? |
| S | F | Higher values inside? | Are the values of track2 higher inside the segments of track1, than what is expected by chance? |
| P | VS | Located in segments with high values? | Does the number of track1-points that fall in track2-segments depend on the value of track2-segments? |
| S | VP | Higher values inside segments? | Do the points of track2 that occur inside segments of track1 have higher values than points occurring outside the segments of track1? |
| VP | VP | Nearby values similar? | When track1-points and track2-point are nearby each other, are the values more similar than expected by chance? |
| P | VS (c/c) | Located in case segments | Does the number of track1-points that fall in track2-segments depend on whether the track2-segments are marked as case or control? |
| VS (c/c) | S | Preferential overlap? | Are the segments of track1 marked as case overlapping unexpectedly more with the segments of track2 than the segments of track1 marked as control? |
| VP (cat) | VS (cat) | Category pairs differentially co-located? | Which categories of track1-points fall more inside which categories of track2-segments? |
| LGP | P | Co-localized in 3D? | Are the points of track2 closer in 3D (as defined by track1) than expected by chance? |
Each analysis is defined for either one or two tracks, with the corresponding track type denoted in the columns ‘Track1 type’ and ‘Track2 type’. The track type abbreviations, as defined in (6), are as follows: Points (P), Segments (S), Valued Points (VP), Valued Segments (VS), Function (F), Linked Genome Partition (LGP) and any Linked (L) track. In addition, attached values are: number (default), case/control (c/c), category (cat) and weighted edges (w). Most hypothesis tests are available in one- and two-sided versions. Looking at, e.g., overlap, the possible alternative hypotheses would then be whether the segments of track1 are overlapping the segments of track2, more, less or differently than expected by chance. Results of the analyses are given both at the global level and for local regions along the genome. A few of the hypothesis tests relating points and/or segments are also available in specific libraries (7,8), but only for certain null models. In addition, these libraries require low-level command-line access, API access or configuration file setup in order to start analyses.
Figure 2.Screenshots of the web interface and results page for the ‘Analyze genomic tracks’ tool. (A) Input data, analyses of interest, and analysis parameters are precisely specified through a set of selection boxes. (B) The result page provides a main conclusion from the statistical test, as well as a range of details that can be inspected by following various links from the main results page.
Tools for statistical, visual and specialized analyses of genomic tracks
| Tool name | Description | Genomic example |
|---|---|---|
| Analyze genomic tracks | The main analysis interface of the Genomic HyperBrowser ( | Analyze cell-specificity of active chromatin in disease regions, as described in section ‘Full analysis scenario. |
| Visualize track elements relative to anchor regions | Allows visualization of the distribution of track elements along chromosomes, or along custom-specified bins. The specified regions are displayed vertically, in order to simplify visual comparison. | Visualize the detailed positioning of histone modifications relative to the TSS of a selected set of gene regions. |
| Create high-resolution map of track distribution along genome | Visualizing track elements along a line, such as in the UCSC genome browser or the relative positioning visualization tool, can necessarily only offer a global overview at a very limited resolution. This tool instead uses a fractal layout of the genome line (similar to Hilbert curve ( | Visualize the genome-wide distribution of a densely populated track, such as repeating elements or a DNase accessibility experiment. |
| Create high-resolution map of multiple track distributions along genome | Similar to the one-track version above, but uses up to three separate color channels (red,green,blue) to visualize the presence of up to three different tracks in corresponding parts of the genome by combining their color channel values at individual pixels. | Visualize the comparative distribution of DNase accessibility in three different cell types to see patterns of similar and distinct accessibility. |
| Visualize relation between two tracks across genomic regions | Used to reveal complex relations between tracks along the genome. For each defined analysis region (bin), a score is calculated for both tracks, using the specified summarizing function. The resulting (x,y) scores are then visualized as a single point in a scatter plot. | Plot exon density versus average melting temperature in 10 mbp bins along the genome. |
| Aggregation plot of track elements relative to anchor regions | Used to reveal trends of how track elements are distributed relative to a set of anchor regions (bins). All anchor regions are divided into the same number of sub-bins, and a summary statistic is calculated for each sub-bin and averaged across all anchor regions. The tool returns a plot of the average values with 95% confidence intervals. | Positions of histone modifications around TSS. |
| Analyze co-localization of input genomic regions | Analyze a selected track of genome locations for spatial co-localization with respect to the three-dimensional structure of the genome, as defined using results from recent Hi-C experiments. The Hi-C data have been corrected for bias using a method presented in a recent paper ( | Analyze whether somatic mutations in cancer are co-localized in 3D in a relevant cell type. |
| Perform clustering of genomic tracks | Used to investigate relations between multiple tracks in an unsupervised manner (manuscript submitted). This tool allows an essentially unlimited number of tracks to be selected, and further allows the distance measure to be used for the clustering to be precisely specified through selection among a varied set of a notions of track similarity. | Analyze similarities between histone modifications in different cell types. |
| Analyze k-mer occurrences | Used to analyze a global track of occurrence locations for a specified k-mer from a particular reference genome. All relevant analyses in the ‘Analyze genomic tracks’ tool can be used. | Analyze correlation of a specific k-mer with other tracks, e.g. genes, in order to find functional significance. |
| Inspect k-mer frequency variation | Used to calculate and visualize the frequency distribution of a particular k-mer along a genome reference. Splits the selected analysis regions (e.g. chromosomes) into a suitable number of subregions (bins). For each bin, the number of occurrences of the selected k-mer is counted and plotted. | Inspect the frequency variation of a particular k-mer along the genome. |
Further descriptions are given at the web pages of the tools themselves, along with demo buttons and links to reproducible examples of how each tool can be used. The ‘Analyze genomic tracks’ tool has previously been described (4).