| Literature DB >> 30307532 |
Chakravarthi Kanduri1,2, Christoph Bock3,4,5, Sveinung Gundersen1,6, Eivind Hovig1,6,7,8, Geir Kjetil Sandve1,2.
Abstract
MOTIVATION: Many high-throughput methods produce sets of genomic regions as one of their main outputs. Scientists often use genomic colocalization analysis to interpret such region sets, for example to identify interesting enrichments and to understand the interplay between the underlying biological processes. Although widely used, there is little standardization in how these analyses are performed. Different practices can substantially affect the conclusions of colocalization analyses.Entities:
Mesh:
Year: 2019 PMID: 30307532 PMCID: PMC6499241 DOI: 10.1093/bioinformatics/bty835
Source DB: PubMed Journal: Bioinformatics ISSN: 1367-4803 Impact factor: 6.937
Fig. 1.(a) Examples of coordinates of a sequence of nucleotides on zero-based and one-based genome coordinate systems. The brackets represent closed while parentheses represent open intervals. Being closed at a position represents the inclusion of that position in the genomic interval, whereas being open represents the exclusion of that position. (b) Example of discretizing continuous value to call genomic intervals. Here, although both profile-A and profile-B look visually similar, one of the genomic intervals in profile-B was not called because of marginally falling below a chosen threshold or owing to algorithm parameters, resulting in the exclusion of that region from further analyses. (c) Example of variations when computing distances from start, midpoint or end coordinates. Here, although both the points are 1 kb upstream of the genomic intervals, because of the differences in the length of genomic intervals the distances largely differ (here almost 2-fold) when computed to the midpoint or end
Fig. 2.Examples of the distributional properties and dependence structure of genomic features along the genomic sequence. Genomic tracks contain genomic regions that are known to occur in clumps and with variable lengths. Also, genomic sequences could be characterized by the distributional and biological properties of genomic events, where stretches of sequence share similar biological properties (as homogeneous blocks). Furthermore, multiple genomic annotations colocalize with each other and thus any statistical association should disentangle the effect mediated by the colocalization of confounding features
Fig. 3.The statistical methods for colocalization analysis can broadly be categorized into two types depending upon whether they use (a) a general null model of colocalization (b) or a specifically selected set of universe regions to estimate a null model (upper and lower panels). Both methods are further categorized as either (i) analytical tests (ii) or tests based on Monte Carlo simulations (left and right panels). Upper left: analytical tests with a general null model is exemplified by a binomial test on whether Track 1 (T1) positions are located inside Track 2 (T2) regions more than expected by chance. Upper right: General Monte Carlo-based tests provide great flexibility in the choice of test statistics and randomization strategies (null model). Here, exemplified by a simple overlap statistic (bps overlap) and uniform randomization of T2. Lower left: A set of universe regions limits the analysis to (here) the ‘case’ regions of T1 and a set of ‘control’ regions that could have been part of T1. Simple counting of the overlapping regions in T1 and T2 provides the basis of a Fisher's exact test. Lower right: The combination of universe regions and Monte Carlo-based testing is not previously presented in literature, but might be designed to combine advantages of them both. For a detailed overview of the null model categories, see Box 1
Fig. 4.Examples of different permutation strategies in Monte Carlo Null models. (a) In this illustration, let us assume that the dependence relationship between track-A and track-B is being queried. Note the local heterogeneities within the genomic region, where blue and green segments represent blocks of locally homogenous regions. The red segment represents assembly gaps. Tracks A and B are comprised of points and genomic intervals respectively, and different colors are used to distinguish them. (b) The simplest form of permutation model assumes uniformity and independence of genomic locations and thus shuffles either of the track without any restrictions. Note here that one of the points was also shuffled to an assembly gap region. (c) Another null model preserves the sequence distance between the points or genomic intervals when shuffling and avoids gaps. (d) A more conservative strategy preserves the sequence distance, while also shuffling to regions matched by biological properties (blue and green colors) thus preserving local heterogeneity