Literature DB >> 25161255

Fast randomization of large genomic datasets while preserving alteration counts.

Andrea Gobbi¹, Francesco Iorio², Kevin J Dawson¹, David C Wedge¹, David Tamborero¹, Ludmil B Alexandrov¹, Nuria Lopez-Bigas¹, Mathew J Garnett¹, Giuseppe Jurman¹, Julio Saez-Rodriguez¹.

Abstract

MOTIVATION: Studying combinatorial patterns in cancer genomic datasets has recently emerged as a tool for identifying novel cancer driver networks. Approaches have been devised to quantify, for example, the tendency of a set of genes to be mutated in a 'mutually exclusive' manner. The significance of the proposed metrics is usually evaluated by computing P-values under appropriate null models. To this end, a Monte Carlo method (the switching-algorithm) is used to sample simulated datasets under a null model that preserves patient- and gene-wise mutation rates. In this method, a genomic dataset is represented as a bipartite network, to which Markov chain updates (switching-steps) are applied. These steps modify the network topology, and a minimal number of them must be executed to draw simulated datasets independently under the null model. This number has previously been deducted empirically to be a linear function of the total number of variants, making this process computationally expensive.
RESULTS: We present a novel approximate lower bound for the number of switching-steps, derived analytically. Additionally, we have developed the R package BiRewire, including new efficient implementations of the switching-algorithm. We illustrate the performances of BiRewire by applying it to large real cancer genomics datasets. We report vast reductions in time requirement, with respect to existing implementations/bounds and equivalent P-value computations. Thus, we propose BiRewire to study statistical properties in genomic datasets, and other data that can be modeled as bipartite networks.
AVAILABILITY AND IMPLEMENTATION: BiRewire is available on BioConductor at http://www.bioconductor.org/packages/2.13/bioc/html/BiRewire.html. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.

Entities: Chemical Disease Gene Species

Mesh：

Year: 2014 PMID： 25161255 PMCID： PMC4147926 DOI： 10.1093/bioinformatics/btu474

Source DB: PubMed Journal: Bioinformatics ISSN： 1367-4803 Impact factor: 6.937

1 INTRODUCTION

In the past few years, next-generation sequencing (NGS) technologies have progressed swiftly, and currently hundreds of genomes can be simultaneously sequenced in a matter of weeks, at more affordable costs. This opens a wide range of new avenues in biological and biomedical research. In particular, because of the established impact of the genomic background on disease progression and response to drug treatment, cancer research has significantly benefited from these advances. Comprehensive catalogues of mutations in multiple cancer types have been assembled and fruitfully used to identify new diagnostic, prognostic and therapeutic targets (Barretina ; Garnett ; ICGC ; TCGA ). Existing large-scale projects, such as the Cancer Genome Atlas (TCGA; TCGA ), the International Cancer Genome Consortium data portal (ICGC ) and, recently, the Genomics of Drug Sensitivity in Cancer (Garnett ) and the Cancer Cell Line Encyclopedia (Barretina ), provide invaluable opportunities to explore molecular alterations that could potentially play a crucial role in a plethora of different cancer types and their response to therapy (Stratton ). A key task in these projects is to distinguish between driver mutations (i.e. those conferring selective clonal growth advantage) and functionally neutral passenger mutations (which do not contribute to tumour development) (Bignell ; Greenman ). Once key driver mutated genes are identified, a fruitful analysis is to consider them in the context of the pathways where they operate. This allows the identification of cancer driver biological networks, whose altered functionality results in the acquisition of a cancer hallmark (Hanahan and Weinberg, 2011; Vogelstein ). One of the ideas exploited to identify these networks is based on the assumption that sets of mutations exhibiting statistically significant levels of mutual exclusivity (ME) are likely to alter genes involved in a common biological process that drives cancer development. It has been noted that driver mutations in cancer occur in a limited number of pathways and driver lesions in the same pathway do not tend to occur in the same patient (Yeang ). A possible biological explanation is that if a crucial node is altered in an oncogenic pathway, a secondary mutation on the same pathway is unlikely to provide further selective advantage to the cancer cell, thus it does not tend to be selected during somatic evolution. Hence, sets of mutations exhibiting statistically significant levels of ME are likely to alter genes involved in a common biological process that drives cancer development. On the other hand, mutations of genes participating in different biological pathways may exert a synergistic effect in conferring growth advantages to tumour cells. Therefore, investigations have been devoted to searching for groups of genes that are simultaneously mutated more often than expected by random chance (Thomas ; Uren ). Based on these premises, the emergence of combinatorial properties among patterns of genomic events has been investigated in a number of recent studies, through the application of novel statistical measures quantifying, for example, the ‘mutual exclusivity’ or the ‘co-occurrence’ of different genomic lesions (Ciriello et al., 2012; Cui, 2010; Gu ; Miller ; Vandin ; Yeang ). Among these studies, those aimed at identifying groups of genes whose mutation patterns tend to ME are based on the same principle and are conceptually similar (Ciriello ; Miller ; Vandin ), although they differ in two crucial methodological aspects: In (Ciriello ), for example, the authors designed MEMo, a computational framework in which gene sets to be tested for ME are derived from cliques (i.e. groups of genes with complete pair-wise connectivity) identified in functional networks, assembled from publicly available signalling and pathway maps. For the statistical assessment of ME, a variety of strategies have been followed. Vandin perform a significance test simulating a null model by independently permuting the mutations of each gene across patients, thus preserving the mutation frequency gene-wise (but not sample-wise). In (Miller ), authors make use of tools from coding theory, and the ME significance for a set of genes is computed algorithmically. In contrast to these two methods, MEMo quantifies the sample coverage (SC) of a set of genes in terms of the number of samples in which at least one of them is mutated. Then the ME of the gene set under consideration is computed as the divergence of its SC from expectation. To evaluate the statistical significance of this ME measure, P-values are computed under an appropriate null model. This can be achieved by randomly permuting the analysed dataset, while preserving the overall distribution of observed alterations across both genes and samples. This is crucial to preserve tumour specific alterations, heterogeneity in mutation/copy-number alteration rates across patients and to let the SC significance be proportional to the gene set ME. To generate this null model, the authors make use of a permutation strategy based on a random network generation model referred to as the switching-algorithm (Milo ). First the relevant information in the data is represented as a binary event matrix (BEM) (Fig. 1A): a ‘0–1 table’ in which the generic entry is equal to 1 if in the i-th sample, the j-th gene is altered (by a non-synonymous somatic mutation, a homozygous deletion or an amplification), and is equal to 0 otherwise.

Fig. 1.

BEM randomization through the switching-algorithm. A bipartite graph (B) is derived from the initial BEM by considering it as a graph incidence matrix (A). A sequence of switching-steps (C and D) is performed. In each of these steps, two edges (a,b) and (c,d) are randomly chosen (C) and, if the edges (a,d) and (c,b) do not exist yet, they are added to the network, while (a,b) and (c,d) removed (D). A rewired version of the BEM is derived by considering the incidence matrix of the resulting network after a sufficiently long sequence of switching-steps (E)

The way sets of genes to be tested for ME are selected. The way ME of a gene set is assessed and its statistical significance is quantified. BEM randomization through the switching-algorithm. A bipartite graph (B) is derived from the initial BEM by considering it as a graph incidence matrix (A). A sequence of switching-steps (C and D) is performed. In each of these steps, two edges (a,b) and (c,d) are randomly chosen (C) and, if the edges (a,d) and (c,b) do not exist yet, they are added to the network, while (a,b) and (c,d) removed (D). A rewired version of the BEM is derived by considering the incidence matrix of the resulting network after a sufficiently long sequence of switching-steps (E) The uniform distribution on the set of 0–1 tables with fixed marginal totals (i.e. with prescribed row-wise and column-wise sums) is used as a null model in various contexts (Besag and Clifford, 1989; Ponocny, 2001; Rasch, 1993). In ecological research 0–1 tables, called ‘presence–absence’ matrices (PAMs) (Miklós and Podani, 2004) are randomized to evaluate the deviation of observed phenomena, such as the co-occurrence of different species in the same habitat, from random expectations (Connor and Simberloff, 1979; Gotelli, 2000; Wilson, 1987). Several algorithms exist to generate constrained and non-constrained null models depending on which basic features of the PAM are retained in the computations (Gotelli, 2000; Gotelli and Entsminger, 2001). Nevertheless, the randomization of moderately large matrices in a short space of time is still challenging. Ciriello et al. took advantage of tools from graph theory by considering a BEM as the incidence matrix of a bipartite graph (Gross and Yellen, 2006) (Fig. 1B). Then, they adapted the switching-algorithm for network randomization with node degree preservation to the problem of randomizing a BEM while preserving its row- and column-wise sums (Milo ). If a BEM derived from a genomic dataset is considered as the incidence matrix of a bipartite network G, then nodes in the first set of G correspond to genes, and those in the second set correspond to samples. Additionally, a node i in the first set is connected to a node j in the second set if the gene mapped by node i is altered in the sample mapped by node j (Fig. 1A and B). Defining the degree of a node as the number of its incident links, row-wise sums of the BEM will correspond to degrees of the nodes in the first set, whereas column-wise sums of the BEM will correspond to degrees of the nodes in the second set. The problem of randomizing a BEM while preserving its row- and column-wise sums can then be reduced to the problem of shuffling the links in the corresponding network G while preserving its node degrees, or ‘network rewiring’, with the additional constraint that the shuffling should preserve bipartiteness (i.e. nodes in the same subset should never be connected). Based on these premises, in MEMo (Ciriello ), randomized versions of a BEM are generated by adapting the switching-algorithm to bipartite networks. This method proceeds through a series of Monte Carlo switching-steps to produce rewired networks, starting from the original one, and preserving its degree distribution, as summarized in Figure 1. For the Markov chain underlying this algorithm to ‘forget’ the initial network (thus to minimize the initial bias), a sufficiently large number of switching-steps should be performed. The presence of trends in the time series of network metrics along the sample path of a Markov chain simulation is evidence that the chain has not yet reached its stationary distribution (Ray ) (the required uniform distribution). If the Markov chain has a slow mixing time (Stanton and Pinar, 2012), then the number of iterations required to reach (approximate) stationarity (the so-called burn-in time) may be very long. In (Milo ), the authors propose on empirical grounds that 100 times the number of existing links () is an adequate burn-in time, and this lower bound is generally used. In what follows, we will refer to this bound as the ‘empirical bound’ (N′). The desired number of random networks needed to compute empirical P-values should then be multiplied by this number to obtain an estimation of the total time requirements. When dealing with a large number of tests (as are often required in the identification of cancer network drivers, where the number of gene sets to test is potentially very large), to achieve significance after multiple hypothesis test correction, the number of random networks to be generated (hence of switching-algorithm runs) could be in the order of hundreds of thousands. Consequently, the amount of time required to accomplish this task could be very high. This would make routine analyses practical only on server clusters. Here we propose a novel, analytically derived, approximate lower bound to the number of switching-steps required by the switching-algorithm to generate randomized versions of a BEM, preserving genomic event distributions both across samples and genes. Finally, we have implemented BiRewire, an R package (Ihaka and Gentleman, 1996) allowing users We illustrate the application of BiRewire with examples where the BEMs are derived from real datasets from the TCGA (TCGA ) and other studies, after the applications of state-of-the-art filters for the identification of somatic mutations affecting protein function and cancer-specific driver genes. Finally, we compare the obtained execution times and P-value computations with those obtained with different implementations of the switching-algorithm and different bounds. to study and visualize trends in metrics over different numbers of switching-steps for a given BEM; to determine the minimum number of switching-steps required to reach the approximate stationary distribution (here the uniform distribution on the set of allowed BEMs); to generate randomized BEMs using the switching-algorithm with the number of switching-steps set to either this lower bound or a user-defined one.

2 METHODS

We analytically derived an approximate lower bound for the number of switching-steps to be performed by the switching-algorithm, when applied to a bipartite network (where V is the set of vertices and E the set of links, with ). This bound is equal to where d is the edge density of the original network, defined as the ratio between and the number of edges of a fully connected bipartite graph with the same number of nodes in the two classes: . With respect to the empirical bound proposed in (Milo ) (i.e. ), our bound can be expressed as at least for bipartite graphs. In what follows, we will denote with a rewired version of the bipartite network G obtained with the switching-algorithm through k switching-steps. We assume intuitively that is a rewired version of G if The average similarity between G and its rewired version tends to remain constant when k is further increased (i.e. performing additional switching-steps does not make more different from G, on average); The average similarity between G and is sufficiently close to the expected similarity between any pair of random bipartite networks with the same size, edge density and node degrees of G (i.e. between any pair of rewired versions of G). The first condition above is often used when monitoring convergence of the sampler, where trends within chains are studied to quantify the ‘forgetting’ of the initial state (Brooks and Gelman, 1998). Taken together, the two conditions are necessary and sufficient to claim that after k switching-steps the initialization bias of the underlying Markov chain reaches a minimum. When they are verified, performing additional switching-steps does not make any more different from G, on average. The second property guarantees that G and are indistinguishable from any pair of networks sampled independently from the null distribution. As a consequence, can be considered as an approximate observation drawn from the uniform distribution of all the possible bipartite networks with the same number of nodes, links and degrees as G. By running the switching-algorithm on bipartite networks of different sizes and edge densities, we first verified that after a specified number k of switching-steps, which is much lower than N′, Conditions 1 and 2 are met. Then, we went on to empirically verify that the fulfilment of our convergence criteria provides a good estimation of the autocorrelation time (Stanton and Pinar, 2012): a standard tool for estimating the convergence of a Markov chain to its stationary distribution (Sokal, 1989). Finally, we present a novel approximate lower bound N (which was derived analytically) for the number of switching-steps k at which our two conditions hold. We show that after N switching steps the distribution of the Jaccard index (JI; a measure of similarity) between and G reaches the same steady state as is reached at N′, at least on the tested networks. These networks were chosen to have topological features make their incidence matrix comparable with a BEM derived from a typical large-scale NGS dataset. These results were obtained using an efficient implementation of the switching-algorithm, detailed in the Supplementary Materials, and the R package igraph (Csardi and Nepusz, 2006).

3 RESULTS

3.1 Randomness convergence across switching step

Based on the same premises of the output-based method proposed in (Johnson, 1996), to show that after a specified number of k switching-steps the average similarity between G and converges (i.e. it tends to remain constant even if applying additional switching-steps), we generated several random bipartite networks containing a total number of nodes (with and ), a fixed edge density equal to 15% (3000 edges) and different levels of squareness (i.e. ratio). By adopting an experimental setting similar to that described in (Stanton and Pinar, 2012), for a given level of squareness, each of the corresponding networks G was then given as input to 50 different instances of the switching-algorithm, each performing switching-steps. The output of each of these instances was then sampled every 100 switching-steps and collected, at the j-th sample time, into a set of rewired networks with and . Finally, at each sample time j, the average similarity between each rewired network in R and the original network was computed (to verify Condition 1), as well as the average pair-wise similarity between each pair of networks in R (to verify Condition 2). To quantify the extent of similarity between two networks, the Jaccard Index (JI) (Jaccard, 1901) between their incidence matrices was computed. If we denote with B the incidence matrix of the network G and with B the incidence matrix of its rewired version , then it can be easily verified that the JI between G and is equal to where is the number of links contained in G (equal to that of ) and , is the bitwise sum of the Hadamard product between the two matrices (i.e. the number of ones in common between them, hence the number of common links across the two networks). Results of this simulation are depicted in Supplementary Figure S1A. After an adequate number of switching-steps (which is much lower than N′), both the average similarity between the rewired networks and the initial networks (indicated by the blue curves) and the average pair-wise similarity computed between each pair of rewired networks (red curves) plateau at the same level (consistently with Conditions 1 and 2). These results suggest that the true lower bound for the number of switching-steps required by the switching-algorithm to rewire bipartite networks, providing them with the maximal level of randomness, is much lower than N′. For reference, we include in Supplementary Figure S1A the expected similarity between any pair of random bipartite networks with the same number of nodes and edges of G (green line in Supplementary Fig. S1A) but with possibly different node degrees, derived as detailed in the Supplementary Materials. This gives an indication of how much the distribution of networks under the null model differs from the distribution under the alternative model in which node degrees are not preserved. Results from a similar simulation but starting from bipartite networks containing and nodes and different levels of edge densities are shown in Supplementary Fig. S1B. Also in this case, after an adequate number of switching-steps (which is again much lower than N′), the average similarity between the rewired networks and the initial ones (indicated by the blue curves) reaches a plateau level that is equal to the one reached by the average pair-wise similarity computed between pairs of rewired networks (consistently with Conditions 1 and 2). A final empirical study showing that the fulfilment of our convergence criteria provides a good estimation of the autocorrelation time (Stanton and Pinar, 2012), hence of the mixing of the underlying Markov chain, is detailed in the Supplementary Materials and Supplementary Figure S2.

3.2 A novel lower bound to the number of switching-steps required to rewire bipartite networks

In this section, we summarize the derivation of a lower bound N to the number of switching-steps that the switching-algorithm should perform to rewire a bipartite network, as a function of its number of nodes and edges. The starting point of our proof is the definition of similarity between a bipartite network G and its rewired version (defined in the previous section) based on the JI: In the first part of our proof (provided as Supplementary Material), we formulate the mean-field equation (Barabási ) for (see Lemma 1 of the proof) and consequently for Equation (4). Then from this mean-field equation, we derive a fixed point and a convergence time N, in terms of the number of switching-steps k (Lemma 2). Finally, we show that the switching-algorithm can be used to approximate null models for G through a minimum number of N switching-steps (Lemma 3). The mean-field equation for is equal to where the functions represent five possible values of given , depending on the switching step performing successfully or not, and are the probabilities associated with these values (see Propositions 1, 2, 3 in the proof). Specifying these probabilities allow the mean-field Equation (5) to be written as a second-order linear recursive sequence for which a closed form is provided in (Brousseau, 1971). This yields where m and q can be expressed as , and , with the number of possible edges preserving bipartiteness. For a fixed ε, where , we estimate N as the minimum value such that with , and is the fixed point of the recursion in Equation (6). As shown in our proof (Proposition 4), for the purpose of finding a lower bound N (rather than the exact value of required switching-steps), we can take as the unique fixed point of 6, (Proposition 5 in the proof). Fixing , from the asymptotical equivalence ln (1 + x) ∼ ln x and Lemma 2 of the proof it follows that where and d are defined as in the previous sections. With a similar procedure, a mean-field equation can also be estimated for the similarity between any pair of networks and derived from the original network G through two different instances of the switching-algorithm, performing k switching-steps (Lemma 3 of the proof). Briefly, we derived a recursive sequence for . As shown in the proof (Proposition 6, 7) and similarly to Equation (6), this sequence can be expressed as a second-order linear sequence: but with parameters and . Comparing the two mean-field Equations (6) and (8), it follows that . This implies that the average similarity between any two rewired versions of a network G cannot be greater than the similarity between G and each of the two individual rewired versions. As a conclusion, our proof shows that our novel bound guarantees a maximal level of edge mixing, and that the similarity between any pair of rewired versions of a given network can not be greater than those between them and the original one. Finally, we conducted an empirical study to show that after N switching steps, the initial bias of the Markov chain underlying the switching-algorithm, quantified by the residual similarity to the original network (i.e. ), is minimized at least as much as it is minimized after switching steps [i.e. the empirical bound proposed in (Milo )]—details are provided in the Supplementary Materials and Supplementary Figures S3 and S4. Taken together with our formal proof, and empirical study of equivalence between our convergence criteria and the auto-correlation time estimation criteria (detailed in the Supplementary Materials and Supplementary Fig. S2), these results suggest that N can be considered as a good ‘burn-in time’ (in terms of switching-steps) for the Markov chain underlying the switching-algorithm. As a consequence, N switching-steps are enough to simulate samples from the uniform distributions of all the possible bipartite networks with prescribed node degree, through individual consecutive executions of the switching-algorithm, with an approximation power equal to the one attainable when performing N′ switching-steps.

3.3 Time requirements and statistics comparison for different bounds and implementations on real datasets

We compared the performances of the switching-algorithm when applied to a real large cancer genomics dataset, in terms of execution time on a typical desktop computer, by using different software implementations and two user-defined numbers of required switching-steps: our novel lower bound N and the empirical one suggested in (Milo ), N′. For the purpose of this comparison, we used breast cancer samples and their respective mutations downloaded from the TCGA (TCGA ) data portal. A BEM (provided as Supplementary Dataset) was constructed from the deleterious somatic mutations derived from this dataset (as detailed in the Supplementary Materials), yielding 757 rows (i.e. samples), 9757 columns (i.e. genes), 19 758 non-null entries (i.e. variants), corresponding to an edge density equal to 0.27% in the corresponding bipartite network. For this dataset, the lower bound to the number of switching step computed with our method corresponds to N = 97 951, whereas the empirical one is (Supplementary Fig. S5). Results, in terms of execution times required to generate 10 000 rewired versions of the resulting BEM through our implementation of the switching-algorithm, the rewire function of the igraph package (Csardi and Nepusz, 2006), the commsimulator function of vegan package (one of the most famous packages for ecology research) (Dixon, 2003) and two different numbers of required switching-steps (respectively, N and N′), are summarized in Table 1.

Table 1.

Performance comparisons in terms of execution time and residual bias across different algorithms and bounds

	BiRewire	igraph v0.6.1	igraph v0.6.5	vegan swap	vegan Patefield
(A) Execution time
N	53 min 20 s	5 h 58 s	43 days	154 days	5 h 21 min 29 s
			6 h 21 min	21 h 36 min^a
			28 s
N′	9 h	47 days	2 years 145 days	8 years 114 days
N′	37 min 30 s	7 h 37 min 55 s	41 min 12 s^a	22 h 53 min 20 s^a
(B) Residual average Jaccard similarity
N	0.006716	0.907788	0.006744	0.006762^a	0.006921
N′	0.006744	0.299971	0.006723^a	0.006879^a	0.006921

Note: aEstimations.

Performance comparisons in terms of execution time and residual bias across different algorithms and bounds Note: aEstimations. In Table 1, we report also the residual average Jaccard similarity scores of the rewired networks with respect to the original one. First columns of the table refer to our optimized implementation of the switching-algorithm, while data in the second and the third ones refer to the rewire function, provided in two different versions of the igraph package (respectively, v0.6.1 and the latest v0.6.5) (Csardi and Nepusz, 2006). In the fourth column, we report time requirements of the commsimulator function contained in the vegan package (Dixon, 2003) when used with the ‘swap’ method parameter (i.e. the switching-algorithm). The rewire function contained in igraph v0.6.1 does not implement the switching-algorithm but proceeds through a series of rewiring steps (detailed in the Supplementary Materials) through a strategy that systematically biases the edge selection and requires, at each step, a local exploration of the network that is generally slower than storing and retrieving individual edges from an edge list (time complexity analysis provided in the Supplementary Materials). In the rewire function contained in the latest version of the igraph package (v0.6.5), authors implemented the switching-algorithm. As a consequence, for this version of the package, executing N switching-steps guarantees that the residual similarity reaches its plateau (as shown in the third column of Table 1). However, computational time requirements for this implementation (third column in Table 1) are vastly higher than the previous one, making this function practically unusable on large genomics datasets. A detailed analysis of its asymptotical time complexity (far from being trivial) has not been included yet in the documentation of the package. In Table 1, the performance scores marked with (a) have been estimated starting from the execution time requirements and the average residual similarities observed on limited number of samples of rewired versions of the original network. For reference, we also include in Table 1 the performances (in terms of time requirements and residual similarity to the original network) of the r2dtable function, included in the vegan package, to generate 10 000 random 0–1 tables with same marginal totals of our BEM. This function makes use of Patefields algorithm (Patefield, 1981). Also in this case, the time requirements were significantly higher ( s versus s), and the residual similarity is comparable with the one obtained with our implementation of the switching-algorithm and our bound. Finally, to investigate the consistency of ME significance when using null models generated with different number of switching steps, we analysed ME patterns for the protein affecting mutations of a colorectal cancer dataset assembled from the TCGA (TCGA ) and other studies, as described in the Supplementary Material. This yielded a small BEM (provided as Supplementary Dataset) composed by 206 rows (i.e. samples), 78 columns (i.e. genes), 793 non-null entries (i.e. variants), corresponding to an edge density equal to 5% in the corresponding bipartite network. For this dataset, the lower bound to the number of switching step computed with our method corresponds to N = 2497, whereas the empirical one is . We tested the ME significance (as described in the MEMo approach and in the previous sections) for all the possible 3003 gene pairs by using two different null models, simulated by generating 10 000 randomized version of the BEM through N and N′ switching steps, respectively. We observed an overall concordance of resulting coverage P-values across the two null models and a perfect match between the corresponding two sets of gene pairs with a significant ME (P < 0.05 and after Benjamini–Hochberg correction of the P-values for multiple hypothesis testing). Results for gene pair with coverage are provided as Supplementary Data and Figure 2.

Fig. 2.

ME P-value comparisons. ME P-values for 237 gene pairs, whose coverage is in the BEM derived from the colorectal cancer dataset. Positions on the two axes indicate P-values computed by using two different null models simulated by generating 10 000 randomized version of the original BEM, through the switching algorithm and different numbers of switching steps: our novel lower bound and the empirical one. An overall consistency of P-values can be observed and a set of 11 gene pairs has a significant level of ME (at a false discovery rate ) on both the null models

3.4 The BiRewire package

We have developed R package BiRewire (available on Bioconductor; Gentleman ), which provides high-performing routines for generating random bipartite graphs with prescribed node degrees (using the switching-algorithm), for the analysis of convergence diagnostics across switching-steps, and the estimation of the minimal number of steps according to the formula described in Equation (1). BiRewire is vastly faster than other existing implementations, not only because it uses our new lower bound but also because it implements an optimal version of the switching-algorithm, as detailed in the Supplementary Materials. Specifically, with BiRewire, users can (i) create bipartite graphs starting from genomic binary event matrices (or, generally, from any kind of PAMs), (ii) perform an analysis, which consists of studying the sample path (time series) of the JI across switching-steps (with user-defined sampling times), and estimating the lower bound to achieve convergence to the uniform distribution on the set of allowed bipartite networks, (iii) generate rewired versions of a bipartite graph with the analytically derived bound of switching-steps or a user-defined one and (iv) derive projections of the starting network and its rewired version and perform different graph-theory analysis on them. All the functions of the package are written in C and R-wrapped.

4 CONCLUSIONS

We presented a novel approximate lower bound for the minimal number of steps required by the switching-algorithm to simulate genomic datasets from relevant null models. This new lower bound was derived analytically, and it considerably reduces the computational time for estimating the significance of combinatorial metrics such as mutation mutual exclusivity and co-occurrence under these null models. We showed that this novel bound strongly reduces computational time requirements, when tested on a real dataset and a typical desktop computer architecture paired with the R package BiRewire (which we have developed). Our methods can be readily adapted to the computation of P-values under similar null models, which are appropriate for other kinds of data that can be modelled as a presence-absence matrix (hence, a bipartite network) preserving the ‘presence-distributions’ both across rows and columns. We believe that its applicability range covers different fields of computational biology and will grow in the future, as increasingly more data for which bipartite graphs provide a natural representation become available.

20 in total

1. High-throughput oncogene mutation profiling in human cancer.

Authors: Roman K Thomas; Alissa C Baker; Ralph M Debiasi; Wendy Winckler; Thomas Laframboise; William M Lin; Meng Wang; Whei Feng; Thomas Zander; Laura MacConaill; Laura E Macconnaill; Jeffrey C Lee; Rick Nicoletti; Charlie Hatton; Mary Goyette; Luc Girard; Kuntal Majmudar; Liuda Ziaugra; Kwok-Kin Wong; Stacey Gabriel; Rameen Beroukhim; Michael Peyton; Jordi Barretina; Amit Dutt; Caroline Emery; Heidi Greulich; Kinjal Shah; Hidefumi Sasaki; Adi Gazdar; John Minna; Scott A Armstrong; Ingo K Mellinghoff; F Stephen Hodi; Glenn Dranoff; Paul S Mischel; Tim F Cloughesy; Stan F Nelson; Linda M Liau; Kirsten Mertz; Mark A Rubin; Holger Moch; Massimo Loda; William Catalona; Jonathan Fletcher; Sabina Signoretti; Frederic Kaye; Kenneth C Anderson; George D Demetri; Reinhard Dummer; Stephan Wagner; Meenhard Herlyn; William R Sellers; Matthew Meyerson; Levi A Garraway
Journal: Nat Genet Date: 2007-02-11 Impact factor: 38.330

2. Signatures of mutation and selection in the cancer genome.

Authors: Graham R Bignell; Chris D Greenman; Helen Davies; Adam P Butler; Sarah Edkins; Jenny M Andrews; Gemma Buck; Lina Chen; David Beare; Calli Latimer; Sara Widaa; Jonathon Hinton; Ciara Fahey; Beiyuan Fu; Sajani Swamy; Gillian L Dalgliesh; Bin T Teh; Panos Deloukas; Fengtang Yang; Peter J Campbell; P Andrew Futreal; Michael R Stratton
Journal: Nature Date: 2010-02-18 Impact factor: 49.962

3. Systematic interpretation of comutated genes in large-scale cancer mutation profiles.

Authors: Yunyan Gu; Da Yang; Jinfeng Zou; Wencai Ma; Ruihong Wu; Wenyuan Zhao; Yuannv Zhang; Hui Xiao; Xue Gong; Min Zhang; Jing Zhu; Zheng Guo
Journal: Mol Cancer Ther Date: 2010-07-27 Impact factor: 6.261

4. International network of cancer genome projects.

Authors: Thomas J Hudson; Warwick Anderson; Axel Artez; Anna D Barker; Cindy Bell; Rosa R Bernabé; M K Bhan; Fabien Calvo; Iiro Eerola; Daniela S Gerhard; Alan Guttmacher; Mark Guyer; Fiona M Hemsley; Jennifer L Jennings; David Kerr; Peter Klatt; Patrik Kolar; Jun Kusada; David P Lane; Frank Laplace; Lu Youyong; Gerd Nettekoven; Brad Ozenberger; Jane Peterson; T S Rao; Jacques Remacle; Alan J Schafer; Tatsuhiro Shibata; Michael R Stratton; Joseph G Vockley; Koichi Watanabe; Huanming Yang; Matthew M F Yuen; Bartha M Knoppers; Martin Bobrow; Anne Cambon-Thomsen; Lynn G Dressler; Stephanie O M Dyke; Yann Joly; Kazuto Kato; Karen L Kennedy; Pilar Nicolás; Michael J Parker; Emmanuelle Rial-Sebbag; Carlos M Romeo-Casabona; Kenna M Shaw; Susan Wallace; Georgia L Wiesner; Nikolajs Zeps; Peter Lichter; Andrew V Biankin; Christian Chabannon; Lynda Chin; Bruno Clément; Enrique de Alava; Françoise Degos; Martin L Ferguson; Peter Geary; D Neil Hayes; Thomas J Hudson; Amber L Johns; Arek Kasprzyk; Hidewaki Nakagawa; Robert Penny; Miguel A Piris; Rajiv Sarin; Aldo Scarpa; Tatsuhiro Shibata; Marc van de Vijver; P Andrew Futreal; Hiroyuki Aburatani; Mónica Bayés; David D L Botwell; Peter J Campbell; Xavier Estivill; Daniela S Gerhard; Sean M Grimmond; Ivo Gut; Martin Hirst; Carlos López-Otín; Partha Majumder; Marco Marra; John D McPherson; Hidewaki Nakagawa; Zemin Ning; Xose S Puente; Yijun Ruan; Tatsuhiro Shibata; Michael R Stratton; Hendrik G Stunnenberg; Harold Swerdlow; Victor E Velculescu; Richard K Wilson; Hong H Xue; Liu Yang; Paul T Spellman; Gary D Bader; Paul C Boutros; Peter J Campbell; Paul Flicek; Gad Getz; Roderic Guigó; Guangwu Guo; David Haussler; Simon Heath; Tim J Hubbard; Tao Jiang; Steven M Jones; Qibin Li; Nuria López-Bigas; Ruibang Luo; Lakshmi Muthuswamy; B F Francis Ouellette; John V Pearson; Xose S Puente; Victor Quesada; Benjamin J Raphael; Chris Sander; Tatsuhiro Shibata; Terence P Speed; Lincoln D Stein; Joshua M Stuart; Jon W Teague; Yasushi Totoki; Tatsuhiko Tsunoda; Alfonso Valencia; David A Wheeler; Honglong Wu; Shancen Zhao; Guangyu Zhou; Lincoln D Stein; Roderic Guigó; Tim J Hubbard; Yann Joly; Steven M Jones; Arek Kasprzyk; Mark Lathrop; Nuria López-Bigas; B F Francis Ouellette; Paul T Spellman; Jon W Teague; Gilles Thomas; Alfonso Valencia; Teruhiko Yoshida; Karen L Kennedy; Myles Axton; Stephanie O M Dyke; P Andrew Futreal; Daniela S Gerhard; Chris Gunter; Mark Guyer; Thomas J Hudson; John D McPherson; Linda J Miller; Brad Ozenberger; Kenna M Shaw; Arek Kasprzyk; Lincoln D Stein; Junjun Zhang; Syed A Haider; Jianxin Wang; Christina K Yung; Anthony Cros; Anthony Cross; Yong Liang; Saravanamuttu Gnaneshan; Jonathan Guberman; Jack Hsu; Martin Bobrow; Don R C Chalmers; Karl W Hasel; Yann Joly; Terry S H Kaan; Karen L Kennedy; Bartha M Knoppers; William W Lowrance; Tohru Masui; Pilar Nicolás; Emmanuelle Rial-Sebbag; Laura Lyman Rodriguez; Catherine Vergely; Teruhiko Yoshida; Sean M Grimmond; Andrew V Biankin; David D L Bowtell; Nicole Cloonan; Anna deFazio; James R Eshleman; Dariush Etemadmoghadam; Brooke B Gardiner; Brooke A Gardiner; James G Kench; Aldo Scarpa; Robert L Sutherland; Margaret A Tempero; Nicola J Waddell; Peter J Wilson; John D McPherson; Steve Gallinger; Ming-Sound Tsao; Patricia A Shaw; Gloria M Petersen; Debabrata Mukhopadhyay; Lynda Chin; Ronald A DePinho; Sarah Thayer; Lakshmi Muthuswamy; Kamran Shazand; Timothy Beck; Michelle Sam; Lee Timms; Vanessa Ballin; Youyong Lu; Jiafu Ji; Xiuqing Zhang; Feng Chen; Xueda Hu; Guangyu Zhou; Qi Yang; Geng Tian; Lianhai Zhang; Xiaofang Xing; Xianghong Li; Zhenggang Zhu; Yingyan Yu; Jun Yu; Huanming Yang; Mark Lathrop; Jörg Tost; Paul Brennan; Ivana Holcatova; David Zaridze; Alvis Brazma; Lars Egevard; Egor Prokhortchouk; Rosamonde Elizabeth Banks; Mathias Uhlén; Anne Cambon-Thomsen; Juris Viksna; Fredrik Ponten; Konstantin Skryabin; Michael R Stratton; P Andrew Futreal; Ewan Birney; Ake Borg; Anne-Lise Børresen-Dale; Carlos Caldas; John A Foekens; Sancha Martin; Jorge S Reis-Filho; Andrea L Richardson; Christos Sotiriou; Hendrik G Stunnenberg; Giles Thoms; Marc van de Vijver; Laura van't Veer; Fabien Calvo; Daniel Birnbaum; Hélène Blanche; Pascal Boucher; Sandrine Boyault; Christian Chabannon; Ivo Gut; Jocelyne D Masson-Jacquemier; Mark Lathrop; Iris Pauporté; Xavier Pivot; Anne Vincent-Salomon; Eric Tabone; Charles Theillet; Gilles Thomas; Jörg Tost; Isabelle Treilleux; Fabien Calvo; Paulette Bioulac-Sage; Bruno Clément; Thomas Decaens; Françoise Degos; Dominique Franco; Ivo Gut; Marta Gut; Simon Heath; Mark Lathrop; Didier Samuel; Gilles Thomas; Jessica Zucman-Rossi; Peter Lichter; Roland Eils; Benedikt Brors; Jan O Korbel; Andrey Korshunov; Pablo Landgraf; Hans Lehrach; Stefan Pfister; Bernhard Radlwimmer; Guido Reifenberger; Michael D Taylor; Christof von Kalle; Partha P Majumder; Rajiv Sarin; T S Rao; M K Bhan; Aldo Scarpa; Paolo Pederzoli; Rita A Lawlor; Massimo Delledonne; Alberto Bardelli; Andrew V Biankin; Sean M Grimmond; Thomas Gress; David Klimstra; Giuseppe Zamboni; Tatsuhiro Shibata; Yusuke Nakamura; Hidewaki Nakagawa; Jun Kusada; Tatsuhiko Tsunoda; Satoru Miyano; Hiroyuki Aburatani; Kazuto Kato; Akihiro Fujimoto; Teruhiko Yoshida; Elias Campo; Carlos López-Otín; Xavier Estivill; Roderic Guigó; Silvia de Sanjosé; Miguel A Piris; Emili Montserrat; Marcos González-Díaz; Xose S Puente; Pedro Jares; Alfonso Valencia; Heinz Himmelbauer; Heinz Himmelbaue; Victor Quesada; Silvia Bea; Michael R Stratton; P Andrew Futreal; Peter J Campbell; Anne Vincent-Salomon; Andrea L Richardson; Jorge S Reis-Filho; Marc van de Vijver; Gilles Thomas; Jocelyne D Masson-Jacquemier; Samuel Aparicio; Ake Borg; Anne-Lise Børresen-Dale; Carlos Caldas; John A Foekens; Hendrik G Stunnenberg; Laura van't Veer; Douglas F Easton; Paul T Spellman; Sancha Martin; Anna D Barker; Lynda Chin; Francis S Collins; Carolyn C Compton; Martin L Ferguson; Daniela S Gerhard; Gad Getz; Chris Gunter; Alan Guttmacher; Mark Guyer; D Neil Hayes; Eric S Lander; Brad Ozenberger; Robert Penny; Jane Peterson; Chris Sander; Kenna M Shaw; Terence P Speed; Paul T Spellman; Joseph G Vockley; David A Wheeler; Richard K Wilson; Thomas J Hudson; Lynda Chin; Bartha M Knoppers; Eric S Lander; Peter Lichter; Lincoln D Stein; Michael R Stratton; Warwick Anderson; Anna D Barker; Cindy Bell; Martin Bobrow; Wylie Burke; Francis S Collins; Carolyn C Compton; Ronald A DePinho; Douglas F Easton; P Andrew Futreal; Daniela S Gerhard; Anthony R Green; Mark Guyer; Stanley R Hamilton; Tim J Hubbard; Olli P Kallioniemi; Karen L Kennedy; Timothy J Ley; Edison T Liu; Youyong Lu; Partha Majumder; Marco Marra; Brad Ozenberger; Jane Peterson; Alan J Schafer; Paul T Spellman; Hendrik G Stunnenberg; Brandon J Wainwright; Richard K Wilson; Huanming Yang
Journal: Nature Date: 2010-04-15 Impact factor: 49.962

5. Bioconductor: open software development for computational biology and bioinformatics.

Authors: Robert C Gentleman; Vincent J Carey; Douglas M Bates; Ben Bolstad; Marcel Dettling; Sandrine Dudoit; Byron Ellis; Laurent Gautier; Yongchao Ge; Jeff Gentry; Kurt Hornik; Torsten Hothorn; Wolfgang Huber; Stefano Iacus; Rafael Irizarry; Friedrich Leisch; Cheng Li; Martin Maechler; Anthony J Rossini; Gunther Sawitzki; Colin Smith; Gordon Smyth; Luke Tierney; Jean Y H Yang; Jianhua Zhang
Journal: Genome Biol Date: 2004-09-15 Impact factor: 13.583

Review 6. Combinatorial patterns of somatic gene mutations in cancer.

Authors: Chen-Hsiang Yeang; Frank McCormick; Arnold Levine
Journal: FASEB J Date: 2008-04-23 Impact factor: 5.191

Review 7. The cancer genome.

Authors: Michael R Stratton; Peter J Campbell; P Andrew Futreal
Journal: Nature Date: 2009-04-09 Impact factor: 49.962

8. Patterns of somatic mutation in human cancer genomes.

Authors: Christopher Greenman; Philip Stephens; Raffaella Smith; Gillian L Dalgliesh; Christopher Hunter; Graham Bignell; Helen Davies; Jon Teague; Adam Butler; Claire Stevens; Sarah Edkins; Sarah O'Meara; Imre Vastrik; Esther E Schmidt; Tim Avis; Syd Barthorpe; Gurpreet Bhamra; Gemma Buck; Bhudipa Choudhury; Jody Clements; Jennifer Cole; Ed Dicks; Simon Forbes; Kris Gray; Kelly Halliday; Rachel Harrison; Katy Hills; Jon Hinton; Andy Jenkinson; David Jones; Andy Menzies; Tatiana Mironenko; Janet Perry; Keiran Raine; Dave Richardson; Rebecca Shepherd; Alexandra Small; Calli Tofts; Jennifer Varian; Tony Webb; Sofie West; Sara Widaa; Andy Yates; Daniel P Cahill; David N Louis; Peter Goldstraw; Andrew G Nicholson; Francis Brasseur; Leendert Looijenga; Barbara L Weber; Yoke-Eng Chiew; Anna DeFazio; Mel F Greaves; Anthony R Green; Peter Campbell; Ewan Birney; Douglas F Easton; Georgia Chenevix-Trench; Min-Han Tan; Sok Kean Khoo; Bin Tean Teh; Siu Tsan Yuen; Suet Yi Leung; Richard Wooster; P Andrew Futreal; Michael R Stratton
Journal: Nature Date: 2007-03-08 Impact factor: 49.962

9. Comprehensive genomic characterization defines human glioblastoma genes and core pathways.

Authors:
Journal: Nature Date: 2008-09-04 Impact factor: 49.962

10. Large-scale mutagenesis in p19(ARF)- and p53-deficient mice identifies cancer genes and their collaborative networks.

Authors: Anthony G Uren; Jaap Kool; Konstantin Matentzoglu; Jeroen de Ridder; Jenny Mattison; Miranda van Uitert; Wendy Lagcher; Daoud Sie; Ellen Tanger; Tony Cox; Marcel Reinders; Tim J Hubbard; Jane Rogers; Jos Jonkers; Lodewyk Wessels; David J Adams; Maarten van Lohuizen; Anton Berns
Journal: Cell Date: 2008-05-16 Impact factor: 41.582

14 in total

1. A weighted exact test for mutually exclusive mutations in cancer.

Authors: Mark D M Leiserson; Matthew A Reyna; Benjamin J Raphael
Journal: Bioinformatics Date: 2016-09-01 Impact factor: 6.937

2. A Computational Drug Repositioning Approach for Targeting Oncogenic Transcription Factors.

Authors: Kaitlyn M Gayvert; Etienne Dardenne; Cynthia Cheung; Mary Regina Boland; Tal Lorberbaum; Jackline Wanjala; Yu Chen; Mark A Rubin; Nicholas P Tatonetti; David S Rickman; Olivier Elemento
Journal: Cell Rep Date: 2016-06-02 Impact factor: 9.423

3. Efficient randomization of biological networks while preserving functional characterization of individual nodes.

Authors: Francesco Iorio; Marti Bernardo-Faura; Andrea Gobbi; Thomas Cokelaer; Giuseppe Jurman; Julio Saez-Rodriguez
Journal: BMC Bioinformatics Date: 2016-12-20 Impact factor: 3.169

4. Network inference reveals novel connections in pathways regulating growth and defense in the yeast salt response.

Authors: Matthew E MacGilvray; Evgenia Shishkova; Deborah Chasman; Michael Place; Anthony Gitter; Joshua J Coon; Audrey P Gasch
Journal: PLoS Comput Biol Date: 2018-05-08 Impact factor: 4.475

5. A pan-cancer atlas of cancer hallmark-associated candidate driver lncRNAs.

Authors: Yulan Deng; Shangyi Luo; Xinxin Zhang; Chaoxia Zou; Huating Yuan; Gaoming Liao; Liwen Xu; Chunyu Deng; Yujia Lan; Tingting Zhao; Xu Gao; Yun Xiao; Xia Li
Journal: Mol Oncol Date: 2018-10-02 Impact factor: 6.603

6. An Unbiased Lipid Phenotyping Approach To Study the Genetic Determinants of Lipids and Their Association with Coronary Heart Disease Risk Factors.

Authors: Eric L Harshfield; Albert Koulman; Daniel Ziemek; Luke Marney; Eric B Fauman; Dirk S Paul; David Stacey; Asif Rasheed; Jung-Jin Lee; Nabi Shah; Sehrish Jabeen; Atif Imran; Shahid Abbas; Zoubia Hina; Nadeem Qamar; Nadeem Hayyat Mallick; Zia Yaqoob; Tahir Saghir; Syed Nadeem Hasan Rizvi; Anis Memon; Syed Zahed Rasheed; Fazal-Ur-Rehman Memon; Irshad Hussain Qureshi; Muhammad Ishaq; Philippe Frossard; John Danesh; Danish Saleheen; Adam S Butterworth; Angela M Wood; Julian L Griffin
Journal: J Proteome Res Date: 2019-04-26 Impact factor: 4.466