Literature DB >> 35758804

CALDERA: finding all significant de Bruijn subgraphs for bacterial GWAS.

Hector Roux de Bézieux¹, Leandro Lima², Fanny Perraudeau¹, Arnaud Mary³, Sandrine Dudoit⁴, Laurent Jacob³.

Abstract

MOTIVATION: Genome-wide association studies (GWAS), aiming to find genetic variants associated with a trait, have widely been used on bacteria to identify genetic determinants of drug resistance or hypervirulence. Recent bacterial GWAS methods usually rely on k-mers, whose presence in a genome can denote variants ranging from single-nucleotide polymorphisms to mobile genetic elements. This approach does not require a reference genome, making it easier to account for accessory genes. However, a same gene can exist in slightly different versions across different strains, leading to diluted effects.
RESULTS: Here, we overcome this issue by testing covariates built from closed connected subgraphs (CCSs) of the de Bruijn graph defined over genomic k-mers. These covariates capture polymorphic genes as a single entity, improving k-mer-based GWAS both in terms of power and interpretability. However, a method naively testing all possible subgraphs would be powerless due to multiple testing corrections, and the mere exploration of these subgraphs would quickly become computationally intractable. The concept of testable hypothesis has successfully been used to address both problems in similar contexts. We leverage this concept to test all CCSs by proposing a novel enumeration scheme for these objects which fully exploits the pruning opportunity offered by testability, resulting in drastic improvements in computational efficiency. Our method integrates with existing visual tools to facilitate interpretation.
AVAILABILITY AND IMPLEMENTATION: We provide an implementation of our method, as well as code to reproduce all results at https://github.com/HectorRDB/Caldera_ISMB. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.

Entities: Chemical

Mesh：

Year: 2022 PMID： 35758804 PMCID： PMC9235473 DOI： 10.1093/bioinformatics/btac238

Source DB: PubMed Journal: Bioinformatics ISSN： 1367-4803 Impact factor: 6.931

1 Introduction

Genome-wide association studies (GWAS) look for genetic variants whose presence or absence is associated with a trait of interest, such as the risk for a person to develop a disease, or the yield for a crop. They were originally used on human genomes using single-nucleotide polymorphisms (SNPs) as genetic variants (Visscher ). While SNPs do capture most of the genetic variation in genomes that are similar enough, they can miss essential variants in other situations. For example, some bacterial species are known to have large accessory genomes, i.e. sets of genes that are not present in every strain in the species. In spite of their name, some of these accessory genes play a central role for some traits of interest, such as antibiotic resistance. In Pseudomonas aeruginosa, for instance, accessory genes account for 70% of known genetic determinants of resistance to amikacin (Jaillard ). In this context, k-mers—defined as all words of length k found in the genomes—have emerged as a popular alternative to SNPs to describe genetic diversity (Earle ; Sheppard ). More specifically, bacterial GWAS often test the association between the trait of interest and the presence/absence of k-mers. A broad variety of genetic variants—ranging from SNPs to mobile genetic elements or translocations—cause the mutated strains to contain one or several specific k-mers. These GWAS are therefore able to capture any of these variants without requiring their prior identification or even definition. On the other hand, k-mer-based GWAS suffer from two important limitations. First, interpreting their result is notoriously tedious: any given k-mer can belong to several regions of the same genome, and conversely a gene causing the trait of interest can contain many specific k-mers. Second, because a resistance-causing gene often exists in slightly different version, the k-mers of each version are only present in a fraction of the resistant strains. As a consequence, these k-mers are less strongly associated with resistance than the presence of the polymorphic gene itself. Jaillard proposed to help interpret the result of k-mer-based GWAS using the de Bruijn graph (DBG, de Bruijn, 1946; Pevzner ), which connects overlapping k-mers. Several significant k-mers arising from a single polymorphic gene typically aggregate into a somewhat linear subgraph of the DBG (Fig. 1), making their interpretation easier. Similarly, (Lees ), a widely-used bacterial GWAS pipeline, now recommends using unitigs over k-mers. However, still tests the individual nodes of this subgraph separately, at the risk of missing causal genes whose presence is too diluted across different versions and therefore different k-mers. (Drouin ) uses conjunction and disjunction of patterns of presence/absence of k-mers to predict the phenotype. However, that approach does not directly allow performing inference and requires specifying the maximum number of allowed combinations.

Fig. 1.

Example of de Bruijn graphs. (a) A general example with two genes, each with some variability, resulting in a mostly linear sequence only at the coarse level. More details in Section 4. (b) A simpler setting with two samples and four nodes, leading to three CCSs: and Here, we propose to test the association between the phenotype and a single covariate capturing the presence of any version of a gene—or any other potential genetic determinant. Concretely, this covariate indicates the presence in each genome of any k-mer among those represented in a particular connected subgraph of the DBG. More specifically, we restrict ourselves to closed connected subgraphs (CCSs). A CCS is a connected subgraph such that adding any neighbor does not change the created covariate (i.e. the set of samples containing a k-mer that is also in the subgraph). Non-closed subgraphs are represented by the exact same covariate as their closure, and would therefore be redundant. As any CCS may represent a causal variant that exists in several versions in the dataset, we take an agnostic approach and test the association between the phenotype and one covariate for every CCS in the DBG. In contrast, relies on one covariate for each node of the DBG. This new approach has two potential issues: (i) the number of CCSs grows exponentially with the number of nodes in the DBG, making the task computationally intractable, and (ii) adjusting for multiple testing over this very large number of tests leaves little to no power to detect associations. Our method addresses these two issues by using the concept of testability introduced by Tarone (1990). Tarone’s procedure controls the family-wise error rate (FWER) while disregarding numerous non-testable hypotheses in its multiple testing correction. Intuitively, a covariate representing the presence of any k-mer among a growing set that corresponds to larger and larger CCS quickly becomes true for all samples. It thus cannot possibly be associated to any phenotype and can therefore be discarded without being tested or counted toward multiple testing correction. Testability provides a well-grounded and quantitative version of this intuition. Furthermore, since adding nodes to a connected subgraph can only increase the number of present k-mers in the corresponding covariate, we can develop a method that rapidly prunes non-testable CCSs, thereby solving the computational problem. Testability has been used in similar situations, but most existing procedures are restricted to complete (Minato ; Terada ) or linear graphs (Llinares-López , 2017). Sese described an algorithm to test all CCSs by combining the testability-based procedure of Terada with (Sese ), an enumeration method for CCSs. While no experiment was provided in Sese , we show that a version of this algorithm using an improved version of (Llinares-López ; Minato ) could find all significant CCSs in graphs with up to 20 000 nodes in less than a day in only the most favorable settings. However, the DBG built for typical bacterial GWAS involve millions of nodes, so a more scalable method is necessary to make CCSs testing amenable. Our contributions are the following: We introduce a novel, provably complete and non-redundant enumeration scheme for CCSs called . We also improve an existing pruning criterion for the Cochran–Mantel–Haenszel (CMH) test. We show that combining these contributions with Tarone’s testability-based procedure makes it possible to find all significant CCSs in a large graph, making it suited to bacterial GWAS. We provide the first implementation of a procedure finding all significant CCSs, along with a user-friendly visualization tool derived from . Finally, we demonstrate the advantages of over competing methods on both simulated and real examples in terms of computational speed, statistical power and biological interpretation. Notation and goal for : We consider a set of n samples, , where are p binary covariates describing sample i, denotes a binary phenotype, and assigns sample i to one population among J. We denote n1 and n2 the number of samples such that y = 0 and 1, respectively. Furthermore, we consider an undirected unweighted connected graph , where and each vertex is associated with one of the p binary covariates represented in x. We denote by the indices of samples containing v. Conversely, for , we note the set of covariates contained by the i-th sample. For any connected subgraph , such that and , we let be the indices of samples containing at least one of the covariate represented by the vertices of . Of note, this framework addresses both disjunctions and conjunctions, as the latter can simply be obtained by replacing each x by its complement. We now properly define the notion of CCS and the closure operation (proof in Supplementary Section S-1.1). Definition 1. A connected subgraph is closed if and only if there exists no edge such that , , and . We denote by the set of all closed connected subgraphs of . Lemma 1. For any connected subgraph of , there exists a unique subgraph such that and , which we note . Assuming that are n i.i.d. realizations of random variables , and C, our objective is to test null hypotheses of the form for all , while controlling the FWER (i.e. the chance of at least one Type I error or false positive) at level α. Translated in the context of GWAS, we want to test the association between the phenotype Y and the presence pattern of the covariate represented by each CCS , while controlling for the population structure C. We denote in the remainder of this manuscript, as X, Y and C are common for all CCS in .

2 Background on significant subgraph detection using testability

We now describe the important concept of minimal attainable p-value proposed by Tarone (1990), and how it can be used to (i) retain more power than the Bonferroni procedure while controlling the FWER and (ii) test more rapidly a large set of hypotheses. Both improvements come from the possibility to discard a large proportion of hypotheses without explicitly testing them, and will be exploited in Section 3 to propose our procedure testing all CCSs in . Minimal p-values are a property of discrete tests. For example, Fisher’s exact test (Fisher, 1922) relies on a 2 × 2 contingency table, whose margins would describe in our case the number of sensitive and resistant bacteria and the number of bacteria whose genome contains or not a genetic variant. Given the margins of this table, only a finite number of cell count assignment are possible and Fisher’s test can only lead to a finite number of values, the smallest of which is strictly positive (see Supplementary Fig. S1 for an example). Importantly, this minimal attainable p-value is entirely determined by the margins of the contingency table: given these margins, is the minimum over a finite number of possible partitions and is independent of the actual observed cell counts. Intuitively, strongly imbalanced margins (e.g. variants that are present in a very large proportion of samples) cannot possibly lead to small p-values, no matter how the table is filled (i.e. how the few samples that do not have the variant are distributed among resistant and sensitive phenotypes).

2.1 Using minimal attainable p-values for a tighter FWER control

The FWER is the probability to incorrectly reject at least one null hypothesis. When testing N of them and rejecting those whose p-value p is smaller than a threshold δ, , where is taken over the N null distributions . The Bonferroni correction (Bonferroni, 1936) is a common procedure to control the FWER at a level α. It is motivated by a simple union bound: as FWER is upper-bounded by and since by definition , controlling each individual tests at level makes the FWER upper-bounded by α. Tarone (1990) sharpens this bound, by using the fact that for some hypotheses. Since by definition , the corresponding term is exactly 0. Therefore, the FWER is actually controlled at level where m is the number of testable hypotheses, for which . This suggests that using a larger threshold δ than the Bonferroni could still control the FWER at level α—while rejecting more hypotheses and therefore increasing power. Choosing the largest such δ is not a trivial task, as increasing δ also increases the number m of non-testable hypotheses. Let m(k) be the number of testable hypotheses at level , i.e. such that . In the worst case, m(k) = N and we recover the Bonferroni procedure. More generally, Tarone’s analysis guarantees that for as long as . Letting k0 be the smallest k verifying this property, maximizes the number of rejections while controlling the at level α.

2.2 Using minimal p-values to efficiently explore

Provided that enough CCSs have sufficiently large , Tarone’s procedure could therefore address the loss of power incurred when exploring . However, naively finding k0 requires computing the minimal p-values for all CCSs and iterate through these minimal p-values to adjust the threshold, leaving the computational problem unsolved. A more efficient strategy has been introduced to compute k0 (Llinares-López ; Minato ): starting from k = 1 a set of testable hypotheses, i.e. of elements with is grown. When becomes larger than k, k is incremented to . All hypotheses that are not testable anymore under the new threshold—i.e. such that —are removed from , and the exploration continues until the point where all testable hypotheses are in and k = k0. This strategy finds k0 in a single enumeration of all tests, but still requires computing all minimal p-values, which would not be feasible in our case. However, this search algorithm is also well suited to pruning strategies—a fact already used in Llinares-López and Minato . Let be the minimal p-value associated with for a CCS . Assuming that for some pairs of subgraphs , we can stop exploring all subgraphs including as soon as itself is found non-testable. This monotonic property is verified when using Fisher’s exact test to test : provided that is strictly increasing in , and adding nodes to can only increase (see Supplementary Fig. S2 for an example). Our main contribution, presented in Section 3 will be an efficient exploration algorithm for , which is well suited to pruning.

2.3 Controlling for a categorical covariate: the CMH test

When testing for associations, controlling for confounders is essential to avoid spurious discoveries. This is particularly important in bacterial GWAS, where strong population structures can lead to large sets of clade-specific variants to be found associated with a phenotype. The CMH test can be used to test associations of two binary variables while controlling for a third categorical variable. It relies on J two-by-two association tables such as the one in Table 1, with and .

Table 1.

Association table in community j for subgraph , used for the CMH test

Variable	i∈I(S)	i∉I(S)	Rows totals
yi=1	aS,j	n1,j−aS,j	n1,j
yi=0	xS,j−aS,j	n2,j−xS,j+aS,j	n2,j
Cols totals	xS,j	nj−xS,j	n_j

Association table in community j for subgraph , used for the CMH test Like Fisher’s exact test, the CMH test is done conditional on all margins . Papaxanthos furthermore, demonstrated that its minimal p-value could be computed in O(J) (proof in Supplementary Section S-1.5) using the margins. However, the minimal p-value of the CMH test does not verify the monotonicity property which is required to prune while exploring . Papaxanthos introduced the envelope, a lower bound on , which verifies the monotonicity property. It can also be computed in for all such that, for all categories j, . This allows for a valid pruning strategy. The condition on is the CMH analogous of the condition of Fisher’s test, and can decrease the number of prunable subgraphs as it must be verified for all J groups.

3 Speeding up the detection of all significant CCSs with CALDERA

We are now ready to present our contributions for scalable detection of significant elements in : an efficient exploration algorithm and an improved envelope for the CMH test, allowing for more pruning in the presence of imbalanced populations.

3.1 Critical properties for a fast, Tarone-aware enumeration of

We exploit several factors to provide a fast exploration of . First, we ensure that it is non-redundant, i.e. that each element of is enumerated exactly once, by defining a tree whose nodes are the elements of and propose an algorithm to traverse this tree. Second, the tree is directly built over , as opposed to the set of connected subgraphs. The latter option, as proposed in Sese is more straightforward to define and to explore and still induces a tree over , but yields a much larger object and results in a more expensive traversal. Third, we avoid maintaining subgraph connectivity, such as a block-cut tree (Westbrook and Tarjan, 1992). Such a mechanism is efficient to build a tree over connected subgraphs, but is costly to compute. Finally, in order to exploit the pruning opportunity offered by the testing procedure, the exploration should be such that all explored from a given verify . Haraguchi and Okuno define a tree on , but the root of the tree corresponds to the entire graph : the inclusion relationship along edges of the tree is the opposite to the one we need, making their exploration unsuited to our problem. The COIN/COPINE algorithm described in Seki and Sese (2008); Sese builds a tree over the set of connected subgraphs, which induces a tree over but has two drawbacks. First, it maintains an itemtable to avoid enumerating the same element twice. This itemtable has an important memory footprint, and only guarantees a tree structure when exploring in depth first. Secondly, the enumeration of connected subgraphs requires maintaining a list of articulation points along each explored branch, a costly operation.

Algorithm 1 Children of

Input parent CCS , current CCS , largest index i, itemtable 1: procedure Children(, i, ) 2: children 3: for k, G in enumerate(EqGroups) do 4: is a candidate child 5: if verify () then is a child 6: if i is NULL then Exploring from the direct neighbors of 7: Add to children 8: Add Children to children 9: else Exploring from the neighbors of another child 10: if and then Check that was not enumerated earlier 11: 12: Add to children 13: Add Children to children 14: end if 15: end if 16: end if 17: end for 18: return children 19: end procedure

3.2 Defining and exploring the tree over

In order to build a tree over rooted on the empty CCS, we use a reverse search, a strategy introduced in Avis and Fukuda (1996). Reverse search relies on a reduction operation, which takes one element of the set to be enumerated, and returns a unique, strictly smaller element of the same set. This operation necessarily defines a tree over the elements of the set, by ensuring a unique path between any element and the empty one—the root of the tree. This reduction operation defines the unique parent of every element in the tree. In order to traverse the tree from the root, one needs to inverse the reduction operation, i.e. in our setting, given a CCS to recover all CCSs that lead to by reduction. Here, we introduce a reduction operation over , as well as its inversion. We consider the parent operation given by Definition 2 for any element of , and show that it defines a valid reduction as introduced above. We then show its inversion in Algorithm 1. All proofs are presented in the Supplementary Appendix. Definition 2. For a subgraph , we denote the indices of samples containing all covariates represented by nodes in . If , then the parent of is , i.e. . Else we note . The parent of is the connected subgraph of that contains . Here, the over a set of nodes is defined through some arbitrary numbering of the elements of . In layman’s terms, our reduction first finds the largest index among samples that contain at least one but not all covariates represented by nodes in . It then removes all nodes representing a covariate contained by this sample, and retains one of the connected components as . Our reduction therefore relies on a numbering of both the samples—to decide which nodes are removed—and the nodes—to define which connected component resulting from the removal is retained as the unique parent. Lemma 2. The function defines a valid reduction over . Note that we have for all so this structure allows pruning. Lemma 3 then provides necessary and sufficient conditions for to be a child of . The third condition involves the set of neighboring nodes of , defined as . Lemma 3. For such that , we have: if and only if the three following conditions are verified: (C1) (C2) (C3) , or written differently, . Using (C1–3) in Lemma 3 to check whether for any does not require identifying the connected components of , even though the reduction itself does rely on these connected components. This property of the inverse reduction is critical for the scalability of : repeatedly identifying or maintaining these components would be very costly. It results from the fact that the reduction operation does not maintain full connectivity, but only retains one of the connected components obtained by removing a subset of its nodes. Doing so comes at a price: finding all children of is not straightforward—Lemma 3 only provides a way to check if a candidate is a child of . We must therefore provide a non-redundant way to explore all potential children, after which Lemma 3 will guarantee a non-redundant exploration of . More precisely, reducing any CCS to its parent involves the removal of a subset of its nodes, breaking into several connected components—the one containing the largest vertex being retained as the unique parent. For this reason, the reverse search formalized in Algorithm 1 cannot just search for children of among all closures obtained after adding one of its neighbors (Lines 6–7): larger CCSs may also lead to by reduction if they involve other nodes that are not in its direct neighborhood. For example, the graph shown in Figure 1 and discussed in Section 3.3 contains two CCS ( and ) which both lead to by reduction but cannot be obtained from by just adding its (single) neighbor and taking the closure. For every identified child —e.g. in the example—we must therefore recursively search for other candidates among the closures obtained after adding one of its own neighbors —nodes v1 or v2 in the example. This procedure is necessary to reconnect all children that include but would leave it as a separated connected component after removing nodes —node v4 in the example. This recursive exploration is guaranteed to visit each candidate child, but does not ensure that each child is visited only once. A redundant exploration would lose the benefit of building a tree to explore efficiently. We therefore need an itemtable that keeps track of visited patterns : if a candidate child has a pattern that includes the pattern of an already enumerated child from the neighborhood of the same , we know that —and any child that could be obtained from it—has already been visited and the algorithm stops exploring from . In practice, we do not need to store the full table , and rely on a concept from Uno and further described in Supplementary Section S-2 to reduce memory footprint. Theorem 1. For any NULLreturns the set . Theorem 1 says that Algorithm 1 solves the problem of inverting the reduction, and therefore of building a tree structure on . Of note, Algorithm 1 effectively explores equivalence groups of neighbors, yielding the same pattern. Formally, an equivalence group verifies: . We name the pattern of the k-th equivalence group .

3.3 Example of exploration

To help provide a better intuition of Algorithm 1, we use a simple graph with four nodes and four samples in Figure 2 and will unfold how various CCS are explored. The algorithm technically starts from the empty CCS, whose children are in this case , each of them (i) being closed and (ii) leading to the empty set by reduction.

Fig. 2.

A short illustration of CALDERA’s exploration. A simple graph with four nodes and four samples. Nodes v1, v2 and v3 are linked to node v4. To construct the CCS from , we can either first add v1—and thus construct —and then add v2, or directly add v2 and then we get by closure. To avoid enumerating twice, we therefore need local itemtables Starting from v4, we can construct the connected subgraph whose pattern . Adding any of its neighbors will change that pattern, so is a CCS. Can we explore from ? We use Algorithm 1 with , i = NULL and . We enumerate over the neighbors of (Line 3), which are v1, v2 and v3. For example, with v3, (Line 4). is equal to so we do not verify (C1) (Line 5) and we stop there. The same happens with v1 and v2. This can be expected. When applying the reduction to any CCS that contain v4, we will remove all the nodes containing the biggest sample, that is 4 so we will always remove v4. So no CCS gives by reduction. Therefore, when inverting the reduction with Algorithm 1, we should find no children of . If we start from v3, we begin similarly. , whose pattern , is a CCS. If we add its (only) neighbor v4 (Line 4), we construct again . But this time we have (C1) ; (C2) ; (C3) while so . So we verify (C1–3), which ensures that is indeed a child of {3} (Line 5). Since i is NULL (Line 6), we add to the list of the children of (Line 7). We then call Algorithm 1 again (Line 8), with and . We have two possible neighbors (and corresponding equivalence groups), v1 and v2 (Line 3). We first add v2. However, is not closed. By the closure operation, we add v1 as well, so we have a new (Line 4), and verify (C1–3) (Line 5). i = 4 is not NULL anymore, so we move to Line 10. and . Since k = 1 (we are exploring the first equivalence group from ), we do not update (Line 11) but add to the children of (Line 12). has no neighbors since it is the full graph, so Line 13 returns no more values. We return to Line 3 to explore the second equivalence group from , with k = 2: we add v1. The new is a CSS (Line 4) and verify (C1–3) (Line 5). i = 4 is not NULL and (Line 10). We now update it to . We add to the list of children and call Algorithm 1 again (Line 13), with , i = 4 and . has a single neighbor (Line 3), v2. is a CCS (Line 4), verify (C1–3) (Line 5). i is not NULL. However, so . We stop the exploration here. This illustrates the importance of the local itemtables since without them, we would have enumerated twice. Algorithm 1 applied to returned . When doing the reduction of any of those CCS, we remove v4 which breaks the CCS into several components. Since v3 will always be the biggest remaining node, it will always be picked as the parent. So, when inverting the reduction, we find all subgraphs containing v3 and v4 as children. The remaining CCS will be found similarly by starting from v1 (for ) and v2 (for and ).

3.4 A Breadth-first-search (BFS) enumeration

We argue that exploring any tree structure on in breadth first will often allow for more pruning than in depth first. Pruning occurs among children, not siblings. At any level, even if the CCSs visited along a branch do increase k and therefore lower the testability threshold, all the other CCSs of the level will need to be visited regardless of their testability. In contrast, the increase of k gained by visiting all CCSs of the same level in the tree will lower the threshold for all CCSs at the next level, making more branches prunable. We demonstrate this in Section 4 and provide more intuitive examples in the Supplementary Appendix (Supplementary Sections S-5.1 and S-5.2). A search in breadth is also easily parallelized since the computation of the minimal p-value, the envelope and the children of every CCS of a given level can be done in parallel, before increasing k and updating . In contrast, a parallelized search in depth-first must share and regularly update k and , which negates the advantages of parallelization.

Algorithm 2 List significant closed connected subgraphs

1: procedure List_sig_closed_subgraphs() 2: NULL 3: 4: 5: whiledo 6: 7: ifthen 8: 9: end if 10: ifthen 11: 12: 13: end if 14: ifthen 15: forNULLdo 16: 17: end for 18: end if 19: end while 20: Solutions 21: fordo 22: ifthen 23: Add to Solutions 24: end if 25: end for 26: return Solutions 27: end procedure Algorithm 2 explores through a BFS traversal of the tree defined by the reduction , exploiting Algorithm 1 (L.15) to invert the reduction and using this exploration to apply the Tarone testing procedure described in Section 2.2 (L7–12, 14), before finally testing the testable CCSs (L21-25). However, BFS is more memory intensive than depth-first search (DFS, see Section 4). In order to control the trade-off between speed and memory, we implemented a hybrid exploration scheme in which we allow each stage of the tree to be split into several batches. The tree is explored in BFS until some user-defined maximal width is reached at any level, at which point we start a DFS from the visited nodes. We then restart the exploration of the level in BFS.

3.5 Pruning more CCSs when controlling for an imbalanced categorical covariate

The envelope introduced in Papaxanthos verifies the monotonicity for any subgraph because . However, the algorithm to compute this envelope only applies to the so called potentially prunable subgraphs which are such that for all subgroups defined by the categorical covariate adjusted for by the CMH test. Pruning can therefore not be done from subgraphs for which at least one of the J groups has few occurrences of the corresponding covariate. This limitation arises in Lemma 2 of Papaxanthos , which characterizes the argmin of the envelope of a subgraph . Lemma 4 lifts this restriction: Lemma 4. For any connected subgraph , the envelope is attained for an optimum such that . The proof is provided in Supplementary Section S-1.5. Lemma 4 exploits a cruder bound for groups that are not in the increasing regime of the minimal p-value. It recovers the Lemma 2 of Papaxanthos for potentially prunable subgraphs, while offering an additional pruning opportunity for the other ones. If a subgraph was not potentially prunable only because it was missing the condition for one small group j, it may still be actually prunable since small groups of samples only affect the CMH test statistic marginally. On the other hand, if the condition is not verified for a large group or several small ones, the resulting envelope will be very loose and will not allow for pruning in practice. We provide some intuition in the Supplementary Appendix (Supplementary Fig. S3).

4 Experiments

We demonstrate the superiority of in terms of computational speed, statistical power and biological interpretation. To do so, we rely on both simulated and real datasets.

4.1 Datasets and settings

To test the speed of the methods, we generate datasets with n samples represented by covariates, and a graph connecting these covariates. We vary the proportion prop of samples that are resistant, i.e. have a phenotype of 1, and the number of samples. We also perform exploration when changing the value of α, which impacts pruning. This leads to 4 scenarios to compare the runtimes of the methods, named Speed 1 to Speed 4. More details on implementations and parameters can be found in Supplementary Section S-6. In order to test the speed gains provided by the new lower bound, we also explore an Imbalance in which we add a binary confounding variable, fixing n = 100 and p = 3000 but varying the balance of samples across the confounding variable. To test the power of the different methods, we rely on a simulation where the ground truth is known, named Exploration. We generate a dataset with n = 100 samples, 50 of each phenotype, where two genes A and B are present. Gene A is present for all samples, while gene B is only present for resistant samples. We introduce heterogeneity such that the DBG of the two genes is only linear at a coarse level (Fig. 1b). More details for the setting of those simulations are provided in Supplementary Section S-7. We also rely on two real datasets, where we use compacted DBGs (de Bruijn, 1946; Pevzner ). In a DBG, k-mer are nodes and k-mer that overlap by k – 1 nucleotides are connected by an edge. The graph is then compacted by reducing all linear sequences to a single node. The first dataset, which we name Pseudomonas, consists of the n = 280 P.aeruginosa genomes along with their resistance phenotype to amikacin, used in (Jaillard ). The bacteria are partitioned based on k-mean into two distinct groups. The compacted DBG is constructed using the k-mers with k = 31 (default) using , leading to a graph with over 2.3 million nodes and average degree . The second, named Akkermansia, consists of the Akkermansia muciniphila genomes collected in Karcher . We use host information as covariates: we want to identify genetic sequences that are associated with a body mass index over 30. The compacted DBG constructed over those n = 401 strains has 1.3 million nodes with an average degree of . On these two real datasets, we rely on heuristics to choose the level α at which the FWER is controlled and the number of stages explored in the BFS search—a full exploration being too memory intensive. The level is fixed at the lowest value at which 10 CCSs at stage 1 of the BFS (i.e. unitig closures) are found significant. The stage is chosen by stopping when the number of unitigs covered by a significant CCS reaches a plateau—suggesting that further exploration would not bring much novelty.

4.2 Methods

On top of , we use the following methods. COIN (Sese ) is to our knowledge the only described algorithm to identify significant CCSs, combining the enumeration method of COPINE with the algorithm. Minato presented a provably superior version of , which we denote . Since no implementation was provided in Sese , we implemented as a baseline COIN+LAMP2. Since and both rely on the same statistical procedures (the identification of testable hypotheses with Fisher’s test), the set of significant CCSs found is the same regardless of the method. For this reason, we only use in the speed comparison, since the methods have the same power. tests individual unitigs for association with a phenotype, using a linear mixed-model (LMM). We also benchmark three k-mer-based methods, available via the pyseer pipeline (Lees ), that recapitulates usual methods: a fixed-effect model without population effect, a LMM similar to the one used in , which is recommended, and an elastic-net model. Note that, since those methods, as well as , do not rely on graph exploration, they will not be benchmarked on the speed simulation portion, which solely focuses on that task.

4.3 Speed gains of CALDERA

In addition to , we benchmark three versions of . The first one, closest to , is the DFS implementation. The second one is the BFS implementation, where we modify the enumeration order of the elements of to promote pruning. The last is a parallelized BFS implementation, using five cores. Benefit of : In Figure 3a, representing the results of Speed 2, we see that the ranking in speed is uniform over all value of p, with being the slowest, followed by the DFS and BFS implementation, and finally the parallelized version of . For p = 20 000, runs in 2h20 while the parallelized version of takes 5 min. The ranking is the same for Speed 1, Speed 3 and Speed 4 (see Supplementary Section S-6). For example, for Speed 1 and p = 20 000, times out (2-day threshold) before finishing, while the parallelized version of runs in 6 h. Over all parameter values, the average ratio of runtime for over with 5 cores is 76 and we tested on graphs with up to p = 100 000 nodes in 14 h. In terms of memory, and have near identical requirements, while the DFS implementation uses about 40% of the other methods. More details on memory usage can be found in Supplementary Section S-6.

Fig. 3.

Results of . (a) Run times for and on graphs with various values of p. In this setting, n = 100. (b) Proportion of all unitigs associated with the resistant phenotype that are found to be significant by , the procedure on all unitigs and , as the value of α changes On the larger Pseudomonas dataset, even is unable to explore the entire with the heuristic level . We stop the exploration from any CCS that reaches a size of 2000 nucleotides or more (-Lmax 2000 option). We also observe that the unitigs covered by the significant CCSs reaches a plateau after the first six stages of the BFS. took 3h20 on 4 cores (plus 2h30 to build the graph), using 200 Gb of RAM to complete these stages using batches of size 200 000, leading to potentially testable CCSs and only 39 significant ones. For comparison, after running for 24 h, was exploring the tree structure with a running —i.e. had achieved about 2% of the exploration. runs in 2h45 (15 mn for the statistical test). k-mer-based methods benefit from a faster initial step of k-mer counting (1h30) but the statistical test is much longer, even using four cores: 2h30 for the fixed-effect and LMM, 3h30 for the elastic net method. Overall, performs slower than DBGWAS but on par with the k-mer-based methods. We provide a more general analysis of the computational cost of against the number of BFS stages in Supplementary Section S-8 and recommend using a similar analysis and stop after a few stages in cases where a full exploration is no feasible. Benefit of : For extreme ratios—below 0.02—the new lower bound allows much more pruning and enumerates an order of magnitude fewer elements of . Up to a ratio of 0.1, the new lower bound leads to a decrease of at least 10% in the number of explored subgraphs (see Supplementary Fig. S7).

4.4 Power gains of CALDERA

As mentioned above, and all versions of rely on the same statistical procedures and therefore find an identical list of significant CCSs for a given level α. However, we can compare the power of with and k-mer-based methods. We also use the procedure when testing separately, using Fisher’s test—like . We run all three methods on the dataset Exploration and measure how many of the 367 unitigs of gene B are called significant, when controlling the FWER at a varying level α. For , a significant unitig is one that is contained in a significant CCS. Even when controlling the FWER at very low levels (), correctly recovers the entirety of the resistant gene. On the other hand, the other methods fail to ever recover the entire gene, even at . This clearly show the enhanced power of : because of variations along the genome, the association of any individual unitig with the phenotype is weak, while a covariate that jointly represents all 367 unitigs of the resistant gene is very strongly associated with that phenotype. The additional loss of power of k-mer methods stems from the larger multiple testing correction that they incur. Here, we also benchmarked : although this method focuses on prediction and thus does not return p-values, its conjunction—disjunction approach could potentially identify the full gene as the best predictor of the phenotype. However, that is not the case and only returns one k-mer as the best predictor. We also apply those three methods to the Pseudomonas dataset. While there is no ground truth, this dataset contains two confirmed genetic variants linked to resistance to amikacin: a SNP on the aac(6′) gene, represented by one unitig, and the pHS87b plasmid, represented by 476 unitigs. This allows us to see how the methods handle those different scales. At the default , find the aac(6′) mutation as one CCS, and finds significant CCSs that covers 96% of the plasmid. Those two components represent 59% of all significant unitigs. In contrast, and do recover the mutation but only 34% and 0% of the plasmid, respectively. Even at , and only recover respectively 72% and 8% of the plasmid. Moreover, while it is not possible to compute a false negative rate on a real dataset, we can see that, at this level, the two known sequences—the plasmid and the aac(6′) mutation—only represent 6% and 17% of all significant unitigs.

4.5 Simplified biological interpretation

Biological interpretation in or happens at the component levels: significant unitigs or CCSs separated by only a few non-significant unitigs are displayed as one component. Unitigs can also be annotated using various databases, to enhance interpretation. Components are ranked in order of decreasing p-values, choosing the smallest p-value among all unitigs/CCSs. As such, both the number of components and their rankings will impact the ease of interpretation. Figure 4 gives an example: we see the results of running on the Akkermansia dataset. Only one CCS is significant, while returns no significant unitig. All the unitigs in the significant CCS (colored in green) are annotated, using the RefSeq database of all known A.muciniphila proteins as a reference (Tatusova ), and map to a common gene. This gene is not well annotated (hypothetical protein) but partially map to a Tubulin/FtsZ_GTPase A.muciniphila protein.

Fig. 4.

Akkermansia dataset: Tubulin/FtsZ_GTPase gene. Screenshot from the output of CALDERA. We select the first component, which is the one which contains the most significant CCS. (a) Unitigs belonging to the most significant CCS are colored in green (darker gray in the black and white version of the document). Other unitigs linked to the CCS in the DBG are colored in gray. Size denotes overall frequency, while a black contour denotes that the sequence of the unitig has a match in the database, here RefSeq (Tatusova ). (b) All significant hits on that database are listed in a panel, usually on top of the subgraph. (c) User can click on nodes to display information, or right-click to select all nodes from the same component. This contains info on the node, such as the frequency, the pattern of the associated CCS, or any match to the database (A color version of this figure appears in the online version of this article) The plasmid returned as the first component of on the Pseudomonas dataset can be seen in Supplementary Fig. S13. Visually, we can see clearly a broad circular graph, with local genetic variations. returns eight components: the first is the entire plasmid, returned as one component. The second is the aac(6′) mutation. always ranks the aac(6′) mutation first but never returns the plasmid as one component, even when controlling the FWER at a level of 0.1 (three components, the first one ranked fourth). Moreover, at this level, DBWGAS returns 77 components, making the interpretation much harder.

5 Discussion

This article presented , an algorithm to enumerate all significant CCSs. is between one and two orders of magnitude faster than previously described exploration methods. It easily scales to large datasets, relying on an efficient structure on and an exploration scheme that leverages the pruning opportunity offered by discrete statistics. This increased computational speed allows to deploy this method to DBG-based bacterial GWAS, which we demonstrate on two real examples. Moreover, we show that considering the CCSs, as done by , leads to increased power and facilitates interpretation, compared to previous methods that perform statistical tests at the node level. can better detect low signal caused by variability in genetic elements. It also returns larger and more coherent outputs that are easier to interpret. We extensively discussed how performs on bacterial GWAS. However, can also be used for other tests of association involving a graph structure. We provide in Supplementary Section S-10 an example: we look at the association between SNPs on Arabidopsis thaliana genomes and a ‘date to flowering’ phenotype. In that setting, the graph is a regulatory network on the genes and the objective is to identify subnetworks whose disruption by at least one mutation is associated with the phenotype. In settings where the node is a more natural object than the CCS, discrete testing can still be used to take advantage of Tarone (1990)’s procedure and increase power. However, pruning will no longer be possible, unless some other order can be established between nodes that preserve the order of minimal p-values. For now, does not scale to datasets with hundreds of millions of nodes that are possible in metagenome-wide association studies. Future work that focuses on incorporating pre-processing schemes before would be needed to compact the graph to both reduce its size and facilitate pruning by increasing the average . Click here for additional data file.

14 in total

1. Genome-wide detection of intervals of genetic heterogeneity associated with complex traits.

Authors: Felipe Llinares-López; Dominik G Grimm; Dean A Bodenham; Udo Gieraths; Mahito Sugiyama; Beth Rowan; Karsten Borgwardt
Journal: Bioinformatics Date: 2015-06-15 Impact factor: 6.937

2. Correlation between phenotypic antibiotic susceptibility and the resistome in Pseudomonas aeruginosa.

Authors: Magali Jaillard; Alex van Belkum; Kyle C Cady; David Creely; Dee Shortridge; Bernadette Blanc; E Magda Barbu; W Michael Dunne; Gilles Zambardi; Mark Enright; Nathalie Mugnier; Christophe Le Priol; Stéphane Schicklin; Ghislaine Guigon; Jean-Baptiste Veyrieras
Journal: Int J Antimicrob Agents Date: 2017-05-26 Impact factor: 5.283

Review 3. 10 Years of GWAS Discovery: Biology, Function, and Translation.

Authors: Peter M Visscher; Naomi R Wray; Qian Zhang; Pamela Sklar; Mark I McCarthy; Matthew A Brown; Jian Yang
Journal: Am J Hum Genet Date: 2017-07-06 Impact factor: 11.025

4. Genome-wide genetic heterogeneity discovery with categorical covariates.

Authors: Felipe Llinares-López; Laetitia Papaxanthos; Dean Bodenham; Damian Roqueiro; Karsten Borgwardt
Journal: Bioinformatics Date: 2017-06-15 Impact factor: 6.937

5. pyseer: a comprehensive tool for microbial pangenome-wide association studies.

Authors: John A Lees; Marco Galardini; Stephen D Bentley; Jeffrey N Weiser; Jukka Corander
Journal: Bioinformatics Date: 2018-12-15 Impact factor: 6.937

6. A fast and agnostic method for bacterial genome-wide association studies: Bridging the gap between k-mers and genetic events.

Authors: Magali Jaillard; Leandro Lima; Maud Tournoud; Pierre Mahé; Alex van Belkum; Vincent Lacroix; Laurent Jacob
Journal: PLoS Genet Date: 2018-11-12 Impact factor: 5.917

7. Identifying lineage effects when controlling for population structure improves power in bacterial association studies.

Authors: Sarah G Earle; Chieh-Hsi Wu; Jane Charlesworth; Nicole Stoesser; N Claire Gordon; Timothy M Walker; Chris C A Spencer; Zamin Iqbal; David A Clifton; Katie L Hopkins; Neil Woodford; E Grace Smith; Nazir Ismail; Martin J Llewelyn; Tim E Peto; Derrick W Crook; Gil McVean; A Sarah Walker; Daniel J Wilson
Journal: Nat Microbiol Date: 2016-04-04 Impact factor: 17.745

8. NCBI prokaryotic genome annotation pipeline.

Authors: Tatiana Tatusova; Michael DiCuccio; Azat Badretdin; Vyacheslav Chetvernin; Eric P Nawrocki; Leonid Zaslavsky; Alexandre Lomsadze; Kim D Pruitt; Mark Borodovsky; James Ostell
Journal: Nucleic Acids Res Date: 2016-06-24 Impact factor: 16.971

9. Predictive computational phenotyping and biomarker discovery using reference-free genome comparisons.

Authors: Alexandre Drouin; Sébastien Giguère; Maxime Déraspe; Mario Marchand; Michael Tyers; Vivian G Loo; Anne-Marie Bourgault; François Laviolette; Jacques Corbeil
Journal: BMC Genomics Date: 2016-09-26 Impact factor: 3.969

10. Genomic diversity and ecology of human-associated Akkermansia species in the gut microbiome revealed by extensive metagenomic assembly.

Authors: Nicolai Karcher; Eleonora Nigro; Mireia Valles-Colomer; Willem M de Vos; Nicola Segata; Michal Punčochář; Aitor Blanco-Míguez; Matteo Ciciani; Paolo Manghi; Moreno Zolfo; Fabio Cumbo; Serena Manara; Davide Golzato; Anna Cereseto; Manimozhiyan Arumugam; Thi Phuong Nam Bui; Hanne L P Tytgat
Journal: Genome Biol Date: 2021-07-14 Impact factor: 13.583