Literature DB >> 22219725

Genetic co-occurrence network across sequenced microbes.

Pan-Jun Kim1, Nathan D Price.   

Abstract

The phenotype of any organism on earth is, in large part, the consequence of interplay between numerous gene products encoded in the genome, and such interplay between gene products affects the evolutionary fate of the genome itself through the resulting phenotype. In this regard, contemporary genomes can be used as molecular records that reveal associations of various genes working in their natural lifestyles. By analyzing thousands of orthologs across ∼600 bacterial species, we constructed a map of gene-gene co-occurrence across much of the sequenced biome. If genes preferentially co-occur in the same organisms, they were called herein correlogs; in the opposite case, called anti-correlogs. To quantify correlogy and anti-correlogy, we alleviated the contribution of indirect correlations between genes by adapting ideas developed for reverse engineering of transcriptional regulatory networks. Resultant correlogous associations are highly enriched for physically interacting proteins and for co-expressed transcripts, clearly differentiating a subgroup of functionally-obligatory protein interactions from conditional or transient interactions. Other biochemical and phylogenetic properties were also found to be reflected in correlogous and anti-correlogous relationships. Additionally, our study elucidates the global organization of the gene association map, in which various modules of correlogous genes are strikingly interconnected by anti-correlogous crosstalk between the modules. We then demonstrate the effectiveness of such associations along different domains of life and environmental microbial communities. These phylogenetic profiling approaches infer functional coupling of genes regardless of mechanistic details, and may be useful to guide exogenous gene import in synthetic biology.

Entities:  

Mesh:

Year:  2011        PMID: 22219725      PMCID: PMC3248385          DOI: 10.1371/journal.pcbi.1002340

Source DB:  PubMed          Journal:  PLoS Comput Biol        ISSN: 1553-734X            Impact factor:   4.475


Introduction

An important challenge in biology is to understand the relationships between an organism's genotype and its phenotype, involving dissection of myriad interdependencies among various cellular components, such as genes, proteins, and small molecules [1], [2]. Recent efforts to map protein-protein interactions [3], [4], together with efforts to reconstruct metabolic and regulatory networks at the genome scale [5], [6], [7], offer promising opportunities to investigate emergent biological phenomena of interacting biomolecules inside cells. Complex interdependencies among the products of individual genes do not necessarily imply molecular interactions by direct physical contacts but also include more macroscopic associations exhibited at e.g. biochemical pathway levels, as has been claimed in epistasis research [8]. These genetic interdependencies, regardless of their underlying mechanisms, may leave a trace on the composition of genomes through evolutionary processes. One important set of analyses that have been used successfully over the past decade is based on phylogenetic profiles as introduced in [9]. Here, we analyze a type of phylogenetic profile – the patterns of the presence or absence of orthologs across many organisms – to find genes with favored co-occurrence in the same genomes (called herein correlogs) or disfavored co-occurrence (called anti-correlogs) suggesting their putative functional coupling. Such analysis of correlogy and anti-correlogy can help uncover global gene associations conserved at the biome level, beyond those specific to any particular organism. This information is distinct from genetic interactions inferred from, for example, double-mutant data [10], [11], which relate to gene relationships within the specific organisms employed in the experiments. In particular, knowledge of the association of genes not coexisting inside considered specific organisms can be applied for heterologous gene expression in synthetic biology [12], and interest in anti-correlogy itself inevitably tends to target heterologous genes. It should also be noted that fitness of an organism in its natural habitat does not necessarily coincide with fitness in a laboratory [13], and ortholog profiles of organisms may thus encompass genetic relationships in environmental and ecological contexts not readily captured by laboratory experiments. The importance of co-occurring orthologs as a means to gain insight into gene relationships is well appreciated in many previous studies [9], [14], [15], [16], [17], [18], [19]; however, these studies have largely focused on improving the identification of molecular-level interactions rather than on systematic analysis of the global organization of gene associations, as is the focus herein. In addition, we have included the removals of indirect gene-gene correlations in analyzing co-occurrence patterns to reduce false positive associations. We here take a comprehensive and systematic approach to study from a global perspective the biome-wide associations of genes as manifest through patterns of correlogy and anti-correlogy across sequenced bacteria.

Results

Characterization of correlogy and anti-correlogy

We surveyed the presence or absence of gene orthologs across 588 different bacterial species (Table S1) on the basis of orthology data available from the Kyoto Encyclopedia of Genes and Genomes (KEGG) [20]. Beyond simply measuring co-occurrence of these genes, we evaluated degrees of correlogy or anti-correlogy between the genes within the context of direct associations in biological activities, using methods to help avoid vertical co-inheritance effects, transitivity effects, and other spurious correlations. In particular, we attempted to reduce the contribution of indirect (anti-)correlogous relationships that result from transitivity effects: if genes i and j are each correlated with third common genes in terms of the presence or absence across species, genes i and j can also appear to be correlated with each other even in the absence of direct association between them. Therefore, simple correlations calculated from the co-occurrence pattern can suffer from these indirect correlations. Such removal of indirect correlations in this manner has not generally been taken into account in previous studies on ortholog profiles [9], [14], [15], [16], [17], [18], [19], or also with similar types of correlation calculations in other fields to infer disease comorbidity networks [21], [22], [23] or social networks [24]. Nevertheless, filtering out these transitivity effects has been of critical importance in e.g. reverse engineering of transcriptional regulatory networks [25], [26], and we employed such ideas to reduce transitivity effects when quantifying correlogy and anti-correlogy. As a result, for every pair of a total of 2085 genes, we assigned w of which magnitude increase away from zero measures the magnitude of correlogy (w>0) or anti-correlogy (w<0) between genes i and j in a pair (see Methods).

Biochemical and phylogenetic properties

It is worthwhile to address how much correlogous relationships overlap with molecular-level interactions. To test this idea, we compared the distribution of w for physically-binding proteins in Escherichia coli with that for arbitrary pairs of the proteins [27], and found that highly correlogous proteins are much likely to physically interact (Figure 1A). Indeed, the average of w for directly-binding proteins was 6.8 times larger than that for the second nearest proteins in the protein interaction network (P = 7.1×10−55), and 10.5 times larger than that for all the protein pairs (P = 8.8×10−57; Figure 1B and Methods). Instead of w, if we use the simple co-occurrence measure r that precedes w before the alleviation of transitivity effects (Methods), the overall distribution of r is heavily biased toward positive r 's (Figure 1C), and the enrichment of high r's in physically-binding protein pairs is relatively weak (P = 4.0×10−8 against the second nearest proteins) although still significant. Moreover, the network-topological signature appearing for w in Figure 1B becomes very distorted for r with a hump at the tenth nearest proteins in Figure 1D. Note that a hump at the tenth nearest proteins represents average r larger than averages at nearer proteins, possibly contributed to by indirect correlations from transitivity but unlikely to be biologically meaningful. Thus, the elimination of effects caused by transitivity is critical to identifying biologically meaningful correlogous and anti-correlogous relationships between genes. Another simple co-occurrence measure we tried, the mutual information I of genes i and j (Methods), reflects better the physical interactions than r but still less than w (P = 3.4×10−28 against the second nearest proteins; Figures 1E and 1F). Again, I leaves a small hump at the tenth nearest proteins in Figure 1F, and by its definition does not directly provide positive or negative signs of gene relationships themselves which are important in our study.
Figure 1

Biochemical properties of correlogous and anti-correlogous gene associations.

(A) The probability density of w for physically-interacting protein pairs (red) and that for all protein pairs (black) in the E. coli protein interaction network. The probability density means the fraction of protein pairs in a unit interval of w. Note that the red line is not only skewed into w>0 but is also rapidly truncated at w<0 compared with the black line, indicating that physically-interacting proteins are highly enriched for w>0 compared with arbitrary pairs of proteins. (B) The average w for each value of the shortest path length among pairs of proteins in the protein interaction network. (C and D) Plotted for r in the same ways as (A) and (B). (E and F) Plotted for I in the same ways as (A) and (B). Gray horizontal lines in (B), (D), and (F) indicate the global average, and error bars represent standard deviations.

Biochemical properties of correlogous and anti-correlogous gene associations.

(A) The probability density of w for physically-interacting protein pairs (red) and that for all protein pairs (black) in the E. coli protein interaction network. The probability density means the fraction of protein pairs in a unit interval of w. Note that the red line is not only skewed into w>0 but is also rapidly truncated at w<0 compared with the black line, indicating that physically-interacting proteins are highly enriched for w>0 compared with arbitrary pairs of proteins. (B) The average w for each value of the shortest path length among pairs of proteins in the protein interaction network. (C and D) Plotted for r in the same ways as (A) and (B). (E and F) Plotted for I in the same ways as (A) and (B). Gray horizontal lines in (B), (D), and (F) indicate the global average, and error bars represent standard deviations. The data in Figure 1A suggest a natural boundary for separating a regime enriched with functionally-obligatory protein interactions (w>0.045) from that with conditional or transient interactions [28], [29]. Specifically, the probability density of w for physically-binding proteins deviates strongly from that for arbitrary pairs of proteins at w>0.045, but almost overlaps with that for the arbitrary pairs at the rest w (except for the absence of strong anti-correlogy). In the former regime of w, there thus might be present functionally-obligatory relationships between physically interacting proteins to constrain them with large correlogous associations, as indicated by the rich presence of operonic genes whose transcriptions are precisely co-regulated in time (Text S1 and Figure S1). Even if we exclude these operonic gene pairs, the transcripts of the gene pairs with w>0.045 still tend to be more co-expressed than the others (P = 0.03), implying more obligatory interactions between them (Text S1 and Figure S1). Taken together, integration of protein-protein interaction data with the heterogeneous data of correlogous relationships provides a powerful means for identifying potentially functionally-obligatory interactions conserved across evolution from mere binding events. It is of course expected that other forms of molecular interactions than physically-binding interactions would also contribute to correlogy and anti-correlogy. Table 1 presents some cases of the highest |w|'s; rfbF encodes the enzyme to catalyze the reaction, CTP+α-d-glucose 1-phosphate→diphosphate+CDP-glucose, and the produced CDP-glucose is subsequently converted by the enzyme from the correlogous gene, rfbG, into CDP-4-dehydro-6-deoxy-d-glucose. The correlogous relationship between rfbF and rfbG thus appears to be a consequence of the need for the second reaction to proceed when the first occurs in order to achieve the relevant biological functions. On the other hand, two NAD+ synthases, one utilizing ammonia as an amide donor to produce NAD+ and the other utilizing glutamine as an amide donor instead, are highly anti-correlogous to each other, and the anti-correlogy reflects that both processes operating simultaneously in the same organism have been consistently selected against. This fact might be related to the distinct modes of regulating cellular nitrogen inside bacteria, as pleiotropic nitrogen utilization mutations can be found at NAD+ synthases [30]. In general, genes associated with similar functions were enriched with correlogous and anti-correlogous relationships (Figure S2 and Text S1). Indeed, the average of w>0 for isozymes based on the same Enzyme Commission numbers was 5.8 times larger than that for arbitrary pairs of enzymes (Z = 83.69), and the average of w<0 for the isozymes was 2.9 times larger than that for arbitrary enzyme pairs (Z = 24.11). Thus, isozymes exhibit high levels of correlogy and anti-correlogy compared to arbitrary enzyme pairs, although the effect is more pronounced for the correlogy side.
Table 1

Three most correlogous or anti-correlogous pairs of genes.

wij KEGG identifierNameDescription
Correlogous
0.4037K05878 dhaK Dihydroxyacetone kinase, N-terminal domain
K05879 dhaL Dihydroxyacetone kinase, C-terminal domain
0.3982K00978 rfbF Glucose-1-phosphate cytidylyltransferase
K01709 rfbG CDP-glucose 4,6-dehydratase
0.3975K0197716S rRNA, rrs 16S ribosomal RNA
K0198023S rRNA, rrl 23S ribosomal RNA
Anti-correlogous
−0.2188K03785 aroD 3-dehydroquinate dehydratase I
K03786 aroQ, qutE 3-dehydroquinate dehydratase II
−0.2107K01916 nadE NAD+ synthase
K01950 NADSYN1, QNS1, nadE NAD+ synthase (glutamine-hydrolysing)
−0.1817K00756 pdp Pyrimidine-nucleoside phosphorylase
K00758 deoA Thymidine phosphorylase
To evaluate the overall correlogous couplings around individual genes, we defined for a given gene i, , where the summation was taken over all other gene j's satisfying w>0. Likewise, we can define . S and S quantify how tightly gene i is associated to other genes through correlogy and anti-correlogy, respectively. Having both the largest S and S was gene rpmJ, which encodes ribosomal protein L36. The next largest were S and S of DNA-modifying genes, while the smallest ones were owned by flagella-related genes (Table S2). In general, S and S of each gene i are very positively correlated (Text S1 and Figure S3). How broadly genes are phylogenetically distributed would affect or be affected by the strength of interactions with other genes. Therefore, the degree of phylogenetic spread of any given gene is expected to be correlated with its S and S. Rather surprisingly, our analysis suggests that species-level spread of genes does not exhibit such correlations (Figure 2A), but spread at a higher taxonomic level – at the phylum level – reveals clear correlations with S and S; as genes inhabit diverse phyla, their S and S tend to increase continuously until saturated at a plateau (Z = 79.94; Figure 2B and Methods). Since different phyla (phylogenetically distant) may represent disparate cell types more clearly than different species (phylogenetically close), our result suggests that how many such disparate cellular conditions genes inhabit at the phylum level could substantially evolve or be influenced by the strength of the genetic interdependencies around the genes.
Figure 2

Correlation with phylogenetic dispersion.

(A) The average S of genes for each number of species where the genes are present, and (B) that for each number of phyla where the genes are present. For (A) and (B), S also behaves similarly to S (Text S1 and Figure S3). Gray horizontal lines indicate the global average, and error bars represent standard deviations.

Correlation with phylogenetic dispersion.

(A) The average S of genes for each number of species where the genes are present, and (B) that for each number of phyla where the genes are present. For (A) and (B), S also behaves similarly to S (Text S1 and Figure S3). Gray horizontal lines indicate the global average, and error bars represent standard deviations.

Global organizational properties

To systematically view the essential structure of correlogous and anti-correlogous relationships from a global architectural perspective, we constructed a maximum relatedness subnetwork (MRS) [24], in which each gene i points to two other genes j and j′ by different categories of edges that represent the most correlogous (max>0) and anti-correlogous (min<0) genes to gene i, respectively (Figures 3A–3C and Table S3). By definition, genes in MRS are not necessarily reciprocally linked. As such, one can easily recognize that following a series of genes in either of correlogy or anti-correlogy direction of links, the magnitude of w of each link increases until encountering reciprocal links. MRS provides the ‘backbone’ structure of a considered network, and has been demonstrated as particularly useful for detecting modular structures of complex networks [24]. One example is for two isocitrate dehydrogenases, IDH1 and IDH3, which encode similar enzymes depending on NADP+ and NAD+, respectively, and are anti-correlogously associated in the MRS. Consistent with our observations, their distinct phylogenetic profiles have been discussed in previous studies [31]. Also in the MRS, IDH1 is correlogously associated with gltA that encodes citrate synthase, and the enzyme activities from IDH1 and gltA are indeed known as elaborately coordinated in a cell for efficient growth on acetate [32]. For another example, β-glucosidase (bglB), which breaks down cellobiose into β-d-glucose and can be industrially useful for lignocellulose conversion for biofuel production [33], is correlogously associated in the MRS with l-rhamnose mutarotase (rhaM). We expect that this mutarotase may convert the product of β-glucosidase into α-d-glucose if the host cell prefers the α form to the β form, and thus expressing both the two genes, bglB and rhaM, may introduce a synergetic effect in cellobiose metabolic engineering. Furthermore, Figure 3C illustrates that, along correlogy direction, the MRS arranges sequentially xylose transport protein (xylF), xylose isomerase (xylA), and xylulokinase (xylB), the same order as their arrangement in the xylose metabolism pathway (Figure 3D). It is interesting to note that the MRS brings relatively less characterized genes such as xylF to attention by informing of their strong correlogous relationships with other genes. As such, Figure 3C does specify the identities of genes important for xylose metabolic engineering [34] that may benefit from simultaneous heterologous expression of xylF, xylA, and xylB. Also, by their link directionality the MRS provides the information that xylA and xylB are more associated than xylA and xylF.
Figure 3

MRS of gene associations.

(A–C) Large-scale to small-scale overviews of the MRS. A part of the whole MRS in (A) is magnified in (B) to reveal its clear modular structure. Circles represent genes, and each gene i is arrowed to gene j having the largest |w| with w>0 (dark arrow) or w<0 (light arrow). Size of each circle corresponds to S of the gene, and circles are colored the same if linked via w>0. In (B) and (C), different groups of genes linked via w>0 are shown to be seamlessly connected by links of w<0. In (C), gene names or KEGG identifiers appear for relevant circles. (D) Transport and biochemical conversions of xylose catalyzed by the proteins from genes in (C).

MRS of gene associations.

(A–C) Large-scale to small-scale overviews of the MRS. A part of the whole MRS in (A) is magnified in (B) to reveal its clear modular structure. Circles represent genes, and each gene i is arrowed to gene j having the largest |w| with w>0 (dark arrow) or w<0 (light arrow). Size of each circle corresponds to S of the gene, and circles are colored the same if linked via w>0. In (B) and (C), different groups of genes linked via w>0 are shown to be seamlessly connected by links of w<0. In (C), gene names or KEGG identifiers appear for relevant circles. (D) Transport and biochemical conversions of xylose catalyzed by the proteins from genes in (C). We found that all genes in the MRS are decomposed into 483 different small subgroups (Table S3), of which each includes correlogously associated genes yet not linked correlogously to any genes in the other subgroups. Nevertheless, the vast majority (99.5%) of anti-correlogy links bridge the gaps between different modules in the MRS, binding all of them as a single giant component (Figures 3B and 3C). Thus, the MRS results in clear modular structure among correlogs, with these modules interrelated through anti-correlogous relationships. Different genes in each correlog group are highly likely to perform biological tasks in a common functional category (Z = 43.16 for KEGG categories; see Methods). For every pair of possible functional categories, we also analyzed how likely genes of the functional categories in a pair would belong together to the same correlog groups (Methods). As expected, genes of the same KEGG functional categories tend to belong to the same correlog groups, but deviations from this behavior are also informative (Figure 4A and Table S4). For example, genes of functional category Folding, sorting and degradation and those of another category Metabolism of other amino acids significantly tend to be affiliated to the same correlog groups (P<10−4). The former category involves a number of molecular chaperones and RNA helicases, while the latter involves enzymes to synthesize glutathione and d-glutamate. Since glutathione serves as the major endogenous antioxidant and d-glutamate decorates bacterial cell walls, our results support the previous observation that the corresponding enzymes are likely to be actively in concert with chaperones and RNA helicases when subject to oxidative or cell-wall stress [35], [36]. Likewise, genes of functional category Xenobiotics biodegradation and metabolism and those of category Folding, sorting and degradation are highly likely to be together in the same correlog groups (Figure 4A and Table S4), which can be understood in the similar context of cellular stress response induced by xenobiotic compounds such as benzoate and bisphenol [37].
Figure 4

Characteristics of correlog groups in the MRS.

(A) Plotted in descending order is how likely genes in a given pair of functional categories belong to the same correlog groups (plotted for P<10−3). Data of intermediate values are omitted for visual clarity, and all of them can be found in Table S4. Blue is for the same functional categories in a pair, and red is for different functional categories in a pair. Functional categories are abbreviated as the followings. T: Membrane transport; M: Cell motility; C: Carbohydrate metabolism; A: Amino acid metabolism; G: Glycan biosynthesis and metabolism; K: Kinase and peptidase; S: Signal transduction; Oa: Metabolism of other amino acids; F: Folding, sorting and degradation; P: Biosynthesis of polyketides and terpenoids; Os: Biosynthesis of other secondary metabolites; X: Xenobiotics biodegradation and metabolism. (B) The probability density of the number of genes in each correlog group. (C–E) Clustering of genes from individual species and environmental samples. For each bacterial species, the number of genes mapped to the MRS (x axis) and the number of different correlog groups including the genes (y axis) are plotted in yellow. In the same way, for each archaeal species, plotted in red in (C), for each eukaryotic species, plotted in green in (D), and for each environmental sample, plotted in black in (E). In (C)–(E), for comparison with the null model, the number of different correlog groups including randomly mapped genes is plotted in blue with the error bars representing standard deviations. Although for some data points the error bars in blue look overlapped with the symbols in red and green, this is simply because of the large size of the symbols used to facilitate visual examination. Gray lines at the upper and lower sides in (C)–(E) denote the number of different correlog groups with the most sparsely-distributed mapped genes and the maximally-crowded mapped genes, respectively, while the latter allows for multiple configurations of gene mappings having standard deviations denoted by error bars. In other words, the gray lines set the absolute limits of the number of different correlog groups for each number of genes. Most data points in yellow, red, green, and black are located far below blue error bars, illustrating significant clustering of genes around correlog groups across different domains of life and environmental microbial communities.

Characteristics of correlog groups in the MRS.

(A) Plotted in descending order is how likely genes in a given pair of functional categories belong to the same correlog groups (plotted for P<10−3). Data of intermediate values are omitted for visual clarity, and all of them can be found in Table S4. Blue is for the same functional categories in a pair, and red is for different functional categories in a pair. Functional categories are abbreviated as the followings. T: Membrane transport; M: Cell motility; C: Carbohydrate metabolism; A: Amino acid metabolism; G: Glycan biosynthesis and metabolism; K: Kinase and peptidase; S: Signal transduction; Oa: Metabolism of other amino acids; F: Folding, sorting and degradation; P: Biosynthesis of polyketides and terpenoids; Os: Biosynthesis of other secondary metabolites; X: Xenobiotics biodegradation and metabolism. (B) The probability density of the number of genes in each correlog group. (C–E) Clustering of genes from individual species and environmental samples. For each bacterial species, the number of genes mapped to the MRS (x axis) and the number of different correlog groups including the genes (y axis) are plotted in yellow. In the same way, for each archaeal species, plotted in red in (C), for each eukaryotic species, plotted in green in (D), and for each environmental sample, plotted in black in (E). In (C)–(E), for comparison with the null model, the number of different correlog groups including randomly mapped genes is plotted in blue with the error bars representing standard deviations. Although for some data points the error bars in blue look overlapped with the symbols in red and green, this is simply because of the large size of the symbols used to facilitate visual examination. Gray lines at the upper and lower sides in (C)–(E) denote the number of different correlog groups with the most sparsely-distributed mapped genes and the maximally-crowded mapped genes, respectively, while the latter allows for multiple configurations of gene mappings having standard deviations denoted by error bars. In other words, the gray lines set the absolute limits of the number of different correlog groups for each number of genes. Most data points in yellow, red, green, and black are located far below blue error bars, illustrating significant clustering of genes around correlog groups across different domains of life and environmental microbial communities. We found that the number of genes in each correlog group approximately follows an exponential distribution (Figure 4B). We expect that each correlog group may serve as a toolbox for importing exogenous genes and functions into a cell, as xylF, xylA, and xylB above form a single correlog group themselves (Figures 3C and 3D). Furthermore, the prevalence of anti-correlogy links between correlog groups as shown in Figures 3B and 3C extends the concept of anti-correlogy from single genes to correlog groups and allows for evaluating the anti-correlogous associations between, rather than within, the correlog groups (Table S5 and Text S1). For example, a correlog group containing aldehyde-related dehydrogenases was anti-correlogously associated with another group containing perR, peroxide stress response regulator, and this anti-correlogy between the two groups might be involved in a problem that arises in controlling cellular redox state [38]. If a correlog group represents a repertoire of genes that tend to coexist in the same organisms, genes in individual organisms when mapped to MRS should be densely distributed around the correlog groups rather than distributed uniformly. This hypothesis can be straightforwardly checked by enumerating the number of different correlog groups that the genes are mapped to, and comparing this value with the number of different correlog groups harboring the same number of genes but mapped randomly. If the former is smaller than the latter, the genes can be regarded as more clustered around correlog groups than by chance. As would be expected, genes from each bacterial species show much more clustered behaviors than by chance (−13.20≤Z≤−3.68 for all bacteria; Figures 4C–4E and Methods). To the MRS, we can also map orthologs in each species from the other superkingdoms, archaea and eukaryotes. Interestingly, although the MRS in this study is based on bacterial data, orthologs from archaea and eukaryotes show somewhat weakened but still significantly clustered behaviors around the correlog groups, reflecting a significant conservation of gene associations across different domains of life [−5.97≤Z≤−2.64 for archaea except for one species with Z = −1.32 (Figure 4C) and −6.47≤Z≤−2.23 for eukaryotes except for five species with −1.93≤Z≤−1.39 (Figure 4D); see Methods]. Among these observations, less significantly clustered behaviors come from the species having small numbers of genes mapped to the MRS, and this result might be due to insufficient mappings or minimally-required diversity of genes for any biological systems. On the other hand, considering that correlogs tend to coexist in individual organisms, it might be challenging to examine the above clustering issues for environmental microbial communities [39], as each of the environmental samples typically contains a number of different species. Again, we mapped to the MRS the genes from twelve diverse environmental sources such as human and mouse guts, deep-sea whale fall carcasses, and uranium contaminated ground water [40], and found still significant clustering of those genes around the correlog groups (−9.26≤Z≤−2.10 except for one sample with less significant Z = −0.95; Figure 4E and Methods). Accordingly, this result encourages us to even conjecture the identities of undetected but existing genes in environmental samples based on those of detected genes by applying the knowledge of correlogous gene associations. It would also be interesting to identify correlog groups harboring cosmopolitan genes [41] in a given environment, as these groups can represent together environment-specific genetic contents rather than species-specific ones.

Discussion

Recent advances in understanding genome-scale intracellular networks have been enabled by the availability of high-throughput experimental techniques that permit network data to be collected on a scale far larger than previously possible [1], [2]. These networks help capture and quantify the functional interplay amongst genes that drives essentially all cellular phenotypes in the environment. Correlogy and anti-correlogy can be regarded as the natural outcome of such ubiquitous interactions between genes, molded through a long evolutionary process that tests beneficial effects of numerous gene combinations for a cell. The resulting ‘sociology’ of genes provides rich implications in rapidly growing fields such as function prediction of uncharacterized genes [27], gap-filling in genome-scale biochemical model reconstruction [42], and synthetic biology [12]. In synthetic biology, information on both correlogs and anti-correlogs may be useful to screen additional gene candidates to be expressed or silenced contingent on primary genes of interest, and such considerations should inform attempts to construct optimal recombinant strains. Intriguingly, the information on correlogy and anti-correlogy can be uncovered even when the underlying mechanisms of these gene associations are unknown. Finally, future extension of our work towards archaea and eukaryotes will allow comparative analysis of correlogy and anti-correlogy for different domains of life, offering a fascinating opportunity of understanding evolution of genetic associations.

Methods

All methods used in this study, including the methods of various data analysis, are provided in Text S1 with a detailed description. Below, we present an abbreviated version that describes the essence of our analysis.

Quantification of correlogy and anti-correlogy

We downloaded orthology data from the KEGG database [20], and surveyed the presence or absence of each ortholog across different bacterial species (see Table S1 for the full list of the bacterial species considered here). In order to capture functional interactions of genes reflected in their co-occurrence patterns across species, we take into account the genes (i.e., orthologs) not too lowly nor too highly prevalent across species; in the case of too lowly (highly) prevalent genes, there do not exist so many species with (without) the genes, making it hard to judge whether these few co-presences (co-absences) of the genes actually come from their functional interactions. In other words, without filtering, spurious correlations from non-functional origins may emerge, simply by vertical co-inheritance of genes or by chance. Specifically, if E denotes the number of species containing gene i and N denotes the total number of species, one can define X = min(E, N−E) for each gene i. The probability density of X approximately follows the power-law decay as long as X≥X = 80, and we chose the genes with X≥X to prevent spurious correlations that could occur at low X deviating from the power-law trend observed at large X. To evaluate direct gene associations while alleviating transitivity effects in charge of indirect correlations between genes, we applied the partial correlation method employed in graphical Gaussian models [26], of which superiority over many other methods was demonstrated in reverse engineering of transcriptional regulatory networks [25]. To implement this method, we start with calculating the Pearson correlation r for binary variables of presence and absence of genes i and j:where C is the number of species containing both genes i and j. Next, to reduce indirect correlations between genes i and j, we calculate the partial correlation w using r:where p is the (i, j)th component of an inverse matrix of r. However, in this study, the number of genes ( = 2085) is much larger than the number of species ( = 588), yielding an ill-conditioned problem for matrix r. To overcome this problem, we applied the shrinkage estimation derived by Schäfer and Strimmer [26]. Specifically, Schäfer and Strimmer obtained a regularized estimator of r combining analytic determination of shrinkage intensity from the Ledoit-Wolf theorem [43]. The following is the resultant estimator r that simply substitutes for r in the above calculation of w:where δ is the Kronecker delta symbol and λ is given by Here x = 1 if gene i is present in species k, otherwise, x = 0, <···> denotes the average over species k's, and σ is the standard deviation of x over species k's. As a result, the obtained w was used to quantify correlogy (w>0) or anti-correlogy (w<0) between genes i and j. For comparative analysis, the mutual information I of genes i and j between {x} and {x} across species k's [14] was also calculated.

Significance analysis of correlation between w (or r, I) and protein interaction

We calculated the average of w (w) from the pairs of physically-binding proteins in E. coli [27], and obtained its P value by generating the distribution of average w (w) from the same number of, but arbitrarily-mated pairs of the proteins as distant as the given shortest path length in the protein interaction network. The central limit theorem ensured that this null distribution converged well to the Gaussian distribution, providing the P value for how frequently w exceeds w. Similar analyses were also performed for r and I.

Evaluation of S and S

In order to characterize how tightly each gene i is correlogously associated to other genes, we calculated , where the summation was taken over all other gene j's satisfying w>0. In a similar way, we calculated for anti-correlogous couplings around gene i.

Significance analysis of correlation between S and phylum-level dispersion

Let n be the number of different phyla where genes are present. For genes with n<7 (Figure 2B), we obtained the slope of S against n by linear regression, and normalized it by multiplying . From surrogate data with randomly-permuted gene presences across species, we also generated an ensemble of such normalized slopes for n<7, and calculated the Z score of the actual value.

Characterization of the maximum relatedness subnetwork (MRS)

For any given weighted network, one can simplify its structure by constructing the MRS composed only of highly weighted edges [24]. Specifically, in the MRS of this study, each gene i points to only two genes j and j′ by different categories of edges that represent the most correlogous (max>0) and anti-correlogous (min<0) genes to gene i, respectively. We found that all genes in the MRS here are decomposed into 483 different small subgroups, of which each includes correlogously associated genes yet not linked correlogously to any genes in the other subgroups. These subgroups in the MRS were termed correlog groups.

Functional coherence of correlog groups in the MRS

For given correlog group g in the MRS and given functional category c of genes, we can calculate = Ñ/Ñ, where Ñ is the number of genes affiliated to both correlog group g and functional category c, and Ñ is the total number of genes affiliated at least to one functional category in correlog group g. Therefore, f represents the uniformity of gene functions in a correlog group. Majority (57.1%) of correlog groups with Ñ>1 were shown to have at least one functional category c satisfying f = 1 in each g. To calculate the corresponding Z score, we generated an ensemble of correlog groups with Ñ>1 by randomly exchanging genes of the same number of the affiliated functional categories. For a given pair of functional categories c1 and c2, we can also define their overlapping ratio (Figure 4A and Table S4) as:where i and j are indices of genes, a is 1 if gene i belongs to functional category c, otherwise 0, d is 1 if gene i belongs to correlog group g, otherwise 0, and H(x) is 1 if x>0, otherwise 0. Y 1, quantifies how likely genes in c1 and c2 belong to the same correlog groups (0≤Y 1,≤1). The corresponding P value was obtained by generating the null distribution in the same way as in the case of f above.

Effectiveness of correlog groups for different domains of life and environmental samples

For each species or environmental sample, we counted the number (n) of correlog groups harboring the genes mapped to the MRS. We also obtained the mean (η) and the standard deviation (σ) of such numbers of correlog groups when the same number of genes are randomly mapped to the MRS (Figure 4C–4E):where N is the number of genes mapped to the MRS, g is the index of each correlog group, N is the total number of genes in correlog group g, and is the total number of correlog groups in the MRS. Accordingly, we calculate the Z score of n [Z = (n−η)/σ]. Distinct regimes of protein interactions with different w's. In the left panel, plotted is the fraction of gene pairs belonging to the same operons among the gene pairs of given w, which encode physically binding (solid line) or non-binding (dashed line) protein pairs. In the central panel, plotted are the probability density of w for binding protein pairs (red) and that for arbitrary protein pairs (black), of which both were obtained by excluding the gene pairs belonging to the same operons. Considering only these non-operonic pairs, the right panel shows the averages and standard deviations of the Pearson correlation ρ's of the transcript profiles for every gene pair with w<0.045 or w>0.045, among binding protein pairs (red) and among arbitrary protein pairs (black). For more details, refer to Text S1. (TIF) Click here for additional data file. Overall strengths of correlogous (left panel) and anti-correlogous (right panel) associations between genes in given functional categories. Each grid is colored according to Ω 1, (left panel) or Ω 1, (right panel) following the scale bar on the rightmost side, where c1 and c2 denote the functional categories on the horizontal and vertical axes of the grid, and Ω 1, and Ω 1, correspond to overall strengths of correlogous and anti-correlogous associations between c1 and c2, respectively. For more details, refer to Text S1. (TIF) Click here for additional data file. For each gene, scatter plot between S and S (left panel) and that between S and w (right panel). For more details, refer to Text S1. (TIF) Click here for additional data file. List of bacterial species used for evaluating correlogy and anti-correlogy between genes. (PDF) Click here for additional data file. Genes of the largest or smallest S's and S's. (PDF) Click here for additional data file. Details of the MRS constructed in this study. Gene i in each row is arrowed to gene j (w>0) and gene j′ (w<0) in the MRS. The 1st through 3rd columns contain the annotation information of gene i, and the 5th through 8th columns contain information for genes j and j′ that receive incoming links from gene i. The annotation information of genes j and j′ is found in the rows in which the 1st column includes the same KEGG identifiers as those of genes j and j′. The 4th column contains the correlog group indices assigned to gene i. If gene i's belong to the same (different) correlog groups, they are assigned the same (different) indices in the 4th column. (PDF) Click here for additional data file. Pairs of functions highly likely to be assigned to the same correlog groups (P<10−3; see Methods). The 1st and 2nd columns contain the names of functions in each function pair, the 3rd column records the fraction of correlog groups having both functions in the pair among correlog groups having either of functions in the pair, and the 4th column is for the corresponding P values. (PDF) Click here for additional data file. Pairs of correlog groups associated via significant anti-correlogy with Z>10 (see Text S1). For each index of correlog groups, the corresponding genes can be found in Table S3. (PDF) Click here for additional data file. Supporting methods and results. (PDF) Click here for additional data file.
  42 in total

1.  KEGG: kyoto encyclopedia of genes and genomes.

Authors:  M Kanehisa; S Goto
Journal:  Nucleic Acids Res       Date:  2000-01-01       Impact factor: 16.971

Review 2.  Network biology: understanding the cell's functional organization.

Authors:  Albert-László Barabási; Zoltán N Oltvai
Journal:  Nat Rev Genet       Date:  2004-02       Impact factor: 53.242

3.  Integrating high-throughput and computational data elucidates bacterial networks.

Authors:  Markus W Covert; Eric M Knight; Jennifer L Reed; Markus J Herrgard; Bernhard O Palsson
Journal:  Nature       Date:  2004-05-06       Impact factor: 49.962

4.  Evidence for dynamically organized modularity in the yeast protein-protein interaction network.

Authors:  Jing-Dong J Han; Nicolas Bertin; Tong Hao; Debra S Goldberg; Gabriel F Berriz; Lan V Zhang; Denis Dupuy; Albertha J M Walhout; Michael E Cusick; Frederick P Roth; Marc Vidal
Journal:  Nature       Date:  2004-06-09       Impact factor: 49.962

5.  A shrinkage approach to large-scale covariance matrix estimation and implications for functional genomics.

Authors:  Juliane Schäfer; Korbinian Strimmer
Journal:  Stat Appl Genet Mol Biol       Date:  2005-11-14

6.  Comparing association network algorithms for reverse engineering of large-scale gene regulatory networks: synthetic versus real data.

Authors:  Nicola Soranzo; Ginestra Bianconi; Claudio Altafini
Journal:  Bioinformatics       Date:  2007-05-07       Impact factor: 6.937

7.  Modular epistasis in yeast metabolism.

Authors:  Daniel Segrè; Alexander Deluna; George M Church; Roy Kishony
Journal:  Nat Genet       Date:  2004-12-12       Impact factor: 38.330

8.  eSGA: E. coli synthetic genetic array analysis.

Authors:  Gareth Butland; Mohan Babu; J Javier Díaz-Mejía; Fedyshyn Bohdana; Sadhna Phanse; Barbara Gold; Wenhong Yang; Joyce Li; Alla G Gagarinova; Oxana Pogoutse; Hirotada Mori; Barry L Wanner; Henry Lo; Jas Wasniewski; Constantine Christopolous; Mehrab Ali; Pascal Venn; Anahita Safavi-Naini; Natalie Sourour; Simone Caron; Ja-Yeon Choi; Ludovic Laigle; Anaies Nazarians-Armavil; Avnish Deshpande; Sarah Joe; Kirill A Datsenko; Natsuko Yamamoto; Brenda J Andrews; Charles Boone; Huiming Ding; Bilal Sheikh; Gabriel Moreno-Hagelseib; Jack F Greenblatt; Andrew Emili
Journal:  Nat Methods       Date:  2008-09       Impact factor: 28.547

Review 9.  Applications of genome-scale metabolic reconstructions.

Authors:  Matthew A Oberhardt; Bernhard Ø Palsson; Jason A Papin
Journal:  Mol Syst Biol       Date:  2009-11-03       Impact factor: 11.429

10.  The impact of cellular networks on disease comorbidity.

Authors:  Juyong Park; Deok-Sun Lee; Nicholas A Christakis; Albert-László Barabási
Journal:  Mol Syst Biol       Date:  2009-04-07       Impact factor: 11.429

View more
  19 in total

1.  Phylogeny-corrected identification of microbial gene families relevant to human gut colonization.

Authors:  Patrick H Bradley; Stephen Nayfach; Katherine S Pollard
Journal:  PLoS Comput Biol       Date:  2018-08-09       Impact factor: 4.475

2.  A Two-Enzyme Adaptive Unit within Bacterial Folate Metabolism.

Authors:  Andrew F Schober; Andrew D Mathis; Christine Ingle; Junyoung O Park; Li Chen; Joshua D Rabinowitz; Ivan Junier; Olivier Rivoire; Kimberly A Reynolds
Journal:  Cell Rep       Date:  2019-06-11       Impact factor: 9.423

3.  Phage defence by deaminase-mediated depletion of deoxynucleotides in bacteria.

Authors:  Brian Y Hsueh; Geoffrey B Severin; Clinton A Elg; Evan J Waldron; Abhiruchi Kant; Alex J Wessel; John A Dover; Christopher R Rhoades; Benjamin J Ridenhour; Kristin N Parent; Matthew B Neiditch; Janani Ravi; Eva M Top; Christopher M Waters
Journal:  Nat Microbiol       Date:  2022-07-11       Impact factor: 30.964

4.  Bacterial regulon modeling and prediction based on systematic cis regulatory motif analyses.

Authors:  Bingqiang Liu; Chuan Zhou; Guojun Li; Hanyuan Zhang; Erliang Zeng; Qi Liu; Qin Ma
Journal:  Sci Rep       Date:  2016-03-15       Impact factor: 4.379

5.  A set of genes conserved in sequence and expression traces back the establishment of multicellularity in social amoebae.

Authors:  Christina Schilde; Hajara M Lawal; Angelika A Noegel; Ludwig Eichinger; Pauline Schaap; Gernot Glöckner
Journal:  BMC Genomics       Date:  2016-11-04       Impact factor: 3.969

6.  Revised computational metagenomic processing uncovers hidden and biologically meaningful functional variation in the human microbiome.

Authors:  Ohad Manor; Elhanan Borenstein
Journal:  Microbiome       Date:  2017-02-08       Impact factor: 14.650

7.  Taxa-function robustness in microbial communities.

Authors:  Alexander Eng; Elhanan Borenstein
Journal:  Microbiome       Date:  2018-03-02       Impact factor: 14.650

8.  The condition-dependent transcriptional landscape of Burkholderia pseudomallei.

Authors:  Wen Fong Ooi; Catherine Ong; Tannistha Nandi; Jason F Kreisberg; Hui Hoon Chua; Guangwen Sun; Yahua Chen; Claudia Mueller; Laura Conejero; Majid Eshaghi; Roy Moh Lik Ang; Jianhua Liu; Bruno W Sobral; Sunee Korbsrisate; Yunn Hwen Gan; Richard W Titball; Gregory J Bancroft; Eric Valade; Patrick Tan
Journal:  PLoS Genet       Date:  2013-09-12       Impact factor: 5.917

9.  Analysis of genes contributing to plant-beneficial functions in Plant Growth-Promoting Rhizobacteria and related Proteobacteria.

Authors:  Maxime Bruto; Claire Prigent-Combaret; Daniel Muller; Yvan Moënne-Loccoz
Journal:  Sci Rep       Date:  2014-09-02       Impact factor: 4.379

10.  Kernel Architecture of the Genetic Circuitry of the Arabidopsis Circadian System.

Authors:  Mathias Foo; David E Somers; Pan-Jun Kim
Journal:  PLoS Comput Biol       Date:  2016-02-01       Impact factor: 4.475

View more

北京卡尤迪生物科技股份有限公司 © 2022-2023.