Literature DB >> 28934485

Detecting gene subnetworks under selection in biological pathways.

Alexandre Gouy^1,2, Joséphine T Daub³, Laurent Excoffier^1,2.

Abstract

Advances in high throughput sequencing technologies have created a gap between data production and functional data analysis. Indeed, phenotypes result from interactions between numerous genes, but traditional methods treat loci independently, missing important knowledge brought by network-level emerging properties. Therefore, detecting selection acting on multiple genes affecting the evolution of complex traits remains challenging. In this context, gene network analysis provides a powerful framework to study the evolution of adaptive traits and facilitates the interpretation of genome-wide data. We developed a method to analyse gene networks that is suitable to evidence polygenic selection. The general idea is to search biological pathways for subnetworks of genes that directly interact with each other and that present unusual evolutionary features. Subnetwork search is a typical combinatorial optimization problem that we solve using a simulated annealing approach. We have applied our methodology to find signals of adaptation to high-altitude in human populations. We show that this adaptation has a clear polygenic basis and is influenced by many genetic components. Our approach, implemented in the R package signet, improves on gene-level classical tests for selection by identifying both new candidate genes and new biological processes involved in adaptation to altitude.

Entities: Chemical Disease Gene Species

Mesh：

Year: 2017 PMID： 28934485 PMCID： PMC5766194 DOI： 10.1093/nar/gkx626

Source DB: PubMed Journal: Nucleic Acids Res ISSN： 0305-1048 Impact factor: 16.971

INTRODUCTION

Understanding the genetic basis of adaptation remains a central theme of evolutionary biology. Adaptation is typically viewed as involving selective sweeps that drive beneficial alleles from low to high frequencies in a population, lowering genetic diversity and increasing linkage disequilibrium near the selected region (1–3). Numerous statistical tests have been developed to detect selection from genomic data based on a simple selective sweep model (reviewed in (4)). Therefore, most work in humans and other species has focused on identifying signals of strong selection at individual loci (5). These methods have been quite successful in humans to identify loci involved in several adaptations such as diet, altitude, disease resistance, and pigmentation (reviewed in (6)). However, examples of adaptation due to a selective sweep at a single locus remain relatively rare in human populations. Therefore, some authors have argued that adaptation events could occur by the evolution of polygenic traits rather than via the fixation of single beneficial mutations (7–9). Recent Genome-Wide Association Studies (GWAS) in various model organisms have confirmed that variation at many important traits is controlled by a large number of loci scattered throughout the genome, e.g. human height (10,11). Selection acting additively on this kind of traits could therefore lead to small shifts in allele frequencies (8). This verbal model has been studied analytically, showing that in some cases, polygenic selection may indeed lead to subtle shifts in allele frequencies (12–14). However, these small allele frequency changes may remain below the detection limit of most of outlier detection methods (15). Therefore, the generality of conclusions drawn from significant tests can be seriously challenged because phenotypic traits exhibiting clear-cut molecular signatures of selection may represent a biased subset of all adaptive traits (16). Another caveat of classical genome scans for selection is that lists of candidate genes are sometimes difficult to connect to a particular adaptive mechanism, since SNP-level results are unlikely to reveal complex mechanisms of adaptation, given the lack of signal of small-effect alleles. It seems therefore necessary to consider alternative approaches to study the genetic basis of adaptation of complex traits. Current approaches to detect selection acting on polygenic traits rely mostly on quantitative genetics models. Classical quantitative genetics approaches are not based on genetic data, but on an explicit description of continuous phenotypes (e.g. height, body mass index, fertility, etc.). These methods have strong theoretical foundations, and allow one to disentangle the genetic from the environmental variance by taking into account the heritability of the traits, and therefore to detect shifts in the distribution of the phenotype under selection (17). But these methods do not permit to identify the genetic basis of adaptation, and other approaches must be considered to associate genetic data to quantitative traits responding to selection. Correlative approaches have emerged where associations between a genotype and various environmental variables are tested (e.g. (18–21)). Another test for selection acting on quantitative trait loci (QTL) has been developed by Orr (22). The idea is to test if some populations have accumulated more alleles than expected that change the value of a trait in a given direction. This approach has then been extended to expression QTLs (e.g. in (23)). Finally, recent approaches have tried to estimate selection coefficients from GWAS data (9,24,25), but all these methods need some phenotypic measures of the tested individuals or associated environmental data, which can be sometimes difficult to obtain. In contrast to a gene-centric approach, some studies have considered testing if a set of genes as a whole is yielding signals of selection (7,26). Indeed, different genes within pathways (i.e. molecular networks leading to a given biological function) may interact to produce a given phenotype (27,28), and therefore be under the same selective pressure. Finding sets of outlier interacting genes can be achieved using gene-set enrichment methods (e.g. (29,30)). The idea is to assign a score (i.e. proxy for selection) to each gene within a biological pathway (i.e. gene-set) and to test if the distribution of scores within the pathway is significantly shifted towards extreme values (7). This approach has successfully identified candidate pathways involved in various human adaptations, such as response to pathogens (7), or adaptation to altitude (26). However, this gene-set enrichment approach mainly identifies pathways where all its members show a shift in the distribution of a given tested statistic. It might thus be underpowered to find more subtle signals, where only a subset of genes is under selection in a large pathway, which is a more likely situation than assuming that all the genes in a pathway have responded to selection. To address this problem, network analysis can provide new insights into the genetics basis of adaptation. In the last few years, network-based approaches have spread into a large number of research areas, and were successfully used to solve a wide range of biological problems; e.g. gene expression studies (31,32), GWAS (28) or evolutionary biology (33,34). Here, we present a new network-based method to detect polygenic selection in natural populations. The general idea is to search for subnetworks of interacting genes within biological pathways that present unusual features. This search is a typical combinatorial optimization problem that can be solved using a heuristic approach like simulated annealing (31,35). We implemented such an algorithm to search for high-scoring subnetworks of genes in biological pathways, and we developed a testing procedure that explicitly takes into account the optimization process involved in this search. After studying the sensitivity and precision of our method with simulated data, we reanalysed data from a previous study looking for convergent adaptation to altitude in Tibetans and Andeans (26). As compared to the original study, we discover new genes and biochemical functions potentially related to adaptation in these human populations. Our method can thus complement classical genome scans by providing functional information and discovering new genes with weaker effects that are involved in complex selective processes. Finally, we discuss the limits and potential improvements, as well as other possible applications of our methodology.

MATERIALS AND METHODS

Pathway databases and conversion to gene networks

We considered biological pathways as gene networks. More formally, we define a gene network as a graph G(V,E), where V is a set of nodes (i.e. genes), and E is a set of edges (i.e. interactions between genes). In this study we used three signalling and metabolic pathway databases that are considered as references in systems biology: (i) KEGG, the Kyoto Encyclopaedia of Genes and Genomes Pathway database (36); (ii) NCI, the National Cancer Institute / Nature Pathway Interaction Database (37); and (iii) Reactome (38,39). We then used the R/Bioconductor graphite package to convert biological pathways into graphs of interacting genes (see (40) for more details on this procedure).

Computation of summary statistics on gene networks

To characterize the structure of networks and check for potential differences between databases, we generated the distributions of three standard summary statistics for each of the three databases. We thus computed for each network: (i) the number of nodes, (ii) the number of edges and (iii) the graph density. The graph density d is a measure of connectivity between the nodes of the network, and it is defined as the number of edges in a set E compared to the maximum number of possible edges between nodes in a set V, therefore d = 2 * |E| / (|V| * (|V| − 1)), where |X| represents the number of members of a set X. We also analysed the overlap between pathway databases by computing the number of genes they share. Finally, we quantified the redundancy between pathways within a database by computing Jaccard's similarity index. For a pair of networks A and B with sets of nodes VA and VB, Jaccard's index is defined as JAB = (|VA| ∩ |VB|) / (|VA| ∪ |VB|).

Workflow to detect outlier subnetworks

As the detection of outlier subnetworks includes several distinct steps, we describe here our analysis pipeline. The goal of our approach is to search within each gene network the subnetwork with the largest signal of interest (e.g. evidence of selection) using a simulated annealing approach (35). Our algorithm is globally similar to that used by Ideker et al. (31), but our method differs in two important ways, as described below. First, whereas Ideker et al.’s method aimed at finding the highest-scoring subset of nodes, we aim here at finding the highest-scoring subnetwork (i.e. a subset of genes that are directly connected by edges). Second, we consider a statistical testing procedure that explicitly considers the optimization procedure when computing the P-value of a given subnetwork. Indeed, the score of a given subnetwork identified by the simulated annealing algorithm cannot be compared to that of a random subnetwork, as simulated annealing would identify a high scoring subnetwork even in absence of any true signal (41).

Gene and subnetwork scores

We begin our testing procedure by assigning a score to each of the genes (nodes) in our network. In population genetics applications looking for subsets of selected genes, this score might be a measure of population differentiation between populations (e.g. FST), the result of a selection test, or the difference in some measure between cases and controls. If this score is available for different SNPs in a given gene, we need to summarize their scores in some way, as our method assumes that each gene has a single score. For instance, the SNP with maximum score can be selected to represent a gene, or the average of the SNP-specific scores can be computed over all SNPs assigned to a gene. We then use an aggregate score for a subnetwork of size k following (31) as s = ∑(x) / √(k), where x the score of the i-th node (gene).

Subnetwork score normalization

We then normalize the scores of subnetworks such as to be able to compare subnetworks of different sizes. Indeed, we expect to observe less variance in subnetwork aggregate scores in large than in small networks. The score of a given subnetwork of size k is thus normalized as z = (s – μ) / σ, where μ and σ are the mean and standard deviation of the score of a subnetwork of size k, computed empirically over 10 000 random subnetworks of size k, obtained for each data base separately. The means and standard deviations of subnetworks of sizes kmin to kmax are computed once and stored in a lookup table. Random subnetworks of size k are obtained by (i) randomly selecting a network in the database with a probability depending on the network size, (ii) randomly selecting a gene from this network as an initial member of the subnetwork and (iii) iteratively adding k– 1 other randomly chosen genes that are connected to the growing subnetwork.

Searching for optimal subnetworks with simulated annealing

We have used a simulated annealing algorithm to detect the Highest Scoring Subnetwork (HSS) of each gene network. The general idea is to start with a random subnetwork, and modify it progressively by adding or removing one node at a time until we reach a subnetwork with the highest possible normalized score. The algorithm takes as initial parameters the number of iterations N to perform and the annealing parameter alpha, which determines a temperature function T(α) that decreases geometrically over time. In more details, our search algorithm is as follows: Select a starting subnetwork of arbitrary size kmin, defined at random. Calculate its normalized score z. Modify the current subnetwork: First, select a node at random from the following list: (i) nodes not belonging to the current subnetwork, but that are connected to it by a single edge, (ii) terminal nodes of the current subnetwork, (iii) internal nodes of the current subnetwork which are not articulation points (i.e. whose removal will not create two disjoint subnetworks). If the selected node is not part of the current subnetwork then add it, else remove it. Calculate the new subnetwork's normalized score z1 Accept the new subnetwork with probability min(1,p), where the annealing probability p = exp([z1 – z] / T1), and T1 is the annealing temperature for the iteration i+1. This typical simulated annealing equation means that a new subnetwork is always accepted if its normalized score is larger than that of the previous subnetwork, and that less optimal subnetworks are more and more difficult to accept with more iterations of the algorithm. Repeat steps 3–5 above for a given predefined number of iterations N. Record the resulting subnetwork and its score. This algorithm is expected to find the global optimum for a sufficient number of iterations (42), but as its performance could vary between networks, we have run it five times for each pathway, and recorded the subnetwork with the highest score.

Statistical testing procedure

To test if the score of the estimated HSS is significantly larger than what would be expected by chance, we need to generate the null distribution of HSS for subnetworks of a given size. To do this, we cannot simply randomly sample subnetworks and compute their scores in the original dataset, as we need to take into account the fact that the optimization procedure will bias the subnetwork scores towards high value. To take this effect into account, we have generated a null distribution of optimised scores. To do this, we first permute gene scores across all networks of a given database. Then a network is randomly chosen with a probability proportional to its size, and the optimization algorithm is applied to obtain the HSS on the permuted dataset. The score of the resulting HSS is finally recorded. This procedure is repeated N times to generate the null distribution. The empirical P-value of a given observed HSS is then obtained as the proportion of random HSS of similar size that have a score larger or equal to the observed HSS score (unilateral test). As many subnetworks are tested with our procedure, we have corrected the inferred P-values for multiple testing by computing q-values, which are false discovery rates (FDRs) that would be computed if the observed P-value was used as a threshold to declare significance. To do this, we used the FDR method (43) implemented in the R package q-value.

Modified simulated annealing for large networks

We have adapted our algorithm to analyse large network by introducing several modifications to our algorithm, which was initially developed for small networks. First, we now set the probability of adding or removing a gene to 0.5, instead of letting it depend on the structure of the network (see algorithm above). This modification allows one to remove genes from the subnetwork even when the number of potential neighbours (from the whole network) is large. Second, we now authorize the removal of articulation nodes (nodes that split a subnetwork in disjoint subnetworks if removed): when an articulation node is selected, we list all its disconnected sub-networks and compute their scores. We keep the highest-scoring subnetwork, and discard the others, as a proposal. These two modifications allow one to explore more easily a large network, and prevent the subnetwork to grow continuously. Finally, we add a ‘quenching’ step during which the temperature is set to 0 for 1000 iterations. During these steps, the probability (P) of adding a gene to the current sub-network is equal to its number of direct neighbours divided by its size, and the probability of removing a gene (including articulation points) is 1 – P.

Pipeline implementation

Our analysis pipeline has been implemented in R, and graphical representations of the networks and HSS were made using the software Cytoscape (44,45), called from R with the Bioconductor package RCytoscape (46).

Test of the method on simulated data

As simulated annealing is an approximate method, we studied its performance using a simulation-based approach. We simulated pseudo-observed data sets by building a random network of size N using a random edge model, i.e. where an edge is drawn with a given probability p. Then, a connected subnetwork of size k is randomly sampled within the network. The node scores from the subnetwork are drawn from a normal distribution N(μHSS,1), where μHSS is the average score of this subnetwork. The score of the other nodes of the network are drawn from a standard normal distribution N(0,1). We then apply our simulated annealing algorithm to find the highest-scoring subnetwork using with i iterations. Therefore, the outcome of our search depends on five parameters: the network size N, the HSS size k and its average expected score μHSS, the network connectivity p and the number of iterations i. In order to characterize the accuracy of our network search and to better understand which parameters have an impact on our estimation, we computed, for each simulation, the number of true positives (TP, the number of nodes from the true HSS that are correctly identified), the number of true negatives (TN, the number of nodes that are not in the HSS and that are not identified), false positives (FP, the number of nodes wrongly identified as part of the true HSS) and false negatives (FN, the number of nodes from the true HSS that have not been identified). We then computed two measures of performance: precision or positive predictive value: PPV = TP / (TP + FP); and sensitivity or true positive rate: TPR = TP / (TP + FN). To assess the impact of our five parameters on the precision and on the sensitivity of our estimation, we used a Generalised Linear Model (GLM) where the response variables are the counts of TP and FP for precision, and the counts of TP and FN for sensitivity. The predictor variables are the five above-mentioned parameters, and the error follows a binomial distribution where a TP is considered as a ‘success’ and a FP or FN as a ‘failure’. In this way, GLM predictions correspond to model-based estimates of precision (TP successes among (TP + FP) trials) and of sensitivity (TP successes among (TP + FN) trials), which takes into account the joint influence of several variables on these measures. To test the performance of our significance testing procedure that explicitly takes the optimisation process into account, we computed P-values using the null distribution obtained from 10,000 runs of simulated annealing on data generated with μHSS = 0 (i.e. the null hypothesis).

Application to real data: detection of convergent adaptation to altitude in humans

We analysed a dataset published by Foll et al. (26) on convergent adaptation to altitude in Tibetans and Andeans. This data set consists of 632 344 SNPs genotyped in four populations (47,48): two populations living at high altitude in the Andes (n = 49 individuals) and in Tibet (n = 49), as well as two lowland related populations from Central America (n = 39) and East Asia (n = 90). For each SNP, a probability of convergent adaptation has been computed under a hierarchical Bayesian model (26). To get a unique score per gene, as required in our methodology, we used the P-value of the highest scoring SNP mapped within a gene or less than 50 kb away. We applied our methodology to detect subnetworks under selection on this dataset. The three pathway databases were analyzed separately (i.e. every step of the workflow has been done independently for each database). Pathways for which the largest connected subnetwork size was less than kmin = 5 nodes were removed from the analysis, since we wanted to avoid focusing on small subnetworks. Aggregate subnetwork score distributions have been generated by sampling 10,000 random subnetworks for each possible size k. The HSS search algorithm has been applied to every pathway with N = 10 000 simulated annealing iterations. The P-value of the obtained HSSs was inferred from the distribution of scores of 10 000 random HSS generated under the null hypothesis (i.e. permuted data).

RESULTS

We first studied the performance of our signet approach in terms of precision (i.e. the fraction of selected genes in the estimated highest-scoring subnetworks (HSS)) and sensitivity (i.e. the proportion of selected genes that are identified as such) by analysing pseudo-observed data. We generated random networks and HSS based on five parameters (see Material and Methods). We ran signet on the simulated data and compared the estimated HSS to the true HSS. Using logistic regressions, we show that out of the five parameters tested, four have a significant impact on both the precision and sensitivity of the method (Table 1). Most of the model deviance is explained by the mean score of the selected genes, the network size, and the subnetwork size. As expected, precision and sensitivity increase with μHSS and are both larger than 95% when μHSS is >3 (Figure 1A). Note that μHSS corresponds to the true subnetwork z-score, with z being expressed in units of standard deviation (SD). Thus, our results imply that the HSS score must be more than three SD away from the mean z-score of a non-selected network of the same size to properly capture the signal in a dataset. Furthermore, even if network (N) and subnetwork (k) sizes influence our ability to correctly identify HSS, N and k have a negligible impact on the precision of our estimations when the true subnetwork score is sufficiently large. Indeed, in this case precision remains high for a broad range of N and k values (Figure 1B and C). Even though one would have thought that the number of iterations of the simulated annealing algorithm was an important parameter for the success of the algorithm, it has a limited impact and 5000 iterations appear enough to achieve high precision (Table 1). Finally, we find that network density has no real influence on the performance of our method.

Table 1.

Estimates of the effects of the five parameters on the precision and sensitivity obtained under a logistic regression framework. For each parameter, the coefficient, P-value and the percentage of explained total deviance (%TD) are indicated

	Precision			Sensitivity
	Estimate	P-value	%TD	Estimate	P-value	%TD
N ¹	–0.025	<2×10^-16	17.5	4.3×10^-3	<2×10^-16	<1
k ²	0.14	<2×10^-16	21.3	–0.03	1.4×10^-15	<1
μ _HSS ³	0.75	<2×10^-16	45.8	1.29	<2×10^-16	66
d ⁴	–0.029	0.29	<1	5×10^-2	0.31	<1
i ⁵	–1.8×10^-6	2×10^-5	<1	1.5×10^-6	0.05	<1

1Network size.

2HSS size.

3HSS mean score.

4Network density.

5Number of iterations.

Figure 1.

GLM-based estimates of the precision (orange) and sensitivity (blue) of the estimation, as a function of μHSS (A), network size (B) and subnetwork size (C). The horizontal dashed lines indicate a 0.95 threshold. 1Network size. 2HSS size. 3HSS mean score. 4Network density. 5Number of iterations. Then, in order to verify that our statistical testing procedure behaves properly, we computed the P-value distribution under the null hypothesis of μHSS = 0. In that case, P-values do not depart significantly from a uniform distribution (Kolmogorov–Smirnov test, D = 0.03, P = 0.76; Supplementary Figure S1), which is the behavior expected when the null hypothesis is true. We have performed a runtime analysis of signet on simulated networks (Supplementary Table S1) to examine the relation between computation time and network properties. As expected, computation time increases linearly with the number of iterations. However, it increases exponentially with network size, and it is not affected much by network density. A typical analysis of the KEGG database (225 pathways) takes about 6′ on a desktop computer (processor Intel Xeon E5-1650 3.50 GHz), and one needs about 30′ to generate a null distribution of size 1000, such as to infer sub-network P-values.

Pathways databases characteristics

We used pathways defined in three databases: KEGG, NCI and Reactome (including respectively 225, 189 and 1095 pathways). To see whether we should treat these databases separately or not, we first computed statistics summarizing the main properties of these databases. First, we characterized the overlap between these databases, i.e. the number of genes shared between databases. We show that even if they substantially overlap in their gene content, the three databases have a large number of private genes (Supplementary Figure S2A). We also characterized the overlap between pathways within databases using Jaccard's index. We computed the redundancy within a database as the proportion of pathway pairs with an overlap higher than a given threshold as a function of this threshold (49). We find that pathways from the three databases have different levels of overlap, with Reactome having the largest fraction on non-overlapping pathways (Supplementary Figure S2B). Finally, we computed summary statistics to understand the structures of the networks in the different databases. The distributions of the number of nodes, the number of edges and the connectivity are also strikingly different between the three databases (Supplementary Figure S3). Since pathways of these three databases had different properties, we have analysed them separately, using genes from each database to build separate null distributions and perform statistical tests.

Adaptation to altitude in humans

We analysed the data from Foll et al. (26), who studied adaptation in two human populations living at high altitude. For each SNP, they computed the probability of convergent adaptation to altitude in Andeans and Tibetans under a hierarchical Bayesian model, and we used this probability as our score. We define the gene score as the highest-scoring SNP within the gene or in a 50 kb surrounding window. The distribution of gene scores appears slightly different between databases (Supplementary Figure S4), again justifying the separate analysis of the three databases. To search for high-scoring subnetworks in each pathway, we first generated the aggregate subnetwork score distributions for each database and for all possible subnetwork sizes. We then searched for the high-scoring subnetwork in each pathway using 10 000 simulated annealing iterations, and we assessed their significance from a null distribution of HSSs based on 10,000 permutated data sets (see Material and Methods). Interestingly, we find that subnetwork scores tend to be lower in denser pathways. Indeed, the estimated subnetwork score significantly decreases with the density of a pathway (linear regression, F(1,1339) = 42.11, P = 1.2 × 10−10; R = –0.17; Supplementary Figure S5). This result is unlikely to be an artefact, as our simulation study shows that our procedure is not affected by network density (Table 1). Therefore, genes potentially involved in adaptive processes seem to be preferentially found in pathways with less gene-gene interactions. These results are in agreement with other empirical studies that showed that deleterious mutations tend to accumulate at the periphery of gene networks (50). Even though positive selection can also act on genes with more interactions (33,51), this result suggests that adaptation to altitude has mainly targeted genes with less pleiotropic effects since the number of interactions of a gene is clearly correlated to its pleiotropy level (52). We then considered a HSS as significant if it showed a P-value <0.01 and a q-value <0.20. None of the pathways tested in the Reactome database remained significant after multiple test correction. We identified four pathways with a significant HSS in the NCI database and six such subnetworks in KEGG (Supplementary Table S2). The overall top-scoring pathway is the HIF-2-α transcription network (Figure 2), a pathway containing genes known to respond to hypoxia conditions. EPAS1 (HIF-2-α) is the top-scoring gene, it is a transcription factor active under hypoxic conditions. All the other significant genes within this pathway are directly interacting with EPAS1 and should thus play an important role in response to hypoxia. Some of these genes are inhibitors (CITED2) or cofactors (ARNT) of Hypoxia-Inducible Factors (HIF), others are regulated by HIF, such as VEGFA, a growth factor involved in angiogenesis.

Figure 2.

Most significant subnetwork among the three pathway databases. The HIF-2-α transcription pathway is represented as a graph (A), where each node is a gene, and the node size is proportional to the gene score. The highest scoring subnetwork (HSS) of the pathway is shown in red. The gene scores density distribution in this pathway is shown in (B). The dashed line represents the density of gene scores within all the KEGG database, the histogram shows the distribution of genes scores within this pathway, and the vertical red lines indicate the scores of the genes belonging to the HSS. When top-scoring HSSs were overlapping by one or more gene, we merged them in a single network (Figure 3). After this procedure, we observe four distinct clusters of genes. First, in the NCI database, we find a single cluster of genes within four pathways involved in vascular processes such as angiogenesis, response to hypoxia or blood coagulation (Figure 3A). Among these, the top-scoring genes are Endothelial PAS domain-containing protein 1 (EPAS1), Interleukin-6 (IL6), Angiopoietin 1 (ANGPT1), Pleiotrophin (PTN), Tyrosine-protein phosphatase non-receptor type 11 (PTPN11) and Epidermal Growth Factor (EGFR). We also observe many genes in these HSS that present lower scores. Most of these are growth factors, such as genes in the Insulin Growth Factor (IGF), receptor tyrosine kinases (ErbB, EGFR), Neurotrophic Factors (NTF) or Interleukin (IL) families. We identified three other clusters of genes in the KEGG database that are involved in very different biological processes (Figure 3B). First, a large network of 32 genes involved in metabolic functions where the top-scoring genes are Alcohol Dehydrogenase (ADH) subunits, most of the other genes being other aerobic metabolism related enzymes such as the Glucuronosyltransferase (UGT), Glutathion S-tranferase (GST) or Glutamic-Oxaloacetic Transaminase (GOT) families. All of them present moderate probabilities of convergent adaptation (< 0.8). Second, an immunity-related cluster is observed, including 6 Human Leucocyte Antigen (HLA) genes. A last cluster consists in three genes related to neuronal cell growth, with Neuroligin 4 (NLGN4X) being the top-scoring gene in this database.

Figure 3.

Merged significant subnetworks. For each database, NCI (A) and KEGG (B), we merged the significant subnetworks of genes if they overlapped. The colour intensity and size of the nodes are proportional to the gene score. Red lines delimit the individual significant subnetwork and the names of pathways to which they belong are shown next to it.

Application to a large network

As large network databases are now relatively common, we tested our method on a network that is much larger than that typical of biological pathways. To do so, we merged all pathways from the NCI database into a single connected network (2,354 nodes and 21,537 edges). We first tested the convergence of the algorithm modified for large networks on simulated data. We took the same network topology and drew gene scores from a standard normal distribution. We selected a random subnetwork of size 50 and drew gene scores from a normal distribution with mean 5 and variance 1. We generated 20 pseudo-observed datasets in this way and ran signet on it. In all 20 cases, we correctly identified at least 45/50 nodes from the simulated HSS and did not observe any false positives. We then applied our modified algorithm to the large merged network with observed scores. Since several sub-networks could exist in such large networks, we performed five sequential searches, each time removing the identified HSS from the previous iteration, so as to be able to identify five distinct sub-networks. We then generated a null distribution of HSS z-scores by permuting the scores among genes and running our simulated annealing approach 1,000 times. Two out of our five HSS were found significant (both with P < 0.001, Supplementary Figure S6). These two HSS are disjoint but connected to each other and half of the genes (15/30) are top scoring genes from the hypoxia pathway identified previously (Figure 3A). The first top-scoring subnetwork includes mostly growth factors involved in tissue development: seven of them have been identified in our previous pathway analysis (e.g IL6, PTPN11 or EGFR), and 13 of them are new candidates (e.g. Ras-Related Protein Rab-5A, RAB5A, Mitogen-Activated Protein 3-Kinase 2, MAP3K2). The second significant HSS contains ten genes and is structured around EPAS1. Only two genes were not identified in our previous pathway analysis: Vitronectin (VTN) and F-Box And WD Repeat Domain Containing 11 (FBXW11).

DISCUSSION

New insights into human adaptation to altitude

The challenges of living at high altitude impose a very strong selective pressure on individuals, mainly due to low oxygen levels leading to hypoxia (53). Physiological changes have been identified in Tibetans and Andeans living at high altitude (54), and many studies have unveiled the genetic bases of these physiological changes (reviewed in (53)). Adaptation to altitude thus offered us a good positive control to test our new method on real data, and therefore, the fact that our top subnetwork is found in the HIF-2-α transcription pathway is reassuring. This pathway is indeed a key component of the response to hypoxia, as it modulates or induces various physiological responses such as angiogenesis, haemoglobin concentration or erythropoiesis (55). Numerous genes within this pathway have already been proposed to be under selection in Tibetans and Andeans, such as EPAS1 and IL6 (55–58). In addition to these usual suspects, we identify many other genes with scores that remain below the detection threshold of the original genome scan (26), and which show a much more moderate signal of convergent adaptation. The identification of other candidate genes present in the HIF pathway is in line with the view that adaptation to altitude has a polygenic basis (55). For instance, we identified pleiotrophin (PTN), which acts as an angiogenic factor through multiple mechanisms (59), but which has to our knowledge never been identified as a major player in adaptation to altitude. Another gene, PTPN11, has a high score and a central position in one of the significant subnetworks. It encodes the protein tyrosine phosphatase SHP-2, which regulates heart and blood cells development during embryogenesis, as well as other tissues (60). The Cell Adhesion Molecules (CAM) pathway also presents interesting signals, as we have identified a small cluster of 3 genes coding for neuroligins, which are neuronal proteins involved in the modulation of synaptic transmission (61). However, it has recently been shown that these genes are also involved in vascular processes (62,63). The NLGN1 gene thus seems to be a strong candidate for adaptation to high-altitude in Tibetans and Andeans and mechanisms linked to neuroligins action in angiogenesis at high altitude would deserve further investigation. Note that we have also identified a cluster of genes involved in separate metabolic processes. Signals of adaptation at ADH and ALDH genes have been observed in the original study as well as in Ethiopian populations living at high altitude (64). As suggested in the original study, these genes could be involved in fatty-acid degradation and energy production in the mitochondrion: in case of hypoxia, alternative pathways such as omega-oxidation (including ADH genes) could be an alternative to beta-oxidation (26).

Advantages and limitations of the method

The search for high-scoring subnetworks is a combinatorial optimization problem for which several methods have been developed (31,41,65). Here, we describe a new method to detect selection in biological pathways based on a simulated annealing algorithm that extends a previous approach (31) by searching for the highest scoring subnetwork of interacting genes rather than for the highest scoring subset of nodes, i.e. we constrain the search to a single connected set of genes. Even though an exact algorithm has been developed to find the optimal subnetwork, it is not generally applicable, as it can only be applied to a list of P-values coming from a mixture of beta distribution (41). On the other hand, our simulated annealing method does not require any assumption on the distribution of gene scores, and it can therefore be applied to a wider range of problems. In addition, our statistical testing procedure explicitly takes into account the optimization procedure, by building a null distribution of high-scoring subnetworks in permuted data. The generation of this null distribution is a crucial step to prevent simulated annealing from identifying subnetworks in the absence of any signal (41), and we show here by simulation that our statistical procedure behaves properly in terms of type I error (Figure 1 and Supplementary Figure S1). An interesting feature of our approach is the integration of functional information into the analysis by directly testing biologically relevant gene sets. This procedure allows one to better interpret the output of a genome scan and to find the potential functions that are involved in the adaptive process. This is in clear contrast with simple gene ontology enrichment, which is typically performed on a list of top scoring candidate genes. For instance, we re-analysed the 72 candidate genes identified in the original study of convergent adaptation to altitude (26). We performed a Fisher exact-test based enrichment analysis on the same gene sets that were analysed with signet (KEGG, Reactome, and NCI). Eight pathways remained significant after Bonferroni correction at a 5% level. Three of these pathways contained significant HSS: ‘Chemical carcinogenesis’, ‘Drug metabolism – cytochrome P450′, and ‘Tyrosine metabolism’. The other 5 pathways were related to metabolism (‘Fatty acid degradation’, ‘Glycolysis’, ‘Metabolism of xenobiotics’, ‘Retinal metabolism’ and ‘Ethanol oxidation’). This enrichment analysis thus missed some important biological processes involved in adaptation to altitude such as the HIF signalling pathway. Note that 14 of the 72 genes initially identified as candidates for adaptation to altitude are also present in our significant HSSs (Figure 3). Among the absent genes, 46 are not included in our current analysis, either because they are absent from the pathways databases (n = 31), or because no SNP could be associated to them (because the closest SNP is more than 50 Kb away from them) (n = 15). Therefore, only 26 genes were identifiable with signet. The fact that a large fraction (46%) of these genes are not present in our significant HSSs and that only 4 out of 9 HSSs include a top-scoring gene shows that our method is not just agglomerating less significant genes around top scoring genes. In any case, our results seem biologically more relevant than a simple enrichment analysis, and in this case easier to interpret. Our approach is conceptually close to the method developed by Daub et al. (7), which consisted in testing if a whole pathway presented a shift in the gene score distribution. The main difference with this previous approach is that we aim here at finding high-scoring subnetworks within pathways. Indeed, it is more likely that polygenic adaptive events have focused on only a subset of genes rather than on a whole pathway. In addition to be able to identify small subsets of genes even in large pathways, our approach allows one to identify outlier functions and genes at the same time, whereas under the previous whole-pathway approach, pathways had to be manually inspected in order to know which genes were driving adaptation (7). However, Daub et al.’s approach (7) has some advantages as it can be applied to any pathway, as it is not limited to pathways for which gene interaction networks are explicitly available. Therefore, the two approaches should be seen as complementary. Whereas the present methodology overcomes some common problems associated to genome scans for selection, such as being able to identify genes with moderate selection score (Figures 2 and 3), and to explicitly associate candidate gene-sets to biological functions, it also presents some limitations as compared to other methods to detect selection. For instance, our approach is limited by the availability of pathway and network information. Therefore, some genes and biological functions cannot be tested, and the method is not easily applicable to non-model species for which no pathways databases are available. Then, one should be aware that only a certain type of biological functions are tested, i.e. biochemical phenotypes, and we thus have no information about higher-order phenotypes, e.g. height or weight. Finally, this method does not allow one to identify isolated top-scoring genes. However, such isolated outlier genes are easily identified with a classical genome scan. One can thus check if outliers are represented among significant subnetworks and therefore determine if selection has only targeted these single genes or if higher-order processes have been the target of selection.

CONCLUSION AND PERSPECTIVES

Overall, the signet method allowed us to study an example of human adaptation from a gene network perspective. Based on information about gene interactions and a proxy for selection, we could identify potentially new targets of selection, like pleiotrophin or neuroligins. This method has thus the potential to detect new genetic bases of adaptation in humans, as well as in other species for which gene interactions databases exist or could be inferred. As our method takes as input a simple gene score, other measures than population genetics statistics could be used. For instance, one could analyse phenotype-association measures or gene expression levels (such as in QTL and eQTL analysis, e.g. (22,23)) by using a z score computed from an association P-value. In addition, even if we have analysed networks of relatively moderate size, we have shown that signet can handle large gene networks. This latter feature opens up new perspectives, such as the analysis of large functional interaction networks (e.g. from STRING (66) or BioGRID (67) databases). The main limitations for such studies would then be computation time, as the generation of null distributions in a large network (e.g. of size 5,000) takes substantial time (Supplementary Table S1), but this procedure could be easily parallelized on a computer cluster. Finally, even though we have applied signet to search for signals of adaptation in humans, the same workflow can be performed on other organism for which pathway databases are available, and used in other fields, such as in differential gene expression studies, GWAS or any kind of analysis for which a score can be obtained for any given gene. Therefore, the signet approach is very versatile and might be of interest to both evolutionary biologists and biomedical researchers.

AVAILABILITY

Our approach has been implemented as a fully automated R package. The source code and documentation are available on Github (http://www.github.com/CMPG/signet). Click here for additional data file.

63 in total

1. Natural selection on EPAS1 (HIF2alpha) associated with low hemoglobin concentration in Tibetan highlanders.

Authors: Cynthia M Beall; Gianpiero L Cavalleri; Libin Deng; Robert C Elston; Yang Gao; Jo Knight; Chaohua Li; Jiang Chuan Li; Yu Liang; Mark McCormack; Hugh E Montgomery; Hao Pan; Peter A Robbins; Kevin V Shianna; Siu Cheung Tam; Ngodrop Tsering; Krishna R Veeramah; Wei Wang; Puchung Wangdui; Michael E Weale; Yaomin Xu; Zhe Xu; Ling Yang; M Justin Zaman; Changqing Zeng; Li Zhang; Xianglong Zhang; Pingcuo Zhaxi; Yong Tang Zheng
Journal: Proc Natl Acad Sci U S A Date: 2010-06-07 Impact factor: 11.205

2. Optimization by simulated annealing.

Authors: S Kirkpatrick; C D Gelatt; M P Vecchi
Journal: Science Date: 1983-05-13 Impact factor: 47.728

3. Selective sweep at a quantitative trait locus in the presence of background genetic variation.

Authors: Luis-Miguel Chevin; Frédéric Hospital
Journal: Genetics Date: 2008-10-01 Impact factor: 4.562

4. Widespread signals of convergent adaptation to high altitude in Asia and america.

Authors: Matthieu Foll; Oscar E Gaggiotti; Josephine T Daub; Alexandra Vatsiou; Laurent Excoffier
Journal: Am J Hum Genet Date: 2014-09-25 Impact factor: 11.025

5. The "hitchhiking effect" revisited.

Authors: N L Kaplan; R R Hudson; C H Langley
Journal: Genetics Date: 1989-12 Impact factor: 4.562

6. Colloquium paper: human adaptations to diet, subsistence, and ecoregion are due to subtle shifts in allele frequency.

Authors: Angela M Hancock; David B Witonsky; Edvard Ehler; Gorka Alkorta-Aranburu; Cynthia Beall; Amha Gebremedhin; Rem Sukernik; Gerd Utermann; Jonathan Pritchard; Graham Coop; Anna Di Rienzo
Journal: Proc Natl Acad Sci U S A Date: 2010-05-05 Impact factor: 11.205

7. graphite - a Bioconductor package to convert pathway topology to gene network.

Authors: Gabriele Sales; Enrica Calura; Duccio Cavalieri; Chiara Romualdi
Journal: BMC Bioinformatics Date: 2012-01-31 Impact factor: 3.169

8. Response of Polygenic Traits Under Stabilizing Selection and Mutation When Loci Have Unequal Effects.

Authors: Kavita Jain; Wolfgang Stephan
Journal: G3 (Bethesda) Date: 2015-03-31 Impact factor: 3.154

9. Approximation to the distribution of fitness effects across functional categories in human segregating polymorphisms.

Authors: Fernando Racimo; Joshua G Schraiber
Journal: PLoS Genet Date: 2014-11-06 Impact factor: 5.917

10. Gene and Network Analysis of Common Variants Reveals Novel Associations in Multiple Complex Diseases.

Authors: Priyanka Nakka; Benjamin J Raphael; Sohini Ramachandran
Journal: Genetics Date: 2016-08-03 Impact factor: 4.562

16 in total

Review 1. Evolutionary and population (epi)genetics of immunity to infection.

Authors: Luis B Barreiro; Lluis Quintana-Murci
Journal: Hum Genet Date: 2020-04-13 Impact factor: 4.132

Review 2. Polygenic adaptation: a unifying framework to understand positive selection.

Authors: Neda Barghi; Joachim Hermisson; Christian Schlötterer
Journal: Nat Rev Genet Date: 2020-06-29 Impact factor: 53.242

3. The history and evolution of the Denisovan-EPAS1 haplotype in Tibetans.

Authors: Xinjun Zhang; Kelsey E Witt; Mayra M Bañuelos; Amy Ko; Kai Yuan; Shuhua Xu; Rasmus Nielsen; Emilia Huerta-Sanchez
Journal: Proc Natl Acad Sci U S A Date: 2021-06-01 Impact factor: 11.205

Review 4. Towards a Dynamic Interaction Network of Life to unify and expand the evolutionary theory.

Authors: Eric Bapteste; Philippe Huneman
Journal: BMC Biol Date: 2018-05-29 Impact factor: 7.431

5. Inference of natural selection from ancient DNA.

Authors: Marianne Dehasque; María C Ávila-Arcos; David Díez-Del-Molino; Matteo Fumagalli; Katerina Guschanski; Eline D Lorenzen; Anna-Sapfo Malaspinas; Tomas Marques-Bonet; Michael D Martin; Gemma G R Murray; Alexander S T Papadopulos; Nina Overgaard Therkildsen; Daniel Wegmann; Love Dalén; Andrew D Foote
Journal: Evol Lett Date: 2020-03-18

6. Great tits and the city: Distribution of genomic diversity and gene-environment associations along an urbanization gradient.

Authors: Charles Perrier; Ana Lozano Del Campo; Marta Szulkin; Virginie Demeyrier; Arnaud Gregoire; Anne Charmantier
Journal: Evol Appl Date: 2017-12-20 Impact factor: 5.183

Review 7. Perspectives on studying molecular adaptations of amphibians in the genomic era.

Authors: Yan-Bo Sun; Yi Zhang; Kai Wang
Journal: Zool Res Date: 2020-07-18

8. Distinct Patterns of Selective Sweep and Polygenic Adaptation in Evolve and Resequence Studies.

Authors: Neda Barghi; Christian Schlötterer
Journal: Genome Biol Evol Date: 2020-06-01 Impact factor: 3.416

9. Evidence of Polygenic Adaptation to High Altitude from Tibetan and Sherpa Genomes.

Authors: Guido A Gnecchi-Ruscone; Paolo Abondio; Sara De Fanti; Stefania Sarno; Mingma G Sherpa; Phurba T Sherpa; Giorgio Marinelli; Luca Natali; Marco Di Marcello; Davide Peluzzi; Donata Luiselli; Davide Pettener; Marco Sazzini
Journal: Genome Biol Evol Date: 2018-11-01 Impact factor: 3.416

10. Identifying subpathway signatures for individualized anticancer drug response by integrating multi-omics data.

Authors: Yanjun Xu; Qun Dong; Feng Li; Yingqi Xu; Congxue Hu; Jingwen Wang; Desi Shang; Xuan Zheng; Haixiu Yang; Chunlong Zhang; Mengting Shao; Mohan Meng; Zhiying Xiong; Xia Li; Yunpeng Zhang
Journal: J Transl Med Date: 2019-08-06 Impact factor: 5.531