| Literature DB >> 26802430 |
Dries De Maeyer1, Bram Weytjens1, Luc De Raedt2, Kathleen Marchal3.
Abstract
In clonal systems, interpreting driver genes in terms of molecular networks helps understanding how these drivers elicit an adaptive phenotype. Obtaining such a network-based understanding depends on the correct identification of driver genes. In clonal systems, independent evolved lines can acquire a similar adaptive phenotype by affecting the same molecular pathways, a phenomenon referred to as parallelism at the molecular pathway level. This implies that successful driver identification depends on interpreting mutated genes in terms of molecular networks. Driver identification and obtaining a network-based understanding of the adaptive phenotype are thus confounded problems that ideally should be solved simultaneously. In this study, a network-based eQTL method is presented that solves both the driver identification and the network-based interpretation problem. As input the method uses coupled genotype-expression phenotype data (eQTL data) of independently evolved lines with similar adaptive phenotypes and an organism-specific genome-wide interaction network. The search for mutational consistency at pathway level is defined as a subnetwork inference problem, which consists of inferring a subnetwork from the genome-wide interaction network that best connects the genes containing mutations to differentially expressed genes. Based on their connectivity with the differentially expressed genes, mutated genes are prioritized as driver genes. Based on semisynthetic data and two publicly available data sets, we illustrate the potential of the network-based eQTL method to prioritize driver genes and to gain insights in the molecular mechanisms underlying an adaptive phenotype. The method is available at http://bioinformatics.intec.ugent.be/phenetic_eqtl/index.html.Entities:
Keywords: experimental evolution, biological networks, gene prioritization, coexisting ecotypes, drug resistance
Mesh:
Year: 2016 PMID: 26802430 PMCID: PMC4825419 DOI: 10.1093/gbe/evw010
Source DB: PubMed Journal: Genome Biol Evol ISSN: 1759-6653 Impact factor: 3.416
FOverview of the network-based eQTL method. The input of the method consists of, respectively, coupled genotype and expression phenotype data for a set of evolved lines with the same phenotype and a genome-wide interaction network. Red and green indicate, respectively, over- and underexpression with respect to a reference. Genes that are considered to be significantly differentially expressed according to a test statistic are indicated by a specific symbol as displayed on the figure legend. Mutated driver and passenger genes are indicated with two different symbols as displayed on the legend. The numbering of each mutated gene indicates the evolved line in which this mutated gene occurred. (A) Construction of the end point specific probabilistic subnetworks: for each evolved line the genome-wide interaction network is converted into a probabilistic subnetwork by assigning to each edge in the genome-wide interaction network a weight that is interpreted as the probability that the edge has an influence on the assessed phenotype. These weights depend on the level of differential expression of the terminal node of the edge. Genes that are more differentially expressed (darker red/green) will give rise to higher weights on the edges (indicated by the width of the edge). (B) Pathfinding in each of the probabilistic subnetworks. The mutated and significantly differentially expressed genes occurring in each of the evolved lines are mapped to the corresponding end point specific probabilistic subnetworks. For each significantly differentially expressed gene, all possible paths from this gene to all mutated genes in the same end point are searched for (paths are shown as black curves). (C) Optimal subnetwork selection. Optimization is performed by integrating the paths found in all end point specific probabilistic networks according to a predefined cost function that positively scores the addition of paths connecting pairs of mutated genes-differentially expressed genes observed in any of the end points, but that penalizes the addition of edges. As a result, paths that are strongly connected to the expression phenotype and that overlap with each other are selected as the optimal subnetwork.
Selected mutated genes prioritized as driver genes
| AMK Resistance | Coexisting Ecotypes | ||||||
|---|---|---|---|---|---|---|---|
| Gene name | Rank | Line | Type | Gene name | Rank | Line | Type |
| 1 | 2,4 | frameshift | 1 | S | missense | ||
| 2 | 1,3 | missense, in-frame del | 1 | S | missense | ||
| 3 | 2 | nonsense | 1 | S | missense | ||
| 3 | 4 | nonsense | 2 | S | intergenic | ||
| 3 | 4 | In-frame del | 3 | S | intergenic | ||
| 4 | 4 | missense | 4 | S | synonymous | ||
| 5 | 1,2,3,4 | missense | 5 | L | missense | ||
| 6 | 1 | missense | 5 | L | Large del | ||
| 7 | 3 | Frameshift del | 6 | S | missense | ||
| 8 | 2 | missense | 7 | L | intergenic | ||
| 9 | 1 | missense | glk | 7 | S | intergenic | |
| 10 | 1 | missense | |||||
FPerformance assessment of the network-based eQTL method on the semisynthetic data set. Data of all selected mutated genes at specific ranks are presented as Tukey boxplots. Note that multiple mutated genes can have identical ranks as ranks are assigned based on the maximal edge cost for which a mutation is present within the subnetwork and thus multiple mutated genes can have identical maximal edge costs for which they are present within the subnetwork. The upper plot shows the PPV, (fraction of the selected mutations which are true positives, i.e. driver mutations) in terms of the ranks of the selected mutations. It can be seen that low ranks have higher PPV values. Note that at rank 1, the variance is high. This is because inferred subnetworks for rank 1 are small, and therefore more prone to random effects. That is the selection of one additional false positive in a particular random set largely affects the PPV. Solutions are clearly less variable from rank 2 onwards. The lower plot shows the sensitivity (fraction of all possible true positives selected) in terms of the ranks of the selected mutations. Sensitivity increases with rank, implying a trade-off between PPV and sensitivity.
Data sets used to compile the Escherichia coli genome-wide interaction networks
| Interaction Type | |||
|---|---|---|---|
| Protein–protein | 2,737 ( | 2,721 ( | 2,534 ( |
| Protein–DNA | 4,492 ( | 3,415 ( | 3,890 ( |
| Sigma | 727 ( | 1,225 ( | 592 ( |
| Metabolic | 2,798 ( | 5,136 ( | 2,462 ( |
| Phosphorylation and dephosphorylation | 44 ( | 38 ( | 40 ( |
| Srna | 213 ( | 2 ( | 171 ( |
| Size (edges) | 11,011 | 12,537 | 9,689 |
| Size (nodes) | 2,732 | 2,650 | 2,418 |
The E. coli K12 MDS42 network was derived from the E. coli K12 MG1655 network by deleting all edges connecting genes that do not exist in E. coli K12 MDS42.
FVisualization of subnetworks inferred from the Amikacin resistance data set based on data from 100 randomizations. The visualization was created by merging separate inferred subnetworks resulting from a parameter sweep of the edge cost from 0.25 to 1.75. The width of the edge displays the stringency at with the edge was selected (the wider the edge the more stringent the condition. More Stringent conditions correspond to higher edge costs). Node borders are subdivided into four parts to visualize in which line a mutation occurred (evolved lines compared with ancestral line). The inner color of the nodes is also subdivided into four parts where each part represents the degree of differential expression in the corresponding line. The colors of the edges represent the edge types.
FVisualization of subnetworks inferred from the coexisting ecotypes data set. The visualization was created by merging separately inferred subnetworks resulting from a parameter sweep of the edge cost from 0.025 to 0.975. The width of the edges represents the maximal 30 mutation cost for which these edges were selected. The width of the edge displays the stringency at with the edge was selected (the wider the edge the more stringent the condition. More Stringent conditions correspond to higher edge costs). Node borders are subdivided into two parts to visualize in which strain a mutation occurred. The inner color of the nodes represents the degree of differential expression (L ecotype compared with S ecotype). The colors of the edges represent the edge types.