Literature DB >> 23160177

Evolutionary rate and duplicability in the Arabidopsis thaliana protein-protein interaction network.

Abstract

Genes show a bewildering variation in their patterns of molecular evolution, as a result of the action of different levels and types of selective forces. The factors underlying this variation are, however, still poorly understood. In the last decade, the position of proteins in the protein-protein interaction network has been put forward as a determinant factor of the evolutionary rate and duplicability of their encoding genes. This conclusion, however, has been based on the analysis of the limited number of microbes and animals for which interactome-level data are available (essentially, Escherichia coli, yeast, worm, fly, and humans). Here, we study, for the first time, the relationship between the position of proteins in the high-density interactome of a plant (Arabidopsis thaliana) and the patterns of molecular evolution of their encoding genes. We found that genes whose encoded products act at the center of the network are more evolutionarily constrained than those acting at the network periphery. This trend remains significant when potential confounding factors (gene expression level and breadth, duplicability, function, and length of the encoded products) are controlled for. Even though the correlation between centrality measures and rates of evolution is generally weak, for some functional categories, it is comparable in strength to (or even stronger than) the correlation between evolutionary rates and expression levels or breadths. In addition, genes encoding interacting proteins in the network evolve at relatively similar rates. Finally, Arabidopsis proteins encoded by duplicated genes are more highly connected than those encoded by singleton genes. This observation is in agreement with the patterns observed in humans, but in contrast with those observed in E. coli, yeast, worm, and fly (whose duplicated genes tend to act at the periphery of the network), implying that the relationship between duplicability and centrality inverted at least twice during eukaryote evolution. Taken together, these results indicate that the structure of the A. thaliana network constrains the evolution of its components at multiple levels.

Entities: Chemical Disease Gene Species

Mesh：

Substances：
Arabidopsis Proteins

Year: 2012 PMID： 23160177 PMCID： PMC3542556 DOI： 10.1093/gbe/evs101

Source DB: PubMed Journal: Genome Biol Evol ISSN： 1759-6653 Impact factor: 3.416

Introduction

Genes and proteins rarely operate in isolation; on the contrary, they often act as parts of complex functional networks of interacting molecules. Understanding the function and evolution of molecular networks is not only a key step toward understanding organisms’ function and evolution, but it can also aid applications such as metabolic engineering and drug discovery and design (for review, see Butcher et al. 2004; Korcsmáros et al. 2007; Lee et al. 2011). Furthermore, considering the position of genes and proteins in the networks in which they participate may provide key insight into the evolutionary forces governing their evolution. Indeed, several lines of evidence point to a link between the position of proteins in metabolic and protein–protein interaction networks (PINs) and the patterns of molecular evolution of their encoding genes (reviewed by Cork and Purugganan 2004; Eanes 2011; Wagner 2012). In particular, proteins’ network position has an effect on their rate of evolution and on the duplicability of their encoding genes. Several questions, however, remain open. Proteins’ rates of evolution vary across orders of magnitude, as a result of the action of different levels and types of evolutionary forces (Zuckerkandl and Pauling 1965; King and Jukes 1969; Ohta and Kimura 1971; Li et al. 1985). Identifying and understanding the factors responsible for this variability is one of the main open questions in Evolutionary Biology. Purifying selection is expected to act more severely on genes performing functions that are more important for the organism’s biological fitness, thus resulting in lower rates of evolution (Wilson et al. 1977; Kimura 1983); however, the relative contribution of genes to fitness remains elusive. Over the last decade, the wealth of genomic and functional data has made feasible to pursue the formulation of a unifying theory that explains the variation of the rates of protein evolution (for review, see Herbeck and Wall 2005; Koonin and Wolf 2006; McInerney 2006; Pál et al. 2006; Rocha 2006; Wolf et al. 2006). Several factors have been shown to correlate with the strength of purifying selection acting on genes, with patterns of expression being the most prominent: more highly or widely expressed genes tend to be more constrained than those expressed at low levels or in a narrow range of tissues (Duret and Mouchiroud 2000; Pál et al. 2001; Krylov et al. 2003; Subramanian and Kumar 2004; Wright et al. 2004; Drummond et al. 2005, 2006; Ingvarsson 2007; Slotte et al. 2011; Yang and Gaut 2011). Other relevant determinants of genes’ evolutionary rates include the length (Subramanian and Kumar 2004; Ingvarsson 2007) and function (Castillo-Davis et al. 2004; Alvarez-Ponce and McInerney 2011) of the encoded products. However, these factors are poor predictors of evolutionary rates (e.g., Ingvarsson 2007; Larracuente et al. 2008). Remarkably, genes acting at the center of the PIN tend to be more selectively constrained than those acting at the network periphery, a pattern that has thus far been observed in Escherichia coli, yeasts, worms, flies, and mammals (Fraser et al. 2002; Jordan et al. 2003; Hahn and Kern 2005; Lemos et al. 2005; Davids and Zhang 2008; Alvarez-Ponce 2012). Consistently, proteins involved in complexes tend to be highly conserved (Teichmann 2002). Similar patterns have been observed in metabolic networks, whose most connected enzymes tend to evolve under stronger selective pressures (Vitkup et al. 2006). In addition, genes encoding physically interacting proteins tend to evolve at relatively similar rates (Fraser et al. 2002; Lemos et al. 2005; Alvarez-Ponce et al. 2009, 2011; Cui et al. 2009). Although these trends are significant and consistent across all organisms studied to date, they are often weak, and whether they reflect a direct effect of network position on genes’ rates of evolution has been the subject of debate. In particular, some authors have suggested that these trends might be a byproduct of the distribution of confounding factors across the network, such as genes acting at the center of the network being expressed at higher levels or to biases in interactomic data sets (Bloom and Adami 2003; Batada et al. 2006) (but see Fraser et al. 2003; Fraser and Hirsh 2004; Fraser 2005; Lemos et al. 2005). Genes also widely differ in their duplicability (i.e., the propensity to retain duplicated copies after a gene duplication event), with some genes remaining as single copies (singletons) over long evolutionary periods and others being recurrently duplicated. Genes’ duplicabilities are known to be affected by a number of factors, including the position of the encoded proteins in the PIN. However, the direction of the relationship between gene duplicability and network centrality is not universal. In the PINs of E. coli, yeast, worm, and fly, singleton genes tend to occupy more central positions than duplicated genes (Hughes and Friedman 2005; Prachumwat and Li 2006; Makino et al. 2009; D'Antonio and Ciccarelli 2011). This trend has been attributed to a fragility of the network to duplications affecting its more connected elements. Indeed, complexes and pathways are thought to perform better with balanced concentrations of their members, and the duplication of a given gene is expected to disrupt the dosage balance of the interactions in which its encoded product is involved (Birchler et al. 2001; Veitia 2002; Papp et al. 2003), which may have more deleterious effects for proteins with a higher number of interactors. Conversely, human proteins encoded by duplicated genes tend to be more central to the PIN than those encoded by singleton genes (Liang and Li 2007; Doherty et al. 2012). This is a derived character resulting from the high duplicability of human hubs originated after the emergence of Metazoans, implying that the relationship between centrality and duplicability underwent modification in the vertebrate lineage (D'Antonio and Ciccarelli 2011). The factors underlying this shift in the relationship between centrality and duplicability remain, however, unclear. The relationship between network position and genes’ patterns of molecular evolution has been hitherto investigated only in the few microorganisms (E. coli and yeast) and animals (worm, fly, and human) for which interactome-scale protein–protein interaction data are available. The recent availability of a relatively high-density interactome for Arabidopsis thaliana (Arabidopsis Interactome Mapping Consortium 2011; Stark et al. 2011), together with the availability of the genome sequences for this species (Arabidopsis Genome Initiative 2000) and its close relative A. lyrata (Hu et al. 2011), allowed us to investigate, for the first time, the relationship between genes’ patterns of molecular evolution and their position in a plant network. Consistent with patterns observed in other organisms, we found that genes encoding the most central proteins of the network tend to evolve under stronger levels of selective constraint and that genes encoding physically interacting proteins evolve at relatively similar rates, pointing to general trends across all of life. The relationship between centrality and evolutionary rate is independent of potential confounding factors (gene duplicability, expression level and breadth, and the length of the encoded products), suggesting a direct effect of the network structure on the patterns of molecular evolution of its components. Even though the correlation between the measures of centrality and rates of evolution is in general weak, for some functional categories, it is comparable in strength to (or even stronger than) the correlation between evolutionary rates and expression levels or breadths. Surprisingly, genes acting at the center of the A. thaliana PIN are more likely to be duplicated than those acting at the periphery, implying that the relationship between centrality and duplicability underwent modification not only in vertebrates but also in plants.

Materials and Methods

Protein–Protein Interaction Data

The PIN was obtained by merging the A. thaliana networks available from the BioGRID database v3.1.81 (Stark et al. 2011) and from Arabidopsis Interactome Mapping Consortium (2011). Only physical interactions among pairs of A. thaliana proteins were considered. All interactions used in the current analysis had been determined experimentally in A. thaliana (i.e., the data set does not contain interactions inferred computationally or derived from other species).

Impact of Natural Selection

For each gene in the network, we attempted to identify its 1:1 ortholog in the A. lyrata genome (Hu et al. 2011) using a best reciprocal Basic Local Alignment Search Tool (BLAST) approach (using BLASTP and an E-value cut-off of 10−10). Each pair of A. thaliana–A. lyrata orthologous protein sequences was aligned using ProbCons 1.12 (Do et al. 2005), and the resulting alignments were used to guide the alignment of the corresponding coding sequences. Estimates of dN, dS, and ω were obtained using the one-ratio model M0 from the PAML 4.4 package (Yang 2007).

Paralogs Identification

Each A. thaliana protein was used as query in a BLASTP (Altschul et al. 1997) search against the A. thaliana proteome. Genes were classified as singleton if no hit was obtained with an E value < 0.1 or as duplicated if at least a homolog was found that met the following criteria: 1) E value ≤ 10−10; 2) the aligned region length (L) was ≥80% of the length of the query sequence; and 3) amino acid identity was ≥30% if L > 150 amino acids or 0.06 + 4.8L−0.32[1 + exp(−/1,000)] otherwise (Rost 1999). Duplicated genes were further classified as whole-genome duplication (WGD) genes if classified as such in Blanc et al. (2003); otherwise, they were classified as resulting from small-scale genome duplication (SSD).

Gene Expression Level and Breadth

Expression data from 79 A. thaliana tissues were obtained from Schmid et al. (2005). For each gene and tissue, values were averaged across the three replicates, and the gene was considered to be expressed if it was annotated as “present” in at least two of the replicates. For each gene, expression level was computed as the median across the 79 tissues, and expression breadth was computed as the number of tissues in which the gene is expressed. For genes matching multiple probe sets, the set yielding a higher expression level was used. Probe sets matching multiple genes were discarded.

Functional Information

Each A. thaliana gene was assigned to one (or sometimes a few) eukaryotic orthologous group (KOG) categories using the eggNOG v2 database (Muller et al. 2010).

Ribosomal Proteins

Proteins were considered to be ribosomal if identified as such in Barakat et al. (2001), if their description contained the text “ribosomal protein” or if they were assigned to the Gene Ontology (Ashburner et al. 2000) terms “large ribosomal subunit” or “small ribosomal subunit.”

Age of Genes

Each of the 5,789 A. thaliana network proteins was used as query in a BLASTP (Altschul et al. 1997) search against the nr database (obtained from the National Center for Biotechnology Information on January 2012). An E-value cut-off of 10−6 was used. Only hits whose aligned region length was ≥80% the length of the query sequence were considered. Genes presenting hits in prokaryotes or in nonplant eukaryotes (non-Embryophyta) were deemed as “ancient.” The remaining genes were considered to be “land plant-specific.”

Results

Genes Acting at the Center of the A. thaliana PIN Are More Selectively Constrained Than Those Acting at the Network Periphery

We assembled a PIN for A. thaliana by merging all interactions available in the BioGRID database (Stark et al. 2011) and in Arabidopsis Interactome Mapping Consortium (2011). The resulting network consisted of 5,789 unique proteins connected by 14,368 physical binary interactions. For each protein in the network, we computed three centrality measures: 1) the number of interactors (degree); 2) the number of shortest paths between all pairs of proteins of which the protein is part (betweenness; Freeman 1977); and 3) the inverse of the average shortest distance to all the other proteins in the network (closeness). Using a best reciprocal BLAST approach, we found 1:1 A. lyrata orthologs for 3,868 of the 5,789 A. thaliana network genes. For each pair of orthologs, the impact of natural selection was inferred from the nonsynonymous to synonymous divergence ratio (ω = dN/dS). Values of ω = 1 are expected for genes evolving neutrally, whereas ω < 1 is indicative of the action of purifying selection preserving the sequence of the encoded proteins, and ω > 1 in a number of codons is indicative of the action of positive selection (adaptive evolution) driving the fixation of nonsynonymous substitutions. The estimated ω values exhibit a median of 0.128, indicating that these genes evolved under relatively high levels of selective constraint. We found that the 3,868 genes encoding proteins that are represented in the network (i.e., those with described interactors) exhibit lower rates of evolution than those not represented in the network (median ω values of 0.128 and 0.183, respectively; Mann–Whitney test, P < 10−15). In addition, among these 3,868 network genes, ω values negatively correlate with betweenness (Spearman’s rank correlation coefficient, ρ = −0.053, P = 0.001; fig. 1) and closeness (ρ = −0.034, P = 0.035): the larger the values of betweenness and closeness for a protein, the stronger are the selective constraints acting on that protein. Although the correlation is only marginally significant for degree (ρ = −0.030, P = 0.063), the 1,467 genes with degree >1 show significantly lower ω values than the 2,401 genes with degree = 1 (median ω values of 0.125 and 0.135, respectively; Mann–Whitney test, P = 0.005). Taken together, these results indicate that the most central genes in the A. thaliana network are subject to stronger levels of purifying selection than those acting at the network periphery. Similar results were obtained for dN but not for dS (table 1), indicating that the distribution across the network of selection at the amino acid level is the main responsible for the observed trend.

Correlation between ω and betweenness.

Table 1

Spearman’s Correlations among the Parameters Considered in the Study

	ω	d_N	d_S	Degree	Closeness	Betweenness	Expression Level	Expression Breadth
d_N	0.732***
d_S	−0.219***	0.310***
Degree	−0.030	−0.026	−0.019
Closeness	−0.034*	−0.032*	0.001	0.412***
Betweenness	−0.053**	−0.051**	−0.017	0.875***	0.436***
Expression level	−0.331***	−0.342***	−0.049**	−0.013	0.066**	0.046**
Expression breadth	−0.245***	−0.264***	−0.072**	−0.006	0.024	0.035*	0.692***
Protein length	0.050**	−0.009	−0.124***	−0.017	−0.042**	−0.023	−0.088***	0.012

*P < 0.05.

**P < 0.01.

***P < 10−6.

Correlation between ω and betweenness. Spearman’s Correlations among the Parameters Considered in the Study *P < 0.05. **P < 0.01. ***P < 10−6.

The Relationship between Centrality and Evolutionary Rate Is Independent of Gene Expression Level and Breadth, Gene Function, and the Length of the Encoded Products

Having established an association between proteins’ centralities and their rates of evolution, we sought to determine whether this association reflected a direct effect of network position or, on the contrary, it was the result of the distribution across the network of a number of factors that correlate both with rates of evolution and network centralities. Levels of selective constraint acting on a gene are known to depend on a number of factors, among which the level and breadth of gene expression seem to be the most important (Duret and Mouchiroud 2000; Pál et al. 2001; Krylov et al. 2003; Subramanian and Kumar 2004; Wright et al. 2004; Drummond et al. 2005, 2006; Ingvarsson 2007; Slotte et al. 2011; Yang and Gaut 2011). In agreement with previous reports, we observed that the ω and dN values exhibit a strong negative correlation with both expression level and breadth in our data set (table 1). Additionally, we observed that, similar to the PINs of other organisms (Bloom and Adami 2003; Lemos et al. 2005), the most central genes in the Arabidopsis PIN are more highly and broadly expressed than those acting at the periphery: expression levels positively correlate with betweenness and closeness and expression breadths with betweenness (table 1). Combined, these observations raise the possibility that the correlation between centrality measures and evolutionary rates could be driven by the higher levels and/or breadths of expression of genes acting at the center of the PIN. However, partial correlation analysis (supplementary table S1, Supplementary Material online) shows that the association between ω and betweenness is independent of expression level and breadth (ρ = −0.038, P = 0.026). Another potential confounding factor is protein length, as it correlates positively with ω (i.e., genes encoding shorter proteins tend to be more selectively constrained; Subramanian and Kumar 2004; Ingvarsson 2007; table 1) and negatively with closeness (i.e., genes acting at the center of the network tend to encode shorter proteins, in agreement with previous observations in Drosophila; Lemos et al. 2005; table 1). However, the correlation between ω and both betweenness (ρ = −0.052, P = 0.001) and closeness (ρ = −0.032, P = 0.048) remains significant when protein lengths are controlled for (supplementary table S1, Supplementary Material online), indicating that the tendency for central genes to evolve under strong levels of selective constraint is also independent of protein length. Furthermore, the correlation between ω and betweenness remains significant when protein length and expression level and breadth are simultaneously controlled for (ρ = −0.037, P = 0.028; supplementary table S1, Supplementary Material online). Genes performing different functions are subject to different levels of selective constraint (e.g., Castillo-Davis et al. 2004; Alvarez-Ponce and McInerney 2011) and encode proteins with different network centralities (Kunin et al. 2004; Cotton and McInerney 2010; Alvarez-Ponce and McInerney 2011). Consistently, we found that genes in different categories exhibit different ω and centrality values (Kruskal–Wallis test, ω: P < 10−15; degree: P < 10−15; betweenness: P = 1.17 × 10−11; closeness: P = 7.32 × 10−10). Therefore, the observed correlation between centrality and evolutionary rate (table 1) could conceivably be the result of these differences. To discard this possibility, the correlation between evolutionary rate and degree was evaluated separately for genes involved in each of the KOG functional categories (Tatusov et al. 2003) (similar results were obtained for betweenness and closeness; data not shown). The analysis was restricted to the 20 KOG categories comprising more than 25 A. thaliana network genes. Remarkably, the correlation coefficient was negative for 18 of the 20 categories (table 2). The correlation was significant for only five of these categories, probably as a result of the reduced statistical power resulting from partitioning the data set. These observations indicate that, in spite of the fact that genes with different functions exhibit different ω and centrality values, the negative correlation between centrality and ω is largely independent of gene function.

Table 2

Correlation between Degree and ω for Each Functional Category

							Correlation
KOG Category	Description	n	Average ω	Average Degree	Average Expression Level	Average Expression Breadth	ω–Degree			ω–Expression Level		ω–Expression Breadth
							ρ	P	Comparison^a	ρ	P	ρ	P
J	Translation, ribosomal structure, and biogenesis	138	0.122	2.36	1,240.4	76.3	0.086	0.316	0.149	−0.379	2.10 × 10⁻⁵**	−0.254	0.005**
A	RNA processing and modification	148	0.169	5.89	327.4	74.3	−0.162	0.049*	0.105	−0.360	2.06 × 10⁻⁵**	−0.211	0.015*
K	Transcription	313	0.204	6.43	178.8	61.0	−0.118	0.037*	0.143	−0.233	7.71 × 10⁻⁵**	−0.167	0.005**
L	Replication, recombination, and repair	73	0.162	3.85	147.9	62.7	−0.203	0.086	0.138	−0.223	0.074	−0.171	0.173
B	Chromatin structure and dynamics	45	0.167	3.91	357.6	73.2	−0.357	0.016*	0.025*	−0.436	0.003**	−0.198	0.202
D	Cell cycle control, cell division, and chromosome partitioning	107	0.189	10.24	244.6	63.5	−0.092	0.344	0.526	−0.444	5.09 × 10⁻⁶**	−0.487	4.26 × 10⁻⁷***
T	Signal transduction mechanisms	305	0.128	6.47	223.8	65.8	−0.168	0.003**	0.009**	−0.291	5.11 × 10⁻⁷***	−0.314	5.19 × 10⁻⁸***
Z	Cytoskeleton	68	0.138	4.99	472.7	66.3	−0.205	0.094	0.149	−0.312	0.014*	−0.252	0.050*
U	Intracellular trafficking, secretion, and vesicular transport	180	0.117	4.91	369.3	73.8	−0.040	0.592	0.876	−0.359	2.47 × 10⁻⁶**	−0.280	2.99 × 10⁻⁴**
O	Posttranslational modification, protein turnover, and chaperones	342	0.137	6.35	580.3	70.0	−0.069	0.203	0.450	−0.309	4.90 × 10⁻⁸***	−0.181	0.002**
C	Energy production and conversion	121	0.106	3.24	936.5	73.9	−0.197	0.031*	0.067	−0.342	2.72 × 10⁻⁴**	−0.418	5.99 × 10⁻⁶**
G	Carbohydrate transport and metabolism	129	0.099	4.45	803.4	71.0	−0.165	0.062	0.129	−0.461	1.67 × 10⁻⁷***	−0.296	0.001**
E	Amino acid transport and metabolism	91	0.128	2.45	720.2	73.3	−0.069	0.517	0.733	−0.389	5.14 × 10⁻⁴**	−0.332	0.003**
F	Nucleotide transport and metabolism	32	0.103	2.97	697.2	76.0	−0.063	0.733	0.861	−0.130	0.518	0.025	0.900
H	Coenzyme transport and metabolism	26	0.100	3.38	480.5	72.2	−0.030	0.883	0.998	−0.237	0.253	−0.250	0.228
I	Lipid transport and metabolism	63	0.141	2.49	431.0	71.9	−0.165	0.197	0.292	−0.443	4.41 × 10⁻⁴**	−0.249	0.055
P	Inorganic ion transport and metabolism	68	0.155	3.46	465.4	67.3	−0.134	0.275	0.392	−0.044	0.736	−0.047	0.722
Q	Secondary metabolites biosynthesis, transport, and catabolism	47	0.128	2.09	308.7	57.6	−0.190	0.200	0.284	−0.453	0.003**	−0.279	0.077
R	General function prediction only	425	0.160	4.12	297.5	69.4	0.013	0.783	0.364	−0.235	3.56 × 10⁻⁶**	−0.175	6.18 × 10⁻⁴**
S	Function unknown	156	0.158	3.35	306.1	72.1	−0.053	0.512	0.776	−0.185	0.030*	−0.031	0.718

Note.—Only functional categories comprising more than n = 25 A. thaliana network genes are presented.

aComparison of the correlation for genes within the category versus the correlation for all other network genes. Comparisons were conducted using Fisher’s z-transformation, followed by standard normal comparison. The suitability of this test for Spearman’s correlations is discussed in Myers and Sirois (2006).

*P < 0.05.

**P < 0.01.

***P < 10−6.

Correlation between Degree and ω for Each Functional Category Note.—Only functional categories comprising more than n = 25 A. thaliana network genes are presented. aComparison of the correlation for genes within the category versus the correlation for all other network genes. Comparisons were conducted using Fisher’s z-transformation, followed by standard normal comparison. The suitability of this test for Spearman’s correlations is discussed in Myers and Sirois (2006). *P < 0.05. **P < 0.01. ***P < 10−6. Remarkably, correlation coefficients are highly variable across the different KOG categories (table 2). First, pairwise comparison shows that 10 pairs of categories exhibit significantly different correlation coefficients (supplementary table S2, Supplementary Material online). Second, for each functional category, we compared the correlation coefficient for genes within that category with the correlation coefficient for all other genes (i.e., all network genes not belonging to that category); this analysis shows that categories “signal transduction mechanisms” (P = 0.009) and “chromatin structure and dynamics” (P = 0.025) exhibit significantly higher correlation coefficients than the rest of the network (table 2). Taken together, these results indicate that the extent to which sequence evolution is affected by protein centrality is dependent on gene function. In fact, some categories exhibit strong negative ω-degree correlations that are comparable in strength to (or even stronger than) the ω-expression level and the ω-expression breadth correlations (tables 1 and 2). Finally, because genes encoding ribosomal proteins are highly conserved, and presumably highly connected, the association between centrality and evolutionary rate observed here could potentially be due to these genes. However, when these genes are eliminated, the correlation between ω and betweenness remains significant (ρ = −0.062, P = 1.34 × 10−4) and the correlation between ω and degree—which was not significant in the complete data set (table 1)—reaches significance (ρ = −0.041, P = 0.011). Furthermore, these correlations remain significant when expression level, expression breadth, and protein length are simultaneously controlled for (degree: ρ = −0.038, P = 0.029; betweenness: ρ = −0.042, P = 0.014). Taken together, these observations point to a direct effect of the position of proteins in the A. thaliana PIN on the rates of evolution of their encoding genes.

Genes Encoding Physically Interacting Proteins Are Subject to Similar Levels of Selective Constraint

We considered whether interacting proteins in the A. thaliana network are subject to relatively similar levels of selective constraint compared with random protein pairs. For that purpose, we computed the normalized absolute difference between the evolutionary rates of all pairs of interacting genes in the network (X): Here, m is the number of interactions in the network, and ω1 and ω2 are the ω values of the two genes involved in interaction i. Self-interactions (i.e., interactions among proteins encoded by the same gene, a total of 383 in the data set) were not considered in this analysis, as PINs are enriched in such interactions, which can inflate the average similarity between interacting genes (Ispolatov et al. 2005; Alvarez-Ponce and McInerney 2011). The observed value (X = 0.892) was compared with a null distribution obtained from a collection of random networks with the same proteins, number of interactions, and degree for each node, which we generated from the original network by repeatedly switching pairs of edges (as in Luisi et al. 2012). Out of 10,000 random networks, only 426 showed an X value lower or equal to the observed one, indicating that interacting proteins exhibit rates of evolution that are more similar than expected from a random network (P = 0.0426; fig. 3). Random networks exhibit an average X value of 0.900, indicating that the ω values for interacting genes are 0.8% more similar than random pairs of proteins.

Normalized absolute difference between the ω values of pairs of interacting genes in the network (X). The arrow points to the observed value and the histogram represent the null distribution obtained from 10,000 random networks.

Ribosomal proteins might also represent a source of confounding bias, as they are subject to strong levels of purifying selection and are highly connected to each other. However, significant results were obtained when ribosomal proteins were also removed from the network (X = 0.889, P = 0.039).

Centrality and Duplicability in the A. thaliana PIN

Finally, we studied the effect of proteins’ network position on the duplicability of their encoding genes. Among the 5,789 genes encoding the A. thaliana PIN, 883 were found to be singleton, and 3,532 were deemed as duplicated based on similarity searches (the rest of the genes remained unclassified; see Materials and Methods). Among duplicated genes, 1,540 are the result of one of the WGD events that took place in the Arabidopsis lineage (Blanc et al. 2003; De Bodt et al. 2005), and 1,992 were deemed as the result of SSD events. Unexpectedly, proteins encoded by duplicated genes have more interactors in the network (average degree of 5.18) than those encoded by singleton genes (average degree of 4.10) (Mann–Whitney test, P = 0.008; fig. 2). These differences remain significant when self-interactions and interactions among proteins encoded by paralogs are removed from the analysis (P = 0.049), or when genes are classified as duplicated or singleton based on whether they present annotated paralogs in the Ensembl plants database release 14 (Kersey et al. 2012) (P = 1.62 × 10−4). Furthermore, degree positively correlates with the number of paralogs annotated in Ensembl plants (ρ = 0.089, P = 1.02 × 10−11), even when only duplicated genes are considered (ρ = 0.086, P = 5.85 × 10−9). This observation is in agreement with the patterns observed in the human interactome (Liang and Li 2007; D'Antonio and Ciccarelli 2011; Doherty et al. 2012) but in contrast with those observed in E. coli, yeast, worm, and fly, whose duplicated genes tend to encode lowly connected proteins compared with singleton genes (Hughes and Friedman 2005; Prachumwat and Li 2006; Makino et al. 2009; D'Antonio and Ciccarelli 2011). Among duplicated genes, those resulting from WGDs are more highly connected in the Arabidopsis network than those resulting from SSD events (average degree of 5.70 and 4.78, respectively; Mann–Whitney test, P = 5.63 × 10−6; fig. 2).

Average number of interactors for proteins encoded by singleton, whole-genome duplication (WGD), and small-scale duplication (SSD) genes. Error bars represent the standard error of the means.

Average number of interactors for proteins encoded by singleton, whole-genome duplication (WGD), and small-scale duplication (SSD) genes. Error bars represent the standard error of the means. D’Antonio and Ciccarelli (2011) found that the relationship between centrality and duplicability was different among ancient and metazoan-specific human genes: among genes of premetazoan origin, duplicated genes are less central than singleton genes, whereas among genes that are specific to metazoans, duplicated genes tend to occupy more central positions than singleton genes. We considered whether the relationship between centrality and duplicability observed in Arabidopsis also depended on the age of the genes. For that purpose, genes were classified as “ancient” (if they had homologs in prokaryotes or in nonplant eukaryotes) or as “land plant-specific” (otherwise). A total of 3,114 network genes were classified as ancient, and the remaining ones were deemed as plant specific. We found that, among plant-specific proteins, those encoded by duplicated genes are more highly connected than those encoded by singleton genes (average degrees of 5.39 and 3.85, respectively; Mann–Whitney test, P = 4.73 × 10−6). This difference, however, is not significant among ancient genes (average degrees for proteins encoded by duplicated and singleton genes: 5.06 and 4.41 interactions, respectively; P = 0.484). Therefore, the relationship between centrality and duplicability observed in Arabidopsis seems to be specific to plant-specific genes. Because duplicated Arabidopsis genes are more selectively constrained than singleton genes (Yang and Gaut 2011), the lower evolutionary rates of genes central to the A. thaliana PIN could potentially be a by-product of their enrichment in duplicated genes. However, the correlation between ω and both betweenness (ρ = −0.050, P = 0.015) and closeness (ρ = −0.049, P = 0.016) remains significant when only duplicated genes are considered. Although these correlations are not significant for singleton genes (ρ = −0.044, P = 0.288 for betweenness; ρ = 0.043, P = 0.293 for closeness), this might result from the small number of genes in this category (n = 883).

Analysis of a High-Quality Subnetwork Provides Consistent Results

Currently available interactomes are the result, to a high extent, of the application of high-throughput techniques of protein–protein interaction discovery. As a result, interactomic data sets are subject to high rates of false positives and negatives (Bader et al. 2004; Deeds et al. 2006). Given the potential that this could be affecting our observations, we repeated our analyses in a high-quality subset of the A. thaliana interactome, containing only reliable interactions. Similar to D'Antonio and Ciccarelli (2011), we filtered our data set, retaining only interactions that had been determined using low-throughput techniques (i.e., more accurate analysis of protein interactions on a one-by-one basis), and those that had been identified by two or more high-throughput analyses independently. The filtered network consisted of 4,808 interactions connecting 2,798 proteins, out of which 1,842 are encoded by genes with 1:1 orthologs in A. lyrata. This represents a dramatic decrease in the amount of data when compared with the full data set (only 48.3% of the proteins and 33.5% of the interactions were present in the high-quality subnetwork), which potentially involves a decrease in statistical power but also an increase in the signal-to-noise ratio. Although the correlation between degree and ω is not significant in this reduced data set (ρ = 0.003, P = 0.907), ω values significantly correlate with both betweenness and closeness, with correlation coefficients that are higher in magnitude than those observed in the entire data set (betweenness: ρ = −0.061, P = 0.008; closeness: ρ = −0.108, P = 3.22 × 10−6). The correlation between ω and closeness remains significant when the effects of expression level and breadth, and protein length, are simultaneously controlled for (ρ = −0.061, P = 0.013). Furthermore, genes encoding proteins that are represented in the network are more selectively constrained than those that are not represented in the network (median ω values of 0.120 and 0.176, respectively; Mann–Whitney test, P < 10−15). Consistent with the observations in the full interactome, duplicated genes are more highly connected than singleton genes (average degrees of 3.58 and 2.83, respectively) in the high-quality subnetwork. Although the degrees of both groups are not significantly different (Mann–Whitney test, P = 0.132), the duplicated/singleton degree ratio is equivalent to that observed in the full data set (1.27), suggesting that the lack of significance in the analysis of the high-quality subnetwork may result from the reduced statistical power resulting from trimming the data set.

Discussion

Results presented here provide multiple evidences linking the position of proteins in the A. thaliana PIN and evolutionary forces acting on their encoding genes. Remarkably, genes acting at the most central positions of the network (measured as degree, betweenness, or closeness) are subject to stronger levels of purifying selection than those acting at the network periphery. The trend is independent of the distribution across the network of potential confounding factors (patterns and levels of gene expression, gene duplicability, function, and protein length), suggesting that network position has a direct effect on levels of selective constraint. The observation that proteins with more interacting partners are subject to stronger levels of selective constraint suggests that direct protein–protein interactions impose constraints on protein sequence evolution. Nonetheless, closeness, and in particular betweenness, seem to be better predictors of evolutionary rates than degree (fig. 1 and table 1). These are global measures of network centrality that take into account not only the immediate network context of a protein (as degree does) but rather its position in relation to the entire network. Therefore, the evolutionary rate of a given protein does not only depend on the number of direct interactors but also on its broader network context. In particular, betweenness is a measure of how the flow of information through the network depends on each individual protein (Jeong et al. 2000; Wagner and Fell 2001). Proteins bridging the gap between parts of the network tend to exhibit a high betweenness (Ravasz et al. 2002). Therefore, our observations suggest that proteins exerting a high degree of control on information flow across the network are particularly relevant for the organism’s fitness. These results are in agreement with previous observations in organisms for which dense PINs are readily available (E. coli, yeast, worm, fly, and human; Fraser et al. 2002; Teichmann 2002; Jordan et al. 2003; Hahn and Kern 2005; Lemos et al. 2005; Davids and Zhang 2008; Alvarez-Ponce 2012), which show that genes acting at the center of the network, or those participating in protein complexes, are more selectively constrained, thereby suggesting a general trend across all of life. Furthermore, our results are in agreement with previous observations in yeast, worm, and fly that betweenness is a better predictor of rates of evolution than degree (Hahn and Kern 2005)—a pattern that has been also observed in the yeast metabolic network (Lu et al. 2007)—suggesting a general trend at least in Eukaryotes. Further evidence for the relationship between position of proteins in the A. thaliana PIN and rates of evolution is provided by the observation that genes encoding interacting proteins tend to be subject to similar levels of selective constraint (fig. 3), consistent with previous observations in other interactomes (Pazos and Valencia 2001; Fraser et al. 2002; Lemos et al. 2005; Alvarez-Ponce et al. 2009, 2011; Cui et al. 2009). This similarity might in part result from the mutational compensatory dynamics between amino acids involved in protein–protein interactions (Codoñer and Fares 2008; Fares et al. 2011). It should be noted, however, that covariation in the evolutionary rates of proteins can obey to other factors such as these proteins performing shared biological functions or exhibiting correlated patterns of expression (Clark et al. 2012). These alternative explanations—covariation due to mutational compensation, shared function, or coexpression—are not mutually exclusive. Normalized absolute difference between the ω values of pairs of interacting genes in the network (X). The arrow points to the observed value and the histogram represent the null distribution obtained from 10,000 random networks. Although our analyses reveal a clear association between network position and evolutionary rates, the trend is generally weak in comparison with other correlates of rates of evolution such as gene expression (Duret and Mouchiroud 2000; Pál et al. 2001; Krylov et al. 2003; Subramanian and Kumar 2004; Wright et al. 2004; Drummond et al. 2005, 2006; Ingvarsson 2007; Slotte et al. 2011; Yang and Gaut 2011)—compare, for instance, the ω-degree correlation (ρ = −0.030) with the ω-expression level (ρ = −0.331) and the ω-expression breadth (ρ = −0.245) correlations (table 1). This is in good agreement with previous analyses linking network position and rates of evolution, which often recover weak effects (for review, see Cork and Purugganan 2004; Rocha 2006). The weakness of these effects may be the result of the relatively low fraction of amino acids participating in protein–protein interactions. It is remarkable, however, that the strength of the correlation between degree and rates of evolution is highly variable across the different functional categories. Indeed, for some categories, the correlation coefficients of the ω-degree correlation attain values that are comparable with (or even surpass) those for the ω-expression level and/or the ω-expression breadth correlations (table 2). This suggests that for certain functional categories, protein–protein interactions strongly constrain protein evolution. The ω-degree correlation is significantly higher for proteins in functional categories “signal transduction mechanisms” (P = 0.009) and “chromatin structure and dynamics” (P = 0.025) than for the rest of the network (supplementary table S2, Supplementary Material online). Consistently, it has been recently observed that degree is a strong correlate of rates of evolution among genes involved in the human signal transduction network (Alvarez-Ponce 2012). Despite the weak effect of network position on rates of evolution, a much stronger effect is observed on gene duplicability. On average, proteins encoded by duplicated genes exhibit 26%–27% more interactors than those encoded by singleton genes (fig. 2). This higher connectivity of duplicated genes has been also observed in humans (Liang and Li 2007; D'Antonio and Ciccarelli 2011; Doherty et al. 2012), but the opposite pattern was observed in E. coli, yeast, worm, and fly—in which genes acting at the periphery of the network are the ones that tend to undergo duplication (Hughes and Friedman 2005; Prachumwat and Li 2006; Makino et al. 2009; D'Antonio and Ciccarelli 2011). The differential pattern observed in the human interactome has been shown to be the result of the high duplicability of hubs originated after the emergence of Metazoans (D'Antonio and Ciccarelli 2011). Human ancient genes (those of premetazoan origin), however, exhibit the same tendency as observed in E. coli, yeast, worm, and fly: duplications tend to occur in genes acting at the periphery of the network (D'Antonio and Ciccarelli 2011). This indicates that the particular trend observed in the human interactome is a derived character and hence that the relationship between centrality and duplicability shifted in the vertebrate lineage. Our observations that, also among A. thaliana novel (i.e., plant specific) genes, duplicated genes are more connected than singleton genes indicate that the relationship between duplicability and centrality shifted not only in vertebrates but also in the Arabidopsis lineage. The higher duplicability of proteins acting at the periphery of the E. coli, yeast, worm, and fly PINs has been attributed to the fragility of the network to duplication of its more connected elements. Indeed, the duplication of a given gene is expected to disrupt the dosage balance of the interactions in which it is involved (Birchler et al. 2001; Veitia 2002; Papp et al. 2003), which may have more deleterious effects for genes with a higher number of interactors. On the other hand, highly connected proteins may play a key role in maintaining the robustness of the network, thereby making networks fragile to mutation or loss of these genes (Albert et al. 2000; Jeong et al. 2001). Hence, duplication of highly connected proteins might be favored, as they increase the robustness of the system (Ekman et al. 2006). These and other competing forces probably act with different strengths in different organisms, resulting in the contrasting overall trends observed in available interactomes. Major evolutionary transitions (from prokaryotes to unicellular eukaryotes, to multicellular eukaryotes, and to land plants and vertebrates) were accompanied by reductions of orders of magnitude in effective population sizes. As a result, vertebrates and land plants exhibit effective population sizes that are smaller than those for E. coli, yeast, worm, and fly (Lynch 2007). In populations with a small effective size, natural selection is less efficient, and hence, the fate of mutations is largely determined by random genetic drift. In such populations, natural selection may have small power in removing duplicates of genes involved in a large number of interactions, in spite of the fact that duplication of such genes are expected to be deleterious according to the dosage balance hypothesis (Birchler et al. 2001; Veitia 2002; Papp et al. 2003). This can result in a (partial) suppression of the negative relationship between duplicability and centrality predicted by the dosage balance hypothesis. It is possible that, under such conditions, factors promoting a positive duplicability–centrality relationship can manifest, thereby resulting in the patterns observed in the human (Liang and Li 2007; D'Antonio and Ciccarelli 2011; Doherty et al. 2012) and Arabidopsis (current work) interactomes. The future availability of the interactomes of a broader range of organisms, with different effective population sizes, may enable a better understanding of the effect of this factor on the relationship between centrality and duplicability. Among A. thaliana duplicated genes, those resulting from the WGD events that took place in this lineage (Blanc et al. 2003; De Bodt et al. 2005) are more highly connected in the PIN than those resulting from SSD events (fig. 2). This result is in concert with the dosage-balance hypothesis, which predicts that duplication of a gene with interactors may be deleterious unless its interacting partners simultaneously coduplicate (Birchler et al. 2001; Veitia 2002; Papp et al. 2003). Under this scenario, highly connected genes are more likely to retain duplicated copies if they are the result of WGDs, as simultaneous duplication of the entire genome maintains the stoichiometry of all balanced sets (Veitia 2004, 2005). The current analysis represents, to our knowledge, the first interactome-level evaluation of the relationship between the structure of the A. thaliana PIN and the patterns of molecular evolution of its components. Taken together, results presented here indicate that the network imposes constraints on the patterns of molecular evolution of its components at multiple levels, from sequence evolution to gene duplication. Therefore, genes do not evolve independently but as pieces of a more complex system. Currently, the availability of protein–protein interaction data for A. thaliana is limited in comparison with other model organisms (Stark et al. 2011). As more data become available, the emergence of a more detailed map of the A. thaliana interactome will enable a more detailed understanding of its evolution.

Conclusions

Results presented here provide multiple evidences linking the position of proteins in the A. thaliana PIN and evolutionary forces acting on their encoding genes. In agreement with previous observations in other organisms, genes acting at the most central positions of the network are subject to increased levels of selective constraint, and genes encoding interacting proteins exhibit correlated rates of evolution. Duplicated genes tend to be more central than singleton genes, in agreement with the patterns observed in humans, but opposite to those observed in E. coli, yeast, worm, and fly, thereby indicating that the relationship between centrality and duplicability underwent modification not only in vertebrates but also in plants. Taken together, these results indicate that the A. thaliana PIN impose constraints in the patterns of molecular evolution of its encoding genes at multiple levels.

Supplementary Material

Supplementary tables S1 and S2 are available at Genome Biology and Evolution online (http://www.gbe.oxfordjournals.org/).

90 in total

Review 1. Systems biology in drug discovery.

Authors: Eugene C Butcher; Ellen L Berg; Eric J Kunkel
Journal: Nat Biotechnol Date: 2004-10 Impact factor: 54.908

2. Converging on a general model of protein evolution.

Authors: Joshua T Herbeck; Dennis P Wall
Journal: Trends Biotechnol Date: 2005-10 Impact factor: 19.536

Review 3. Evolutionary systems biology: links between gene evolution and function.

Authors: Eugene V Koonin; Yuri I Wolf
Journal: Curr Opin Biotechnol Date: 2006-09-08 Impact factor: 9.740

4. Network-level molecular evolutionary analysis of the insulin/TOR signal transduction pathway across 12 Drosophila genomes.

Authors: David Alvarez-Ponce; Montserrat Aguadé; Julio Rozas
Journal: Genome Res Date: 2009-01-13 Impact factor: 9.043

Review 5. Systems metabolic engineering for chemicals and materials.

Authors: Jeong Wook Lee; Tae Yong Kim; Yu-Sin Jang; Sol Choi; Sang Yup Lee
Journal: Trends Biotechnol Date: 2011-05-10 Impact factor: 19.536

6. Metabolic networks and their evolution.

Authors: Andreas Wagner
Journal: Adv Exp Med Biol Date: 2012 Impact factor: 2.622

7. On the constancy of the evolutionary rate of cistrons.

Authors: T Ota; M Kimura
Journal: J Mol Evol Date: 1971 Impact factor: 2.395

Review 8. A new method for estimating synonymous and nonsynonymous rates of nucleotide substitution considering the relative likelihood of nucleotide and codon changes.

Authors: W H Li; C I Wu; C C Luo
Journal: Mol Biol Evol Date: 1985-03 Impact factor: 16.240

9. Ensembl Genomes: an integrative resource for genome-scale data from non-vertebrate species.

Authors: Paul J Kersey; Daniel M Staines; Daniel Lawson; Eugene Kulesha; Paul Derwent; Jay C Humphrey; Daniel S T Hughes; Stephan Keenan; Arnaud Kerhornou; Gautier Koscielny; Nicholas Langridge; Mark D McDowall; Karine Megy; Uma Maheswari; Michael Nuhn; Michael Paulini; Helder Pedro; Iliana Toneva; Derek Wilson; Andrew Yates; Ewan Birney
Journal: Nucleic Acids Res Date: 2011-11-08 Impact factor: 16.971

10. A simple dependence between protein evolution rate and the number of protein-protein interactions.

Authors: Hunter B Fraser; Dennis P Wall; Aaron E Hirsh
Journal: BMC Evol Biol Date: 2003-05-23 Impact factor: 3.260

26 in total

1. Gene Duplicability of Core Genes Is Highly Consistent across All Angiosperms.

Authors: Zhen Li; Jonas Defoort; Setareh Tasdighian; Steven Maere; Yves Van de Peer; Riet De Smet
Journal: Plant Cell Date: 2016-01-07 Impact factor: 11.277

2. Secreted Proteins Defy the Expression Level-Evolutionary Rate Anticorrelation.

Authors: Felix Feyertag; Patricia M Berninsone; David Alvarez-Ponce
Journal: Mol Biol Evol Date: 2017-03-01 Impact factor: 16.240

3. Evolutionary Perspectives of Genotype-Phenotype Factors in Leishmania Metabolism.

Authors: Abhishek Subramanian; Ram Rup Sarkar
Journal: J Mol Evol Date: 2018-07-19 Impact factor: 2.395

Review 4. Evolution of Gene Duplication in Plants.

Authors: Nicholas Panchy; Melissa Lehti-Shiu; Shin-Han Shiu
Journal: Plant Physiol Date: 2016-06-10 Impact factor: 8.340

5. Recent positive selection has acted on genes encoding proteins with more interactions within the whole human interactome.

Authors: Pierre Luisi; David Alvarez-Ponce; Marc Pybus; Mario A Fares; Jaume Bertranpetit; Hafid Laayouni
Journal: Genome Biol Evol Date: 2015-04-02 Impact factor: 3.416