Literature DB >> 31590243

Patterns and Constraints in the Evolution of Sperm Individualization Genes in Insects, with an Emphasis on Beetles.

Helena I Vizán-Rico1, Christoph Mayer2, Malte Petersen2, Duane D McKenna3, Xin Zhou4, Jesús Gómez-Zurita5.   

Abstract

Gene expression profiles can change dramatically between sexes and sex bias may contribute specific macroevolutionary dynamics for sex-biased genes. However, these dynamics are poorly understood at large evolutionary scales due to the paucity of studies that have assessed orthology and functional homology for sex-biased genes and the pleiotropic effects possibly constraining their evolutionary potential. Here, we explore the correlation of sex-biased expression with macroevolutionary processes that are associated with sex-biased genes, including duplications and accelerated evolutionary rates. Specifically, we examined these traits in a group of 44 genes that orchestrate sperm individualization during spermatogenesis, with both unbiased and sex-biased expression. We studied these genes in the broad evolutionary framework of the Insecta, with a particular focus on beetles (order Coleoptera). We studied data mined from 119 insect genomes, including 6 beetle models, and from 19 additional beetle transcriptomes. For the subset of physically and/or genetically interacting proteins, we also analyzed how their network structure may condition the mode of gene evolution. The collection of genes was highly heterogeneous in duplication status, evolutionary rates, and rate stability, but there was statistical evidence for sex bias correlated with faster evolutionary rates, consistent with theoretical predictions. Faster rates were also correlated with clocklike (insect amino acids) and non-clocklike (beetle nucleotides) substitution patterns in these genes. Statistical associations (higher rates for central nodes) or lack thereof (centrality of duplicated genes) were in contrast to some current evolutionary hypotheses, highlighting the need for more research on these topics.

Entities:  

Keywords:  Coleoptera; Insecta; evolutionary rates; gene network; phylogenetic inference; sex-biased genes

Mesh:

Year:  2019        PMID: 31590243      PMCID: PMC6826512          DOI: 10.3390/genes10100776

Source DB:  PubMed          Journal:  Genes (Basel)        ISSN: 2073-4425            Impact factor:   4.096


1. Introduction

Phenotypic and physiological differences among closely related species with highly similar genomes are expected to be the result of differences in the expression profiles of key genes (e.g., [1]). In this regard, understanding the mechanisms underlying differences between males and females of the same species becomes of particular interest. Conspecific individuals of different sexes share most, if not all, of their genome and genetics but sometimes display striking anatomical and physiological differences. Studies using model organisms have demonstrated the existence of significant differences in gene expression profiles between sexes. For example, approximately 30% of genes in the vinegar fly (order Diptera), Drosophila melanogaster, show sex-biased expression, and most of these genes are specific to reproductive tissues [2,3,4]. In fact, it has been proposed that most gene expression in Drosophila is sex biased at some point, exhibiting this bias either throughout the life cycle or in specific developmental stages [5]. Similarly, 5–15% of the genes in the mosquito (order Diptera) Anopheles gambiae genome show differential expression between males and females [6], and approximately 20% of the X-chromosome genes of Tribolium castaneum (order Coleoptera) are regulated differently in each sex [7]. The existence and need for biases in gene expression imply several evolutionary mechanisms that, on the one hand, allow for the bias to occur and, on the other hand, condition the dynamics of changes in the affected genes through time [8]. Sex bias in gene expression can be achieved through linkage to sex chromosomes and dosage compensation, sex-specific alternative splicing, and other mechanisms [9,10,11,12]. However, these mechanisms primarily affect the expression of regulatory elements, which in turn condition the action of the genes themselves, e.g., following a particular sex-specific splicing or protein maturation pathway. Gene duplication is another mechanism that directly allows for new gene expression profiles, including sex-biased ones [13]. Gene duplication offers an immediate solution to differential expression needs by potentially allowing each copy of a gene to acquire unique functionality. It is now viewed as having played an important, if poorly understood, role in the evolution of sex-biased gene expression [14]. Moreover, gene duplication could also be related, in part, to the relaxation of evolutionary constraints on one of the resulting gene copies, which could, in turn, lead to more rapid gene evolution [15]. Rapid gene evolution has classically been proposed as a consequence of sex-biased and particularly male-biased genes [4,5,16,17,18]. However, it is not entirely clear whether it is the bias in expression that results in faster evolutionary rates or if it is because of other features of these genes, such as their frequent tissue specificity, which is also correlated with faster evolutionary rates [19]. It is generally accepted that gene duplication is a major force altering the diversity and characteristics of sex-biased genes, but the connection between sex-biased gene expression and evolutionary rates remains poorly understood [8]. So far, these associations have been studied in just a handful of model organisms, and even though it is theoretically plausible that evolutionary processes and functional patterns are related, it is too early to invoke a general rule. Working toward this generalization first requires determining the unequivocal orthology of sex-biased genes between model and non-model species [20]. Furthermore, it requires assuming that orthology and structural homology correlate with functional homology [21,22]. Another problem lies in the actual definition of sex-biased genes. The concept is intuitive and unambiguous: a sex-biased gene is one with different levels of expression between males and females [23]. However, it is also a quantitative one: how different do the expression levels have to be to elicit the activation of the particular evolutionary mechanisms mentioned above? Other non-trivial issues include the occurrence of pleiotropy, the fact that sex-biased genes may be expressed for alternative functions in different tissues and not necessarily related or restricted to one sex, and protein–protein interactions, so that a specific function takes place through physical and genetic modulation by other proteins. Pleiotropy and protein–protein interactions could modulate or limit the evolutionary dynamics of genes, obscuring or changing the expectations derived from the study of model species. In this study, we aimed to explore the correlation of sex-biased expression with gene duplications and accelerated evolutionary rates in a large evolutionary framework, using non-model organisms for which no gene expression analyses are available. Our work was informed by previous studies involving a model organism (D. melanogaster) and used phylogenetic approaches. The obvious candidates for sex-biased genes are those involved in processes that are exclusive to one sex, for example, spermatogenesis in males [18,24]. Thus, in order to test for these differences, we selected a male reproduction functional group, i.e., a coherent set of genes working together toward a specific reproductive function in males, including genes that are male biased in Drosophila spp. and genes that are expressed both in female and in male tissues or non-reproductive tissues. In particular, the present study focused on an integrated male reproductive function—sperm individualization—which is known to involve the action of both constitutive and sex-biased genes in D. melanogaster with different degrees of tissue specificity. Sperm individualization is one of the final stages in spermatogenesis that resolves spermatids as individual cells from the syncytial male germline cysts [25]. In a very simplified manner, this process involves a number of stages where (1) a syncytial cyst forms around all spermatids resulting from a primary spermatocyte, (2) an individualization complex formed through cytoskeletal mechanisms and membrane formation encapsulates each of the spermatids, and (3) the syncytial cytoplasm is discarded [26]. We investigated the phylogeny and evolution of these genes across the class Insecta, with particular emphasis on the species-rich order Coleoptera (beetles). The insects we studied included several model organisms for which both orthology assessment and expression studies were publicly available (e.g., modENCODE and OrthoDB projects; [27,28]). Given that beetles are proportionally underrepresented in the genomic and gene profiling literature, we mined relevant data from the 1KITE project (http://1kite.org/), thereby broadening representation of beetles in our study and facilitating orthology assessment via phylogenetic approaches [29].

2. Materials and Methods

2.1. Selection of Functional Group and Expression Profiles

The gene browser AmiGO2 [30] was used to search for genes belonging to the gene ontology category “sperm individualization” (GO:0007291), a category that comprises all genes recognized to participate in the aforementioned processes. With this query, we obtained 54 genes, of which 1 was reported only for mammals (Spem1) and was not further considered, and the remaining 53 genes had been previously characterized in Drosophila melanogaster. The DNA coding sequences (CDSs) of these genes were retrieved (in September 2017) from FlyBase [31]. A preliminary blastx default search was conducted using these CDSs as query sequences, revealing that nine of these genes lacked obvious putative homologs in organisms other than Diptera. These genes (dj, dud, fan, mst101(3), nkg, ntc, soti, TTLL3B, and yuri; named based on Drosophila gene nomenclature) were excluded from subsequent analyses. The remaining 44 genes (Table 1) were retained for use in our phylogenetic study and were functionally categorized as (i) unbiased or (ii) sex biased, according to their expression profiles in Drosophila using data publicly available in modENCODE [27]. These expression profiles were mined from Affymetrix tiling arrays (Figure 1), designed to study transcription levels in a large number of Drosophila cell lines and developmental stages, using modMINE [32]. When the expression profiles of males were less than twofold higher or not more than twofold lower than those measured in females, they were not considered indicative of being biased (a criterion applied in previous studies; e.g., [17]). Five of the genes of interest (Cul3, Dark, didum, mlt, and orb2) lacked data in the Affymetrix tiling array experiments, and we deduced their sex-based functional profile based on RNA-seq transcriptome profiles available in modENCODE [27].
Table 1

Genes belonging to the ontology category “sperm individualization” (GO:0007291) in insects. Genes are identified by their names and their corresponding FlyBase ID in the Drosophila melanogaster genome. Information on the general function of the gene and sex biases in expression profiles is also given.

GeneFlyBase IDFunctionExpression Profile
Act5C FBgn0000042cytoskeleton structureunbiased
Ance FBgn0012037peptidaseunbiased
aux FBgn0037218ATP binding cofactor of kinaseunbiased
blanks FBgn0035608siRNA bindingmale biased
Bug22 FBgn0032248cilium organization and assemblyunbiased
CdsA FBgn0010350enzyme (CDP diglyceride synthetase)unbiased
Chc FBgn0000319coated vesicles structureunbiased
ctp FBgn0011760dynein complex assemblyunbiased
Cul3 FBgn0261268protein bindingunbiased
Cyt-c-d FBgn0086907electron carriermale biased
Dark FBgn0263864apoptosome assemblyunbiased
didum FBgn0261397unconventional myosinunbiased
Dredd FBgn0020381enzyme (caspase)unbiased
Dronc FBgn0026404enzyme (caspase)unbiased
Duba FBgn0036180enzyme (deubiquitinase)unbiased
EcR FBgn0000546transcription factorunbiased
eIF3m FBgn0033902translation initiation factorunbiased
Fadd FBgn0038928protein bindingunbiased
gish FBgn0250823enzyme (protein kinase)unbiased
gudu FBgn0031905NAmale biased
heph FBgn0011224mRNA binding (translation repression)unbiased
hmw FBgn0038607motile cilium assemblymale biased
jar FBgn0011225myosinunbiased
klhl10 FBgn0040038substrate recruiting for ubiquitin ligase complexmale biased
Lasp FBgn0063485actin/myosin scaffoldingunbiased
Mer FBgn0086384cytoskeletal protein bindingunbiased
mlt FBgn0265512microtubule removalunbiased
nes FBgn0026630enzyme (lysophospholipid acyltransferase)unbiased
Npc1a FBgn0024320sterol metabolismunbiased
nsr FBgn0034740dynein complex assemblymale biased
orb2 FBgn0264307translation factorunbiased
Osbp FBgn0020626protein bindingunbiased
oys FBgn0033476enzyme (lysophospholipid acyltransferase)unbiased
Past1 FBgn0016693membrane assemblyunbiased
Pen FBgn0011823protein bindingunbiased
poe FBgn0011230calmodulin bindingunbiased
porin FBgn0004363membrane channel proteinunbiased
Prosalpha6T FBgn0032492enzyme (protease)male biased
scat FBgn0011232protein bindingfemale biased
shi FBgn0003392GTPase for microtubule motilityunbiased
skap FBgn0037643ATP binding enzymeunbiased
sw FBgn0003654dynein complex assemblyunbiased
Taz FBgn0026619enzyme (phospholipid transacylase)unbiased
Vps28 FBgn0021814vesicular traffickingunbiased
Figure 1

Heatmap visualization of gene expression scores (log2 of the actual value) of sperm individualization genes in Drosophila melanogaster as derived from RNA-seq data from different stages of adult male and female flies [27].

2.2. Retrieval of Sperm Individualization Gene Orthologs in Insects

The FlyBase IDs for the 44 genes of interest were used as queries to find putative orthologs and their corresponding eukaryotic orthologous group (EOG) identifiers in OrthoDB v9.1 [33]. We retrieved all insect amino acid sequences for each EOG from the database, together with descriptive information about the number of hits and taxonomic redundancy, as well as data on the relative amino acid sequence divergence of each orthologous group as a proxy for the evolutionary rate in each EOG [28]. The representation of Coleoptera in OrthoDB is currently restricted to six species of three infraorders of the suborder Polyphaga (Table 2). In order to increase the representation of Coleoptera in the sample, we mined the genes of interest from transcriptomic data from beetle species available from 1KITE. The species studied included representatives from all four suborders of Coleoptera (Table 2). Moreover, we also searched for these genes in published RNA-seq data from testis of Calligrapha multipunctata (Chrysomelidae), which we expected to be enriched in sperm individualization genes [22]. In order to identify the 44 genes of interest in the assembled beetle transcriptomes, we used the software pipeline Orthograph version 0.5.14 [34]. This software predicts the orthology of nucleotide sequences by mapping their amino acid translation to genes of known ortholog groups using a graph-based best reciprocal hit approach. The pipeline also performs an automatic correction for sequence orientation, frameshifts, and translation. For all Orthograph searches, we used the official gene sets (OGSs) of three reference species: D. melanogaster (dmel_r6.11; http://flybase.org/, [35]); the red flour beetle, Tribolium castaneum (v3.0; http://beetlebase.org/, [36]); and the leaf-cutting ant, Acromyrmex echinatior (v3.8; http://hymenopteragenome.org/acromyrmex/, [37,38]).
Table 2

Beetle species used in the current study and their current systematic placement. Unless specified otherwise, gene sequence data were obtained from 1KITE.

Suborder InfraorderSuperfamilyFamilySpeciesLibrary ID (1KITE)
Archostemata Micromalthidae Micromalthus debilis INSqzbTABRAAPEI-210
Adephaga Aspidytidae Sinaspidytes wrasei WHINSnuyTAAARAAPEI-47
Carabidae Cicindela hybrida INShauTBARAAPEI-21
Dytiscidae Cybister lateralimarginalis INSnfrTADRAAPEI-16
Gyrinidae Gyrinus marinus INSnfrTBERAAPEI-19
Noteridae Noterus clavicornis INShkeTALRAAPEI-37
Myxophaga Hydroscaphidae Hydroscapha redfordi INSntgTARRAAPEI-208
LepiceridaeLepicerus sp.INSytvTAJRAAPEI-19
Polyphaga
“basal Polyphaga”ScirtoideaScirtidae Cyphon laevipennis INSjdsTBDRAAPEI-47
BostrichiformiaBostrichoideaBostrichidae Xylobiops basilaris WHANIsrmTMCLRAAPEI-11
CucujiformiaChrysomeloideaCerambycidae Anoplophora glabripennis a -
Chrysomelidae Calligrapha multipunctata b -
Leptinotarsa decemlineata a -
CleroideaByturidae Byturus ochraceus INShkeTAORAAPEI-43
Cleridae Thanasimus formicarius INShkeTCERAAPEI-79
CoccinelloideaCoccinellidae Rhyzobius pseudopulcher WHANIsrmTMABRAAPEI-9
CurculionoideaCurculionidae Dendroctonus ponderosae a -
TenebrionoideaMeloidae Meloe violaceus INShauTAYRAAPEI-19
Tenebrionidae Tribolium castaneum a -
Zopheridae Bitoma cylindrica WHANIsrmTMAPRAAPEI-39
ElateriformiaBuprestoideaBuprestidae Agrilus planipennis a -
ElateroideaLampyridae Lamprohiza splendidula INShkeTCGRAAPEI-87
ScarabaeiformiaScarabaeoideaScarabaeidae Cetonia aurata pisana WHANIsrmTMAVRAAPEI-53
Onthophagus taurus a -
StaphyliniformiaHydrophiloideaHydrophilidae Hydrochara caraboides INShauTASRAAPEI-13
StaphylinoideaStaphylinidae Ocypus brunnipes INShkeTCMRAAPEI-45

a Beetle model species and data obtained from OrthoDB; b Data available from [22].

Each OGS included the 44 genes belonging to the EOGs of interest. Additionally, Orthograph required a tab-delimited file listing the name of the gene for each EOG and each reference species (obtained from OrthoDB). With this information, Orthograph retrieved from each OGS the genes of interest and aligned the amino acid sequences to create a profile hidden Markov model with which to conduct a forward search for respective candidate homologs in each of the beetle transcriptomes. The resulting hits were compared with a BLAST search against all genes in all OGSs (reverse search), and for each match between the best hit of the reverse search and the ortholog group of the original forward search, the corresponding transcript was assigned to that specific ortholog group [34]. Each Orthograph search produced the single best hit from each of the 1KITE transcriptomes mined for the study and generated separate files for each EOG, one with the original nucleotide data and one with their amino acid sequence translations, including the sequences of both the beetle targets and the reference species.

2.3. Phylogenetic Analyses of Amino Acid Sequences in Insects

Insect amino acid sequences from each EOG and those obtained from the output of Orthograph were aligned with the G-INS-i algorithm of MAFFT v7 [39]. Long autapomorphic insertions in these alignments, possibly corresponding to unrecognized introns, were trimmed manually, as were sequence ends of doubtful quality, typically showing as sequences unaligned beyond one point and longer than the remaining sequences in the alignment, suggesting that the reading frame had been lost and, therefore, the correct start or stop codons were not found either. In a few cases, the protein was retrieved from OrthoDB or the beetle transcripts as disjoint amino acid fragments coming from non-overlapping sequenced transcripts of the same gene. In these cases, the full protein length was reconstituted, and gaps between fragments were filled with missing data. Sequences were secondarily removed from the alignments if they (i) consisted of short fragments usually spanning less than 50% of the gene; (ii) were highly similar and monophyletic for a given species; and/or (iii) were highly divergent in the context of the variability of the alignment, the latter two features assessed based on preliminary phylogenetic analyses of the data. The resulting purged alignments (deposited in Zenodo.org: 10.5281/zenodo.3380181) were analyzed using SMS [40] to identify the models of amino acid sequence evolution best fitting the data. The resulting models were used in maximum likelihood (ML) tree searches executed using the program PhyML v3.0 [41]. Since some of the genes of interest are multi-copy (in principle, OrthoDB identifies duplicated genes from isoforms resulting from alternative splicing), several gene alignments included many more sequences than taxa, and phylogenetic analyses allowed us to easily recognize when these extra sequences represented gene duplications affecting particular taxa or entire clades. In the former case, one representative of an intraspecific duplication was retained, and in the latter, duplicated versions of the gene were separated into independent alignments, which we realigned with MAFFT. Of the gene variants studied, the one including the sperm individualization gene copy in Drosophila was analyzed, assessing the best-fitting evolutionary model again with SMS. ML gene trees were inferred using PhyML, and statistical measures of nodal support were estimated via 100 bootstrap pseudoreplicates.

2.4. Phylogenetic Analyses of Nucleotide Sequences in Beetles

Nucleotide sequence matrices of the genes of interest for Coleoptera were generated by combining the sequences retrieved using Orthograph with the corresponding orthologs of model beetle species (Table 2). Data from model beetle species and from a hemipteroid (to be used as an outgroup in the analyses) were obtained with blastn searches against the nucleotide collection (nr/nt) at NCBI. The match of the retrieved nucleotide sequences with the amino acid sequence obtained from OrthoDB for the same organisms was confirmed with a subsequent blastx search against the reference proteins (refseq_protein) database, also at NCBI. Nucleotide sequences were aligned using the G-INS-i algorithm implemented in the program MAFFT. Low-quality ends were trimmed and short sequences removed, as above. The aligned sequences were also translated into amino acid sequences to assist the alignment by finding reading frame problems and highly divergent regions, which were secondarily removed. ML phylogenetic analyses were implemented using these aligned datasets and the same methods described above for the amino acid data.

2.5. Estimation of Evolutionary Rates

With very few exceptions, the ML gene trees based on amino acid sequences recovered Hymenoptera and Diptera each as monophyletic and usually with strong (typically 98–100%) bootstrap support. These two clades have particularly well-established age estimates based on independent analyses. They were used as calibration points in Bayesian analyses of evolutionary rates and node dating for each gene tree using the software BEAST v1.8.4 [42]. The nodes for these two clades were consistently constrained as monophyletic in all analyses to avoid uninformative topologies, particularly for genes with low phylogenetic signal, and the calibration densities for the time to their most recent ancestors were modeled as follows. For Hymenoptera, we specified a crown age of 309 Ma (291–347 Ma) after [43], approximately modeled in BEAST as a normal distribution with mean = 309 and Stdev = 10; in turn, the crown age of Diptera was assumed to be 265 Ma (256–269 Ma) according to [44] and approximately modeled as a normal distribution with mean = 265 and Stdev = 5. The analyses used substitution models as determined with SMS, an uncorrelated lognormal relaxed clock [45], and a tree prior under the Yule process. The analyses were run initially for 100 million generations, sampling every 10,000th generation, but in most cases, they had to be replicated and results combined until there was good mixing of parameters and all produced stable estimates with acceptably high effective sample sizes (ESS >> 200). In a few cases, typically involving datasets that clearly deviated from a molecular clock (i.e., value of ucld.stdev > 3), the multiple analyses produced erratic results; here, stable results were obtained using an exponential relaxed distribution. Evolutionary rates, as well as node ages, were calculated using Tracer 1.6 [46] on the annotated maximum clade credibility trees obtained by summarizing the post burn-in trees with LogCombiner 1.8.4 and TreeAnnotator 1.8.4 [42]. Nucleotide substitution rates in beetles were assessed using a similar strategy but with constraining the age of Coleoptera using a normal distribution covering the age range based on the estimate for this order as deduced from the previous analyses. Specifically, we extracted this age as the concordant overlap of all confidence intervals for this parameter in the amino-acid-based trees where Coleoptera was monophyletic.

2.6. Statistical Analyses

We tested the hypothesis of no differences in the evolutionary rates of sex-biased genes relative to unbiased genes using a Mann–Whitney U test [47] at a 0.05 significance level, as implemented in the function “wilcox.test” of the R package Stats 3.6.0 [48]. The same test was used to investigate rate differences between genes found as single-copy and as members of multigene families, as well as between genes coordinated in the gene cascade for sperm individualization versus genes participating in this function but not implicated in this interaction network (see below). Finally, genes were tested for differences in absolute evolutionary rates between two main categories based on the overall constancy of those evolutionary rates: genes with relatively homogeneous rates (parameter ucld.stdev < 0.6) and genes with heterogeneous rates (ucld.stdev > 0.6). These tests were conducted using substitution rates estimated from the insect amino acid data and substitution rates for beetles estimated from nucleotide data. In order to recognize possible interactions of the explanatory variables used in these tests, chi-squared permutation contingency tests of independence were run for each pair of categorical variables used to rank all genes, including expression bias, paralogy, network interaction, and rate heterogeneity. These tests used the “perm.ind.test” function of the R package wPerm 1.0.1 [49] with 9999 randomization replicates. In all tests, sample sizes allowed for low type I error rates, between 5% and 10% (Power = 0.80).

2.7. Analyses Constrained by Gene Interactions

Public databases were used to define the subset of physically or genetically interacting genes among those sharing sperm individualization as a unifying function. Specifically, we established the interaction network of Drosophila melanogaster as an interaction model by extracting the information about specific protein–protein physical interactions from BioGRID version 3.4 [50] and that about genetic interactions in metabolic pathways from FlyBase [31]. The obtained graph included 21 nodes (i.e., genes) and 28 edges (i.e., interactions), and the architecture of interactions was used to explore correlations with the evolutionary properties of this subset of genes and with other gene characteristics, including evolutionary rates, patterns of gene duplication, and sex-biased gene expression. Since these genes are not isolated in their function and their interactions, we also considered the total number of known interactions per gene, as shown in BioGRID, as a measure to modulate node importance. We tentatively corrected node importance in every case, calculating the logarithm of the product between node centrality and the absolute number of known interactions per node. Statistical network analyses were carried out with the aid of R tools implemented in the “igraph” package [51] on the undirected connected graph representing all interacting genes. Measures of node centrality or node "importance" in the networks were obtained relative to the number of receiving edges (“closeness”) or their rank (“eigen_centrality”). We estimated the correlation between these variables and evolutionary rates and gene paralogy based on the Spearman rank-order correlation coefficients. Additionally, the network community structure was explored with several node modularity optimization algorithms in “igraph”, including the Clauset–Newman–Moore algorithm (command “cluster_fast_greedy”) and the Louvain method (command “cluster_louvain”; [52]), as well as exact modularity maximization (command “cluster_optimal”) using the algorithm published by [53]. Modularity was also estimated by considering edges instead of nodes and using the algorithm (command “cluster_edge_betweenness”) proposed by [54]. We tested for the existence of differences in evolutionary rates for each resulting group using the Kruskal–Wallis test [55]. Additionally, the homogeneity of rates between bipartitions of the network defined by each of the edges separating groups was investigated using a Mann–Whitney U test at a 0.05 significance level.

3. Results

3.1. Characteristics of Datasets: Composition of Sequence Alignments

The 44 investigated genes involved in sperm individualization (GO:0007291) were present in the subclass Pterygota (winged insects), both in Palaeoptera (mayflies and odonates), which were used as outgroups in all analyses, and Neoptera (the remaining orders of winged insects). Most of these genes showed unbiased patterns of gene expression in Drosophila, except for eight genes (Table 1). Given the lack of similar functional studies in most other insects, these eight genes represented our hypothesis for biased expression in the insect and beetle datasets. The median length of the associated proteins ranged from 89 amino acids in the case of ctp to 2949 amino acids in the case of poe, with an average of 638 ± 529 amino acids per protein. For most of the genes, OrthoDB contributed the amino acid sequences of the six beetle model species to the Coleoptera subset (Figure 2b). The only exceptions were Bug22, ctp, Dark, EcR, Fadd, jar, nsr, and Prosalpha6T, which lacked data for one of the species, Duba for two, and hmw for five. Mapping of orthologous genes using Orthograph from transcriptomes of a selection of 19 beetle species from the 1KITE Project and one testis-specific transcriptome from another beetle species resulted in positive hits in all cases, although with different success rates, possibly related to the quality or source of the transcriptomes. No single species yielded ortholog sequence data for all tested genes, with Rhyzobius pseudopulcher retrieving the highest number of genes (40 out of 44) and two water beetle species, Gyrinus marinus and Noterus clavicornis, retrieving the lowest (27 and 11 genes, respectively). For 70% of the beetle species, we retrieved at least 75% of the genes (Figure 2a). In turn, for all genes analyzed, we found orthologs in the beetle transcriptomes, but with different success rates (Figure 2b). A large proportion of genes (43.2%) were found in at least 19 out of 20 beetle species, and most of them (72.7%) were found in 15 or more beetle species. Conversely, eight genes could not be found in at least half of the species analyzed, with genes such as hmw, ctp, orb2, heph, and Bug22, showing the lowest recovery frequencies (n ≤ 5). The proteins encoded by these genes were shorter than the average but were also typically lacking recognized orthologs in some of the beetle model species (Figure 2b).
Figure 2

Performance of Orthograph searches of sperm individualization gene orthologs in beetle transcriptomes. (a) Number of genes retrieved from each of the non-model beetle species transcriptomes. (b) Number of beetle species yielding ortholog sequences for each of the sperm individualization genes analyzed, with information on the average protein length and showing genes absent in one or more OrthoDB beetle model species as white columns.

3.2. Characteristics of Datasets: Gene Duplications

At the time of this study, OrthoDB curated data for 119 insect species. When more than 119 sequences were retrieved for a particular gene, this informed in most cases of potential multi-copy genes (Table 3). Their actual presence was confirmed in the ML trees when including all the sequences retrieved from OrthoDB and the beetle sequences mined from 1KITE. In these cases, we used the annotation of the D. melanogaster sequence to recognize the sperm individualization paralog of interest. Figure 3 shows a diagram of gene duplications (and some secondary gene losses) as recognized in this study.
Table 3

Summary of sequence characteristics of genes retrieved from OrthoDB. The table lists the number of sequences (N) and species (Sp; with a maximum of 119 species), number of species in which the gene is single copy (Single), the median protein length (L), and relative evolutionary rate (r) as tabulated in OrthoDB. Furthermore, the number (n) of aligned sequences in this study and the alignment lengths (Length), as well as the inferred optimal evolutionary model, are given.

Gene N SpSingleLr n LengthModel
Act5C 51711573760.60---
Ance 238109306310.9290621LG + G + I
aux 1361129411641.211261755JTT + G + I + F
blanks 172107713481.55122719JTT + G + I + F
Bug22 201115362000.6183270LG + G + I
CdsA 1181091014450.76122574JTT + G + I + F
Chc 12311511016760.631301726JTT + G + I
ctp 1219879890.57---
Cul3 153116917800.73133858JTT + G + I
Cyt-c-d 148110741080.65---
Dark 1191069413781.991122193JTT + G + I + F
didum 12611310217931.101242175LG + G + I
Dredd 9181754931.7698676JTT + G + I + F
Dronc 12896784251.67119654WAG + G + I + F
Duba 10599936280.951131087JTT + G + I + F
EcR 1211131055150.83120576JTT + G + I
eIF3m 1161131113870.77130394JTT + G + I
Fadd 9693902461.77108353JTT + G + I + F
gish 129113994410.73124401JTT + G + I
gudu 1251151066891.01124658LG + G + I
heph 206114472850.78114597JTT + G + I + F
hmw 7876742601.3880925JTT + G + I + F
jar 1311119712380.861291375JTT + G + I + F
klhl10 214109546190.96113630LG + G + I
Lasp 113105973210.79121298JTT + G + I
Mer 1201121056050.87129686JTT + G + I
mlt 1241121034771.10130651LG + G + I + F
nes 1221121044741.21128472LG + G + I + F
Npc1a 2161162312560.981241435LG + G + I
nsr 317115383550.92129563JTT + G + I
orb2 115106983510.62105293JTT + G + I
Osbp 158112795970.911301094JTT + G + I
oys 121108975051.03115463LG + G + I
Past1 1241141045340.66120564LG + G + I
Pen 34211675190.83121593LG + G + I + F
poe 1831158429491.081293846JTT + G + I + F
porin 124108982820.86127286LG + G + I + F
Prosalpha6T 126108912770.77125312LG + G + I + F
scat 1301161039421.111331233JTT + G + I
shi 133113948570.671321005LG + G + I
skap 24711684240.80128476LG + G + I
sw 1251151096550.80133755JTT + G + I
Taz 105102992650.91118303LG + G + I + F
Vps28 135112942120.72131213LG + G + I
Figure 3

Schematic consensus phylogeny of insects (drawn from trees in [56,57,58]) showing the inferred evolutionary gains and losses of sperm individualization genes for major insect lineages.

For a total of 21 genes, we had no evidence for duplications or losses in the winged insect lineage. However, 18 genes showed duplications in Pterygota or parts of this evolutionary lineage. Two of these genes, Act5C and Cyt-c-d, were found as part of multigene families, and it was difficult to tell individual copies apart with the available data and as a result of their high similarity. Act5C is part of a gene complex with a deep split in all insects, including actin-related proteins (Arp1 in Drosophila) and several actins resulting from various duplications. We found evidence for at least six actin-like gene copies in Acalyptratae flies (fruit and peacock flies, among others), five in mosquitoes, four in Hymenoptera and Palaeoptera, at least three in Coleoptera and the hemipteroids, and at least two among Lepidoptera. Act5C, in particular, is highly conserved in the whole of Pterygota, and most of the available beetle sequences were retrieved close to this specific Drosophila paralog in the phylogeny. In turn, Cyt-c-d was revealed as a member of a multigene family in most insect groups, including odonates, some hemipteroids, beetles, and some dipterans. The proteins encoded by these genes are short and highly conserved, so paralogs could not be resolved easily, but most beetle sequences retrieved by Orthograph were more similar to the Cyt-c-p copy of the gene in Drosophila. Finally, ctp corresponded to a very short fragment, highly conserved and with evidence for paralogy, though it was not possible to discriminate gene copies. Given the difficulty of discerning orthologs, these three genes were not considered in downstream analyses. Seven of the duplicated genes—namely, Ance/Acer, Bug22, klhl10, Npc1a/Npc1b, nsr, Pen/Kap-α1/Kap-α3, and skap/Sucb—were duplicated in all studied insects and, in some cases, with one of the copies being subsequently lost or further multiplied in particular lineages. For example, orthologs of the Npc1b and Pen copies were lost in Lepidoptera, the sister copy of klhl10 was lost in Diptera, and one copy of nsr was lost in aculeate Hymenoptera. Acalyptratae (Diptera) had three additional copies of nsr (four in Bactrocera tephritid peacock flies); Acer was duplicated independently in Trichoptera, Lepidoptera, and some Diptera; and skap had an additional copy among the Hymenoptera. Overall, 10 genes had lineage-specific duplications. Dredd had several copies in Ephemera alone; orb2 and Osbp were duplicated in some hemipterans; and for Dronc, gish, and Past1, we found evidence for duplications in Coleoptera. Finally, the remaining four genes were duplicated in Diptera: blanks and Cul3, with fast-evolving copies in some dipterans; CdsA in some nematocerans (midges and moth-flies); and Prosalpha6 in Drosophila alone (wherein only the paralog Prosalpha6T, perhaps missing in all the other insects, is male biased). Apart from the lineage-specific losses found for Npc1b, Pen, and the sister copies of klhl10 and nsr, other gene losses detected in our data set affected Dredd (missing in mosquitoes [Diptera: Culicidae]), Duba (lacking in Trichoptera and Lepidoptera), Fadd (absent in some Hemiptera), and hmw (not recorded in Ephemera [Ephemeroptera] or Anopheles [Diptera]).

3.3. Evolutionary Rates of Sperm Individualization Genes in Insecta and Coleoptera

Amino acid sequence matrices of the orthologous sperm individualization genes of insects and nucleotide sequence data of Coleoptera were used to infer gene trees under ML and Bayesian inference and to estimate evolutionary rates (Supplementary Files S1–S4). In general, both methods produced similar gene trees, e.g., with respect to resolving the relationships of the insect orders and some infraordinal relationships (Figure 4a), usually with relatively strong nodal support, and consistent with the current systematic knowledge for insects [56]. However, most trees had relatively poorly resolved deep relationships, particularly within the hemimetabolous insect orders, which were represented by relatively few taxa. In turn, in most beetle trees, the suborders represented by several species were retrieved as monophyletic, but there was no consensus among trees on subordinal relationships (Figure 4b). However, in most cases the topologies were consistent with Polyphaga being sister to the other three suborders (Adephaga, Myxophaga, and Archostemata).
Figure 4

Frequency of maximum likelihood (ML) and Bayesian inference (BI) sperm individualization gene trees resolving particular higher taxa or relationships among them in insects (a) and beetles (b). Key: Ad, Adephaga; Ar, Archostemata; C, Coleoptera; D, Diptera; H, Hymenoptera; Hm, Hemiptera; HOM, Holometabola; Ht, Heteroptera; L, Lepidoptera; My, Myxophaga; Po, Polyphaga; S, Strepsiptera; T, Trichoptera.

Based on the previous phylogenies, the amino acid substitution rates for 41 proteins encoded by sperm individualization genes (excluding the 3 proteins for which orthology could not be confirmed) spanned nearly two orders of magnitude, from 0.000237 amino acid changes per lineage and million years (subs./l./Ma) in the protein orb2 to 0.009667 subs./l./Ma in the protein hmw (Table 4). The average substitution rate for the whole dataset was 0.00239 ± 0.003012 subs./l./Ma. Slightly over half (56%) of these proteins, typically those with lower overall substitution rates, exhibited evolutionary rates inconsistent with a molecular clock, i.e., rates on individual branches with more substantial departures from the mean (ucld.stdev ≥ 0.6).
Table 4

Characteristics of amino acid datasets of sperm individualization proteins of insects deduced from information in public databases (B: unbiased [0] and sex-biased [1] genes; N: non-interacting [0] and interacting [1]), as well as information deduced from their phylogenetic analyses, including duplications (D: single [0] and multicopy [1]), evolutionary rates, evolutionary rate heterogeneity (ucld.stdev), and the estimated age of the clade Coleoptera.

GeneB/D/NSubstitution Rate (×10−3)ucld.stdevAge Coleoptera
hmw 1/0/09.67 ± 1.3492.972-
Dark 0/0/15.27 ± 0.2820.418264.2 [214.3–314.2]
blanks 1/1/14.38 ± 0.3820.483310.6 [248.2–376.1]
Dredd 0/0/14.23 ± 0.2190.309278.4 [239.2–315.8]
Fadd 0/0/14.21 ± 0.2770.345-
Dronc 0/1/13.62 ± 0.2210.391379.8 [339.5–426.0] b
Duba 0/0/13.18 ± 0.2570.566348.2 [248.5–450.2] b
nsr 1/1/03.00 ± 0.2790.672194.1 [127.4–262.5]
Bug22 0/1/02.82 ± 0.2780.559-
scat 1/0/02.73 ± 0.259>3 a267.8 [185.8–359.0]
aux 0/0/12.45 ± 0.1540.386311.0 [252.3–372.7]
nes 0/0/02.27 ± 0.1300.423307.2 [253.7–365.2] b
poe 0/0/02.10 ± 0.174>3 a328.7 [227.3–433.1]
Osbp 0/1/02.05 ± 0.1800.601299.6 [230.3–375.4] b
Npc1a 0/1/01.91 ± 0.157>3 a334.2 [231.5–442.6]
didum 0/0/11.90 ± 0.1260.511337.0 [264.9–419.9]
Pen 0/1/01.78 ± 0.1200.531273.2 [208.2–340.5]
klhl10 1/1/11.69 ± 0.1240.816-
Prosalpha6T 1/1/01.68 ± 0.1510.559248.5 [185.6–315.1]
oys 0/0/01.57 ± 0.1270.558-
gudu 1/0/01.45 ± 0.1150.553347.7 [258.1–444.1]
Ance 0/1/01.42 ± 0.0990.396-
sw 0/0/11.38 ± 0.1713.712247.5 [133.2–374.3]
Taz 0/0/01.33 ± 0.1110.607312.2 [231.9–401.0]
Mer 0/0/11.30 ± 0.1300.936304.5 [208.4–409.9]
skap 0/1/11.15 ± 0.1160.557314.4 [214.9–424.3] b
CdsA 0/0/01.15 ± 0.1030.629350.5 [258.6–440.6] b
jar 0/0/11.06 ± 0.0860.482403.2 [315.0–494.6] b
porin 0/0/00.99 ± 0.1060.779-
Lasp 0/0/10.98 ± 0.1380.930451.5 [288.4–633.0] b
Cul3 0/1/10.98 ± 0.1303.798223.1 [112.6–351.7] b
EcR 0/0/00.93 ± 0.0910.789-
heph 0/0/00.90 ± 0.1183.919-
eIF3m 0/0/10.86 ± 0.0840.480363.0 [277.4–462.1]
shi 0/0/10.74 ± 0.0923.847253.2 [124.3–397.0]
Past1 0/1/10.71 ± 0.0690.713387.4 [273.0–513.2] b
Vps28 0/0/00.63 ± 0.0790.822-
gish 0/1/00.47 ± 0.0704.174262.9 [137.6–401.9]
mlt 0/0/00.42 ± 0.5233.268244.6 [121.4–368.1] b
Chc 0/0/10.32 ± 0.0350.700346.6 [242.1–464.4]
orb2 0/1/00.24 ± 0.0351.414180.4 [52.1–328.6]

a Data analyzed under exponential relaxed clock, with ucld.stdev estimated from inconclusive runs under an uncorrelated lognormal relaxed clock; b Coleoptera is rendered paraphyletic by the inclusion of Strepsiptera.

The analyses of evolutionary rates yielded age estimates for the clade Coleoptera with averages ranging between 180.4 Ma, in the case of orb2, and 451.5 Ma, in the case of Lasp, with broad confidence intervals of 186.2 ± 62.49 Ma on average (Table 4). Coleoptera was recovered as monophyletic in 18 of the analyses, and the overlap of the age confidence intervals obtained for each gene covered a period between 277.4 and 315.2 Ma (except in the case of nsr, which yielded an age much younger than the oldest known beetle fossils) (Figure 5). This time interval was used to restrict the age of Coleoptera in subsequent analyses, and it was consistent with most clade age estimates for Coleoptera obtained in analyses where the beetle clade also included Strepsiptera.
Figure 5

Inferred ages and 95% credibility intervals for the Coleoptera clade (top panel) or a Coleoptera + Strepsiptera clade (bottom panel) based on the molecular clock analyses of amino acid sequence data of sperm individualization genes. The full overlap of age estimates of monophyletic Coleoptera identifies an interval (shaded area) consistent with the proposed age of the group based on fossil data and used here as age prior for the evolutionary rate analyses in beetles.

The above time constraint for Coleoptera produced instantaneous nucleotide substitution rates ranging from 0.00208 subs./l./Ma in the case of the gene nes to 0.01190 subs./l./Ma in the case of Cul3, with an average substitution rate for the whole set of genes investigated of 0.00452 ± 0.002083 subs./l./Ma (Table 5). Slightly over half these genes had substitution rates relatively consistent with a molecular clock (ucld.stdev < 0.6), and in contrast to the case of the amino acid sequence analyses, the genes departing from the molecular clock were those with higher nucleotide substitution rates.
Table 5

Characteristics of the nucleotide phylogenetic data sets of sperm individualization genes in Coleoptera. The number of species (N), length of nucleotide sequence alignments (L), the determined evolutionary model, inferred evolutionary rates, and information on rate heterogeneity (ucld.stdev) are given for each gene.

GeneNLModelSubstitution Rate (×10−3)ucld.stdev
Cul3 252196TN93 + G + I11.90 ± 2.6032.707
gish 241197GTR + G + I9.34 ± 1.9812.821
Act5C 151128GTR + G + I8.91 ± 2.2322.839
scat 251974GTR + G + I7.26 ± 1.3972.958
eIF3m 251155GTR + G + I7.06 ± 1.3942.872
poe 233291GTR + G + I6.90 ± 1.3222.844
Dredd 23732GTR + G + I6.41 ± 1.2553.012
CdsA 211323GTR + G + I6.35 ± 1.2292.954
blanks 25672GTR + G + I5.82 ± 1.1722.994
shi 252610GTR + G + I5.81 ± 1.1612.918
skap 241302GTR + G + I5.09 ± 0.9843.033
Dark 141863GTR + G + I4.96 ± 0.9052.947
Duba 20858GTR + G + I4.93 ± 0.9972.916
Prosalpha6T 25822GTR + G + I4.90 ± 0.9593.041
Chc 224944GTR + G + I4.78 ± 0.4100.270
klhl10 141779GTR + G + I4.48 ± 0.8492.952
jar 253393GTR + G + I4.47 ± 0.8403.005
didum 215073GTR + G + I4.41 ± 0.8173.029
oys 141347GTR + G + I4.41 ± 0.8393.019
ctp 8267GTR + G3.79 ± 1.1070.232
mlt 241248GTR + G + I3.61 ± 0.4640.474
sw 251455GTR + G + I3.61 ± 0.3470.344
Taz 23774GTR + G + I3.52 ± 0.2780.340
Fadd 19228GTR + G + I3.41 ± 0.4950.463
nsr 24780GTR + G + I3.39 ± 0.4640.409
orb2 8834GTR + G + I3.36 ± 0.8340.477
Npc1a 233756GTR + G + I3.32 ± 0.2800.255
Dronc 24954GTR + G + I3.17 ± 0.2650.221
EcR 171278GTR + G + I3.13 ± 0.3960.382
Vps28 26573GTR + G + I3.12 ± 0.5090.169
Osbp 251881GTR + G + I3.09 ± 0.2850.365
Past1 131566GTR + G + I3.09 ± 0.3440.186
aux 182130GTR + G + I3.09 ± 0.2950.252
Lasp 25423GTR + G + I3.08 ± 0.4950.110
gudu 151848GTR + G + I2.93 ± 0.2800.450
Pen 241440GTR + G + I2.92 ± 0.2520.296
porin 25849GTR + G + I2.78 ± 0.3930.499
Mer 231701GTR + G + I2.78 ± 0.2600.329
Ance 181716GTR + G + I2.59 ± 0.1980.299
hmw 8201GTR + G2.57 ± 0.4570.118
nes 251317GTR + G + I2.08 ± 0.1710.538

3.4. Analysis of Rate Differences

Permutation tests of independence produced non-significant results for every pair of independent variables used in subsequent tests, suggesting that there were no interactions among them. The null hypothesis that sperm individualization genes with or without duplications in the insect lineage had the same evolutionary rates was not rejected (Mann–Whitney U = 183, p = 0.758; also treating hemipteroid orb2 and Osbp duplications as non-duplicated genes: U = 157, p = 0.497). Similarly, this hypothesis was not rejected in the case of genes working in coordination in a gene interaction network (like the one deduced for Drosophila) tested against genes dissociated from this network (Mann–Whitney U = 190, p = 0.632). However, when genes were split into two categories according to their predicted sex expression bias, or according to whether they evolved in a clocklike fashion, the null hypothesis of no differences in their evolutionary rates was rejected at the 0.05 significance level (Mann–Whitney U = 52, p = 0.019 and Mann–Whitney U = 323, p = 0.002, respectively). In these cases, sex-biased and clock-constrained genes would have slightly faster rates, except for the male-biased gene hmw, a fast-evolving protein departing nonetheless from a molecular clock. The same tests, when applied to nucleotide substitution rates of the genes of interest in beetles, produced non-significant results when rate differences were tested for predicted expression biases (U = 115, p = 0.817), gene duplications in the beetle lineage (U = 181, p = 0.551), or their predicted coordination in an interaction network (U = 156, p = 0.118). The test produced a clear significant result when rate differences were tested against the clocklike behavior of data (U = 5, p << 0.001), with the genes departing from the molecular clock having much higher rates (genes[ucld.stdev < 0.6]: 0.00314 ± 0.000533 versus genes[ucld.stdev ≥ 0.6]: 0.00620 ± 0.002030).

3.5. Evolutionary Patterns in the Sperm Individualization Interaction Network

The results of the Spearman’s rank correlation tests between amino acid substitution rates and measures of node importance based on the number of receiving edges (S = 956.6, rho = 0.1609, p = 0.5106), rank (S = 712.6, rho = 0.3749, p = 0.1138), or their respective corrections considering the total number of genetic interactions of the nodes of interest (S[edges] = 1340.0, rho = −0.1754, p = 0.4709; S[rank] = 1054.0, rho = 0.0754, p = 0.7592) were all non-significant (Figure 6). Similarly, the correlations between nucleotide substitution rates in beetles and node centrality measures based on the number of receiving edges (S = 1115.3, rho = 0.2758, p = 0.2263) or their tentative correction based on edges (S = 1520, rho = 0.0130, p = 0.9573) and rank (S = 1180, rho = 0.2338, p = 0.3063) were non-significant. However, when node centrality was assessed based on the first eigenvector of the adjacency matrix, their correlation with nucleotide substitution rates was significant (S = 770.5, rho = 0.4997, p = 0.0211), suggesting a slight effect of more densely connected regions of the network having significantly higher evolutionary rates. In turn, there was no evidence for a correlation between genes being single copy or duplicated and any centrality measure without (edges: S = 1772.9, rho = −0.1512, p = 0.5129; rank: S = 2127.7, rho = −0.3816, p = 0.0878) or with correction (edges: S = 1368.7, rho = 0.1112, p = 0.6312; rank: S = 1980.5, rho = −0.2860, p = 0.2088).
Figure 6

Biplots of the correlation between different measures of node importance in the interaction network of sperm individualization genes and their amino acid evolutionary rates in insects (top panels) and nucleotide substitution rates in beetles (bottom panels).

Network modularity measures split the gene interaction network into four groups when using edge-based partitioning or five groups when using node-based partitioning, with considerable agreement between strategies (Figure 7). Exact modularity and the Clauset–Newman–Moore algorithm produced identical groupings, differing from the edge-based solution in the transfer of one node (Dredd) to an adjacent group and the split of two nodes (Act5C and Lasp) as an additional group. The Louvain modularity produced groups identical to the other node-based methods, but transferring one node (Past1) into the adjacent group. None of these global partitioning strategies showed statistical differences in amino acid substitution rates (“edge_betweenness”, chi-sq = 2.8835, p = 0.4099; “optimal”, chi-sq = 2.5958, p = 0.6276; “louvain”, chi-sq = 2.4258, p = 0.6580) or nucleotide substitution rates (“edge_betweenness”, chi-sq = 4.9143, p = 0.1782; “optimal”, chi-sq = 5.4316, p = 0.2458; “louvain”, chi-sq = 4.9848, p = 0.2889). However, when different group bipartitions of the network were considered, the edge between Dronc and shi (ABC and DE clusters in Figure 7) delimited groups with different amino acid substitution rates (U = 72, p = 0.0279) and different nucleotide substitution rates when beetle data were considered both for edge (U = 83, p = 0.0409) and for node (U = 93, p = 0.0062) partitions. Nucleotide substitution rates were also statistically significantly different across the edges joining Dredd and Dronc (U = 80, p = 0.0200; AB and CDE clusters in Figure 7) and Chc and Past1 (U = 69, p = 0.0147; ABCD and E clusters in Figure 7).
Figure 7

Mutual interaction network of sperm individualization genes in Drosophila including protein–protein (lines) and genetic/regulatory interactions (arrows), the latter with information on the enhancing and/or repressing modulation effects. Nodes represent interacting proteins, and they are color coded according to their inferred amino acid evolutionary rates. Edges represent documented interactions between proteins, with their width being proportional to the evolutionary rate differences between interacting proteins (dashed lines are used when their evolutionary rate data are missing). Dashed-line and solid-background polygons show the edge-based and node-based partitions of the network, respectively. More details and alternative partitioning schemes are described in the main text.

4. Discussion

4.1. Data Mining Genomic and Transcriptomic Resources: Sequence Quality

The results of studies exploiting genomic and transcriptomic resources depend on their quality and curatorial status, regardless of how complex and efficient the bioinformatic approaches used to extract this information are [57,58]. Usually, the scale and complexity of studies using “big data” prevent end-user control of their quality [59], and data may include unnoticed errors (e.g., incorrect taxonomic assignments or shifts in reading frames) or may have escaped objective quality filters (e.g., low sequence quality or assembly problems). Here, we used several public databases of annotated sequence data, including GenBank, FlyBase, modENCODE, OrthoDB, and BioGRID, as well as the partially released 1KITE database. Each may have contributed particular biases to the results, but the amount of data was still amenable to manual control of the different analytical steps, allowing for the recognition of problems and for hopefully avoiding them by iterative analytical exploration and filtering of the data. The first challenge we had to address after mining the sequence data, and before all analyses, was filtering what we interpreted as noisy sequence data or suspicious annotations in the data sets. Sequence quality was a major concern when using data directly mined from sequence repositories. Thus, we identified, through iterative assessment, two main criteria for the total or partial removal of potentially noisy data. These were (i) long autapomorphic insertions in amino acid sequences which may result from unrecognized introns and (ii) highly divergent, unalignable regions, typically at the ends of sequences, due to compensated nucleotide gains/losses in that part of the sequence, locally affecting the reading frame. Reiterated multiple sequence alignments also allowed recomposing the proteins that appeared in OrthoDB as non-overlapping fragments for some taxa into a single sequence. However, when this situation affected duplicated genes, there was a risk of joining fragments of non-orthologous proteins, which we addressed by using phylogenetic trees to inform manual curation [29]. We gained additional insight into the aforementioned problems by merging annotated and curated amino acid sequence data from OrthoDB with translated nucleotide sequence data from 1KITE beetle transcriptomes. Some of the latter sequences showed precisely the same translation problems affecting homology as were found for the insect protein data, and they were filtered according to the same criteria specified above.

4.2. Data Mining Genomic and Transcriptomic Resources: Orthology Assessment

Orthology assessment was particularly crucial in the examination of beetle data mined directly from raw transcriptomes, and here, this assessment was particularly important because orthology provided our best hypothesis for conserved gene function. For most EOGs of interest for which we searched the transcriptomes, the pipeline yielded a phylogenetically cohesive group of potentially orthologous sequences with their paralogs when they were present in the transcriptome. The efficiency of Orthograph in this respect was demonstrated when mistakes were made. For example, a bad specification of the EOG corresponding to the gene Pen initially resulted in predicted beetle orthologs for one of the other importin-alpha genes in insects, which could be identified and corrected in our iterative phylogenetic approach. For six genes, however, the analyses picked up at least two paralogs. Two corresponded to Act5C and Cyt-c-d, which we already described as challenging to separate in the respective duplicated copies, even using phylogenies. The other four are more difficult to explain, and recognizing them required phylogenetically informed decisions; they were removed from the analyses a posteriori. Of these, two were genes for which we revealed duplications in beetles, Dronc and gish (for the latter, we found the beetle-specific paralog only in Xylobiops [Bostrichidae]). The other two were Npc1a, for which the correct sperm individualization ortholog was identified in 16 beetle transcriptomes and its paralog Npc1b in Lepicerus sp., and klhl10, for which the copy missing in Diptera was found in Micromalthus debilis and Lamprohiza splendidula. For all of these genes, we have strong evidence hinting at them being duplicated in the beetle genomes, yet we retrieved one of the copies in most species and the other copy in one or just a few transcriptomes. If these genes are indeed duplicated, the reason why both copies were not found consistently in all beetle transcriptomes may be related to how the program Orthograph works, i.e., retrieving a single best reciprocal hit, the putative ortholog. In these circumstances, and analogously to ranked results of BLAST searches, the correct, biologically meaningful sequence may be missed after yielding a suboptimal hit, perhaps because of sequence quality and/or length issues or the absence of the ortholog of interest in some of the transcriptomes.

4.3. Evolutionary Dynamics of Sperm Individualization Genes

All qualitative traits that were used to rank sperm individualization genes in insects were statistically independent. This implies that, at least for this subset of genes, some evolutionary predictions do not apply, including the association of sex-biased gene expression with an origin attributed to gene duplications [18]. Apart from duplications, we also recorded gene losses, because it has been hypothesized that the rate of turnover (i.e., lack of 1:1 orthology) for sex-biased—particularly male-biased—genes may be higher than for other genes [60,61]. Among sperm individualization genes, we found lineage-specific losses for both biased and unbiased genes without statistical differences between groups (chi-sq = 2.4529, p = 0.1450), and our phylogenetic analyses, in fact, show that preservation and genomic dosage of sperm individualization genes are generally highly conserved across the Insecta despite their long evolutionary history. In our analysis of the correlation between the different ways in which we ranked sperm individualization genes and their inferred evolutionary rates, only two instances of statistically significant differences were obtained. The first relates to the overall homogeneity of substitution rates both for insect amino acid and for beetle nucleotide sequence data (even if with opposite signs). The second and most interesting, considering the deep evolutionary time considered and the assumption of conservation of gene functionality across this time scale, was for sex-biased genes, which had different and significantly higher evolutionary rates than unbiased genes. The fact that sex-biased genes, and, more specifically, male-biased genes, evolve more rapidly than unbiased genes is a well-known general evolutionary pattern documented from a diversity of organisms [5,60,61,62,63,64,65,66,67,68,69]. However, it is surprising that this signature is still present across some 400 million years of evolution when it remains unclear whether gene functionality and sex bias in their expression have been conserved. If these features changed during the course of evolution, it is still possible that faster rates in this case could be related to other expression features, such as tissue specificity and narrow expression profiles [65]. Indeed, faster rates of evolution associated with sex-biased expression have been explained as the result of several potential causes, including participation in specific processes such as spermatogenesis [62,69], activation in reproductive tissues relative to genes expressed in several tissues [13,70], linkage to the homogametic sex chromosome [13,70], relatively low levels of expression [71], or circumscription to specific stages of development [5]. These correlations are far from universal, and there are exceptions to each of the proposed patterns [5,68,69], much depending on the organism under study but also on their life histories. For example, female mating behavior in different species of Anopheles [Diptera: Culicidae]—some species of which are polyandrous, while others mate once in their lifetime—may have different impacts on sperm competition and selection and, consequently, on the evolutionary dynamics of sperm-related genes [69]. Moreover, while these factors could potentially lead to faster rates of evolution in sex-biased genes, protein–protein interactions could effectively constrain them [72], a possibility that will be discussed below.

4.4. Evolutionary Dynamics of Interacting Sperm Individualization Genes

Genetic interactions act as a dominant force explaining evolutionary rates, and the nature and type of interaction may prevail over other factors, such as the characteristics of gene expression [73]. There are hypotheses on how these two features may interact, such as the expected negative correlation between the number of protein interactions and evolutionary rates, or the proposition that interacting proteins should evolve at similar rates [74]. The micro- and macroevolutionary analyses of the effect of these interactions have facilitated significant advances in our understanding of these processes. On the one hand, our knowledge on the structure of genetic interaction networks, also for non-model organisms, is more detailed. On the other hand, the development of explicit, quantitative methods allows us to evaluate the architecture and properties of the networks relative to the biological features of their elements, particularly in the case of metabolic networks [75,76,77,78]. Among typical macroevolutionary patterns related to the protein–protein interaction network structure, it has been proposed that duplicated genes tend to be more highly connected in such networks [77]. The sperm individualization network shows an area that concentrates duplicated and relatively highly connected proteins (e.g., Dredd, Dronc, and skap); however, there was no statistical support for a correlation between these features. Correlations were found, nonetheless, for evolutionary rates when the undirected network was bipartitioned, adding statistical support to the intuitive notion of faster-evolving genes and proteins (blanks, Dark, Dredd, Dronc, Duba, and Fadd) appearing concentrated in one region of the network. Furthermore, we found a positive correlation between rank-based centrality and nucleotide substitution rates for beetles. In general, the opposite trend tends to be the norm, and highly connected genes usually show slower rates of evolution, maybe because the protein function depends on more topological interactions with other proteins, which constrain the possibility of change [74,75]. However, this is a controversial topic, and other examples of faster-evolving core proteins in an interaction network exist, such as the analysis of transcriptional networks in yeast [79]. In any case, it is too early to draw conclusions about the evolutionary trends in sperm individualization genes. Significant results were only obtained for beetles, for which we lack empirical evidence of the same gene interactions known in Drosophila, and different evolutionary dynamics seem to operate depending on the overall function of the network. This highlights the necessity for further research in beetles beyond the model organism T. castaneum. The lack of significant or consistent results between analyses employing insect amino acid and beetle nucleotide sequence data may be explained, in part, by the partial view of the actual interactions in which sperm individualization genes participate. It is possible that the real nature and number of these interactions is not captured by the necessarily crude correction applied here (i.e., total number of receiving edges in the interactome). The structural measures obtained from the interaction network are intrinsic and the represented network of interactions is not isolated; therefore, these measures can show some biases [75]. A poorly connected node in the sperm individualization network can have many connections to other functional domains of the cell. For example, the proteins Fadd and Mer physically interact with the products of three and four other sperm individualization genes, respectively; however, in the complete interactome of Drosophila, they are known to interact physically or genetically with 100 and 165 other proteins, respectively. As already mentioned, another possibility is that the interaction network described for Drosophila is not universal for insects, totally or partially, and that the enforced topology is unable to capture evolutionary constraints for these genes in insects, or that the actual evolutionary dynamics of beetles are different from general trends in insects. Nevertheless, we tried to find intrinsic patterns that could be associated with the coordination of the genes of interest in a specific function, and at least in the case of beetles, there could be a signature worth exploring from a functional point of view. While we identified statistically significant differences in the rate of amino acid substitution in insects depending on hypothesized sex-biased expression, the study of nucleotide substitution rates in beetles for the same genes did not reveal any significant pattern. A somewhat reverse pattern was obtained in our exploration of evolutionary rates constrained by the architecture of a hypothesized network of interaction, wherein mainly nucleotide substitution rates of beetles showed some correlation with this architecture. This apparent contradiction and the complexity of the factors involved in explaining evolutionary rates make it difficult to fully explain these patterns satisfactorily. Before we can do that, we need more in-depth insight into the temporal and spatial expression profiles, effective function, genetic interactions, and pleiotropic effects of these genes in every single species, but also to incorporate information on their life history, which is likely to influence their evolutionary dynamics.
  66 in total

1.  Metabolic flux is a determinant of the evolutionary rates of enzyme-encoding genes.

Authors:  Martino Colombo; Hafid Laayouni; Brandon M Invergo; Jaume Bertranpetit; Ludovica Montanucci
Journal:  Evolution       Date:  2013-09-19       Impact factor: 3.694

Review 2.  The evolutionary causes and consequences of sex-biased gene expression.

Authors:  John Parsch; Hans Ellegren
Journal:  Nat Rev Genet       Date:  2013-02       Impact factor: 53.242

3.  Distinct regulatory programs establish widespread sex-specific alternative splicing in Drosophila melanogaster.

Authors:  Britta Hartmann; Robert Castelo; Belén Miñana; Erin Peden; Marco Blanchette; Donald C Rio; Ravinder Singh; Juan Valcárcel
Journal:  RNA       Date:  2011-01-13       Impact factor: 4.942

4.  Widespread adaptive evolution of Drosophila genes with sex-biased expression.

Authors:  Matthias Pröschel; Zhi Zhang; John Parsch
Journal:  Genetics       Date:  2006-09-01       Impact factor: 4.562

5.  Constraint and turnover in sex-biased gene expression in the genus Drosophila.

Authors:  Yu Zhang; David Sturgill; Michael Parisi; Sudhir Kumar; Brian Oliver
Journal:  Nature       Date:  2007-11-08       Impact factor: 49.962

6.  MAFFT multiple sequence alignment software version 7: improvements in performance and usability.

Authors:  Kazutaka Katoh; Daron M Standley
Journal:  Mol Biol Evol       Date:  2013-01-16       Impact factor: 16.240

7.  Hyperexpression of the X chromosome in both sexes results in extensive female bias of X-linked genes in the flour beetle.

Authors:  Eldon G Prince; Donna Kirkland; Jeffery P Demuth
Journal:  Genome Biol Evol       Date:  2010-07-12       Impact factor: 3.416

8.  AmiGO: online access to ontology and annotation data.

Authors:  Seth Carbon; Amelia Ireland; Christopher J Mungall; ShengQiang Shu; Brad Marshall; Suzanna Lewis
Journal:  Bioinformatics       Date:  2008-11-25       Impact factor: 6.937

9.  Genetic dissection of sperm individualization in Drosophila melanogaster.

Authors:  J J Fabrizio; G Hime; S K Lemmon; C Bazinet
Journal:  Development       Date:  1998-05       Impact factor: 6.868

10.  Evolutionary rates and centrality in the yeast gene regulatory network.

Authors:  Richard Jovelin; Patrick C Phillips
Journal:  Genome Biol       Date:  2009-04-09       Impact factor: 13.583

View more

北京卡尤迪生物科技股份有限公司 © 2022-2023.