| Literature DB >> 26774999 |
Eduardo Corel1, Philippe Lopez2, Raphaël Méheust2, Eric Bapteste2.
Abstract
The tree model and tree-based methods have played a major, fruitful role in evolutionary studies. However, with the increasing realization of the quantitative and qualitative importance of reticulate evolutionary processes, affecting all levels of biological organization, complementary network-based models and methods are now flourishing, inviting evolutionary biology to experience a network-thinking era. We show how relatively recent comers in this field of study, that is, sequence-similarity networks, genome networks, and gene families-genomes bipartite graphs, already allow for a significantly enhanced usage of molecular datasets in comparative studies. Analyses of these networks provide tools for tackling a multitude of complex phenomena, including the evolution of gene transfer, composite genes and genomes, evolutionary transitions, and holobionts.Entities:
Keywords: bipartite graph; evolution; gene transfer; graph theory; introgression; symbiosis
Mesh:
Year: 2016 PMID: 26774999 PMCID: PMC4766943 DOI: 10.1016/j.tim.2015.12.003
Source DB: PubMed Journal: Trends Microbiol ISSN: 0966-842X Impact factor: 17.079
Figure 1Key Figure: Different Graph Representations of the Same Gene Sharing among Genomes
(A) Sequence-similarity network (SSN): each node (circle) represents a protein-coding gene sequence; the color and the label of the node represent the genome where the gene is found. Two nodes are connected by an edge (a line linking two nodes) if the pair of sequences fulfills given similarity criteria such as a minimum percentage identity and coverage (i.e., the ratio between the length of the matching parts and the total length of any two sequences). Sequence-similarity networks are analyzed as a partition into connected components (CCs, highlighted as color halos). This partition defines groups of putative gene families, when reciprocal sequence coverage and identity percentage are high [68]: for instance, we can interpret CC1 as a gene family for which two copies are present both in genomes A and B. (B) Genome networks (GNs) can be obtained from SSNs: nodes are genomes (described by color and label); edges connect genomes that share at least one gene family; GNs can be weighted: weights count the number of gene families shared by the two genomes. In the example, A and B share three gene families, but the graph does not specify which ones. (C) Multiplexed networks (MNs) can be, in turn, obtained from GNs by labelling edges in order to identify what gene families are shared: nodes represent genomes; multi-edges represent distinct shared gene families (same color code as the CCs in the SSN); weights count the number of shared genes in each family: the blue edge between A and B corresponds to CC1 in (A) and has therefore weight 2. (D) Bipartite graphs can also be obtained from SSNs; top nodes are genomes; bottom nodes are gene families; edges connect a genome to a gene family if that genome contains at least one representative of the corresponding gene family; weights count the number of genes of that family present in that genome: in the example, node 1 corresponds to CC1 in (A), and has therefore edges incident to genomes A and B, each of weight 2.
Statistics of the Prokaryote–Virus–Plasmid Gene Families–Genomes Bipartite Graphsa
| Minimal identity percentage to connect sequences | 30% | 60% | 90% |
| Number of connected components (CC) | 156 | 375 | 488 |
| Number of CC having only plasmids | 25 | 73 | 155 |
| Number of CC having only viruses | 130 | 299 | 297 |
| Size of the giant connected component (number of nodes) | 6362 | 5143 | 2769 |
For reciprocal 80% length cover, and different identity thresholds.
The data consist of all protein sequences from all complete plasmidic, viral, and archaeal genomes from NCBI (as of 11/2013), as well as one complete eubacterial complete genome for each family. The identity percentage describes the similarity, in terms of the conservation of primary sequences, between pairs of molecules. The higher this ‘identity threshold’ the more similar pairs of sequences must be to be directly connected in a sequence-similarity network. For high ‘identity threshold’, connected components consist of highly conserved sequences. In a first molecular clock-like approximation, higher ‘identity thresholds’ define groups of sequences that diverged more recently from one another than groups defined with lower ‘identity threshold’.
Figure 2Twins and Articulation Points in a Bipartite Graph. (A) Top nodes in this bipartite graph are genomes and bottom nodes gene families. Nodes in each colored ellipse at the bottom form a twin class, since their sets of neighbors (supports encircled by similarly colored ellipses on the top level) are identical (as highlighted by the coloring of their incident edges). (B) Collapsing twin nodes into super-nodes yields a reduced graph, without further bottom twin nodes. The supported groups of host genomes are unchanged, and are now defined as the neighbors of a single super-node. Due to the graph reduction, the green super-node is now an articulation point, since its removal disconnects the nodes in the pink and brown supports.
Figure 3Typical Patterns for Candidate Endosymbiotic Gene Transfer (EGT) and Composite Genes in Sequence-Similarity Networks. (A) Sequence-similarity networks can be used for the detection of distant homologues in eukaryotic genomes. Complete (left) and partial (right) sequence similarity, and how they are translated as different types of edges in the sequence-similarity network (SSN). In black, the percentage of reciprocal cover is high; the sequences are homologous over their entire length. In purple, the cover percentage is low; the sequences are only partly similar, that is, they share a homologous domain. (B) Shortest-path analysis in a sequence-similarity graph can be used for detecting possible endosymbiotic gene transfer (EGT). Indeed, EGT results in a characteristic network pattern: an indirect short path along which all edges indicate homology, connecting two nodes corresponding to diverged sequences present in a given host organism. Green nodes represent eukaryotic sequences; red, bacterial sequences; and yellow, archaeal sequences. Black edges denote complete sequence similarity (>80% length). All shortest paths between eukaryotic sequences that pass through the bacterial and archaeal components are likely candidates for EGT, because this indicates that a first type of eukaryotic sequence has affinities to bacterial sequences while a second type has affinities to archaeal ones. (C) Sequence-similarity networks with edges for complete and partial coverage are also useful for the detection of composite genes. The figure shows a pattern associated with the detection of composite genes. Black edges denote complete (>80% cover) and purple edge denote partial (<80% cover) sequence similarity. The green family is a candidate symbiogenetic composite gene, derived from endosymbiotic lateral gene transfer, since it displays one part with similarity to host-related sequences (yellow) and another part with similarity to endosymbiont-related (blue) genes. (D) A concrete example of a possible EGT: archaeal sequences are represented in blue, eubacterial in red, and eukaryotic genes in green (there is also a single plasmidic sequence in blue-green on the right). Eukaryotic sequences clearly form two groups, one closer to archaea, one more related to eubacteria. All the sequences have a generic annotation as RNA-pseudouridine synthase, but while the eubacterial (and related eukaryotic) sequences are exclusively tRNA synthases (thus putatively of mitochondrial origin), on the archaeal side (thus possibly of host origin) we find tRNA- as well as rRNA-pseudouridine synthases. It indeed turns out that this family contains two pseudouridine synthase genes that are both present in Saccharomyces cerevisiae, having a similar function but acting on a different substrate: one on the archaeal side, coding for Cbf5p that acts on large and small rRNA 100, 101, and the other on the eubacterial side, coding for Pus4, that acts on mitochondrial and cytoplasmic tRNA-uridine [102].