| Literature DB >> 21573202 |
Roberto F S Andrade1, Ivan C Rocha-Neto, Leonardo B L Santos, Charles N de Santana, Marcelo V C Diniz, Thierry Petit Lobão, Aristóteles Goés-Neto, Suani T R Pinho, Charbel N El-Hani.
Abstract
This paper proposes a new method to identify communities in generally weighted complex networks and apply it to phylogenetic analysis. In this case, weights correspond to the similarity indexes among protein sequences, which can be used for network construction so that the network structure can be analyzed to recover phylogenetically useful information from its properties. The analyses discussed here are mainly based on the modular character of protein similarity networks, explored through the Newman-Girvan algorithm, with the help of the neighborhood matrix . The most relevant networks are found when the network topology changes abruptly revealing distinct modules related to the sets of organisms to which the proteins belong. Sound biological information can be retrieved by the computational routines used in the network approach, without using biological assumptions other than those incorporated by BLAST. Usually, all the main bacterial phyla and, in some cases, also some bacterial classes corresponded totally (100%) or to a great extent (>70%) to the modules. We checked for internal consistency in the obtained results, and we scored close to 84% of matches for community pertinence when comparisons between the results were performed. To illustrate how to use the network-based method, we employed data for enzymes involved in the chitin metabolic pathway that are present in more than 100 organisms from an original data set containing 1,695 organisms, downloaded from GenBank on May 19, 2007. A preliminary comparison between the outcomes of the network-based method and the results of methods based on Bayesian, distance, likelihood, and parsimony criteria suggests that the former is as reliable as these commonly used methods. We conclude that the network-based method can be used as a powerful tool for retrieving modularity information from weighted networks, which is useful for phylogenetic analysis.Entities:
Mesh:
Substances:
Year: 2011 PMID: 21573202 PMCID: PMC3088654 DOI: 10.1371/journal.pcbi.1001131
Source DB: PubMed Journal: PLoS Comput Biol ISSN: 1553-734X Impact factor: 4.475
Enzymes associated with the chitin metabolic pathway that satisfy the condition of being present in more than 100 organisms from the 1695 original data set, downloaded from GeneBank at May 19th, 2007.
| Protein | E.C. number | Domain (#) |
| Acetylglucosamine phosphate deacetylase | 3.5.1.25 | B(170), A(6) |
| Glucosaminephosphate isomerase | 2.6.1.16 | E(23), B(285), A(5) |
| Hexosaminidase | 3.2.1.52 | E(3), B(235) |
| Phosphoglucoisomerase | 5.3.1.9 | E(16), B(472), A(12) |
| UDP-acetylglucosamine pyrophosphorylase | 2.7.7.23 | E(2), B(324), A(2) |
Abbreviations: E = Eukarya; B = Bacteria; A = Archaea; E. C. = Enzyme commission. Number in parentheses after the letters shows the total of organismic individual sequences per domain for each protein.
Figure 1The size of the largest connected component (N) versus the threshold similarity σ: a) Acetyl; b) UDP.
Figure 2The distance δ(σ,σ+Δσ) between networks for successive similarities at the maximal value, with Δσ = 1, in the case of: a) Acetyl at σ = σ = 42%; b) UDP at σ = σ = 51%.
Figure 3The dendrogram produced by the successive elimination of links with largest value of betweeness in the case of Acetyl: a) for σ = 30%<42%; b) for σ = σ = 42% that reveals the modular structure of the network.
Figure 4The neighborhood matrix with the 11 modules for Acetyl at σ = σ = 42%.
Figure 5The standard network representation of Acetyl at σ = σ = 42% (using Pajek package) with the communities that were indicated in .
We label as C12 the small sub-graphs and isolated nodes that do not constitute a biologically meaningful community.
Summary of the results for each of the five enzyme networks: values of σ corresponding to the largest peaks in the graphs δ×σ; number of nodes; number of distinct organisms; and the number of distinct communities.
| Protein |
| # nodes | # organisms | # communities |
| Acetylglucosamine phosphate deacetylase | 42 | 176 | 88 | 12 |
| Glucosaminephosphate isomerase | 40 | 313 | 209 | 5 |
| Hexosaminidase | 37 | 238 | 67 | 10 |
| Phosphoglucoisomerase | 37 | 501 | 332 | 6 |
| UDP-acetylglucosamine pyrophosphorylase | 51 | 327 | 245 | 7 |
Values of congruence obtained after pair-wise comparison of the phylogenetic analysis provided by two different networks.
| A | G | H | P | U | |
| A | 0.79 | 0.73 | 0.93 | 0.91 | |
| G | 0.79 | 0.69 | 0.83 | 0.87 | |
| H | 0.73 | 0.69 | 0.90 | 0.79 | |
| P | 0.93 | 0.83 | 0.90 | 0.95 | |
| U | 0.91 | 0.87 | 0.79 | 0.95 |
The average value of the entries in the table is 84%. Abbreviations: A, acetyl; G, gluco; H, hexo; P, phosphor; U, UDP.
Figure 6Series of spikes representing the 382 organisms present in each one of the 5 selected enzymes associated with the chitin metabolic route.
Along each series of spikes, color identifies the group the organisms belong to. There is no color correspondence between two network classifications.
Values of congruence obtained after pair-wise comparison of the phylogenetic analysis based on chitin synthase sequences provided by five different methods: Bayesian (B), distance (D), likelihood (L), parsimony (P), and the network method introduced herein (N).
| B | D | L | P | N | |
| B | 0.74 | 0.82 | 0.51 | 0.82 | |
| D | 0.74 | 0.69 | 0.54 | 0.54 | |
| L | 0.82 | 0.69 | 0.59 | 0.82 | |
| P | 0.51 | 0.54 | 0.59 | 0.59 | |
| N | 0.82 | 0.54 | 0.82 | 0.59 |
Average congruence of N with the four other methods = 69%. Average taken over the six pair-wise comparisons among the four methods (B, D, L, P) = 60%.