| Literature DB >> 31752672 |
Xiao Deng1, Xuechao Zhao1, Yuan Liang2, Liang Zhang3, Jianping Jiang4, Guoping Zhao2,3, Yan Zhou5,6.
Abstract
BACKGROUND: The genome topology network (GTN) is a new approach for studying the phylogenetics of bacterial genomes by analysing their gene order. The previous GTN tool gives a phylogenetic tree and calculate the different degrees (DD) of various adjacent gene families with complete genome data, but it is limited to the gene family level. RESULT: In this study, we collected 51 published complete and draft group B Streptococcus (GBS) genomes from the NCBI database as the case study data. The phylogenetic tree obtained from the GTN method assigned the genomes into six main clades. Compared with single nucleotide polymorphism (SNP)-based method, the GTN method exhibited a higher resolution in two clades. The gene families located at unique node connections in these clades were associated with the clusters of orthologous groups (COG) functional categories of "[G] Carbohydrate transport and metabolism,", "[L] Replication, recombination, and repair" and "[J] translation, ribosomal structure and biogenesis". Thus, these genes were the major factors affecting the differentiation of these six clades in the phylogenetic tree obtained from the GTN.Entities:
Keywords: Clusters of orthologous groups (COG); Genome topology network; Genomes; Group B Streptococcus; Phylogenetics
Mesh:
Year: 2019 PMID: 31752672 PMCID: PMC6868693 DOI: 10.1186/s12864-019-6234-8
Source DB: PubMed Journal: BMC Genomics ISSN: 1471-2164 Impact factor: 3.969
Fig. 1Work flow of the new GTN version. The GTN calculates the whole genome region when only complete genomes are included or the common synteny block regions when draft data are included. Then, the GTN assigns genes to different gene families and calculates the relative DD value and the evolutionary distance on the basis of the gene family assignment. After obtaining the phylogenetic tree, the GTN determines all genes at unique node connections. The steps in red boxes are the modifications included in this GTN version. In f, D (G1, G2) represents the evolutionary distance between genomes 1 and 2, N represents the number of nonredundant families among the total gene families, Ci represents the number of common adjacent gene families to orthoi, τ1i represents the number of adjacent gene families to orthoI in genome 1 and can be regarded as the number of connections constituting the orthoI network in genome 1, and τ2i represents the number of adjacent gene families to orthoI in genome 2. N in f represents the number of genes in the gene family
Fig. 2Example of the topology networks of two genomes. a Location information for genes in the assumed genome A on the basis of the GFF file. b Location information for genes in the assumed genome B on the basis of the GFF file. c Gene order of orthologous genes in the assumed genome A after clustering. d Gene order of orthologous genes in the assumed genome b after clustering. Compared with that in genome a, the gene order of ortho3 and ortho4 is changed, and an ortho2 gene is deleted. E. Topology network of genome. a. f Topology network of genome b
Time required to perform in different methods
| method | time |
|---|---|
| GTN (BLAST+MCL) | 25.3 h |
| GTN (CD-HIT+DIAMOND) | 30 min |
| GTN (Roary input) | 110 min |
| SNP | 11.6 h |
| First version of GTN | 50 min |
The tools in brackets represent different methods for performing gene family assignment
Fig. 3COG-based phylogenetic tree of the complete genome group. The number following each strain is the number of genes at unique node connections. The number following the six main clades is the number of genes at unique node connections that can be found in all genomes of the clade. The first red number before “|” in a cross is the length (KB) of the pieces that are connected based on the common node connections in the genomes of the clade. The second red number is the number of pieces
COG functional classification of genes at the unique node connections of the six main clades
| Function classification | clade A | clade B | clade C | clade D | clade E | clade F | total |
|---|---|---|---|---|---|---|---|
| [S] Function unknown | 159 | 141 | 44 | 44 | 52 | 14 | 454 |
| [R] General function prediction only | 127 | 123 | 35 | 38 | 30 | 9 | 362 |
| [G] Carbohydrate transport and metabolism | 73 | 108 | 25 | 22 | 38 | 3 | 269 |
| [L] Replication, recombination and repair | 68 | 58 | 29 | 44 | 32 | 26 | 257 |
| [J] Translation, ribosomal structure and biogenesis | 82 | 58 | 16 | 28 | 18 | 6 | 208 |
| [M] Cell wall/membrane/envelope biogenesis | 58 | 62 | 34 | 32 | 12 | 6 | 204 |
| [K] Transcription | 65 | 54 | 22 | 33 | 19 | 5 | 198 |
| [E] Amino acid transport and metabolism | 66 | 70 | 13 | 16 | 16 | 5 | 186 |
| [H] Coenzyme transport and metabolism | 42 | 62 | 16 | 22 | 14 | 6 | 162 |
| [P] Inorganic ion transport and metabolism | 45 | 54 | 23 | 18 | 10 | 3 | 153 |
| [F] Nucleotide transport and metabolism | 53 | 46 | 9 | 4 | 12 | 3 | 127 |
| [O] Posttranslational modification, protein turnover, chaperones | 24 | 33 | 10 | 4 | 12 | 3 | 86 |
| [U] Intracellular trafficking, secretion, and vesicular transport | 20 | 33 | 14 | 4 | 4 | 3 | 78 |
| [V] Defense mechanisms | 26 | 20 | 5 | 10 | 9 | 5 | 75 |
| [D] Cell cycle control, cell division, chromosome partitioning | 14 | 21 | 17 | 8 | 4 | 4 | 68 |
| [T] Signal transduction mechanisms | 21 | 20 | 5 | 4 | 10 | 2 | 62 |
| [C] Energy production and conversion | 18 | 17 | 3 | 2 | 4 | 1 | 45 |
| [I] Lipid transport and metabolism | 14 | 18 | 5 | 2 | 2 | 0 | 41 |
| [Q] Secondary metabolites biosynthesis, transport and catabolism | 6 | 8 | 0 | 0 | 0 | 0 | 14 |
| [N] Cell motility | 0 | 3 | 2 | 0 | 0 | 0 | 5 |
| total | 981 | 1009 | 327 | 335 | 298 | 104 |
Pathway enrichment of the genes at the unique node connections of the six main clades
| clade | pathway | p-value | genes number |
|---|---|---|---|
| A | sag01100: Metabolic pathways | 5.2E-06 | 197 |
| sag01110: Biosynthesis of secondary metabolites | 0.0053 | 81 | |
| sag00230: Purine metabolism | 0.013 | 43 | |
| sag00564: Glycerophospholipid metabolism | 0.028 | 12 | |
| sag00550: Peptidoglycan biosynthesis | 0.031 | 22 | |
| sag00561: Glycerolipid metabolism | 0.045 | 13 | |
| sag00680: Methane metabolism | 0.045 | 14 | |
| B | sag01100: Metabolic pathways | 0.0014 | 200 |
| sag01110: Biosynthesis of secondary metabolites | 0.0073 | 89 | |
| sag00564: Glycerophospholipid metabolism | 0.0074 | 18 | |
| sag03060: Protein export | 0.015 | 18 | |
| sag00052: Galactose metabolism | 0.016 | 31 | |
| C | sag01100: Metabolic pathways | 0.0014 | 48 |
| sag01110: Biosynthesis of secondary metabolites | 0.0073 | 17 | |
| sag00564: Glycerophospholipid metabolism | 0.0074 | 1 | |
| sag03060: Protein export | 0.015 | 7 | |
| sag00052: Galactose metabolism | 0.016 | 5 | |
| D | sag01100: Metabolic pathways | 0.067 | 44 |
| E | sag01100: Metabolic pathways | 0.015 | 50 |
| F | sag01100: Metabolic pathways | 0.021 | 13 |
COG families with an average DD value > 4 in a complete genome group DD/str: average DD value. Para/str: average gene number for each genome. The COGs in red are included in both Tab. 4 and Tab. 5.
COG families with an average DD value > 2 in the complete and draft genome group DD/str: average DD value. Para/str: average gene number for each genome. The COGs in red are included in both Tab. 4 and Tab. 5
Difference between the GTN and SNP-based methods
| GTN method | SNP method | |
|---|---|---|
| Input file(s) | Fna, faa and gff format files | Gbk format file |
| Calculation region | Whole genome or common synteny block | Single-copy core genes |
| Evolutionary evidence | Gene order | SNP |
| Method for phylogenetic tree | Neighbour-joining | Maximum likelihood |
| What can be obtained | Neighbour-joining tree; genes at unique node connections; relative DD list; gene indel information; gene clusters (COG); common ancestor information | Maximum likelihood tree; core gene list; core gene alignment result; gene cluster |
The SNP-based methods refer to the methods that we used in this study (panX, mafft and RAxML). The information on “genes at unique node connections” includes all genes at unique node connections. All these genes render an altered gene order, and they are evolutionary evidence of genomic evolutionary history (gene indels, duplications and recombination). The results shown in Tab. 2 are mainly based on these results. The information in the “relative DD list” includes all relative DD values of each COG family. The “gene indel” information includes genes in unique COG families or different copies of COG families. “Common ancestor information” includes the average length and number of fragments of a common ancestor; the red numbers in Fig. 3 were based on these results